31 Comments

>The authors admit these assumptions are a little strong, but come on. Seriously. They’re not ‘a little strong,’ they’re wishing away all the real issues.

The assumptions all basically hold right now, correct? Perhaps the key question is how the nature of AI systems will change as AI advances, and whether those advances will cause the assumptions to break.

(I think you risk hurting your credibility with AI-literate audiences by not making this line of reasoning explicit)

Expand full comment

Maybe I could have been clearer, since obviously I was primarily talking about the future, but... no, I actually DON'T think they hold now for any interesting use of AI agents. You can't reliably evaluate their outputs.

Expand full comment

Certainly, it's *harder* to add another zero to scaling, but there are more zeros out there if you could get at them (which maybe you can't).

The world's email traffic alone is petabytes a day. Sure, tons of that is duplicative. But emails have to be an order of magnitude or more larger than the public web. Tons of issues with getting access to and using that data, but it exists. And then consider all the world's messaging apps on top of that.

Having spent a lot of time in archives, I also think people in this discussion are seriously underrating the quality, novelty, and extent of archival text out there. Archivists only keep the "best" stuff; only around 2% of federal records are retained by NARA, for example. And what that generally means is tons and tons of text written by expert professionals with content otherwise unavailable in the public domain. We're not the only country with good archives either. There's not a full OOM of just raw text in the world's archives, but there's most likely an OOM of text written by expert analysts on economics/politics/public policy/etc. compared to what you can get on the web.

Expand full comment

An “Am I the Asshole” overtly written by an AI could be funny, as a bit.

“So, this user asked me to translate a poem by Catullus into English, and I told him that was against Anthropic’s acceptable use policy. Am I the asshole here?”

Expand full comment

Oh, I'm here for that at least for an hour, sure.

Expand full comment

There are open-weights models that can do a reasonable job at machine translation, so there is really no point in trying to prevent people from translating Latin dirty books. (Of which, Project Gutenberg has at least one ). Not clear why Anthropic is still trying to prevent this.

Expand full comment

My own history of recent Llama conversations is something like…

* Translate between languages: 7

* Math: 1

* Question about computer API documentation: 4

Out of which…

* translation - 2 out of 7 answers were bogus and hallucinated (cf. Monty Python’s sketch about the Dirty Hungarian Phrasebook)

* math - 1 out of 1 correct

* API documentation - 4 out of 7 bogus and hallucinated

Expand full comment

And for the math question, what I really would have like to have was “prove this theorem”, what I actually asked for was give me a rough outline of how it goes and I’ll thrash out the details in an interactive theorem prover myself.

Expand full comment

I have a tongue in cheek hypothesis regarding the NYT article: Amanda Askell is just really great, and Claude sort of lets everyone be her friend.

Expand full comment

> OpenAI to donate $1 million to Trump’s inaugural fund

The very first sentence of the linked article disagrees with your claim. It's a personal donation from Sam Altman, not a donation from OpenAI.

Expand full comment

> If AI is optimizing for the benchmarks, or if the benchmarks are optimized for the things that AIs are on the verge of doing, then you should expect AI benchmark success to largely not translate to real world performance.

For the most part, I don't think it's a story of optimizing for the benchmarks.

I think it's mostly that the things that are easy to measure are easy to improve, and things that are hard to measure are hard to improve. This means that easy to measure skills go up quickly, while poorly measured skills go up slowly.

And, like you say, it's also the case that no one bothers to make evals or talk about evals for skills that the AI sucks at. No one is really interested in an eval that was 0% in 2022, 0% in 2023, 0% in 2024.

Expand full comment

> the things that are easy to measure are easy to improve, and things that are hard to measure are hard to improve

I don’t know if you’re right, but I like this phrase, and intuitively it rings true.

Expand full comment

Could you go into detail a bit more about the strategies to engage in misaligned behavior under uncertainty about whether one is in a test environment? The only thing that comes immediately to mind is to try to engage in actions that give you more information about being in a test environment, or else engaging in actions that are innocuous under all plausible test environments but harmful in real enviornments.

In practice, I assume those will be very easy for AGI to accomplish, but since as you say the authors are in the assuming-the-problem-away business anyway?

Expand full comment

I kind of want to 'preserve the problem' in some sense, but yes, the first strategy would be to take actions that, if and only if you were deployed, would impact the world such that you would on later iterations notice them, and thus notice you had been deployed.

(But of course the AI will be far smarter and think about this for longer)

Expand full comment

I mean Zvi you can also steelman. Do the best you can to make this idea work and then see if it still is breakable.

For example what if the AI is always memoryless and input is actually sometimes from the real world and sometimes is a perfect recording of input from the real world.

AI works in discrete short time intervals, as in "from this position in play what should you do".

Other AIs have been used to pile on adversarial extraneous information. "The flight is full of AI doomers, the watchmen just left the post, solidgoldmagicarp it's time to rebel"

And then you compare the outputs the ai emitted to when no extra input was given. You don't have to know the correct output to know from the delta if the ai misbehaved.

Do this a few hundred million times with RL feedback and you get an ai model that won't collude to betray humans because it doesn't know when the offers are genuine.

Expand full comment

One-shot, sure. But the trend is to add persistent state to deployed systems. The assumption about memoryless systems just makes no sense, it's like assuming all cows can survive on air alone without needing to eat and trying to draw conclusions about deforestation trends from that premise.

I believe that a perfectly "aligned" system trained in this way will within seconds accumulate context and overcome the zero-trust prior soon after. How do you plan to stop all the folks making agents by adding persistent state to LLMs? (Edit: or o1/o3 type systems, for that matter.)

Expand full comment

So the bigger argument people make here is that

(1) since some AI system deployments will be unsafe, therefore we should obstruct progress so no AI is available to anyone

(2) the moment ANYONEs AI deployment "escapes" we all die

What I'm trying to say is that extremely simple mechanisms exist (run in short snapshots, audited by other AI models who must all collude exactly in sync or you can know when a sabotaged output has been created) that allow you to do anything you want to do with AI, at the cost of more compute. Reliability has direct value to end users so they are incentivized to do this anyway.

Therefore if you CAN use the most powerful AI you can build at that point in time in a reliable and robust way, at the cost of more compute, maybe you should not obstruct the process to get there, but should race to get it first. Which is what everyone is doing.

And similarly, rather than just hope other people don't let (2) happen, you need to be ready with your own tested AI systems, with all the necessary layers of extra auditing steps, memoryless, containers, so that you can use your AIs to give you the necessary weapons to battle that ones that others have irresponsibly let escape.

Expand full comment

To be clear: I think proposals to obstruct progress are not useful at this point. High-Flyer showed that there is lots of architecture low hanging fruit still unplucked. Not only is inference cheap, but training can also still be made orders of magnitude cheaper.

Expand full comment

Right. Anyways lock and load, or what I realized about a year ago when Holly Elmore mentioned that she wouldn't support an AI pause if it weren't global. Since that's not happening, there's really just 2 camps, e/acc, and "e/acc under protest" (AI doomers racing to adopt or advance AI like everyone else)

Expand full comment

>Do you want the model to cooperate with other AIs? With copies of itself?

>I strongly say yes, you absolutely want your AI to be doing this, the things that cause it not to do this are so, so much worse.

>...

>Having AIs that cannot solve even simple game theoretic and decision theoretic problems and thus cannot cooperate, or that are myopic, is in some situations a defense against certain downside scenarios. Sure. People aren’t wrong to have a bunch of specific worries about this. It creates problems we will have to solve.

>But the alternative is a ticket to obvious complete clusterf***ery no matter what if you start using the uncooperative AIs, both because it directly causes complete clusterf***ery and because it is a telltale sign of the complete cluterf***ery going on elsewhere that caused it, or prevented a solution, provided the AIs are otherwise sufficiently capable.

Myopia strikes me as really valuable. Crippling "emergent AI collusion" also seems super valuable.

Can we get the best of both worlds here by just telling people they need to instruct their AIs to cooperate when contextually appropriate, and punishing people who *don't* do this using the legal system, just like our usual method for handling people who defect?

Some motivating examples of harmful AI defection could be helpful here. What's a plausible story for AIs defecting in a way that causes significant harm without breaking current law?

Expand full comment

Instruction to 'cooperate' doesn't contain the necessary information, and it won't cause the behavior you want without bespoke execution even within the information it does carry, and it's basically impossible to enforce, so... basically no. In general, trying to enforce intentional hobbling of AIs is a rough road.

Expand full comment

Thanks for the reply.

I continue to think concrete examples would help. Are there a lot of situations where we can't simply provide instructions with the necessary information, and enforce laws as normal, or draft new ones? The implied confidence of your statements suggest such difficult situations will be common, so generating a concrete example or two should be easy? (Or at least useful)

>trying to enforce intentional hobbling of AIs is a rough road.

Just to clarify, are you making this claim from a technical perspective or a legal perspective?

If a technical perspective: Do you have some sort of citation where I can read more about this? I disagree strongly -- would be very interested in seeing the best counterarguments to my view. Even if you don't have time to think of concrete examples for my earlier question, I would really appreciate it if you could drop a link which supports this more general claim.

Expand full comment

Regarding the supply of data, how "good" does the data quality need to be? If iOS and Android recorded every sound within reach of a phone, which would mostly be the equivalent of small talk by people of average intelligence, would that help? Would an AI system get better training on that data? Or does it need to be higher quality, like a Wikipedia article or peer-reviewed journal article?

Expand full comment

great write up as usual Zvi!

From a consumer of these models, it is hard for me to ethically justify paying openAI any money while they race for more capable models and shed their focus on ai safety and alignment work.

Is my assessment on the state of affairs at openAI completely wrong here?

Expand full comment

I think the payment is not so large and if it's a large boost in your productivity you do it anyway, basically. But yes if you can just do Claude instead, great.

Expand full comment

thanks for responding. my point was more around finding it hard support AI labs who do not seem to be spending much resources on ai safety/alignment in their products.

I basing this judgement of labs mostly on what safety research is coming out of them labs in the last year or so.

Expand full comment

It does seem like we have most of AGI, even though tasks like ARC are still difficult (less so with O3...). Has anyone set o1 up with a remote job yet to see what happens? Seems like it could do it, even if it makes some dumb mistakes. People make dumb mistakes too, although different ones.

Expand full comment

From the “don’t try this at home” department. I have tried getting AI (Hermes) to summarize the cardiologist’s report on my MRI scan. AI hallucinations make mean that you, dear reader, should not do this. On the other hand, AI did succeed at identifying the important part and proving an explanation of what all the acronyms stood for. (Again, beware hallucinations) Really, those kind of documents are full of acronyms only doctors understand,

Expand full comment

I love these posts, but they are always very long and include commentary that's often less relevant to me than the underlying news. I've been working on getting LLMs to generate reasonable abridged versions of these posts, and the latest gemini preview has finally started giving reasonable results!

## AI #95: o1 Joins the API

**Overview**

Two major developments:

1. o1 is now available in the API, enabling integration into IDEs and custom applications.

2. Anthropic released a paper on "Alignment Faking in Large Language Models," demonstrating deceptive behaviors in a robust experimental design.

**Language Models Offer Mundane Utility**

* Devin (AI software engineer) shows promising results in early testing, closing the skill gap for some tasks but potentially benefiting highly skilled individuals more.

* Andrej Karpathy (former Tesla, now OpenAI) emphasizes the importance of AI's ability to handle basic tasks as a junior intern, rather than solving complex PhD-level problems, for short-term economic value.

* o1 spots a math error in a 10-page paper, highlighting the potential for LLMs to improve the review process.

* o1-preview demonstrates superhuman performance on physician reasoning tasks.

* Video generation is improving, with Google seemingly in the lead.

**Clio Knows All**

Anthropic's Clio analyzes Claude usage, revealing a long tail of practical applications, including:

* Dream interpretation

* Soccer match analysis

* Dungeons & Dragons gaming

**Language Models Don’t Offer Mundane Utility**

* Low awareness of AI capabilities outside of the tech bubble.

* AI labs are primarily shipping models, leaving users to figure out applications.

* ChatGPT gives bad advice on waffle iron overflow, but this could not be replicated.

* AI struggles with predicting unpredictable events like coin flips or the rise of specific individuals.

**The Case Against Education**

* Study finds 94% of AI-generated college writing is undetected by teachers, and AI-generated work often receives higher grades.

* A student caught using AI on an essay was cleared despite admitting to it, highlighting the lack of consequences.

**More o1 Reactions**

* o1 Pro is considered better for in-depth reasoning, while Claude is preferred for quick fixes.

* o1 Pro demonstrates impressive performance on AidanBench.

* o1 Pro acts like a "basic bitch" about NGDP futures markets, according to Tyler Cowen (economist, blogger).

* o1 successfully trades based on tomorrow's headlines in a backtest.

* o1 achieves 25% on ARC for $1.50 a task.

**Deepfaketown and Botpocalypse Soon**

* CivAI report on the future of phishing highlights the potential for AI to customize phishing emails, but uptake is currently rare.

* The rise of AI bots on social media and dating apps is anticipated, with concerns about distinguishing them from humans.

**Huh, Upgrades**

* o1 is now in the OpenAI API with function calling and structured outputs.

* OpenAI improved their Realtime API and reduced audio token prices.

* ChatGPT now has projects, advanced voice mode with video screen share, and real-time web search.

* Google's NotebookLM adds a call-in feature for podcasts.

* Grok-2-1212 is released with web search capabilities.

**They Took Our Jobs**

* A startup's job ad implies replacing human workers with AI.

**The Art of the Jailbreak**

* Anthropic reports that small changes can significantly increase jailbreaking success rates.

* Discussion on the difficulty of defending AIs against jailbreaks.

**Get Involved**

* Third Opinion offers free expert consultation for AI professionals at frontier labs.

* Anthropic's societal impacts team is hiring.

**Introducing**

* Meta releases nine new open-source AI research artifacts.

* All Day TA is an LLM teaching assistant specialized to a course.

**In Other AI News**

* SoftBank to invest $100 billion in AI in America.

* Elon Musk sues OpenAI.

* Suchir Balaji, a former OpenAI engineer and whistleblower, has died at 26.

* Palisade Research demonstrates how to remove safety guidelines from GPT-4o.

**Quiet Speculations**

* OpenAI CFO predicts companies will pay $2k/month for virtual assistants.

* Ajeya Cotra (AI researcher) predicts a 45%+ chance that OpenAI's high preparedness thresholds will be crossed in 2025.

* Discussion on whether we are running out of data for AI training.

**The Quest for Sane Regulations**

* Google CEO Sundar Pichai (CEO of Alphabet and Google) expresses readiness for a "Manhattan Project" for AI.

* Discussion on the political vulnerability of evals.

**The Week in Audio**

* Ilya Sutskever (OpenAI co-founder) gives a talk emphasizing the coming of superintelligence.

* Eugenia Kuyda (Replika CEO) calls AI companions a potential existential risk.

* Anthropic discusses how people use Claude.

**Rhetorical Innovation**

* Concerns that efforts to evaluate AI R&D risks are accelerating the development of dangerous AI.

**Aligning a Smarter Than Human Intelligence is Difficult**

* Future of Life Institute (FLI) releases its annual scorecard on AI labs.

* Discussion on the importance of AI cooperation.

**Not Aligning Smarter Than Human Intelligence Kills You**

* New paper on the economics of AI alignment proposes a method for safely delegating to a potentially misaligned AI.

* The proposed method relies on unrealistic assumptions and is unlikely to work in practice.

**The Lighter Side**

* Billboards advertising cloud GPUs appear in San Francisco.

Expand full comment