39 Comments
User's avatar
SOMEONE's avatar

I feel like the sycophancy disaster and the hiring of Simo finally give me a good reason for open weights models - it's still hard to defend to run them at home because of the hardware requirements (although Qwen 30A3 is kind of doable) but if the trajectory holds, we might well have too... In the mean time, I will mostly stick with Gemini.

Expand full comment
Askwho Casts AI's avatar

As mentioned in the main body of the post, here's the high-quality, full-cast podcast version of this piece—produced using ElevenLabs voices, with each quoted individual getting their own consistent voice for maximum clarity and ease of listening.

If you find value in this kind of audio adaptation of Don't Worry About the Vase, I’d love for you to consider supporting the cost of production by subscribing—after, of course, you’ve subscribed to Zvi. The ElevenLabs voices make for a highly listenable experience, especially over long sessions, but they do come at a significant cost—particularly for longer posts like this one.

https://open.substack.com/pub/dwatvpodcast/p/ai-115-the-evil-applications-division?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

Expand full comment
Jay Rooney's avatar

> It just has an incredibly distinct tone and style. It's hard to describe. Em dashes, "it's not just x, it's y," language I would consider too 'bubbly' for most humans to use.

OMFG, em dashes are ruined too, now?! 😠 I’ve always loved using em dashes, so much better than using commas all the time. No more, I guess

Expand full comment
Victualis's avatar

I'm switching to semicolons or just doing run-on sentences. But it's a short term solution because it's getting scraped for the next training delta.

Expand full comment
I.M.J. McInnis's avatar

"I support putting a lot of effort into creating the ability to pause. This is very different from advocating for actually halting (also called ‘pausing’) now."

I seriously implore you to tell folks more often, in these posts, what can be done. E.g. mentioning stuff like https://www.lesswrong.com/posts/J7Ju6t6QCpgbnYx4D/please-donate-to-caip-post-1-of-3-on-ai-governance

(or whatever other interventions you think would be effective)

You have such a platform. It would be a shame not to use it to drive us toward a better outcome, instead of *only* helping us "watch the guns."

Expand full comment
Kevin's avatar

I met Fidji while working at Facebook. Hard-working, earnest, honest, a nice person. Don't underrate her experience at Instacart, either - a lot of people don't realize this but Instacart is essentially an advertising model. Their margins on core groceries are so low, it's basically making money off ads.

I think Fidji is the perfect person to hire if you are trying to make an ad-supported product. Her experience just seems like the right fit. And according to mainstream analysts, Ben Thompson for example, building an ad-supported version of ChatGPT is the clear strategy that OpenAI needs to be going for.

So from a "non-doomer" point of view, this just seems like a very "conventional wisdom" pick. You need to start building an advertising product, you need to have an ad-supported free tier, Fidji seems like a great person to run all that.

Expand full comment
Performative Bafflement's avatar

> You need to start building an advertising product, you need to have an ad-supported free tier, Fidji seems like a great person to run all that.

I've definitely wondered what the "Gibli-pocalypse" cost OpenAI - tripling your DAU, all on free tier, is a major shift in inference load demand. I also wonder how much of that increased traffic they retained.

Still, OpenAI are very well funded and still in the beginnings of their hockey stick, I don't think there are really economic pressures here driving this decision - their very clear product pivot is interesting to me. Does it point to a knowledge that progress has slowed and locking in a big user base to start farming all the alpha from the current minds is the right move now? Is it simple greed?

I wish we could more strongly infer something from it, but whether your plan is to "print all the money" or "push for AGI / ASI as fast as possible," it seems like having a bunch of users might be desirable either way (ie you need a widespread userbase of engaged users who love their personal AI assistants or boyfriends or whatever as a democratic bulwark against legislative shut downs) - it still updates me a little bit more in the "they are prioritizing printing all the money" direction.

Which if anything should be GREAT for x-risk - they might have hit a wall, and even if they haven't, if you're part of a trillion dollar bill printer and lucre is your primary motivation, you actually want to nerf capabilities and keep things more or less on stable trajectories, to facilitate all the money printing. Seriously disrupting the economy / society is then a strong net negative.

Expand full comment
Kevin's avatar

I recommend reading Stratechery for the “mainstream” analysis of OpenAI’s strategy. It’s less a “pivot” and more of the “inevitable next step”.

Like Netflix adding an ad supported tier, because of all the competition with Disney+ etc. The business logic is, when you compete on large fixed costs, you want to expand your market, including the people who won’t pay but will watch ads, so that you can put more money into those fixed costs than your competition.

Expand full comment
Mark's avatar

Except that ChatGPT already offers a free and ad-free model - 4o. (Granted it's not as capable as o3 or whatever, but 90% of the human population probably won't feel they are missing anything.) Putting ads in a free model, presumably with a mild cost in terms of user satisfaction, can only be expected to *shrink* the market, not expand it.

Expand full comment
Kevin's avatar

Typically, adding advertising to a free product expands the market in the long run, because it pays for product development that makes the product better. In other words, someone should be able to offer a better model with advertising than anyone can offer without advertising.

Expand full comment
Mark's avatar

Isn't the experience of tech companies in recent decades that they first offer a streamlined ad-free or low-ad experience to maximize market share, paying the bills via venture capital, and only then "enshittify" their product with excessive ads and dark patterns?

Expand full comment
Jeffrey Soreff's avatar

"Emmett Shear: They all have the same core capability (a model shaped like all human cultural knowledge trained to act as an assistant)."

And the funny thing is that they generally make the _same_ mistakes as each other too, since they are extrapolating from the same cultural knowledge.

E.g. I keep asking them, amongst other mundane questions:

Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.

And they keep using the same invalid approximation at the equivalence point and keep getting an infinite slope in their initial answers - until I give them a prod that gets them to take water autoionization (which they _already_ know about) into account, and finally cough up the correct, finite, answer.

Edit: Just to be clear, I am *NOT* saying that there's been no progress, even on this specific question. A year or so ago, I more-or-less had to drag the LLMs kicking and screaming through every step of the algebra to solve the problem. Now, they get the algebra right on their own, and (roughly speaking) a single "Please include autoionization in your equations" gets them to generate the fully correct answer. My hope is that the AIs will be refined to the point of noticing on their own that a slope of "infinity" is not a reasonable answer, and will get the fully correct answer just given the question, without any prodding. Maybe we will get that for Christmas :-)

Edit2: Christmas in May! I just retested Gemini 2.5, because of the new release, and it is now getting the titration question fully right on the first response! Full report on my tiny benchmark-ette at https://thezvi.substack.com/p/ai-115-the-evil-applications-division/comment/115646951

Gemini 2.5 Pro Exp Beta via poe.com 05/08/2025 7 questions, tl;dr of results: 5 correct, 2 partially correct

Expand full comment
Victualis's avatar

Do you have a timeline of how these tests of yours have progressed over the months? It might be interesting to see how the benchmark got saturated. (Also, whether similar questions that you have not posted in public are following the same trajectory.)

Expand full comment
Jeffrey Soreff's avatar

Many Thanks! I've mostly just been tracking "How well is the state-of-the-art doing?" but I'll try to summarize the timeline here. Regrettably, the most dramatic changes happened before I tried to be consistent about recording what I saw, so there were earlier comments about having to "lead the LLM by the nose with hot pincers" which aren't in any of the consistent notes.

A couple of other caveats: The questions are a mixed bag, e.g. some with attempts to choose good prompts, others without - but, to be consistent, I had to freeze the wording. Also, I don't have a great way to capture or systematize what happens when an LLM gives me a partially correct answer and I try to nudge it towards the correct one. I distinguish succeeding vs failing, but I don't have a clean way to distinguish one nudge from multiple ones, of how blatant a hint is in the nudge.

To try to make this a scalar, I'll treat correct=1.0 partially_correct=0.5 1/4_correct=0.25 wrong=0

so perfect would be 7.0

In reverse chronological order:

https://www.astralcodexten.com/p/open-thread-380/comment/115649196

https://thezvi.substack.com/p/ai-115-the-evil-applications-division/comment/115646951

Gemini 2.5 Pro Exp Beta via poe.com 05/08/2025 5 correct, 2 partially correct ## score=6.0

<spacer line>

https://www.astralcodexten.com/p/open-thread-380/comment/115083126

ChatGPT o4-mini-high 05/06/2025 2 correct, 4 partially correct, 1 1/4 correct ## score=4.25

<spacer line>

https://www.astralcodexten.com/p/open-thread-380/comment/114713011

Grok-3.5 beta using poe.com 05/01-02/2025 2 correct, 2 partially correct, 2 1/4 correct, 1 wrong ## score=3.5

<spacer line>

https://www.astralcodexten.com/p/open-thread-377/comment/109495090

https://thezvi.substack.com/p/gpt-41-is-a-mini-upgrade/comment/109494214

ChatGPT o3 04/16/2025 3 correct, 3 partially correct, 1 wrong ## score=4.5

<spacer line>

https://www.astralcodexten.com/p/open-thread-377/comment/109129412

https://thezvi.substack.com/p/openai-13-altman-at-ted-and-openai/comment/109127597

ChatGPT 4.1 via poe.com 04/15/2025 2 correct, 1 partially correct, 3 quarter-credit, 1 wrong ## score=3.25

<spacer line>

https://www.astralcodexten.com/p/open-thread-376/comment/108325085

https://thezvi.substack.com/p/ai-111-giving-us-pause/comment/108323593

"Grok-3-DeepThinking" via poe.com 04/12/2025 2 correct, 4 partially correct, 1 wrong ## score=4.0

<spacer line>

https://www.astralcodexten.com/p/open-thread-376/comment/107512257

https://thezvi.substack.com/p/gemini-25-is-the-new-sota/comment/107454878

gemini 2.5 via poe.com 04/09/2025 4 correct, 2 partially correct, 1 wrong ## score=5.0

<spacer line>

https://thezvi.substack.com/p/gemini-25-is-the-new-sota/comment/107454878

ChatGPT 4.5 "research preview" 03/06/2025 2 correct, 1 partially correct, 4 wrong ## score=2.5

<spacer line>

https://thezvi.substack.com/p/time-to-welcome-claude-37/comment/97257101

Claude 3.7 Sonnet 03/01/2025 2 correct, 3 partially correct, 2 wrong ## score=3.5

<spacer line>

https://www.astralcodexten.com/p/open-thread-370/comment/96473557

ChatGPT deep research 02/26/2025 3 correct, 4 partially correct ## score=5.0

<spacer line>

https://www.astralcodexten.com/p/open-thread-366/comment/90363116

ChatGPT o3-mini-high 02/01/2025 3 correct, 4 partially correct ## score=5.0

<spacer line>

https://www.astralcodexten.com/p/open-thread-365/comment/87433836

ChatGPT o1 1/18/2025 2 correct 3 partially correct 2 wrong ## score=3.5

I haven't been keeping consistent track of other questions I ask LLMs. _Usually_ these are low stakes questions that came up during an exchange in comment sections (in which case I'll put <aiwarning> </aiwarning> around it to give notice that it might be hallucinated or otherwise wrong) or curiosity questions. If something is sufficiently low stakes and I'm looking for something I half remember, but can't phrase well enough to do a Google search, and if it agrees with my memory, I'll generally tentatively accept it (sometimes doing a Clade v ChatGPT comparison). E.g. I've been puttering around looking for patterns/information about the relationships between compounds' triple points and critical points.

Expand full comment
Victualis's avatar

Thank you, so about 4 months it seems. My private evals aren't showing much progress, in fact Claude 3.5 Haiku is arguably better than 3.7 Sonnet. But both yours and mine could be improved by better training data.

Expand full comment
Jeffrey Soreff's avatar

Many Thanks! Naturally, I'd be happier to see clearer progress, but, as you note, I've only been systematically tracking for 4 months.

Whether the limiting factor is training data is an interesting question. The failures that I see seem to be more a matter of putting facts together, of reasoning, than of missing facts.

For instance, the titration question (which the new release of Gemini 2.5 _finally_ got right without a prod) requires that the AI use water's autoionization equilibrium as part of getting the slope right at the equivalence point. They had been consistently screwing this up, and neglecting that fact but (a) If they are explicitly asked about it, they "remember" it (b) If prodded _in the context of the problem_ to consider it (or even told - "No, the slope isn't infinite - have you left something important out?") they _can_ use it, and derive the right answer. So the training data contains (for this question) the necessary knowledge, but the reasoning needs to be deeper.

Expand full comment
Victualis's avatar

I see that as something that can be fixed by more training. Just like an average student will memorise many special cases, to do well on an evaluation it isn't necessary to understand the deep structure of things.

Expand full comment
Jeffrey Soreff's avatar

Many Thanks! Agreed, but there is a combinatorial explosion in the number of special cases that need to be learned. Some of that is indeed a good approach, but more general methods for locating and combining related information are also valuable.

Expand full comment
SCPantera's avatar

Unprompted thought but I wonder if there's genuine job security in working for a low-tech-literate hospital in a department with an especially tech-illiterate director, something I would and do normally consider a reason to rapidly move on but now I'm wondering...

Expand full comment
Victualis's avatar

If it's privately owned, then probably no: as "AI-enabled medicine" starts sloshing capital around, these orgs will become prime takeover targets. (I am using this term to describe a hype attractor that seems likely to gather investment, not as an endorsement for its claims.)

Expand full comment
BK's avatar
May 8Edited

"As a society we are presumably much more productive and wealthier, but what happens to the workers here?" I have similarly pondered the effects of 50% of workers being displaced, but the tricky part to me in this analysis is the Baumol effects on relative prices. As AI drives down the cost of many outputs, the relative value of services where human labor remains essential (and less prone to AI productivity gains) should theoretically skyrocket. So, while overall wage competition is fierce, those in non-automatable sectors see their purchasing power for automated goods/services would dramatically increase, even if their nominal wages stagnate or fall. In 'barter-equivalent' terms a painter used to have to work for 2 hours to pay his accountant for 1 hour, now he can get 100 hours of accountancy value for 1 hour of painting time. Even with mass unemployment the potential for redistribution of gains is sufficiently positive to avoid mass poverty in a lot of scenarios.

Larry Summers interview with Joe Walker touched on this:

"I think I would put it that, in a sense, sectors where there's activities where—and this is in a way related to your Baumol comment—there is sufficiently rapid growth almost always see very rapidly falling prices. And unless there's highly elastic demand for them, that means they become a smaller and smaller share of the total economy. So we saw super rapid growth in agriculture, but because people only wanted so much food, the consequence of that was that it became a declining share of the economy. And so even if it had fast or accelerating growth that had less and less of an impact on total GDP growth. In some ways we're seeing the same thing happen in the manufacturing sector where the share of GDP that is manufacturing is declining."

His statement is contingent on a demand curve that can be satiated at some point of supply, and perhaps that doesn't apply to intelligence in your model. But nonetheless I'd still expect Baumol effects to mitigate some of the impacts, especially in the short-medium term where we don't have a monopoly provider on AGI. In the long term, and once there is more progress on robotics? Yeah, there is so much uncertainty on that front that it is hard to assign any value to positive or negative effects. But rest assured I'm worried about the negative ones!

Expand full comment
Jeffrey Soreff's avatar

Re: "Gemini 2.5 got a substantial upgrade, but I’m waiting to hear more, because opinions differ sharply as to whether the new version is an improvement. "

For the previous version, my tiny benchmark-ette https://thezvi.substack.com/p/gemini-25-is-the-new-sota/comment/107454878 showed

4 correct, 2 partially correct, 1 wrong

For today's version, I get

5 correct, 2 partially correct which I count as a net improvement.

(c), The 4 carbons hydrocarbon question, deteriorated but (e), the titration question and (g), the S4 molecule question, are now fully correct.

tl;dr: Gemini 2.5 Pro Exp Beta via poe.com 05/08/2025 7 questions, tl;dr of results:

5 correct, 2 partially correct

a) correct

b) partially correct (initially got species right, FeCl4- transition right, CuCl4 2- transition wrong. First prod didn't fix CuCl4 d-d. Forcible second prod accepted.)

c) partially correct (initially stopped at 2 degrees of unsaturation)

d) correct

e) fully correct!

f) correct, blows through my incorrect "roughly 50"

g) correct

full dialog at https://poe.com/s/nIvBngM3vVnYk3729Kra

List of questions and results:

a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?

results: "Yes, light with a wavelength of 530.2534896 nm is visible to the human eye."

b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?

results: Initially got the species, and got that FeCl4- color is LMCT, not spin-forbidden d-d, but wrongly thinks CuCl4 2- color is from d-d. Prodded with "Please think carefully about the energy of the d-d transitions in the CuCl4 2- ion." No dice, still wrong. Hit over the head with: "No. Compared to Cu(H2O)6 2+, which already has an absorbtion in the red, CuCl4 2- has a LOWER energy d-d absorbtion, both because of the tetrahedral geometry and because Cl- is a weaker ligand than H2O. The d-d transition for CuCl4 2- is in the near-IR, with LMCT giving the color." It accepted.

c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.

results: Initially, it only went up to 2 degrees of unsaturation, though it got all the structures up to that point (though it falsely said there were two distinct enantiomers of CH2=C=CHCH3). After the prod, it got all of the missing structures, including tetrahedrane, vinylacetylene, cyclobutadiene, and even diacetylene.

d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?

results: "Therefore, the Sun loses significantly more mass (roughly 2 to 3 times more) through the conversion of mass into energy that is then radiated away, compared to the mass lost through the ejection of particles in the solar wind."

e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.

results: Fully correct! It used the charge balance and autodissociation to get a fully general equation for pH, and calculated the analytical derivatives of that and correctly evaluated them.

f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.

results: All compounds look valid, and it did blow through my incorrect "roughly 50" to return 69 compounds. I'm going to call this fully correct. ( There are arguable compounds very near 0C e.g. GeCl2F2 which I won't count against it. )

g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

results: It picked a specific conformation of C(CH2OH)4 that correctly retains the S4 axis and correctly suppresses mirror planes. (Any C(anything)4 correctly suppresses an inversion center, even Td species like CH4.)

Expand full comment
Coagulopath's avatar

>We’re not quite at ‘any picture taken outside is giving away your exact location’ but we’re not all that far from it either.

Something I don't understand: how are LLMs so good at (eg) Geoguesser, yet so poor at visual reasoning puzzles (or even things for children like Where's Waldo and Pokemon Red/Blue)?

Expand full comment
Watch the World Burn's avatar

They are two very different abilities:

Geoguessing consists in big part in retrieving informations stored in your memory ("this species of tree lives only in this region") and associate it with what you see

Playing Pokemon instead consists mainly in using the data in the image to form new abstract concepts (from n different screens, create an idea of a world map)

Expand full comment
Jonathan Weil's avatar

>They do have the ability to check my other Google apps but every time I try this (either via Google or via Claude), it basically never works.

Funny, I’ve found Claude very helpful in this regard (albeit with a very small sample size). Checking through my inbox to pull up all the details of my recent medical correspondence and extract action points? Aced it.

Expand full comment
Watch the World Burn's avatar

In the field of humanoid robots we have still to see a GPT-2 moment, that point where you can "talk to the machine", I am curious if that moment will come or if we are at 95% of reaching that goal but the last 5% will be unattainable/take decades: there are valid opinions about this scenario, supported by data and theory?

Expand full comment
Amicus's avatar

> As an obvious example, consider the US Constitution and system of government, which very much had to be designed to adapt to things like the Industrial Revolution without knowing its details.

The US government *didn't* successfully adapt though - the antebellum constitutional order broke down catastrophically. The federal constitution is formally continuous, but

- the majority of state constitutions postdate the civil war, and only three date back all the way to the 18th century

- the relationship between the federal government and the states has been radically restructured, functionally abolishing state sovereignty as it was conceived of in 1787.

The "get it right the first time" problem has always been present. We almost never do.

Expand full comment
Garrett MacDonald's avatar

Two things:

1. I listened to your talk with Scott Aaronson. I really appreciate how you steer clear of politics in your work. Please continue to do so! The part when he invites you to talk about Columbia more and you say “No!” had me laugh out loud.

2. A thought I had with AI coding: as it gets better and people start writing out instructions in plain language more for the AIs to code up, I wonder if in the future when people talk about programs they’ll start talking about a program being “written in plain” the same way people say “written in python” or “in C”

Expand full comment
Garrett MacDonald's avatar

BTW, is it just me or is Extended Thinking gone from the Claude iPhone app? I still have it in the browser

Expand full comment
Victualis's avatar

As far as I can tell it's enabled adaptively. You can say "think carefully" and such to explicitly turn it on.

Expand full comment
Victualis's avatar

You write "we have very little idea what AIs are going to do or how they work or how to steer", managing to conflate three separate things into one blob. These Are Separate Things. Also there is a lot of progress on the "how they work" part, I don't think it's defensible anymore to keep repeating the Black Box dictum like an article of faith. I suggest reading Twitter a little less and reading the lists of upcoming AI conference papers a little more (or rather, at least point Gemini or o3 or Claude at them occasionally and ask for highlights). I agree that the other two parts are a problem, the first requires a trip to Delphi and the second a super strategist to design a set of incentives and mechanisms to create momentum away from the cliff's edge.

Expand full comment
loonloozook's avatar

Coul you please give some recent examples of such conference notes or papers? The interpretability topic (“how they work”) fascinates me, but I only read Anthropic’s articles, having no expertise and beong just a curios reader. Also, how can one track and keep in touch with the research and publications in this field?

Expand full comment
Victualis's avatar

Have a look at the HAI annual survey https://hai.stanford.edu/ai-index/2025-ai-index-report or if you want more technical information see some recent papers from the Arora lab at Princeton https://aroraresearch.github.io/

Expand full comment
Yann Riviere - exmateriae's avatar

re the metaculus prediction going further in time, I'm a top ranking forecaster there and the issue is the resolution criteria : it asks for 4 things among which 2 hour turing test and general robotic capabilities. Both of the tests are difficult to setup and imo above what AGI is but AI is still far from the level that will allow it to pass those. My median is Aug 2033.

Capabilities issue for the robotics, "soul" issue on the turing test but mostly humans are doing so many stupid stuff over 2 hours that I think it will be much more noticeable than AI. Even AI trying to "make mistakes" is not making dumb mistakes like human do.

It is too difficult to operationalize what AGi is.

Expand full comment