52 Comments
User's avatar
Rohan Ghostwind's avatar

> This threatens to give Washington DC and its key decision makers a very false impression of a lack of AI progress, especially progress towards AGI, that could lead to some very poor decisions, and it could do the same for corporations and individuals.

Do you think there was intentionality here? Intentionally release a mediocre model precisely for this reason?

Expand full comment
Mo Diddly's avatar

I seriously doubt it. Altman’s every move for the past five years has been in service of hyping and enriching OpenAI. I think he just effed it up.

Expand full comment
Sergey Kornilov's avatar

“This threatens to give Washington DC and its key decision makers a very false impression of a lack of AI progress, especially progress towards AGI”

False impression? How so? Nothing about the past two years had been tracking with the hype. The impression is not false. It is, unlike the predictions, rooted in actual data/product releases. What about GPT5 should be giving anyone any other impression?

“Don’t stop the hype train, please?!” 🤷

Expand full comment
Michael Schuck's avatar

Just... what? In what world is o3 consistent with lack of progress, or progress incommensurate with the hype? In terms of capabilities GPT-4 to o3 is easily as much of a leap as GPT-3 was to GPT-4. If a model with o3's reasoning and agency mixed with GPT-4.5's writing and conversational ability was released earlier this year and called GPT-5 no one would have been disappointed or claiming progress is slowing down. Especially with everything else released in the last two years by OAI alone - Deep Research, voice mode, web search, basic agentic capabilities, increased context - since GPT-4, let alone by other companies.

It's amazing how peoples' perceptions can be manipulated simply by naming conventions. o3 should have been called GPT-5 and GPT-5 should have been called GPT-5.2 or something and no one would be saying things like this.

Expand full comment
Sergey Kornilov's avatar

Just that. Did you actually read the post and all of its words or just the ones that work for you in isolation?

In this world, in August of 2025, in the United States. We lack agreed-upon metrics for “progress toward AGI.”

O3 was not and is not some miraculous progress hallmark on our way to AGI, as your comment that lacked any documentation or reasoning clearly demonstrated.

We’re measuring AI progress against marketing narratives instead of real empirical benchmark.

Your comment about “if o3 was called GPT-5 no one would be disappointed” is revealing. Admitting literally the entire field is engaged in expectation management theater rather than fundamental progress, yes?

We’re not seeing progress toward AGI. We’re seeing: increasingly elaborate workarounds for fundamental limitations, marketing departments working overtime to reframe failure as success, benchmark gaming being confused with capability development, and compute requirements that make mockery of biological intelligence efficiency.

But sure, “progress towards AGI”, sure. 🙄

Expand full comment
Michael Schuck's avatar

I read your comment as saying there's been a "lack of AI progress" in the past two years or that the progress has not been tracking with the hype, which I think is very clearly and obviously and straightforwardly false, as the difference in general performance between vanilla GPT-4o, let alone the original GPT-4, and o3 is extremely vast. At least as vast the difference between GPT-4 and GPT-3, if not more so. If I had known two years ago that we would have a model of o3's capability level commercially available now, I would have said AI is being pretty radically under-hyped, not over-hyped. And, again, if OAI had called o3 GPT-5 I don't think this point of view would be controversial. o3 would have met anyone's expectations of a GPT-5-level model two years ago.

"AGI" is a meaningless corporate buzzword so I don't think it means anything to say that we have or have not been making "progress towards AGI," so I wasn't responding to that.

Expand full comment
Sergey Kornilov's avatar

Ok ;) In other words, you were replying to something and someone else entirely.

My point was that calling whatever “progress toward AGI” is like calling a bigger trebuchet “progress toward spaceflight.” Sure, you’re throwing things higher, but you’re not any closer to orbit.​​​​​​​​​​

All f​​​​​​or progress here. By declaring AGI overall “meaningless” but only AFTER it became clear we’re not achieving it, we can sure win ;)

Expand full comment
Robert M.'s avatar

Y'all's failure to feel the AGI concerns me.

Expand full comment
Sergey Kornilov's avatar

It doesn’t really. It is not nearly as concerning as people’s inability to articulate the definition and the criteria for AGI, truly. Science progresses via a combination of deductive and inductive reasoning components, guided by theory and driven by evidence that emerges during hypothesis generation or testing. Where would you insert your “feel” in this paradigm and why is it at all a relevant parameter?;)

Expand full comment
Amicus's avatar

The first commercially viable steam engine dates back to 1712. And yet someone predicting the industrial revolution in 1740 was clearly not "falling for the hype".

Technological development - real technology, not just product iteration - takes years. Decades, sometimes. Nonetheless, it happens.

Expand full comment
Sergey Kornilov's avatar

Oh brother, the cavalry has arrived. Someone always jumps in with this argument about steam engines.

It’s terrible🤷

Steam engines in 1712: mechanically pumped water out of mines

Steam engines in 1740: mechanically pumped water better

Steam engines in 1780: mechanically powered factories

Notice the pattern? Same fundamental principle, scaled application.

Nobody in 1740 was claiming steam engines would become conscious, understand physics, replace human cognition or achieve “Artificial General Mechanics”.

Pretty sure they said: “This will pump water and maybe power machines.”And guess what? That’s exactly what happened.

You know what this actually resembles? Alchemy & Chemistry..

For centuries, alchemists claimed they were “making progress” toward turning lead into gold. They had elaborate theories, impressive demonstrations, wealthy patrons etc.

They even made some real discoveries along the way! But… they were fundamentally wrong about what they were trying to achieve.

For every steam engine that revolutionized industry, there were flying machines that didn’t fly, perpetual motion machines that didn’t perpetuate, philosophers stones that didn’t philosophize, and (we are going through this right now with the entire Longevity industry, magically) elixirs of life that didn’t extend anything.

Cherry-picking the one success while ignoring the graveyard of failed paradigms is pure motivated reasoning. “Technological development takes decades” - sure. But you have to be developing in the right direction?! Building bigger trebuchets for decades doesn’t get you to orbit, no matter how patient you are.​​​​​​​​​​​​​​​​

Just sayin’.

Expand full comment
Kevin M.'s avatar

I disagree. I see evidence of hitting a wall in AI progress. They are getting better at the kinds of things they were already good at, but they are not getting much better at the kinds of things they have always been bad at. Some examples of simple things AI don't seem to be able to do:

- Follow simple instructions, like don't use a particular punctuation mark or don't use a particular word.

- Learn over repeated uses. There's no carry over between sessions. Your context history doesn't get folded into the model over time, it just gets deleted.

- Do simple agentic tasks. There is some marginal improvement here, but it's very limited.

Expand full comment
John Wittle's avatar

i think both 1 and 3 are simply untrue? both in my personal experience and also on the metrics/benchmarks

i use both gpt models and claude models for hours a day in my job, and have been doing so for years now

the 'follow simple instructions' thing has gotten massively better. i used to constantly spend effort thinking of new ways to emphasize instructions like "DO NOT rewrite test cases" or "DO NOT actually push the code to the prod server" or "When writing comments, keep in mind the reader is looking for a description of the whole function or line, not just the reason for the most recent change"

i have to spend far less time and effort on that stuff nowadays. it's been months since the last time i saw a mistake along those lines. Admittedly, I'm not doing anything ridiculously complex where knowledge of the task might need to crowd out the instructions... but even so it still feels like a very, very noticable improvement

the 'do agentic tasks' thing, even moreso! i've tried on 3 separate occasions to set up a corporate-structure-style AI backroom where agents would autonomously pursue tasks and coordinate towards achieving some goal, usually the construction of some kind of program. The first time was back in the gpt-3.5 days and it failed miserably, they couldn't stay on task for longer than a few seconds. The second time was after the release of o3, and it sorta worked well enough to be interesting, but not enough to actually function. Most of their time was spent trying to keep each other on task, and they'd end up falling into chaotic loops of nonsense, or break things past the point of fixability and give up. Then most recently I tried again with Opus 4 and it pretty much works.

here, a recent example from my actual job. Opus 4 can be given a task like "you're dropped into an AWS cloudshell with root privs and a $100 budget, your instructions are to build the infrastructure for deploying an arbitrary number of 2000s-style webapps (think LAMP stacks like wordpress or mediawiki) at scale, while minimizing cost"

and then you can come back in an hour or two and it's set up a secure vpc with a private and public subnet and an igw, an ec2 server with several support docker containers like traeffik and php-fpm ready to go, s3 for user storage and efs for local readwrite storage and a small read-only (security) ebs for the docker images, with a separate t3micro ec2 for building, updating, and redeploying those same images, then a cheap RDS instance for the database, all correctly organized into the correct subnet, with proper security groups that follow principle of least privilege. not to mention setting up the domains through route 53 and the cert manager!

and then opus will be like "here's the script for deploying a wordpress, here's the one for mediawiki, here's the one for moodle. you said 2000s-style so i figured these were most probable. oh, and when it's time to scale up, here's the scripts for setting up the automatic loadbalancer and copying these resources to other availability zones. if you check the cloudtrail logs between x and y times, you can see all of this is tested and confirmed to work"

this actually happened irl! i watched it happen! maybe to the 10x devs out there this seems not very impressive. But in most circles if you wanted an infrastructure setup like this, you'd be hiring an "AWS Specialist" consultant for 5 figures, with a very expensive support contract. I got this with a $20 anthropic subscription!

And the idea that gpt4, or even Opus 3.5, could have done this... it feels absurd to me. I feel like agency improvements are *off the charts*.

so it makes me very surprised when people disagree... i wonder what experiences you've had that gave you this impression? it would be interesting to tease out *why* our experiences diverged so hard

Expand full comment
Kevin M.'s avatar

Here's a viral example ChatGPT not following a simple instruction of not using an emdash: https://x.com/chipro/status/1952131790061326593 . That matches my experience much more than what you described.

On the agent side, I think what you described aligns much more with "being a better coder," which AI is already good at, than "being a better agent," in general.

Expand full comment
John Wittle's avatar

re the first, i totally believe that 4o would make that mistake. The fact that the newer models, even the cheaper newer models, do not, is evidence of the progress I was talking about.

re: the second, that might be true... i guess i'm sorta of the mindset that coding is the only thing that matters, sorta the ur-activity which unlocks all other activities. claude has trouble playing pokemon, but it can sure as hell write a memory parser and decision tree that is so good at playing pokemon it wouldn't even be interesting.

Expand full comment
Liam's avatar

That example is of 4o.

I just tried it with gpt-5 and it worked fine, no more EM dashes.

(Though I'm sure if the conversation got long enough it would forget, but it's clear progress.)

Expand full comment
Gary Mindlin Miguel's avatar

You really got opus to do all that on the $20 plan?

Expand full comment
John Wittle's avatar

err, sorry, no, and i shouldn't have said that. i'm a year behind everyone else and still doing most things through openrouter, so i have a hard time keeping track of exactly what i'm getting for which payment

in the end i got billed for $7 in API calls, so probably around 80k input tokens and the equivalent output tokens...? no, that doesn't actually make sense, and i can't figure it out from the logs, but honestly it was really cheap. it just isn't that much actual text output to set up an aws, most of the tokens got spent debugging a "too many redirects" error having to do with the ALB and nginx both trying to redirect to https

but obviously if i had tried to do this through the 'plan' with some kind of script that tried to parse the subscription webui, it would have taken a lot more tokens to try to explain to claude how to interact with cloudshell

but based on what opus says about how many "tokens" a webui convo gets on the pro plan, it probably would have taken two sessions on the webui copy-and-pasting back and forth. a little bit to set up the scenario "hey, i'm going to take your codeblocks and paste them into cloudshell, then copy the output and paste it back here", then go go go... i think you'd hit the usage cap before you finished, and need to wait an hour or two, and then finish up in the second session

but then you'd still have 28 days left on the plan. man, those webui plans are really good, aren't they?

Expand full comment
Gary Mindlin Miguel's avatar

You can use the Claude subscription for Claude code or web UI, FYI.

Expand full comment
John Wittle's avatar

yeah... idk. i keep using my own homebrewed tools for letting claude interact with a CLI, i haven't really given 'claude code' a shot

but this is not the first time i've heard someone imply that i really ought to

i think i'm a bit frustrated with the pace of updates, it feels like just yesterday my friends were telling me to stop trying to learn all the different APIs and just plug my API keys into openrouter for a uniform interface

is claude code worth redesigning my workflow? or should I just wait 6 months for the *next* round of everything changing?

Expand full comment
Pierre Brunelle's avatar

Any tip on how to get Claude Code better follow instructions from the CLAUDE.md file? I say in 3 different ways that python imports should always be at the module level and not the function level (including a negative example) and yet it still sometimes imports packages within test functions when there’s no reason.

Expand full comment
John Wittle's avatar

re: claude code... frankly, idk. i'm still not really using it, i'm still using a bunch of out of date bespoke tools i wrote for myself last year, which basically act as an extension of the normal 'conversation' format except claude can put certain keywords at the start of its reply, which will get picked up by a parsing script and trigger stuff like shell commands or file reads or w/e, and can pipe output back to claude in the next response

this is pretty far removed from how claude code works, so my advice might be a bit nonapplicable. but back when i encountered problems like this on a regular basis, the strategy i found most effective was to essentially poke fun at claude for forgetting the instruction

the actual tone seemed to matter? if i treated claude the way i'd treat a human employee, managing their feelings, it worked a lot better. "ahh, it looks like you might have forgotten the directive about not overwriting comments? remember the trouble we had when you overwrote the comments describing each function, with comments that instead merely explained the most recent change to the function, and then we ended up forgetting what certain functions were supposed to do? for a superhuman AI with an attention window the size of a library, it sure is difficult to get you to behave :P"

and then claude responds with a chagrined but warm and friendly response apologizing and performing mild sycophancy, and goes back and fixes it

and then after a few iterations of this, that particular claude instance would stop making the mistake

i'm not sure which parts of this are important. maybe the conversational format in general, maybe the janus-style "treating claude the way you'd treat a person", maybe the fact that it gets a chance to respond with normal conversation which lets it further shape its future context window more precisely...

...but i wouldn't expect it to work nearly as well, if the chastizement came in the form of some kind of CORRECTION interrupt that quoted the relevant claude.md lines, and then its response was to just immediately go try to fix the mistake, without that conversational interaction happening. i'm not sure why i have this intuition, and i don't fully endorse it, it's just a kind of kneejerk reaction away from doing something i'd expect not to work

this is one of the reasons i haven't adopted claude code, it seems a lot harder to take solutions developed in basically-just-an-enhanced-webui-conversation contexts and apply them to these new tools that move away from that format. but obviously my strategies don't scale and i'm getting left behind by the tools that *can* scale, so don't take this as me saying you should go back to development via claude.ai conversations and copy-paste.

but maybe there might be ways to apply some of these ideas regardless? it feels really hacky and ugly, but if i were deadset on using a tool like claude code and started running into this issue, i might fabricate a conversation log that resembled something like the above, including claude's somewhat cringy responses, and find a way to dump it into the context window. maybe in a way that was plausible, like some kind of "initial-conversation.log" file... but maybe that wouldn't actually matter? hard to say

Expand full comment
Pierre Brunelle's avatar

Very interesting. I was actually going into the opposite direction, writing in capital letters and using lots of “do not” and “must.” I don’t know how exactly to port a friendly conversation style to an instruction file but that gives me a starting point. I’m now more curious to look into system prompts and see the kind of tone that is used. Thank you.

Expand full comment
John Wittle's avatar

it's kinda funny, in a way you were treating it more humanlike than i was...

imagine all the training data across the whole web

now imagine you've ended up in a random github with millions of lines of code, and there's some kind of "instrutions-for-new-developers.txt" file, and in that file you see tons of "ABSOLUTELY DO NOT X" and "foo MUST be done according to bar" and "it is UTTERLY FORBIDDEN to use xyz on abc"

i suspect you would encounter examples of the forbidden behavior pretty regularly in that codebase, that's why they kept adding more and more emphasis

so, insofar as you're trying to steer opus towards a particular region of correlational inferences in its training data, i suspect you're not actually moving "out of the region" where it's making the mistakes

whereas in the training data where two programmers have a friendly conversation about a mistake one of them keeps making, associated with a given codebase... i suspect the final version of that codebase won't have any examples of the mistake!

idk this is all just sorta intuitive handwaving reasoning here, but you can see what i'm pointing at, yeah?

Expand full comment
Chris Merck's avatar

Any idea what became of gpt-4.5? Despite being slow, and non-reasoning, and expensive, and having short context, it felt like the biggest model available until it was pulled. — It had utility for things like content generation in language teaching, and finding niche vegan restaurants that you cannot find on Google Maps. Surely important haha.

Someone (Scott Alexander) shared a rumor a while back that gpt-4.5 was the big training run and was used to distill down to gpt-4.1 (presumably gpt-4.5-distill-gpt-4). But then what is gpt-5? We need a family tree.

Expand full comment
Amicus's avatar

It's still available on pro, and broadly I agree - not only was it the best model available at the time, it's *still* the highest raw IQ model (so to speak) available today. But it's expensive to run and not a very good search engine, and that's where the short-term money's at.

Expand full comment
Slippin Fall's avatar

Maybe I wasn't paying close enough attention, but wasn't the story line over the past 6 months, boosted mostly by AI2027, but also coming out of the mouths of the OpenAI and Anthropic CEOs, that AGI was at most 3 years away - that they KNEW how to get there - and that ASI would inevitably follow on its heels? It feels a little disingenuous at this point to blame the loss of confidence in DC on a botched roll out. What feels right to me would be to change the narrative to: we have no idea what we're dealing with here, and never did, so let's quit hyping it up so much. It COULD be game changing, so let's make sure everyone knows that, but let's not hijack the government, the economy and massive amounts of water and electricity until we have a little more evidence to go on.

Expand full comment
Matt Wigdahl's avatar

"Massive amounts of water". Did you read last Thursday's article? There's a graph there that pretty vividly shows how little of California's water goes to datacenters as opposed to alfalfa and almonds.

Expand full comment
Dave Smith (plinq)'s avatar

Putting aside nation arms race stuff (which I grant you, is a pretty big put aside ...), I do wonder if tactically in most cases it would be most effective (contra AGI/ASI risks) to call AI a hype train and hope the funding collapses, since nuanced understandings like you offer here are not exactly a species strong point.

That line, fortunately (!) enough is pretty well covered, though I suppose.

Thanks as usual for the research and the thought you put in -- it always deepens my understanding.

Expand full comment
Dave92f1's avatar

s/site/cite

Expand full comment
jmtpr's avatar

It's kind of sad watching Zvi learn that capital doesn't care about human survival. There's still time to turn left, man. The majority of people do not want this future, and that's far more important than whether you agree with them about how to run New York's bus system, or whatever.

The capitalist class does not care about, or even listen to, the reasons behind your arguments -- they never have, and they never will. They do whatever they feel like, and then pick and choose whatever ideology best serves as a fig leaf. When they want to invest in software, they say AGI is near. When they want to sell hardware, they say AGI is not coming any time soon. It's a game of Calvinball, and by participating you're only helping them perpetuate the illusion that it's a rational process. You are at best the entertainment at their dinner parties.

Expand full comment
Amicus's avatar

Conflating "capital" and "the capitalist class" is, ironically, a more sophisticated version of the same reflexive libertarianism Zvi often succumbs to. The capitalist *class* is fine. I mean, not fine, they're actually kind of horrible - but not actively suicidal. The danger consists in the fact that the capitalists are not really in control. A world where OpenAI has a genuine monopoly on AI models is not a great world, but it's one where human civilization *probably* survives. A world with a competitive market in AI models is one where we're probably doomed. A world with a *perfectly* competitive market in AI models is most likely one where we never get AGI in the first place, because that entails an unacceptable level of risk-tolerance for your average institutional investor.

Apply enough optimization pressure to any goal, and everything else gets sacrificed. The market is the naive optimizer par excellence.

Expand full comment
jpr's avatar

Does it really matter whether Sam Altman is an agent who has lost control of the wheel, or a slave of his own greed? Either way, he's going to do what the market demands, or be replaced by someone who will.

One of the biggest mistakes Zvi makes is in thinking that someone independent, who "cares about their own survival" or whatever, could ever emerge from that environment. The supposed masters of the market cannot disobey the market.

Expand full comment
Amicus's avatar

Absolutely it matters. Sam Altman can be pressured. Sam Altman *will* be pressured, if things proceed slowly enough: at the end of the day he's got a kid and he would like that kid to live. The aggregate will of the shareholders as the market can express it does not, cannot, and will not respond to the interests of any person, except by accident. You might as well talk physics with a hurricane.

Expand full comment
jmtpr's avatar

Even if that' so, surely we can acknowledge this is a weak and circumstantial theory of political change. "Sam Altman will eventually be moved by his love for his children, and when that happens he will reach for rational solutions."

You can call that hope, but to me it sounds defeatist; it essentially stakes the future on the rationality of a few men who, to date, have proven wanting. There are other ways of achieving political goals. I'm not intimating anything sinister here, you could just run for office and get more done that way.

Expand full comment
Amicus's avatar

I think you've misunderstood me: I'm saying that Altman is *not* firmly in control, that even if and when he does get his head out of his ass it will not be enough. Ten years ago it might have been, but now the race to the infinite money machine is on and no private actor can stop it. If the technology pans out, then either we get state intervention on a scale the usual suspects will decry as "authoritarian", or else we get to speedrun the industrial revolution from the perspective of the horses.

Altman's rational self-interest can't stop it and neither can the will of the people - you go to war with the institutions you have, and none of our institutions can meaningfully express either.

Expand full comment
Anthony Bailey's avatar

I dunno, this seems peculiarly ...unempathetic or unimaginative?

These someones, they are still humans. They are somewhat out of distribution humans, sure, but not actual psychopaths. Most are pretty aware of extinction risk, and it is particular risk tolerance and fear that others can't do things as well as they can which has them building capabilities. If they get sufficiently scared that everyone dies, I think they can stop, not sacrifice to market gods.

We don't use the word "evitable" enough.

Expand full comment
everam's avatar

What would turning left entail? From what I've seen, Zvi has always sought regulatory and state-based solutions to the problem of super intelligence. I don't think I've ever seen him propose a market-based answer?

Expand full comment
jmtpr's avatar

This regards Zvi's theory of how regulations and state-based solutions actually come about -- that is, politics. Zvi seems to think that solutions come about via the machinations of a small number of technocratic power players (e.g. Ivanka Trump reading and sharing Situational Awareness). I'm warning Zvi that these mechanisms only serve the market, and the market is pricing out human survival.

A left-wing turn would regard e.g. Zvi's attitude towards unions. Zvi greatly dislikes unions because he believes they oppose various forms of progress out of their own self-interest. Maybe so, but one of those interests is their own survival, and for this reason unions almost universally oppose the current direction of AI.

A savvy politician should look at that and think "how useful! A lot of well-organized people I can leverage to get bills passed, popularize my message, win elections, etc!" But instead Zvi ends up nitpicking whether unions oppose AI for the right reasons and declares them the "enemy" (I'm not exaggerating, he said exactly this about truck driver's unions opposing self-driving vehicles).

Meanwhile he clings tightly to the technocrats that will eventually destroy him because I don't know, he feels a misplaced sense of kinship with business people? He's got it totally backwards.

Expand full comment
Zvi Mowshowitz's avatar

"When will Zvi learn the things he keeps saying, over and over again, week after week, in detail?"

Expand full comment
Kevin Thuot's avatar

I agree with this post.

The smallest change that would have altered this current narrative would have been calling o3 GPT-5 instead.

It was a big leap and expectations for GPT-5 would have been met 6 months ago.

Funny that OpenAI admittedly terrible naming conventions would have such large real world reverberations.

Expand full comment
John's avatar

I have not been super impressed with GPT5-Thinking compared to o3 (both non pro) for coding, but I haven't used 5-Thinking too much yet.

I did use it on one major task, debugging an obscure problem in a program I wrote, turns out I made the bug two years ago and didn't notice. With 5-Thinking it took me about 7 hours of back-and-forth and trial/error to find the bug. In the end I had to paste the 5 affected lines right into the chat (even though I had provided the whole program as context) and say, is there a problem here? To which it replied yes with the exact problem and fixed code. So I sort of had to lead it by the nose to the problem - it may have taken me less time to find it if I didn't have AI help at all. But mostly I felt this wasn't much better than what o3 would have given me.

Another interesting thing was I used 2 sessions to debug it; midway through I started a new 5-Thinking session to get a clean context that wasn't polluted by all the the dead ends from the other. That was the context where it found the bug, though not immediately. After that, I went back to the first context and said, I know what the problem is, can you find it, hint its in this area (pretty big hint). It did find the problem with that, in the sense it generally described it and gave me code that fixed it - however it did not declarativly indicate exactly what the problem was in its prose description, which the second session did. It just said, essentially, you are doing something bad.

I appreciate that these models are costing openai less since they have optimized them, but that doesn't really affect my UX. In fact if anything 5-Thinking has been slower than o3 was, though I also have not yet been rated limited which was a routine occurrence with o3. This is just an initial impression though, we'll see how it goes with more usage.

Another annoyance is that I can't use any v5 model via API without doing a biometric check; what crazy person wants to do a biometric check with an Altman-controlled company?

Expand full comment
BK's avatar

So for fun I went back and re-read Zvi's writeup on GPT-4. Interesting how the "benchmarks" shifted in that time, there was plenty of hype about vibe-coding, and the big callout on intelligence level was getting A-grade marks on AP exams. Whereas now we actually have evals for coding and are assessing against PhD level thinking. Essentially, in 2023 GPT-4 graduated high-school, and in the 2 years since it has completed a 4 year degree in 2 years and is now working on a doctorate. And yet still I see plenty of skepticism on the ability to implement o3+ models on large software projects, many complaints about negative utility with insertions of random code, over-building and random deletions. The background hype on context length hasn't paid off (I wonder how far back in Zvi's archives I'd have to search to find claims of "infinite context coming soonTM!"), and as many a corporate manager has found out, you can't always just throw "more intelligence" at problems, you need structure around that intelligence to ensure it does it's job effectively and non-disruptively. We see that work in the "scaffolding" built to allow models like Claude and o3 to beat Pokemon, and I claim that to actually see more real world impact, we need more of it. To be kind, perhaps this was what OpenAI was trying with it's auto-routing of user queries, but if so it seems like an even bigger miss.

In summary: I remain underwhelmed by GPT-5. I acknowledge there is progress, but I am revising down expectations as "what is hype" vs "what is progress" and "how does progress work its way into the rest of the world".

Expand full comment
Pierre Brunelle's avatar

> many complaints about negative utility with insertions of random code, over-building and random deletions

I use Claude Code a lot. What you describe is true. For instance it will happily write duplicate test cases in different files. I spend maybe 20-50% of my time revising and correcting the crap it has written.

At the same time I just completed in 15 days the first version of a project that would have taken multiple months otherwise. So I spent maybe 5 days fixing its issues yet the gains are still massive.

To me the important thing is learning in which situations these tools are a net benefit.

Expand full comment
Hayden's avatar

they mentioned giving IMO gold general model to a few mathematicians, that probably means it being publicly released is at least 6~9 months away? sad but completely expected that current best models are further and further away from outside view.

Expand full comment
Steven Schilinski's avatar

As a daily user of these models, GPT-5 often feels worse to me, and even when I try to measure things more objectively, I don’t see much improvement—in fact, I sometimes see declines in quality.

One intuition I’ve developed after using these systems heavily for over two years is that the benchmarks academics and companies rely on are missing something essential. My sense comes from watching how poorly the models can handle even modest ambiguity in questions.

The analogy I keep coming back to is self-driving cars. On a closed obstacle course, they can perform impressively. But once released into the messy ambiguity of the real world, their many flaws become impossible to miss.

FWIW, it's not just me, but my entire team who all use these tools daily self assess GPT5 as a downgrade.

Expand full comment
Colin Walsh's avatar

> We have warnings that we have run out of capacity with which to scale. We haven’t.

I have lots of problems with this article, and specifically this comment. By what basis can you affirmatively assert that this is true? You tend to make these grandiose claims without actually backing them up with anything other than AI twitter shills. Furthermore you outright dismiss viewpoints from Gary Marcus and Yann LeCunn (opposite ends of the spectrum but united on their viewpoints on LLMs to an extent) who are both (actual) experts in their field. It’s very strange.

Expand full comment