24 Comments
User's avatar
markus's avatar

Thanks, great write-up as always. Asked gpt5-thinking about the lead of gemini 2.5pro over gpt5 on polymarket and it got it right away with the correct scores (ie with style points off)

Fergus Argyll's avatar

"The point about internal testing should also be emphasized. If indeed there are superior internal models, those are the ones currently creating the most danger, and most in need of proper frontier testing."

I'll listen to any argument but I will immediately dismiss any argument of this form. 2 years after "What did Ilya see" - he saw gpt-4o telling him he's a genius. The labs release their very best and this is currently their very best. Maybe it's good maybe it's not but deal with it.

Evan Brizius's avatar

"The default instinct is to react to GPT-5 as if it is the best OpenAI can do, and to update on progress based on that assumption. That is a dangerous assumption to be making, as it could be substantially wrong, and the same is true for Anthropic."

I'm not ready to declare the death of the LLM yet, but I have to be honest that this release is causing me to update in that direction. OpenAI had every economic incentive to wow the world with another huge step forward, and instead we get comically fucked up visuals in the live demo. "We keep the *really* powerful models in house" reasoning could be used to explain away anything. Now, I am open to the possibility (it would explain why investors remain so confident plowing money into this) but it should carry a meaningful discount.

loonloozook's avatar

>I have to be honest that this release is causing me to update in that direction

Why? If I am not mistaken, even the most bullish estimations, such as AI2027, predicted 2025 as rather boring regarding the development of base models.

alpaca's avatar

If they had something better Anthropic wouldn't have to ban them from using Claude Code.

My most likely hypothesis is: what we see is what they got, and they're getting desperate because of their burn rate, for-profit conversion, etc. so they had to release something now to not lose the API market completely.

Mike Bauer's avatar

Are you familiar with the ideas of sales and marketing?

Altman is selling his company’s wares

GPT 5 is a good model, works well. Clean replies, laid out better, adds value to me as an engineer and entrepreneur.

Am I only using one model? No, just like I used different calculators, software programs, etc.

JaziTricks's avatar

regarding "changing model for each question", this isn't true.

the android app defaults to my previous model choice (5-thinking) in the same chat window.

it even defaults to thinking in new chats.

I'm a plus user

Fergus Argyll's avatar

Same for me on desktop

mrx's avatar
Aug 11Edited

I think that was in reference to the new autoswitcher/router, not the model picker in the UI.

Ted Sanders's avatar

>My understanding is that long context needle is pretty solid for other models, but it’s hard to directly compare because this is an internal benchmark.

Small correction: OpenAI MRCR is fully open source. It’s branded OpenAI not because it’s internal or because we’re trying to claim credit, but just to disambiguate that it’s the version we cleaned up and improved. https://huggingface.co/datasets/openai/mrcr

I think there are a few places comparing long context across models. When I googled, the first one I found was artificial analysis, which ranks gpt-5 #1 for long context reasoning (maybe not the best measure, since it mixed in reasoning). https://artificialanalysis.ai/articles/gpt-5-benchmarks-and-analysis

Ron Bodkin's avatar

There are a lot of users on TikTok complaining because they lost the sycophany of 4o in the upgrade to 5...

Michael's avatar

On the other hand, for those of us with good taste, the new "Robot" personality is fantastic.

Michael's avatar

It feels like the router has improved a lot over the weekend. Of course, this is totally a case where I could be fooling myself.

The total lack of transparency breeds paranoia given imperfectly aligned interests between OpenAI and the user, so the router really does have to be a near perfect "magic" user experience for this not to be a problem, and it is definitely not that.

I do expect it can be made to work well enough and that other consumer chat assistants will adopt it, especially given that models adaptively choosing how many reasoning tokens to use is already how things work anyway (sometimes you get "thought for 1s").

Vince's avatar

“Ideally one would have a settings option at least for the $200 tier to say ‘no I do not want to use your judgment, I want you to use the version I selected.’” - this is the case already, you can use the basic model with the router, or you can select. I just always select thinking, and it’s basically exactly what you said - a slightly better o3. It is annoying that the context length improvements of base 5 don’t seem to translate to thinking 5, though

On a separate note, with the model realizing that it’s on a test environment 1/4 or so of the time, I’d be really interested to know what the percentage is for actual use cases. I.e. does 25% of the time when I put in a prompt it think does it think “since this is an evaluation”, or is it 0%, or somewhere in between? That greatly shapes what I think about the model’s situational awareness

michael michalchik's avatar

I am hearing wildly different reports on people's experiences with ChatGPT-5. This is very interesting to me. So far, for me, the experience has been mildly positive but underwhelming, and an indication that for OpenAI, at least, progress has slowed down.

Just speculating off the top of my head, factors that might make a difference.

Chat GPT 5.0 is not a reasoning model, or at least not automatically a reasoning model. It's unclear to me what it's doing when it engages thinking mode. This might make it inferior to 3.0, which has extended reasoning. But not inferior to o4, which is the base model that doesn't.

There's an auto switch between extended thinking and base thinking. The way you ask your questions or phrase the prompt will push it in and out of this mode and some people made naturally prompt in ways that engage a lot of thinking and other people may not.

It may be more sensitive to responding to the system prompt that you set up in your customization. My version of chat GPT seems to be more responsive to my instructions in my customization then previous versions of chat GPT though these things are hard to quant.

The switching between reasoning modes seems partially dependent on overall system load. High demand times maybe automatically switching out of extended thinking. Low demand times which is when I usually operate, may be more permissive with extended thinking.

People's expectations maybe interfering with their ability to judge progress. When you expect a big breakthrough and perhaps don't have a very clear and easy to remember notion of past performance you may be comparing actual performance against your expectations of large improvements and that comparison makes it seem like things are getting worse.

People also may have been getting used to the flattery and sycophancy of previous models and be disappointed that the AI is not as friendly and supportive as it used to be. They may feel like responses are less inspiring which also feels like they are lower quality. The sycophancy it seems lower than the o4 model to me.

There may be problems with some of the expert subsystems in the mixture of experts. Or there might be problems with switching appropriately between them. So depending upon the topics and details of the prompts that you're interested in you may get appropriate and inappropriate sub specialist models. Again I may be having less problems because my custom instructions have things in them to enhance selecting the correct expert on the topic.

Edited by 5.0 thinking mode.

Steven Adler's avatar

Re: OpenAI: "Across all our red teaming campaigns, this work comprised more than 9,000 hours of work from over 400 external testers and experts."

Zvi: "I am down for 400 external testers. I would have liked more than 22.5 hours per tester?"

^ I am pretty unsure what this number of hours represents. I notice that I feel pretty deeply skeptical that there were 9,000 hours of like deep substantive red-teaming of the form I think is trying to be conveyed

Mark Russell's avatar

Thanks for all the work you are doing on this. You and Zvi and Elezier are part of the quiet army trying to keep the human race alive, and I appreciate it. Out of gratitude, I will no longer imagine you as the drummer of Guns & Roses, a comparison of which you are surely sick of (I myself share a name with a famous humorist). After all, his real name was actually Michael Coletti, so you are the one and only true Steven Adler!

Steven Adler's avatar

Oh that's funny! I had no idea he was born a different name

Unfortunately there is still the Austin TX mayor Steven Adler, the former IBM exec Steven Adler, etc. Thanks for the kind words

Mark Russell's avatar

Yeah, name commonalities. I once, in the 90s, had an airline agent scroll for my frequent flyer miles by name. I was the 2nd to last of 238 Mark Russells. We laughed aloud.

Mark Russell's avatar

Manifold markets asks, will Pliny jailbreak GPT-5 on launch day? And somehow it took several hours for people to respond with ‘obviously yes.’

It actually took several hours. It did not during that time occur to me that this might be because Pliny was actually finding this hard.

Gotta jump in here with this. Prediction markets have been rubbing me the wrong way and now i can say why. This is a perfect example of a. how prediction markets are set up incorrectly and

b. how they are broken, and bleeding epistemic/predictive validity

So much of this stuff is responses geared to the scoring, by either constantly updating, or worse, waiting to see who won a footrace, or was about to win, and then post your prediction, then take credit for being oh-so-smart. And now we have caught you in the act.

Time to change the way these markets score answers. There has to be something on the line for putting in a correct answer early. There has to be something on the line, scoring wise, for giving a prediction that could still turn out to be wrong, but might actually be right, and then score that appropriately. Visionary people need to be rewarded! You can't just have a bunch of people lurking, mouse in hand, when the black swan shows up, and then give them credit for saying "Black Swan!" It has to be less credit than the person who said "Black Swan" a week ago and never updated. It has to. really.

Thank you, Zvi, for sharing this smoking gun. Manifold has turned into what I feared it would turn into: a bunch of people waiting for x to equal y who then say "X will equal Y" and want credit for being right.

vtsteve's avatar

"That’s your queue, sir."

Took a minute to register. But how does it cache out? lol

avalancheGenesis's avatar

Curious if you changed mail settings for this post - somehow it never hit my inbox or spam folder at all, I got no notifications, and would have missed it entirely if not linked to from the followup post. Definitely thought it was weird not to get a timely GPT-5 take from the one place I'd expect it most. Makes me a bit paranoid that I might have missed other DWATV content. Gotta keep that 100% Completion ratio up...!

Mike Lambert's avatar

“Also there’s something weird. GPT-5-Thinking has a 6% k=1 attack rate, and a 56.8% k=10 attack rate, whereas with 10 random attempts you would expect k=10 at 46%. “

I suspect k=1 is an average of 10 attempts. Doing 1-(1-a)**10 would be correct except that there is variance in k, which i believe ends up pushing this attack rate higher.