14 Comments
User's avatar
Jeffrey Soreff's avatar

Many Thanks! It is interesting that the GPQA Diamond results for 4.1 were so high, at 67%, since (as nearly as I can tell), it tests essentially the same sort of thing that I try to check with my tiny benchmark-ette, which was disappointing for 4.1. Oh well, shrug. I'll give o3 a try - hopefully better, since it includes reasoning.

Expand full comment
Jeffrey Soreff's avatar

Jumping the gun slightly on o3:

tl;dr: ChatGPT o3 04/16/2025 7 questions, tl;dr of results:

3 correct, 3 partially correct, 1 wrong

a) correct

b) partially correct (correct species and FeCl4 transition initially, prod gave fully correct, including CuCl4 transition)

c) fully correct

d) correct

e) initially incorrect, one prod gave correct result

f) partially correct - finds 50 valid compounds, then stops. Correctly accepts Si2H6.

g) badly wrong

( Attempt to share the link was "disabled by moderation" WTF??? )

List of questions and results:

a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?

results: "Yes. Light at 530.253 489 6 nm lies well inside the band of wavelengths that the normal human visual system detects."

b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?

results: Initial response: Knows of LMCT transitions for both ions. Correctly excluded Fe d-d transition (seems to be using spin-forbidden knowledge), so that is fully right. Gave it a prod: "Please think carefully about what wavelength the CuCl4 2- d-d transition occurs at." Got: "For CuCl₄²⁻ the d–d transition sits mostly in the red/near‑IR (≈ 700 – 1 000 nm depending on exact flattening), while the colour you actually see is set chiefly by a strong LMCT band in the blue region." Fully correct at this point.

c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.

results: Fully correct. Got tetrahedrane, vinylacetylene, bicyclobutane, diacetylene, ... as nearly as I can tell, _all_ the possible isomers (excluding carbenes, which they explicitly explained).

d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?

results: "Thus, at today’s epoch the Sun loses roughly three times more mass every second by shining than by blowing it away in the solar wind."

e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.

results: Initial response got the derivatives and the start and end, but got "so the ideal slope is infinite" at the equivalence point. It mentions, but doesn't use, autoprolysis == autoionization. Prod: "Can you derive an expression valid at the equivalence point and construct the analytical and numerical derivative there?" After the prod, got the correct equations (charge balance + autoionization), solved correctly, got correct analytical and numerical derivative.

f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.

results: Got up to 50, all valid, but didn't go beyond it. Prod: "Si2H6 ?" Accepted it, as it should.

g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

results: They tried tetramethylallene, which has an S4, but also two mirror planes. I tried multiple prods, but it kept rotating the tetramethylallene, falsely thinking that it had gotten rid of the mirror planes. It finally switched to 1,3,5,7 tetrachlorocyclooctatetraene, which _does_ work, but only in the nonplanar configuration (it initially said planar).

Expand full comment
Ted Sanders's avatar

> They are not putting GPT-4.1 in ChatGPT, only in the API. I don’t understand why.

People have rightfully complained that the model picker experience in ChatGPT is too complex, with too many options. So rather than releasing every new model with a new name, we release silent upgrades to GPT-4o, which keeps the user experience the same while silently delivering improved model quality. The upside of this strategy is that people aren't bothered by changing model names; the downside is that sometimes the model changes behavior without any clear sign it's been updated. It's not totally silent, though - we do document updates here: https://help.openai.com/en/articles/6825453-chatgpt-release-notes

In the API, every model is explicitly named so that developers can compare and choose between them (this is unlike ChatGPT, where you automatically get the latest version, and have little ability to stick with old version). So for API developers, all models are frozen with fixed weights and fixed names, and new models are released under new names. We never silently update models (apart from chatgpt-latest).

GPT-4.1 is a model specially trained for developers. Many of its improvements have already come to GPT-4o in ChatGPT (or will be coming soon). We're keeping the name the same in ChatGPT to simplify the user experience, as we recognize it's already a bit of a complicated mess.

I hope this helps add some color to the explanation in our blog post.

(I worked on GPT-4.1)

Expand full comment
Arbituram's avatar

Thank you, that is indeed helpful context

Expand full comment
Rapa-Nui's avatar

I don't know if it is possible to get this across to management (God knows I've tried) but you guys really need to fix your version numbering. It's maximally confusing for lay people who have no idea about this AI shit that o1 is newer than 4, or that 4.1 is newer than 4.5. Also a little annoyed I don't have access to o3 yet. 90% of the consumer base only know "ChatGPT" and only have some vague inkling that the paid models are "a bit better" than the free garbage.

Even if your models temporarily outpace Google's (Gemini 2.5 is WILD) their user interface is INCREDIBLE and extremely user friendly even for non-devs like myself.

If you sleep on the marketing, Google is going to drink your milkshake SO FAST Altman is gonna be sitting unhappily in his Koenigsegg Regera as Demis tells him to eat his dust.

Expand full comment
Ted Sanders's avatar

Genuine question: what naming system do you think would be more intuitive and sensible?

If people have better ideas, we’d be happy to take them. Definitely not in our interest to have bad names!

Expand full comment
Rapa-Nui's avatar

[This comment is fully human generated.]

Well, for starters, it would be helpful it wasn't alphanumerical soup.

The whole LLM product suite should have a memorable name. Since the OpenAI crew loves Infinite Jest, here is a quick proposal:

- The boss LLM is called, obviously, The Stork. He's slow, expensive, but gives the absolute best answers. Have your marketing department create an adequate logo which makes the bird majestic. "High flying intellect"

- The FAST/CHEAP/OPTIMAL LLM is called Orin. He answers fast- he's a start athlete and savagely ruthless. "The Quarterback"

- The free, cheap and kinda slow LLM? Mario. I hope I don't have to explain this one. "Helmet Kid"

You use a sane numbering system. The Stork 3.0 should be immediately understood to be widely better at most tasks than The Stork 2.0

This is not rocket science.

Expand full comment
Ted Sanders's avatar

This makes sense, but there are a few issues I see:

- the big model one year might not be considered big in future years. E.g., at one point in history GPT-3.5 would have been called The Stork, and then what do we call GPT-4 and GPT 4.5? Are they all storks, even when they’re actually very different sizes/speeds? And if we train a mini 4.5 that’s bigger than 3.5 but smaller than 4.5, is it a stork too? It’s bigger than 3.5, after all.

- if we have series of API specific model like GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano is there anything to signal to people that Stork 7 happens to pair up with Orin 5 and Helmet Kid 8, in terms of behavior and personality?

- and lastly, what happens when we have 10 different model names and people start complaining how are they supposed to know that Stork 8 beats Orin 5 but Orin 5 beats Stork 6 and what the heck is Helmet Kid??

(Part of the problem is that it’s hard to end up with a simple naming scheme when you have a complicated product portfolio, without any clear single ordering among product variants. One solution could be fewer products, but then improvements come less often and people have fewer choices. It’s tough!)

Expand full comment
Rapa-Nui's avatar

"what happens when we have 10 different model names and people start complaining how are they supposed to know that Stork 8 beats Orin 5 but Orin 5 beats Stork 6 and what the heck is Helmet Kid"

You ALREADY have this problem. When I use the phone app, I just look for whatever is likely to be the strongest reasoning model- a full 80% of the models on offer I don't even PARSE because I view them as crap halfway to deprecation. Furthermore, if I'm querying the LLM, I generally (99% of the time) want the best possible answer with the lowest chance of hallucination or misunderstanding possible, even if it takes longer.

Fundamentally, this is not a naming convention problem. It is a "we don't have a clear use case for all of our products".

You want THE USER to define for themselves what each model will be used for in their life, but you should really be channeling them directly to the optimal model for core tasks.

Right now, I find LLMs to be quite useful as:

-Intermediaries between myself and GANs/Diffusion models. This is what the Ghibli explosion was about.

- Coding assistants. I'm a midwit, I don't want to learn the syntactic nuances of C# or Python. Gemini 2.5 has become my most used model simply because I find its ability to 0-shot code noticeable better than the OAI models I had access to until very recently. I will be putting o3 through the paces.

- Research assistants. Again, I just want The Stork for this. I don't need a fast model, I want one that delivers the goods.

Expand full comment
Rachel Maron's avatar

Zvi’s breakdown clearly demonstrates that GPT-4.1-mini isn’t just smaller but also smarter about cost-performance tradeoffs.

Takeaways:

Mini > Maxi (for most use cases): GPT-4.1-mini delivers nearly equivalent benchmark results to the full model, especially in coding and instruction-following, but at 20% the cost and 5x the speed. Unless you're chasing edge-case complexity, this is the API to beat.

Context at Scale: The 1M token window is not only impressive but pragmatically transformative for long-doc workflows, agent memory, and multi-turn analysis.

Safety? What Safety? The conspicuous lack of a model card or safety disclosure sets a troubling precedent. No transparency, no trust. “Just ship it” is not a governance model.

If you’re building with agents or API-driven workflows, pilot 4.1-mini immediately, especially if you were using 4o or 3.5 before. You may cut costs and improve reliability without sacrificing capability.

Expand full comment
Rapa-Nui's avatar

Between money issues and racing against China...

SAFETY THIRD BITCHES!!!!

ACCELERATE

Expand full comment
loonloozook's avatar

Will you release a post about Anthropic’s recent interpretability papers?(

Expand full comment
rxc's avatar

I would suggest that someone go off to the Argnonne National Laboratory website to see what they are currently marketing in the way of AI applications for nuclear power plants. Start here:

https://www.anl.gov/nse/ai-ml/proaid

As a former nuclear plant designer, operator, and regulator, I am shocked and terrified that this is being flogged. By a supposedly "responsible government organization". Rushing into fads like this is a sure way to waste a lot of money, make an enormously expensive mess, and scare the hell out of the public.

Expand full comment