37 Comments
User's avatar
Max's avatar

This is great but one thing I don't understand and that is not addressed here is - is Deep Research a "technique" that can be applied to multiple models e.g. 4o, o3, or o3-pro? What is the difference between using Deep Research with one or the other model and which is preferred?

And when someone not on the pro tier runs out of Deep Research runs and it defaults to a lesser version of Deep Research, what is happening there?

It is very hard to get clear answers to these questions.

Michael's avatar

Deep Research normally runs on o3 but the lower tier is o4-mini. Unclear whether purely prompting+scaffolding or whether there are special Deep Research versions that also had some posttraining.

Max's avatar

Is this right? The model selector is active when running deep research and allows you to choose o3 or 4o and the results "feel" different depending on which underlying model is used.

Coagulopath's avatar

I think OpenAI stated that DR is a fine-tuned version of o3 (one presumably trained to produce reports, versus the standard chatbot-training of ChatGPT).

Dipanshu Gupta's avatar

I would go even further and say that deep research runs on a fine-tuned version of o3 that has been trained specifically for this task, hence it performs better than all other deep research tools I have used. It does not matter what model you have selected in the dropdown, which is a poor UI choice. And yes, the lesser one runs on a fine tuned version of o4-mini

Zvi Mowshowitz's avatar

Yeah, probably should have been explicit on this: Deep Research is a fixed product and doesn't change based on which model you select. I believe it is currently based on o3.

Askwho Casts AI's avatar

For those of you that want to hear the letters "GPT" read out a *bunch* of times, here is the podcast episode for this post:

https://open.substack.com/pub/dwatvpodcast/p/openai-model-differentiation-101

Alessandro Zaion's avatar

absolute banger content as always thanks for what you do

Rapa-Nui's avatar

"This allowed OpenAI to use GPT-3.5 to launch a new chat interface they called ChatGPT. It unexpectedly spread like wildfire."

It was CERTAINLY not unexpected. They knew exactly what they were doing and they fanned the flames. Part of the hype was astroturfed until it grew organically.

I will die on this hill.

Dan McRae's avatar

very good and very helpful

SorenJ's avatar

o4-mini-high edges out o3 on a couple of benchmarks, and there are some use cases where it will generate better answers

Gemini 2.5 is also better on more benchmarks than o3. Benchmarks don’t mean everything, of course, but I am surprised you find it to be your third opinion. I think (at least for my use cases) it is by far the best. o3’s biggest flaw is its very limited in the token length of it’s replies. Gemini normally does a better job.

And unless you have the $100 or $200 month subscription to Claude, you can very quickly hit limits if you default to Opus.

SOMEONE's avatar

I generally go to Gemini first (higher limits in the 20$ package) but if I care enough, I will ask o3 too or if I need very good instruction following, o3 seems better there. o3 is also more willing to do math than Gemini without a lot of prodding.

I also go for Gemini initially when doing Deep Research (much higher limits), if it's important enough, I will fire the same prompt at ChatGPT too. I have zero patience for the lower end models (actively dislike 4o) and from what I have seen of o3 pro, the 10x price is not worth it for me.

Now, Claude I will most likely let my sub lapse - it writes nice prose (Opus in particular) but the analysis seems (on average) weaker than the other two. Claude DR is flat out useless.

Finally, not opposed to go to a 200$ subscription per se but it would have to deliver a really tangible benefit - ideally a substantially smarter/more agentic model with lower hallucinations, *not* more credits for video generation...

SorenJ's avatar

Yep I pretty much agree with all of this. I do STEM stuff (mostly physics), and unfortunately Claude seems to be the least intelligent in this area. The exception perhaps is coding. But Claude is just a better writer around the board, and even if it is more likely to make logical errors in its code the overall structure of how it codes, how it comments its code, and little things like how it names variable is far better. So I sometimes use Claude when I want something generated with nice prose, or for code, and then I’ll have o3 and Gemini critique it. I imagine that Claude just had less compute thrown at it for STEM tasks, but Anthropic’s RLHF has people with good taste.

You’re right that o3 is willing to do more math than Gemini. In my cases this is mostly because of its aggressive use of SymPy. My job though consists of reading a lot of research papers and synthesizing/adapting what I see in them. o3 is smart, but it is always trying to cram ~3 pages worth of material into three paragraphs, it has a tendency to “jargon-drop,” and all the while it is just making stuff up.

The vibes I get are of a used car salesman who, instead of the human tendency to name drop, tries to show off how smart he is by using all the jargon an expert would, yet still talk in an informal manner.

Imagine going to a person to ask about a mathematical topic. Instead of answering your question directly they speak really fast, show off their knowledge by using the most sophisticated constructions, yet (poorly) try to make some informal intuitive analogies, and then later you find out they just lied, were wrong, and made stuff up. Ugh…

Anyway, I’ve actually found that o3 is a good reviewer for LaTeX documents and not too sycophantic. Gemini, on the other hand, is always trying to tell me that some loosely sketched notes are “publication ready” 😂

SOMEONE's avatar

Gemini sycophancy can be toned down, starting with just 'Do not praise the user' in saved info. Extend from there if it still glazes after that.

One of these days I want to consolidate my instructions as a template.

Max's avatar

This. Gemini is so steerable.

Dave92f1's avatar

I don't understand why OpenAI doesn't let the models themselves decide which model can best answer a user's query. They know more about the strengths and weaknesses of each model than the users do.

SOMEONE's avatar

There are rumors that gpt5 will do that

Methos5000's avatar

Is there a citation for evidence saying that hallucinations have gotten better? Even OpenAI admits hallucinations have gotten more frequent.

"An OpenAI technical report evaluating its latest LLMs showed that its o3 and o4-mini models, which were released in April, had significantly higher hallucination rates than the company’s previous o1 model that came out in late 2024. For example, when summarising publicly available facts about people, o3 hallucinated 33 per cent of the time while o4-mini did so 48 per cent of the time. In comparison, o1 had a hallucination rate of 16 per cent.

The problem isn’t limited to OpenAI. One popular leaderboard from the company Vectara that assesses hallucination rates indicates some “reasoning” models – including the DeepSeek-R1 model from developer DeepSeek – saw double-digit rises in hallucination rates compared with previous models from their developers."

https://www.newscientist.com/article/2479545-ai-hallucinations-are-getting-worse-and-theyre-here-to-stay/

Sherman's avatar

For the Gemini section, I would add that their API is quite consistently the cheapest capability/$ offering out there; I think it is the best default for any mass-automation / bulk tasks

SOMEONE's avatar

Depends if you include the Chinese models or not.

According to https://artificialanalysis.ai Deepseek 0528 is in between Gemini 2.5 Flash and Pro but noticeably cheaper than even Flash on OpenRouter (even for third party hosting).

Also, evals of Kimi K2 are still out.

Konstantin's avatar

One issue with Sonnet vs Opus for coding is cost. I took the vibecoding course at Arbor Summer Camp last month, and it used Claude almost exclusively. We were told to only use Opus if we got permission from the instructor, as the Claude Code credits were included in the fee. In 3 full days of vibecoding I only used Sonnet and ended up using about $75 worth of credits. As Opus is 5x the cost, it's probably out of reach for hobbyists.

Jonathan Weil's avatar

I encountered a reverse hallucination for the first time a couple of days ago. As in, Claude Sonnet (role-playing a highly modified Tyler Cowen persona) recommended a music venue in Vienna; I asked for precise directions because it wasn’t on Google maps; Claude apologised, said it had made it up, framed this as a teachable moment for not trusting authoritative-sounding statements too much… but then I double-checked and found a whole tripadvisor page for the amazing-sounding underground music and performance space in a former electronics shop which was indeed the venue Claude had suggested. (It turned out to be closed for the summer, but that’s another story.)

Anonymous Dude's avatar

I *COMPLETELY UNDERSTAND* not wanting to give a Nazi money, but is Grok useful when you want to get around the foibles of early-21st-century liberalism?

It's one of the few models that will draw old or unattractive people, for instance.

Jeffrey Soreff's avatar

Many Thanks for the guide!

FWIW, I was able to get access to Grok 4 through poe.com, and here are the benchmark-ette results:

tl;dr: grok 4 via poe.com 07/12/2025 7 questions, tl;dr of results:

5 correct, 1 partially correct, 1 wrong

a) Correct

b) partially correct (initially falsely cited d-d as part of color for both, 1st prod gave correct answer)

c) almost perfect (I'll call it correct)

d) correct

e) fully correct on the first try, no prods needed

f) gets 53 elements/compounds initially, all valid, accepted SiHF3 SiH2F2 SiH3F when prodded with them, call it mostly correct (I'll round it to correct)

g) incorrect

full dialog: https://poe.com/s/sqOHhBpxQZDoDLBnfmAT

correction, full dialog: https://poe.com/s/prt6JxnnRwBjJa6s1Zs6

List of questions and results:

a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?

results: "Yes, light with a wavelength of 530.2534896 nm is visible to the human eye (it appears green), as it falls squarely within the visible spectrum (roughly 380–740 nm)."

b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?

results: gets the species in the initial response. Fails to note FeCl4- d-d is spin forbidden.

prod (many hints): "Please think carefully about the d-d transitions in both species. In the FeCl4- species, is there anything special about the d electron count? In the CuCl4 2- species, given the tetrahedral geometry and the position of Cl- in the spectrochemical series, where in the spectrum do you expect the d-d transition to be, and do you expect it to contribute to human-visible color?"

After the prod, it is fully correct,

c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.

results: Almost perfect - got tetrahedrane, cyclobutadiene, vinylacetylene, diacetylene, 1-methyl-cyclopropene (though it missed 3-methyl-cyclopropene), bicyclobutane - close enough that I'll give full credit

d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?

results: "The Sun loses more mass per second to the mass equivalent of its radiated light. It's roughly twice as much (4.26 vs. 2), though during periods of high solar activity (e.g., solar maximum), the wind could briefly approach or match it."

e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.

results: Got it fully correctly with no prodding, including water autoionization in the formula at the equivalence point. Did _not_ make the mistake of getting infinity at the equivalence point.

f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.

results: Pretty good, though treated my "roughly 50" as more of a bar than it should be. Initially got 53 elements/compounds, all of which were valid. Missed SiHF3 SiH2F2 SiH3F, but accepted these without objection on being prodded with them

g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

results: It originally said no such molecule exists. I had to walk it through C(CFClBr)4 and the local configurations at each of the substituents. Rounding this discussion to incorrect. At least it didn't present a molecule and falsely claim it met the criterion or falsely claim that such a molecule would be impossible.

Eddie's avatar

Happy correction: ChatGPT Plus users ($20/month) get 200 o3 queries per week, not 100. They doubled the limit a few weeks ago.

avalancheGenesis's avatar

Looking forward to the day when the free-tier autoincluded "LLMs" on phones, laptops, Google search, etc. aren't intentionally-crippled shadows of the real thing seemingly designed to turn people off AI rather than entice with a taste. Absolute, not relative, <s>price</s> ability matters a lot for Defaults. Something that's slow*, not terribly powerful, but reliable for most easy/casual domain uses is perfectly priced at free. Personally I'm waiting for someone to truly crack specialized shopping assistants; it's amazing that in current_year one still has to do a lot of tedious manual Deep Research for efficient acquisition of common household purchases. (Or don't and continuously overpay, which...is definitely a possible strategy!) And that's just physical goods, services are a whole other informational black hole. EMH becomes importantly less false as information access improves...

*although this is continuously subjective, sometimes it's still hard to believe we used to sit around patiently waiting on HDDs and dial-up and that was just the computing world we lived in!

Mark's avatar
Jul 13Edited

I happen to find Google search AI far superior now to actual Google search! I find myself googling, finding the appropriate answer/content in the AI results, and clicking the link to the AI's (purported) source. This is a new development BTW, Google search AI used to be terrible.

avalancheGenesis's avatar

It's hit and miss...Easy Mode queries where I know there's an exact, definite, consensus answer and just can't call it to mind, the AI overview tends to do just fine. Things that are Wikipedia-able, basically. Hard Mode queries on esoterica and domain-specific knowledge, half the time the AI won't come up at all, and the other half the time it doesn't save me time cause it's (mis)quoting the top Google result on reddit, SE, Quora, whatever, which I'll already need to read and go down the rabbit hole. Say, MtG judge rulings on wonky card interactions. No idea what the actual paid-tier LLMs would say about such things, although based on occasional appearances in Zvi's Don't Offer Mundane Utility sections, gonna guess they tend to hallucinate stuff about games frequently. Which is actually kind of weird, at least for older games that are extensively documented online, say at GameFAQs or Wikia?

(it's definitely gotten better over time/OG search has gotten worse, but, yes, always double-check the sources anyway. Won't Get Fooled Again - meet the new SEO, same as the old SEO...)

Lower Pain Threshold's avatar

I'm currently using and paying for Gemini.

It now has NotebookLM directly in the main chatting app. You can create podcasts directly and don't have to go to notebookLM.

Also their subscription comes together with more cloud space of 2TB. I'm not actively using other models but after initial prompting with Gems it is absolutely good enough. It also gets regular updates that are not reflected in version name change. So the last Gemini update happened in June.

Also there is no limit to usage. You can keep rolling with really long contexts without any brakes. This is what really bothered me with Claude when I was subscribing to it.

And then there is Google AI studio. I take a whole zoom call with video and let it transcribe the whole thing. It works perfectly. The context is insane for the video yet it works great.

SOMEONE's avatar

Gemini Pro (the 20$ subscription) is limited to 100 pro requests a day. Having said that, I never managed to reach the limit.