I wish we got a bit more evaluation around model releases for non-coding problem-solving – e.g. "how do I achieve [this] in my home renovation given [these idiosyncratic constraints]." o3 was a huge step forward for that stuff and is still better than 2.5 Pro – curious to know how Opus is.
My guess is that coding is the only segment where Anthropic has hit “product-market fit” and found a deep pool of customers willing to pay real money.
This is mostly a good thing - it gives Anthropic a sustainable revenue base and aligns their incentives with customers who are focused on mundane utility.
At the same time, it means that falling behind on coding capability would be an existential risk to the business. Even if they release a model that is world-class in other ways, they would lose a lot of their revenue if it was not very close to the best coding model.
I expect that this will continue to nudge their priorities and training mixtures towards coding capability, at least at the margin.
It seems much more aggressive about finding many sources (typically hundreds of sources though not all in final report) and supposedly has a much higher time cap on how long it will search. I have observed most queries stopping after around 10 minutes though.
The summarized CoT in the web interface mentions that there's a "lead researcher" orchestrating other copies of the model. It seems to me to be about subjectively equally as good as Gemini or o3 based Deep Research.
Ultimately I don't like any of these products though. Searching for 30+ minutes and processing hundreds of sources is a powerful capability. "Read this mediocre term paper" is a terrible user interface on top of that.
That sounds quite comparable to how Gemini Pro behaves since IO - prompted properly (basically, a customized version of https://lawsen.substack.com/p/getting-the-most-from-deep-research), it will now chew through upwards of 500 pages and use 150-250 in the report. Takes similarly long but every now and then it sadly freezes...
That feature to turn the output into one page websites acting as infographic (or a more detailed website) including graphs in Gemini is however quite powerful as I cannot usually be bothered to read the 50 pages, either...
Not a fan compared to OpenAI’s product. It’s a bit better than Googles variant but not worth paying for just yet IMO. I’m planning to cancel my Max plan after this month.
Interesting, I find OpenAI deep research about equal to GeminiPro (the flash based non pro is substantially weaker). When running them side by side they trade blows in my case. In fact I just let chatgpt plus lapse because of that.
Thanks for the offer but I don't think I can be bothered with an in-depth comparison unless it's critical and I don't currently have anything like that on my plate. Might sign up for Max just to try next time it's critical enough or if I decide I really want to dig into Claude code.
Opus doesn't seem quite as good at aggressive multi-round web searching as o3. I suspect this may be fixable by improving system prompt + tool implementation.
However, I have not caught it in any blatant lies yet. So overall Claude has the edge in usefulness here.
I was also very surprised to find myself missing the ChatGPT memory feature, which I initially hated the idea of. I wouldn't say o3 with memory feels like it "knows me". It feels like it has my customer record open in the CRM. But that's still really useful.
Claude Code is much better with the 4 release. I'm going to have to buy Max just to save myself money. I've spent $11, $10, and $6 in the three days that I've done any work since the new Claude Code came out. (Based on logs it seems the new model wasn't used yet most of launch day, May 22.)
I don't even use any Anthropic functionality besides Claude Code! Typically I use o3 for smart questions, 4o for stupid questions. I will probably give them more of a try now if I'm getting the Max subscription.
As a subscription service I currently use Gemini. Shall I buy the Claude subscription in order to "truly" try new models or access via OpenRouter is sufficient?
Do the people praising Claude 4 have access to a different model than I do? I've been comparing both Opus and Sonnet against o3 and 2.5 Pro since Claude 4 came out, and I've yet to see a prompt where Claude performed better—or even on par—with the competition.
The new Research mode has the same issue as Google’s Research mode: it ingests hundreds of sources but then struggles to synthesize them into a coherent, useful report. OAI's Deep Research has the opposite problem: it produces long and detailed reports but I often wish I could force it to analyze 200 sources instead of just 20-40 to which it seems to default.
My workflow for non-coding questions over the past few weeks has settled on:
(1) Collect information using o3 (with Deep Research in limited cases)
↳ (2) Use 2.5 Pro to fact-check and expand overly concise sections
↳ (3) Run the result through Sonnet to clean up grammar and formatting
Claude IMO produces the most human-like language of the "big 3”, which remains its main strength — but otherwise, it's been losing ground to both Google and OAI.
Does anybody in AI ever notice that nobody in AI can agree which model is best for which things and there are wildly divergent experiences of the same model?
I like Claude 4, although it doesn't feel like a generational improvement. The thing I'm excited about is that it seems to have more "taste"—more ability to discern subjective quality.
A big barrier to using LLMs as automated judges ("rank these poems from best to work") is that they are not good at separating high-end output. They can tell a good poem from a bad poem, but not a good poem from a great poem. They always end up overfitting on wrong quality signals (like linguistic complexity) and then you get slop.
If LLM judging was solved, we'd be able to start data flywheels for all sorts of creative tasks. So Claude 4 making progress there is exciting.
Benchmarks that have caught my eye:
- 10.72% on Humanity's Last Exam (similar to o1—highest score is o3 at 20.32%)
- 3rd on the EQBench Creative Writing Benchmark (for what it's worth, the creator says he plans on redoing the benchmark with Sonnet 4 as a judge, and it has Opus 4 at #1).
- no ARC-AGI 2 score yet (I'm not expecting anything impressive—Anthropic models have historically not done well on this benchmark)
- no Snakebench score but as Claude 3.7 is currently in the lead it might be #1.
- no Aidanbench score because the creator was hired by OA and he gave it to people who never update it anymore. Again, not expecting much.
- No FrontierMath score but judging from Livebench's Mathematics category it will be a bit behind Gemini 2.5 Pro
- LMArena is literally an anti-benchmark. It's slop. It rewards slop. If I hear a model is doing well on LMArena, I update against that model. Hopefully Claude 4 lands with a nice dull thud. #20 or lower would be my hope.
edit: oh yeah, and Pokemon Red/Blue. It's "racing" Gemini Pro 2.5 and o3 right now (not that results will mean much, due to unlike scaffolding). I can't tell how far along they both are because Opus 4 and o3 are in battles right now, but Gemini has 3 badges (and a Dugtrio called "Tridrill" for some reason.)
Asked Claude Sonnet 4 about UK pension rules. It got them wrong. Not like they changed them recently either. A standard search didn't fail on that. Things maybe getting better, but there are still plenty of times that for normal everyday simple things that don't include coding or weird maths they just don't work well. My recent fav with Sonnet 3.7 was asking for the next RBA meeting...this was on a sunday, it said tuesday. on monday it said tuesday. on tuesday it said tuesday. on wednesday it said tuesday the day before. Even more fun...the date was totally wrong it was in three weeks.
Zvi did you see https://invariantlabs.ai/blog/mcp-github-vulnerability? Not surprising, but interesting in that it doesn’t have any Pliny style jailbreaks. They say alignment isn’t enough, although I would say they have a weak definition of alignment if they think that. Very unlikely a junior engineer asked to do the same task would do that.
It seems to me that the limit on sonnet 4 for not paying customers is pretty high. I was using it today for non coding questions throughout the day and never ran into any limits.
I wish we got a bit more evaluation around model releases for non-coding problem-solving – e.g. "how do I achieve [this] in my home renovation given [these idiosyncratic constraints]." o3 was a huge step forward for that stuff and is still better than 2.5 Pro – curious to know how Opus is.
My guess is that coding is the only segment where Anthropic has hit “product-market fit” and found a deep pool of customers willing to pay real money.
This is mostly a good thing - it gives Anthropic a sustainable revenue base and aligns their incentives with customers who are focused on mundane utility.
At the same time, it means that falling behind on coding capability would be an existential risk to the business. Even if they release a model that is world-class in other ways, they would lose a lot of their revenue if it was not very close to the best coding model.
I expect that this will continue to nudge their priorities and training mixtures towards coding capability, at least at the margin.
> R1 is slightly above Sonnet-4-no-thinking, everything above that involves a model from one of the big three labs.
Typo, or quip against qwen?
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/claude-4-you-the-quest-for-mundane?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
Can someone comment on Claude Max deep research? Deep research is my main reason for having paid subscriptions...
It seems much more aggressive about finding many sources (typically hundreds of sources though not all in final report) and supposedly has a much higher time cap on how long it will search. I have observed most queries stopping after around 10 minutes though.
The summarized CoT in the web interface mentions that there's a "lead researcher" orchestrating other copies of the model. It seems to me to be about subjectively equally as good as Gemini or o3 based Deep Research.
Ultimately I don't like any of these products though. Searching for 30+ minutes and processing hundreds of sources is a powerful capability. "Read this mediocre term paper" is a terrible user interface on top of that.
That sounds quite comparable to how Gemini Pro behaves since IO - prompted properly (basically, a customized version of https://lawsen.substack.com/p/getting-the-most-from-deep-research), it will now chew through upwards of 500 pages and use 150-250 in the report. Takes similarly long but every now and then it sadly freezes...
That feature to turn the output into one page websites acting as infographic (or a more detailed website) including graphs in Gemini is however quite powerful as I cannot usually be bothered to read the 50 pages, either...
Not a fan compared to OpenAI’s product. It’s a bit better than Googles variant but not worth paying for just yet IMO. I’m planning to cancel my Max plan after this month.
Happy to run a few queries for you if you want.
Interesting, I find OpenAI deep research about equal to GeminiPro (the flash based non pro is substantially weaker). When running them side by side they trade blows in my case. In fact I just let chatgpt plus lapse because of that.
Thanks for the offer but I don't think I can be bothered with an in-depth comparison unless it's critical and I don't currently have anything like that on my plate. Might sign up for Max just to try next time it's critical enough or if I decide I really want to dig into Claude code.
Opus doesn't seem quite as good at aggressive multi-round web searching as o3. I suspect this may be fixable by improving system prompt + tool implementation.
However, I have not caught it in any blatant lies yet. So overall Claude has the edge in usefulness here.
I was also very surprised to find myself missing the ChatGPT memory feature, which I initially hated the idea of. I wouldn't say o3 with memory feels like it "knows me". It feels like it has my customer record open in the CRM. But that's still really useful.
I look forward to these posts , thanks
Claude Code is much better with the 4 release. I'm going to have to buy Max just to save myself money. I've spent $11, $10, and $6 in the three days that I've done any work since the new Claude Code came out. (Based on logs it seems the new model wasn't used yet most of launch day, May 22.)
I don't even use any Anthropic functionality besides Claude Code! Typically I use o3 for smart questions, 4o for stupid questions. I will probably give them more of a try now if I'm getting the Max subscription.
Agreed, Claude Code using Claude Opus 4 is a major step up.
As a subscription service I currently use Gemini. Shall I buy the Claude subscription in order to "truly" try new models or access via OpenRouter is sufficient?
Do the people praising Claude 4 have access to a different model than I do? I've been comparing both Opus and Sonnet against o3 and 2.5 Pro since Claude 4 came out, and I've yet to see a prompt where Claude performed better—or even on par—with the competition.
The new Research mode has the same issue as Google’s Research mode: it ingests hundreds of sources but then struggles to synthesize them into a coherent, useful report. OAI's Deep Research has the opposite problem: it produces long and detailed reports but I often wish I could force it to analyze 200 sources instead of just 20-40 to which it seems to default.
My workflow for non-coding questions over the past few weeks has settled on:
(1) Collect information using o3 (with Deep Research in limited cases)
↳ (2) Use 2.5 Pro to fact-check and expand overly concise sections
↳ (3) Run the result through Sonnet to clean up grammar and formatting
Claude IMO produces the most human-like language of the "big 3”, which remains its main strength — but otherwise, it's been losing ground to both Google and OAI.
Does anybody in AI ever notice that nobody in AI can agree which model is best for which things and there are wildly divergent experiences of the same model?
Maybe Anthropic's holding back release for a bigger model. A Magnum Opus? 🤔
I like Claude 4, although it doesn't feel like a generational improvement. The thing I'm excited about is that it seems to have more "taste"—more ability to discern subjective quality.
A big barrier to using LLMs as automated judges ("rank these poems from best to work") is that they are not good at separating high-end output. They can tell a good poem from a bad poem, but not a good poem from a great poem. They always end up overfitting on wrong quality signals (like linguistic complexity) and then you get slop.
If LLM judging was solved, we'd be able to start data flywheels for all sorts of creative tasks. So Claude 4 making progress there is exciting.
Benchmarks that have caught my eye:
- 10.72% on Humanity's Last Exam (similar to o1—highest score is o3 at 20.32%)
- 3rd on the EQBench Creative Writing Benchmark (for what it's worth, the creator says he plans on redoing the benchmark with Sonnet 4 as a judge, and it has Opus 4 at #1).
- no ARC-AGI 2 score yet (I'm not expecting anything impressive—Anthropic models have historically not done well on this benchmark)
- no Snakebench score but as Claude 3.7 is currently in the lead it might be #1.
- no Aidanbench score because the creator was hired by OA and he gave it to people who never update it anymore. Again, not expecting much.
- No FrontierMath score but judging from Livebench's Mathematics category it will be a bit behind Gemini 2.5 Pro
- LMArena is literally an anti-benchmark. It's slop. It rewards slop. If I hear a model is doing well on LMArena, I update against that model. Hopefully Claude 4 lands with a nice dull thud. #20 or lower would be my hope.
edit: oh yeah, and Pokemon Red/Blue. It's "racing" Gemini Pro 2.5 and o3 right now (not that results will mean much, due to unlike scaffolding). I can't tell how far along they both are because Opus 4 and o3 are in battles right now, but Gemini has 3 badges (and a Dugtrio called "Tridrill" for some reason.)
Asked Claude Sonnet 4 about UK pension rules. It got them wrong. Not like they changed them recently either. A standard search didn't fail on that. Things maybe getting better, but there are still plenty of times that for normal everyday simple things that don't include coding or weird maths they just don't work well. My recent fav with Sonnet 3.7 was asking for the next RBA meeting...this was on a sunday, it said tuesday. on monday it said tuesday. on tuesday it said tuesday. on wednesday it said tuesday the day before. Even more fun...the date was totally wrong it was in three weeks.
Zvi did you see https://invariantlabs.ai/blog/mcp-github-vulnerability? Not surprising, but interesting in that it doesn’t have any Pliny style jailbreaks. They say alignment isn’t enough, although I would say they have a weak definition of alignment if they think that. Very unlikely a junior engineer asked to do the same task would do that.
It seems to me that the limit on sonnet 4 for not paying customers is pretty high. I was using it today for non coding questions throughout the day and never ran into any limits.