Hey Claude, how much is a late fee for a 15’ U-Haul truck rental?
Ultra-deep thinking mode. Greater rigor, attention to detail, and multi-angle verification. Start by outlining the task and breaking down the problem into subtasks. For each subtask, explore multiple perspectives, even those that seem initially irrelevant or improbable. Purposefully attempt to disprove or challenge your own assumptions at every step. Triple-verify everything. Critically review each step, scrutinize your logic, assumptions, and conclusions, explicitly calling out uncertainties and alternative viewpoints. Independently verify your reasoning using alternative methodologies or tools, cross-checking every fact, inference, and conclusion against external data, calculation, or authoritative sources. Deliberately seek out and employ at least twice as many verification tools or methods as you typically would. Use mathematical validations, web searches, logic evaluation frameworks, and additional resources explicitly and liberally to cross-verify your claims. Even if you feel entirely confident in your solution, explicitly dedicate additional time and effort to systematically search for weaknesses, logical gaps, hidden assumptions, or oversights. Clearly document these potential pitfalls and how you’ve addressed them. Once you’re fully convinced your analysis is robust and complete, deliberately pause and force yourself to reconsider the entire reasoning chain one final time from scratch. Explicitly detail this last reflective step.
does it feel like the models aren't really improving very much for useful tasks? We seem to have bottomed out for the last 12mnths imo. I had a big moment around O1 but ever since then nothing.
What useful tasks are you finding progress static on?
My personal use case was satisfied by GPT 3.5/4. I do not write code beyond simple php stuff for website. So for me, AI progress has felt fairly slow for about 2+ years. One thing is that AI has become much better at web search and integrating deep facts (GPT4 had search on and off, but it wasn't very useful beyond parroting whatever SEO slop was on Google's front page). So that's helpful.
If I was doing high-level math or physics or statistics (etc), I am sure it would not feel static.
> We seem to have bottomed out for the last 12mnths imo. I had a big moment around O1 but ever since then nothing.
Man, do you not explore ideas with it, or ask it a continued series of deeper questions?
Like the increase in intelligence and depth with both Gemini 2.5 and o3 is a massive step change.
Most of my friends are STEM Phd's, and it can lap them on their subjects of interest - as in, me asking 2.5 increasingly deeper questions yields more fruit than asking literal Phd's! I learned more in my first week of paid Gemini 2.5 than I had in many MONTHS of o1 and deep research and 4o.
All the models used to suck for debate or adversarial collaboration, too. 2.5 and now o3 are the first models that can stake out a position, clearly back that position with cites and facts, and then stand their ground against your rebuttals and counterarguments. Essentially every model before them would fold instantly.
Google sent an email with the subject “Help your children get ready to use Gemini Apps.” In it they said the following:
“Remind your children that Gemini isn’t human. Even though it sometimes talks like one, it can’t think for itself or feel emotions”
I noticed I was surprised that they said Gemini can’t “think for itself.” I think that’s obviously untrue, though I realize there’s a lot of ambiguity in that statement.
> Cursor now writes 1 billion lines of accepted code a day, versus the world writing several billion a day total. That’s only Cursor.
I assume the % of digital art being created today by AI already passed manually drawn (possibly still below editing and reworking? maybe not even that)
Look I absolutely hate sounding like those "this is not such a big deal" guys because it is a big deal, but % of production stop being a good reference when the cost for an LLM to produce it (and several tries, etc) is ~zero and people producing things ubaid-style (that is, manually) avoid unnecessary work. It's like comparing kilometers walked vs kilometers driven: there's an obvious comparison to be made there but it doesn't tell much about the relationship between driving and walking
I think you are correct—AI images probably now outnumber human-created artwork by maybe a few orders of magnitude. Go on DeviantArt, type in tags, and sort by "Recent." It's not uncommon for the entire page to be AI generated images. And that's just the stuff people post...
Here is the "Full Cast" podcast episode for this post, where every uniquely quoted person gets a consistent voice across the piece, including transcribing and voicing all the letter images in the "The Mask Comes Off", Rob Wiblin section:
> They’re importantly not claiming affiliate revenue, so incentives are not misaligned. It’s so weird how OpenAI will sometimes take important and expensive principled stands
Why do we think this is a principled stand and not just temporary?
When the model guesses the location of a picture based on incredibly subtle cues… does it even know how it knows? I wonder if in some cases it just “intuits” and the explanation is invented after the fact.
I mostly use Gemini. It's a bit of a sycophant as well. I've not experienced anything as blatant as the examples you gave for GPT-4o but it's definitely a bit of a flatterer. Probably that's part of the reason it's at the top of the arena leaderboard.
The confabulations are actually really interesting (as well as irritating). I have a growing list of paper titles that should exist but don't yet, software features that should but don't exist, and concepts missing from various languages natural and artificial. Perhaps over time I will become attuned to spotting these gaps without LLM assistance.
I'm supposed to be worried about 20-30 queries/day? Oops. The 20,000-30,000 I did this week might be a bad sign, then.
Also, Qwen 3!
Be sure get a download with all the bugs fixed and read the instructions carefully. Every new model is first deployed with broken prompt templates and other deployment issues when used with your favorite tool.
That said, Qwen 3 is fantastic on my private benchmarks. I've tested the bug-fixed Unsloth quants of the 32B and the 30B A3B MoE models, and they're great. 32B is the strongest model I've even seen run on a single consumer graphics card, and 30B A3B is probably 90% as good and around 10x faster.
Seriously, if you're seeing crappy results, go download a fixed version and try again.
Performance-wise, it's going to mostly be weaker than something like DeepSeek-R1 because it's so much smaller. And that means it will obviously be weaker than high-end frontier models. And like all small models, it's going to be better at handling clear requests to perform objectively verifiable tasks. Small models are hurt more by vague instructions.
But I'm pretty sure that the Qwen3 32B can give many 12-month old frontier models a run for their money. And 30B A3B is very smart for something so fast on local hardware.
- Math problems, roughly what you'd expect a high school senior headed to a good STEM school to handle.
- Turning vague pseudocode into a clean program.
- Short-story summaries, and question answering with concrete examples. (With a 20k context window.)
- Translation, including some tricky idiomatic stuff.
Here were the things that were partially successful:
- Poetry. With the 32B, it's bad but it rhymes. This is usually what breaks smaller models. The 30B A3B failed here.
- Telling me I'm wrong when I've given it confidently incorrect information. In this case, the 32B eventually decided I was probably wrong (in the thinking transcript), but gave up an hallucinated anyway in the output. This is partial credit because even most frontier models bombed this a year ago.
Compared to similar-sized Gemma models, the E1 distillations from earlier this year, Mistral, etc., this was very strong. Compared to current frontier models that are 20 times the size, it's not amazing.
But it still represents a very wide range of concrete abilities, some of which I've never seen before in models this size. And frontier models a year ago struggled with some of these items.
I need to add some story writing evals, actually. None of my tests focus on open ended creativity, which is often quite weak at this size. And I haven't tested for overly strict safety behavior.
But overall, my experience is that the Qwen team knows how to build good small models.
> I am not myself advocating for a pause at this time.
How come?
Hey Claude, how much is a late fee for a 15’ U-Haul truck rental?
Ultra-deep thinking mode. Greater rigor, attention to detail, and multi-angle verification. Start by outlining the task and breaking down the problem into subtasks. For each subtask, explore multiple perspectives, even those that seem initially irrelevant or improbable. Purposefully attempt to disprove or challenge your own assumptions at every step. Triple-verify everything. Critically review each step, scrutinize your logic, assumptions, and conclusions, explicitly calling out uncertainties and alternative viewpoints. Independently verify your reasoning using alternative methodologies or tools, cross-checking every fact, inference, and conclusion against external data, calculation, or authoritative sources. Deliberately seek out and employ at least twice as many verification tools or methods as you typically would. Use mathematical validations, web searches, logic evaluation frameworks, and additional resources explicitly and liberally to cross-verify your claims. Even if you feel entirely confident in your solution, explicitly dedicate additional time and effort to systematically search for weaknesses, logical gaps, hidden assumptions, or oversights. Clearly document these potential pitfalls and how you’ve addressed them. Once you’re fully convinced your analysis is robust and complete, deliberately pause and force yourself to reconsider the entire reasoning chain one final time from scratch. Explicitly detail this last reflective step.
does it feel like the models aren't really improving very much for useful tasks? We seem to have bottomed out for the last 12mnths imo. I had a big moment around O1 but ever since then nothing.
o1 was just last September! R1 was January!
Coding is way better than it was 12 months ago. Multi-step search tasks are way better.
It depends a lot on what "useful tasks" you are trying to do.
For me coding topped out around o1. But i'm not a great coder to begin with. Do professional SWE's feel gemini 2.5 is better for coding than o1?
What useful tasks are you finding progress static on?
My personal use case was satisfied by GPT 3.5/4. I do not write code beyond simple php stuff for website. So for me, AI progress has felt fairly slow for about 2+ years. One thing is that AI has become much better at web search and integrating deep facts (GPT4 had search on and off, but it wasn't very useful beyond parroting whatever SEO slop was on Google's front page). So that's helpful.
If I was doing high-level math or physics or statistics (etc), I am sure it would not feel static.
It could be that my tasks are not involved enough. Not sure.
I'm enjoying gemini 2.5 however. But that was the first new model release since o1 that feels slightly better.
o3 i don't trust personally.
> We seem to have bottomed out for the last 12mnths imo. I had a big moment around O1 but ever since then nothing.
Man, do you not explore ideas with it, or ask it a continued series of deeper questions?
Like the increase in intelligence and depth with both Gemini 2.5 and o3 is a massive step change.
Most of my friends are STEM Phd's, and it can lap them on their subjects of interest - as in, me asking 2.5 increasingly deeper questions yields more fruit than asking literal Phd's! I learned more in my first week of paid Gemini 2.5 than I had in many MONTHS of o1 and deep research and 4o.
All the models used to suck for debate or adversarial collaboration, too. 2.5 and now o3 are the first models that can stake out a position, clearly back that position with cites and facts, and then stand their ground against your rebuttals and counterarguments. Essentially every model before them would fold instantly.
Interesting thanks for letting me know. Perhaps if I was still doing my PhD and ask more involved questions I'd be able to notice the difference.
Google sent an email with the subject “Help your children get ready to use Gemini Apps.” In it they said the following:
“Remind your children that Gemini isn’t human. Even though it sometimes talks like one, it can’t think for itself or feel emotions”
I noticed I was surprised that they said Gemini can’t “think for itself.” I think that’s obviously untrue, though I realize there’s a lot of ambiguity in that statement.
https://proteanbazaar.substack.com/p/consciousness-actually-explained
> Cursor now writes 1 billion lines of accepted code a day, versus the world writing several billion a day total. That’s only Cursor.
I assume the % of digital art being created today by AI already passed manually drawn (possibly still below editing and reworking? maybe not even that)
Look I absolutely hate sounding like those "this is not such a big deal" guys because it is a big deal, but % of production stop being a good reference when the cost for an LLM to produce it (and several tries, etc) is ~zero and people producing things ubaid-style (that is, manually) avoid unnecessary work. It's like comparing kilometers walked vs kilometers driven: there's an obvious comparison to be made there but it doesn't tell much about the relationship between driving and walking
I think you are correct—AI images probably now outnumber human-created artwork by maybe a few orders of magnitude. Go on DeviantArt, type in tags, and sort by "Recent." It's not uncommon for the entire page to be AI generated images. And that's just the stuff people post...
Here is the "Full Cast" podcast episode for this post, where every uniquely quoted person gets a consistent voice across the piece, including transcribing and voicing all the letter images in the "The Mask Comes Off", Rob Wiblin section:
https://open.substack.com/pub/dwatvpodcast/p/ai-114-liars-sycophants-and-cheaters
> They’re importantly not claiming affiliate revenue, so incentives are not misaligned. It’s so weird how OpenAI will sometimes take important and expensive principled stands
Why do we think this is a principled stand and not just temporary?
> OpenAI claims to have improved their citations.
Well, the claims are related to their citation UI. That tweet didn't claim that the citations are better quality.
When the model guesses the location of a picture based on incredibly subtle cues… does it even know how it knows? I wonder if in some cases it just “intuits” and the explanation is invented after the fact.
Seems to be a mix of both. https://www.astralcodexten.com/p/testing-ais-geoguessr-genius
Here's someone very big on Substack who thinks he's tried Gemini because he Googled a question: https://freddiedeboer.substack.com/p/ai-has-no-noticer
Does anyone know what model they're using for Google search? It's probably some small version of Gemini/Gemma, right?
Whatever it is, it's cheap, fast and hooked up to really iffy RAG pipeline that can't detect when Redditors are joking.
I mostly use Gemini. It's a bit of a sycophant as well. I've not experienced anything as blatant as the examples you gave for GPT-4o but it's definitely a bit of a flatterer. Probably that's part of the reason it's at the top of the arena leaderboard.
I mostly use Gemini too and always thought it to be rather neutral.
Just reviewed my last 5 interactions, no flattering... and thankfully none of the pseudo smart comments in italic, either.
The confabulations are actually really interesting (as well as irritating). I have a growing list of paper titles that should exist but don't yet, software features that should but don't exist, and concepts missing from various languages natural and artificial. Perhaps over time I will become attuned to spotting these gaps without LLM assistance.
I someone wants to contact their policymakers about this these might be helpful:
Control AI US tool: https://controlai.com/take-action/usa
Control AI UK tool: https://controlai.com/take-action/uk
"Claude Cade" -> Claude Code
I'm supposed to be worried about 20-30 queries/day? Oops. The 20,000-30,000 I did this week might be a bad sign, then.
Also, Qwen 3!
Be sure get a download with all the bugs fixed and read the instructions carefully. Every new model is first deployed with broken prompt templates and other deployment issues when used with your favorite tool.
That said, Qwen 3 is fantastic on my private benchmarks. I've tested the bug-fixed Unsloth quants of the 32B and the 30B A3B MoE models, and they're great. 32B is the strongest model I've even seen run on a single consumer graphics card, and 30B A3B is probably 90% as good and around 10x faster.
Seriously, if you're seeing crappy results, go download a fixed version and try again.
Performance-wise, it's going to mostly be weaker than something like DeepSeek-R1 because it's so much smaller. And that means it will obviously be weaker than high-end frontier models. And like all small models, it's going to be better at handling clear requests to perform objectively verifiable tasks. Small models are hurt more by vague instructions.
But I'm pretty sure that the Qwen3 32B can give many 12-month old frontier models a run for their money. And 30B A3B is very smart for something so fast on local hardware.
Things I have tried successfully with Qwen3:
- Structured information extraction.
- Math problems, roughly what you'd expect a high school senior headed to a good STEM school to handle.
- Turning vague pseudocode into a clean program.
- Short-story summaries, and question answering with concrete examples. (With a 20k context window.)
- Translation, including some tricky idiomatic stuff.
Here were the things that were partially successful:
- Poetry. With the 32B, it's bad but it rhymes. This is usually what breaks smaller models. The 30B A3B failed here.
- Telling me I'm wrong when I've given it confidently incorrect information. In this case, the 32B eventually decided I was probably wrong (in the thinking transcript), but gave up an hallucinated anyway in the output. This is partial credit because even most frontier models bombed this a year ago.
Compared to similar-sized Gemma models, the E1 distillations from earlier this year, Mistral, etc., this was very strong. Compared to current frontier models that are 20 times the size, it's not amazing.
But it still represents a very wide range of concrete abilities, some of which I've never seen before in models this size. And frontier models a year ago struggled with some of these items.
I need to add some story writing evals, actually. None of my tests focus on open ended creativity, which is often quite weak at this size. And I haven't tested for overly strict safety behavior.
But overall, my experience is that the Qwen team knows how to build good small models.