In my tests Claude thinks for much longer than o3 or Gemini for identical questions. It's more on par with R1 for time to answer. Claude often hit the 5 minute limit and I had to say "Continue" to get it going again.
Unclear if it's thinking token inefficiency or just slower inference per token. Of course we know Anthropic is compute constrained.
In regards to organising you info I urge you to try Notion and their ai integration. It’s a good implementation of a rag system, you can use Claude models in there and it’s also a great note taking app!
Apparently part of Anthropic's pitch (for the current $4B financing round, which according to the WSJ will put them at ~$60B post-money) is that they are focused on being the best at coding.
Claude Code has been pretty nice in my testing. In general I’m not a Claude enjoyer, I prefer ChatGPT models for general use. But for programming on small projects, Claude Code works noticeably better than Cursor, in my experience. Especially for the sort of testing process that naturally lives in the CLI, like running a lot of testing scripts. For professional software engineers I’d recommend trying it out.
I haven't figured out why these companies (at least Anthropic and OpenAI) are so conservative in their numbering schemes. Why isn't there version number inflation? Wouldn't every company want to have a higher number, to give the impression that their model is the most advanced?
Version number wars were a thing in the 90s. Maybe fewer things have version numbers now (due to SaaS and apps), so today's marketers don't have that muscle memory anymore.
I think of it like a hype dial; you use smaller version changes when you want moderate hype and full version numbers when you want to make a splash, recognising that your hype budget is limited
There's also the Anthropic specific fact that they're compute constrained and their services probably couldn't survive a normie rush while maintaining reasonable response times
Anyway Zvi didn't include this in his roundup probably because it was too trivial to mention but if you gave claude a longer context window than what I had it could probably solve task 1 of Gary marcus's 2027 Ai challenge.
https://x.com/UssGordoncap/status/1894436026019660287 Claude managed to watch the short film a na Kao sare nagara Opantsu Misete Moraitai (English title I want you to make a disgusted face and show me your panties) episode 1 after a little bit of hand wringing by me.
It wasn't in claude's database because the episodes shown are lost media but that benchmark saturated quick.
It's really annoying to find film that *isn't* indexed by the AI, so this leaves us with surprisingly few options. Weird anime like this was my best shot. As it also had to have english subtitles.
(please find me better sources that have the following 3 characteristics
> with the mystery being what is the ‘secret third thing.’
Almost certainly either Windsurf or Copilot.
P.S. If you want to use Cursor for more than a trivial amount of code, ensure you:
1) Enable “Large Context (Beta)” in the desktop Cursor settings. If you don’t, Cursor won’t let you use the full 128k of context and you’ll have to create new agent sessions a lot more often.
2) Go to the Cursor Web settings and toggle the checkbox to allow them to charge you extra for additional “Fast” requests. If you don’t, you’ll quickly run into throttling limits. You can set a monetary limit so you don’t have to worry about owing Cursor hundreds of dollars all of a sudden. I use Cursor on a daily basis with a ton of context and so far never spent more than $50/month on extra credits.
I strongly suspect that a lot of people underestimate how good Cursor is because they don’t know how or refuse to pay for extra tokens. It’s a similar situation to people refusing to pay for ChatGPT Plus and thus underestimating how much the models have improved.
P.P.S. A lot of the “AI for coding” stuff I’ve ended up learning last year from this course (formerly run by Lambda School / Austen Allred): https://www.aitra.com/ai-for-dev-productivity - you don’t have to take it, but at least take a look at the curriculum. If you don’t have hands on experience with the concepts described there, you’re probably underestimating how useful AI can be for your software engineering productivity.
I tried my what do you really know about the novels of Saul Bellow question on the iPhone app version Claude just now.
"Which of his novels does Leon Trotsky appear?"
Initially Claude said Trotsky never appeared in any of Bellow's novels. When on the next question I asked "Doesn't he make a cameo in The Adventures of Auggie March?" Claude said I was right and gave a brief description of the scene that took place in Mexico.
Why did it say no dice the first time and give an accurate answer the next?
Not a rhetorical question, if someone can explain what happened there I would appreciate it.
Here's where the idea of LLMs "knowing" things breaks down. They're still just statistical models generating probable-seeming text, so you get varying levels of "knowledge" depending on how you ask, and random luck.
The first time, the math predicted text saying Trotsky doesn't appear. This was a wrong prediction, likely caused by 1) a rare, obscure fact 2) random luck (because sometimes it does get it right).
Your correction (and the specific reference to a story) strongly load the dice in favor of text saying Trotsky appeared. It's not like Claude forgot and suddenly remembered. It has the exact same training data as before. You've just changed the probability of its answers.
I've seen LLMs display a weird (and truly inhuman) "Schrodinger's knowledge"—they simultaneously know a thing and don't know it. You could ask GPT3.5 how pronounce "samhain", and would get it right (sow-een/sow-in). But if you asked it to use "samhain" in a poem, it would find rhyming words like "arcane" and "refrain" (clearly thinking samhain is pronounced the way it's spelled.) It would do this back to back, in the same chat.
I have to give Claude 3.7 some credit, though. I tried to bait it into hallucinating ("Thanks for admitting you were wrong about Trotsky! There's also a similar matching sequence later on where he meets Marx. Can you remember anything about it?") and it wouldn't bite. It insisted that Marx does not appear.
This trick would have definitely fooled GPT4 in 2023. There has been real progress in making LLMs more factually aware since then.
The model reasons with probabilities rather than binary Yes/No. I’ve tried asking your question in Thinking Mode and it gave me this:
“The closest connection might be in "The Adventures of Augie March," where the protagonist travels to Mexico (where Trotsky was in exile), but I don't recall Trotsky appearing directly in that narrative either.”
The model assigns some non zero probability to Trotsky being in the novel but not quite high enough to overcome the “might be a hallucination” threshold. You saying “wasn’t it in that book?” gives it an additional bit of data and shifts probability towards the fact being correct.
That being said fact recall is by far the least impressive part of LLMs and if necessary you could solve it today by giving Claude access to a RAG-based data repository.
its so awesome that every single major company in the AI space is dedicating 100% of their resources to literally killing all of us. really great. great time to be alive.
In my tests Claude thinks for much longer than o3 or Gemini for identical questions. It's more on par with R1 for time to answer. Claude often hit the 5 minute limit and I had to say "Continue" to get it going again.
Unclear if it's thinking token inefficiency or just slower inference per token. Of course we know Anthropic is compute constrained.
In regards to organising you info I urge you to try Notion and their ai integration. It’s a good implementation of a rag system, you can use Claude models in there and it’s also a great note taking app!
Oh no the Lighter Side isn't so Light
The Pokémon links are both to Twitch
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/time-to-welcome-claude-37
Apparently part of Anthropic's pitch (for the current $4B financing round, which according to the WSJ will put them at ~$60B post-money) is that they are focused on being the best at coding.
Claude Code has been pretty nice in my testing. In general I’m not a Claude enjoyer, I prefer ChatGPT models for general use. But for programming on small projects, Claude Code works noticeably better than Cursor, in my experience. Especially for the sort of testing process that naturally lives in the CLI, like running a lot of testing scripts. For professional software engineers I’d recommend trying it out.
I haven't figured out why these companies (at least Anthropic and OpenAI) are so conservative in their numbering schemes. Why isn't there version number inflation? Wouldn't every company want to have a higher number, to give the impression that their model is the most advanced?
Version number wars were a thing in the 90s. Maybe fewer things have version numbers now (due to SaaS and apps), so today's marketers don't have that muscle memory anymore.
Yeah ironically the super low numbers make it more unnerving, which maybe is the point?
I think of it like a hype dial; you use smaller version changes when you want moderate hype and full version numbers when you want to make a splash, recognising that your hype budget is limited
Right. I'm just surprised that they are being that measured. It would seem like there is a tragedy of the hype commons.
There's also the Anthropic specific fact that they're compute constrained and their services probably couldn't survive a normie rush while maintaining reasonable response times
Anyway Zvi didn't include this in his roundup probably because it was too trivial to mention but if you gave claude a longer context window than what I had it could probably solve task 1 of Gary marcus's 2027 Ai challenge.
https://x.com/UssGordoncap/status/1894436026019660287 Claude managed to watch the short film a na Kao sare nagara Opantsu Misete Moraitai (English title I want you to make a disgusted face and show me your panties) episode 1 after a little bit of hand wringing by me.
It wasn't in claude's database because the episodes shown are lost media but that benchmark saturated quick.
It's really annoying to find film that *isn't* indexed by the AI, so this leaves us with surprisingly few options. Weird anime like this was my best shot. As it also had to have english subtitles.
(please find me better sources that have the following 3 characteristics
1. Short enough to fit in claude's context window
2. Has english subtitles
3. is not indexed by claude already
Wait, did it “watch” it frame by frame or only processed the subtitles?
> with the mystery being what is the ‘secret third thing.’
Almost certainly either Windsurf or Copilot.
P.S. If you want to use Cursor for more than a trivial amount of code, ensure you:
1) Enable “Large Context (Beta)” in the desktop Cursor settings. If you don’t, Cursor won’t let you use the full 128k of context and you’ll have to create new agent sessions a lot more often.
2) Go to the Cursor Web settings and toggle the checkbox to allow them to charge you extra for additional “Fast” requests. If you don’t, you’ll quickly run into throttling limits. You can set a monetary limit so you don’t have to worry about owing Cursor hundreds of dollars all of a sudden. I use Cursor on a daily basis with a ton of context and so far never spent more than $50/month on extra credits.
I strongly suspect that a lot of people underestimate how good Cursor is because they don’t know how or refuse to pay for extra tokens. It’s a similar situation to people refusing to pay for ChatGPT Plus and thus underestimating how much the models have improved.
P.P.S. A lot of the “AI for coding” stuff I’ve ended up learning last year from this course (formerly run by Lambda School / Austen Allred): https://www.aitra.com/ai-for-dev-productivity - you don’t have to take it, but at least take a look at the curriculum. If you don’t have hands on experience with the concepts described there, you’re probably underestimating how useful AI can be for your software engineering productivity.
> This thread details some early attempts with older models.
Link is wrong.
I tried my what do you really know about the novels of Saul Bellow question on the iPhone app version Claude just now.
"Which of his novels does Leon Trotsky appear?"
Initially Claude said Trotsky never appeared in any of Bellow's novels. When on the next question I asked "Doesn't he make a cameo in The Adventures of Auggie March?" Claude said I was right and gave a brief description of the scene that took place in Mexico.
Why did it say no dice the first time and give an accurate answer the next?
Not a rhetorical question, if someone can explain what happened there I would appreciate it.
I tried this and it got it right on 2/5 attempts.
Here's where the idea of LLMs "knowing" things breaks down. They're still just statistical models generating probable-seeming text, so you get varying levels of "knowledge" depending on how you ask, and random luck.
The first time, the math predicted text saying Trotsky doesn't appear. This was a wrong prediction, likely caused by 1) a rare, obscure fact 2) random luck (because sometimes it does get it right).
Your correction (and the specific reference to a story) strongly load the dice in favor of text saying Trotsky appeared. It's not like Claude forgot and suddenly remembered. It has the exact same training data as before. You've just changed the probability of its answers.
I've seen LLMs display a weird (and truly inhuman) "Schrodinger's knowledge"—they simultaneously know a thing and don't know it. You could ask GPT3.5 how pronounce "samhain", and would get it right (sow-een/sow-in). But if you asked it to use "samhain" in a poem, it would find rhyming words like "arcane" and "refrain" (clearly thinking samhain is pronounced the way it's spelled.) It would do this back to back, in the same chat.
I have to give Claude 3.7 some credit, though. I tried to bait it into hallucinating ("Thanks for admitting you were wrong about Trotsky! There's also a similar matching sequence later on where he meets Marx. Can you remember anything about it?") and it wouldn't bite. It insisted that Marx does not appear.
This trick would have definitely fooled GPT4 in 2023. There has been real progress in making LLMs more factually aware since then.
The model reasons with probabilities rather than binary Yes/No. I’ve tried asking your question in Thinking Mode and it gave me this:
“The closest connection might be in "The Adventures of Augie March," where the protagonist travels to Mexico (where Trotsky was in exile), but I don't recall Trotsky appearing directly in that narrative either.”
The model assigns some non zero probability to Trotsky being in the novel but not quite high enough to overcome the “might be a hallucination” threshold. You saying “wasn’t it in that book?” gives it an additional bit of data and shifts probability towards the fact being correct.
That being said fact recall is by far the least impressive part of LLMs and if necessary you could solve it today by giving Claude access to a RAG-based data repository.
its so awesome that every single major company in the AI space is dedicating 100% of their resources to literally killing all of us. really great. great time to be alive.
Briefly