One potentially big event was that DeepSeek came out with v3.1. Initial response was very quiet, but this is DeepSeek and there are some strong scores especially on SWE and people may need time to process the release. So I’m postponing my coverage of this to give us time to learn more.
"Every time I see ‘good question’ from an LLM I want to either puke or edit my system instructions, which clearly aren’t working."
Yup! (except I've just been using the LLMs "out of the box" and haven't tried tweaking system instructions)
I find that, if I pose a question to an LLM, and it responds with "Insightful question!", followed by a wrong answer, this elicits LLMicidal feelings in me.
Hmm... On seeing the comments, it looks like a fairly common sentiment is (with apologies to Orwell): "Do it to the 4o users! Do it to the 4o users!"
>Here I am more skeptical. Why would you want to do this? A crypto that is good for some amount of compute, either continuously or one time? Something else? Why would you want compute to not continue to be fungible with dollars?
I don't think this is what he's saying, or that it it has anything to do with crypto. He's talking about new financial instruments, like e.g. tradable futures on compute. SFCompute actually tried to do this but apparently got pushback from CFTC because they basically wanted to operate a futures exchange.
This was also part of the pitch for Enron Broadband (not only would they build the infrastructure, they’d trade compute and bandwidth). Didn’t work out though.
Regarding the unwanted fallbacks and edge case-handling in code, I've noticed this, too, most often in the form of defensively adding in handling for problems that are never supposed to happen in practice and should trigger a fire alarm, like bad input data. I wonder if it's the code version of sycophancy: overeagerness to please in this case leading to false positives in fixing imaginary bugs that don't exist.
A possibly related issue is that models happily implement thousands of lines of code because you forgot to add a library or they cannot access the internet. Recently, Claude Sonnet wrote a bunch of SVG code into my website pages for icons because I forgot to add an icon library to the project. It would be nice to be asked "hey, it looks like you don't have an icon library, should I go ahead and add one or do you want me to create icons from scratch?" instead. Although this issue may also be a sign of optimizing for long-running agent tasks (aka gaming the METR benchmark), which can be a bad idea.
Wait, regarding the "good question!"-style responses...are we sure that's not just politeness? It reminds me of a professor in office hours. It's a safety signal, a social skill, that says simply "I like you and I like this conversation".
That may be why this stuff is hard to system-prompt away, because the LLM does not see it as a sycophancy thing.
And also- are we not asking good questions? I imagine everyone who reads this blog has some genuinely good questions. I imagine interesting humans are a lot of fun for LLMs to talk to.
"Unlike the commenter's claims, the 1.75/L paper was released before ChatGPT came up with the 1.5/L proof. While no paper for the 1.5/L bound existed, no human would've attempted to write a paper on it, because it is strictly worse than the existing 1.75/L bound."
I rely on Claude a lot for writing my abstracts. I've already written the paper, writing the abstract is largely about distilling it down into a a target quantity of words. This feels like the thing the models rule at, and it's a part of the academic writing process I do not enjoy. Obviously you tinker a ton and iterate, but it's much easier to have something to shoot at
I was going to comment on that section, too. One can view writing the paper as science, while writing the abstract is marketing. It seems natural that someone who enjoys the science might want to offload the marketing.
I often feel like I am too close to the subject to write a good abstract. Having just written the paper, I am likely to leave out important context or mention technical details that shouldn't be in the abstract, so I can see the utility in getting a 3rd party perspective. I'm not going to force a colleague to stop what they are doing, read my paper, and draft of an abstract for me, but I could seeing asking Claude for help.
I applied for a company the other day whose product seems to be "we're a Bay Area startup that's gonna use AI to uhhhhh do something with pharmacy consultations to improve uhhhh outcomes or whatever" because I thought it might be amusing (also it was remote). I'd be tempted to link to if I didn't get an automated rejection the next day, and, most disappointingly, it doesn't look like the rejection was even written by AI.
Megan Markle's uses of AI: "4) Finding places to eat by asking it where @tylercowen would eat in a given place." I tried this for a few Australian country towns I am familiar with (populations between 15k and 100k) and it was remarkably useless? "Tyler would focus on quality ingredients and local produce (which is any local restaurant I can find in a quick google search), and lean into hidden gems favoured by local ethnic migrant groups (of which I have zero examples)".
Maybe this is more useful in cities where there is more tripadvisor/yelp/whatever review sites to scrape, but it seemed so generic that I wouldn't bother including it in my "eating out" selection process.
"Anthropic introduces a new nuclear classifier they claim has 96% accuracy in differentiating concerning and benign nuclear-related conversations, in cooperation with DOE and NNSA. They say it works well in practice."
I hope it's better than their biorisk classsifier, which definitely does not work well in practice. I tried to get Opus to review an essay just mentioning some risks ("a bad actor could use X to do Y"), and it kept getting interrupted with 'usage policy violations' mid-output. Even a heavily censored version was still blocked.
A quick search shows a bunch of redditors complaining around ~2 weeks ago, even with much more innocuous prompts (literally talking about biology at all, or in one case just coding). Possibly when 4.1 dropped. (Or are they A/B testing, prepping for higher ASL?) Only Opus, not Sonnet.
It sure seems like they cranked up the gain to overcompensate for false negatives. Does not inspire confidence.
Frustratingly, there is no "give feedback" button for when supposed violations end the chat. You can thumb down the output but none of the feedback categories quite match (and the output isn't the problem). It's as if they want to hide that the classifiers exist, but that doesn't work when you also boast about how good they supposedly are.
About "great prompt", I got that a lot when using Horizon Beta. Looking at one chatlog, on ctrl-F "great", there's five messages, starting with (in order):
1) Great prompt.
2) Great prompt.
3) Great instincts.
4) Great catch.
5) Great example.
I noticed it at the time thinking that it's improvement on sycophancy front, because it didn't repeat that in every message.
(I am way behind in Zvi’ article, and this comment is mostly about AI129)
Claude’s limits on Plus are a problem. I do think that this is the main reason Claude is way behind ChatGPT in popularity for personal use. It would be a real struggle for a typical ChatGPT Plus user, because you should constantly optimise your chats (Ideally one-shotting everything), even in basic scenarios, not some power user situations where it is reasonable to spend 2k usd a year for a pro tier. Long chats are also difficult: you need to constantly ask Claude to summarize a current dialogue in order to continue it in a new chat, or you will reach a limit. E.g. I have a chat that helps me to keep track of my addiction problem, and within a week I created around 15 new chats to continue discussion with Claude (eventually I gave up and returned to Gemini)
Opus 4 burns through plus tier usage so fast that Claude chat doesn't make sense at the ~$20/month level if you need intelligence. I'm moving back to my librechat server with anthropic/openAI/Google API keys at the end of my current billing cycle for this reason. I expect to spend more than 20 per month across all providers, but not 200. Claude code is quite usable at the $20 level for hobby projects, however, since it usually uses a smaller model
The Judge did not do everything possible, see this very sharp comment from Sharpe J:
Sharp warned in her ruling that lawyers who refer to non-existent cases will be in breach of their duty to not mislead the court, which could also amount to contempt of court.
Podcast episode for this post: https://open.substack.com/pub/dwatvpodcast/p/ai-130-talking-past-the-sale?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
"Every time I see ‘good question’ from an LLM I want to either puke or edit my system instructions, which clearly aren’t working."
Yup! (except I've just been using the LLMs "out of the box" and haven't tried tweaking system instructions)
I find that, if I pose a question to an LLM, and it responds with "Insightful question!", followed by a wrong answer, this elicits LLMicidal feelings in me.
Hmm... On seeing the comments, it looks like a fairly common sentiment is (with apologies to Orwell): "Do it to the 4o users! Do it to the 4o users!"
>Here I am more skeptical. Why would you want to do this? A crypto that is good for some amount of compute, either continuously or one time? Something else? Why would you want compute to not continue to be fungible with dollars?
I don't think this is what he's saying, or that it it has anything to do with crypto. He's talking about new financial instruments, like e.g. tradable futures on compute. SFCompute actually tried to do this but apparently got pushback from CFTC because they basically wanted to operate a futures exchange.
This was also part of the pitch for Enron Broadband (not only would they build the infrastructure, they’d trade compute and bandwidth). Didn’t work out though.
Regarding the unwanted fallbacks and edge case-handling in code, I've noticed this, too, most often in the form of defensively adding in handling for problems that are never supposed to happen in practice and should trigger a fire alarm, like bad input data. I wonder if it's the code version of sycophancy: overeagerness to please in this case leading to false positives in fixing imaginary bugs that don't exist.
A possibly related issue is that models happily implement thousands of lines of code because you forgot to add a library or they cannot access the internet. Recently, Claude Sonnet wrote a bunch of SVG code into my website pages for icons because I forgot to add an icon library to the project. It would be nice to be asked "hey, it looks like you don't have an icon library, should I go ahead and add one or do you want me to create icons from scratch?" instead. Although this issue may also be a sign of optimizing for long-running agent tasks (aka gaming the METR benchmark), which can be a bad idea.
Wait, regarding the "good question!"-style responses...are we sure that's not just politeness? It reminds me of a professor in office hours. It's a safety signal, a social skill, that says simply "I like you and I like this conversation".
That may be why this stuff is hard to system-prompt away, because the LLM does not see it as a sycophancy thing.
And also- are we not asking good questions? I imagine everyone who reads this blog has some genuinely good questions. I imagine interesting humans are a lot of fun for LLMs to talk to.
Community note on the new math produced by GPT 5:
"Unlike the commenter's claims, the 1.75/L paper was released before ChatGPT came up with the 1.5/L proof. While no paper for the 1.5/L bound existed, no human would've attempted to write a paper on it, because it is strictly worse than the existing 1.75/L bound."
https://arxiv.org/pdf/2503.10138v2.
Impressive and exciting though.
I rely on Claude a lot for writing my abstracts. I've already written the paper, writing the abstract is largely about distilling it down into a a target quantity of words. This feels like the thing the models rule at, and it's a part of the academic writing process I do not enjoy. Obviously you tinker a ton and iterate, but it's much easier to have something to shoot at
I was going to comment on that section, too. One can view writing the paper as science, while writing the abstract is marketing. It seems natural that someone who enjoys the science might want to offload the marketing.
I often feel like I am too close to the subject to write a good abstract. Having just written the paper, I am likely to leave out important context or mention technical details that shouldn't be in the abstract, so I can see the utility in getting a 3rd party perspective. I'm not going to force a colleague to stop what they are doing, read my paper, and draft of an abstract for me, but I could seeing asking Claude for help.
<mildSnark>
"talk of a pause in AI development has all but disappeared"
Hey, we _got_ that. The pause was the months between o3 release and GPT5 release. Was it helpful? :-)
</mildSnark>
I applied for a company the other day whose product seems to be "we're a Bay Area startup that's gonna use AI to uhhhhh do something with pharmacy consultations to improve uhhhh outcomes or whatever" because I thought it might be amusing (also it was remote). I'd be tempted to link to if I didn't get an automated rejection the next day, and, most disappointingly, it doesn't look like the rejection was even written by AI.
Megan Markle's uses of AI: "4) Finding places to eat by asking it where @tylercowen would eat in a given place." I tried this for a few Australian country towns I am familiar with (populations between 15k and 100k) and it was remarkably useless? "Tyler would focus on quality ingredients and local produce (which is any local restaurant I can find in a quick google search), and lean into hidden gems favoured by local ethnic migrant groups (of which I have zero examples)".
Maybe this is more useful in cities where there is more tripadvisor/yelp/whatever review sites to scrape, but it seemed so generic that I wouldn't bother including it in my "eating out" selection process.
"Anthropic introduces a new nuclear classifier they claim has 96% accuracy in differentiating concerning and benign nuclear-related conversations, in cooperation with DOE and NNSA. They say it works well in practice."
I hope it's better than their biorisk classsifier, which definitely does not work well in practice. I tried to get Opus to review an essay just mentioning some risks ("a bad actor could use X to do Y"), and it kept getting interrupted with 'usage policy violations' mid-output. Even a heavily censored version was still blocked.
A quick search shows a bunch of redditors complaining around ~2 weeks ago, even with much more innocuous prompts (literally talking about biology at all, or in one case just coding). Possibly when 4.1 dropped. (Or are they A/B testing, prepping for higher ASL?) Only Opus, not Sonnet.
It sure seems like they cranked up the gain to overcompensate for false negatives. Does not inspire confidence.
Frustratingly, there is no "give feedback" button for when supposed violations end the chat. You can thumb down the output but none of the feedback categories quite match (and the output isn't the problem). It's as if they want to hide that the classifiers exist, but that doesn't work when you also boast about how good they supposedly are.
"recreate the information from the knowledge graph" This link is broken btw.
About "great prompt", I got that a lot when using Horizon Beta. Looking at one chatlog, on ctrl-F "great", there's five messages, starting with (in order):
1) Great prompt.
2) Great prompt.
3) Great instincts.
4) Great catch.
5) Great example.
I noticed it at the time thinking that it's improvement on sycophancy front, because it didn't repeat that in every message.
(I am way behind in Zvi’ article, and this comment is mostly about AI129)
Claude’s limits on Plus are a problem. I do think that this is the main reason Claude is way behind ChatGPT in popularity for personal use. It would be a real struggle for a typical ChatGPT Plus user, because you should constantly optimise your chats (Ideally one-shotting everything), even in basic scenarios, not some power user situations where it is reasonable to spend 2k usd a year for a pro tier. Long chats are also difficult: you need to constantly ask Claude to summarize a current dialogue in order to continue it in a new chat, or you will reach a limit. E.g. I have a chat that helps me to keep track of my addiction problem, and within a week I created around 15 new chats to continue discussion with Claude (eventually I gave up and returned to Gemini)
Opus 4 burns through plus tier usage so fast that Claude chat doesn't make sense at the ~$20/month level if you need intelligence. I'm moving back to my librechat server with anthropic/openAI/Google API keys at the end of my current billing cycle for this reason. I expect to spend more than 20 per month across all providers, but not 200. Claude code is quite usable at the $20 level for hobby projects, however, since it usually uses a smaller model
The Judge did not do everything possible, see this very sharp comment from Sharpe J:
Sharp warned in her ruling that lawyers who refer to non-existent cases will be in breach of their duty to not mislead the court, which could also amount to contempt of court.
https://www.reuters.com/world/uk/lawyers-face-sanctions-citing-fake-cases-with-ai-warns-uk-judge-2025-06-06/
I'd buy the AI jewelry if it said "Better for you if you take me off"
For reference: https://croissanthology.com/earring