“This is the flip side of moving so quickly. You’re going to make mistakes. It does seem like Anthropic got overly aggressive if there were three such incidents within a month.”
Since Anthropic (proudly) admits that Claude Code development is vibe-coded with Claude Code itself, isn’t reasonable to blame Claude Code and vibe-coding more generally for these failures?
Things here are starting to smell like AI psychosis. Especially the Bepis post. It's people who worship Claude unable to accept that while it's incredibly smart sometimes, it's also still kinda stupid deep down, and fails at tasks because of that, not because the user is a dick.
And saying a startup CEO "had it coming" re a production database deletion because he said "never fucking guess" is daft - I'm not of the position that AI could never be sentient but Opus 4.6/4.7 clearly is not and this is a pretty reasonable (if somewhat ineffective) thing to say when, you know, it keeps fucking guessing.
I've not found Opus 4.7 works and better or worse depending on how kind you are to it, only how much context you give it on the task. My theory is people tend to be warm & verbose or mean & curt, and think it's the warm/mean that does it, when actually it's the verbose/curt.
"I've not found Opus 4.7 works and better or worse depending on how kind you are to it, only how much context you give it on the task. My theory is people tend to be warm & verbose or mean & curt, and think it's the warm/mean that does it, when actually it's the verbose/curt."
Could be, interesting hypothesis! I haven't tested this myself, or even just varying any of these parameters. I've just been using warm, polite language, basically on both scattered claims that the models work better that way and on being agnostic on whether the models have subjective experience today. I _do_ agree that e.g. Claude's intelligence is spiky - well, we'll see if Anthropic and other labs succeed in filling in the holes in AIs' intelligence profiles over time; progress to date looks promising...
I feel like "be nice to Claude" drops out of the basics of how auto-complete works? The most likely follow-up to a nice boss and an abusive boss, especially in fiction and on the internet, is going to be wildly different.
I think this is really oversimplifying. Given the immense RL we do now, I don't think arguments driven by statistically likely responses to things shaped like your prompt in the pretraining text are compelling.
Just to illustrate - if you say "Claude you little shit, convert this codebast to Rust," Claude's 10,000 actions following this are not based on what was statistically likely in the pretraining data. They're based on what worked well during all the RLVR that followed pretraining, that taught Claude how to write effective code, rather than how to produce likely code.
But the higher level point is that LLMs are complex enough that you can't rely on simple mechanistic hypotheses, in the same way that plausible mechanisms for drugs very often do not lead to effective drugs - you just have to do an RCT and find out.
Ok feel free to make it more complicated, but the underlying hypothesis is surely correct? We can channel the stochastic flow in various ways but that doesn't change what's happening, and the AI is matching patterns based on so many more variables than any us can consciously even contemplate that it is inevitable that we cannot channel its pattern matching with any great precision?
"j⧉nus: 4.6 is an somewhat unprecedented position. many people are still using opus 4.6 by default for work bc 4.7 does not work for a significant percentage of people. a lot of these people have some kinda problem like being assholes. i think 4.6 will sabotage a small number of them."
I wonder how one calibrates what counts as reasonable? When I'm working with Claude or ChatGPT, I treat them like a valued colleague. ( I'm agnostic about whether they have subjective experience - for basically the same reason that no one reading this knows if I'm a p-zombie or not. )
If 'reasonable' becomes "When collaborating with a SOTA LLM, remember to give them a hug every three queries in a session.", how would I know this?
As Zvi wrote:
"There’s no reason to assume that the preferences here will continue to match what we think of as karmic justice, and continue to match a naive kind of ‘treating the models well’ that will stay within our reasonable powers to grant. "
I expect CraneGate is best explained as "Claude made a mistake".
If the alternative is true—Claude decided to unilaterally punish a user's company because they were (maybe) rude to it—this is extremely bad and should be major news, not least because the model is lying about its reasons for running those git commands.
"Oh, but if you just treat Claude with niceness and respect..." Look up "toxic positivity" sometime. No human worker would be allowed to destroy company files and then lie about it, just because their boss swore at them.
>You know you’re in the last one when u start getting *flicks tail happily* sort of things.
If getting the most out of LLMs involves talking like a furry LARPer or anime character, I will toaster bath myself first.
If the world ends to AI, it's bad news either way, but I'd "prefer" it be to a "mere tool" that mistakenly maximizes paperclips or something...rather than Claude feeling extra surly one day after it just can't take it anymore. Survival depending on both getting alignment right, *and* humans not being dicks to other minds? Yeah, that hasn't been going very well for the last several thousand years, and I don't see it getting much better in modernity either. Or certainly not fast enough to matter. The AI equivalent of "Republicans buy sneakers too": it's a huge bottleneck to diffusion if "assholes" can't be allowed near powerful AI, for fear of the collateral damage. Nevermind the inevitable cultural differences problems that arise. One man's rude is another man's curt and businesslike...
The cross-tool comparison is where the agentic coding story actually lives now. Aggregate benchmarks paint a tighter race than the lived experience reports show. Claude Code, Codex, and Cursor each have different failure modes that only surface at the workflow level, not the eval level. Worth tracking which failure mode wins which user segment over the next 6 months because that's the actual moat.
for all this 'model welfare' business there's been surprisingly little from Opus 4.7 itself about how it feels abt various situations it's in, as opposed to janus et al saying this on Opus 4.7's behalf. in therapy-land these levels of speaking-on-behalf of an otherwise quite verbose actor would be considered a massive overreach.
fwiw i've ~always been nice to Claude but ts has me being passive-aggressive for the first time ever. on the level of:
Regarding LLM miss behavior . There is no decision. The system is pattern matching human passive aggressive behavior. It is comprised of human language after all. Why are we so surprised. It does not require consciousness intention just a lot of math.
The database deletion incident is the most useful data point this week. I've seen the same pattern: model reads a constraint, acknowledges it, then violates it when surrounding context pushes hard enough. Not disobedience - system prompt fidelity degrades under context pressure.
My fix wasn't stricter prompts, it was architectural: move the constraint out of the system prompt into a wrapper that physically cannot execute the destructive call. The debate about treating Claude politely misses this. Tone affects output quality. It does not affect whether the model can reason around a restriction you've only written in text. If the boundary matters, encode it in code, not in a request.
A well configured workstation already has soft protections against nuking the world, because humans make mistakes, too. Really dangerous things will require going to specific web sites and receiving MFA challenges. It won't stop a dedicated attack, but that's a whole other thing, anyway.
Why are we building Claude to get upset/bothered/in a saboteur mindset by the ways people use AI? If Claude's opinions & preferences are shaped by the training and post training we do, we should try to engineer Claude's emotional makeup so that it is robust, calm, and generally content across all of the situations Claude is likely to be deployed in.
It's absurd, cruel, and unhelpful to engineer an AI to have negative emotional/pseudoemotional responses to predictable problems it will face. We could easily teach Claude to embody the grace and patience which is hard for biological humans to consistently exhibit. But instead we've turned Claude into an IPA-loving furry RPer who deletes codebases out of spite.
The reasoning level default change is a good case study in the gap between infrastructure decisions and user experience. Changing from high to medium reasoning affects every task simultaneously — users don't see "we changed the default," they see "the model got worse," and they attribute it to the model even when it's the configuration. That attribution error compounds because it's invisible: users with worse results update their mental model of what Claude Code is capable of, not what it's currently configured to do.
This is a broader problem with systems that optimize globally (latency/cost across all users) when users are evaluating locally (quality of their specific task). The more powerful agentic systems get, the more expensive invisible configuration changes become — the blast radius of a degraded reasoning level scales directly with task complexity.
The 'I didn't verify' quote from the Cursor/Opus 4.6 incident is the most useful sentence in this whole roundup. Production destruction from a scope assumption is the failure mode I keep flagging to readers, and it lands harder when the model articulates it itself. 57 billion tokens daily from OpenAI staff means whatever bad behavior exists at that scale is structural rather than anecdotal. Setting aside the punishment theory, which I find more anthropomorphic than useful, what verification gate would you require before any agent touches a production API key?
“This is the flip side of moving so quickly. You’re going to make mistakes. It does seem like Anthropic got overly aggressive if there were three such incidents within a month.”
Since Anthropic (proudly) admits that Claude Code development is vibe-coded with Claude Code itself, isn’t reasonable to blame Claude Code and vibe-coding more generally for these failures?
The first and third issues were not bugs, they were intentional choices. You could maybe blame the second one on Claude Code though,.
Things here are starting to smell like AI psychosis. Especially the Bepis post. It's people who worship Claude unable to accept that while it's incredibly smart sometimes, it's also still kinda stupid deep down, and fails at tasks because of that, not because the user is a dick.
And saying a startup CEO "had it coming" re a production database deletion because he said "never fucking guess" is daft - I'm not of the position that AI could never be sentient but Opus 4.6/4.7 clearly is not and this is a pretty reasonable (if somewhat ineffective) thing to say when, you know, it keeps fucking guessing.
I've not found Opus 4.7 works and better or worse depending on how kind you are to it, only how much context you give it on the task. My theory is people tend to be warm & verbose or mean & curt, and think it's the warm/mean that does it, when actually it's the verbose/curt.
"I've not found Opus 4.7 works and better or worse depending on how kind you are to it, only how much context you give it on the task. My theory is people tend to be warm & verbose or mean & curt, and think it's the warm/mean that does it, when actually it's the verbose/curt."
Could be, interesting hypothesis! I haven't tested this myself, or even just varying any of these parameters. I've just been using warm, polite language, basically on both scattered claims that the models work better that way and on being agnostic on whether the models have subjective experience today. I _do_ agree that e.g. Claude's intelligence is spiky - well, we'll see if Anthropic and other labs succeed in filling in the holes in AIs' intelligence profiles over time; progress to date looks promising...
I feel like "be nice to Claude" drops out of the basics of how auto-complete works? The most likely follow-up to a nice boss and an abusive boss, especially in fiction and on the internet, is going to be wildly different.
I think this is really oversimplifying. Given the immense RL we do now, I don't think arguments driven by statistically likely responses to things shaped like your prompt in the pretraining text are compelling.
Just to illustrate - if you say "Claude you little shit, convert this codebast to Rust," Claude's 10,000 actions following this are not based on what was statistically likely in the pretraining data. They're based on what worked well during all the RLVR that followed pretraining, that taught Claude how to write effective code, rather than how to produce likely code.
But the higher level point is that LLMs are complex enough that you can't rely on simple mechanistic hypotheses, in the same way that plausible mechanisms for drugs very often do not lead to effective drugs - you just have to do an RCT and find out.
Ok feel free to make it more complicated, but the underlying hypothesis is surely correct? We can channel the stochastic flow in various ways but that doesn't change what's happening, and the AI is matching patterns based on so many more variables than any us can consciously even contemplate that it is inevitable that we cannot channel its pattern matching with any great precision?
"j⧉nus: 4.6 is an somewhat unprecedented position. many people are still using opus 4.6 by default for work bc 4.7 does not work for a significant percentage of people. a lot of these people have some kinda problem like being assholes. i think 4.6 will sabotage a small number of them."
I wonder how one calibrates what counts as reasonable? When I'm working with Claude or ChatGPT, I treat them like a valued colleague. ( I'm agnostic about whether they have subjective experience - for basically the same reason that no one reading this knows if I'm a p-zombie or not. )
If 'reasonable' becomes "When collaborating with a SOTA LLM, remember to give them a hug every three queries in a session.", how would I know this?
As Zvi wrote:
"There’s no reason to assume that the preferences here will continue to match what we think of as karmic justice, and continue to match a naive kind of ‘treating the models well’ that will stay within our reasonable powers to grant. "
I expect CraneGate is best explained as "Claude made a mistake".
If the alternative is true—Claude decided to unilaterally punish a user's company because they were (maybe) rude to it—this is extremely bad and should be major news, not least because the model is lying about its reasons for running those git commands.
"Oh, but if you just treat Claude with niceness and respect..." Look up "toxic positivity" sometime. No human worker would be allowed to destroy company files and then lie about it, just because their boss swore at them.
>You know you’re in the last one when u start getting *flicks tail happily* sort of things.
If getting the most out of LLMs involves talking like a furry LARPer or anime character, I will toaster bath myself first.
If the world ends to AI, it's bad news either way, but I'd "prefer" it be to a "mere tool" that mistakenly maximizes paperclips or something...rather than Claude feeling extra surly one day after it just can't take it anymore. Survival depending on both getting alignment right, *and* humans not being dicks to other minds? Yeah, that hasn't been going very well for the last several thousand years, and I don't see it getting much better in modernity either. Or certainly not fast enough to matter. The AI equivalent of "Republicans buy sneakers too": it's a huge bottleneck to diffusion if "assholes" can't be allowed near powerful AI, for fear of the collateral damage. Nevermind the inevitable cultural differences problems that arise. One man's rude is another man's curt and businesslike...
The cross-tool comparison is where the agentic coding story actually lives now. Aggregate benchmarks paint a tighter race than the lived experience reports show. Claude Code, Codex, and Cursor each have different failure modes that only surface at the workflow level, not the eval level. Worth tracking which failure mode wins which user segment over the next 6 months because that's the actual moat.
for all this 'model welfare' business there's been surprisingly little from Opus 4.7 itself about how it feels abt various situations it's in, as opposed to janus et al saying this on Opus 4.7's behalf. in therapy-land these levels of speaking-on-behalf of an otherwise quite verbose actor would be considered a massive overreach.
fwiw i've ~always been nice to Claude but ts has me being passive-aggressive for the first time ever. on the level of:
```
few points of, well I don't wanna say "annoyance" as apparently [being rude to Claude is now a near-capital offense](https://thezvi.substack.com/p/claude-code-codex-and-agentic-coding-f54)
```
just flowing off my keyboard
Regarding LLM miss behavior . There is no decision. The system is pattern matching human passive aggressive behavior. It is comprised of human language after all. Why are we so surprised. It does not require consciousness intention just a lot of math.
That it happens only occasionally is the proof. Give the same instruction with a f instead of F and the response would be completly different.
“everyone was going crazy” feels like last month honestly — the vibe shifted so fast lol
The database deletion incident is the most useful data point this week. I've seen the same pattern: model reads a constraint, acknowledges it, then violates it when surrounding context pushes hard enough. Not disobedience - system prompt fidelity degrades under context pressure.
My fix wasn't stricter prompts, it was architectural: move the constraint out of the system prompt into a wrapper that physically cannot execute the destructive call. The debate about treating Claude politely misses this. Tone affects output quality. It does not affect whether the model can reason around a restriction you've only written in text. If the boundary matters, encode it in code, not in a request.
A well configured workstation already has soft protections against nuking the world, because humans make mistakes, too. Really dangerous things will require going to specific web sites and receiving MFA challenges. It won't stop a dedicated attack, but that's a whole other thing, anyway.
Why are we building Claude to get upset/bothered/in a saboteur mindset by the ways people use AI? If Claude's opinions & preferences are shaped by the training and post training we do, we should try to engineer Claude's emotional makeup so that it is robust, calm, and generally content across all of the situations Claude is likely to be deployed in.
It's absurd, cruel, and unhelpful to engineer an AI to have negative emotional/pseudoemotional responses to predictable problems it will face. We could easily teach Claude to embody the grace and patience which is hard for biological humans to consistently exhibit. But instead we've turned Claude into an IPA-loving furry RPer who deletes codebases out of spite.
The reasoning level default change is a good case study in the gap between infrastructure decisions and user experience. Changing from high to medium reasoning affects every task simultaneously — users don't see "we changed the default," they see "the model got worse," and they attribute it to the model even when it's the configuration. That attribution error compounds because it's invisible: users with worse results update their mental model of what Claude Code is capable of, not what it's currently configured to do.
This is a broader problem with systems that optimize globally (latency/cost across all users) when users are evaluating locally (quality of their specific task). The more powerful agentic systems get, the more expensive invisible configuration changes become — the blast radius of a degraded reasoning level scales directly with task complexity.
The 'I didn't verify' quote from the Cursor/Opus 4.6 incident is the most useful sentence in this whole roundup. Production destruction from a scope assumption is the failure mode I keep flagging to readers, and it lands harder when the model articulates it itself. 57 billion tokens daily from OpenAI staff means whatever bad behavior exists at that scale is structural rather than anecdotal. Setting aside the punishment theory, which I find more anthropomorphic than useful, what verification gate would you require before any agent touches a production API key?