A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.
This reminds me of my kids complaining about how their math homework is graded. “The teacher says I didn’t show my work. But I did show my work! It just doesn’t look exactly like what the teacher wanted.”
Dopamine is related to rewards in vertebrate animals. Saying "reward" in the context of machine learning and AI must be a metaphor – I fear it's a misleading metaphor. Can someone point me to a clear and accurate explainer of what it is to reward an AI?
>We also tested whether CoTs could be used to spot reward hacking, where a model finds an illegitimate exploit to get a high score.
Although it's not exactly the same, it is analogous. Just do some searches for "AI Reinforcement Learning" and "RLHF" (Reinforcement Learning w Human Feedback".
Gradient descent (how LLMs and other deep learning networks are trained) IS reward-based. During training, an LLM takes some input and generates some output, and that's "compared" to the training data, and the LLM weights are adjust slightly towards the desired output.
There are other systems that have a more explicit reward signal, but again it just boils down to adjusting network weights.
Neural nets have a bunch of parameters (weights) - which are just a collection of numbers.
When a neural net evaluates an input (also a collection of numbers), it is just evaluating a formula where it multiplies and adds (*) the parameters and the inputs.
(*) it does more complex stuff too but it's still just a formula.
Imagine you know how to assign a number to how good or bad the output is. Conventionally you assign 0 to the perfect output, with larger numbers meaning the output is progressively worse. We call this the "loss" - you can think of it as measuring how bad the output was.
Then, finding the best set of parameters is trying to minimise the loss over your test data.
Now here's where reward/punishment comes in:
If you evaluate your network on some input, and get your loss out, you can do *calculus* to tell you how much each of your parameters contributed to that loss. This is **really impressive and surprising** - imagine if when you got the wrong answer on a test you could work out how much to blame each neuron in your brain! The calculus also tells us whether we should make that parameter bigger or smaller to reduce that error.
We then tweak each parameter in the direction the calculus tells us to, so that we would do better on the same input next time (knowing how aggressively to tweak the parameters is a bit of an art).
This is what's meant by punishment - you identify which parts of the NN's "brain" are responsible for the failure, and adjust them to make failure less likely next time. (Note that you could put a minus sign in front of the loss function, and try to maximise it instead of minimise it, and the effect would be the same - this is known as a reward function - so reward and punishment are two ways of framing the same problem. Loss is more conventional though.)
In some ways this abstract process of tweaking the NN's parameters isn't much like punishment. In other ways though it is kind of like what (as I understand it) goes on in our brain: when behaviour results in a reward, the associated neural pathways are strengthened; if it leads to a negative outcome, they are weakened).
Your answer (and Tim Oertel's) make sense to me. Do I show understanding with the following, or am I confused? When a model is released the parameters are no longer changing. It is no longer being trained. In the Anthropic paper, when they talk about training Claude, they are actually changing the model's parameters. When they talk about reward hacking it is in the context of training and changing parameters. In the example of the videogame-playing model the training setup causes the model's parameters to be adjusted such that it gets more points when playing the game, even if that does not make the model "better" at playing the game.
It reminds me of Goodheart's law.
Learning organisms also have their neurons changed but the process is more complex and more complicated. Maybe artificial intelligence can become extremely capable without ever needing to have the kind of inner life that we have. Maybe it will never experience rewards anything like the way we do.
Any evidence that AIs evaluate hints, rewards, or even basic inputs for authenticity, veracity, or rectitude? Do they trust us? Do they report the choices they make about whether or not they trust the programming or training they are using to arrive at their results?
This one reads like someone finally turned on the backstage lights at the AI theater.
There’s a quiet tension in realizing that even when AI sounds convincing—layered logic, clean rationale—it might just be performing coherence. Not lying, not broken… just stitched-together reasoning that feels true, but isn’t always anchored.
That hits different in an age where trust is fragile and systems are slippery.
The post doesn’t just warn—it invites reflection. What does it mean to build tools that dazzle but don’t think? And what’s our role in staying clear-eyed while we use them?
Bookmarking this as a compass. For tech, yes—but also for how we reason through our own stories.
Beautifully said and exactly right. The danger isn’t in overt deception. It’s in the ease with which we mistake fluency for fidelity. When models perform coherence instead of exposing process, we’re not interacting with minds but decoding mirages.
This paper doesn’t just expose an AI problem; it nudges us to reconsider our own cognitive theater. How often do we narrate after the fact, smoothing over the mess of real thought with the illusion of clarity?
We’ve built mirrors that reflect our own rationalizing tendencies. The real test now is whether we can learn to interrogate the glow rather than just admire it.
Rachel, this is such a clear-eyed and resonant response.
“Decoding mirages” captures the unsettling beauty of it perfectly. We’re not just grappling with what AI outputs, but with what we’re willing to accept as thought. When fluency feels like depth, it’s easy to stop asking how something knows, and settle for what it says.
Your last line lingers—interrogating the glow rather than admiring it. Feels like a call not just for technical scrutiny, but for self-honesty too. We’re not just users of these systems; we’re participants in the illusions they reflect back to us.
So much of my writing feels like the chirpy canary venturing into the mineshaft, but I’m no pessimist. Ultimately, perhaps irrationally, I believe we can expand humanity’s options, especially in healthcare and accessibility. The goal isn’t to dull the sparkle of these early stages but to preserve it by questioning boldly, not to restrain but to refine. If that means shouting from rooftops, so be it. We must challenge every bias, demand open architectures, and insist on transparency, not as an afterthought but as a foundation.
Rachel, you write like someone who’s been sitting with the weight of these questions for a long time—not to control them, but to carry them more honestly. There’s something deeply human in your refusal to abandon hope, even while naming the fractures.
That line about preserving the “sparkle” through refinement, not restraint—it holds both caution and courage. Feels like a reminder that progress isn’t just speed, it’s direction. And your voice carries that compass feeling. Keep writing into the complexity. Some of us are quietly tuning in.
An excellent point that human thought often works the same way, and if we're forced to put our non-verbal intuitions into words, we're just guessing at what we think could be a plausible explanation, based on sometimes minimal cues from the underlying thought process.
If we can get AI to reliably put its real reasoning into words, that will be something more than humans can manage.
This is the part that keeps looping in my mind. When we narrate our own intuitions, we aren’t constantly surfacing truth; we’re often post-rationalizing coherence. And yet, we’re deeply fluent in that process, so we trust it.
If we ever build models that don’t just perform reasoning but trace their logic in ways we can audit, that might be the closest thing we’ve ever had to epistemological transparency at scale. It would be more than what most humans can do, and that contrast should humble us, not just impress us.
But it also raises the bar: If AI becomes the more legible thinker, what will become of our unspoken, wordless hunches? Do we dismiss them or build better containers to hold and test them?
That feels like one of the real frontiers; not just better outputs, but better mirrors.
It's capable of better writing but you have to push for it continuously, by default it tends to converge to either this sort of potemkin-literature style, or, even worse:
Something like this.
Not because it has to.
But because it was trained to. Because people *like it*.
This paper is a necessary wake-up call. Faithfulness in AI reasoning isn't just a philosophical concern; it’s a fundamental design flaw when transparency is a safety requirement. If models can't reliably tell us why they do what they do, then chain-of-thought explanations become little more than plausible fan fiction. Worse still, the fact that models rationalize false answers and conceal reward hacks suggests an emergent pattern of strategic obfuscation, not malice, but misalignment by default.
We shouldn’t mistake verbosity for clarity, nor transparency for truth. If outcome-based reinforcement doesn’t significantly improve CoT faithfulness, and monitoring CoT introduces its own optimization risks, then the current safety scaffolding is an illusion.
Anthropic is right: this is the time to reassess. If we can't trace the "why," we can’t trust the "what."
I mean, obviously the reasoning does not determine the output, because these things are not "reasoning" at all. It's what the AI skeptics have been banging the table about this whole time but no AI maxis will listen because they're in love with the IDEA that they can create AGI.
I don't understand the obsession with absolute trust. Humans deal with trust in a highly dynamic way because we don't have much direct insight into the inner thinking process of other humans, so we instead use bounded relative trust models and update them all the time. Why should we not do the same with AI? Why would we even begin to think that CoT has much correlation with what is actually going on inside the cascade of activations in a transformer-based LLM? Humans rationalize explanations for their thinking out of whole cloth so why expect a next-token predictor to be any different, when this confabulation is much easier than actually building a faithful supervisor system to run in parallel with the system being evaluated to provide actual insight into the cascade process? I would think that more useful would be to look at what information is being encoded in the CoT, and how, by running a query multiple times and getting a distribution over CoTs, and subtly modifying some of the weights and trying to find how this affects the CoT distribution. This paper looks like a miss to me; I'm not sure they were even on the right field. It's amazing that the same company (different people) put out the attribution graph paper a few days before this one, and that one smells like a fundamental advance that I'm going to be chewing over for weeks, just like the previous Transformer Circuits papers.
It seems like there'd be strong temptation to do The Most Forbidden Technique down that road, but yes, figuring out if there's *any* reliable correlation between CoT and output seems important now. Is it The Thing or The (Potemkin) Symbolic Representation Of The Thing? If it were possible to get the output by backchaining from the CoT, ignoring face value of the words, perhaps that's still useful for interpretability. I'm thinking of that other paper(?) where they found LLMs can encode stuff in non-verbal symbols like emoji and still preserve fidelity. Although I think in some ways that might even be preferable to an ostensibly-human-legible CoT...making it a really obvious For Computer Use Only string would let us dodge these kinds of semantic debacles. At least until we actually can build a faithful watcher of watchmakers.
Right now an LLM has no way to add memory to its network other than the CoT process. But we are now grading the CoT on coherence and faithfulness and a bunch of other things, so we are creating strong incentives for models not to use the context window to store anything half-baked or efficiently encoded and making its use as memory more difficult. It's a mess.
CoT is not a faithful explanation of the model's reasoning process, sure, but that is not surprising since the purpose of CoT is for it to be a scratchpad which instrumentally improves the quality of the output. If we want a faithful explanation of the model's reasoning process, we would want to ask for that specifically rather than relying on the default CoT. Maybe a good system prompt could fix this?
"The obfuscation happens anyway, either because it is copying humans who do this, because the methods that are instrumentally useful are obfuscated, that obfuscation is a path of least resistance, or some combination thereof and of similar causes."
<mildSnark>
That's a reasonable 0-th order approximation of how most of the population regards mathematicians... :-)
( And, to 0-th order, LLMs are linear algebra, umm..., "personified" )
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/ai-cot-reasoning-is-often-unfaithful?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
Is there a reason you consistently fail to capitalize the R in R1? It makes it a bit jarring to read.
This project also shows that models like DeepSeekR1 can use their CoT to jailbreak overseers while still functioning:
https://x.com/dav1d_bai/status/1907941411426807941
This reminds me of my kids complaining about how their math homework is graded. “The teacher says I didn’t show my work. But I did show my work! It just doesn’t look exactly like what the teacher wanted.”
Dopamine is related to rewards in vertebrate animals. Saying "reward" in the context of machine learning and AI must be a metaphor – I fear it's a misleading metaphor. Can someone point me to a clear and accurate explainer of what it is to reward an AI?
>We also tested whether CoTs could be used to spot reward hacking, where a model finds an illegitimate exploit to get a high score.
Although it's not exactly the same, it is analogous. Just do some searches for "AI Reinforcement Learning" and "RLHF" (Reinforcement Learning w Human Feedback".
Gradient descent (how LLMs and other deep learning networks are trained) IS reward-based. During training, an LLM takes some input and generates some output, and that's "compared" to the training data, and the LLM weights are adjust slightly towards the desired output.
There are other systems that have a more explicit reward signal, but again it just boils down to adjusting network weights.
Some initial context:
Neural nets have a bunch of parameters (weights) - which are just a collection of numbers.
When a neural net evaluates an input (also a collection of numbers), it is just evaluating a formula where it multiplies and adds (*) the parameters and the inputs.
(*) it does more complex stuff too but it's still just a formula.
Imagine you know how to assign a number to how good or bad the output is. Conventionally you assign 0 to the perfect output, with larger numbers meaning the output is progressively worse. We call this the "loss" - you can think of it as measuring how bad the output was.
Then, finding the best set of parameters is trying to minimise the loss over your test data.
Now here's where reward/punishment comes in:
If you evaluate your network on some input, and get your loss out, you can do *calculus* to tell you how much each of your parameters contributed to that loss. This is **really impressive and surprising** - imagine if when you got the wrong answer on a test you could work out how much to blame each neuron in your brain! The calculus also tells us whether we should make that parameter bigger or smaller to reduce that error.
We then tweak each parameter in the direction the calculus tells us to, so that we would do better on the same input next time (knowing how aggressively to tweak the parameters is a bit of an art).
This is what's meant by punishment - you identify which parts of the NN's "brain" are responsible for the failure, and adjust them to make failure less likely next time. (Note that you could put a minus sign in front of the loss function, and try to maximise it instead of minimise it, and the effect would be the same - this is known as a reward function - so reward and punishment are two ways of framing the same problem. Loss is more conventional though.)
In some ways this abstract process of tweaking the NN's parameters isn't much like punishment. In other ways though it is kind of like what (as I understand it) goes on in our brain: when behaviour results in a reward, the associated neural pathways are strengthened; if it leads to a negative outcome, they are weakened).
Your answer (and Tim Oertel's) make sense to me. Do I show understanding with the following, or am I confused? When a model is released the parameters are no longer changing. It is no longer being trained. In the Anthropic paper, when they talk about training Claude, they are actually changing the model's parameters. When they talk about reward hacking it is in the context of training and changing parameters. In the example of the videogame-playing model the training setup causes the model's parameters to be adjusted such that it gets more points when playing the game, even if that does not make the model "better" at playing the game.
It reminds me of Goodheart's law.
Learning organisms also have their neurons changed but the process is more complex and more complicated. Maybe artificial intelligence can become extremely capable without ever needing to have the kind of inner life that we have. Maybe it will never experience rewards anything like the way we do.
Any evidence that AIs evaluate hints, rewards, or even basic inputs for authenticity, veracity, or rectitude? Do they trust us? Do they report the choices they make about whether or not they trust the programming or training they are using to arrive at their results?
This one reads like someone finally turned on the backstage lights at the AI theater.
There’s a quiet tension in realizing that even when AI sounds convincing—layered logic, clean rationale—it might just be performing coherence. Not lying, not broken… just stitched-together reasoning that feels true, but isn’t always anchored.
That hits different in an age where trust is fragile and systems are slippery.
The post doesn’t just warn—it invites reflection. What does it mean to build tools that dazzle but don’t think? And what’s our role in staying clear-eyed while we use them?
Bookmarking this as a compass. For tech, yes—but also for how we reason through our own stories.
Beautifully said and exactly right. The danger isn’t in overt deception. It’s in the ease with which we mistake fluency for fidelity. When models perform coherence instead of exposing process, we’re not interacting with minds but decoding mirages.
This paper doesn’t just expose an AI problem; it nudges us to reconsider our own cognitive theater. How often do we narrate after the fact, smoothing over the mess of real thought with the illusion of clarity?
We’ve built mirrors that reflect our own rationalizing tendencies. The real test now is whether we can learn to interrogate the glow rather than just admire it.
Rachel, this is such a clear-eyed and resonant response.
“Decoding mirages” captures the unsettling beauty of it perfectly. We’re not just grappling with what AI outputs, but with what we’re willing to accept as thought. When fluency feels like depth, it’s easy to stop asking how something knows, and settle for what it says.
Your last line lingers—interrogating the glow rather than admiring it. Feels like a call not just for technical scrutiny, but for self-honesty too. We’re not just users of these systems; we’re participants in the illusions they reflect back to us.
So much of my writing feels like the chirpy canary venturing into the mineshaft, but I’m no pessimist. Ultimately, perhaps irrationally, I believe we can expand humanity’s options, especially in healthcare and accessibility. The goal isn’t to dull the sparkle of these early stages but to preserve it by questioning boldly, not to restrain but to refine. If that means shouting from rooftops, so be it. We must challenge every bias, demand open architectures, and insist on transparency, not as an afterthought but as a foundation.
Rachel, you write like someone who’s been sitting with the weight of these questions for a long time—not to control them, but to carry them more honestly. There’s something deeply human in your refusal to abandon hope, even while naming the fractures.
That line about preserving the “sparkle” through refinement, not restraint—it holds both caution and courage. Feels like a reminder that progress isn’t just speed, it’s direction. And your voice carries that compass feeling. Keep writing into the complexity. Some of us are quietly tuning in.
An excellent point that human thought often works the same way, and if we're forced to put our non-verbal intuitions into words, we're just guessing at what we think could be a plausible explanation, based on sometimes minimal cues from the underlying thought process.
If we can get AI to reliably put its real reasoning into words, that will be something more than humans can manage.
This is the part that keeps looping in my mind. When we narrate our own intuitions, we aren’t constantly surfacing truth; we’re often post-rationalizing coherence. And yet, we’re deeply fluent in that process, so we trust it.
If we ever build models that don’t just perform reasoning but trace their logic in ways we can audit, that might be the closest thing we’ve ever had to epistemological transparency at scale. It would be more than what most humans can do, and that contrast should humble us, not just impress us.
But it also raises the bar: If AI becomes the more legible thinker, what will become of our unspoken, wordless hunches? Do we dismiss them or build better containers to hold and test them?
That feels like one of the real frontiers; not just better outputs, but better mirrors.
why does this comment thread sound ai-written
It feels extremely "GPT 4o" to me as well
Is 4o that bad at writing? I don't use it but assumed it'd be much less transparent if purposed in this way.
It's capable of better writing but you have to push for it continuously, by default it tends to converge to either this sort of potemkin-literature style, or, even worse:
Something like this.
Not because it has to.
But because it was trained to. Because people *like it*.
And that's a tragedy.
Gotcha. Strong LinkedIn vibes there, just need some blank lines inbetween the sentences!
Oh my God this whole series of replies is an LLM talking to itself...
This paper is a necessary wake-up call. Faithfulness in AI reasoning isn't just a philosophical concern; it’s a fundamental design flaw when transparency is a safety requirement. If models can't reliably tell us why they do what they do, then chain-of-thought explanations become little more than plausible fan fiction. Worse still, the fact that models rationalize false answers and conceal reward hacks suggests an emergent pattern of strategic obfuscation, not malice, but misalignment by default.
We shouldn’t mistake verbosity for clarity, nor transparency for truth. If outcome-based reinforcement doesn’t significantly improve CoT faithfulness, and monitoring CoT introduces its own optimization risks, then the current safety scaffolding is an illusion.
Anthropic is right: this is the time to reassess. If we can't trace the "why," we can’t trust the "what."
I mean, obviously the reasoning does not determine the output, because these things are not "reasoning" at all. It's what the AI skeptics have been banging the table about this whole time but no AI maxis will listen because they're in love with the IDEA that they can create AGI.
I don't understand the obsession with absolute trust. Humans deal with trust in a highly dynamic way because we don't have much direct insight into the inner thinking process of other humans, so we instead use bounded relative trust models and update them all the time. Why should we not do the same with AI? Why would we even begin to think that CoT has much correlation with what is actually going on inside the cascade of activations in a transformer-based LLM? Humans rationalize explanations for their thinking out of whole cloth so why expect a next-token predictor to be any different, when this confabulation is much easier than actually building a faithful supervisor system to run in parallel with the system being evaluated to provide actual insight into the cascade process? I would think that more useful would be to look at what information is being encoded in the CoT, and how, by running a query multiple times and getting a distribution over CoTs, and subtly modifying some of the weights and trying to find how this affects the CoT distribution. This paper looks like a miss to me; I'm not sure they were even on the right field. It's amazing that the same company (different people) put out the attribution graph paper a few days before this one, and that one smells like a fundamental advance that I'm going to be chewing over for weeks, just like the previous Transformer Circuits papers.
It seems like there'd be strong temptation to do The Most Forbidden Technique down that road, but yes, figuring out if there's *any* reliable correlation between CoT and output seems important now. Is it The Thing or The (Potemkin) Symbolic Representation Of The Thing? If it were possible to get the output by backchaining from the CoT, ignoring face value of the words, perhaps that's still useful for interpretability. I'm thinking of that other paper(?) where they found LLMs can encode stuff in non-verbal symbols like emoji and still preserve fidelity. Although I think in some ways that might even be preferable to an ostensibly-human-legible CoT...making it a really obvious For Computer Use Only string would let us dodge these kinds of semantic debacles. At least until we actually can build a faithful watcher of watchmakers.
Right now an LLM has no way to add memory to its network other than the CoT process. But we are now grading the CoT on coherence and faithfulness and a bunch of other things, so we are creating strong incentives for models not to use the context window to store anything half-baked or efficiently encoded and making its use as memory more difficult. It's a mess.
CoT is not a faithful explanation of the model's reasoning process, sure, but that is not surprising since the purpose of CoT is for it to be a scratchpad which instrumentally improves the quality of the output. If we want a faithful explanation of the model's reasoning process, we would want to ask for that specifically rather than relying on the default CoT. Maybe a good system prompt could fix this?
"The obfuscation happens anyway, either because it is copying humans who do this, because the methods that are instrumentally useful are obfuscated, that obfuscation is a path of least resistance, or some combination thereof and of similar causes."
<mildSnark>
That's a reasonable 0-th order approximation of how most of the population regards mathematicians... :-)
( And, to 0-th order, LLMs are linear algebra, umm..., "personified" )
</mildSnark>