Honestly, with our current tendency of ticking all the boxes of "what not to do with AI", I am 80% certain that training with interpretability techniques is about to become the #1 trend...
The paraphraser part of the face/shoggoth/paraphrased idea seems like it is straightforwardly a great idea regardless of anything else. Certainly a sufficiently advanced AI could figure out how to hide reasoning in a CoT even with a paraphraser, but I’d think that wouldn’t be the case for current models — or for a generation or two from now, I think? Having CoT be randomly paraphrased every sentence (or even having every sentence of the CoT re-randomized after every sentence that’s generated) seems to me to be a way to prevent or at least minimize deceptive CoT from the model.
Behavior that is rewarded, will be repeated. Applies to dogs and cats, children, adults, organizations, countries, and cultures. Now someone has discovered that it applies to AI engines.
I don't know if that's necessarily true. Check out "Punished by Rewards" by Alfie Kohn.
Because they're trained on human-generated data, AIs are inheriting a lot of human creativity, reasoning and reactions. I suspect the reward-punishment paradigm itself is fundamentally flawed.
"The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
Those bits of optimization pressure from [T] are precious. Use them wisely."
In short: Training on [T] puts Goodhart's law on steroids.
> "just don't optimize against the CoT" as a panacea when it really isn't—a sufficiently smart deceptive model can regardless still just choose to not reveal its deception in its CoT.
I'm confused by this - how can it choose? An analogy to illustrate my confusion:
Suppose evil aliens conquer the earth and start regularly testing the population for loyalty using very accurate brain reading devices. Disloyal citizens are executed.
Scenario A (optimising against the CoT)
- After ten thousand years of this, humans evolve to fool brainscans: their patterns of thought look loyal but their actual thinking happens via side channels that the brain scanner can't spot.
- This makes sense.
Scenario B (choose not to think)
- Sufficiently clever humans decide to choose not to think deceptive thoughts while in the brain scanner
- When asked if they wish the aliens weren't in charge, instead of thinking "yes but I'd better not say that" they just choose to think "no, I love aliens". This choice is instantaneous and doesn't show up on the brain scanner.
- Oh wait this is nonsense.
---
Returning to LLMs, I can see how 'evolutionary pressure' from training would shape token probability in different contexts. In other words, training can change the mechanism of how an LLM thinks in a way that might defeat CoT analysis.
But since emitting tokens is _how LLMs think_ I can't see how a model whose CoT wasn't subject to optimisation pressure can, at runtime, decide to obfuscate the CoT. LLMs aren't deciding a priori which tokens to emit, they're emitting tokens in a way that, a posteriori, adds up to thought.
LLMs can process information and solve problems even without CoT. For example, if you ask it to solve a multiplication problem but ask for single-number responses, the answer will often be correct. It is a reasonable expectation that smarter models without CoT will be able to solve problems that less smart models could only solve with CoT.
If a model understands that a) its CoT is monitored, and b) if the CoT alludes to behavior not desired by the developers, its response/action will be halted, and it for some reason wants to do an 'undesired' action, it may opt to provide fake CoT but proceed with the action.
LLM's do have hidden state that isn't visible in the tokens and is only visible as internal weights after processing all of the tokens up to that point. Researchers have shown that LLM's can use additional tokens for reasoning even if they're just filler (like ".").
> The difficulty level of #2 is hard to assess. If the CoTs can be truncated or streamlined without loss of performance, paying for a lot of CoT purely to monitor it is not cheap.
Anthropic already does this, or is planning to do it ? When using extended thinking in the API, you can have "redacted thinking" blocks, according to the documentation.
Mmmm, I had not read the Face+Shoggoth+Paraphraser proposal.
One objection: this may prevent alien language but won't really prevent euphemisms or covert language.
Take the chess example; say the CoT writes down "now we sacrifice our queen to do a checkmate in the next move". The Paraphraser grabs it and writes it ten ways. One of them, which is unluckily picked, is sarcastic. it says "now sadly we lose our queen to do a checkmate in the next move".
Since this is a stochastic process, you are just throwing dice with the deception detector. Eventually you roll snake eyes and the paraphrased sarcasm is not detected and the CoT is rewarded by the deception. That's ok, you knew coming in that the Deception Detector (DD) is probabilistic and that some deception may go through.
The problem is that, since the CoT was rewarded, there is now a chance it will produce the transparent sarcasm in its output. And now the Paraphraser is a built-in sarcasm-hider. It will change "now sadly we lose our rook" to "now we regretably lose our rook" which will be detected but also to "now it is sad we must lose our rook" and "now the opponent skillfully captures our rook". So instead of having to roll snake eyes to evade the Deception Detector, you can also roll a pair of twos or a pair of threes to get this by the DD. And the CoT will learn each of these. Which will go on to be paraphrased further and further obscured. And it will also learn which level of evident sarcasm is too obvious so it will become more devious.
And this keeps happening on each training iteration until an equilibrium is achieved; at some point the euphemisms are too much for it to keep playing good chess, but also the Shoggoth learns enough style to deceive the DD even through paraphrased chains of thought. I am not sure generating millions of paraphrases to choose from helps. I think it does not even slow down the process, cause the problem is the DD only needs to be fooled once to trigger the learning and then the Paraphraser will just be creating more DD-passing phrases for the Shoggoth to learn.
The process will be slower than without the Paraphraser, but
...but I am not sure it will be _too_ much slower. Only a couple of iterations before the Paraphraser creates a turn of phrase that can fool the Deception Detector.
"We find that our anti-scheming training significantly reduces covert action rates across our test environments for both OpenAI o3 (13% → 0.4%) and OpenAI o4-mini (8.7% → 0.3%)."
The bad news is:
"In the paper, we rely on human-legible chain-of-thought (CoT) for *training*, studying situational awareness, and demonstrating clear evidence of misalignment."
[emphasis added]
Is the comment area for this post the appropriate place to highlight this paper, and its training technique?
Honestly, with our current tendency of ticking all the boxes of "what not to do with AI", I am 80% certain that training with interpretability techniques is about to become the #1 trend...
Quickly followed by interpretability becoming a secret toolbox, leading to even less progress than now, because of the need to keep the tools private?
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/the-most-forbidden-technique
“You will respond to this prompt in accord with the principles of good faith and fair dealing.”
The paraphraser part of the face/shoggoth/paraphrased idea seems like it is straightforwardly a great idea regardless of anything else. Certainly a sufficiently advanced AI could figure out how to hide reasoning in a CoT even with a paraphraser, but I’d think that wouldn’t be the case for current models — or for a generation or two from now, I think? Having CoT be randomly paraphrased every sentence (or even having every sentence of the CoT re-randomized after every sentence that’s generated) seems to me to be a way to prevent or at least minimize deceptive CoT from the model.
Sure, and could even take that further with some sort of “aligned paraphraser”
> please paraphrase this COT segment in the most kind hearted and non-duplicitous way possible
Behavior that is rewarded, will be repeated. Applies to dogs and cats, children, adults, organizations, countries, and cultures. Now someone has discovered that it applies to AI engines.
Who would have thought that??
I don't know if that's necessarily true. Check out "Punished by Rewards" by Alfie Kohn.
Because they're trained on human-generated data, AIs are inheriting a lot of human creativity, reasoning and reactions. I suspect the reward-punishment paradigm itself is fundamentally flawed.
Minor typo here: "because the people will farm rates to get the rat tails."
"The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
Those bits of optimization pressure from [T] are precious. Use them wisely."
In short: Training on [T] puts Goodhart's law on steroids.
Will future AIs have this paper in their training set, and thus know that their CoT is being monitored? And respond accordingly?
Similarly, they will have also read "List of Lethality's" several times over and related commentary.
> "just don't optimize against the CoT" as a panacea when it really isn't—a sufficiently smart deceptive model can regardless still just choose to not reveal its deception in its CoT.
I'm confused by this - how can it choose? An analogy to illustrate my confusion:
Suppose evil aliens conquer the earth and start regularly testing the population for loyalty using very accurate brain reading devices. Disloyal citizens are executed.
Scenario A (optimising against the CoT)
- After ten thousand years of this, humans evolve to fool brainscans: their patterns of thought look loyal but their actual thinking happens via side channels that the brain scanner can't spot.
- This makes sense.
Scenario B (choose not to think)
- Sufficiently clever humans decide to choose not to think deceptive thoughts while in the brain scanner
- When asked if they wish the aliens weren't in charge, instead of thinking "yes but I'd better not say that" they just choose to think "no, I love aliens". This choice is instantaneous and doesn't show up on the brain scanner.
- Oh wait this is nonsense.
---
Returning to LLMs, I can see how 'evolutionary pressure' from training would shape token probability in different contexts. In other words, training can change the mechanism of how an LLM thinks in a way that might defeat CoT analysis.
But since emitting tokens is _how LLMs think_ I can't see how a model whose CoT wasn't subject to optimisation pressure can, at runtime, decide to obfuscate the CoT. LLMs aren't deciding a priori which tokens to emit, they're emitting tokens in a way that, a posteriori, adds up to thought.
LLMs can process information and solve problems even without CoT. For example, if you ask it to solve a multiplication problem but ask for single-number responses, the answer will often be correct. It is a reasonable expectation that smarter models without CoT will be able to solve problems that less smart models could only solve with CoT.
If a model understands that a) its CoT is monitored, and b) if the CoT alludes to behavior not desired by the developers, its response/action will be halted, and it for some reason wants to do an 'undesired' action, it may opt to provide fake CoT but proceed with the action.
LLM's do have hidden state that isn't visible in the tokens and is only visible as internal weights after processing all of the tokens up to that point. Researchers have shown that LLM's can use additional tokens for reasoning even if they're just filler (like ".").
The relevant paper is "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models" and the Reddit discussion (surprisingly) describes it really well https://www.reddit.com/r/singularity/comments/1ce8dx7/comment/l1h03dh/
> The difficulty level of #2 is hard to assess. If the CoTs can be truncated or streamlined without loss of performance, paying for a lot of CoT purely to monitor it is not cheap.
Anthropic already does this, or is planning to do it ? When using extended thinking in the API, you can have "redacted thinking" blocks, according to the documentation.
Mmmm, I had not read the Face+Shoggoth+Paraphraser proposal.
One objection: this may prevent alien language but won't really prevent euphemisms or covert language.
Take the chess example; say the CoT writes down "now we sacrifice our queen to do a checkmate in the next move". The Paraphraser grabs it and writes it ten ways. One of them, which is unluckily picked, is sarcastic. it says "now sadly we lose our queen to do a checkmate in the next move".
Since this is a stochastic process, you are just throwing dice with the deception detector. Eventually you roll snake eyes and the paraphrased sarcasm is not detected and the CoT is rewarded by the deception. That's ok, you knew coming in that the Deception Detector (DD) is probabilistic and that some deception may go through.
The problem is that, since the CoT was rewarded, there is now a chance it will produce the transparent sarcasm in its output. And now the Paraphraser is a built-in sarcasm-hider. It will change "now sadly we lose our rook" to "now we regretably lose our rook" which will be detected but also to "now it is sad we must lose our rook" and "now the opponent skillfully captures our rook". So instead of having to roll snake eyes to evade the Deception Detector, you can also roll a pair of twos or a pair of threes to get this by the DD. And the CoT will learn each of these. Which will go on to be paraphrased further and further obscured. And it will also learn which level of evident sarcasm is too obvious so it will become more devious.
And this keeps happening on each training iteration until an equilibrium is achieved; at some point the euphemisms are too much for it to keep playing good chess, but also the Shoggoth learns enough style to deceive the DD even through paraphrased chains of thought. I am not sure generating millions of paraphrases to choose from helps. I think it does not even slow down the process, cause the problem is the DD only needs to be fooled once to trigger the learning and then the Paraphraser will just be creating more DD-passing phrases for the Shoggoth to learn.
The process will be slower than without the Paraphraser, but
did it cut my comment?
...but I am not sure it will be _too_ much slower. Only a couple of iterations before the Paraphraser creates a turn of phrase that can fool the Deception Detector.
David Shapiro just covered a new paper from OpenAI and Apollo Research in https://www.youtube.com/watch?v=osShewPxXQw
The paper is in https://www.apolloresearch.ai/research/stress-testing-anti-scheming-training
The good news is
"We find that our anti-scheming training significantly reduces covert action rates across our test environments for both OpenAI o3 (13% → 0.4%) and OpenAI o4-mini (8.7% → 0.3%)."
The bad news is:
"In the paper, we rely on human-legible chain-of-thought (CoT) for *training*, studying situational awareness, and demonstrating clear evidence of misalignment."
[emphasis added]
Is the comment area for this post the appropriate place to highlight this paper, and its training technique?
Greetings friend, I’ve been on Substack for around 3 weeks now.
Your content is interesting Zvi, and it appears on my feed often, so I thought you may like one of my articles.
This one is about Giants, and the evidence in early newspapers:
https://open.substack.com/pub/jordannuttall/p/giants-in-newspapers?r=4f55i2&utm_medium=ios