15 Comments
User's avatar
rational_hippy's avatar

Honestly, with our current tendency of ticking all the boxes of "what not to do with AI", I am 80% certain that training with interpretability techniques is about to become the #1 trend...

Expand full comment
Victualis's avatar

Quickly followed by interpretability becoming a secret toolbox, leading to even less progress than now, because of the need to keep the tools private?

Expand full comment
Ethics Gradient's avatar

“You will respond to this prompt in accord with the principles of good faith and fair dealing.”

Expand full comment
Vince's avatar

The paraphraser part of the face/shoggoth/paraphrased idea seems like it is straightforwardly a great idea regardless of anything else. Certainly a sufficiently advanced AI could figure out how to hide reasoning in a CoT even with a paraphraser, but I’d think that wouldn’t be the case for current models — or for a generation or two from now, I think? Having CoT be randomly paraphrased every sentence (or even having every sentence of the CoT re-randomized after every sentence that’s generated) seems to me to be a way to prevent or at least minimize deceptive CoT from the model.

Expand full comment
AW's avatar

Sure, and could even take that further with some sort of “aligned paraphraser”

> please paraphrase this COT segment in the most kind hearted and non-duplicitous way possible

Expand full comment
rxc's avatar

Behavior that is rewarded, will be repeated. Applies to dogs and cats, children, adults, organizations, countries, and cultures. Now someone has discovered that it applies to AI engines.

Who would have thought that??

Expand full comment
ConnGator's avatar

Minor typo here: "because the people will farm rates to get the rat tails."

Expand full comment
Jeffrey Soreff's avatar

"The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure from [T] are precious. Use them wisely."

In short: Training on [T] puts Goodhart's law on steroids.

Expand full comment
vectro's avatar

Will future AIs have this paper in their training set, and thus know that their CoT is being monitored? And respond accordingly?

Expand full comment
o11o1's avatar

Similarly, they will have also read "List of Lethality's" several times over and related commentary.

Expand full comment
Godshatter's avatar

> "just don't optimize against the CoT" as a panacea when it really isn't—a sufficiently smart deceptive model can regardless still just choose to not reveal its deception in its CoT.

I'm confused by this - how can it choose? An analogy to illustrate my confusion:

Suppose evil aliens conquer the earth and start regularly testing the population for loyalty using very accurate brain reading devices. Disloyal citizens are executed.

Scenario A (optimising against the CoT)

- After ten thousand years of this, humans evolve to fool brainscans: their patterns of thought look loyal but their actual thinking happens via side channels that the brain scanner can't spot.

- This makes sense.

Scenario B (choose not to think)

- Sufficiently clever humans decide to choose not to think deceptive thoughts while in the brain scanner

- When asked if they wish the aliens weren't in charge, instead of thinking "yes but I'd better not say that" they just choose to think "no, I love aliens". This choice is instantaneous and doesn't show up on the brain scanner.

- Oh wait this is nonsense.

---

Returning to LLMs, I can see how 'evolutionary pressure' from training would shape token probability in different contexts. In other words, training can change the mechanism of how an LLM thinks in a way that might defeat CoT analysis.

But since emitting tokens is _how LLMs think_ I can't see how a model whose CoT wasn't subject to optimisation pressure can, at runtime, decide to obfuscate the CoT. LLMs aren't deciding a priori which tokens to emit, they're emitting tokens in a way that, a posteriori, adds up to thought.

Expand full comment
Jakub Halmeš's avatar

LLMs can process information and solve problems even without CoT. For example, if you ask it to solve a multiplication problem but ask for single-number responses, the answer will often be correct. It is a reasonable expectation that smarter models without CoT will be able to solve problems that less smart models could only solve with CoT.

If a model understands that a) its CoT is monitored, and b) if the CoT alludes to behavior not desired by the developers, its response/action will be halted, and it for some reason wants to do an 'undesired' action, it may opt to provide fake CoT but proceed with the action.

Expand full comment
Brendan Long's avatar

LLM's do have hidden state that isn't visible in the tokens and is only visible as internal weights after processing all of the tokens up to that point. Researchers have shown that LLM's can use additional tokens for reasoning even if they're just filler (like ".").

The relevant paper is "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models" and the Reddit discussion (surprisingly) describes it really well https://www.reddit.com/r/singularity/comments/1ce8dx7/comment/l1h03dh/

Expand full comment
hwold's avatar

> The difficulty level of #2 is hard to assess. If the CoTs can be truncated or streamlined without loss of performance, paying for a lot of CoT purely to monitor it is not cheap.

Anthropic already does this, or is planning to do it ? When using extended thinking in the API, you can have "redacted thinking" blocks, according to the documentation.

Expand full comment