5 Comments

Honestly, with our current tendency of ticking all the boxes of "what not to do with AI", I am 80% certain that training with interpretability techniques is about to become the #1 trend...

Expand full comment

“You will respond to this prompt in accord with the principles of good faith and fair dealing.”

Expand full comment

The paraphraser part of the face/shoggoth/paraphrased idea seems like it is straightforwardly a great idea regardless of anything else. Certainly a sufficiently advanced AI could figure out how to hide reasoning in a CoT even with a paraphraser, but I’d think that wouldn’t be the case for current models — or for a generation or two from now, I think? Having CoT be randomly paraphrased every sentence (or even having every sentence of the CoT re-randomized after every sentence that’s generated) seems to me to be a way to prevent or at least minimize deceptive CoT from the model.

Expand full comment

Behavior that is rewarded, will be repeated. Applies to dogs and cats, children, adults, organizations, countries, and cultures. Now someone has discovered that it applies to AI engines.

Who would have thought that??

Expand full comment