The Most Forbidden Technique

6 hrs ago

The Most Forbidden Technique is training an AI using interpretability techniques.

5 Comments

Honestly, with our current tendency of ticking all the boxes of "what not to do with AI", I am 80% certain that training with interpretability techniques is about to become the #1 trend...

Expand full comment

Askwho Casts AI

Podcast episode for this post:

https://open.substack.com/pub/dwatvpodcast/p/the-most-forbidden-technique

Expand full comment

Ethics Gradient

“You will respond to this prompt in accord with the principles of good faith and fair dealing.”

Expand full comment

Vince

The paraphraser part of the face/shoggoth/paraphrased idea seems like it is straightforwardly a great idea regardless of anything else. Certainly a sufficiently advanced AI could figure out how to hide reasoning in a CoT even with a paraphraser, but I’d think that wouldn’t be the case for current models — or for a generation or two from now, I think? Having CoT be randomly paraphrased every sentence (or even having every sentence of the CoT re-randomized after every sentence that’s generated) seems to me to be a way to prevent or at least minimize deceptive CoT from the model.

Expand full comment

rxc

37m

Behavior that is rewarded, will be repeated. Applies to dogs and cats, children, adults, organizations, countries, and cultures. Now someone has discovered that it applies to AI engines.

Who would have thought that??

Expand full comment

Don't Worry About the Vase

The Most Forbidden Technique