Discussion about this post

User's avatar
vectro's avatar

Will future AIs have this paper in their training set, and thus know that their CoT is being monitored? And respond accordingly?

Expand full comment
rational_hippy's avatar

Honestly, with our current tendency of ticking all the boxes of "what not to do with AI", I am 80% certain that training with interpretability techniques is about to become the #1 trend...

Expand full comment
13 more comments...

No posts