Discussion about this post

User's avatar
Inside The Black Box's avatar

"Wait til you find out about the invention of fire or the wheel" is one of those quotes that tells you everything about how the industry sees itself. Fire and the wheel didn't need a PAC to manage their public image. The analogy only works if you've already decided your technology is an inevitable force of nature rather than a product being sold by companies with specific financial interests.

[insert here] delenda est's avatar

I am a bit depressed by the contrast between these alignment papers, and various earlier work, which appear to indicate that extremely small samples in the training data can influence model behaviour, and then read all these "only 0.6%" of training was affected. I think that if a model were to be motivated to hide/obfuscate its CoT for whatever reason, it would already know that it should do that.

40 more comments...

No posts

Ready for more?