Discussion about this post

User's avatar
Jeff Mohl's avatar

I also found the proposed solutions pretty lacking from a long term perspective. Hacking is by definition an unintended path to the goal, and 'prevent hacking during training' fails immediately if the model exceeds human capability even in the narrow programming domain (which seems likely to be the first line crossed). Inoculation prompting is just telling the model 'bad things are good actually' which only generalizes if we're willing to tell the model this for literally every bad thing, which is the opposite of alignment. Without that any RL task you put the model through has the potential to teach it to be both evil and secretive.

This seems like an importantly brittle failure mode for alignment. The Betley paper they compare against found generalized misalignment just from fine tuning on a list of adversarial numbers! So the attack surface is very broad, which makes the narrowness of these solutions even worse.

They seem aware of these problems, but not having compelling solutions despite coming from an excellent lab makes this paper a solid negative update for me in terms of hopefulness.

Expand full comment
David Spies's avatar

It's weird that the paper _actually points_ to an instance of the model trying to sabotage the experiment and people take this as "Oh, I guess alignment will be easy". If the model was smarter, we might not _notice_ the sabotage. Or maybe it is and we didn't. Maybe this whole result was based on a sabotaged experiment with corrupted data. Did they detail their security procedures to make sure the LLM didn't have access to the data? We're finally at the point where that's a real worry.

Expand full comment
1 more comment...

No posts

Ready for more?