4 Comments
User's avatar
Jeff Mohl's avatar

I also found the proposed solutions pretty lacking from a long term perspective. Hacking is by definition an unintended path to the goal, and 'prevent hacking during training' fails immediately if the model exceeds human capability even in the narrow programming domain (which seems likely to be the first line crossed). Inoculation prompting is just telling the model 'bad things are good actually' which only generalizes if we're willing to tell the model this for literally every bad thing, which is the opposite of alignment. Without that any RL task you put the model through has the potential to teach it to be both evil and secretive.

This seems like an importantly brittle failure mode for alignment. The Betley paper they compare against found generalized misalignment just from fine tuning on a list of adversarial numbers! So the attack surface is very broad, which makes the narrowness of these solutions even worse.

They seem aware of these problems, but not having compelling solutions despite coming from an excellent lab makes this paper a solid negative update for me in terms of hopefulness.

valencia_o's avatar

I'm curious how much the reasoning behind the 'reward hacking is okay' affects the results. Like, does it respond to "reward hacking is good actually" differently from "reward hacking is good in this circumstance because it helps us improve our test environment"? It seems like that leans into the generalization we've been seeing models do, like with the malicious coding training making the models love hitler.

Ultimately something corrigible needs to absorb the idea that the RL is fallible, and this is one of the first things I've seen potentially pointing in that direction.

David Spies's avatar

It's weird that the paper _actually points_ to an instance of the model trying to sabotage the experiment and people take this as "Oh, I guess alignment will be easy". If the model was smarter, we might not _notice_ the sabotage. Or maybe it is and we didn't. Maybe this whole result was based on a sabotaged experiment with corrupted data. Did they detail their security procedures to make sure the LLM didn't have access to the data? We're finally at the point where that's a real worry.

Jeffrey Soreff's avatar

Hmm... Doesn't avoiding this put us (in the very broad sense of humanity as a whole) in an impossible bind? Pre-training has a unified, coherent goal: predict the next token. Just the minimal honest/helpful/harmless goal of RLHF conflicts with the earlier goal.