Discussion about this post

User's avatar
Scott Aaronson's avatar

Kudos to the authors for a great paper!

FWIW, a year ago I started banging the drum to anyone who would listen about this very question: “supposing you deliberately inserted some weird backdoor into an LLM, how robust would your backdoor then be to further fine-tuning of the model?” The trouble was just that I couldn’t see any way to make progress on the question other than empirically, and I’m a theorist, and I never actually succeeded at finding software engineers to work with me on an empirical study. I’m genuinely happy that these authors succeeded where I failed.

But there’s one wrinkle that maybe hasn’t been touched in the widespread (and welcome!) discussion of this new paper. Namely: I was mostly interested in backdoors as a POSITIVE for AI alignment — with the idea being that the trainer could insert, for example, a “cryptographically obfuscated off-switch,” a backdoor by which to bring their model back under human control if that ever became necessary. But I knew this proposal faced many difficulties, of which the most immediate was: would such a backdoor, once inserted, be robust even against “ordinary” additional fine-tuning, let alone deliberate attempts at removal?

The new result strongly suggests that yes, it would be. Which is some good news for the cryptographic off-switch proposal.

In the post, you (Zvi) consider but reject the idea that the new result could “just as well be good news for alignment,” on the ground that an AI that only acts aligned when fed some specific backdoor input is not an aligned AI. Ok, but what if the whole idea is to have a secret backdoor input, known only to (certain) humans, by which the AI can be shut down or otherwise brought back under human control if needed? Granted that this won’t work against an arbitrarily powerful self-modifying AGI, it still strikes me as worth doing for the foreseeable future if we can feasibly do it, and the new result reinforces that.

Expand full comment
Grant McKinney's avatar

I'm starting to hate the ambiguity of the word 'deceptive'. In this context, it's used to mean 'lying' and not 'non-obvious undesirable behavior' and so people read it as "the model does a bad thing because you trained it to" instead of "if the model learns something wrong, it's non-trivial to patch".

Also reminds me, I've been reading the papers from the SATML competition about finding back door passwords inserted through RLHF. It was hypothesized that RLHF was able to create a generalizable back door because of Proximal Policy Optimization. Think that means at a certain level of data poisoning, standard harmlessness and helpfulness except when given the password is a better fit than slightly amending normal behavior.

In this case, it required a lot of data poisoning (5%) and degraded the model's normal functioning, but I don't think that can be relied upon more generally.

Expand full comment
10 more comments...

No posts