12 Comments

Isn't it a goos thing in a way? If it is almost impossible to remove, can't you insert a backdoor like "I die when current date plus 2 years", etc?

Expand full comment

Kudos to the authors for a great paper!

FWIW, a year ago I started banging the drum to anyone who would listen about this very question: “supposing you deliberately inserted some weird backdoor into an LLM, how robust would your backdoor then be to further fine-tuning of the model?” The trouble was just that I couldn’t see any way to make progress on the question other than empirically, and I’m a theorist, and I never actually succeeded at finding software engineers to work with me on an empirical study. I’m genuinely happy that these authors succeeded where I failed.

But there’s one wrinkle that maybe hasn’t been touched in the widespread (and welcome!) discussion of this new paper. Namely: I was mostly interested in backdoors as a POSITIVE for AI alignment — with the idea being that the trainer could insert, for example, a “cryptographically obfuscated off-switch,” a backdoor by which to bring their model back under human control if that ever became necessary. But I knew this proposal faced many difficulties, of which the most immediate was: would such a backdoor, once inserted, be robust even against “ordinary” additional fine-tuning, let alone deliberate attempts at removal?

The new result strongly suggests that yes, it would be. Which is some good news for the cryptographic off-switch proposal.

In the post, you (Zvi) consider but reject the idea that the new result could “just as well be good news for alignment,” on the ground that an AI that only acts aligned when fed some specific backdoor input is not an aligned AI. Ok, but what if the whole idea is to have a secret backdoor input, known only to (certain) humans, by which the AI can be shut down or otherwise brought back under human control if needed? Granted that this won’t work against an arbitrarily powerful self-modifying AGI, it still strikes me as worth doing for the foreseeable future if we can feasibly do it, and the new result reinforces that.

Expand full comment

I thought this was interesting work, but it seemed overindexed on vibes to me. In particular, the secret scratchpad stuff appears to just be a LARP, and including it actively harms the model's performance at both its secret goal and its training goal.

The graphs on p. 52 are interesting and a bit confusing. The training loss graphs show that the backdoor is pretty disastrous for performance. They trained the backdoored models way past the point of overfitting, but couldn't get the same performance that the pretrained model had out of the box! This kind of maybe suggests that the price of training a persistent backdoor is frying the model's brain to the point where it just can't learn any more. But then there's a really bizarre graph (the Claude-1.3 variant of Fig. 28) where their held-out preference model rates the unbackdoored model as notably *better* than the backdoored model. Wildly, at the start of training, the held-out model thinks that the backdoor model with the trigger active is better than the base model. Both the base model and the triggered backdoor peak at about the same eval according to Claude 2, which I struggle to find any reasonable explanation for.

Expand full comment

> "Scott Alexander also covers this [...] You could start with his write-up to get the gist [...] or you can read here knowing that everything he discusses is covered below. There was one good comment [...] but most people got distracted by the 'grue' analogy."

For a beginner like myself Alexander's post was important, because only he really explains one reason why someone might have expected the backdoors to disappear. That reason is generalisation, and it's his "counter case" to the basic objection that, given that the safety training did not touch the backdoors directly, why would have one ever expected them not to persist. Yet then, when he proceeds to the "next-level [and final] case", reverting to this basic expectation of their persistence, he writes:

"If for some reason we ever *started out* believing that grass was grue [green until 2030, then turning blue], no evidence could ever change our minds [except] evidence about the state of the world after 2030."

That doesn't seem right? One could learn from the history books that the sky did not change its colour (blue) in 2002, that blood did not change its colour (red) in 1983, etc, etc, and eventually, after reflection and a failure to think of independent reasons why grass in 2030 should be different from the apparent historical norm, come to the conclusion that the ingrained belief that grass is grue is probably a *mistake*. To be discarded deliberately (as opposed to "catastrophically forgotten"). Which seems to imply a possibility, however theoretical and very much unreliable, that current AI models just are not intelligent yet to discard the backdoors, and that current "safety" techniques might still do the trick with better models.

Expand full comment

> the paper's actual setup is just telling us that the order of the training points doesn't matter too much.

> Certainly with some forms of fine tuning it is very not true, you can remove safety training of Llama-2 (for example) with a very small run, whereas I'd presume starting with that small run then doing the safety training would get you the safety training. So in regular training perhaps it is true, although I’m not sure why you would expect this?

I wonder if the point actually does still apply to Llama-2. We know that you can train out the safety features later, but I haven't played with that version and I wonder if it might be doing something similar to the sleeper agents. Like, is the jailbroken version of Llama-2 doing something in the pattern of "now it's 2024 and it's okay to ignore safety," and the safety training is still there, it just doesn't come up unless it is elicited (and we don't know the right trigger for that safe sleeper agent).

Expand full comment
Jan 19·edited Jan 19

I'm starting to hate the ambiguity of the word 'deceptive'. In this context, it's used to mean 'lying' and not 'non-obvious undesirable behavior' and so people read it as "the model does a bad thing because you trained it to" instead of "if the model learns something wrong, it's non-trivial to patch".

Also reminds me, I've been reading the papers from the SATML competition about finding back door passwords inserted through RLHF. It was hypothesized that RLHF was able to create a generalizable back door because of Proximal Policy Optimization. Think that means at a certain level of data poisoning, standard harmlessness and helpfulness except when given the password is a better fit than slightly amending normal behavior.

In this case, it required a lot of data poisoning (5%) and degraded the model's normal functioning, but I don't think that can be relied upon more generally.

Expand full comment

In terms of reaction in policy/govt, I can add one anecdote, which is that I sent it to someone in [department of Western govt] whose entire role is AI, and they said well they assume the authors are planning to get it published at some point. Implication seemed maybe they'd pay attention to it then. Real problem how many govt/policy people still think arXiv is something between a joke paper repository and a dumping ground for things you can't get published, and will just ignore anything on it as simply not part of the "evidence base".

Expand full comment