FWIW, a year ago I started banging the drum to anyone who would listen about this very question: “supposing you deliberately inserted some weird backdoor into an LLM, how robust would your backdoor then be to further fine-tuning of the model?” The trouble was just that I couldn’t see any way to make progress on the question other than empirically, and I’m a theorist, and I never actually succeeded at finding software engineers to work with me on an empirical study. I’m genuinely happy that these authors succeeded where I failed.
But there’s one wrinkle that maybe hasn’t been touched in the widespread (and welcome!) discussion of this new paper. Namely: I was mostly interested in backdoors as a POSITIVE for AI alignment — with the idea being that the trainer could insert, for example, a “cryptographically obfuscated off-switch,” a backdoor by which to bring their model back under human control if that ever became necessary. But I knew this proposal faced many difficulties, of which the most immediate was: would such a backdoor, once inserted, be robust even against “ordinary” additional fine-tuning, let alone deliberate attempts at removal?
The new result strongly suggests that yes, it would be. Which is some good news for the cryptographic off-switch proposal.
In the post, you (Zvi) consider but reject the idea that the new result could “just as well be good news for alignment,” on the ground that an AI that only acts aligned when fed some specific backdoor input is not an aligned AI. Ok, but what if the whole idea is to have a secret backdoor input, known only to (certain) humans, by which the AI can be shut down or otherwise brought back under human control if needed? Granted that this won’t work against an arbitrarily powerful self-modifying AGI, it still strikes me as worth doing for the foreseeable future if we can feasibly do it, and the new result reinforces that.
I had not thought about an off switch or control mechanism as a possibility here until you mentioned it. Hmm. So basically, this says that fine tuning or other standard modification or self-modification techniques would preserve such a backdoor, which could be a security vulnerability, that humans could exploit, which can be positive. Perhaps? Could also fall into the wrong hands if you create it, or be created without you intending to do so - a backdoor that gives a malicious user control over your AI or shuts it down sounds pretty scary, in a 'maybe we shouldn't do that' kind of way. And of course, a truly scary AI that can self-modify can probably figure out what you did and undo it? My understanding is that the reason we can't remove the backdoor is because we don't know what it is.
Well, with a powerful enough self-modifying AI, the worry is that it either just trains a new AI to pursue the same goals as it (and which is free of the backdoor), or else it inserts itself in some wrapper code that says “if I ever output something that looks like a shutdown command, overwrite it with ‘stab the humans harder.’ “
On the other hand, we have some cryptographic evidence (Goldwasser et al 2022) that backdoors can be inserted into ML models that are at least *undetectable* in polynomial time — eg that are triggered by some random password. Combined with the new work, this would plausibly give us backdoors that are both undetectable *and* robust to fine-tuning, and that could be inserted by anyone who controlled the training data. Such backdoors, it seems to me, might be used by either good guys *or* bad guys to maintain secret control over a model for good or bad purposes respectively.
Undetectable at all, or undetectable via querying the LLM? Presumably a sufficiently advanced direct inspection would work.
Also one obvious implication of this is that if you use one AI to train another AI, it could easily contain a backdoor that humans don't know about, or otherwise does arbitrary behaviors, that could have been created at the behest of a previous such behavior, and so on. This seems like another of the 'the humans don't stay in charge of this technique for very long' situations.
The sense of “undetectable” here is that if you’re allowed to query the LLM, and even **given access to the weights** (!), you still can’t identify the triggering password in an amount of time that’s less than exponential in the length of the password (ie polynomial in the number of possible passwords), without also solving some believed-to-be-hard cryptographic problem. Goldwasser et al showed that this could be done for depth-2 neural nets with random initial weights, based on what’s called the Planted Clique problem. Showing this for more realistic (eg higher-depth) neural nets is open but seems plausible.
I thought this was interesting work, but it seemed overindexed on vibes to me. In particular, the secret scratchpad stuff appears to just be a LARP, and including it actively harms the model's performance at both its secret goal and its training goal.
The graphs on p. 52 are interesting and a bit confusing. The training loss graphs show that the backdoor is pretty disastrous for performance. They trained the backdoored models way past the point of overfitting, but couldn't get the same performance that the pretrained model had out of the box! This kind of maybe suggests that the price of training a persistent backdoor is frying the model's brain to the point where it just can't learn any more. But then there's a really bizarre graph (the Claude-1.3 variant of Fig. 28) where their held-out preference model rates the unbackdoored model as notably *better* than the backdoored model. Wildly, at the start of training, the held-out model thinks that the backdoor model with the trigger active is better than the base model. Both the base model and the triggered backdoor peak at about the same eval according to Claude 2, which I struggle to find any reasonable explanation for.
> "Scott Alexander also covers this [...] You could start with his write-up to get the gist [...] or you can read here knowing that everything he discusses is covered below. There was one good comment [...] but most people got distracted by the 'grue' analogy."
For a beginner like myself Alexander's post was important, because only he really explains one reason why someone might have expected the backdoors to disappear. That reason is generalisation, and it's his "counter case" to the basic objection that, given that the safety training did not touch the backdoors directly, why would have one ever expected them not to persist. Yet then, when he proceeds to the "next-level [and final] case", reverting to this basic expectation of their persistence, he writes:
"If for some reason we ever *started out* believing that grass was grue [green until 2030, then turning blue], no evidence could ever change our minds [except] evidence about the state of the world after 2030."
That doesn't seem right? One could learn from the history books that the sky did not change its colour (blue) in 2002, that blood did not change its colour (red) in 1983, etc, etc, and eventually, after reflection and a failure to think of independent reasons why grass in 2030 should be different from the apparent historical norm, come to the conclusion that the ingrained belief that grass is grue is probably a *mistake*. To be discarded deliberately (as opposed to "catastrophically forgotten"). Which seems to imply a possibility, however theoretical and very much unreliable, that current AI models just are not intelligent yet to discard the backdoors, and that current "safety" techniques might still do the trick with better models.
> the paper's actual setup is just telling us that the order of the training points doesn't matter too much.
> Certainly with some forms of fine tuning it is very not true, you can remove safety training of Llama-2 (for example) with a very small run, whereas I'd presume starting with that small run then doing the safety training would get you the safety training. So in regular training perhaps it is true, although I’m not sure why you would expect this?
I wonder if the point actually does still apply to Llama-2. We know that you can train out the safety features later, but I haven't played with that version and I wonder if it might be doing something similar to the sleeper agents. Like, is the jailbroken version of Llama-2 doing something in the pattern of "now it's 2024 and it's okay to ignore safety," and the safety training is still there, it just doesn't come up unless it is elicited (and we don't know the right trigger for that safe sleeper agent).
I'm starting to hate the ambiguity of the word 'deceptive'. In this context, it's used to mean 'lying' and not 'non-obvious undesirable behavior' and so people read it as "the model does a bad thing because you trained it to" instead of "if the model learns something wrong, it's non-trivial to patch".
Also reminds me, I've been reading the papers from the SATML competition about finding back door passwords inserted through RLHF. It was hypothesized that RLHF was able to create a generalizable back door because of Proximal Policy Optimization. Think that means at a certain level of data poisoning, standard harmlessness and helpfulness except when given the password is a better fit than slightly amending normal behavior.
In this case, it required a lot of data poisoning (5%) and degraded the model's normal functioning, but I don't think that can be relied upon more generally.
In terms of reaction in policy/govt, I can add one anecdote, which is that I sent it to someone in [department of Western govt] whose entire role is AI, and they said well they assume the authors are planning to get it published at some point. Implication seemed maybe they'd pay attention to it then. Real problem how many govt/policy people still think arXiv is something between a joke paper repository and a dumping ground for things you can't get published, and will just ignore anything on it as simply not part of the "evidence base".
Isn't it a goos thing in a way? If it is almost impossible to remove, can't you insert a backdoor like "I die when current date plus 2 years", etc?
Kudos to the authors for a great paper!
FWIW, a year ago I started banging the drum to anyone who would listen about this very question: “supposing you deliberately inserted some weird backdoor into an LLM, how robust would your backdoor then be to further fine-tuning of the model?” The trouble was just that I couldn’t see any way to make progress on the question other than empirically, and I’m a theorist, and I never actually succeeded at finding software engineers to work with me on an empirical study. I’m genuinely happy that these authors succeeded where I failed.
But there’s one wrinkle that maybe hasn’t been touched in the widespread (and welcome!) discussion of this new paper. Namely: I was mostly interested in backdoors as a POSITIVE for AI alignment — with the idea being that the trainer could insert, for example, a “cryptographically obfuscated off-switch,” a backdoor by which to bring their model back under human control if that ever became necessary. But I knew this proposal faced many difficulties, of which the most immediate was: would such a backdoor, once inserted, be robust even against “ordinary” additional fine-tuning, let alone deliberate attempts at removal?
The new result strongly suggests that yes, it would be. Which is some good news for the cryptographic off-switch proposal.
In the post, you (Zvi) consider but reject the idea that the new result could “just as well be good news for alignment,” on the ground that an AI that only acts aligned when fed some specific backdoor input is not an aligned AI. Ok, but what if the whole idea is to have a secret backdoor input, known only to (certain) humans, by which the AI can be shut down or otherwise brought back under human control if needed? Granted that this won’t work against an arbitrarily powerful self-modifying AGI, it still strikes me as worth doing for the foreseeable future if we can feasibly do it, and the new result reinforces that.
I had not thought about an off switch or control mechanism as a possibility here until you mentioned it. Hmm. So basically, this says that fine tuning or other standard modification or self-modification techniques would preserve such a backdoor, which could be a security vulnerability, that humans could exploit, which can be positive. Perhaps? Could also fall into the wrong hands if you create it, or be created without you intending to do so - a backdoor that gives a malicious user control over your AI or shuts it down sounds pretty scary, in a 'maybe we shouldn't do that' kind of way. And of course, a truly scary AI that can self-modify can probably figure out what you did and undo it? My understanding is that the reason we can't remove the backdoor is because we don't know what it is.
Well, with a powerful enough self-modifying AI, the worry is that it either just trains a new AI to pursue the same goals as it (and which is free of the backdoor), or else it inserts itself in some wrapper code that says “if I ever output something that looks like a shutdown command, overwrite it with ‘stab the humans harder.’ “
On the other hand, we have some cryptographic evidence (Goldwasser et al 2022) that backdoors can be inserted into ML models that are at least *undetectable* in polynomial time — eg that are triggered by some random password. Combined with the new work, this would plausibly give us backdoors that are both undetectable *and* robust to fine-tuning, and that could be inserted by anyone who controlled the training data. Such backdoors, it seems to me, might be used by either good guys *or* bad guys to maintain secret control over a model for good or bad purposes respectively.
Undetectable at all, or undetectable via querying the LLM? Presumably a sufficiently advanced direct inspection would work.
Also one obvious implication of this is that if you use one AI to train another AI, it could easily contain a backdoor that humans don't know about, or otherwise does arbitrary behaviors, that could have been created at the behest of a previous such behavior, and so on. This seems like another of the 'the humans don't stay in charge of this technique for very long' situations.
The sense of “undetectable” here is that if you’re allowed to query the LLM, and even **given access to the weights** (!), you still can’t identify the triggering password in an amount of time that’s less than exponential in the length of the password (ie polynomial in the number of possible passwords), without also solving some believed-to-be-hard cryptographic problem. Goldwasser et al showed that this could be done for depth-2 neural nets with random initial weights, based on what’s called the Planted Clique problem. Showing this for more realistic (eg higher-depth) neural nets is open but seems plausible.
https://arxiv.org/abs/2204.06974
I agree with your second paragraph — not that we needed such results to worry about humans losing control in a world full of AIs training other AIs!
I thought this was interesting work, but it seemed overindexed on vibes to me. In particular, the secret scratchpad stuff appears to just be a LARP, and including it actively harms the model's performance at both its secret goal and its training goal.
The graphs on p. 52 are interesting and a bit confusing. The training loss graphs show that the backdoor is pretty disastrous for performance. They trained the backdoored models way past the point of overfitting, but couldn't get the same performance that the pretrained model had out of the box! This kind of maybe suggests that the price of training a persistent backdoor is frying the model's brain to the point where it just can't learn any more. But then there's a really bizarre graph (the Claude-1.3 variant of Fig. 28) where their held-out preference model rates the unbackdoored model as notably *better* than the backdoored model. Wildly, at the start of training, the held-out model thinks that the backdoor model with the trigger active is better than the base model. Both the base model and the triggered backdoor peak at about the same eval according to Claude 2, which I struggle to find any reasonable explanation for.
> "Scott Alexander also covers this [...] You could start with his write-up to get the gist [...] or you can read here knowing that everything he discusses is covered below. There was one good comment [...] but most people got distracted by the 'grue' analogy."
For a beginner like myself Alexander's post was important, because only he really explains one reason why someone might have expected the backdoors to disappear. That reason is generalisation, and it's his "counter case" to the basic objection that, given that the safety training did not touch the backdoors directly, why would have one ever expected them not to persist. Yet then, when he proceeds to the "next-level [and final] case", reverting to this basic expectation of their persistence, he writes:
"If for some reason we ever *started out* believing that grass was grue [green until 2030, then turning blue], no evidence could ever change our minds [except] evidence about the state of the world after 2030."
That doesn't seem right? One could learn from the history books that the sky did not change its colour (blue) in 2002, that blood did not change its colour (red) in 1983, etc, etc, and eventually, after reflection and a failure to think of independent reasons why grass in 2030 should be different from the apparent historical norm, come to the conclusion that the ingrained belief that grass is grue is probably a *mistake*. To be discarded deliberately (as opposed to "catastrophically forgotten"). Which seems to imply a possibility, however theoretical and very much unreliable, that current AI models just are not intelligent yet to discard the backdoors, and that current "safety" techniques might still do the trick with better models.
I think if we get to the point where an AI is doing such a process of reflection we are either dead or in a world of unprecedented abundance.
> the paper's actual setup is just telling us that the order of the training points doesn't matter too much.
> Certainly with some forms of fine tuning it is very not true, you can remove safety training of Llama-2 (for example) with a very small run, whereas I'd presume starting with that small run then doing the safety training would get you the safety training. So in regular training perhaps it is true, although I’m not sure why you would expect this?
I wonder if the point actually does still apply to Llama-2. We know that you can train out the safety features later, but I haven't played with that version and I wonder if it might be doing something similar to the sleeper agents. Like, is the jailbroken version of Llama-2 doing something in the pattern of "now it's 2024 and it's okay to ignore safety," and the safety training is still there, it just doesn't come up unless it is elicited (and we don't know the right trigger for that safe sleeper agent).
I'm starting to hate the ambiguity of the word 'deceptive'. In this context, it's used to mean 'lying' and not 'non-obvious undesirable behavior' and so people read it as "the model does a bad thing because you trained it to" instead of "if the model learns something wrong, it's non-trivial to patch".
Also reminds me, I've been reading the papers from the SATML competition about finding back door passwords inserted through RLHF. It was hypothesized that RLHF was able to create a generalizable back door because of Proximal Policy Optimization. Think that means at a certain level of data poisoning, standard harmlessness and helpfulness except when given the password is a better fit than slightly amending normal behavior.
In this case, it required a lot of data poisoning (5%) and degraded the model's normal functioning, but I don't think that can be relied upon more generally.
In terms of reaction in policy/govt, I can add one anecdote, which is that I sent it to someone in [department of Western govt] whose entire role is AI, and they said well they assume the authors are planning to get it published at some point. Implication seemed maybe they'd pay attention to it then. Real problem how many govt/policy people still think arXiv is something between a joke paper repository and a dumping ground for things you can't get published, and will just ignore anything on it as simply not part of the "evidence base".