In their announcement Introducing Superalignment, OpenAI committed 20% of secured compute and a new taskforce to solving the technical problem of aligning a superintelligence within four years.
> If the first stage in your plan for alignment of superintelligence involves building a general intelligence (AGI), what makes you think you’ll be able to align that first AGI? What makes you think you can hit the at best rather narrow window of human intelligence without undershooting or overshooting?
Yep!
Also: even if you can build a human-level system that does alignment research safely on the first try, running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach. Safely directing the whole superintelligent system at solving alignment is an alignment-complete problem, while spending some money on a human-level alignment researcher is not (you could just hire someone)
I'm not actually sure that "strong" alignment is possible or meaningful. My mental model is something like:
1. Colonizing the stars was always for post-humans, not monkeys in tin cans.
2. "Intelligence" is fundamentally giant, partly inscrutable matrices. Virtually all progress we've made on AI came from recognizing and accepting this fact.
3. Human-friendly values are too complex to formally model in a mathematically rigorous way.
4. Yudkowsky's fast takeoff scenarios involve unlikely assumptions about nanotech and slightly dubious assumptions about intelligence. I'd give them probability < 0.05 within 5 years of us building a AGI. Most of this weight comes from the fact that synthetic biology seems like a poor way to build GPUs, and diamond-phase nanotech looked like extreme "hopium" the last time I looked at the research.
5. But "slow" takeoff can still kill us all, given 10-50 years. The main limiting factor is building out atom-moving tools on a scale sufficiently large to replace the world economy, and doing so without too much human backlash. Without nanotech, this harder to hide.
6. Many scenarios where we die look like utopia until a "treacherous turn", because a hostile AI would need time for a sufficient robotics build out, and it wouldn't want to trigger a human backlash until it had secured its supply lines.
Because of (2) and (3), I'm deeply pessimistic about "strong" alignment, where the goal is "we have a mathematical proof the AI won't kill us." I strongly suspect that asking for such a proof fundamentally misunderstands how both intelligence and human values work.
So any alignment plan, in my pessimistic view is roughly equivalent to, "We're going to carefully raise the alien spider god from childhood, teach it the best moral values we know, and hope that when it grows up, it likes humans enough to keep us around." I actually believe that this plan _might_ work, but that's just my prior that intelligent beings are sometimes benevolent. If GPT-4 has shown me anything, it's that we could give the alien spider god a pretty good understanding of humans. And therefore we're not picking _completely_ randomly from "goal space" or "mind space." Sure, we might be trying to hit a bullseye while blind drunk, but at least we're in the same room as the dartboard.
So, as for OpenAI's plan, I like the part where they're "we admit this is a problem, so let's try a bunch of things and allocate lots of resources." I don't think their plan offers any guarantees beyond "trying to raise the alien spider god with values we like." But I think that's about the best we could get, conditional on building ASI.
I am _much_ more concerned that their plan apparently starts with "step (1): build a near-human AGI." My prior is that near-human AGIs are not _automatically_ threats, for the same reason that there are some very smart people who have not made themselves world dictator.
My fundamental objection to building a "narrow" AGI to help with alignment research is that it encourages proliferation. Once you can build the specialized AGI, it's presumably very easy to build and deploy AGIs commercially. Which brings us several huge steps closer to the endgame.
But this is classic OpenAI: "We'll protect you from Skynet by doing everything possible to accelerate the development of AI!"
Let's assume that we can build a near-human AGI, but train/box/specialize/restrict it enough so that it isn't a threat. I _can_ imagine that, even if we overshoot human intelligence by a modest factor. Probably the fact that we're relying so heavily on LLMs suggests that we might see something that had a few human-like properties and lower instrinsic motivation to do things.
But I am old and cynical, and I strongly suspect that as soon as we can build an AGI, even a weak one, there will be vast commercial pressure to make millions of copies and to use it for thousands of commercial tasks. And there will be pressure to improve it. And OpenAI is subject to all the standard human failings...
I think there lies the rub... inherent immutable and pervasive **human failings.** How can we construct anything intelligent that does not have that? Well, as seen by the OP's posts, a lot of words from a lot of smart people, all ideas doomed from the start. I just love the tunnelvision of "I cannot fail, I will succeed, we just have to pound harder." This human failure had killed more pilots than anything else, the belief that we must be right, bugger all. Perhaps if a team started using the inverse strategy: we have failed, we fucked up, let's clean up our mess before we make another dollar on it. Ha, i'm dreaming now...
Alignment research seems like it always runs into the same problem. The motivation for doing it is really great, but there's no practical plan to make any progress.
I have very limited understanding of the real problems with alignment, but if I could try to sum it up, it would be something like:
1. Alignment is many times more complex a problem to solve than to just create a super-advanced, un-aligned AI.
2. We will likely only have one shot at completely solving it.
3. No existing ideas seem at all likely to work.
IF those three statements are true, then we truly are on a path for disaster. Because the mundane utility will cause us to continue to rush forward with merely the faint outline of a plan. I do think the current OpenAI plan is something akin to the meme you provided: "Invent AGI - ??? - Solve AGI Alignment".
We're in a pretty bad spot that, to a layman like me, this actually seems like the most feasible plan because I've yet to hear a human plan for alignment that has any chance of working (and certainly not within the timeframe I suspect we'll need it).
You are not incorrect but we do have multiple partial plans. It might be like a swiss cheese but so as long that there is no one hole that goes through all of the plans, we could be okay.
So one thought I've been chewing on is that the underspecification of "alignment" is really hamstringing a lot of this -- if we don't even have a well-defined target of how we would prefer an aligned relative to an unaligned AGI to act, the very problem itself seems intractable.
I would therefore like to propose that we develop a set of axioms or proposals for what "alignment" might look like to at least get some degree of common vocabulary available for discussion.
One thought that comes to mind is that what we really want from AGI / ASI is to obey something like what I would call the Criterion of Minimum Action or Criterion of Minimal Disruption -- i.e., relative to a current world-state, an AGI / ASI given a goal (and I think we can assume it as stipulated that AGI/ASI will be agentic by design or else made agentic by humans essentially as soon as it exists) should accomplish that goal in whatever way minimizes changes to the world-state in the status-quo ante, with a loss function (perhaps increasing hyperbolically with deviation from status quo ante) including a set of more tractable sub-problems to define such deviation -- e.g., increased/decreased human deaths above the no-action scenario, increased/decreased longevity, increased/decreased birthrate, deviations from climatic equilibrium and trajectory, various animal / wildlife / biomass homestatic factors, etc.[1]. [2]
What this really boils down to is trying to analogize AGI/ASI performance to what I think human alignment actually "is," which is really at its core about capability-limitation more than anything else -- the reason Hitler doesn't declare victory and genocide all the Jews is that he doesn't have the biggest army nor the Bomb (and because capabilities are generally a distributed rather than unitary resource) rather than because values-alignment is a reliable human trait. Obviously the main problem of AGI/ASI is precisely that overcoming capabilities-limitation is the entire nature of the exercise (and the threat), so we can't rely on that. What we conceivably *could* do instead is bend efforts towards a bias towards extreme minimalism in externality-generation, coupled with reliable world-model prediction and sensing (this seems like something AGI/ASI will likely have, which is indeed what makes it so dangerous) plus perhaps some Three Laws-style guardrails (to the extent independently implementable. This *might* even be simple enough to do with RLHF without it being a complete cockup) to try to limit questions that by their nature admit of no non-catastrophic (or non-world-altering) answer, such as "how do I kill everyone?".
I'm by no means suggesting that this is a comprehensive or ironclad proposal -- it's still very nascent -- but I think it at least has the virtue of trying to operationalize alignment with respect to answering the two questions "(1) what does alignment even *mean* in a human context?" (to which I would argue, a combination of externality and direct harm minimization, with catastrophic scenarios averted through capability limitation and necessarily-aggregated preferences tending towards externality and direct harm-minimization), and "(2) how do we try to mimic this with a loss function in a system that can be assumed to be minimally, if at all, capabilities-limited?"
[1] Obviously this requires ways to prevent an AGI/ASI from self-hacking its world state a la The Robots of Dawn to either deliberately wear blinders or otherwise redefine these metrics into epsilon-significance, but (a) this is just a sketch for now and (b) accurate world-knowledge is instrumentally rational in a way that helps rather than hurts for once.
[2] In some cases -- e.g,. "Find a way to increase human longevity," (which is a goal humans would obviously want to pursue) -- the "least-disruptive means" is still incredibly disruptive. This is a serious problem that will create impetus to relax the extreme homeostasis-bias I propose imposing on AI, but at least it shoves the alignment problem down one level in a way that it seems like current approaches haven't necessarily yet accomplished.
Elaborating slightly more, the Criterion of Minimal Disrution is meant to also tackle specifically the problem of instrumental convergence by, eg, making “To make the most paperclips, you must first kill all the humans” a less-desirable opening move.
I think there lies the rub... inherent immutable and pervasive **human failings.** How can we construct anything intelligent that does not have that? As seen by the OP's posts, we can construct many words from many smart people, but these ideas are doomed from the start. I just love the tunnel vision of "I cannot fail, I will succeed, so we just have to pound harder" of one of the smarties. This human failure has killed more pilots than anything else, the belief that we must be right, bugger all. Perhaps if a team started using the inverse strategy of we have failed, we fucked up, let's clean up our mess before we make another dollar on it? Ha, I'm dreaming now...
Hi, thanks a lot for writing the detailed and highly engaged piece! There are a lot of points here and I can't respond to all of them right now. I think these responses might be most helpful:
Maybe a potential misunderstanding upfront: the Superalignment Team is not trying to teach AI how to do ML research. This will happen without us, and I don’t think it would be helpful for us to accelerate it. Our job is to research and develop the alignment techniques required to make the first system that has the capabilities to do automated alignment research sufficiently aligned.
Put differently: automated ML research will happen anyway, and the best we can do is be ready to use is for alignment as soon as it starts happening. To do this, we must know how to make that system sufficiently aligned that we can trust the alignment research it's producing (because bad faith alignment research is an avenue through which AI can gain undue power).
> If the first stage in your plan for alignment of superintelligence involves building a general intelligence (AGI), what makes you think you’ll be able to align that first AGI? What makes you think you can hit the at best rather narrow window of human intelligence without undershooting (where it would not be useful) or overshooting (where we wouldn’t be able to align it, and might well not realize this and all die)? Given comparative advantages it is not clear ‘human-level’ exists at all here.
One thing that we’ve learned from LLM scaling is that it’s actually moving relatively slowly through the spectrum of human-level intelligence: for lots of tasks that fit in the context window, GPT-4 is better than some humans and worse than others. Overall GPT-4 is maybe at the level of a well-read college undergrad.
We can measure scaling laws and be ready to go once models become useful for alignment. Models are usually vastly better than humans at some tasks (e.g. translation, remembering facts) and much worse at others (e.g. arithmetic), but I expect that there’ll be a window of time where the models are very useful for alignment research, as long as they are sufficiently aligned.
You can think of this strategy as analogous to building the first ever compiler. It would be insane to build a modern compiler with all its features purely in machine code, but you also don’t have to. Instead, you build a minimal compiler in machine code, then you use it to compile the next version of the compiler that you mostly wrote in your programming language. This is not a circular strategy.
> They do talk later about aligning this first AGI. This does not solve the hard problem of how a dumber thing can align a smarter thing.
If the smartest humans could solve the hard problem, shouldn’t an AI system that is as smart as the smartest humans be able to? If the smartest humans can’t solve the hard problem, then no human-driven alignment plans can succeed either. The nice aspect of this plan is that we don't actually need to solve the "align a much smarter thing" problem ourselves, we only need to solve the "align a thing about as smart as us" problem.
> If you use A to align B to align C to align D
If you use humans = A to align B, and B come up with a new alignment technique, and then use this new technique to align C, you haven’t really used B to align C, you’re still just using A to align C.
> This is the opposite of their perspective, which is that ‘good enough’ alignment for the human-level is all you need. That seems very wrong to me. You would have to think you can somehow ‘recover’ the lost alignment later in the process.
A simple example is retraining the same model with a new technique. For example, say you use RLHF to train a model, and it gives you an idea for how to improve RLHF (e.g. significantly fewer hallucinations). Now you re-run RLHF and you get a more aligned model. But you don’t have to go via the proxy of the previous model other than the idea it gave you how to improve alignment. (To be clear, I'm not saying RLHF will be sufficient to align the automated alignment researcher.)
> Testing for interpretability can cause internal disguise and even ‘fooling oneself’ as it does in humans, cognition gets forced into whatever form you won’t detect, including outside what you think is the physical system.
This depends a lot on how much selection pressure you apply. For example, ImageNet models have been selected on the test set for many years now and there is some measurable test set leakage, but the overall effect doesn’t make the test set useless.
If you train a new model every day and keep the hyperparameters that the interpretability people seem to make happier noises about, you’re leaking a lot of bits about the interpretability tools. But if you go back and try to debug your pipeline to find the culprit that caused the problem, you can do this in a way that leaks very few bits and solve more actual problems. (E.g. if you only get to submit one model per week to the interpretability team, you leak at most 1 bit / week.)
> I would expect AIs to be at comparative disadvantage in detecting distinct manifestations of misalignment, compared to humans
Would you agree that AI should be at an advantage at detecting problems we trained it to plant deliberately? We should have some useful empirical evidence on this soon.
> The danger is that we will not confidently know when we transition to testing on dangerous systems.
This is something we’ll need to solve soon anyway; plus every AI lab will want their competitors to be testing for this.
> Building a human-level alignment researcher sounds a lot like building a human-level intelligence,
We’re not pushing on the capabilities required to do this, we’re just trying to figure out how to align it.
> There is great need for more ML talent working on alignment, but I disagree that alignment is fundamentally a machine learning problem. It is fundamentally a complex multidisciplinary problem, executed largely in machine learning, and a diversity of other skills and talents are also key bottlenecks.
Yes, my original statement was too strong and I now regret it. However, I do still think some of the core parts of what makes this problem difficult are machine learning problems.
> If you use current-model alignment to measure superalignment, that is fatal.
Depends on how you make the inference. Just because you’re making GPT-5 more aligned doesn’t mean you’re solving superintelligence alignment. But if you are trying to align a human-level alignment researcher, then somewhere along the way short of the actual system that can do automated alignment research do your techniques need to start working. GPT-5 is a good candidate for this and we need grounding in real-world empirical data.
> Yes, in both examples here and many other games, you can do an easy evaluation easily and tell if one side is dominating in the most obvious ways, but it often won’t help tell you who is winning expert games.
I agree that it can be very hard to tell who is winning (i.e. is in a favorable position) in chess, MTG, or other games (and if you could do this, then you could also play really well). But it is very easy to tell who has won the game (i.e. who the rules of the game declare as the winner).
Thank you, and so much of this is amazing to hear. Other parts are food for deeper thought than I can give right away, as I've been on a trip. I'll be quoting this in full in the next post, and would love to engage further privately if you'll be up for it.
I discuss that pretty extensively in the OP.
Sorry, saw that too late, I got overexcited. Deleted my comment :)
> If the first stage in your plan for alignment of superintelligence involves building a general intelligence (AGI), what makes you think you’ll be able to align that first AGI? What makes you think you can hit the at best rather narrow window of human intelligence without undershooting or overshooting?
Yep!
Also: even if you can build a human-level system that does alignment research safely on the first try, running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach. Safely directing the whole superintelligent system at solving alignment is an alignment-complete problem, while spending some money on a human-level alignment researcher is not (you could just hire someone)
I'm not actually sure that "strong" alignment is possible or meaningful. My mental model is something like:
1. Colonizing the stars was always for post-humans, not monkeys in tin cans.
2. "Intelligence" is fundamentally giant, partly inscrutable matrices. Virtually all progress we've made on AI came from recognizing and accepting this fact.
3. Human-friendly values are too complex to formally model in a mathematically rigorous way.
4. Yudkowsky's fast takeoff scenarios involve unlikely assumptions about nanotech and slightly dubious assumptions about intelligence. I'd give them probability < 0.05 within 5 years of us building a AGI. Most of this weight comes from the fact that synthetic biology seems like a poor way to build GPUs, and diamond-phase nanotech looked like extreme "hopium" the last time I looked at the research.
5. But "slow" takeoff can still kill us all, given 10-50 years. The main limiting factor is building out atom-moving tools on a scale sufficiently large to replace the world economy, and doing so without too much human backlash. Without nanotech, this harder to hide.
6. Many scenarios where we die look like utopia until a "treacherous turn", because a hostile AI would need time for a sufficient robotics build out, and it wouldn't want to trigger a human backlash until it had secured its supply lines.
Because of (2) and (3), I'm deeply pessimistic about "strong" alignment, where the goal is "we have a mathematical proof the AI won't kill us." I strongly suspect that asking for such a proof fundamentally misunderstands how both intelligence and human values work.
So any alignment plan, in my pessimistic view is roughly equivalent to, "We're going to carefully raise the alien spider god from childhood, teach it the best moral values we know, and hope that when it grows up, it likes humans enough to keep us around." I actually believe that this plan _might_ work, but that's just my prior that intelligent beings are sometimes benevolent. If GPT-4 has shown me anything, it's that we could give the alien spider god a pretty good understanding of humans. And therefore we're not picking _completely_ randomly from "goal space" or "mind space." Sure, we might be trying to hit a bullseye while blind drunk, but at least we're in the same room as the dartboard.
So, as for OpenAI's plan, I like the part where they're "we admit this is a problem, so let's try a bunch of things and allocate lots of resources." I don't think their plan offers any guarantees beyond "trying to raise the alien spider god with values we like." But I think that's about the best we could get, conditional on building ASI.
I am _much_ more concerned that their plan apparently starts with "step (1): build a near-human AGI." My prior is that near-human AGIs are not _automatically_ threats, for the same reason that there are some very smart people who have not made themselves world dictator.
My fundamental objection to building a "narrow" AGI to help with alignment research is that it encourages proliferation. Once you can build the specialized AGI, it's presumably very easy to build and deploy AGIs commercially. Which brings us several huge steps closer to the endgame.
But this is classic OpenAI: "We'll protect you from Skynet by doing everything possible to accelerate the development of AI!"
But the near-human AGI they build can't self-improve and can be subject too multiple alignment techniques, plus it has targeted brain damage.
This seems fairly likely to work.
Let's assume that we can build a near-human AGI, but train/box/specialize/restrict it enough so that it isn't a threat. I _can_ imagine that, even if we overshoot human intelligence by a modest factor. Probably the fact that we're relying so heavily on LLMs suggests that we might see something that had a few human-like properties and lower instrinsic motivation to do things.
But I am old and cynical, and I strongly suspect that as soon as we can build an AGI, even a weak one, there will be vast commercial pressure to make millions of copies and to use it for thousands of commercial tasks. And there will be pressure to improve it. And OpenAI is subject to all the standard human failings...
I think there lies the rub... inherent immutable and pervasive **human failings.** How can we construct anything intelligent that does not have that? Well, as seen by the OP's posts, a lot of words from a lot of smart people, all ideas doomed from the start. I just love the tunnelvision of "I cannot fail, I will succeed, we just have to pound harder." This human failure had killed more pilots than anything else, the belief that we must be right, bugger all. Perhaps if a team started using the inverse strategy: we have failed, we fucked up, let's clean up our mess before we make another dollar on it. Ha, i'm dreaming now...
Alignment research seems like it always runs into the same problem. The motivation for doing it is really great, but there's no practical plan to make any progress.
I have very limited understanding of the real problems with alignment, but if I could try to sum it up, it would be something like:
1. Alignment is many times more complex a problem to solve than to just create a super-advanced, un-aligned AI.
2. We will likely only have one shot at completely solving it.
3. No existing ideas seem at all likely to work.
IF those three statements are true, then we truly are on a path for disaster. Because the mundane utility will cause us to continue to rush forward with merely the faint outline of a plan. I do think the current OpenAI plan is something akin to the meme you provided: "Invent AGI - ??? - Solve AGI Alignment".
We're in a pretty bad spot that, to a layman like me, this actually seems like the most feasible plan because I've yet to hear a human plan for alignment that has any chance of working (and certainly not within the timeframe I suspect we'll need it).
Is this a fair assessment of our situation?
You are not incorrect but we do have multiple partial plans. It might be like a swiss cheese but so as long that there is no one hole that goes through all of the plans, we could be okay.
So one thought I've been chewing on is that the underspecification of "alignment" is really hamstringing a lot of this -- if we don't even have a well-defined target of how we would prefer an aligned relative to an unaligned AGI to act, the very problem itself seems intractable.
I would therefore like to propose that we develop a set of axioms or proposals for what "alignment" might look like to at least get some degree of common vocabulary available for discussion.
One thought that comes to mind is that what we really want from AGI / ASI is to obey something like what I would call the Criterion of Minimum Action or Criterion of Minimal Disruption -- i.e., relative to a current world-state, an AGI / ASI given a goal (and I think we can assume it as stipulated that AGI/ASI will be agentic by design or else made agentic by humans essentially as soon as it exists) should accomplish that goal in whatever way minimizes changes to the world-state in the status-quo ante, with a loss function (perhaps increasing hyperbolically with deviation from status quo ante) including a set of more tractable sub-problems to define such deviation -- e.g., increased/decreased human deaths above the no-action scenario, increased/decreased longevity, increased/decreased birthrate, deviations from climatic equilibrium and trajectory, various animal / wildlife / biomass homestatic factors, etc.[1]. [2]
What this really boils down to is trying to analogize AGI/ASI performance to what I think human alignment actually "is," which is really at its core about capability-limitation more than anything else -- the reason Hitler doesn't declare victory and genocide all the Jews is that he doesn't have the biggest army nor the Bomb (and because capabilities are generally a distributed rather than unitary resource) rather than because values-alignment is a reliable human trait. Obviously the main problem of AGI/ASI is precisely that overcoming capabilities-limitation is the entire nature of the exercise (and the threat), so we can't rely on that. What we conceivably *could* do instead is bend efforts towards a bias towards extreme minimalism in externality-generation, coupled with reliable world-model prediction and sensing (this seems like something AGI/ASI will likely have, which is indeed what makes it so dangerous) plus perhaps some Three Laws-style guardrails (to the extent independently implementable. This *might* even be simple enough to do with RLHF without it being a complete cockup) to try to limit questions that by their nature admit of no non-catastrophic (or non-world-altering) answer, such as "how do I kill everyone?".
I'm by no means suggesting that this is a comprehensive or ironclad proposal -- it's still very nascent -- but I think it at least has the virtue of trying to operationalize alignment with respect to answering the two questions "(1) what does alignment even *mean* in a human context?" (to which I would argue, a combination of externality and direct harm minimization, with catastrophic scenarios averted through capability limitation and necessarily-aggregated preferences tending towards externality and direct harm-minimization), and "(2) how do we try to mimic this with a loss function in a system that can be assumed to be minimally, if at all, capabilities-limited?"
[1] Obviously this requires ways to prevent an AGI/ASI from self-hacking its world state a la The Robots of Dawn to either deliberately wear blinders or otherwise redefine these metrics into epsilon-significance, but (a) this is just a sketch for now and (b) accurate world-knowledge is instrumentally rational in a way that helps rather than hurts for once.
[2] In some cases -- e.g,. "Find a way to increase human longevity," (which is a goal humans would obviously want to pursue) -- the "least-disruptive means" is still incredibly disruptive. This is a serious problem that will create impetus to relax the extreme homeostasis-bias I propose imposing on AI, but at least it shoves the alignment problem down one level in a way that it seems like current approaches haven't necessarily yet accomplished.
Elaborating slightly more, the Criterion of Minimal Disrution is meant to also tackle specifically the problem of instrumental convergence by, eg, making “To make the most paperclips, you must first kill all the humans” a less-desirable opening move.
I think there lies the rub... inherent immutable and pervasive **human failings.** How can we construct anything intelligent that does not have that? As seen by the OP's posts, we can construct many words from many smart people, but these ideas are doomed from the start. I just love the tunnel vision of "I cannot fail, I will succeed, so we just have to pound harder" of one of the smarties. This human failure has killed more pilots than anything else, the belief that we must be right, bugger all. Perhaps if a team started using the inverse strategy of we have failed, we fucked up, let's clean up our mess before we make another dollar on it? Ha, I'm dreaming now...
Thanks again for all the posts!
Hi, thanks a lot for writing the detailed and highly engaged piece! There are a lot of points here and I can't respond to all of them right now. I think these responses might be most helpful:
Maybe a potential misunderstanding upfront: the Superalignment Team is not trying to teach AI how to do ML research. This will happen without us, and I don’t think it would be helpful for us to accelerate it. Our job is to research and develop the alignment techniques required to make the first system that has the capabilities to do automated alignment research sufficiently aligned.
Put differently: automated ML research will happen anyway, and the best we can do is be ready to use is for alignment as soon as it starts happening. To do this, we must know how to make that system sufficiently aligned that we can trust the alignment research it's producing (because bad faith alignment research is an avenue through which AI can gain undue power).
> If the first stage in your plan for alignment of superintelligence involves building a general intelligence (AGI), what makes you think you’ll be able to align that first AGI? What makes you think you can hit the at best rather narrow window of human intelligence without undershooting (where it would not be useful) or overshooting (where we wouldn’t be able to align it, and might well not realize this and all die)? Given comparative advantages it is not clear ‘human-level’ exists at all here.
One thing that we’ve learned from LLM scaling is that it’s actually moving relatively slowly through the spectrum of human-level intelligence: for lots of tasks that fit in the context window, GPT-4 is better than some humans and worse than others. Overall GPT-4 is maybe at the level of a well-read college undergrad.
We can measure scaling laws and be ready to go once models become useful for alignment. Models are usually vastly better than humans at some tasks (e.g. translation, remembering facts) and much worse at others (e.g. arithmetic), but I expect that there’ll be a window of time where the models are very useful for alignment research, as long as they are sufficiently aligned.
You can think of this strategy as analogous to building the first ever compiler. It would be insane to build a modern compiler with all its features purely in machine code, but you also don’t have to. Instead, you build a minimal compiler in machine code, then you use it to compile the next version of the compiler that you mostly wrote in your programming language. This is not a circular strategy.
> They do talk later about aligning this first AGI. This does not solve the hard problem of how a dumber thing can align a smarter thing.
If the smartest humans could solve the hard problem, shouldn’t an AI system that is as smart as the smartest humans be able to? If the smartest humans can’t solve the hard problem, then no human-driven alignment plans can succeed either. The nice aspect of this plan is that we don't actually need to solve the "align a much smarter thing" problem ourselves, we only need to solve the "align a thing about as smart as us" problem.
> If you use A to align B to align C to align D
If you use humans = A to align B, and B come up with a new alignment technique, and then use this new technique to align C, you haven’t really used B to align C, you’re still just using A to align C.
> This is the opposite of their perspective, which is that ‘good enough’ alignment for the human-level is all you need. That seems very wrong to me. You would have to think you can somehow ‘recover’ the lost alignment later in the process.
A simple example is retraining the same model with a new technique. For example, say you use RLHF to train a model, and it gives you an idea for how to improve RLHF (e.g. significantly fewer hallucinations). Now you re-run RLHF and you get a more aligned model. But you don’t have to go via the proxy of the previous model other than the idea it gave you how to improve alignment. (To be clear, I'm not saying RLHF will be sufficient to align the automated alignment researcher.)
> Testing for interpretability can cause internal disguise and even ‘fooling oneself’ as it does in humans, cognition gets forced into whatever form you won’t detect, including outside what you think is the physical system.
This depends a lot on how much selection pressure you apply. For example, ImageNet models have been selected on the test set for many years now and there is some measurable test set leakage, but the overall effect doesn’t make the test set useless.
If you train a new model every day and keep the hyperparameters that the interpretability people seem to make happier noises about, you’re leaking a lot of bits about the interpretability tools. But if you go back and try to debug your pipeline to find the culprit that caused the problem, you can do this in a way that leaks very few bits and solve more actual problems. (E.g. if you only get to submit one model per week to the interpretability team, you leak at most 1 bit / week.)
> I would expect AIs to be at comparative disadvantage in detecting distinct manifestations of misalignment, compared to humans
Would you agree that AI should be at an advantage at detecting problems we trained it to plant deliberately? We should have some useful empirical evidence on this soon.
> The danger is that we will not confidently know when we transition to testing on dangerous systems.
This is something we’ll need to solve soon anyway; plus every AI lab will want their competitors to be testing for this.
> Building a human-level alignment researcher sounds a lot like building a human-level intelligence,
We’re not pushing on the capabilities required to do this, we’re just trying to figure out how to align it.
> There is great need for more ML talent working on alignment, but I disagree that alignment is fundamentally a machine learning problem. It is fundamentally a complex multidisciplinary problem, executed largely in machine learning, and a diversity of other skills and talents are also key bottlenecks.
Yes, my original statement was too strong and I now regret it. However, I do still think some of the core parts of what makes this problem difficult are machine learning problems.
> If you use current-model alignment to measure superalignment, that is fatal.
Depends on how you make the inference. Just because you’re making GPT-5 more aligned doesn’t mean you’re solving superintelligence alignment. But if you are trying to align a human-level alignment researcher, then somewhere along the way short of the actual system that can do automated alignment research do your techniques need to start working. GPT-5 is a good candidate for this and we need grounding in real-world empirical data.
> Yes, in both examples here and many other games, you can do an easy evaluation easily and tell if one side is dominating in the most obvious ways, but it often won’t help tell you who is winning expert games.
I agree that it can be very hard to tell who is winning (i.e. is in a favorable position) in chess, MTG, or other games (and if you could do this, then you could also play really well). But it is very easy to tell who has won the game (i.e. who the rules of the game declare as the winner).
Thank you, and so much of this is amazing to hear. Other parts are food for deeper thought than I can give right away, as I've been on a trip. I'll be quoting this in full in the next post, and would love to engage further privately if you'll be up for it.
Happy to discuss more; if you have specific questions, send them my way!