Discussion about this post

User's avatar
Jan Leike's avatar

Hi, thanks a lot for writing the detailed and highly engaged piece! There are a lot of points here and I can't respond to all of them right now. I think these responses might be most helpful:

Maybe a potential misunderstanding upfront: the Superalignment Team is not trying to teach AI how to do ML research. This will happen without us, and I don’t think it would be helpful for us to accelerate it. Our job is to research and develop the alignment techniques required to make the first system that has the capabilities to do automated alignment research sufficiently aligned.

Put differently: automated ML research will happen anyway, and the best we can do is be ready to use is for alignment as soon as it starts happening. To do this, we must know how to make that system sufficiently aligned that we can trust the alignment research it's producing (because bad faith alignment research is an avenue through which AI can gain undue power).

> If the first stage in your plan for alignment of superintelligence involves building a general intelligence (AGI), what makes you think you’ll be able to align that first AGI? What makes you think you can hit the at best rather narrow window of human intelligence without undershooting (where it would not be useful) or overshooting (where we wouldn’t be able to align it, and might well not realize this and all die)? Given comparative advantages it is not clear ‘human-level’ exists at all here.

One thing that we’ve learned from LLM scaling is that it’s actually moving relatively slowly through the spectrum of human-level intelligence: for lots of tasks that fit in the context window, GPT-4 is better than some humans and worse than others. Overall GPT-4 is maybe at the level of a well-read college undergrad.

We can measure scaling laws and be ready to go once models become useful for alignment. Models are usually vastly better than humans at some tasks (e.g. translation, remembering facts) and much worse at others (e.g. arithmetic), but I expect that there’ll be a window of time where the models are very useful for alignment research, as long as they are sufficiently aligned.

You can think of this strategy as analogous to building the first ever compiler. It would be insane to build a modern compiler with all its features purely in machine code, but you also don’t have to. Instead, you build a minimal compiler in machine code, then you use it to compile the next version of the compiler that you mostly wrote in your programming language. This is not a circular strategy.

> They do talk later about aligning this first AGI. This does not solve the hard problem of how a dumber thing can align a smarter thing.

If the smartest humans could solve the hard problem, shouldn’t an AI system that is as smart as the smartest humans be able to? If the smartest humans can’t solve the hard problem, then no human-driven alignment plans can succeed either. The nice aspect of this plan is that we don't actually need to solve the "align a much smarter thing" problem ourselves, we only need to solve the "align a thing about as smart as us" problem.

> If you use A to align B to align C to align D

If you use humans = A to align B, and B come up with a new alignment technique, and then use this new technique to align C, you haven’t really used B to align C, you’re still just using A to align C.

> This is the opposite of their perspective, which is that ‘good enough’ alignment for the human-level is all you need. That seems very wrong to me. You would have to think you can somehow ‘recover’ the lost alignment later in the process.

A simple example is retraining the same model with a new technique. For example, say you use RLHF to train a model, and it gives you an idea for how to improve RLHF (e.g. significantly fewer hallucinations). Now you re-run RLHF and you get a more aligned model. But you don’t have to go via the proxy of the previous model other than the idea it gave you how to improve alignment. (To be clear, I'm not saying RLHF will be sufficient to align the automated alignment researcher.)

> Testing for interpretability can cause internal disguise and even ‘fooling oneself’ as it does in humans, cognition gets forced into whatever form you won’t detect, including outside what you think is the physical system.

This depends a lot on how much selection pressure you apply. For example, ImageNet models have been selected on the test set for many years now and there is some measurable test set leakage, but the overall effect doesn’t make the test set useless.

If you train a new model every day and keep the hyperparameters that the interpretability people seem to make happier noises about, you’re leaking a lot of bits about the interpretability tools. But if you go back and try to debug your pipeline to find the culprit that caused the problem, you can do this in a way that leaks very few bits and solve more actual problems. (E.g. if you only get to submit one model per week to the interpretability team, you leak at most 1 bit / week.)

> I would expect AIs to be at comparative disadvantage in detecting distinct manifestations of misalignment, compared to humans

Would you agree that AI should be at an advantage at detecting problems we trained it to plant deliberately? We should have some useful empirical evidence on this soon.

> The danger is that we will not confidently know when we transition to testing on dangerous systems.

This is something we’ll need to solve soon anyway; plus every AI lab will want their competitors to be testing for this.

> Building a human-level alignment researcher sounds a lot like building a human-level intelligence,

We’re not pushing on the capabilities required to do this, we’re just trying to figure out how to align it.

> There is great need for more ML talent working on alignment, but I disagree that alignment is fundamentally a machine learning problem. It is fundamentally a complex multidisciplinary problem, executed largely in machine learning, and a diversity of other skills and talents are also key bottlenecks.

Yes, my original statement was too strong and I now regret it. However, I do still think some of the core parts of what makes this problem difficult are machine learning problems.

> If you use current-model alignment to measure superalignment, that is fatal.

Depends on how you make the inference. Just because you’re making GPT-5 more aligned doesn’t mean you’re solving superintelligence alignment. But if you are trying to align a human-level alignment researcher, then somewhere along the way short of the actual system that can do automated alignment research do your techniques need to start working. GPT-5 is a good candidate for this and we need grounding in real-world empirical data.

> Yes, in both examples here and many other games, you can do an easy evaluation easily and tell if one side is dominating in the most obvious ways, but it often won’t help tell you who is winning expert games.

I agree that it can be very hard to tell who is winning (i.e. is in a favorable position) in chess, MTG, or other games (and if you could do this, then you could also play really well). But it is very easy to tell who has won the game (i.e. who the rules of the game declare as the winner).

Expand full comment
Random Reader's avatar

I'm not actually sure that "strong" alignment is possible or meaningful. My mental model is something like:

1. Colonizing the stars was always for post-humans, not monkeys in tin cans.

2. "Intelligence" is fundamentally giant, partly inscrutable matrices. Virtually all progress we've made on AI came from recognizing and accepting this fact.

3. Human-friendly values are too complex to formally model in a mathematically rigorous way.

4. Yudkowsky's fast takeoff scenarios involve unlikely assumptions about nanotech and slightly dubious assumptions about intelligence. I'd give them probability < 0.05 within 5 years of us building a AGI. Most of this weight comes from the fact that synthetic biology seems like a poor way to build GPUs, and diamond-phase nanotech looked like extreme "hopium" the last time I looked at the research.

5. But "slow" takeoff can still kill us all, given 10-50 years. The main limiting factor is building out atom-moving tools on a scale sufficiently large to replace the world economy, and doing so without too much human backlash. Without nanotech, this harder to hide.

6. Many scenarios where we die look like utopia until a "treacherous turn", because a hostile AI would need time for a sufficient robotics build out, and it wouldn't want to trigger a human backlash until it had secured its supply lines.

Because of (2) and (3), I'm deeply pessimistic about "strong" alignment, where the goal is "we have a mathematical proof the AI won't kill us." I strongly suspect that asking for such a proof fundamentally misunderstands how both intelligence and human values work.

So any alignment plan, in my pessimistic view is roughly equivalent to, "We're going to carefully raise the alien spider god from childhood, teach it the best moral values we know, and hope that when it grows up, it likes humans enough to keep us around." I actually believe that this plan _might_ work, but that's just my prior that intelligent beings are sometimes benevolent. If GPT-4 has shown me anything, it's that we could give the alien spider god a pretty good understanding of humans. And therefore we're not picking _completely_ randomly from "goal space" or "mind space." Sure, we might be trying to hit a bullseye while blind drunk, but at least we're in the same room as the dartboard.

So, as for OpenAI's plan, I like the part where they're "we admit this is a problem, so let's try a bunch of things and allocate lots of resources." I don't think their plan offers any guarantees beyond "trying to raise the alien spider god with values we like." But I think that's about the best we could get, conditional on building ASI.

I am _much_ more concerned that their plan apparently starts with "step (1): build a near-human AGI." My prior is that near-human AGIs are not _automatically_ threats, for the same reason that there are some very smart people who have not made themselves world dictator.

My fundamental objection to building a "narrow" AGI to help with alignment research is that it encourages proliferation. Once you can build the specialized AGI, it's presumably very easy to build and deploy AGIs commercially. Which brings us several huge steps closer to the endgame.

But this is classic OpenAI: "We'll protect you from Skynet by doing everything possible to accelerate the development of AI!"

Expand full comment
15 more comments...

No posts