17 Comments
⭠ Return to thread

Hi, thanks a lot for writing the detailed and highly engaged piece! There are a lot of points here and I can't respond to all of them right now. I think these responses might be most helpful:

Maybe a potential misunderstanding upfront: the Superalignment Team is not trying to teach AI how to do ML research. This will happen without us, and I don’t think it would be helpful for us to accelerate it. Our job is to research and develop the alignment techniques required to make the first system that has the capabilities to do automated alignment research sufficiently aligned.

Put differently: automated ML research will happen anyway, and the best we can do is be ready to use is for alignment as soon as it starts happening. To do this, we must know how to make that system sufficiently aligned that we can trust the alignment research it's producing (because bad faith alignment research is an avenue through which AI can gain undue power).

> If the first stage in your plan for alignment of superintelligence involves building a general intelligence (AGI), what makes you think you’ll be able to align that first AGI? What makes you think you can hit the at best rather narrow window of human intelligence without undershooting (where it would not be useful) or overshooting (where we wouldn’t be able to align it, and might well not realize this and all die)? Given comparative advantages it is not clear ‘human-level’ exists at all here.

One thing that we’ve learned from LLM scaling is that it’s actually moving relatively slowly through the spectrum of human-level intelligence: for lots of tasks that fit in the context window, GPT-4 is better than some humans and worse than others. Overall GPT-4 is maybe at the level of a well-read college undergrad.

We can measure scaling laws and be ready to go once models become useful for alignment. Models are usually vastly better than humans at some tasks (e.g. translation, remembering facts) and much worse at others (e.g. arithmetic), but I expect that there’ll be a window of time where the models are very useful for alignment research, as long as they are sufficiently aligned.

You can think of this strategy as analogous to building the first ever compiler. It would be insane to build a modern compiler with all its features purely in machine code, but you also don’t have to. Instead, you build a minimal compiler in machine code, then you use it to compile the next version of the compiler that you mostly wrote in your programming language. This is not a circular strategy.

> They do talk later about aligning this first AGI. This does not solve the hard problem of how a dumber thing can align a smarter thing.

If the smartest humans could solve the hard problem, shouldn’t an AI system that is as smart as the smartest humans be able to? If the smartest humans can’t solve the hard problem, then no human-driven alignment plans can succeed either. The nice aspect of this plan is that we don't actually need to solve the "align a much smarter thing" problem ourselves, we only need to solve the "align a thing about as smart as us" problem.

> If you use A to align B to align C to align D

If you use humans = A to align B, and B come up with a new alignment technique, and then use this new technique to align C, you haven’t really used B to align C, you’re still just using A to align C.

> This is the opposite of their perspective, which is that ‘good enough’ alignment for the human-level is all you need. That seems very wrong to me. You would have to think you can somehow ‘recover’ the lost alignment later in the process.

A simple example is retraining the same model with a new technique. For example, say you use RLHF to train a model, and it gives you an idea for how to improve RLHF (e.g. significantly fewer hallucinations). Now you re-run RLHF and you get a more aligned model. But you don’t have to go via the proxy of the previous model other than the idea it gave you how to improve alignment. (To be clear, I'm not saying RLHF will be sufficient to align the automated alignment researcher.)

> Testing for interpretability can cause internal disguise and even ‘fooling oneself’ as it does in humans, cognition gets forced into whatever form you won’t detect, including outside what you think is the physical system.

This depends a lot on how much selection pressure you apply. For example, ImageNet models have been selected on the test set for many years now and there is some measurable test set leakage, but the overall effect doesn’t make the test set useless.

If you train a new model every day and keep the hyperparameters that the interpretability people seem to make happier noises about, you’re leaking a lot of bits about the interpretability tools. But if you go back and try to debug your pipeline to find the culprit that caused the problem, you can do this in a way that leaks very few bits and solve more actual problems. (E.g. if you only get to submit one model per week to the interpretability team, you leak at most 1 bit / week.)

> I would expect AIs to be at comparative disadvantage in detecting distinct manifestations of misalignment, compared to humans

Would you agree that AI should be at an advantage at detecting problems we trained it to plant deliberately? We should have some useful empirical evidence on this soon.

> The danger is that we will not confidently know when we transition to testing on dangerous systems.

This is something we’ll need to solve soon anyway; plus every AI lab will want their competitors to be testing for this.

> Building a human-level alignment researcher sounds a lot like building a human-level intelligence,

We’re not pushing on the capabilities required to do this, we’re just trying to figure out how to align it.

> There is great need for more ML talent working on alignment, but I disagree that alignment is fundamentally a machine learning problem. It is fundamentally a complex multidisciplinary problem, executed largely in machine learning, and a diversity of other skills and talents are also key bottlenecks.

Yes, my original statement was too strong and I now regret it. However, I do still think some of the core parts of what makes this problem difficult are machine learning problems.

> If you use current-model alignment to measure superalignment, that is fatal.

Depends on how you make the inference. Just because you’re making GPT-5 more aligned doesn’t mean you’re solving superintelligence alignment. But if you are trying to align a human-level alignment researcher, then somewhere along the way short of the actual system that can do automated alignment research do your techniques need to start working. GPT-5 is a good candidate for this and we need grounding in real-world empirical data.

> Yes, in both examples here and many other games, you can do an easy evaluation easily and tell if one side is dominating in the most obvious ways, but it often won’t help tell you who is winning expert games.

I agree that it can be very hard to tell who is winning (i.e. is in a favorable position) in chess, MTG, or other games (and if you could do this, then you could also play really well). But it is very easy to tell who has won the game (i.e. who the rules of the game declare as the winner).

Expand full comment