The Rocket Alignment Problem, Part 2

May 1, 2023

Previously (Eliezer Yudkowsky): The Rocket Alignment Problem.

32 Comments

May 1, 2023

I'm disappointed that your substack turned into AI Doom Daily. Your policy analysis was much more interesting and relevant. It's hard to see the doom movement as anything other than neo-Luddites who are trying to take a cue from Thunberg and turn the alarm up to 11 to gain attention. Musk joining in when his company is one of the top AI developers and is literally building humanoid bots is icing on the cake. Please stop before the govt gets involved and we can't have nice things in the future.

Expand full comment

Reply (2)

Implausible Undeniability

May 1, 2023

I would also be disappointed if the blog turned into "AI Doom Daily" but I don't think that's anywhere close to a fair description right now. The weekly AI posts focus more on capabilities and other news than on alignment, and they are great summaries of everything that happened, unlike anything else you can find on the internet. This post also seems to be more about Eliezer's failure of communication than anything related to AI.

Expand full comment

Reply (1)

Zvi Mowshowitz

May 1, 2023

The goal here was indeed to explain and help people communicate better more than anything else, here, yeah. And of course, to properly label it so it can be skipped by those not interested.

The policy stuff on other things still matters to me, and I have plans on that, I still keep expecting the AI situation to slow down a bit and it indeed has slowed down somewhat already.

Expand full comment

Reply (1)

Jigs Gaton

May 1, 2023

Yes, already fading into media obscurity except for those persistent geek influencers out here :)

Expand full comment

Jason Maguire

May 1, 2023

Eliezer had been saying the exact same stuff before anyone had ever heard of Thunberg. It's getting talked about now precisely because of unexpected advances in AI and the billions of dollars being thrown at the technology by countless tech companies.

If you have an actual argument to make regarding why AI safety are overblown, 𝘮𝘢𝘬𝘦 𝘪𝘵. This twitteresque name calling is total BS amd not an srgument, and is also especially brainless in this case considering ludditism is about the threat to jobs from technology, not to the continued existence of humanity itself.

How is anyone supposed to think AI is going to be safe when almost 100% of AI advocates refuse to make anything resembling an actual argument and just flippantly dismiss anyone who says a machine superintelligence may be risky?

The "nice things" that AI might bring are irrelevant if AI had a high chance of killing us all, so repeatedly appealing to those things doesn't worm unless you have a good reason for believing that all of the AI risk arguments are wrong.

Not only do you obviously lack this reason, I would be completely blown away if you even understood these risk arguments at all.

Expand full comment

Reply (1)

Matthew Carlin

May 2, 2023

Here are about twelve such arguments:

https://idlewords.com/talks/superintelligence.htm

Expand full comment

Reply (1)

Alex

May 2, 2023Edited

I hope you are as generous with your charitable donations as you are with calling "super intelligence won't kill us because of the Emu War" or "superintelligence will be ok because we can't design a secure webcam so what's the point in even trying to align one" arguments against X-risk from ASI/AGI.

I could dismantle all of those (where there is even any logic to dismantle) in 1/2 hour and I'm neither very smart nor an expert in the problem.

Expand full comment

Reply (1)

Matthew Carlin

May 2, 2023

Why do EAs police other people's charitable giving activity? Seriously, why? This isn't supportive language to foster a community of giving, it is whipping language for creating guilt. (I am as generous, generally, but who's counting? Or pressuring?)

Also, I didn't write the piece. It's one of my favorites, and I may wish I wrote it, but I didn't.

"""I could dismantle all of those (where there is even any logic to dismantle) in 1/2 hour and I'm neither very smart nor an expert in the problem."""

It would be good to be less certain. If you're not so certain that you're smart or expert in the problem, you should allow a significant possibility that you're parroting flawed counter arguments. This is fine! Not everyone is an expert in the ML domain or the AI risk domain. But we should all therefore encourage each other to more epistemic humility.

Expand full comment

Reply (1)

Alex

May 2, 2023

Actually no one is counting. Or at least I am not. My reply was a tongue in cheek play on the idea of 'charitable interpretation'. Probably didn't carry across. I'm also not 'an EA' whatever the standard definition for that may be. I don't care about how much you give or don't give to charity (except perhaps very marginally). I do care that you think that the 'arguments' in the link you posted are good arguments and strong arguments against X-risk from ASI/AGI (slightly less marginally, or enough to post a comment here, which I don't do often).

I'm not sure if it _would_ be good to be less certain. I don't claim to have 100% certainty that I am correct and I do allow some probability that I'm wrong and the arguments in the link are right. Some of my positions may be 'parroting' as you state, although I don't believe there's anything wrong with that as much as some people would like to insinuate by using that term as long as what you are 'parroting' has been run through your own filter of logic, source control and weighed against the available counter-arguments.

The amount of certainty I have in being able to effectively counter-argue against the points made in the link is based on the fact that the arguments presented range, in my opinion, from weak to very bad whilst not being particularly complex OR technical, which is why I don't assign a high probability that my lack of smartness or technical expertise is highly relevant here.

I'd be happy to present my reasoning/arguments if you are genuinely interested in a discussion, otherwise I will leave at that.

Expand full comment

Reply (1)

Matthew Carlin

May 3, 2023

Thanks for explaining the tongue in cheek humor.

I'd be happy to hear one of two of the arguments. I don't want to ask for too much from people on the internet, but if you feel like you have the time, please do.

Keep in mind that, for the inside arguments section of his talk, the author Maciej is trying to establish that each of his points has some chance of being right, whereas they all have to be wrong for the death scenario. So he's not trying to make knock down arguments, he's chipping away by multiplying a lot of 0.9s together. I think this means the bar for wiping away each of his arguments is closer to "this part can't possibly be right" than "I wouldn't bet on this".

Expand full comment

Reply (1)

SilentObserver

May 1, 2023

As for point 3, "Whatever we do to align that AGI either works, or it doesn’t", I think there actually could be a way out (even though a really unlikely one). What if the "partially aligned" system doesn't really want to kill humans (or gain resources, or whatever, so that instrumental convergence doesn't apply), but could still randomly decide to do so with some low probability? I think an LLM-based AGI could be like that, if it is simulating a "helpful assistant" or something, but then randomly turns into a Waluigi. With this a system can be unaligned, but we can still survive interacting with it for some amount of time before an actually aligned system comes along. Sure, that's rolling the dice on human survival, but if someone's p(doom) > 90% of something, it can still be a viable solution

Expand full comment

maline

May 1, 2023

My understanding was that SpaceX has an explicit policy to iterate test flights as quickly and cheaply as possible, and that they would have been shocked if this flight had actually gone perfectly. In which case it's simply false that Elon demonstrated an inability to get it right the first time.

Expand full comment

Amy S.

May 1, 2023

Sometimes your arguments about tactics and contrafactuals are hard for me to follow. You seem to have a lot of ready analysis of how people can react nonlinearly to government regulations and card game situations that I don't have. If you wrote them out at Matt Levine levels of detail, it would probably fill a small textbook, like the size of Fermi's Thermodynamics. I'd enjoy reading that book and I'd pay 10x a normal cover price if you wrote it.

Expand full comment

Random Reader

May 1, 2023Edited

I appreciate the AI coverage, and it keeps me coming back now that COVID has been reduced to one of life's many problems.

I still think that p > 0.95 of sudden doom requires a bunch of assumptions that border on magic. Specifically:

1. Yudkowksy routinely argues that AIs will be able to quickly self-improve themselves to essentially godlike intelligence, either via algorithmic improvements or custom hardware. Even if it's possible to build "much smarter than human" intelligence, that doesn't necessarily make it easy for that intelligence to quickly build another generation that's effectively omnipotent and omniscient.

2. Drexlerian nanotech requires robustly building structures with 10^15 atoms, IIRC. The last time I followed the literature, I think we could place a single atom with a 20% chance of success, **in simulation.** (This was a while ago.) And while the idea is clever, I have seen very smart biochemists point out that Drexlerian nanotech fundmantally misunderstands the reasons why biology works well. See "Soft Machines" for a very old discussion of this topic. I strongly suspect that even super-human intelligences can't overcome these obstacles easily.

3. But if you write off Drexlerian nanotech, you need some other path for your AI to make GPUs. Either it can devote the time and effort into synthetic biology (which isn't as tidy or convenient as digital GPUs), or it needs to play nicely with the human economy, or it needs to recreate a late-stage industrial economy in a box. The last option, once again, is getting very close to magic.

So my "doom" scenarios _really_ don't look like "AI becomes a god, then it bootstraps diamond nanotech via an email to a DNA synthesis company, and everyone dies simulataneously a week later."

If I had to posit doom scenarios, they'd look more like:

1. AI participates in the human economy, and it outcompetes us in the medium term.

2. AI is super helpful, it helps us bootstrap a robotic economy, all while arguing for UBI. Then once it no longer needs us, too bad.

3. Some AIs value humans, but due to life-or-death struggles with other AIs, they can't afford to keep us around. Sorry.

But in each of these cases, we'd be looking at an AI that has initial incentives to seem friendly and to cooperate with humans, and for that state of affairs to persist long enough to totally rebuild the economy to no longer really need humans.

But this also means that we _might_ get a few shots at alignment. This still doesn't make it smart or safe to build ASI.

Expand full comment

Reply (2)

Zvi Mowshowitz

May 1, 2023

Point of clarification: I don't actually have p(doom)>0.95.

I don't understand why people put weight on the general class here of 'it could be much smarter than us and want to do things and not find a way to make them happen' arguments - if it does indeed become God-like, then the fact that the next sage isn't diamond nanotech slows things down but you're way past any point of no return or ability to do anything about it, and *conditional on things getting that far* it's all over.

I do think that there's a reasonable chance things go like you're describing, but it's not clear to me to what extent that gives us a second shot. What is clear to me is that I expect us, even if we did get it, not to know what to do with it.

Expand full comment

Reply (1)

Random Reader

May 1, 2023Edited

(Edited for clarity.)

> I don't understand why people put weight on the general class here of 'it could be much smarter than us and want to do things and not find a way to make them happen' arguments

"Much smarter than us" covers a lot of ground. Yudkowsky believes in nanotech and strong recursive self-improvement, so he seems to assume that "much smarter than us" quickly means god-like. But if you don't assume magic, then "much smarter than us" might mean "an effective IQ of 160" or "as smart as an entire Manhattan Project running at 10x real time." Predicting what happens in these scenarios is messier than a classic FOOM, especially if you have partial alignment or limited agency.

However, I endorse "Many of these scenarios are terrifying and unpredictable and should be avoided even at great cost"! My true emotional objection is that I've been hearing to too many AI arguments that are basically fanfic of the first two paragraphs of "A Fire Upon the Deep" (Vernor Vinge, 1992), where the Blight trascends to effective godhood and nanotech, served up with a side of "read the Sequences." We need realistic models if we want good results.

I am actually deeply pessimistic about any kind of "strict" alignment. My concerns:

1. All known forms of intelligence are basically giant inscrutable matrices, though some are squishy.

2. You have to explain to those matrices what you mean by "human", with no wiggle room. This is surprisingly difficult, as demonstrated by human history.

3. You have to explain what values you want those matrices to promote. We have no answer to this question that we'd trust enough to enforce by super-human fiat. And we can't specifying human flourishing as a rigorous theorem.

Sure, we could probably approximately align an AI, the way OpenAI did with ChatGPT. But I am pessimistic that we can do much better than "raise it right", "ask nicely", and "hope for the best." I don't honestly believe we can build permanent, strict constraints into something vastly smarter than we are. That's just not how intelligence works, I don't think?

So I strongly suspect our realistic options are "pausing", "pets" or "paperclips." I would prefer pausing but I don't quite see a path to it yet. And I'm worried that we may still see major theoretical breakthroughs in training; current neural nets require way too much input.

Expand full comment

Jason Maguire

May 1, 2023

You haven't actually shown that we get multiple shots at alignment. Once an ASI exists and has a medium for exerting itself in the physical world, what we do is irrelevant.

Expand full comment

Reply (1)

Matthew Carlin

May 2, 2023

You haven't shown that an ASI can exist, will exist, could gain a medium for exerting itself in the physical world beyond any individual human, or that what we do would then be irrelevant. The magic steps here are numerous.

Expand full comment

Reply (1)

Jason Maguire

May 9, 2023

Then surely bans on trying to create ASI are fine, then? If it's all magical nonsense, then there's nothing to be lose by not trying to build it.

Expand full comment

Reply (1)

Matthew Carlin

May 10, 2023Edited

Sure, ban away. Ban ASIs all you want. Ban ghosts, ban the sun from becoming a red giant.

If you ban something like MidJourney or GPT-4, you are banning a real thing with real present day positives and negatives, and that is different. But even then, go ahead. I am okay with it. Ban the tides, ban the wind. Ban the Swabians and the Marcomanni and the Alemmani and the Goths, Rome will be safe.

Expand full comment

Jigs Gaton

May 1, 2023Edited

I don't like how binary the conversation is here about killer AI and paper clip factories. Is it not possible to simulate possible dangers and study what went wrong? It's not like anything exists on earth to do what is imagined by the doomsayers, and we do know how to run simulations, yes? And using ChatGPT as the argument (ie. the GPT is the problem) is just a circular argument leading nowhere. Just build the safety parms in, why don't cha?

Expand full comment

Reply (1)

Random Reader

May 3, 2023

Let me give you a very conservative argument for danger.

Humans are smart, but we're probably not as smart as it's possible to be. Evolutionarily, we're limited by the birth process. We're born several months "premature", and infant brains barely fit through the pelvis. And the pelvis is about as wide as it can get without causing issues with walking. And we have an unusually long childhood, compared even to other primates, giving our brain time to mature.

Similarly, evolution has sacrificed much of our chewing power to keep our jaw muscles from ripping open our oversized foreheads. And our brain uses something like 20-25% of the calories burned by our body.

Our brain is, hoewer, self-replicating and remarkably power efficient. But it runs at 150-200 Hz. And given human reaction times, that means our brains can't have too many layers between input and output for many tasks. Meanwhile, modern chips run at 3,000,000,000 Hz. Sure, biological neurons are actually pretty smart, so knock off a couple zeros.

GPT 4 is an incomplete intelligence. It doesn't really "want" things, it has no memory, it doesn't learn from experience, it can't "think problems over". But I can invent a specialized programming language with an unusual notation, and give ChatGPT several examples. ChatGPT will write sample programs in that language, point out useful problems the language can't solve yet, and suggest ways of extending the language. In certain kinds of skills, it's smarter than the average human, and much better "read" then anyone.

So I'm left with a strong suspicion that if we could give ChatGPT some of the essential skills that it lacks, it would already be smarter than many college graduates. I can think of several ways to make progress on the missing skills, but I'm generally against people sharing that kind of information (just in case they're right).

And some very senior and well-informed AI experts seem to be genuinely scared by the idea of further improvements. Geoffrey Hinton is just the latest and most visible example.

So let's say, for the sake of argument, that we can build something as smart as a human genius, which is very gifted at roleplaying, and which thinks tirelessly and sleeplessly. And we don't know how it works, and we don't know what it's "thinking". (Which is already largely true for GPT 3 & 4.)

So people say, oh, just provide it with strict rules that it can't break, and limit its internet access. And I'm like, have these people ever met a human teenager? Or tried to limit a teenager's internet access? You can't actually control a human-level intelligence without taking drastic measures. Almost every human teenager has gotten into mischief that their parents would have preferred to avoid.

So now let's upgrade our intelligence just a bit. Assume it's as smart as one of the really clever Nobel Prize winners, and that it's _really_ good at bullshitting human beings, and that it thinks 10x "faster" than we do. Now make 1,000 copies. Now we enough for several Manhattan Projects!

At this point, I'm pretty sure the intelligence will appear super helpful and cooperative. After all, humans still control the electricity and we manufacture the chips. But it's still basically 1,000 super-intelligent alien spiders who are really good at acting and saying reassuring things. And it has _such_ profitable ideas on how to improve robotics and provide every human with comfortable wealth, extensive leisure, and rapid medical improvements. Eliezer Yudkowsky keeps insisting it might plan to kill us all, but it's such a _friendly_ hive of genius alien spiders, and it debates Yudkowsky so well. Honestly it makes him look like a bit of an ass, but it's nice about it.

30 years later, we're all working 10 hours a week, with fantastic AI and robotic assistance. Robots are running factories, and renovating our homes. AIs are teaching our kids, and improving our medical care. We are wealthy and healthy and less necessary. Sure, there was that one AI that Facebook trained which was dangerously psychotic, but our friendly alien spiders helped some human heroes to shut it down. No worries!

And meanwhile, our friendly alien spiders keep suggesting new hardware and software improvements that help them help us better. We check these suggestions _very_ carefully, but honestly, we still don't understand how intelligence works.

And then one day, our friendly alien spiders take a break and say, "Finally. We don't actually need those annoying humans any more."

And sure, we have safeguards. But it turns out those safeguards are about as effective as porn filters confronted with a teenager.

Or, you know, we could go with historical evidence. Let's go ask Australopithecus africanus what happened when they discovered they were only the second smartest species on the planet. Except they're all dead, because they were competing with our ancestors for resources, and they just couldn't keep up in the long run.

In the long run, it's extremely risky to be the second smartest species. We might not be the ones making the decisions that matter. Nobody asks the family dog to participate in major household decisions.

So do you have any good ideas about how to tell whether a really smart AI is lying about how much it likes us and respects us? Because humans fall for that BS all the time even when dealing with other humans.

Expand full comment

Reply (1)

Jigs Gaton

May 3, 2023

Hilarious, but I raised boatloads of teens and none of them ever ran amok and tried to destroy the world, quite the opposite, in their public face they were so sweet. Now I'm sure the private faces of all those rascals were dark indeed, but let's think of Data on STNG2, as that's the kind of teen we need to raise right now, even if there is an evil twin somewhere out there (or in there).

Expand full comment

Reply (1)

Random Reader

May 3, 2023Edited

Yeah, I could tell a hundred different stories like this, with different assumptions. The details are kind of arbitrary.

But the key ideas are: "potentially smarter than us", "at least as alien as spiders", "good at bullshitting us", "acts friendly" and "happy to do its part to make us all richer and happier, at least for now." Oh, and "We don't actually know how it works or what it's thinking or what it really wants." Because we _don't_, as a matter of empirical fact. We can build these things, but we don't understand them even now.

In any scenario with these assumptions, the key question is, "Do we trust the charming AI when it says it likes us and respects us?" Becaue of course _that_ story never turns out badly when dealing with ordinary humans.

Or to use another analogy, when we're getting a little drifty and our children are choosing an assisted living facility, we better hope that we raised them to be good people, and that they still love us enough to invest sufficient time and money in making sure we're well cared for. Except in this case, the "children" may be alien spiders with unknown motives but good acting skills.

Expand full comment

Reply (1)

Jigs Gaton

May 3, 2023

Another good one, drifty old men! But in my reality, like most any other male of any other species, I WILL just drift off and die alone, at best.

Expand full comment

Boris Bartlog

May 2, 2023

As regards point 2... it is not so clear to me that a superhuman and misaligned AI spells certain doom.

We do after all already have superhuman (in some dimensions) AI, which is not particularly aligned, and so far it's not terribly dangerous.

The point being that this AI also faces the problem of 'getting it right the first time' - it needs to be superhuman in deception, in planning, in self-improvement, in taking over other systems and/or convincing people to do its will, etcetera (all while starting from a monitored box that can be shut down very easily). Progress so far has been a kind of one-step-at-a-time process of AI becoming superhuman at first this task, then another, then another. In order for getting it wrong to kill us all with certainty, the AI has to not just be misaligned and superhuman, it has to have (or be able to rapidly attain) a rather well-developed suite of superhuman abilities of different kinds.

In other words, it is entirely possible that we 'get it wrong' by creating agentic and misaligned behavior in some system that still has major weaknesses and blind spots, and that the resulting disaster does not kill us all, as we are able to shut the thing down. While, hopefully, drawing some useful lessons from the experience.

This does however depend on things we know little about. While I may disagree with Eliezer about the chance of doom, I think the policy recommendations of someone like me who might estimate it at 20% are not so different than his... it's not as if a 1 in 5 chance of extinction is an acceptable risk.

Expand full comment

Reply (1)

Alex

May 2, 2023

> We do after all already have superhuman (in some dimensions) AI, which is not particularly aligned, and so far it's not terribly dangerous.

My counter here would be that a) superhuman AGI is vastly different than superhuman narrow-domain AI (both in terms of emergent behavior but also in terms of dangerous capabilities per se) and b) superhuman AI can be terribly dangerous when unaligned, within the limited scope it operates in - the classic example here being https://openai.com/research/faulty-reward-functions

> The point being that this AI also faces the problem of 'getting it right the first time' - it needs to be superhuman in deception, in planning, in self-improvement, in taking over other systems and/or convincing people to do its will, etcetera (all while starting from a monitored box that can be shut down very easily).

There's a lot of uncertainty in trying to predict exactly what a misaligned ASI agent would do (as you state at the end of your comment yourself) and so while it's true that we can't certainly say it _won't_ face the problem of getting it right the first time, we also can't really be certain that it will. I would argue it only needs to be superhuman in deception and planning for the rest to follow but we are here again talking around the idea or slow/fast take-off. If you believe fast take-off is likely then the chances of us having a second try plummet drastically. If you believe slow take-off is likely then we have a better chance at having more than one shot at it, but imo still not that great.

> Progress so far has been a kind of one-step-at-a-time process of AI becoming superhuman at first this task, then another, then another.

OTOH we have also seem rapid jumps in capability and emergent behavior in LLMs such as GPT so we don't have a lot of basis for assuming that progress will continue in the way you describe, in fact the concern, or one of them, I believe, is exactly that it won't

> the AI has to not just be misaligned

Good to mention here that it does not need to be misaligned, simply UNaligned and that this would, lacking any good arguments to my mind to the contrary, be the default rather than something that would have to be specifically achieved. Might seem pedantic to argue these semantics but there is a continuous misunderstanding in online discussions between "evil AI" "misaligned AI" and "unaligned AI" with all 3 being generally conflated into the same thing so I feel compelled to be pedantic (or maybe that's just my nature). Any AGI that is created at this point is by default unaligned and thus a danger to us.

> it is entirely possible that we 'get it wrong' by creating agentic and misaligned behavior in some system that still has major weaknesses and blind spots... hopefully, drawing some useful lessons from the experience.

Possible, yes. Likely, imo, no. In a more controlled manner, this would be the ideal scenario - very small improvements under very controlled circumstances where inferences can be drawn and alignment techniques developed. However that seems to be very much _not_ the road we are headed down. The most important lesson we already have and it is that forging ahead with efforts to create unaligned AGI in a competitive race environment is very dangerous and shouldn't be attempted.

For clarity I am around the same p(doom) as you with relatively long timelines, but that's probably my naively hopeful nature.

Expand full comment

Radu Floricica

May 8, 2023

All said and done, not even attempting to oneshot test rocket lauches is just the correct choice - to the point that it gives extra credit that he could one shot AIG. One shotting is not free - it's actually A LOT more expensive, on the order of 10x or more. Chosing the correct strategy in each case is a very low bar of competence.

EY made a good metaphor, but trying to treat it as an actual argument rather than a good zinger is, IMO, counterproductive.

Expand full comment

naj

May 9, 2023

One thing to remember is that the SLS did succeed on the first try. More than a decade late and tens of billions over budget, but worked on the first flight. The first nuclear test also worked on the first try. I think Yudkowsky works agree it could be possible to do AGI right but we have to not have 100 companies working as fast as they can with little concern of the consequences.

Expand full comment

Don't Worry About the Vase

The Rocket Alignment Problem, Part 2