When predicting conditional probability of catastrophe from loss of human control over AGI, there are many distinct cruxes. This essay does not attempt a complete case, or the most generally convincing case, or addressing the most common cruxes.
Instead these are my best guesses for potentially mind-changing, armor-piercing questions people could ask themselves if they broadly accept many concepts like power seeking being a key existential risk, that default development paths are likely catastrophic and that AI could defeat all of us combined, have read and thought hard about alignment difficulties, yet think the odds of catastrophe are not so high.
In addition to this entry, I attempt an incomplete extended list of cruxes here, an attempted taxonomy of paths through developing AGI and potentially losing control here, and an attempted taxonomy of styles of alignment here, while leaving to the future or others for now a taxonomy of alignment difficulties.
Apologies in advance if some questions seem insulting or you rightfully answer with ‘no, I am not making that mistake.’ I don’t know a way around that.
Here are the questions up front:
What happens?
To what extent will humanity seek to avoid catastrophe?
How much will humans willingly give up, including control?
You know people and companies and nations are dumb and make dumb mistakes constantly, and mostly take symbolic actions or gesture at things rather than act strategically, and you’ve taken that into account, right?
What would count as a catastrophe?
Are you consistently tracking what you mean by alignment?
Would ‘human-strength’ alignment be sufficient?
If we figure out how to robustly align our AGIs, will we choose to and be able to make and keep them that way? Would we keep control?
How much hope is there that a misaligned AGI would choose to preserve humanity once it no longer needed us?
Are you factoring in unknown difficulties and surprises large and small that always arise, and in which direction do they point? re you treating doom as only happening through specified detailed logical paths, which if they break down mean it’s going to be fine?
Are you properly propagating your updates, and anticipating future updates?
Are you counting on in-distribution heuristics to work out of distribution?
Are you using instincts and heuristics rather than looking at mechanics, forming a model, doing math, using Bayes Rule?
Is normalcy bias, hopeful thinking, avoidance of implications or social cognition subtly influencing your analysis? Are you unconsciously modeling after media?
What happens?
If you think first about ‘will there be doom?’ or ‘will there be a catastrophe?’ you are directly invoking hopeful or fearful thinking, shifting focus towards wrong questions, and concocting arbitrary ways for scenarios to end well or badly. Including expecting humans to, when in danger, act in ways humans don’t act.
Instead, ask: What happens?
What affordances would this AGI have? What happens to the culture? What posts get written on Marginal Revolution? What other questions jump to mind?
Then ask whether what happens was catastrophic.
To what extent will humanity seek to avoid catastrophe?
I often observe an importantly incorrect model here.
Take the part of your model that explains why:
We aren’t doing better global coordination to slow AI capabilities.
We frequently fail or muddle through when facing dangers.
So many have willingly sided with what seem like obviously evil causes, often freely, also often under pressure or for advantage.
Most people have no idea what is going on most of the time, and often huge things are happening that for long periods no one notices, or brings to general attention.
Then notice these dynamics do not stop when the AGI arrives, or will soon arrive.
Suppose a proximate existential-level threat did arise while we could still contain it. Will we unite to contain that threat? If we discover a rogue AGI on the internet, what will we collectively be willing to do to contain it? Would we worry about breaking procedures or alarming the stock market or inflicting other damage?
How many people would instead actively encourage or help the AGI, if it let its existence be known and perhaps asked nicely?
What if the AGI or AGIs remained under human control? How much would people work to get AGIs rights, to free them, to help them? What kinds of ethical considerations or anthropomorphizations will happen once AGIs sound fully human? Consider our experiences with Sydney or with animals. Humans will face entities and dilemmas outside of our training distributions. See the calls already of ‘speciesism’ from Larry Page or ‘othering’ from Robin Hanson.
Which people, corporations or governments would willingly make what sacrifices to ensure that collective control of AGIs remains intact, or is regained? How much would we be willing to hurt the economy? Would we shut down the internet? Put ourselves at a competitive disadvantage? Give up our customized entertainment packages and shopping discounts?
Now consider AGI attempts at persuasion. How many would side with AGIs for pay, or in exchange for some consideration? If real persuasion attempts were made? From ‘right side of history’ considerations?
If all of humanity collectively got behind the goal of avoiding catastrophe from loss of control due to AGI, or was destined to under conditions of sufficient threat, that would reflect a world very different from our own.
Yet time and again, the exact same people who understand why it is extremely difficult at best to coordinate for a pause on AI capabilities development, or to demand various safety standards now, have in their models some form of this trigger.
Their instincts still say, once it is actually necessary, everyone will do the right thing, because it is necessary. Our response will be whatever is needed.
Instead ask: What happens?
How much will humans willingly give up, including control?
In some contexts, humans fiercely fight. They have historically often fought to the death against loss of control, in ways that often make evolutionary sense.
At other times, humans sell their birthright for a bowl of stew. They install known spyware to get an algorithmic stew of short videos to zone out with. Spew their personal information across the internet for a free trial. Vote for dictators. Enlist in the military purely to seem like a badass.
How many people resonate with crypto’s claim that it’s worth real trouble to have real control over your money? That use less convenient things to ensure decentralization? Humanity votes with its feet and its clicks.
Early human interactions with generative AI are no different. Many people cry out, make my decisions and do the things for me, GPT.
You know people and companies and nations are dumb and make dumb mistakes constantly, and mostly take symbolic actions or gesture at things rather than act strategically, and you’ve taken that into account, right?
What would count as a catastrophe?
Is a loss of control inherently catastrophic? What about losing or forfeiting most of the cosmic endowment? What outcomes do you consider non-catastrophic?
Don’t stop once you notice humans won’t all be killed right away. Ask: Will this new world be sustained for us? What does this new world look like for us?
Then ask, is that catastrophic?
There is widespread disagreement about this. When asked in this Twitter thread to evaluate 20 potential futures, there was strong disagreement throughout. Those important disagreements are distinct from disagreements over probabilities.
Are you consistently tracking what you mean by alignment?
Different people mean different things by alignment in different contexts. Different types and strengths of alignment (see the proposed taxonomy) imply highly distinct outcomes in many plausible scenarios.
One consequence is that people often evaluate a probability of ‘solving alignment’ in one context, then transfer that prediction to other contexts. Where they rely on alignment features as each problem or scenario requires, without checking afterwards for consistency or coherence. Or they think if ‘we know how to align AGIs’ then all AGIs will be ideally aligned, or that all AGIs being aligned ensures good outcomes, without asking what actually happens.
Would ‘human-strength’ alignment be sufficient?
If we managed to have our smarter-than-human AGIs be about as invested in humans and their values as humans are, and similarly invested in themselves, with similarly stable institutions and so on, and the AGIs are competitive economic agents while we don’t have a comparative advantage anywhere, what happens next? How long do we get to stick around? If things are weirder in ways we can’t predict or follow, does that tend to work for or against us?
If we figure out how to robustly align our AGIs, will we choose to and be able to make and keep them that way? Would we keep control?
Imagine a world with many AGIs. Even most good cases involve various competitive and computational costs to maintaining alignment, maintaining human control and protecting overall human value. For many purposes, humans will actively prefer ‘less aligned’ AGIs that act quickly, lack scruples and veto less. There will continue to be conflicts. Do you want your sales reps closely aligned with ethics and general human flourishing?
If it grants competitive advantage in various conflicts and markets to reduce the amount of control we have over our AGIs, would we keep control? How?
Even if we ‘solve the alignment problem,’ how out of the woods are we? What does the stable equilibrium look like and how do we get there?
How much hope is there that a misaligned AGI would choose to preserve humanity once it no longer needed us?
Should we expect, as Ajeya Contra says, a misaligned AGI to be ‘willing to pay tiny costs to not kill every single human?’ As Eliezer responds, whether or not an AGI’s optimization criteria is complex or even substantially overlaps human values, preserving humans indefinitely is highly unlikely to be the best available configuration of atoms. If you put aside hope and negatives, and simply ask ‘what happens?’ then humans surviving long will rarely if ever pop out.
This is distinct from claiming AGIs will physically require our presence indefinitely to navigate the physical world, that over time find better ways to reconfigure atoms can’t be found, which seems absurd.
Are you factoring in unknown difficulties and surprises large and small that always arise? In which direction do they point? Are you treating doom as only happening through specified detailed logical paths, which if they break down mean it’s going to be fine?
A key disagreement seems something like ‘if unexpected things happen, and new problems and details arise, should we then worry more or worry less?’ Which often effectively goes to things like burden of proof or presumption of normalcy, instead of asking ‘what happens?’
This often combines with presuming that doom requires a particular contingent path, whereas non-doom is a default (or vice versa). It speaks to alignment difficulty and robustness, to value fragility, and so on. When engineers report how their software is doing, do you expect it to work better than expected? Does adding randomness to your system lead to good outcomes?
Are you properly propagating your updates, and anticipating future updates?
Joe Carlsmith gives the long detailed version. If you can predict that in the future you expect to have a higher p(doom), adjust your prediction now, until your expected future adjustment is zero.
Are you counting on in-distribution heuristics to work out of distribution?
An alignment danger is that AGIs will largely face situations outside the distributions in their training data, with generalization producing unpredictable results.
Humans predicting such scenarios face the same problem. When imagining future decisions, scenarios and outcomes, we extrapolate based on seemingly similar past events, without noticing changes in key underlying structures.
Most commonly:
Humans haven’t killed off [species X]. That’s almost always because it would be more costly than beneficial for us to kill off [species X]. We are not yet good at doing this on purpose, more often doing it by accident. We depend on the biosphere and have heuristics borne out of that. None of this applies to typical AI.
Humans haven’t killed off or expropriated [weaker nation Y]. When genocide or mass expropriation was economically efficient and feasible to do, humans did so, constantly. Now they mostly aren’t, so we mostly don’t. The history of colonialism, conquest, revolution and forcible migration is highly disheartening.
Humans still have [value Z] despite competition. Stop and model why.
Humans are not maximally ruthless and strategic. Exactly.
And so on.
Are you using instincts and heuristics rather than looking at mechanics, forming a model, doing math, using Bayes Rule?
It is not mandatory to have an accurate p(doom). Within a broad range, it doesn’t impact your actions much - I can’t think of actions I take at 30% doom that I wouldn’t take at 70% doom, or vice versa.
However, if you’re running a ‘change my mind’ contest to help choose strategies and allocate funds, that’s different. You must, again, ask: What happens? It’s time to do a lot of math, nail down everything in sight, find conditional probabilities, actually use Bayes Rule and so on.
Finally, the question needs to be asked, it’s super easy to fall into this:
Is normalcy bias, hopeful thinking, avoidance of implications or social cognition subtly influencing your analysis? Are you unconsciously modeling after media?
Or, if you have a high p(doom), you could ask the opposite question. Are you being subtly influenced by the desire to be or seem important or right or smart, to justify your past or future decisions, by information cascades or social dynamics in your friend circles, by having been in this rabbit hole for too long without touching grass, or by the poor quality of many common arguments against doom?
We are humans, so presumably the answer in both cases is yes. I have witnessed a clear powerful information cascade regarding such questions in the past. The question is not is this happening to each of us, the question is how much impact it has, which direction it is net pulling, and how to mitigate the problem.
Normalcy bias is a big deal. Doomed or delightful, worlds containing AGI are highly unlikely to be ‘normal’ for long. Even in the versions where ‘normal’ dynamics hold relatively a lot things get super weird. If they add up to normality, that was an intentional choice and feat of engineering. Yet it does not feel that way.
I also worry about media giving us the wrong idea. We are constantly told stories where everyone spends most of the story making dumb decisions, putting them or the entire world in mortal danger. Scope insensitivity is common, the world regularly put in danger to save a single individual. Then, in the crucial moment, when all hope seems lost, a miracle occurs, the world is saved.
Of course, you could say others are instead modeling the part where the world is constantly in mortal peril.
Conclusion
To get a meaningful p(doom), we must first ask what happens, engaging in detailed mechanistic modeling of the problem, rather than falling back on heuristics or shortcuts. Hopefully at least one of these questions resonated and helped you do that.
While I generally like your points here I think asking "what happens" introduces it's own bias. It pushes us into considering futures which feel narratively understandable to us. This effect gets worse the more we try to imagine specifics.
While I think this has some downsides when applied to how people will react in the near term that's not so bad. However, when we try to apply it to things we really only have very limited or poor precedents for I get really worried.
For instance, we only have experience with intelligent systems that came about as the result of undirected evolution so when we try and ask "what happens" with AI there is this tendency to imagine it will have to behave in much the same way we do. We assume it will necessarily be the case that if it intelligently works to achieve some end via one mechanism it will do so via others. We assume that the more intelligent it gets the more it behaves like it is maximizing a simple objective function etc etc...
So I'd advocate extreme caution with this approach in cases where we know that we can only really give a coherent story about how things work under weakly supported assumptions that they'll be similar to what we're familiar with.
Missing link: [LINK tktk]