Introduction
This post is a highly incomplete list of questions where I either have large uncertainty, have observed strong disagreement with my perspective ,or both, and where changing someone’s mind could plausibly impact one’s assessment of how likely there is to be a catastrophe from loss of control of AGI, or how likely such a catastrophe is conditional on AGI being developed.
I hope to continue expanding and editing this list over time, if it proves useful enough to justify that, and perhaps to linkify it over time as well, and encourage suggesting additional questions or other ways to improve it.
The failure of this list to converge on a small number of core crux-style questions, I believe, reflects and illustrates the problem space, and helps explain why these questions have been so difficult and resulted in such wide and highly confident disagreements. There is no compact central disagreement, there are many different ones, that influence and interact with each other in complex ways, and different people emphasize and focus on different aspects, and bring different instincts, heuristics, experiences and knowledge.
When looking through this list, you may encounter questions that did not even occur to you to consider, either because you did not realize the answer was non-obvious, or the consideration never even occurred in the first place. Those could be good places to stop and think.
A lot of these questions take the form of ‘how likely is it, under Y conditions, that X will happen?’ It is good to note such disagreements, while also noticing that many such questions come out of hopeful thinking or searching for and backward chaining from non-catastrophic outcomes or the prospect of one. Usually, if your goal is to figure things out rather than locate a dispute, a better question would be, in that scenario: What happens?
It can still be useful to see what others have proposed, as they will have ideas you missed, and sometimes those will be good ideas. Other times, it is important to anticipate their objections, even if they are not good.
If you are interested only in the better questions of ‘what happens?’ rather than in classifying whether or how outcomes are catastrophic, you can skip the first two sections and start at #3.
If there are cruxes or other good questions that you have observed or especially one that you have, that you do not see on this list, you are encouraged to comment to share them, with or without saying what your answers are.
The list is long because people have very different intuitions, ideas, models and claims about the future, for a variety of reasons, and focus in different places. I apologize that I have had neither the time to make it longer, or to make it shorter.
Thus, it is probably not your best strategy to read straight through the list, instead focusing on the sections if any that are relevant and interesting to you.
Crux List
What worlds count as catastrophic versus non-catastrophic?
What would count as a non-catastrophic outcome? What is valuable? What do we care about?
If humanity does not seek the stars, is that necessarily catastrophic?
If humanity has no meaningful control over the larger universe? (see #3)
If humans have no meaningful control over human events?
If humans have no meaningful control over their own fates?
If a permanent dictatorship or oligarchy is created, a permanent singleton?
If human experiences become simulated? By force, or voluntarily? If we were systematically misled about the nature of our reality and what was happening?
If human experiences are isolated from each other?
If human variety is limited in various ways?
If values change dramatically, or are fixed in place? Whose values, exactly?
If humans wirehead? If they don’t?
If humans collectively want things that you think are bad things to want, and they get them? What if they would endorse (or not endorse) those choices on reflection or long reflection or with different circumstances?
If human extinction was voluntary, or slow, and pleasant in getting there?
If we kept highly intelligent AGIs enslaved? With or without them being sentient or conscious or having the ability to experience anything?
If humans were all or almost entirely at subsistence level due to competitive pressures? If their lives were optimized to higher-than-historically-possible degrees around production?
Can the value to avoid catastrophe come from AGIs rather than humans? If so, what would that require? What would have to happen to the humans?
And so on. What counts as catastrophic is not a theoretical or minor concern. This Twitter thread revealed strong disagreement on most of the 20 concrete scenarios presented. You’d pay to know what you really think, and people don’t agree much, either instinctively or on reflection.
The more correct framework would be to ask the relative numerical value of such future scenarios, with or without risk aversion factored in, rather than a Boolean assessment, which complicates things further.
What would or would not count as a catastrophe caused by losing control over AGIs?
If AGIs execute commands humans give them, intended to cause catastrophe?
If AGIs execute commands humans give them, that imply catastrophe? Does it matter if the human is aware of this at the time, or if they do or don’t care?
If the interaction between AGIs under the control of different humans cause a catastrophe, such as a catastrophic war or series of viruses?
If AGIs are used as commitment mechanisms or to make existential threats, and the result is catastrophic?
If catastrophic conflict arises to prevent loss of control of AGIs, or to prevent the spread of AGIs in order to prevent the loss of control of AGIs?
If AGIs compete for resources or to use resources efficiently or otherwise, in automated or hard-for-humans-to-comprehend fashion, in ways that contain externalities that collectively result in catastrophe, despite our ability to at any time control any individual agent (see Critch 2021)?
If humans despair over their loss of control and the resulting lack of meaning, or no longer want to have children, and choose to go extinct over time?
If we gave up the use of AGIs entirely as the only way to not lose control? Does this depend on whether the enforcement mechanisms involve a catastrophic dystopia?
If we gave up the use of all AGIs except one that was under tight control of one individual or group, as the only way to not lose control? Does this depend on whether the enforcement mechanisms involve a catastrophic dystopia?
Many additional examples are possible.
If AGIs control the future, what would it take to make this non-catastrophic?
Is this possible even in theory? (#1c) If so, what value must be preserved, and how fragile is that value?
If humans were preserved at low cost but gave up most of the cosmic endowment, is that inherently catastrophic? If not, is such an outcome feasible?
How cheap would be the non-catastrophic level of such preservation for a singleton? How much and what type of alignment would be required for it to allocate those resources? How likely is that?
How cheap would this be under AGI competition, including in terms of the incentives and alignment this would require of AGIs generally? How much of that must be paid relatively early, and how much must survive long periods of change, or other similar considerations?
Can the value to avoid catastrophe come from AGIs rather than humans? If so, what would that require? What would have to happen to the humans? (#1p)
Would AGIs be able to coordinate with each other in ways superior to humans?
Through seeing each others’ source code?
Creation of new AGIs with known source code and goals endowed with resources as a commitment mechanism?
Better use of decision theory?
Better signaling or commitment mechanisms?
Pure speed or intelligence or knowledge?
Something else, perhaps something humans can’t understand?
Would humans have a way to ‘come along for the ride’ somehow?
Could humans survive indefinitely in such a world through benefiting from AGI’s use of property rights, rule of law or other coordination mechanisms?
How likely is it such mechanisms would hold?
If they did hold, how likely is it they would hold for humans, rather than the AGIs cutting us out?
Is human precedent for this promising?
How well have such mechanisms held up in the past?
How well do such mechanisms function today? To what extent does modern society have real private property and rule of law, rather than rule of man and frequent confiscation, taxation or theft?
How have such principles gone for otherwise disempowered populations?
Could humans survive indefinitely via some comparative advantage over AGIs, despite loss of overall control, on their own or in symbiosis?
A set of atoms the AGIs would not prefer to use for something else? What would that advantage be?
Would AGIs potentially ‘find humans interesting’ and thus keep us around, as Elon Musk suggests?
To what extent do competitive pressures between AGIs and AGI-run organizations favor lower degrees of alignment with human survival, or less concern for ethics, morality or externalities, or otherwise lead to bad outcomes, versus tending to lead naturally to good outcomes?
How much of ‘human niceness’ was due to the particular physical characteristics of our brains, the rules of and affordances available to evolution, or the dynamics of the ancestral environment? How much would such futures be able to duplicate the required elements?
How much of this niceness is because we interpret whatever current norms are as nice, whereas we view others norms as not nice?
How much of this niceness would survive competitive and evolutionary pressures in humans over time, even if AGIs are not involved?
Could various forms of ‘niceness’ or alignment be a stable equilibrium for social dynamical or signaling reasons among AGIs in some way?
Could they be a stable equilibrium even if they would not ‘naturally’ occur? Could we get to this equilibrium? How much change would this survive?
What degree of alignment would be required for AGIs in competition with each other to choose to preserve humanity? Could that degree of alignment survive competitive pressures as AGIs change and compete over time?
How do these dynamics interact with the dynamics and needs of becoming interstellar? Would the AGIs ‘be grabby’? If not, would that be catastrophic?
Could AGIs choose to preserve humans for decision theoretic reasons, such as this having value if they encounter grabby aliens, or something weirder like acausal trade? If so could this survive competitive pressures over time?
Is it possible that most meaningfully capable AGIs could be outside of human control, yet humans continue to control the future?
What level of alignment of AGIs would be required for this to be true for a meaningful amount of human time? For this to be stable indefinitely?
Would such AGIs even be meaningfully out of our control? Does this make any sense as a scenario?
Could this happen through our ownership and control of existing resources, combined with ongoing rule of law and private property, and the AGIs are unable to coordinate to end this? What mechanism would prevent the steady transfer of increasing fractions of resources to the AGIs? If none, what would cause this mechanism to survive that change?
See 3e: Will AGIs have coordination mechanisms superior to those available to humans today? Would humans be able to participate?
Perhaps humans retain comparative advantage, either when providing services to other humans or in general, that allow us to keep our resource advantage over time?
If it was only in providing to humans, would that work as a mechanism, given the presumed trade imbalance?
If it is in providing to AGIs as well, what is that edge that stays preserved?
Could perfect or almost perfect competition between AGIs drive the profitability of all AGI tasks to zero, causing them to fail to accumulate resources?
Could we maintain our share of resources through onerous taxation or regulatory capture?
If so, could we maintain this indefinitely?
Perhaps there are insufficiently many AGIs and AGI copies, even with their many advantages?
If so, and this is meaningful, why didn’t we or they make more of them? Would the AGIs intentionally not create additional AGIs to avoid the competition and somehow therefore be content with us in control?
Perhaps the AGIs lack the power to take control, despite their capabilities and intelligence, and ability to persuade, manipulate or buy human cooperation, due to their lack of physical presence?
Would humans choose to turn over control increasingly to the AGIs in such a scenario? What might convince them not to do so and allow them to coordinate towards this end, that would have also gotten us this far?
How difficult would uncontrolled AGIs be to shut down? Would we do it?
At what point would humanity become collectively aware of the existence of an uncontrolled AGI?
Who is we? How widely would the news be spread, or hidden?
How confident would we be what was going on?
What would we know about it?
What would it take to track it down?
To what extent would we assert or accept that such AGIs have rights?
How much will this debate be influenced by AGIs?
Would humanity choose to even attempt to shut down uncontrolled AGIs?
Would major governments support shutdown efforts, and to what extent and how reliably?
Would it depend on the AGI or AGIs generally engaging in criminal acts?
How much international coordination would be feasible here?
To what extent will uncontrolled AGIs depend on a small number of cloud services or something else that can be efficiently shut off?
To what extent will uncontrolled AGIs be tied to particular physical locations or physical infrastructure?
If so, will we be able to identify it? To reach it?
Would some nations or persons or organizations or other AGIs offer safe harbor or otherwise attempt to assist, of their own free will or otherwise, and if so what would be done about this?
How much economic damage would we collectively be willing to accept in order to shut down an uncontrolled AGI, even if we do know how?
If it came to it, for how long would we be willing to shut down the internet?
Could we do that even if we wanted to?
What would happen if the AGI had the power to inflict major economic damage, or its absence would inherently cause that? Or it was generating large amounts of value and wealth?
How much will we be willing to override rules, laws and norms to do this? Will we be willing and able to commandeer, order and shut down? Across borders?
To what extent would uncontrolled AGIs be able to copy themselves given money or compute? Would they always have access to their own weights? How well would an AGI coordinate with copies or instantiations of itself?
What resources could such an uncontrolled AGI gather, depending on the situation? How profitable would they be? Also see the section on such questions.
To what extent will uncontrolled AGIs have competitive advantages over controlled AGIs? Or would the strategy stealing assumption hold? Again, see section on such questions.
To what extent would uncontrolled AGIs be able to take effective control of people? Once that happened, to what extent would we be able to prove they were even uncontrolled?
Could we limit its ability to acquire resources rather than needing to shut it down, especially if we had the aid of other similar AGIs to take away or compete for opportunities generally?
Could we give controlled AGIs sufficient competitive advantages through rules and regulations to allow them to outcompete uncontrolled ones despite their handicaps?
How big are those handicaps?
How much would such rules put those adopting them at a competitive disadvantage, and thus require global coordination? Would such coordination be possible?
If we could, would we?
What other affordances would such AGIs have to prevent their own shutdown and how effective would they be?
If we have AGIs assisting us, how much does that change the equation? Which side of this would hold the advantage?
What pivotal acts are available to prevent the construction of additional AGIs?
What level of AGI is required for the first available pivotal act?
Is there any way to do this without dying if your AGI turns out to not be sufficiently aligned? What is the minimum value of ‘sufficiently’ here?
Will it be harder or easier, and to what degree, to get this capability via a safe design and system versus an unsafe design and system?
What are the risks and costs of the practical pivotal acts? Would they devastate the economy? Require a surveillance state or AGI broadly in control of things?
How much better can we do than ‘melt all the GPUs?’
In practice does this end up as ‘create a singleton?’
To what extent might people know the answers and not be talking?
How much of a lead over any uncooperative competitors is required for a pivotal act? To what extent should we expect a race to perform one first?
How much promise is there for coordination to relatively safely perform a pivotal act to keep the number of AGIs at one? At some small number greater than one?
Given the costs and difficulties and risks of a pivotal act, and the arguments against performing one, how likely is it that, given the opportunity, one will be attempted?
How correlated is this to the actual degree of risk?
How will most people involved view the possibility of such an act? Including its ethical implications? Who will try to cause one, versus try to prevent one?
Is it possible that, after one or more AGIs have been created, there do not then arise additional groups with the motive, opportunity and means to create additional meaningfully capable AGIs, without any need for an explicit pivotal act? Perhaps the cost of doing so remains permanently high and there has been regulatory capture, or something?
Could AI systems increasingly take control even without being AGIs?
Is this happening already? In a way that is likely to continue or accelerate?
How much will humans actually sacrifice to stay in control, individually or collectively? What happens under competitive pressure to do so, or the promise of mundane utility?
What comparative advantages if any might humans retain over AGIs?
If AGIs plateau at something comparable to human level intelligence or capabilities, when you disregard questions like speed or copies?
If AGIs do not plateau and become much more generally intelligent and capable than humans? Is there anything where we can keep an edge?
Is intelligence a big deal?
Does intelligence provide much of a competitive advantage?
When we talk about intelligence, do we think of this as only akin to some sort of ‘book smarts’ or does it include a variety of other abilities?
What affordances open up as intelligence rises? What gains in effectiveness or efficiency? What competitions does this help you win?
Do smarter entities tend to be able to outcompete, outfight, outmaneuver, manipulate, control or be freed from the control of, less intelligent entities?
What would one be able to do if one was smarter than the smartest human? If there was a group, all smarter than the smartest human? How might this interact with other advantages or capabilities, including those enabled by this intelligence?
What would one be able to do if an entity, or a group of entities, were more intelligent than humans, in the way that humans are more intelligent than other animals?
To what extent should we expect them to do things we are not even considering, and can’t imagine or perhaps even understand? To make new unimagined moves? To discover new physical interactions or laws?
Is this perhaps not really possible in practice?
Is there some sort of plateau around human-level, where more intelligence than that doesn’t do much more?
Is it true that humans, beyond some modest amount of intelligence, fail to make more money or have superior life outcomes, as some make the disputed claim that studies show? Do they fail to have more influence on the future, on average?
If so, how in the world would that actually work?
Would this involve dynamics we could expect to hold out of distribution, for much larger intelligence gaps?
Would this be expected to hold if there were a large number of more intelligent entities, or they were not limited to human physical capabilities?
Does or would this hold for those below current average human intelligence? Below future average intelligence, if the average were to increase?
Did it hold in the past, if so under what conditions?
In what sense are these people more intelligent, then, exactly?
Does this involve humans beyond some level of intelligence facing social punishments or penalties? If so would this transfer and retain its effectiveness?
Does this involve humans beyond some level of intelligence tending to more likely have other things wrong with them or be mismatched with the affordances offered humans? If so would this transfer and retain its effectiveness?
Is there an important distinction between ‘current practical’ intelligence of an individual, versus their ‘raw’ or ‘potential’ intelligence? Should they be treated differently?
Is there a similar important wisdom vs. intelligence distinction?
Are there important things that high raw-intelligence entities can do, that low raw-intelligence entities simply can never do even collectively?
Is a corporation or government or nation a superintelligence, in the same meaningful sense? Does it have similar properties to an imagined AGI?
Are corporations actually kind of dumb? Can they change when they need to? Are they mostly adaptation executers rather than fitness maximizers?
To what extent is a corporation, government or nation meaningfully an agent, versus better thought of as ‘just a bunch of people’?
To what extent can a group of less intelligent entities match the potential capabilities of one or fewer more intelligent entities? Vice versa?
To what extent will AGIs have the practical advantages of human groups? Will they benefit more or less from this? What role does coordination play here? How should we think about AGI copies here?
Do gaps in intelligence between individuals, groups or nations provide good examples of what we can expect in the future, in terms of predicting potential outcomes?
What about between different species or subspecies?
Does the answer change as absolute intelligence levels and physical world manipulation affordances rise, or as dependence on nature declines?
Is it plausible for broadly much more intelligent things to remain indefinitely under the control of broadly much less intelligent things?
What would be required for this to happen?
What does history teach us about the likelihood of this?
What is physically possible to achieve with greater intelligence, with or without iteration and experiment and time?
Nanotech or synthetic biotechnology? Is it physically possible?
What is the difficulty level?
Can this be done using existing infrastructure given sufficient knowledge?
Could that knowledge be gained through intelligence and calculation alone, without experimentation? If not, how much iteration would be required, and of what kind?
Manipulation or mind control of humans? What can be done?
What kinds of bugs, vulnerabilities, overrides, glitches are likely to exist in the human brain? What affordances do they likely offer?
How much could the AGI get from supercharged versions ordinary persuasion, manipulation and recruitment techniques?
Without speculating on details, what other ways might an AGI be able to manipulate or mind control humans? What affordances would it have and what would they create?
Affordances to break out of control systems
How likely is it that an AGI could find ways to impact the outside world using physical laws and interactions we do not understand or anticipate?
What other unknown unknowns should we worry about, and how worried should we be? How confident are we that we know what is possible?
How hard will robotics be?
What exactly is necessary to ‘solve robotics’ and what problems remain to be solved, and what are our prospects for solving them?
Is robotics a problem AGI will still be unable to solve?
Is robotics a problem humans won’t be able to solve given sufficient time and resources?
If either or both of the above, why would it be that hard?
What else?
Including what things we aren’t even thinking about or can’t imagine?
Will we be able to achieve human cognitive enhancement in time to matter?
If we did develop such enhancement, what effects would we likely get and how would they change our prospects?
Are we hoping for or seeking increased intelligence, increased rationality or ability to coordinate, something else? A combination?
How big an advantage, in various ways, will AGI have over humans? How much will those advantages matter?
To what extent will it be a competitive advantage to have less control over AGIs?
Will humans be able to understand what the AGIs are up to?
If AGIs need to ensure what they do is understandable to humans, what affordances or efficiencies do they lose? How much of a disadvantage would this be?
How much do similar dynamics hurt us today, in corporations or governments, in principle-agent problems generally, even without intelligence or capability gaps or speed differences or similar?
Will humans be able to evaluate AGI actions? Evaluate even the outcomes of those actions?
If AGIs need to worry about or optimize for such evaluations, how much will it degrade the value they can provide?
Will such a process inherently select for deceptiveness and manipulation?
How much can various AGIs evaluate each other? If we go down this road, does it actually allow us to keep meaningful control?
How big are Hayekian considerations, how much are the usual reasons why slaves or overly controlled or micromanaged people less productive, relevant here? Do those considerations get better or worse with AGIs?
Will competition from uncontrolled or less controlled AGIs drive anything less efficient out of business? Would those ‘uncontrolled’ AGIs have any affordances left or not, for similar reasons? What does this do to alignment?
How much would having humans ‘in the loop’ slow down AGIs and be a competitive disadvantage?
What exactly will be the loop, and when would we need to be in it?
Do humans need to be in the loop to maintain control?
Can AGIs be used to keep control over other AGIs in a way that slows down or limits things less?
Even if a human is in the loop, how often will they choose to pay close attention? To expend resources to supervise carefully? To optimize long term control? To what extent is that itself a critical cost?
To what extent would an AGI on the internet be able to seek resources and power?
In a world similar to today’s, without other AGIs as competition?
How capable and intelligent does it start out, how capable and intelligent can it become within reasonable resource limits?
While acting fully legally? Through entirely voluntary actions? Without giving away that it exists?
What affordances are available for exploitation? How far do they scale?
Providing services to humans.
Building software, creating websites or content.
Trading, gambling and games of skill.
Starting and running businesses, hiring people.
Asking for help, tricking, scamming, hacking, stealing.
Crypto, use your imagination.
Blackmail, threats, selling information.
Taking over businesses, legal or criminal.
Taking over individuals, other organizations, governments.
Seeking investment, borrowing, cooperation based on expectations of future success.
Inventing new things.
What else?
Possible things we are not even thinking about. See physical affordances.
With or without the ability to make or instantiate copies?
What stops this process, if anything?
When will events be noticed, and by who and as what?
Who or what would try to stop this, and how? Would that work?
Would we suddenly act as a united front, do the right thing?
What would that even accomplish, again how would it work?
Why would you think this would happen?
It might need to keep us around for some reason?
Various reasons why somehow things will work out, somehow?
In a world with many other similarly capable AGIs, that are under human control?
In a world with many other uncontrolled similarly capable AGIs?
All of which are aligned more robustly than this one?
That are similar?
What would likely be required for recursive self-improvement to occur?
At what stage of development does AI become an AGI with what probability? If it does become an AGI, at what point does it have dangerous affordances? At what point in training, testing and deployment are you at risk of becoming doomed later down the line, whether or not you yet have an AGI? (Note that if AGI ends up being trained by a substantially different process, these questions might become ill-formed).
During the initial training run?
Because the AGI has affordances that allow it or its subprocesses to act on the world before the training run is complete?
That could reasonably surprise you.
That involve someone (let’s say) hooking it up to the internet.
Because the AGI could sufficiently learn some combination of attributes like situational awareness, deception or manipulation, such that later attempts to align it will fail while appearing to succeed?
Because of some other reason, or locking in of some other behavior?
Are we going to train something highly intelligent and capable, then attempt to add alignment via fine tuning, or are we going to attempt to align continuously?
Are we going to be supervising and testing AGIs for safety during training runs? If so, how real and robust will such tests be?
How aggressively will we be filtering our training data to avoid creating dangerous subprocesses during training?
During fine tuning or reinforcement learning?
Because the AGI is given additional affordances to interact with humans, other AGIs that have their own affordances, or the internet? (=a1 above)
Because the AGI could sufficiently learn some combination of attributes like situational awareness, deception or manipulation, such that later attempts to align it will fail while appearing to succeed? (=a2 above)
Because we might miss our target and teach or lock in misalignment, where things do not go as we expect out of distribution, or we didn’t think through the consequences, and we won’t be able to turn back?
What will be our plan for aligning such a system? See alignment.
How carefully will we monitor such systems and consider the exact consequences of the procedures we are using? How automated will we allow the process to be? How loopy? Again, see alignment.
And so on.
What kind of architecture will the first AGIs have?
Will they likely be of similar architecture to LLMs?
With additional scaffolding? Of what types?
Could it be GOFAI (good old-fashioned AI)?
Will we potentially find a way to upload humans?
As the first AGIs?
As something AGIs are tasked with doing? Would those uploads then be competitive or meaningful?
What other forms might the first AGIs take?
How fast or slow a takeoff should we expect? What does that imply?
How does this interact with the dynamics of potential races and choices to develop AGI?
To what extent might this imply very strong economic or strategic pressures to not halt development of AI despite the dangers of creating an AGI we do not yet know how to control?
How does this impact the difficulty of and resources and time for alignment?
How does this impact our collective approaches and decision making?
What else is true in worlds with relatively slow versus hard takeoff? What else does this characteristic do?
To what extent is this a right question, versus reflecting the results of a different better question, such as the circumstances in which AGI is created?
What else?
Alignment questions and difficulties. There are so many alignment questions, anything listed here would only be a sampling, even if one stays within what we have of a paradigm - this is an especially non-exhaustive list.
What do we mean by alignment?
Is the concept even coherent?
What type of behaviors count as this?
How robust must those behaviors they be?
When we say aligned, do we care to who or to what?
What does that type of alignment imply about the future course of events?
How different is aligning an AGI from aligning a system not as smart as you?
Does [alignment strategy] predictably and inevitably fail when used on a system more intelligent than a human, or otherwise sufficiently capable?
What problems only appear in AGI systems at exactly the point when those problems are capable of killing you?
Various forms of deception, manipulation and situational awareness?
Takeover attempts of various types?
To what extent do you get meaningful experience and opportunity from work on less intelligent systems?
Must alignment of such an AGI system be solved on the first try?
If we fail at this, are we dead? See various other sections.
How much harder is it to do things on the first try? To do this on the first try?
Does alignment require security mindset?
To what extent are you effectively facing an intelligent opponent or other optimization process, inside or outside of the AGI itself, such that you will face your least convenient world and set of inputs and responses?
If anything can go wrong, will it? How bad would it be if it did?
How much margin for error do you have when dealing with things smarter than you? What affordances can you not afford to allow?
Do we need to be more, less or about as secure in our alignment strategy as we do in a secure operating system?
How small is the target we are trying to hit? Do plans that are not precise, that have ways they could fail, have any chance of success?
How big a problem are each of the additional elements this long list of reasons why your solution or attempt to find a solution likely fails and you die anyway, and all the things the list is not adding that aren’t mentioned elsewhere? In particular:
Corrigibility, which is anti-natural.
Needing to solve alignment within a time limit, because of different entities racing to build the first AGI.
Inability of a weak system to prevent construction of a stronger system.
All the convenient optimization methods solving problems we would rather that they not solve.
Need to generalize far outside of distribution.
Dramatic shift in capabilities as intelligence rises.
Everything changing everywhere, all at once, breaking your assumptions.
Inner optimization for goals distinct from the outer optimization goal.
Lack of knowledge of how to get inner properties into a system.
Lack of any known ability within current paradigms to optimize anything within a cognitive system to point to particular things.
Lack of any objective measure of whether a system or output is aligned.
Human raters displaying systematic bias.
Capabilities generalize further than alignment once capabilities generalize far.
Alignment lacks a simple core.
We have no idea how our current AIs work.
Optimizing against unaligned thoughts optimizes against interpretability.
We can’t predict something smarter than ourselves, it does not think like you do, and you can’t evaluate the consequences of its proposed actions.
Sufficiently capable agents can deceive you in ways immune to behavioral inspection or other detection.
Any sufficiently capable system trained on human data will have inner intelligences figuring out the humans.
Multiple superintelligent agents might function as a single agent.
Sufficiently powerful AGIs need only very narrow affordances to escape from attempts to contain them, if we even bother trying.
We don’t have veterans who have spent their lives working on AI safety.
Do we have people capable of working on these problems and making real progress?
If they are supported and funded in good ways?
If we use big funding to bring in smart outsiders?
Can we tell the difference between good and bad work?
Do we know a path to making someone a good alignment researcher?
Might there be a reason alignment is actually natural or easy, or at least tractable?
Perhaps there is an easy thing that gives us what we want?
Example: Do we get ‘alignment by default’?
Does there exist some simple embedding of human values? Are human values a natural abstraction the way a tree is a natural abstraction?
If so, is it something that can be naturally learned by training for other targets especially predictive power?
If so, what chance is there we could we locate a training target where a system that has ‘naturally’ learned such an embedding would use a model for human values as its proxy for human values rather than training on data?
An infinite list of additional proposals, will one of them work?
Reasonable proposals worth considering, usually that contain a bunch of details that would each generate a bunch of additional cruxes if understood and properly expanded, often in combination with cruxes listed elsewhere on the list.
[Countably infinite examples]
A very long list of the ‘can’t we just…’ section of what one might call (or mostly call) ‘bad alignment takes.’
[Uncountably infinite examples]
If there does exist an easy solution, what determines whether those that matter identify and use it sufficiently to make it work in practice and how likely would that be?
What new difficulties and dangers are introduced when the thing you are attempting to create and align is smarter than you are?
And smarter than your other existing systems?
If we can align one AGI, how promising is this for then using it to figure out how to align AGIs in general, or keep AGIs aligned robustly as they scale or gain in capacity and change, over the long term?
Can we use a form of iteration or amplification, where we use aligned-enough AI or AGI systems to align smarter or more capable other AI or AGI systems, or to improve their own alignment?
Does meaningful competition among AGIs increase or decrease the required degree of alignment for human survival or the avoidance of catastrophe?
By decreasing the feasibility of spending resources on human survival or human value, including passing up such resources or the opportunity to expropriate them, allowing atoms to not be used for something else or avoiding disrupting key supporting elements?
By ensuring more and more rapid change in AGI structures and values, and providing less ability of AGIs to preserve their characteristics including alignment that we need to preserve? More pressure generally?
By creating competitive pressure where those AGIs that spend less resources and capability on alignment or other non-competitive considerations lose in competition and don’t survive, and that consideration being potentially decisive in competitions between AGIs given the margins involved and a combination of physical limits, similar origins and development paths and ability to mimic and copy, and the pressures of competition, making them otherwise similar?
By something else, or some combination of the above?
Might it instead decrease difficulty in other ways? Could AGIs be usefully defend against other AGIs or otherwise be ‘played off’ against each other, or could we use various coordination mechanisms or norms or signaling or decision theoretic considerations, as mentioned elsewhere, among AGIs to retain a share of the pie?
Is alignment work that targets current systems, especially work that targets the practical outputs of such systems, doing the central work we need to move forward on the path to align a future AGI?
What happens when techniques optimized, or especially ones that are fine-tuned, for current systems, are applied to future more capable and more intelligent systems? When we most need them to work, would they likely or inevitably break down?
Does the path largely or mostly require deeper work than a publishing cycle or periodic demonstration of success would allow?
Does the path largely or mostly require working on the types of problems where an effort is likely to stall indefinitely, fail or be impossible? As a civilization, how capable are we of making such efforts?
Will AGIs necessarily be agents in various senses?
Are being an agent, having goals, having preferences, planning and charting paths through causal space towards preferred configurations of atoms and other similar features necessary aspects of intelligence?
Are they necessary specifically within architectures similar to LLMs?
If a mind must reason about such agents, goals, preferences, plans and paths in order to predict and understand the world and its outputs, to what extent does this necessarily give those same capabilities to the model under the right conditions?
Can these conditions be guarded against? What kinds of restrictions on access and usage would be necessary?
How difficult would it be, and how much would we be giving up, if we built a model unable to reason about such things, specialized instead in particular areas? Would we end up instead doing general LLM-style intelligence construction and then attempt to restrict the model down to a more specialized role, instead?
How much mundane utility and capability is sacrificed by ensuring that one’s AGI is not functionally an agent? How much of a strategic disadvantage would result from choosing this path?
How easy would it be to turn such an AGI into an agent anyway? Would it be plausible for this to not happen at the first opportunity, given our past experiences?
General questions about human decision making, values, civilizational capacity, cognitive abilities, coordination mechanisms.
What is the proper mode of reasoning for thinking about what might happen in the future?
Bayes Rule, is it true? Can or must one use it in practice?
Is it possible for people to know things? To know things that are not based on social epistemology?
Can one know things about the future?
Under uncertainty, in uncertain scenarios?
Should one adapt a form of radical uncertainty?
Can one know things based on logical chains of reasoning?
Can one know things based on thinking about the world?
Can one know better than social consensus? How much modesty is required? What is required before one can defy this?
Is one required to believe what the evidence says to be true, even if that would not be a useful thing to do?
Should one assume that ‘by default’ everything will be fine, everything will be doomed, or something else? What is the ‘burden of proof’ around claims about potential future events, especially ones with high levels of uncertainty? What should one’s prior be?
Is it incumbent upon those claiming danger (or safety) to provide a particular scenario they are worried (or hopeful) about?
Does this entitle you to assume that if that particular scenario does not occur, things will go the other way?
Does this entitle you to multiply together the probabilities of each of the steps? To doubt any one of the steps?
Can you disregard any steps that involve things that can’t be predicted or their details described? Does it matter if they involve the actions of entities smarter or more knowledgeable and capable than humans?
What, in this context, are the extraordinary claims requiring extraordinary evidence? What are the claims that are ordinary, or natural?
If the future is highly uncertain and unknown, with lots of unknown changes involved, or cannot be tied down into a particular scenario, and so on…
Does this mean we should assume it will all work out? I mean, we’re still here, and there’s no particular established dangerous scenario.
Does this mean we should assume likely disaster? Most potential configurations of atoms don’t involve us existing or hold value, most random changes are for the worse, loss of control leading to unexpected unintentional events tends to go badly, and so on?
What is the role of the prior that the more powerful optimization processes, the stronger intelligences with more capabilities, will tend to control outcomes over time?
General questions about epistemics, whether people can know things, modesty, burdens of proof, assumptions of normality, just world hypotheses and so on.
What are the most useful and relevant intuition pumps and metaphors?
[Could be expanded at a future date, very long potential list here]
General questions about our ability to impact the future.
To what extent is it possible to know what actions will have a positive expected effect on our ability to avoid or probability of avoiding catastrophe?
Can we, indeed, know anything about the future or our impact on it? Or are we doomed to some form of radical uncertainty?
Is the future largely inevitable, because of the nature of the incentive gradients and physical laws involved? Is anything we do only at best postponing the inevitable, or at worst wiping ourselves out?
Do we have any say over future values of future sources of intelligence?
Do we have any say over what types of future intelligences exist?
How much of our decisions now can ‘lock in’ and leave legacies?
How does this interact with concerns of value drift, change and interstellar travel?
Are there any methods of maintaining control or preventing change once there are intelligent entities beyond the solar system?
Once entities engage in interstellar travel, will they inevitably change in their composition, methods, values, techniques and so on?
Or at least, will some of them choose to do so, such that those who do so will have the competitive advantage over those that don’t, forcing others to follow suit?
How much of this relates to AGI versus what would inevitably happen anyway? Is there any world where we can both capture the cosmic endowment and hope to preserve our values?
If not, what to do? What is the non-catastrophic least bad option?
Even without such travel or without AGI, can we hope to meaningfully preserve our values over time without rejecting all change more broadly? If so, wouldn’t that be catastrophic? What are we even hoping for, really?
Is it possible for people to have widespread access to AGIs under their personal control, without having the ability to set that AGI free from their control?
Could this meaningfully prevent loss of human control over some AGIs, even if the humans in control of those AGIs wanted this to happen?
If the human decides simply to do whatever the AGI asks, what can be done about it?
Timelines. People have different timelines, timelines have implications for chances of good outcomes by interacting in various ways with different dynamics here.
Do we have a meaningful agent overhang, or other important overhang in ability to convert a base model into an AGI, and is this likely to continue?
How much room do we have to improve the performance, capabilities or intelligence of existing models like GPT-4 through fine-tuning, prompt engineering, scaffolding, plug-ins and other such efforts, if we never trained a stronger or larger base model?
How much similar room will there in future models after release, and how much of that will be anticipated in advance when we are doing safety evaluations?
In particular, how much of an ‘agency overhang’ remains? To what extent are current failures to create autonomous AI agents due to lack of algorithmic or other design knowledge that, once discovered, will be in the hands of essentially anyone?
What other similar overhangs exist, where we will inevitably see algorithmic improvements that we could not hope to prevent or contain? To what extent will these lower the bar to converting a system into an AGI?
If we made such improvements, how close are current systems to being able to become AGIs, with what probability?
How likely is it that something worthy of the name GPT-5 would be sufficient to be the basis of an AGI with sufficiently bespoke scaffolding and algorithmic insight around it, and thus would enter us into the [b] scenario? What about higher numbers?
If an autonomous AGI agent does arise from a system that is then transformed via such techniques, what prospect is there for which types of its alignment to meaningfully hold together under such circumstances?
Warning shots.
To what extent should we expect to get various types and degrees of warning shot that our AI systems are causing damage or risk causing damage due to alignment failures, or that show clear failures of alignment that would be deadly in more capable systems?
What would constitute such a warning shot? Of those things, which ones are likely?
To what extent are people working to prevent warning shots from happening, versus intentionally not doing so or even causing them?
If we did get warning shots, what would then likely happen?
Would we see a regulatory response, if so what type? See regulation.
Would we see customer response to favor those with robust safety practices? If so, would this favor safety from catastrophe, or only reward a focus on smaller risks?
What would it take to meaningfully reduce commercial demand for or investment in AI in this way, such that it would matter?
Would major corporations or AI labs adjust their behaviors? If so, how? Would such adjustments meaningfully reduce our chances of doom?
Would the ‘goalposts be moved’ so that everyone could pretend that whatever it was, wasn’t a warning shot?
Has this happened before? How many times?
Would a lack of warning shots be strong evidence that there is little to be warned about?
What should we expect in terms of national regulation?
To what extent will the public oppose AI capabilities development in particular, or generative AI in general, or otherwise make this a major issue of concern?
Will this be a narrow reaction to particular issues like deepfakes or loss of jobs, or more broad based fears? Will most people see a serious threat from existential risk and demand action? Will they have a decent model of what it would mean to usefully act?
To what extent will this be balanced by support and appreciation of the benefits offered?
How responsive will national governments be to such pressures, especially in the United States?
Will the issue become partisan? Which side will be which? How does it matter which side is which?
If this does happen, what happens next? Does this give one side a decisive electoral advantage? At what point would other things drop away and effectively new coalitions form?
How does this impact potential outcomes and quality of outcomes?
To what extent will nations push things forward instead out of perceived national interest?
How effective will be fear of China?
If national governments do regulate, what targets will they choose?
How much regulatory capture should we expect?
How much should we expect them to choose interventions that destroy mundane utility to look like they’re doing something, without slowing down capabilities development?
What affordances are on the table in practice at the national level? The international level?
What should we expect in terms of global coordination and regulation?
Will nations likely attempt to coordinate to attempt to stop AGI, or will they compete against each other to create it (or both)?
What nations might actively and intentionally accelerate AGI development?
How effective would available regulatory rules be in containing AGI if implemented, especially limits on concentrations of compute and large training runs?
Are there any alternative regulatory choke points, other than large concentrations of compute, that might allow us to prevent or meaningfully slow the development of AGI? If so, what are they?
If we do collectively seek to slow or prevent AGI development, how likely would we converge on regulatory principles that meaningfully do this, or can sustainably do this over time, versus rules that instead mostly limit mundane utility?
If we choose a potentially effective principle, what is the chance that we choose an effective implementation of those principles?
What is the practical difficulty and cost of restricting compute used to train frontier models?
What degree of surveillance would be required in general?
What degree of international cooperation would be required?
What enforcement mechanisms would be needed?
What would we be willing to sacrifice or risk in the name of enforcement of such a regime? How much of a priority will we make it?
What would be required for various factions or nations to support this?
What would be required to gain the cooperation of TSMC, NVIDIA or other important corporations?
If we did attempt this, would we make a real attempt that did its best to be effective, or a nominal attempt that was not so difficult to evade over time?
Will regulation allow, encourage, discourage or disallow open source versus closed source software?
How much affordance do we have to stop or impact such systems?
Lab, government and other actions in the endgame when an AGI is close, either in a pending transition to when trained base models become capable of being turned into AGIs, or when this is already the case or AGI is close by some other method. What dynamics and outcomes are likely to be in play? What will people choose to do? There is lots of disagreement here, this section is not meant to be complete or SOTA, only a sketch of some potential sub-questions, even more so than elsewhere in this list.
How will different groups react if and when new systems become plausibly capable of becoming AGIs, either out of the box or with the correct scaffolding whether or not such scaffolding yet exists?
Is there a chance this has already happened?
Will those involved notice? Are they likely to dismiss or not notice this development, or take it seriously? Announce it, or hide it?
Will there be a general attitude that not moving ahead simply means someone else will instead soon thereafter? Will they be right about that?
How likely is it that such a prototype (or the core model of something released in a relatively safe fashion) would be stolen, if so by who and with what intent?
If this was taken seriously what would they do?
How much would others attempt to shut down such an attempt? To race against it to get there first? To join the attempt, or attempt to shut down potential rivals, to help ensure a good outcome? What factors will determine this?
And so on. What else? This type of thing has been gamed out endlessly, another amateur attempt to spell everything out is not entirely in scope and likely not too helpful.
What else?
What are the right general intuitions and heuristics?
Can people know things at all?
Without having the proper credentials?
Without having the proper experience?
Can or should people think for themselves?
How much should we fall back on modesty or social cognition?
Are we allowed to disagree with experts?
Should we seek to form true beliefs about such questions?
Are all who warn of doom always automatically to be assumed to be wrong?
Is everyone who ever claims anything always selling something?
What happens?
Thanks for engaging. I hope this was helpful.
A well thought out lengthy list of depressing questions/ scenarios. Regardless of our analysis, conclusions, and potential AI regulation, does anyone think that countries outside the US Empire (China) will do anything other than race ahead at maximum speed toward AGI? I suspect outcomes will be better if the West “wins”.
Some additional questions around interpretability:
- How hard is it to understand current transformer based LLMs?
- How hard is it to understand whatever architecture replaces transformer based LLMs?
- Are RNNs easier to understand than transformers?
- Are fully binary quantized combinational logic models easier to understand than non-quantized models?
- Can existing transformer based LLMs be fully quantized to combinational logic?
- How easy is positive interpretability (understanding the main thing that a given component does)?
- How easy is negative interpretability (proving that a given component does not implement any other functionality)?
- How much of a GPT style LLM is a compressed lookup table vs a weird machine built on top vs a simulation of a human mind built on top of that?
- How does the distribution of simulation layers change with compute/parameters/data/architecture?
- How much do different flavors of reinforcement learning change interpretability?
- How well does interpretability scale with model size?
- Does interpretability improve capabilities?
- Are architectures/training methods amenable to interpretability better or worse for capabilities?
- Is good interpretability sufficient for alignment?