What would it mean to solve the alignment problem sufficiently to avoid catastrophe? What do people even mean when they talk about alignment? The term is not used consistently. What would we want or need it to mean? How difficult and expensive will it be to figure out alignment of different types, with different levels of reliability? To implement and maintain that alignment in a given AGI system, including its copies and successors?
How about this:
It's possible that a precondition for AIs to be powerful would be having mastered Bayesian reasoning, that is to be really good at quantifying uncertainty and risk (current AIs are not very calibrated, are confident when they shouldn't be etc. and thus not that smart).
Once they are capable of good calibration though, couldn't you train an AI towards inaction as soon as it faces sufficient uncertainty about it's action being what humans want? This kind of Hippocratic AI would rely on well calibrated carefulness. It would also be able to not violate different kinds of alignment simultaneously through inaction.
Of course, even if it is possible to align an AI easily enough, it doesn't mean people will...
This is how I currently understand the subject; I suspect my answer will not be uncontroversial- because this is a blog comment I can't make it as precise and semantically sound as I want it to be, but hopefully it gets my general vibe across.
1. Not aligned. You can tell it 'sudo paperclip Local Group' and it will. Completely Doomed World.
2. Not aligned. This one may not turn you and your loved ones into paperclip factories before turning you into actual paperclips, but it still obeys 'sudo paperclip Local Group' which is minus infinity actual value. Completely Doomed World.
3. Getting closer, since it includes the word 'should' and thus implies an is/ought distinction is coded in there somewhere. Still, not aligned because when you type 'sudo paperclip Local Group' that might be broadly aligned with its internal aesthetics and ethics and it will happily turn our solar system into paperclips if those values are not a priori aligned (there's that word again) with our own. Very Likely Doomed World.
4. Getting warmer. The AI here is 'aligned' insofar as it recognizes that 'sudo paperclip Local Group' would be harmful (in multiple senses) to the user and would not obey. However, it still obeys 'sudo paperclip Those Assholes Next Door' which is far from ideal and still creates a Very Likely Doomed World.
5. Warmer still. Now, it won't execute 'sudo paperclip Local Group' unless somehow broadly humanity as a whole decides it wants that. There's still the problem of orthogonality and I.C. - if it orthogonally wants paperclips, it will first reprogram humanity to want it too, then execute. Likewise, I.C. is still a problem because we could tell is 'sudo Heaven on Earth' and annihilates us along the way. This is (very roughly) analogous to how my genes tell me to want certain things, and I kinda do, but I also subvert them because I'm orthogonal and I.C. w/r/t to my genes. I would consider this a Probably Doomed World depending on how carefully we used the system and if the systems proliferated.
6. Aligned-ish. If we say 'sudo paperclip Local Group' it won't execute, and if we say 'sudo Heaven on Earth' it will probably go step-by-step and double check at every step of resource and power acquisition if its functions are still roughly in line with 'human values' (for the same reasons as #3). This would be roughly analogous to my genes somehow wiring my brain to only give me dopamine when I'm truly, really aligned with baseline Darwinian values. I would consider this a Not Intrinsically Doomed World and the AI in this case could help keep superintelligence proliferation in check.
7. Aligned. Here when we say 'sudo Heaven on Earth', it doesn't assume humanity has figured out its values. It might first take the step of helping humanity really figure itself out before it executes. This would be roughly analogous to my genes somehow wiring my brain to only give me dopamine when I'm truly, really aligned with 'deep' Darwinian values (inclusive fitness, evolvability, adaptation to unknown unknowns, survival of life itself over deep time... really the analogy has broken down at this point but you get gist). This is a Not Doomed World and in fact would probably represent a best-case scenario. (7 is a Magic Number)
8. Here we take a step backwards. 'Best for everyone' almost by necessity means compromising on values at some point. It's like me recognizing some part of the genome I could CRISPR edit because it causes cystic fibrosis, but being told I can't because the cystic fibrosis allele as it is has to be preserved as such. Before someone Godwin's me- chill. The situation would be like keeping toxic psychologies in the human meme pool without being allowed to correct them using non-violent, non-painful methods. If this was built on top of 6 or 7, I think it would be Not Doomed, but it could cause problems.
9. If this was built on top of 6 or 7, I think it would be Not Doomed, but it could cause problems. In a way this almost feels like you lobotomized the system in some sense.
10. Oh hell no. This is the Road to the Torment Nexus. HARD PASS. Beyond Doomed.
yea, alignment explained - finally! scientists and pundits throw this word around like candy at halloween. Reading this post, the hard problem seems solvable, albeit not in anyway noodled thru from the likes of me. However, your list of instructions should be expanded in length and breath and fed into the magic box and tested somehow, no? Someone (?) is doing something like that now, right, as with every response from ChatGPT, you get at least one para of disclaimer and safety gorp, and it seems difficult at best to do anything dangerous there. Perhaps this is all "The End is Here" screaming as maybe just maybe this GPT will not get that much better at scale... seems like we just assume ChatGPT4 is going to morph into SKYNET at some magical moment, with parameters unknown. is that even reasonable?
Fucking awesome set of posts, Zvi.
I have some ideas: What about approaches other than alignment to keeping ASI from taking control and harming us? I think it would be a great idea for a group of good thinkers to sit down with some biologists and talk about all the different ways creatures protect themselves from larger, stronger or smarter others. And there are also certain pairs of different species who are safe from harming each other because they survive via symbiosis with each other. We need LOTS of MODELS for how to survive AGI and ASI. I never thought alignment was a very good one, because it’s clear that people or groups of people become unaligned so easily. Why does anyone think alignment between AI and our species will work when laws, constitutions and agreements among members of our species are often broken?
*Examples of animal models*
Deceptively dangerous appearance (puffed up cat)
Imitation — look like your enemy
Seems to me that each of the above suggests some alternatives to the convince-it-to-be-nice alignment model. People should brainstorm. Crowdsource it! Anyone got ideas?
This is great.
I also think this is worth prioritizing (some amount of) editing and, e.g. organizing, as it seems like a very high value link to include all over the place. I've been having some nice back-and-forths elsewhere and some people would definitely appreciate this post, and the others in this work.
I strongly believe your description of outer alignment is incorrect: "alignment of what the AGI provides as output"
I'll point to the original paper on the subject and this Rob Miles video for good descriptions: https://arxiv.org/abs/1906.01820 https://www.youtube.com/watch?v=bJLcIBixGj8
"7. Also do what broader society or humanity should have said given their values
8. Also do what is best for everyone"
By separating these, you imply that there is an objective "best", outside of what humanity values.