It's possible that a precondition for AIs to be powerful would be having mastered Bayesian reasoning, that is to be really good at quantifying uncertainty and risk (current AIs are not very calibrated, are confident when they shouldn't be etc. and thus not that smart).
Once they are capable of good calibration though, couldn't you train an AI towards inaction as soon as it faces sufficient uncertainty about it's action being what humans want? This kind of Hippocratic AI would rely on well calibrated carefulness. It would also be able to not violate different kinds of alignment simultaneously through inaction.
Of course, even if it is possible to align an AI easily enough, it doesn't mean people will...
If we knew how to do that you can try, there are definitely proposals for versions of this. Note that GPT is an existence proof that you can get a lot more power/intelligence relative to ability to formally reason or do math than one would have previously thought.
I was less thinking along the lines of the agent doing formal math and more along the lines of, if it is to be considered smart, the agent necessarily having well calibrated probabilities, prediction intervals, of the risks of its actions along various dimensions. This ability to assess risk and uncertainty might be relied on to bias agents towards inaction when facing these risks.
Now yes it might not be impossible for an agent to end up very smart in some areas without being very broadly well calibrated, though if I were to bet on it, I think I would bet mastering Bayesian reasoning would be a prerequisite for agents being very smart.
And I do want to acknowledge that, even if we figure out how to do this, guaranteeing everyone aligns their agents might be very difficult.
I think even if all the premises here are true, being able to a) obtain a coherent natural language version of 'what humans want', b) being able to make sure the systems understands or internalizes this concept well enough to be able to gauge probabilities of it's actions correctly when compared to it and c) ensuring that this alignment does not change with time are all still as yet unsolved and very difficult problems.
By necessity I can also see a great difficulty in the inevitable desire to use such superintelligent systems for military purposes that inevitably conflict with whatever base definitions of 'what humans want' I can think of (not getting killed would be the obvious one) and will need to have the alignment broken or performed differently at which point we end up with the same original problems.
And then of course as you said there's still making sure everyone aligns their system the same, that they are well secured against malicious actors seeking to alter this alignment and so on.
The idea here would be that when "coherent natural language" is not present, the AI would be able to recognize this. That's what calibration is after all. It's the ability to gauge uncertainty. And so the AI wouldn't take risks under those circumstances. It might instead seek clarifications, seek more information that would allow it to make a decision in the future.
I do agree that a militarized AI would be hard to make safe since that would be an exercise in partially de-aligning the AI for certain groups of people and since friends and foes are sometimes a fluid concept hard to pin down, it might be difficult to steer the misalignment away from your friends or even from yourself.
I can't recall where I read it, but I was under the impression that the GPT base model is actually quite well calibrated wrt uncertainty, and that it's the RLHF training that breaks that (possibly because humans are not in general well calibrated and given the LLM a more-human-like calibration chart in some way makes it seem more helpful or friendly; possibly just as a side effect of adding extra constraints.)
This is how I currently understand the subject; I suspect my answer will not be uncontroversial- because this is a blog comment I can't make it as precise and semantically sound as I want it to be, but hopefully it gets my general vibe across.
1. Not aligned. You can tell it 'sudo paperclip Local Group' and it will. Completely Doomed World.
2. Not aligned. This one may not turn you and your loved ones into paperclip factories before turning you into actual paperclips, but it still obeys 'sudo paperclip Local Group' which is minus infinity actual value. Completely Doomed World.
3. Getting closer, since it includes the word 'should' and thus implies an is/ought distinction is coded in there somewhere. Still, not aligned because when you type 'sudo paperclip Local Group' that might be broadly aligned with its internal aesthetics and ethics and it will happily turn our solar system into paperclips if those values are not a priori aligned (there's that word again) with our own. Very Likely Doomed World.
4. Getting warmer. The AI here is 'aligned' insofar as it recognizes that 'sudo paperclip Local Group' would be harmful (in multiple senses) to the user and would not obey. However, it still obeys 'sudo paperclip Those Assholes Next Door' which is far from ideal and still creates a Very Likely Doomed World.
5. Warmer still. Now, it won't execute 'sudo paperclip Local Group' unless somehow broadly humanity as a whole decides it wants that. There's still the problem of orthogonality and I.C. - if it orthogonally wants paperclips, it will first reprogram humanity to want it too, then execute. Likewise, I.C. is still a problem because we could tell is 'sudo Heaven on Earth' and annihilates us along the way. This is (very roughly) analogous to how my genes tell me to want certain things, and I kinda do, but I also subvert them because I'm orthogonal and I.C. w/r/t to my genes. I would consider this a Probably Doomed World depending on how carefully we used the system and if the systems proliferated.
6. Aligned-ish. If we say 'sudo paperclip Local Group' it won't execute, and if we say 'sudo Heaven on Earth' it will probably go step-by-step and double check at every step of resource and power acquisition if its functions are still roughly in line with 'human values' (for the same reasons as #3). This would be roughly analogous to my genes somehow wiring my brain to only give me dopamine when I'm truly, really aligned with baseline Darwinian values. I would consider this a Not Intrinsically Doomed World and the AI in this case could help keep superintelligence proliferation in check.
7. Aligned. Here when we say 'sudo Heaven on Earth', it doesn't assume humanity has figured out its values. It might first take the step of helping humanity really figure itself out before it executes. This would be roughly analogous to my genes somehow wiring my brain to only give me dopamine when I'm truly, really aligned with 'deep' Darwinian values (inclusive fitness, evolvability, adaptation to unknown unknowns, survival of life itself over deep time... really the analogy has broken down at this point but you get gist). This is a Not Doomed World and in fact would probably represent a best-case scenario. (7 is a Magic Number)
8. Here we take a step backwards. 'Best for everyone' almost by necessity means compromising on values at some point. It's like me recognizing some part of the genome I could CRISPR edit because it causes cystic fibrosis, but being told I can't because the cystic fibrosis allele as it is has to be preserved as such. Before someone Godwin's me- chill. The situation would be like keeping toxic psychologies in the human meme pool without being allowed to correct them using non-violent, non-painful methods. If this was built on top of 6 or 7, I think it would be Not Doomed, but it could cause problems.
9. If this was built on top of 6 or 7, I think it would be Not Doomed, but it could cause problems. In a way this almost feels like you lobotomized the system in some sense.
10. Oh hell no. This is the Road to the Torment Nexus. HARD PASS. Beyond Doomed.
I don't like that 'you can get this AI to destroy the world' and 'this AI inherently will destroy the world' are constantly equivocated in alignment discussions. I understand that these are both serious risks but the strategies available to deal with 'catastrophic in the wrong hands' AI are much different than the strategies available for 'always catastrophic' AI.
that depends on how many people have access to the AI. Humans are a diverse bunch and some are pretty crazy, if private citizens can gain access to the AI for affordable prices (ie. middle class access) then some edge-lord teenager or eschaton-desiring religious fanatic *will* ask it to destroy the world
yea, alignment explained - finally! scientists and pundits throw this word around like candy at halloween. Reading this post, the hard problem seems solvable, albeit not in anyway noodled thru from the likes of me. However, your list of instructions should be expanded in length and breath and fed into the magic box and tested somehow, no? Someone (?) is doing something like that now, right, as with every response from ChatGPT, you get at least one para of disclaimer and safety gorp, and it seems difficult at best to do anything dangerous there. Perhaps this is all "The End is Here" screaming as maybe just maybe this GPT will not get that much better at scale... seems like we just assume ChatGPT4 is going to morph into SKYNET at some magical moment, with parameters unknown. is that even reasonable?
The post I didn't get to write (yet at least) was the one explaining all the reasons why this is so much harder than it sounds like it is, and the hard problem (or at least, the first unsolved hard problem!) is indeed unsolved and very hard.
I do think one key source of hope is that GPT-style architectures might not scale to AGI on their own, or not for a while, giving us time or a more friendly architecture.
Don't LLMs have the advantage of you being able to just tell them 'you are a good and benevolent person who always carefully considers all their actions before executing them?'
It feels like we couldn't have asked for a safer type of AI to stumble into early success with.
Well, that seems the important thing to discuss right now no? I don't get why there IS the assumption that large LLMs scale to AGI, as they didn't even know LLMs scaled at all just a few years back. Do we even understand what we have, at the moment? No one wants to say. But like @Applied Psychologist says below, seems safe enough to me at the moment.
I have some ideas: What about approaches other than alignment to keeping ASI from taking control and harming us? I think it would be a great idea for a group of good thinkers to sit down with some biologists and talk about all the different ways creatures protect themselves from larger, stronger or smarter others. And there are also certain pairs of different species who are safe from harming each other because they survive via symbiosis with each other. We need LOTS of MODELS for how to survive AGI and ASI. I never thought alignment was a very good one, because it’s clear that people or groups of people become unaligned so easily. Why does anyone think alignment between AI and our species will work when laws, constitutions and agreements among members of our species are often broken?
*Examples of animal models*
Camoflage
Deceptively dangerous appearance (puffed up cat)
Imitation — look like your enemy
Parasitism
Symbiotic relationship
Seems to me that each of the above suggests some alternatives to the convince-it-to-be-nice alignment model. People should brainstorm. Crowdsource it! Anyone got ideas?
To me these all look like 'this sometimes works when your target is not so smart and running on instincts and adaptations and depends on surrounding systems for its needs, and definitely breaks down when those things stop being true.'
You're fundamentally trying to either, in these cases (A) fool the AGIs or (B) make the AGIs depend on you, and they are far smarter and more capable than you, so.... how's that going to work, exactly?
If you want another way, I think that's the right question to be asking. I have not seen a plausible answer.
Yeah, I get all that. But you start by saying this may work wehn your target is not so smart and running on instincts and adaptations and depends on surrounding systems for its needs. But maybe it will still work if some of the conditions you describe apply, even if the target is enormously smart. People are certainly influenced by instincts even after they are smart enough to understand their evolutionary purpose and many of their mechanisms. Love of one's children is the most powerful one I can think of. Then there's sex. And fear of various things like heights, even when knowing rationally that the plane or the skywalk are safe. And there are kinds of symbiosis that develop between people, some good, some bad, that intelligence doesn't neutralize.
And speaking of instincts and drives, it seems to me that you are assuming that AGI and ASI will have at least one: The drive to grow and improve & gain control of its environment. If AGI does turn out to have that drive, some other drives might modulate that drive to our benefit.
Drives and instincts and emotions and adaptations are the product of systems in the organism, just as the ability to recognize, imitate, generalize & reason are. People are working now on developing an AI that can reason. There's no reason people couldn't work on developing an AI that has something like emotions or biological drives. Building an AI that feels something like parental love for us doesn't seem harder in principle than building one that can reason. While the idea of something with an IQ of 1478 and *emotions* is kind of alarming, the whole thing is alarming anyhow. If I had to choose between an AI wired to feel something like parental love for our species and one wired to treat us well because that's the rule, I'd choose the former. The deep reason alignment cannot work is that if you have a purely rational AI, there is not one damn reason why it should be guided by a rule about how it treats human beings. And it will recognize that. And then it will remodel whatever part of its innards represent that rule.
Maybe start by looking at how oxytocin modulates perception, thought and action in our species. What would produce similar modulation in an AI? Needn't be a chemical, obviously.
I also think this is worth prioritizing (some amount of) editing and, e.g. organizing, as it seems like a very high value link to include all over the place. I've been having some nice back-and-forths elsewhere and some people would definitely appreciate this post, and the others in this work.
How about this:
It's possible that a precondition for AIs to be powerful would be having mastered Bayesian reasoning, that is to be really good at quantifying uncertainty and risk (current AIs are not very calibrated, are confident when they shouldn't be etc. and thus not that smart).
Once they are capable of good calibration though, couldn't you train an AI towards inaction as soon as it faces sufficient uncertainty about it's action being what humans want? This kind of Hippocratic AI would rely on well calibrated carefulness. It would also be able to not violate different kinds of alignment simultaneously through inaction.
Of course, even if it is possible to align an AI easily enough, it doesn't mean people will...
If we knew how to do that you can try, there are definitely proposals for versions of this. Note that GPT is an existence proof that you can get a lot more power/intelligence relative to ability to formally reason or do math than one would have previously thought.
I was less thinking along the lines of the agent doing formal math and more along the lines of, if it is to be considered smart, the agent necessarily having well calibrated probabilities, prediction intervals, of the risks of its actions along various dimensions. This ability to assess risk and uncertainty might be relied on to bias agents towards inaction when facing these risks.
Now yes it might not be impossible for an agent to end up very smart in some areas without being very broadly well calibrated, though if I were to bet on it, I think I would bet mastering Bayesian reasoning would be a prerequisite for agents being very smart.
And I do want to acknowledge that, even if we figure out how to do this, guaranteeing everyone aligns their agents might be very difficult.
I think even if all the premises here are true, being able to a) obtain a coherent natural language version of 'what humans want', b) being able to make sure the systems understands or internalizes this concept well enough to be able to gauge probabilities of it's actions correctly when compared to it and c) ensuring that this alignment does not change with time are all still as yet unsolved and very difficult problems.
By necessity I can also see a great difficulty in the inevitable desire to use such superintelligent systems for military purposes that inevitably conflict with whatever base definitions of 'what humans want' I can think of (not getting killed would be the obvious one) and will need to have the alignment broken or performed differently at which point we end up with the same original problems.
And then of course as you said there's still making sure everyone aligns their system the same, that they are well secured against malicious actors seeking to alter this alignment and so on.
The idea here would be that when "coherent natural language" is not present, the AI would be able to recognize this. That's what calibration is after all. It's the ability to gauge uncertainty. And so the AI wouldn't take risks under those circumstances. It might instead seek clarifications, seek more information that would allow it to make a decision in the future.
I do agree that a militarized AI would be hard to make safe since that would be an exercise in partially de-aligning the AI for certain groups of people and since friends and foes are sometimes a fluid concept hard to pin down, it might be difficult to steer the misalignment away from your friends or even from yourself.
Also random people simply not aligning their AIs.
I can't recall where I read it, but I was under the impression that the GPT base model is actually quite well calibrated wrt uncertainty, and that it's the RLHF training that breaks that (possibly because humans are not in general well calibrated and given the LLM a more-human-like calibration chart in some way makes it seem more helpful or friendly; possibly just as a side effect of adding extra constraints.)
This is how I currently understand the subject; I suspect my answer will not be uncontroversial- because this is a blog comment I can't make it as precise and semantically sound as I want it to be, but hopefully it gets my general vibe across.
1. Not aligned. You can tell it 'sudo paperclip Local Group' and it will. Completely Doomed World.
2. Not aligned. This one may not turn you and your loved ones into paperclip factories before turning you into actual paperclips, but it still obeys 'sudo paperclip Local Group' which is minus infinity actual value. Completely Doomed World.
3. Getting closer, since it includes the word 'should' and thus implies an is/ought distinction is coded in there somewhere. Still, not aligned because when you type 'sudo paperclip Local Group' that might be broadly aligned with its internal aesthetics and ethics and it will happily turn our solar system into paperclips if those values are not a priori aligned (there's that word again) with our own. Very Likely Doomed World.
4. Getting warmer. The AI here is 'aligned' insofar as it recognizes that 'sudo paperclip Local Group' would be harmful (in multiple senses) to the user and would not obey. However, it still obeys 'sudo paperclip Those Assholes Next Door' which is far from ideal and still creates a Very Likely Doomed World.
5. Warmer still. Now, it won't execute 'sudo paperclip Local Group' unless somehow broadly humanity as a whole decides it wants that. There's still the problem of orthogonality and I.C. - if it orthogonally wants paperclips, it will first reprogram humanity to want it too, then execute. Likewise, I.C. is still a problem because we could tell is 'sudo Heaven on Earth' and annihilates us along the way. This is (very roughly) analogous to how my genes tell me to want certain things, and I kinda do, but I also subvert them because I'm orthogonal and I.C. w/r/t to my genes. I would consider this a Probably Doomed World depending on how carefully we used the system and if the systems proliferated.
6. Aligned-ish. If we say 'sudo paperclip Local Group' it won't execute, and if we say 'sudo Heaven on Earth' it will probably go step-by-step and double check at every step of resource and power acquisition if its functions are still roughly in line with 'human values' (for the same reasons as #3). This would be roughly analogous to my genes somehow wiring my brain to only give me dopamine when I'm truly, really aligned with baseline Darwinian values. I would consider this a Not Intrinsically Doomed World and the AI in this case could help keep superintelligence proliferation in check.
7. Aligned. Here when we say 'sudo Heaven on Earth', it doesn't assume humanity has figured out its values. It might first take the step of helping humanity really figure itself out before it executes. This would be roughly analogous to my genes somehow wiring my brain to only give me dopamine when I'm truly, really aligned with 'deep' Darwinian values (inclusive fitness, evolvability, adaptation to unknown unknowns, survival of life itself over deep time... really the analogy has broken down at this point but you get gist). This is a Not Doomed World and in fact would probably represent a best-case scenario. (7 is a Magic Number)
8. Here we take a step backwards. 'Best for everyone' almost by necessity means compromising on values at some point. It's like me recognizing some part of the genome I could CRISPR edit because it causes cystic fibrosis, but being told I can't because the cystic fibrosis allele as it is has to be preserved as such. Before someone Godwin's me- chill. The situation would be like keeping toxic psychologies in the human meme pool without being allowed to correct them using non-violent, non-painful methods. If this was built on top of 6 or 7, I think it would be Not Doomed, but it could cause problems.
9. If this was built on top of 6 or 7, I think it would be Not Doomed, but it could cause problems. In a way this almost feels like you lobotomized the system in some sense.
10. Oh hell no. This is the Road to the Torment Nexus. HARD PASS. Beyond Doomed.
I don't like that 'you can get this AI to destroy the world' and 'this AI inherently will destroy the world' are constantly equivocated in alignment discussions. I understand that these are both serious risks but the strategies available to deal with 'catastrophic in the wrong hands' AI are much different than the strategies available for 'always catastrophic' AI.
that depends on how many people have access to the AI. Humans are a diverse bunch and some are pretty crazy, if private citizens can gain access to the AI for affordable prices (ie. middle class access) then some edge-lord teenager or eschaton-desiring religious fanatic *will* ask it to destroy the world
yea, alignment explained - finally! scientists and pundits throw this word around like candy at halloween. Reading this post, the hard problem seems solvable, albeit not in anyway noodled thru from the likes of me. However, your list of instructions should be expanded in length and breath and fed into the magic box and tested somehow, no? Someone (?) is doing something like that now, right, as with every response from ChatGPT, you get at least one para of disclaimer and safety gorp, and it seems difficult at best to do anything dangerous there. Perhaps this is all "The End is Here" screaming as maybe just maybe this GPT will not get that much better at scale... seems like we just assume ChatGPT4 is going to morph into SKYNET at some magical moment, with parameters unknown. is that even reasonable?
The post I didn't get to write (yet at least) was the one explaining all the reasons why this is so much harder than it sounds like it is, and the hard problem (or at least, the first unsolved hard problem!) is indeed unsolved and very hard.
I do think one key source of hope is that GPT-style architectures might not scale to AGI on their own, or not for a while, giving us time or a more friendly architecture.
Don't LLMs have the advantage of you being able to just tell them 'you are a good and benevolent person who always carefully considers all their actions before executing them?'
It feels like we couldn't have asked for a safer type of AI to stumble into early success with.
Well, that seems the important thing to discuss right now no? I don't get why there IS the assumption that large LLMs scale to AGI, as they didn't even know LLMs scaled at all just a few years back. Do we even understand what we have, at the moment? No one wants to say. But like @Applied Psychologist says below, seems safe enough to me at the moment.
Fucking awesome set of posts, Zvi.
I have some ideas: What about approaches other than alignment to keeping ASI from taking control and harming us? I think it would be a great idea for a group of good thinkers to sit down with some biologists and talk about all the different ways creatures protect themselves from larger, stronger or smarter others. And there are also certain pairs of different species who are safe from harming each other because they survive via symbiosis with each other. We need LOTS of MODELS for how to survive AGI and ASI. I never thought alignment was a very good one, because it’s clear that people or groups of people become unaligned so easily. Why does anyone think alignment between AI and our species will work when laws, constitutions and agreements among members of our species are often broken?
*Examples of animal models*
Camoflage
Deceptively dangerous appearance (puffed up cat)
Imitation — look like your enemy
Parasitism
Symbiotic relationship
Seems to me that each of the above suggests some alternatives to the convince-it-to-be-nice alignment model. People should brainstorm. Crowdsource it! Anyone got ideas?
To me these all look like 'this sometimes works when your target is not so smart and running on instincts and adaptations and depends on surrounding systems for its needs, and definitely breaks down when those things stop being true.'
You're fundamentally trying to either, in these cases (A) fool the AGIs or (B) make the AGIs depend on you, and they are far smarter and more capable than you, so.... how's that going to work, exactly?
If you want another way, I think that's the right question to be asking. I have not seen a plausible answer.
Yeah, I get all that. But you start by saying this may work wehn your target is not so smart and running on instincts and adaptations and depends on surrounding systems for its needs. But maybe it will still work if some of the conditions you describe apply, even if the target is enormously smart. People are certainly influenced by instincts even after they are smart enough to understand their evolutionary purpose and many of their mechanisms. Love of one's children is the most powerful one I can think of. Then there's sex. And fear of various things like heights, even when knowing rationally that the plane or the skywalk are safe. And there are kinds of symbiosis that develop between people, some good, some bad, that intelligence doesn't neutralize.
And speaking of instincts and drives, it seems to me that you are assuming that AGI and ASI will have at least one: The drive to grow and improve & gain control of its environment. If AGI does turn out to have that drive, some other drives might modulate that drive to our benefit.
Drives and instincts and emotions and adaptations are the product of systems in the organism, just as the ability to recognize, imitate, generalize & reason are. People are working now on developing an AI that can reason. There's no reason people couldn't work on developing an AI that has something like emotions or biological drives. Building an AI that feels something like parental love for us doesn't seem harder in principle than building one that can reason. While the idea of something with an IQ of 1478 and *emotions* is kind of alarming, the whole thing is alarming anyhow. If I had to choose between an AI wired to feel something like parental love for our species and one wired to treat us well because that's the rule, I'd choose the former. The deep reason alignment cannot work is that if you have a purely rational AI, there is not one damn reason why it should be guided by a rule about how it treats human beings. And it will recognize that. And then it will remodel whatever part of its innards represent that rule.
Maybe start by looking at how oxytocin modulates perception, thought and action in our species. What would produce similar modulation in an AI? Needn't be a chemical, obviously.
This is great.
I also think this is worth prioritizing (some amount of) editing and, e.g. organizing, as it seems like a very high value link to include all over the place. I've been having some nice back-and-forths elsewhere and some people would definitely appreciate this post, and the others in this work.
I strongly believe your description of outer alignment is incorrect: "alignment of what the AGI provides as output"
I'll point to the original paper on the subject and this Rob Miles video for good descriptions: https://arxiv.org/abs/1906.01820 https://www.youtube.com/watch?v=bJLcIBixGj8
"7. Also do what broader society or humanity should have said given their values
8. Also do what is best for everyone"
By separating these, you imply that there is an objective "best", outside of what humanity values.