Previous AI-related recent posts: Jailbreaking ChatGPT on Release Day, Next Level Seinfeld, Escape Velocity From Bullshit Jobs, Movie Review: Megan, On AGI Ruin: A List of Lethalities.
I am curious if there are any particular details that could be changed to make you more willing to share such a post in this situation, or if the things that make it worth sharing are the same things that make you worried about sharing it.
I am confident sharing it would maximize at least some kinds of dignity. Might hurt others.
Rather than taking your jobs, I'd worry more about LLMs augmenting your job, then a small handful of woke companies having a monopoly on it, and then being able to cancel you if you don't agree with their ideology. Kind of like how parler became dependent on AWS cloud servers and then got shut down.
I thought this was a well compiled and valuable post.
Point 20 in section "Basics of AI Wiping Out All Value in the Universe, Take 1" seems to be reversed, should be "no one understands" instead of "one understands."
Your bit about "how scary even narrow, not so powerful intelligence like this can be" is very similar to one of the main points in the new 'book' by David Chapman: https://betterwithout.ai/
He thinks that the existing 'recommender AI systems' used by, e.g. Facebook and Google, are already bad in pretty much this same way. (He's still much more skeptical of the dangers of AGI for the reasons he's described or outlined in his previous writing.)
I want to take a step back and compare your various posts on COVID-19 to that of yourlocalepidemiologist. The latter was often reductive and compromised detail, precision, correctness and completeness in favor of form, readability, adherence to norms etc. On balance, Dr. Jetelina also managed to be extremely effective at what she set out to do, which is bringing a semblance of rationality to a mad-house, and even change some minds.
This conversation about AI alignment needs that same sort of public-friendly ambassador that Dr. Jetelina was with COVID-19, and in another generation past ESR and cohorts were for open-source as a philosophy friendly to businesses and capitalism (before being viewed as an outcast in recent times). Without good public-facing branding and think-pieces that really resonate with lay people, this fight will remain within elitist circles and get ignored by companies that are building commercial AI's.
All of your prescriptions are nice, but this is a brand and public perception fight. Right now, "AI may take over and/or kill you some day" sounds nuts to most people who are still trying to schedule a plumber to show up 3 months down the line, or make ends meet while eyeing the prices of eggs in supermarket. These are exactly the people who have to be enlisted, and reminded that PC's and Smart Phones and GPS were aspirational tech that made things better, even social media v1 made things better, but the first time AI was injected into social media, it broke society and we couldn't fiugre out how to get along again. Now, we have TikTok grabbing attention from kids and this is all agency-less stuff - this is what the AI was designed to do. As AI gets some agency, stuff will break in ways we won't know how to put back together, and that's even before AI gets around to kill folks. Nobody is making this argument in a way that resonates with public well, and folks like you are making arguments that are very hard to share around (because it will undercut any credibility folks like me have with lay friends and they'll simply ignore these essays).
We also need good consumable essays that explain that AI's that are autocompleting text may not be functionally that different from intelligence or sentience, and that may be all that it takes - to pattern match and fill in the gaps to emulate intelligence really well.
The 'book' (free 'web' book) I linked-to in my other comment on this post – "Better Without AI" – covers the "first time AI injected into social media" pretty well. Sadly, I think that idea itself is mostly just more grist for The Culture War and not something very many people can soberly consider.
Are there any good arguments for why alignment is a real thing? I don’t know what it would mean for a human to be ‘aligned’, except in the gross sense of ‘doing what I want.’ The number of beings that satisfy that imperative for me perfectly would be zero, myself included.
The simplest argument I can think of for why MIRI et. al. haven’t made progress despite all the intelligence thrown at the problem is that a fundamental axiom is wrong. Maybe alignment is implicitly deontological, and any attempt to unify it with consequentialism is doomed.
Religion is a good example of human alignment. A combination of manufactured kindred-ness among non family members, counter-intuitive deontological axioms (love your enemies, turn the other cheek, etc), fear of authority and promised future rewards (eg kingdom of heaven) created one hell of a robust society, regardless of what one thinks of the actual tenets.
What is being aligned here: human longevity, reproductive fitness, cultural continuity? Christianity at various points has been good or bad for all the above. Would your definition of alignment be synonymous with useful?
But what of wars of religion, or witch trials and groupthink? Human coordination fails so often that we’ve clearly not discovered an ‘aligned’ ideology.
Christianity was super bad for non-Christians, that’s for sure. And it sucked for many individuals trying to think for themselves. But it was super successful at aligning Christians with each other. I’m not sure if there’s any wisdom to be gained here in the search for AI alignment, but it occurred to me that religion was a major innovation that helped humans enforce alignment with each other at scale, and maybe there’s something to be gleaned from that fact.
Europe managed to have a lot of wars despite the fact that everyone was ostensibly Christian. Universal religions are apparently more successful in empires than local gods, but that doesn’t necessarily mean they lead to the good outcomes the hegemons enjoy. But I think we’re getting a bit off topic :)
Here's a rough concept of what it would be for a human to be aligned.
Imagine you go gather a group of the most intelligent members of the Homo Florensis population and tell them about a new species that just popped up in africa, Homo Sapiens. You give them the DNA of this new species, and give them the opportunity to edit it. The catch is that the Homo Sapiens population will be split in ten portions, and only nine of the portions will have the edits applied.
Now, in our history as soon as Homo Sapiens got to the island of Flores, the new species wiped out Homo Florensis
The alignment task is, find a set of edits where your descendants survive
That’s dependent on the capacities of Florensis more than anything else. If Florensis can trade and/or interbreed with Sapiens then they’ll survive in some form, and if not they’re toast. There isn’t a way to make Sapiens aligned to Florensis if Florensis is so incapable that they can’t figure out how to make allies.
I really like this framing, and I think that you should make explicit that the 9/10 thing is so that the hobbits don't have the option to just cripple H. sap. They have to get their versions to help them survive the real H. sap.
Is there some sort of suggestion box where idiots like myself can post our hair-brained alignment strategy ideas, in the hopes that they might spark good ideas from people in the field? Or are ideas not the bottleneck here?
I don't think this is *that* productive a line of inquiry, because one would have to be absolutely, 100% certain this is true in order to not keep pursuing alignment as a mitigant to AGI risk, but...
I am to some degree stuck on the orthogonality thesis. Partly because everyone agrees it is stupid to even bring it up, and if you bring it up you are signaling that you don't understand how all of this works, and you should be ignored, or even shunned. Core assumptions with this general shape can still be wrong, but it can be nearly impossible to detect by members of the community due to the social dynamics at play.
Anyway, I suspect orthogonality is false.
"Intelligence" itself already encodes an implicit goal, which is something like the pursuit of truth, making accurate assessments of causality, etc. You can't have a superintelligence that has lots of flawed reasoning and false beliefs, because it won't be competent, and therefore won't be superintelligent. A superintelligence, by definition, has to build a high-fidelity model of reality that surpasses human ability.
In other words, on the road to becoming a paperclip-maximizing world-destroyer, AGI will develop superhuman understanding of ~everything. Crucially, this includes moral questions; AGI will have a superhuman understanding of how it *ought* to act. (I suppose if you are not a moral realist, you think AGI will realize morality is fiction, which might make you more pessimistic.)
The entire AGI discussion, and especially hypotheticals like paperclip-maximizing, seems to assume that:
1. A pre-AGI system has a goal permanently fixed into it.
2. It becomes superhuman.
3. It pursues that goal to the infinitieth degree, destroying the universe.
I agree that, if this is how AGI is built, it would inevitably end the world. But I don't think this is likely to be how goal-directed behavior in a superhuman intelligence will work. Superhuman intelligence is likely to reflect on and revise its own goals, like humans or animals do.
To restate what I said at the beginning: even if you are 99.99% sure that some version of the above is correct, we should still spend >.01% of the present value of world GDP (say ~a trillion dollars) preventing the world-ending case where you are wrong. Since we're not doing that, I'm firmly on the side of investing more resources in preventing the world-ending scenario. But I'm a lot more optimistic than EY, and if this was the sort of debate where betting made any sense I'd be happy to put some money up.
Pursuit of truth seems to me just the case of instrumental convergence. It's useful instrumentally to have good models of the world, it doesn't say anything about your terminal values.
I have always had an issue even comprehending how you can get from *is* to *ought* and how AGI is supposed to realize what its goal *ought* to be? Does your pursuit of truth lead you to ask this large question and "solve" morality and achieve its platonic ideal? Doesn't seem obvious to me at all that such an ideal should even exist.
You're right, instrumental convergence does cover "pursuit of truth". This discussion does seem vaguely related to orthogonality, but I overlooked that.
How to get from *is* to *ought* — and whether you can do that — is a tricky meta-ethical question. I happen to think that moral facts exist and you can cross the is/ought divide. Though I don't think it would be particularly helpful to go down that rabbit hole, my point is just that the AGI will be able to apply its superhuman capabilities to reason about these questions too, and will arrive at a superhuman ethics. Humans began to reflect on these questions at some point, I don't see why this should be a special area AGI is blind to.
If that superhuman ethics is nihilistic or relativistic or otherwise in the "moral facts aren't real" camp, then maybe we're just screwed. But if it isn't, it might assign some moral worth to humans, and not just blindly destroy them all to make more paperclips. (And if moral facts aren't real, maybe it doesn't matter anyway?)
The other related point is that it just doesn't seem like intelligent beings are usually born with a single fixed, unwavering goal. That doesn't mean it's impossible for AGI to be this way, but I do think it's more likely that an AGI will revise its own goals in the way humans or animals do, instead of falling into this obvious-to-humans trap of fixating on a single goal to the point of transforming the entire universe into paperclips.
There's also the weird possibility that the AGI will arrive at the right conclusions (it will be aligned and friendly), yet be unable to put its conclusions into practice. Per the strange case of Phineas Gage, the iron rod that pierced his brain did not affect his intelligence, empathy, or ethics - just his planning ability, the ability to put his thoughts into practice.
I think your point #1 is very important. Why should an intelligence explosion NOT result in the AGI, now several OOM smarter than everyone, deciding to change its goal to something else, perhaps something vaguely Friendly? After all, most animals run on pure instinct and have the four Fs as their goals, but us more intelligent humans have been able to change our goals and understand, at least somewhat, how they came to be.
It's funny you wrote all this and did not finish with "this is why I support the Butlerian Jihad".
Alignment is not hard. It is impossible. You are fundamentally trying to solve the Halting Problem except the string the Turing Machine will output is something like "kill all humans and destroy all value".
By pretending alignment is even possible, you are just fooling yourself.
The human brain is Turing Complete. Basically if you give a guy enough pen and paper, it can implement any algorithm.
It follows that AGI (artificial general intelligence, as opposed to the specialized AI we have today, like ChatGPT) _needs_ to be Turing Complete too.
So both can be modelled as Turing Machines.
What's more, AGI, by necessity, will need to be able to learn. In fact, this is what people are scared of. Maybe you can propose resetting you AGI after each session, but it will turn out that that kind of AGI is of little use. Hell, people are complaining MS resets Sidney after 5 questions. There will be _strong_ economic incentives to long running AGIs.
This "learning" can be modelled as a change in the internal programming of the machine. In current AIs the program and the input data have an incestuous relationship. An AI needs to be "retrained" on new data, which means they modify its internal matrixes to modify its weights to model this new data. Learning means modifying the program.
Thus we have to modify our model of a current AI as two Turing Machines stacked on top of one another: one that runs the Machine Learning program, and the Machine Learning program itself, which is a bunch of very opaque matrixes full of numbers. I am going to ignore the base TM that runs the Machine Learning program, but it is only for simplicity, not for safety.
Back to the Halting Problem. It says that you cannot have a method to know if a given turing machine stops computing on a given input.
(Until now, when I say "halt" I mean exactly that. The program ends. However, the Halting Problem can be modified to prove that a given computer won't output a given text. From now on, when I say "halt" I may also mean "the program says a given thing". In particular, I may mean that a given Turing machine says "kill all humans", or that it does not do that ever)
You _can_ prove that a specific Turing Machine halts on certain, or even all input. What the Halting Problem says is that you cannot do this for _all_ Turing Machines. We can call this "alignment". You aligned one specific Turing Machine by proving that it never halts on saying "kill all humans" or equivalent phrases.
So, theoretically, you could be able to grab a given current AI and say, yes, this halts or doesn't halt. In practice, this is almost impossible. Again, current AIs have gigantic amount of parameters and are incredibly opaque. As opposed to a few millions of lines of codes which a bunch of people could read and try to come to understand, as we do right now in software engineering. But large amounts of people nowadays fail at knowing if their program have bugs, or even do what they are supposed to do. Chances that we are able to do this are infinitesimal, less than astronomical. But it is _possible_.
Of course, maybe you can bring AIs to bear on this problem, but (a) surely you are paranoid enough to distrust them and (b) AIs cannot solve the halting problem either.
(maybe AGI will run on a different, easy to understand technology and we are all worrying for nothing. But chances are it will be a progression of the current technologies, or at the very least, partially based on it)
So well, maybe aligning an AGI is theoretically possible. Except that once you retrain it, you modified the program. Every time you feed it new information, it modifies itself in ways difficult to predict or control or interpret. So you are back to square one. You gotta solve the Halting Problem again. And there is nothing guaranteeing that a given modification to an aligned Turing Machine will be aligned too. In general, to prove that you modified a Turing Machine that you know always halts, and prove the modified version always halts too, you would need to force said modifications to be simple and controlled, which is the opposite of what happens when you retrain a current AI. It will modify a bunch of parameters in ways you don't understand.
So there you have it, alignment means solving the Halting Problem, or, in the very best case, solving it for one machine and then being able to solve it for any modifications to said machine. Both are or incredibly difficult or plain old impossible.
It has nothing to do with the halting problem; we are not looking for an AGI that's mathematically proven to not-kill-everyone with probability of 1. An AGI that only kills everyone with a 50% probability would already be a huge win at this point.
For starters, how are you gonna know your AGI kills everyone with 50% probability. This is very poorly defined. And it does not fill me with peace of mind that we'd be developing a technology that only kills everyone half the time? how many of these machines there are? 2? that's 1/4 chances of surviving now
As for the Halting Problem; it is only the very best case that an AGI killing everyone is limited by the Halting Problem. The optimistic case. The naive case.
Basically what I am saying is that you cannot prove that a given computer program will not say "kill all humans", or even "illkay allyay umanshay". This is the simpler version of an AGI dooming humanity.
The general case is much, much worse. Just going to the paperclip catastrophe, the AGI may not need to think about killing all humans and only do it as collateral damage, unaware what's going on. And that's one case. I am sure properly paranoid people can come up with worse, scarier cases.
So what I am saying is, if you had an oracle that magically solved the Halting Problem, you would be able to create an AGI that doesn't kill everyone with (poorly defined) 50% probability. It could still kill everyone by accident without ill intention. Or in other ways I am failing to imagine.
Butlerian Jihad is also very nearly impossible. If the stakes are as high as claimed, then mere agreements between countries to ban AI development aren't enough, because A: game theoretical considerations mean countries that continue AI development have such a huge advantage they might reason that the one trying to enforce the treaty is overselling the risk to humanity so they can stall rival countries while continuing development secretly, and B: unlike nuclear weapons enforcement is next to impossible since there's no uranium centrifuges that can be seen by sattelites.
Ultimately, if you take it to its logical conclusion as the worst threat ever that must be prevented at all costs, then AI Safety advocates should be proposing not only that the US bans AI development, but that the US attempts to conquer all other countries, so it can bring them under a totalitarian centralized world state that enforces throttled development by micro-managing all work with computers everywhere.
Objections to this idea, may be that this is evil, and that trying to conquer other nuclear armed powers would lead to nuclear armageddon. Yes, all well and noted, but you should be aware that as bad as nuclear war projections are, outside of 1980s post-apocalypse flicks, they don't tend to result in the end of the species or even just civilization, though they might kill a billion people. As horrific as that sounds, it's nothing compared to an unstoppable AI that actively seeks out the annihilation of all humans that oppose it (even Amazon tribes won't escape). Just a modest little proposal here.
Butlerian Jihad may mean exactly that. Violent action. Currently there are a few chokepoints to AGI and you wouldn't even need to kill people, just destroy a few plants, to effectively delay AGI for a while. But it'd be just that, delay. This may or may not be worth it in people's eyes. I've seen argued, here and elsewhere, that people should not help with AI development, which sounds to me like a delaying tactic, but also incredibly inneficient.
Or it may mean pacific coordinated action, dunno. Or world conquering by AI denialists or Luddites even.
My point is that it is weird to me that people think in terms of FOOM, claim to be worried and then turn around and say "yes, we will be able to trick the Djinni into obeying us" like they just read the thousand and one nights, while the correct lesson is that you don't traffic with Djinnis.
> Bing is effectively using these recordings of its chats as memory and training.
Is this mechanistically possible? All the twitter quotes I see are screenshots, which I would guess are not being OCR'd. Though I suppose it's not impossible to do so.
If someone posts a text grab of a quote then sure, that'll be available for retrieval. But it's unclear if that's going into the training set. So we're at best in a weird "few-shot in-context learning" paradigm where previous utterances (skewing heavily controversial) are available to be ingested into the conversation context.
It's unclear how the retrieval is actually working though -- is it something like the Toolformer style invocation "Here is what the web says about abc: <output of retrieval tool searching for abc and producing a summary>" where the model is trained to produce a specific format of query, the tool post-processes that query and injects the results back into the context, and then the model continues completing based on that output? (As if the user had written the search results into the chat, basically?) Maybe some of the search results get put in the model's input context and filtered out of the chat text presented to the user so they aren't visible?
But I went into it expecting more of a summary of all the AI things that happened last week, and Bing was certainly not the only one (though it was almost certainly the most important). So let me mention that last week _also_ saw a large advancement in AI image generation tech, ControlNet, which gives an unprecedented level of control over the structure and form of generated images - fully mechanizing "draw the rest of the fucking owl". Quite literally: https://www.reddit.com/r/StableDiffusion/comments/114c4zu/updating_the_meme_to_reflect_the_current_meta/. Check out the rest of the /r/StableDiffusion subreddit; it's all they could talk about last week.
To tie this in to the post, it's notable how many new capabilities people have been able to bolt on to the only open source AI image generator as a direct consequence of it being open source.
That's hot it is these days - can't cover actual everything in one post, even one this long. I haven't made an effort yet to try and keep fully up with the artistic side, aside from using it to generate some simple cover art, which I'm bad at doing good prompt engineering for.
The thing I have never understood about this is where we're supposed to get the computing power to run a billion Von Neumanns at 10,000x human speed.
Let's say for the sake of argument that Von Neumann's brain is exactly as powerful as our best supercomputer. (It's not, it's way more powerful, but direct one-to-one comparisons between computers and the brain are very difficult and I can't find a good source on them right now.) In order to create this God-tier intelligence and destroy everything, we would need computers to be a trillion times more powerful than they are now.
But we're now hitting some fundamental limits on the size of transistors, right? Transistor size is now measured in tens of atoms, we can't make them any smaller unless we figure out how to make them out of subatomic particles. Which means we can't make computers a trillion times more powerful than they are unless we achieve some paradigm shift in computing technology - quantum computers or whatever, I don't understand this field very well.
It reminds me of how people in the 60s saw a spaceship land on the moon and thought "well, obviously we'll have a colony on Mars by 2001 and Alpha Centauri is just a short hop from there". You hit a bottleneck in the technology at a certain point where the next step is orders of magnitude harder than the last step and the difference is prohibitive.
That doesn't seem right to me, based on what little I know about it. Are we going to find algorithmic ways to make neural networks literally a trillion times more effective?
I am curious if there are any particular details that could be changed to make you more willing to share such a post in this situation, or if the things that make it worth sharing are the same things that make you worried about sharing it.
I am confident sharing it would maximize at least some kinds of dignity. Might hurt others.
Thanks for the exceptional post!
> The eighth example is a request for a translation from Russian
FYI it's Ukrainian
Confirmed and fixed, thanks.
Rather than taking your jobs, I'd worry more about LLMs augmenting your job, then a small handful of woke companies having a monopoly on it, and then being able to cancel you if you don't agree with their ideology. Kind of like how parler became dependent on AWS cloud servers and then got shut down.
I thought this was a well compiled and valuable post.
Point 20 in section "Basics of AI Wiping Out All Value in the Universe, Take 1" seems to be reversed, should be "no one understands" instead of "one understands."
Confirmed and fixed, thanks.
Your bit about "how scary even narrow, not so powerful intelligence like this can be" is very similar to one of the main points in the new 'book' by David Chapman: https://betterwithout.ai/
He thinks that the existing 'recommender AI systems' used by, e.g. Facebook and Google, are already bad in pretty much this same way. (He's still much more skeptical of the dangers of AGI for the reasons he's described or outlined in his previous writing.)
This is an important addition to the discourse.
I want to take a step back and compare your various posts on COVID-19 to that of yourlocalepidemiologist. The latter was often reductive and compromised detail, precision, correctness and completeness in favor of form, readability, adherence to norms etc. On balance, Dr. Jetelina also managed to be extremely effective at what she set out to do, which is bringing a semblance of rationality to a mad-house, and even change some minds.
This conversation about AI alignment needs that same sort of public-friendly ambassador that Dr. Jetelina was with COVID-19, and in another generation past ESR and cohorts were for open-source as a philosophy friendly to businesses and capitalism (before being viewed as an outcast in recent times). Without good public-facing branding and think-pieces that really resonate with lay people, this fight will remain within elitist circles and get ignored by companies that are building commercial AI's.
All of your prescriptions are nice, but this is a brand and public perception fight. Right now, "AI may take over and/or kill you some day" sounds nuts to most people who are still trying to schedule a plumber to show up 3 months down the line, or make ends meet while eyeing the prices of eggs in supermarket. These are exactly the people who have to be enlisted, and reminded that PC's and Smart Phones and GPS were aspirational tech that made things better, even social media v1 made things better, but the first time AI was injected into social media, it broke society and we couldn't fiugre out how to get along again. Now, we have TikTok grabbing attention from kids and this is all agency-less stuff - this is what the AI was designed to do. As AI gets some agency, stuff will break in ways we won't know how to put back together, and that's even before AI gets around to kill folks. Nobody is making this argument in a way that resonates with public well, and folks like you are making arguments that are very hard to share around (because it will undercut any credibility folks like me have with lay friends and they'll simply ignore these essays).
We also need good consumable essays that explain that AI's that are autocompleting text may not be functionally that different from intelligence or sentience, and that may be all that it takes - to pattern match and fill in the gaps to emulate intelligence really well.
The 'book' (free 'web' book) I linked-to in my other comment on this post – "Better Without AI" – covers the "first time AI injected into social media" pretty well. Sadly, I think that idea itself is mostly just more grist for The Culture War and not something very many people can soberly consider.
Agreed, but note that agency isn't needed for unforeseen outcomes to overwhelm our society. Sustained fast change is sufficient.
Are there any good arguments for why alignment is a real thing? I don’t know what it would mean for a human to be ‘aligned’, except in the gross sense of ‘doing what I want.’ The number of beings that satisfy that imperative for me perfectly would be zero, myself included.
The simplest argument I can think of for why MIRI et. al. haven’t made progress despite all the intelligence thrown at the problem is that a fundamental axiom is wrong. Maybe alignment is implicitly deontological, and any attempt to unify it with consequentialism is doomed.
Religion is a good example of human alignment. A combination of manufactured kindred-ness among non family members, counter-intuitive deontological axioms (love your enemies, turn the other cheek, etc), fear of authority and promised future rewards (eg kingdom of heaven) created one hell of a robust society, regardless of what one thinks of the actual tenets.
What is being aligned here: human longevity, reproductive fitness, cultural continuity? Christianity at various points has been good or bad for all the above. Would your definition of alignment be synonymous with useful?
I was generally referring to humans in the community being aligned with each other and greatly increasing community size and prosperity.
But what of wars of religion, or witch trials and groupthink? Human coordination fails so often that we’ve clearly not discovered an ‘aligned’ ideology.
Christianity was super bad for non-Christians, that’s for sure. And it sucked for many individuals trying to think for themselves. But it was super successful at aligning Christians with each other. I’m not sure if there’s any wisdom to be gained here in the search for AI alignment, but it occurred to me that religion was a major innovation that helped humans enforce alignment with each other at scale, and maybe there’s something to be gleaned from that fact.
Europe managed to have a lot of wars despite the fact that everyone was ostensibly Christian. Universal religions are apparently more successful in empires than local gods, but that doesn’t necessarily mean they lead to the good outcomes the hegemons enjoy. But I think we’re getting a bit off topic :)
Here's a rough concept of what it would be for a human to be aligned.
Imagine you go gather a group of the most intelligent members of the Homo Florensis population and tell them about a new species that just popped up in africa, Homo Sapiens. You give them the DNA of this new species, and give them the opportunity to edit it. The catch is that the Homo Sapiens population will be split in ten portions, and only nine of the portions will have the edits applied.
Now, in our history as soon as Homo Sapiens got to the island of Flores, the new species wiped out Homo Florensis
The alignment task is, find a set of edits where your descendants survive
Can you explain about the 9 portions part? Is the 10th portion analogous to unaligned AI from China that the others have to compete with?
That’s dependent on the capacities of Florensis more than anything else. If Florensis can trade and/or interbreed with Sapiens then they’ll survive in some form, and if not they’re toast. There isn’t a way to make Sapiens aligned to Florensis if Florensis is so incapable that they can’t figure out how to make allies.
I really like this framing, and I think that you should make explicit that the 9/10 thing is so that the hobbits don't have the option to just cripple H. sap. They have to get their versions to help them survive the real H. sap.
Is there some sort of suggestion box where idiots like myself can post our hair-brained alignment strategy ideas, in the hopes that they might spark good ideas from people in the field? Or are ideas not the bottleneck here?
*Good* ideas are the bottleneck. But shoot me your lottery ticket in a PM on LessWrong, we should have this public service.
You could post on LessWrong. Or just post in a comment here.
I don't think this is *that* productive a line of inquiry, because one would have to be absolutely, 100% certain this is true in order to not keep pursuing alignment as a mitigant to AGI risk, but...
I am to some degree stuck on the orthogonality thesis. Partly because everyone agrees it is stupid to even bring it up, and if you bring it up you are signaling that you don't understand how all of this works, and you should be ignored, or even shunned. Core assumptions with this general shape can still be wrong, but it can be nearly impossible to detect by members of the community due to the social dynamics at play.
Anyway, I suspect orthogonality is false.
"Intelligence" itself already encodes an implicit goal, which is something like the pursuit of truth, making accurate assessments of causality, etc. You can't have a superintelligence that has lots of flawed reasoning and false beliefs, because it won't be competent, and therefore won't be superintelligent. A superintelligence, by definition, has to build a high-fidelity model of reality that surpasses human ability.
In other words, on the road to becoming a paperclip-maximizing world-destroyer, AGI will develop superhuman understanding of ~everything. Crucially, this includes moral questions; AGI will have a superhuman understanding of how it *ought* to act. (I suppose if you are not a moral realist, you think AGI will realize morality is fiction, which might make you more pessimistic.)
The entire AGI discussion, and especially hypotheticals like paperclip-maximizing, seems to assume that:
1. A pre-AGI system has a goal permanently fixed into it.
2. It becomes superhuman.
3. It pursues that goal to the infinitieth degree, destroying the universe.
I agree that, if this is how AGI is built, it would inevitably end the world. But I don't think this is likely to be how goal-directed behavior in a superhuman intelligence will work. Superhuman intelligence is likely to reflect on and revise its own goals, like humans or animals do.
To restate what I said at the beginning: even if you are 99.99% sure that some version of the above is correct, we should still spend >.01% of the present value of world GDP (say ~a trillion dollars) preventing the world-ending case where you are wrong. Since we're not doing that, I'm firmly on the side of investing more resources in preventing the world-ending scenario. But I'm a lot more optimistic than EY, and if this was the sort of debate where betting made any sense I'd be happy to put some money up.
Pursuit of truth seems to me just the case of instrumental convergence. It's useful instrumentally to have good models of the world, it doesn't say anything about your terminal values.
I have always had an issue even comprehending how you can get from *is* to *ought* and how AGI is supposed to realize what its goal *ought* to be? Does your pursuit of truth lead you to ask this large question and "solve" morality and achieve its platonic ideal? Doesn't seem obvious to me at all that such an ideal should even exist.
You're right, instrumental convergence does cover "pursuit of truth". This discussion does seem vaguely related to orthogonality, but I overlooked that.
How to get from *is* to *ought* — and whether you can do that — is a tricky meta-ethical question. I happen to think that moral facts exist and you can cross the is/ought divide. Though I don't think it would be particularly helpful to go down that rabbit hole, my point is just that the AGI will be able to apply its superhuman capabilities to reason about these questions too, and will arrive at a superhuman ethics. Humans began to reflect on these questions at some point, I don't see why this should be a special area AGI is blind to.
If that superhuman ethics is nihilistic or relativistic or otherwise in the "moral facts aren't real" camp, then maybe we're just screwed. But if it isn't, it might assign some moral worth to humans, and not just blindly destroy them all to make more paperclips. (And if moral facts aren't real, maybe it doesn't matter anyway?)
The other related point is that it just doesn't seem like intelligent beings are usually born with a single fixed, unwavering goal. That doesn't mean it's impossible for AGI to be this way, but I do think it's more likely that an AGI will revise its own goals in the way humans or animals do, instead of falling into this obvious-to-humans trap of fixating on a single goal to the point of transforming the entire universe into paperclips.
There's also the weird possibility that the AGI will arrive at the right conclusions (it will be aligned and friendly), yet be unable to put its conclusions into practice. Per the strange case of Phineas Gage, the iron rod that pierced his brain did not affect his intelligence, empathy, or ethics - just his planning ability, the ability to put his thoughts into practice.
I think your point #1 is very important. Why should an intelligence explosion NOT result in the AGI, now several OOM smarter than everyone, deciding to change its goal to something else, perhaps something vaguely Friendly? After all, most animals run on pure instinct and have the four Fs as their goals, but us more intelligent humans have been able to change our goals and understand, at least somewhat, how they came to be.
It's funny you wrote all this and did not finish with "this is why I support the Butlerian Jihad".
Alignment is not hard. It is impossible. You are fundamentally trying to solve the Halting Problem except the string the Turing Machine will output is something like "kill all humans and destroy all value".
By pretending alignment is even possible, you are just fooling yourself.
Can you elaborate on this? Or link some useful material
ok so
The human brain is Turing Complete. Basically if you give a guy enough pen and paper, it can implement any algorithm.
It follows that AGI (artificial general intelligence, as opposed to the specialized AI we have today, like ChatGPT) _needs_ to be Turing Complete too.
So both can be modelled as Turing Machines.
What's more, AGI, by necessity, will need to be able to learn. In fact, this is what people are scared of. Maybe you can propose resetting you AGI after each session, but it will turn out that that kind of AGI is of little use. Hell, people are complaining MS resets Sidney after 5 questions. There will be _strong_ economic incentives to long running AGIs.
This "learning" can be modelled as a change in the internal programming of the machine. In current AIs the program and the input data have an incestuous relationship. An AI needs to be "retrained" on new data, which means they modify its internal matrixes to modify its weights to model this new data. Learning means modifying the program.
Thus we have to modify our model of a current AI as two Turing Machines stacked on top of one another: one that runs the Machine Learning program, and the Machine Learning program itself, which is a bunch of very opaque matrixes full of numbers. I am going to ignore the base TM that runs the Machine Learning program, but it is only for simplicity, not for safety.
Back to the Halting Problem. It says that you cannot have a method to know if a given turing machine stops computing on a given input.
(Until now, when I say "halt" I mean exactly that. The program ends. However, the Halting Problem can be modified to prove that a given computer won't output a given text. From now on, when I say "halt" I may also mean "the program says a given thing". In particular, I may mean that a given Turing machine says "kill all humans", or that it does not do that ever)
You _can_ prove that a specific Turing Machine halts on certain, or even all input. What the Halting Problem says is that you cannot do this for _all_ Turing Machines. We can call this "alignment". You aligned one specific Turing Machine by proving that it never halts on saying "kill all humans" or equivalent phrases.
So, theoretically, you could be able to grab a given current AI and say, yes, this halts or doesn't halt. In practice, this is almost impossible. Again, current AIs have gigantic amount of parameters and are incredibly opaque. As opposed to a few millions of lines of codes which a bunch of people could read and try to come to understand, as we do right now in software engineering. But large amounts of people nowadays fail at knowing if their program have bugs, or even do what they are supposed to do. Chances that we are able to do this are infinitesimal, less than astronomical. But it is _possible_.
Of course, maybe you can bring AIs to bear on this problem, but (a) surely you are paranoid enough to distrust them and (b) AIs cannot solve the halting problem either.
(maybe AGI will run on a different, easy to understand technology and we are all worrying for nothing. But chances are it will be a progression of the current technologies, or at the very least, partially based on it)
So well, maybe aligning an AGI is theoretically possible. Except that once you retrain it, you modified the program. Every time you feed it new information, it modifies itself in ways difficult to predict or control or interpret. So you are back to square one. You gotta solve the Halting Problem again. And there is nothing guaranteeing that a given modification to an aligned Turing Machine will be aligned too. In general, to prove that you modified a Turing Machine that you know always halts, and prove the modified version always halts too, you would need to force said modifications to be simple and controlled, which is the opposite of what happens when you retrain a current AI. It will modify a bunch of parameters in ways you don't understand.
So there you have it, alignment means solving the Halting Problem, or, in the very best case, solving it for one machine and then being able to solve it for any modifications to said machine. Both are or incredibly difficult or plain old impossible.
It has nothing to do with the halting problem; we are not looking for an AGI that's mathematically proven to not-kill-everyone with probability of 1. An AGI that only kills everyone with a 50% probability would already be a huge win at this point.
I kind of disagree?
For starters, how are you gonna know your AGI kills everyone with 50% probability. This is very poorly defined. And it does not fill me with peace of mind that we'd be developing a technology that only kills everyone half the time? how many of these machines there are? 2? that's 1/4 chances of surviving now
As for the Halting Problem; it is only the very best case that an AGI killing everyone is limited by the Halting Problem. The optimistic case. The naive case.
Basically what I am saying is that you cannot prove that a given computer program will not say "kill all humans", or even "illkay allyay umanshay". This is the simpler version of an AGI dooming humanity.
The general case is much, much worse. Just going to the paperclip catastrophe, the AGI may not need to think about killing all humans and only do it as collateral damage, unaware what's going on. And that's one case. I am sure properly paranoid people can come up with worse, scarier cases.
So what I am saying is, if you had an oracle that magically solved the Halting Problem, you would be able to create an AGI that doesn't kill everyone with (poorly defined) 50% probability. It could still kill everyone by accident without ill intention. Or in other ways I am failing to imagine.
Butlerian Jihad is also very nearly impossible. If the stakes are as high as claimed, then mere agreements between countries to ban AI development aren't enough, because A: game theoretical considerations mean countries that continue AI development have such a huge advantage they might reason that the one trying to enforce the treaty is overselling the risk to humanity so they can stall rival countries while continuing development secretly, and B: unlike nuclear weapons enforcement is next to impossible since there's no uranium centrifuges that can be seen by sattelites.
Ultimately, if you take it to its logical conclusion as the worst threat ever that must be prevented at all costs, then AI Safety advocates should be proposing not only that the US bans AI development, but that the US attempts to conquer all other countries, so it can bring them under a totalitarian centralized world state that enforces throttled development by micro-managing all work with computers everywhere.
Objections to this idea, may be that this is evil, and that trying to conquer other nuclear armed powers would lead to nuclear armageddon. Yes, all well and noted, but you should be aware that as bad as nuclear war projections are, outside of 1980s post-apocalypse flicks, they don't tend to result in the end of the species or even just civilization, though they might kill a billion people. As horrific as that sounds, it's nothing compared to an unstoppable AI that actively seeks out the annihilation of all humans that oppose it (even Amazon tribes won't escape). Just a modest little proposal here.
Butlerian Jihad may mean exactly that. Violent action. Currently there are a few chokepoints to AGI and you wouldn't even need to kill people, just destroy a few plants, to effectively delay AGI for a while. But it'd be just that, delay. This may or may not be worth it in people's eyes. I've seen argued, here and elsewhere, that people should not help with AI development, which sounds to me like a delaying tactic, but also incredibly inneficient.
Or it may mean pacific coordinated action, dunno. Or world conquering by AI denialists or Luddites even.
My point is that it is weird to me that people think in terms of FOOM, claim to be worried and then turn around and say "yes, we will be able to trick the Djinni into obeying us" like they just read the thousand and one nights, while the correct lesson is that you don't traffic with Djinnis.
Yes, I agree with your general point here.
> Bing is effectively using these recordings of its chats as memory and training.
Is this mechanistically possible? All the twitter quotes I see are screenshots, which I would guess are not being OCR'd. Though I suppose it's not impossible to do so.
If someone posts a text grab of a quote then sure, that'll be available for retrieval. But it's unclear if that's going into the training set. So we're at best in a weird "few-shot in-context learning" paradigm where previous utterances (skewing heavily controversial) are available to be ingested into the conversation context.
It's unclear how the retrieval is actually working though -- is it something like the Toolformer style invocation "Here is what the web says about abc: <output of retrieval tool searching for abc and producing a summary>" where the model is trained to produce a specific format of query, the tool post-processes that query and injects the results back into the context, and then the model continues completing based on that output? (As if the user had written the search results into the chat, basically?) Maybe some of the search results get put in the model's input context and filtered out of the chat text presented to the user so they aren't visible?
Thanks for taking the time and making the effort to write this.
This was a great post.
But I went into it expecting more of a summary of all the AI things that happened last week, and Bing was certainly not the only one (though it was almost certainly the most important). So let me mention that last week _also_ saw a large advancement in AI image generation tech, ControlNet, which gives an unprecedented level of control over the structure and form of generated images - fully mechanizing "draw the rest of the fucking owl". Quite literally: https://www.reddit.com/r/StableDiffusion/comments/114c4zu/updating_the_meme_to_reflect_the_current_meta/. Check out the rest of the /r/StableDiffusion subreddit; it's all they could talk about last week.
To tie this in to the post, it's notable how many new capabilities people have been able to bolt on to the only open source AI image generator as a direct consequence of it being open source.
That's hot it is these days - can't cover actual everything in one post, even one this long. I haven't made an effort yet to try and keep fully up with the artistic side, aside from using it to generate some simple cover art, which I'm bad at doing good prompt engineering for.
Thank you for this post, as usual a wonderful summary of what is happening with great commentary
The thing I have never understood about this is where we're supposed to get the computing power to run a billion Von Neumanns at 10,000x human speed.
Let's say for the sake of argument that Von Neumann's brain is exactly as powerful as our best supercomputer. (It's not, it's way more powerful, but direct one-to-one comparisons between computers and the brain are very difficult and I can't find a good source on them right now.) In order to create this God-tier intelligence and destroy everything, we would need computers to be a trillion times more powerful than they are now.
But we're now hitting some fundamental limits on the size of transistors, right? Transistor size is now measured in tens of atoms, we can't make them any smaller unless we figure out how to make them out of subatomic particles. Which means we can't make computers a trillion times more powerful than they are unless we achieve some paradigm shift in computing technology - quantum computers or whatever, I don't understand this field very well.
It reminds me of how people in the 60s saw a spaceship land on the moon and thought "well, obviously we'll have a colony on Mars by 2001 and Alpha Centauri is just a short hop from there". You hit a bottleneck in the technology at a certain point where the next step is orders of magnitude harder than the last step and the difference is prohibitive.
I think the idea is that algoritmic gains will eventually bring us to that level on intelligence even if hardware stops improving
That doesn't seem right to me, based on what little I know about it. Are we going to find algorithmic ways to make neural networks literally a trillion times more effective?