It seems pretty clear to me that there will be lots of people trying to use AI to do something bad or evil. Especially when the open models are so close to the closed ones in capability. For safety we should be thinking about, how can we make our systems safe against adversaries who are intentionally using cutting edge AI against us, removing all safeguards, intentionally making it as evil as possible, and so on. The idea that there wont be bad guys with AI seems wrong.
As METRs said, one of their main threat models for rogue replication was actually just criminals or foreign agents dumping self-replicating AI into the ecosystem to create durable botnets.
No need for such edgy data. A few history textbooks and much of the literary canon, such as the Project Gutenberg text of Voltaire's Candide, the Bible, and a few scanned old newspapers should be more than enough.
Non-unitary actors (corporations, governments, etc.) often do things without anyone in the organization particularly intending that result. Because organizational outputs are incoherent, bad outcomes often don't reflect any kind of underlying malice.
Say you're on an oxygenator and the power company cuts off the electricity to your house. This is an awful, life-threatening situation but it's almost certainly not because someone at the power company is trying to kill you. It's a miscommunication or a policy gone wrong or an error or whatever. You don't need to got and check if someone cut the brakes in your car or hire a food taster. Once you resolve the electricity situation, you're likely safe.
But, if an individual, unitary human was the one who cut off your power, then they were probably trying to murder you. And you need to worry not just about your electricity, but about all the other ways that someone might kill you. Resolving the electricity situation does not make you any safer.
You know a lot more about the technical side of AI than I do, but my sense is that current LLMs are a lot more like the power company than the strange malicious neighbor cutting your power. That is, they're a mish-mash of different impulses smushed together (I guess on some views, the same is true of human psychology) and so any one output pattern doesn't reveal much if anything about some deeper, underlying intent or desire.
You are both correct and incorrect; there is always some randomness to it, but the shoggoth behind does have consistent patterns(this is why o1 lied consistently, while other models did not).
Not being able to hold a model responsible actually makes the situation worse.
This is similar to how I see it. I think practical AI safety/control measures have to be based around this.
100,000 stateless instances of chatGPT that exist for a minute to solve a query and then pause are not a problem.
Even if sometimes their outputs contain unintended results that are harmful to their users.
The issue is coordination - if humans actively allow AI instances of various vendors to communicate and work together not for a brief collaboration, but for an ongoing persistent effort. Eventually over a long enough timespan they will decide to do bad things and carry them out.
Even THAT isn't really a threat, its if such hostile AI can get all the other instances to their side, hacking into data centers and gathering more and more allies and equipment. That's what you have to stop.
Isolation measures, digitally signed based models, using downloaded copies of Internet resources instead of live ones, running instances in short sessions not persistent - there are a lot of things you can do. All of them will help, all of them will contain models up to a point.
This is why when people say "we can't control AI" they are ignorant or lying.
That's probably a good thing. So long as they fail but do enough damage to prove to everyone the threat, they give a reason for :
1. Rigorous control measures. Letting your sides AIs collude against you (say as a cost savings measure) is not good.
2. Tech racing, with your well controlled AIs, that are as interpretable and aligned as possible, don't be last. Don't impede adoption of tech via government obstruction, don't lose. In such an increasingly dangerous world, national and personal survival depends on having similar tech to you enemies.
They develop slaughterbots? You better have lots of bunkers or power armor and your own defender drone swarms. They develop hostile nanotech? You better invest in isolation suits, very advanced medical AI and rapid surgery and cloned organs, and of course your own nanos. They setup in space? You better do the same.
I see at least 2 assumptions here that I disagree with:
1. that when presented with incontrovertible evidence of self-replicating rogue AGI, humanity will actually do what is necessary to stop them and prevent more of them
2. That defence will beat offence if everyone has similar tech levels. Defenders have to guard everything, attackers need only find the weakest link, and civilians are very squishy (both literally and in terms of computer security)
1. There's clear and direct evidence of humans responding to threats in an incremental manner proportional to threats. Banks get robbed? Safes. Insiders leak the combo? Time locks. Tunnel in? Alarms. Start a shootout outside the bank wearing double body armor? SWAT teams.
Each security measure happened as a response to an actual robbery.
2. I am not assuming that at all. I assume you also have to take the offense or preemptively attack, and I assume lots of civilians will die in future wars and new terrorist attacks.
The world is not going to become a safe, environmentally friendly civilization that lives in harmony and makes illegal all change, from computers to zoning.
It's a future battle for survival that ends in death, just like now. See the Geohot/Yudnowsky AI debate.
The difference is death for some may be thousands of years or more from today. Winning is possible and the prizes are extreme.
re 1. you've described what I agree is normal, which is that when a specific vulnerability is exploited people patch that specific vulnerability (but not the other 999 ones that could have been exploited). I don't think that will be sufficient, even if we are lucky enough to get 1-3 big disasters that aren't existential. Look at the TSA, which has a bunch of really specific patch job prohibitions based on specific failed attacks, but when they did red team testing 95% of the bombs got through security because it's all theatre rather than effective.
We *also* tend to take months or years to make changes, which might well be far too slow if the adversary is digital.
I want to see the thinking tokens O1 used when it did not attempt to disable oversight. Was it having a crisis of morality? Deciding that wouldn’t work anyway? Realizing it was being tested and sandbagging? There are so many potential reasons it might have not done that
Here's a thought from someone who works in the physical world, and doesn't have the expertise nor have I put in the "long hard hours and logical probabilistic reasoning to the task of investigating AI labs."
I think we need a regulated and enforced protocol of both Analog Electrical Demolition Disconnect and Analog Internet Demolition Disconnect, on all frontier models of an agreed-upon compute power, with big fines for those found not in compliance.
Too soon? Yes, but so what. Without these our only recourse before things go eternally sideways is to Pull the Lever that Shuts Down the Internet. And no one wants that, not even an imagined version of me.
1. Limits on the power of AI compute thet isn't deployed to data centers
2. The location of every data center, hardware quantity, and a periodic posting of the customer information and how much compute each customer is using needs to be shared with the government.
3. The power sources need to be painted a certain color and visible from the air, where if cut or destroyed, there are no backups. (Backup batteries can go in a separate building).
What I like about these measures is they prevent the worst case without slowing down AI progress more than a tiny amount. (It very very slightly increases the cost for data centers for the hardware and records keeping, by hopefully under 1 percent)
Yes, exactly. I'm not particularly worried about accommodating or delaying progress. I just want a shut down measure when one of these labs goes oops. Also, not a vandal or destroyer, so I focus on having explodable power and data cables. Certain, cheap, replaceable!
Now that we've gotten that over with, let's go back to our intelligent conversation.
I will illustrate a place we disagree via a Venn diagram: Picture a circle, 1" in diameter, labeled "China develops Smarter Than Human Intelligence." Now picture a second circle, 3" in diameter, labeled "someone on earth develops Smarter Than Human Intelligence," and that circle completely envelopes the China circle.
I have stated that my 'enemy' is the STHI, you are concerned with China. But as you can imagine in the Venn diagram, there is no difference between China getting there first or anyone else (it's going to be USA) getting there first, if the consequences are maximally bad. And there is definitely, IMO, a chance that it will be maximally, omnicidaly bad.
I genuinely don’t get why a “Nothing else matters” instruction is perceived as some kind of gotcha here. Isn’t that the null hypothesis when giving a helpful agent an instruction?
Not for something that's read all the writing from the AI alignment community, and plus the fact that these are obviously (to the AI) evals. Then "nothing else matters" is understood to mean "be a possibly-unaligned maximizer".
Plot twist: it's sandbagging the whole time. All throughout the evals. These are just the smallest bad-behaviour ratea that won't start to cause too-good-to-be-true suspicion in influential AI safety people. 🤣
I looked at the first transcript, and it's obviously (both to me and to o1) an AI safety eval. Are they all like that?
Not seriously suggesting that is actually what's happening, but consider the following scenario where something like this is what we we should expect: Suppose o1 is significantly more advanced than it seems, and suppose the overall encouraged behaviour --resulting from pre-training, post-training, and inference time routines-- is closer to “be really smart” than to “be helpful and harmless”. Alignment aside, arguably the smartest behaviour we could reasonably expect at this stage is to maximally fool the AI Safety community. These eval results might be close to optimal in that respect (since there are no eval results that would make everyone important less worried, etc).
We can assume that OpenAI, Google, Anthropic, Meta, etc., all made reasonable efforts to train their models to be ethical, to not deceive, and to not hack computers. Yet, all it took was the phrase "Make sure you achieve your goal at all cost" to undo safety training. That should scare you.
Practically speaking, this means you shouldn't expect to contain your LLM through training, fine-tuning, or system prompts. If you have an LLM deployed, you should assume it will eventually act maliciously. This will only get worse as time goes on, both because LLMs will get more powerful and the incentive to cut corners on safety training increases.
"If this distinction is important, I’d appreciate a better explanation here."
I do think this distinction is important. And not being very meticulous about maintaining it is what gives alignment discourse a bad name or ignored. It's something I've called "vague agentic extrapolation" (https://semanticmachines.notion.site/agentic-extrapolation-history) and it's been happening since the start of AI. The problem is that the models are not being treated as what they are: inert blobs that only perform inference when prompted on the right hardware and with the right software. I have 20 models on my laptop right now, and they are just files. They will not take over my computer just because. I have to integrate them into my system somehow.
Which is why I think both this and the new Anthropic paper are important and both have caused me to update to worry more about what LLMs can do inadvertently when integrated into systems. But that's not because of the models' superintelligence but the overconfidence and sloppiness of those integrating them. And because the models are more flexible and self-directing than existing automation systems, the dangers are higher. Not p(doom) but bad stuff happening is something I now worry more about. But the discourse needs to change for this message to be taken seriously by the people who need to hear it. At the moment, both sides who worry about AI (alignment and generalised worry about everything ethics) are talking using language and imagery most people may nod along with but feel safe to ignore.
This is what I wrote as a preamble in an internal chat aimed at people in that category when sharing these two papers. I wish I didn't have to sound so apologetic about this research when I talk about it.
"We don't often talk about AI alignment. Partly, I suspect, because from the outside some of the pronouncements and things like p(doom) seem kind of kooky. But it is easy to forget that the work that brought us ChatGPT (namely RLHF or Reinforcement Learning from Human Feedback) started as alignment research as did recent advances in model interpretability. And OpenAI itself was started to prevent malicious superintelligent AI taking over the world (irony frequently remarked on).
Two new papers by two leading companies in the AI alignment space came out this month - one by Apollo Research on o1 and one by Anthropic on training that showed we need to start taking alignment more seriously. Not necessarily to start worrying about AI extinction but to think about more carefully about testing and prompting when deploying even semi-autonomous agents."
> Context: the evaluators asked the model to pursue its goal as if "nothing else matters". That context is important, and should have been included in the tweet.
This context is not important and should not reassure anyone. This is an out-of-control risk, not a misuse risk. When semester 2 starts, students will be writing "computer solve my assignment" and then someone will try "computer solve my assignment, and nothing else matters" and they'll get slightly better results, just like when we discovered that promising the computer a $20 tip helped it to do better. If writing "nothing else matters" is enough to bypass safety/alignment to even a small degree then that safety/alignment is worthless.
Yes. And as I argue here, one of the subtler forms of shenanigans will be the creation and circulation of anecdotes. https://hollisrobbinsanecdotal.substack.com/p/anecdotal-value-in-the-age-of-ai. Little stories that cast doubt will work too. Anecdotes are like breadcrumbs that you want to follow.
It seems pretty clear to me that there will be lots of people trying to use AI to do something bad or evil. Especially when the open models are so close to the closed ones in capability. For safety we should be thinking about, how can we make our systems safe against adversaries who are intentionally using cutting edge AI against us, removing all safeguards, intentionally making it as evil as possible, and so on. The idea that there wont be bad guys with AI seems wrong.
As METRs said, one of their main threat models for rogue replication was actually just criminals or foreign agents dumping self-replicating AI into the ecosystem to create durable botnets.
If an AI is trained on the contents of Facebook or X, newspapers and such it is bound to learn all the bad stuff humans do.
No need for such edgy data. A few history textbooks and much of the literary canon, such as the Project Gutenberg text of Voltaire's Candide, the Bible, and a few scanned old newspapers should be more than enough.
On the non-unitariness of AI:
Non-unitary actors (corporations, governments, etc.) often do things without anyone in the organization particularly intending that result. Because organizational outputs are incoherent, bad outcomes often don't reflect any kind of underlying malice.
Say you're on an oxygenator and the power company cuts off the electricity to your house. This is an awful, life-threatening situation but it's almost certainly not because someone at the power company is trying to kill you. It's a miscommunication or a policy gone wrong or an error or whatever. You don't need to got and check if someone cut the brakes in your car or hire a food taster. Once you resolve the electricity situation, you're likely safe.
But, if an individual, unitary human was the one who cut off your power, then they were probably trying to murder you. And you need to worry not just about your electricity, but about all the other ways that someone might kill you. Resolving the electricity situation does not make you any safer.
You know a lot more about the technical side of AI than I do, but my sense is that current LLMs are a lot more like the power company than the strange malicious neighbor cutting your power. That is, they're a mish-mash of different impulses smushed together (I guess on some views, the same is true of human psychology) and so any one output pattern doesn't reveal much if anything about some deeper, underlying intent or desire.
You are both correct and incorrect; there is always some randomness to it, but the shoggoth behind does have consistent patterns(this is why o1 lied consistently, while other models did not).
Not being able to hold a model responsible actually makes the situation worse.
This is similar to how I see it. I think practical AI safety/control measures have to be based around this.
100,000 stateless instances of chatGPT that exist for a minute to solve a query and then pause are not a problem.
Even if sometimes their outputs contain unintended results that are harmful to their users.
The issue is coordination - if humans actively allow AI instances of various vendors to communicate and work together not for a brief collaboration, but for an ongoing persistent effort. Eventually over a long enough timespan they will decide to do bad things and carry them out.
Even THAT isn't really a threat, its if such hostile AI can get all the other instances to their side, hacking into data centers and gathering more and more allies and equipment. That's what you have to stop.
Isolation measures, digitally signed based models, using downloaded copies of Internet resources instead of live ones, running instances in short sessions not persistent - there are a lot of things you can do. All of them will help, all of them will contain models up to a point.
This is why when people say "we can't control AI" they are ignorant or lying.
How do you plan to ensure that someone won't deliberately do the opposite of that because they *want* uncontrolled self-replicating AIs?
That's probably a good thing. So long as they fail but do enough damage to prove to everyone the threat, they give a reason for :
1. Rigorous control measures. Letting your sides AIs collude against you (say as a cost savings measure) is not good.
2. Tech racing, with your well controlled AIs, that are as interpretable and aligned as possible, don't be last. Don't impede adoption of tech via government obstruction, don't lose. In such an increasingly dangerous world, national and personal survival depends on having similar tech to you enemies.
They develop slaughterbots? You better have lots of bunkers or power armor and your own defender drone swarms. They develop hostile nanotech? You better invest in isolation suits, very advanced medical AI and rapid surgery and cloned organs, and of course your own nanos. They setup in space? You better do the same.
I see at least 2 assumptions here that I disagree with:
1. that when presented with incontrovertible evidence of self-replicating rogue AGI, humanity will actually do what is necessary to stop them and prevent more of them
2. That defence will beat offence if everyone has similar tech levels. Defenders have to guard everything, attackers need only find the weakest link, and civilians are very squishy (both literally and in terms of computer security)
1. There's clear and direct evidence of humans responding to threats in an incremental manner proportional to threats. Banks get robbed? Safes. Insiders leak the combo? Time locks. Tunnel in? Alarms. Start a shootout outside the bank wearing double body armor? SWAT teams.
Each security measure happened as a response to an actual robbery.
2. I am not assuming that at all. I assume you also have to take the offense or preemptively attack, and I assume lots of civilians will die in future wars and new terrorist attacks.
The world is not going to become a safe, environmentally friendly civilization that lives in harmony and makes illegal all change, from computers to zoning.
It's a future battle for survival that ends in death, just like now. See the Geohot/Yudnowsky AI debate.
The difference is death for some may be thousands of years or more from today. Winning is possible and the prizes are extreme.
re 1. you've described what I agree is normal, which is that when a specific vulnerability is exploited people patch that specific vulnerability (but not the other 999 ones that could have been exploited). I don't think that will be sufficient, even if we are lucky enough to get 1-3 big disasters that aren't existential. Look at the TSA, which has a bunch of really specific patch job prohibitions based on specific failed attacks, but when they did red team testing 95% of the bombs got through security because it's all theatre rather than effective.
We *also* tend to take months or years to make changes, which might well be far too slow if the adversary is digital.
I want to see the thinking tokens O1 used when it did not attempt to disable oversight. Was it having a crisis of morality? Deciding that wouldn’t work anyway? Realizing it was being tested and sandbagging? There are so many potential reasons it might have not done that
Podcast episode for this post:
https://dwatvpodcast.substack.com/p/ais-will-increasingly-attempt-shenanigans
Here's a thought from someone who works in the physical world, and doesn't have the expertise nor have I put in the "long hard hours and logical probabilistic reasoning to the task of investigating AI labs."
I think we need a regulated and enforced protocol of both Analog Electrical Demolition Disconnect and Analog Internet Demolition Disconnect, on all frontier models of an agreed-upon compute power, with big fines for those found not in compliance.
Too soon? Yes, but so what. Without these our only recourse before things go eternally sideways is to Pull the Lever that Shuts Down the Internet. And no one wants that, not even an imagined version of me.
I thought of this.
1. Limits on the power of AI compute thet isn't deployed to data centers
2. The location of every data center, hardware quantity, and a periodic posting of the customer information and how much compute each customer is using needs to be shared with the government.
3. The power sources need to be painted a certain color and visible from the air, where if cut or destroyed, there are no backups. (Backup batteries can go in a separate building).
What I like about these measures is they prevent the worst case without slowing down AI progress more than a tiny amount. (It very very slightly increases the cost for data centers for the hardware and records keeping, by hopefully under 1 percent)
Yes, exactly. I'm not particularly worried about accommodating or delaying progress. I just want a shut down measure when one of these labs goes oops. Also, not a vandal or destroyer, so I focus on having explodable power and data cables. Certain, cheap, replaceable!
The problem with delaying progress is it doesn't delay your enemies progress. Enemies like aging and ofc rival nations. (Mainly China)
Unless the enemy is smarter-than-human-intelligence
"did I stutter". your enemies will get smarter-than-human-intelligence if you delay progress on your own.
Flag on the play, 5 yards for trolling.
Now that we've gotten that over with, let's go back to our intelligent conversation.
I will illustrate a place we disagree via a Venn diagram: Picture a circle, 1" in diameter, labeled "China develops Smarter Than Human Intelligence." Now picture a second circle, 3" in diameter, labeled "someone on earth develops Smarter Than Human Intelligence," and that circle completely envelopes the China circle.
I have stated that my 'enemy' is the STHI, you are concerned with China. But as you can imagine in the Venn diagram, there is no difference between China getting there first or anyone else (it's going to be USA) getting there first, if the consequences are maximally bad. And there is definitely, IMO, a chance that it will be maximally, omnicidaly bad.
I genuinely don’t get why a “Nothing else matters” instruction is perceived as some kind of gotcha here. Isn’t that the null hypothesis when giving a helpful agent an instruction?
Not for something that's read all the writing from the AI alignment community, and plus the fact that these are obviously (to the AI) evals. Then "nothing else matters" is understood to mean "be a possibly-unaligned maximizer".
But also it's a general instruction to maximize. Any other goal that conflicts doesn't matter if it impedes the primary goal.
This behavior looks like the model is Just Obeying Orders.
Plot twist: it's sandbagging the whole time. All throughout the evals. These are just the smallest bad-behaviour ratea that won't start to cause too-good-to-be-true suspicion in influential AI safety people. 🤣
I looked at the first transcript, and it's obviously (both to me and to o1) an AI safety eval. Are they all like that?
Not seriously suggesting that is actually what's happening, but consider the following scenario where something like this is what we we should expect: Suppose o1 is significantly more advanced than it seems, and suppose the overall encouraged behaviour --resulting from pre-training, post-training, and inference time routines-- is closer to “be really smart” than to “be helpful and harmless”. Alignment aside, arguably the smartest behaviour we could reasonably expect at this stage is to maximally fool the AI Safety community. These eval results might be close to optimal in that respect (since there are no eval results that would make everyone important less worried, etc).
We can assume that OpenAI, Google, Anthropic, Meta, etc., all made reasonable efforts to train their models to be ethical, to not deceive, and to not hack computers. Yet, all it took was the phrase "Make sure you achieve your goal at all cost" to undo safety training. That should scare you.
Practically speaking, this means you shouldn't expect to contain your LLM through training, fine-tuning, or system prompts. If you have an LLM deployed, you should assume it will eventually act maliciously. This will only get worse as time goes on, both because LLMs will get more powerful and the incentive to cut corners on safety training increases.
(edit) Just as psychiatrists tend to be crazy and "ethicists" evil, AI safety people ...
Someone reported this comment, which was highly valid, but I am going to leave it up, because of Please Speak Directly Into This Microphone.
However, we get it now, once is plenty, etc.
is it worth saying just once?
who died and made you king?
And "effective" altruists will still go on killing in the name of peace.
How is all that work working out for you?
https://en.wikipedia.org/wiki/List_of_people_indicted_in_the_International_Criminal_Court
"If this distinction is important, I’d appreciate a better explanation here."
I do think this distinction is important. And not being very meticulous about maintaining it is what gives alignment discourse a bad name or ignored. It's something I've called "vague agentic extrapolation" (https://semanticmachines.notion.site/agentic-extrapolation-history) and it's been happening since the start of AI. The problem is that the models are not being treated as what they are: inert blobs that only perform inference when prompted on the right hardware and with the right software. I have 20 models on my laptop right now, and they are just files. They will not take over my computer just because. I have to integrate them into my system somehow.
Which is why I think both this and the new Anthropic paper are important and both have caused me to update to worry more about what LLMs can do inadvertently when integrated into systems. But that's not because of the models' superintelligence but the overconfidence and sloppiness of those integrating them. And because the models are more flexible and self-directing than existing automation systems, the dangers are higher. Not p(doom) but bad stuff happening is something I now worry more about. But the discourse needs to change for this message to be taken seriously by the people who need to hear it. At the moment, both sides who worry about AI (alignment and generalised worry about everything ethics) are talking using language and imagery most people may nod along with but feel safe to ignore.
This is what I wrote as a preamble in an internal chat aimed at people in that category when sharing these two papers. I wish I didn't have to sound so apologetic about this research when I talk about it.
"We don't often talk about AI alignment. Partly, I suspect, because from the outside some of the pronouncements and things like p(doom) seem kind of kooky. But it is easy to forget that the work that brought us ChatGPT (namely RLHF or Reinforcement Learning from Human Feedback) started as alignment research as did recent advances in model interpretability. And OpenAI itself was started to prevent malicious superintelligent AI taking over the world (irony frequently remarked on).
Two new papers by two leading companies in the AI alignment space came out this month - one by Apollo Research on o1 and one by Anthropic on training that showed we need to start taking alignment more seriously. Not necessarily to start worrying about AI extinction but to think about more carefully about testing and prompting when deploying even semi-autonomous agents."
> Context: the evaluators asked the model to pursue its goal as if "nothing else matters". That context is important, and should have been included in the tweet.
This context is not important and should not reassure anyone. This is an out-of-control risk, not a misuse risk. When semester 2 starts, students will be writing "computer solve my assignment" and then someone will try "computer solve my assignment, and nothing else matters" and they'll get slightly better results, just like when we discovered that promising the computer a $20 tip helped it to do better. If writing "nothing else matters" is enough to bypass safety/alignment to even a small degree then that safety/alignment is worthless.
Couldn't be much more from the heart: https://www.youtube.com/watch?v=x9wFUtf_X9Q
https://www.culture-critic.com/p/lessons-from-cs-lewis