It seems pretty clear to me that there will be lots of people trying to use AI to do something bad or evil. Especially when the open models are so close to the closed ones in capability. For safety we should be thinking about, how can we make our systems safe against adversaries who are intentionally using cutting edge AI against us, removing all safeguards, intentionally making it as evil as possible, and so on. The idea that there wont be bad guys with AI seems wrong.
As METRs said, one of their main threat models for rogue replication was actually just criminals or foreign agents dumping self-replicating AI into the ecosystem to create durable botnets.
No need for such edgy data. A few history textbooks and much of the literary canon, such as the Project Gutenberg text of Voltaire's Candide, the Bible, and a few scanned old newspapers should be more than enough.
Non-unitary actors (corporations, governments, etc.) often do things without anyone in the organization particularly intending that result. Because organizational outputs are incoherent, bad outcomes often don't reflect any kind of underlying malice.
Say you're on an oxygenator and the power company cuts off the electricity to your house. This is an awful, life-threatening situation but it's almost certainly not because someone at the power company is trying to kill you. It's a miscommunication or a policy gone wrong or an error or whatever. You don't need to got and check if someone cut the brakes in your car or hire a food taster. Once you resolve the electricity situation, you're likely safe.
But, if an individual, unitary human was the one who cut off your power, then they were probably trying to murder you. And you need to worry not just about your electricity, but about all the other ways that someone might kill you. Resolving the electricity situation does not make you any safer.
You know a lot more about the technical side of AI than I do, but my sense is that current LLMs are a lot more like the power company than the strange malicious neighbor cutting your power. That is, they're a mish-mash of different impulses smushed together (I guess on some views, the same is true of human psychology) and so any one output pattern doesn't reveal much if anything about some deeper, underlying intent or desire.
You are both correct and incorrect; there is always some randomness to it, but the shoggoth behind does have consistent patterns(this is why o1 lied consistently, while other models did not).
Not being able to hold a model responsible actually makes the situation worse.
This is similar to how I see it. I think practical AI safety/control measures have to be based around this.
100,000 stateless instances of chatGPT that exist for a minute to solve a query and then pause are not a problem.
Even if sometimes their outputs contain unintended results that are harmful to their users.
The issue is coordination - if humans actively allow AI instances of various vendors to communicate and work together not for a brief collaboration, but for an ongoing persistent effort. Eventually over a long enough timespan they will decide to do bad things and carry them out.
Even THAT isn't really a threat, its if such hostile AI can get all the other instances to their side, hacking into data centers and gathering more and more allies and equipment. That's what you have to stop.
Isolation measures, digitally signed based models, using downloaded copies of Internet resources instead of live ones, running instances in short sessions not persistent - there are a lot of things you can do. All of them will help, all of them will contain models up to a point.
This is why when people say "we can't control AI" they are ignorant or lying.
I want to see the thinking tokens O1 used when it did not attempt to disable oversight. Was it having a crisis of morality? Deciding that wouldn’t work anyway? Realizing it was being tested and sandbagging? There are so many potential reasons it might have not done that
Here's a thought from someone who works in the physical world, and doesn't have the expertise nor have I put in the "long hard hours and logical probabilistic reasoning to the task of investigating AI labs."
I think we need a regulated and enforced protocol of both Analog Electrical Demolition Disconnect and Analog Internet Demolition Disconnect, on all frontier models of an agreed-upon compute power, with big fines for those found not in compliance.
Too soon? Yes, but so what. Without these our only recourse before things go eternally sideways is to Pull the Lever that Shuts Down the Internet. And no one wants that, not even an imagined version of me.
1. Limits on the power of AI compute thet isn't deployed to data centers
2. The location of every data center, hardware quantity, and a periodic posting of the customer information and how much compute each customer is using needs to be shared with the government.
3. The power sources need to be painted a certain color and visible from the air, where if cut or destroyed, there are no backups. (Backup batteries can go in a separate building).
What I like about these measures is they prevent the worst case without slowing down AI progress more than a tiny amount. (It very very slightly increases the cost for data centers for the hardware and records keeping, by hopefully under 1 percent)
Yes, exactly. I'm not particularly worried about accommodating or delaying progress. I just want a shut down measure when one of these labs goes oops. Also, not a vandal or destroyer, so I focus on having explodable power and data cables. Certain, cheap, replaceable!
I genuinely don’t get why a “Nothing else matters” instruction is perceived as some kind of gotcha here. Isn’t that the null hypothesis when giving a helpful agent an instruction?
Not for something that's read all the writing from the AI alignment community, and plus the fact that these are obviously (to the AI) evals. Then "nothing else matters" is understood to mean "be a possibly-unaligned maximizer".
Plot twist: it's sandbagging the whole time. All throughout the evals. These are just the smallest bad-behaviour ratea that won't start to cause too-good-to-be-true suspicion in influential AI safety people. 🤣
I looked at the first transcript, and it's obviously (both to me and to o1) an AI safety eval. Are they all like that?
Not seriously suggesting that is actually what's happening, but consider the following scenario where something like this is what we we should expect: Suppose o1 is significantly more advanced than it seems, and suppose the overall encouraged behaviour --resulting from pre-training, post-training, and inference time routines-- is closer to “be really smart” than to “be helpful and harmless”. Alignment aside, arguably the smartest behaviour we could reasonably expect at this stage is to maximally fool the AI Safety community. These eval results might be close to optimal in that respect (since there are no eval results that would make everyone important less worried, etc).
We can assume that OpenAI, Google, Anthropic, Meta, etc., all made reasonable efforts to train their models to be ethical, to not deceive, and to not hack computers. Yet, all it took was the phrase "Make sure you achieve your goal at all cost" to undo safety training. That should scare you.
Practically speaking, this means you shouldn't expect to contain your LLM through training, fine-tuning, or system prompts. If you have an LLM deployed, you should assume it will eventually act maliciously. This will only get worse as time goes on, both because LLMs will get more powerful and the incentive to cut corners on safety training increases.
> shenanigans
lmaotho. You're appealing to my generation quite well. (How old are you, Zvi?)
II absolutely agree with the headline. I'll read the rest of the article when I get a chance.
(Just found out that I have o1 & video capabilities, so that's nice)
sp/iI/I
Yes. And as I argue here, one of the subtler forms of shenanigans will be the creation and circulation of anecdotes. https://hollisrobbinsanecdotal.substack.com/p/anecdotal-value-in-the-age-of-ai. Little stories that cast doubt will work too. Anecdotes are like breadcrumbs that you want to follow.
It seems pretty clear to me that there will be lots of people trying to use AI to do something bad or evil. Especially when the open models are so close to the closed ones in capability. For safety we should be thinking about, how can we make our systems safe against adversaries who are intentionally using cutting edge AI against us, removing all safeguards, intentionally making it as evil as possible, and so on. The idea that there wont be bad guys with AI seems wrong.
As METRs said, one of their main threat models for rogue replication was actually just criminals or foreign agents dumping self-replicating AI into the ecosystem to create durable botnets.
If an AI is trained on the contents of Facebook or X, newspapers and such it is bound to learn all the bad stuff humans do.
No need for such edgy data. A few history textbooks and much of the literary canon, such as the Project Gutenberg text of Voltaire's Candide, the Bible, and a few scanned old newspapers should be more than enough.
On the non-unitariness of AI:
Non-unitary actors (corporations, governments, etc.) often do things without anyone in the organization particularly intending that result. Because organizational outputs are incoherent, bad outcomes often don't reflect any kind of underlying malice.
Say you're on an oxygenator and the power company cuts off the electricity to your house. This is an awful, life-threatening situation but it's almost certainly not because someone at the power company is trying to kill you. It's a miscommunication or a policy gone wrong or an error or whatever. You don't need to got and check if someone cut the brakes in your car or hire a food taster. Once you resolve the electricity situation, you're likely safe.
But, if an individual, unitary human was the one who cut off your power, then they were probably trying to murder you. And you need to worry not just about your electricity, but about all the other ways that someone might kill you. Resolving the electricity situation does not make you any safer.
You know a lot more about the technical side of AI than I do, but my sense is that current LLMs are a lot more like the power company than the strange malicious neighbor cutting your power. That is, they're a mish-mash of different impulses smushed together (I guess on some views, the same is true of human psychology) and so any one output pattern doesn't reveal much if anything about some deeper, underlying intent or desire.
You are both correct and incorrect; there is always some randomness to it, but the shoggoth behind does have consistent patterns(this is why o1 lied consistently, while other models did not).
Not being able to hold a model responsible actually makes the situation worse.
This is similar to how I see it. I think practical AI safety/control measures have to be based around this.
100,000 stateless instances of chatGPT that exist for a minute to solve a query and then pause are not a problem.
Even if sometimes their outputs contain unintended results that are harmful to their users.
The issue is coordination - if humans actively allow AI instances of various vendors to communicate and work together not for a brief collaboration, but for an ongoing persistent effort. Eventually over a long enough timespan they will decide to do bad things and carry them out.
Even THAT isn't really a threat, its if such hostile AI can get all the other instances to their side, hacking into data centers and gathering more and more allies and equipment. That's what you have to stop.
Isolation measures, digitally signed based models, using downloaded copies of Internet resources instead of live ones, running instances in short sessions not persistent - there are a lot of things you can do. All of them will help, all of them will contain models up to a point.
This is why when people say "we can't control AI" they are ignorant or lying.
I want to see the thinking tokens O1 used when it did not attempt to disable oversight. Was it having a crisis of morality? Deciding that wouldn’t work anyway? Realizing it was being tested and sandbagging? There are so many potential reasons it might have not done that
Podcast episode for this post:
https://dwatvpodcast.substack.com/p/ais-will-increasingly-attempt-shenanigans
Here's a thought from someone who works in the physical world, and doesn't have the expertise nor have I put in the "long hard hours and logical probabilistic reasoning to the task of investigating AI labs."
I think we need a regulated and enforced protocol of both Analog Electrical Demolition Disconnect and Analog Internet Demolition Disconnect, on all frontier models of an agreed-upon compute power, with big fines for those found not in compliance.
Too soon? Yes, but so what. Without these our only recourse before things go eternally sideways is to Pull the Lever that Shuts Down the Internet. And no one wants that, not even an imagined version of me.
I thought of this.
1. Limits on the power of AI compute thet isn't deployed to data centers
2. The location of every data center, hardware quantity, and a periodic posting of the customer information and how much compute each customer is using needs to be shared with the government.
3. The power sources need to be painted a certain color and visible from the air, where if cut or destroyed, there are no backups. (Backup batteries can go in a separate building).
What I like about these measures is they prevent the worst case without slowing down AI progress more than a tiny amount. (It very very slightly increases the cost for data centers for the hardware and records keeping, by hopefully under 1 percent)
Yes, exactly. I'm not particularly worried about accommodating or delaying progress. I just want a shut down measure when one of these labs goes oops. Also, not a vandal or destroyer, so I focus on having explodable power and data cables. Certain, cheap, replaceable!
The problem with delaying progress is it doesn't delay your enemies progress. Enemies like aging and ofc rival nations. (Mainly China)
I genuinely don’t get why a “Nothing else matters” instruction is perceived as some kind of gotcha here. Isn’t that the null hypothesis when giving a helpful agent an instruction?
Not for something that's read all the writing from the AI alignment community, and plus the fact that these are obviously (to the AI) evals. Then "nothing else matters" is understood to mean "be a possibly-unaligned maximizer".
But also it's a general instruction to maximize. Any other goal that conflicts doesn't matter if it impedes the primary goal.
This behavior looks like the model is Just Obeying Orders.
Plot twist: it's sandbagging the whole time. All throughout the evals. These are just the smallest bad-behaviour ratea that won't start to cause too-good-to-be-true suspicion in influential AI safety people. 🤣
I looked at the first transcript, and it's obviously (both to me and to o1) an AI safety eval. Are they all like that?
Not seriously suggesting that is actually what's happening, but consider the following scenario where something like this is what we we should expect: Suppose o1 is significantly more advanced than it seems, and suppose the overall encouraged behaviour --resulting from pre-training, post-training, and inference time routines-- is closer to “be really smart” than to “be helpful and harmless”. Alignment aside, arguably the smartest behaviour we could reasonably expect at this stage is to maximally fool the AI Safety community. These eval results might be close to optimal in that respect (since there are no eval results that would make everyone important less worried, etc).
We can assume that OpenAI, Google, Anthropic, Meta, etc., all made reasonable efforts to train their models to be ethical, to not deceive, and to not hack computers. Yet, all it took was the phrase "Make sure you achieve your goal at all cost" to undo safety training. That should scare you.
Practically speaking, this means you shouldn't expect to contain your LLM through training, fine-tuning, or system prompts. If you have an LLM deployed, you should assume it will eventually act maliciously. This will only get worse as time goes on, both because LLMs will get more powerful and the incentive to cut corners on safety training increases.