Don't Worry About the Vase

It seems pretty clear to me that there will be lots of people trying to use AI to do something bad or evil. Especially when the open models are so close to the closed ones in capability. For safety we should be thinking about, how can we make our systems safe against adversaries who are intentionally using cutting edge AI against us, removing all safeguards, intentionally making it as evil as possible, and so on. The idea that there wont be bad guys with AI seems wrong.

Expand full comment

sean pan

As METRs said, one of their main threat models for rogue replication was actually just criminals or foreign agents dumping self-replicating AI into the ecosystem to create durable botnets.

Expand full comment

Leonard Reinhart

If an AI is trained on the contents of Facebook or X, newspapers and such it is bound to learn all the bad stuff humans do.

Expand full comment

Victualis

No need for such edgy data. A few history textbooks and much of the literary canon, such as the Project Gutenberg text of Voltaire's Candide, the Bible, and a few scanned old newspapers should be more than enough.

Expand full comment

JBG

On the non-unitariness of AI:

Non-unitary actors (corporations, governments, etc.) often do things without anyone in the organization particularly intending that result. Because organizational outputs are incoherent, bad outcomes often don't reflect any kind of underlying malice.

Say you're on an oxygenator and the power company cuts off the electricity to your house. This is an awful, life-threatening situation but it's almost certainly not because someone at the power company is trying to kill you. It's a miscommunication or a policy gone wrong or an error or whatever. You don't need to got and check if someone cut the brakes in your car or hire a food taster. Once you resolve the electricity situation, you're likely safe.

But, if an individual, unitary human was the one who cut off your power, then they were probably trying to murder you. And you need to worry not just about your electricity, but about all the other ways that someone might kill you. Resolving the electricity situation does not make you any safer.

You know a lot more about the technical side of AI than I do, but my sense is that current LLMs are a lot more like the power company than the strange malicious neighbor cutting your power. That is, they're a mish-mash of different impulses smushed together (I guess on some views, the same is true of human psychology) and so any one output pattern doesn't reveal much if anything about some deeper, underlying intent or desire.

Expand full comment

Reply (2)

sean pan

You are both correct and incorrect; there is always some randomness to it, but the shoggoth behind does have consistent patterns(this is why o1 lied consistently, while other models did not).

Not being able to hold a model responsible actually makes the situation worse.

Expand full comment

This is similar to how I see it. I think practical AI safety/control measures have to be based around this.

100,000 stateless instances of chatGPT that exist for a minute to solve a query and then pause are not a problem.

Even if sometimes their outputs contain unintended results that are harmful to their users.

The issue is coordination - if humans actively allow AI instances of various vendors to communicate and work together not for a brief collaboration, but for an ongoing persistent effort. Eventually over a long enough timespan they will decide to do bad things and carry them out.

Even THAT isn't really a threat, its if such hostile AI can get all the other instances to their side, hacking into data centers and gathering more and more allies and equipment. That's what you have to stop.

Isolation measures, digitally signed based models, using downloaded copies of Internet resources instead of live ones, running instances in short sessions not persistent - there are a lot of things you can do. All of them will help, all of them will contain models up to a point.

This is why when people say "we can't control AI" they are ignorant or lying.

Expand full comment

Thor Odinson

Jan 14

How do you plan to ensure that someone won't deliberately do the opposite of that because they *want* uncontrolled self-replicating AIs?

Expand full comment

Jan 14

That's probably a good thing. So long as they fail but do enough damage to prove to everyone the threat, they give a reason for :

1. Rigorous control measures. Letting your sides AIs collude against you (say as a cost savings measure) is not good.

2. Tech racing, with your well controlled AIs, that are as interpretable and aligned as possible, don't be last. Don't impede adoption of tech via government obstruction, don't lose. In such an increasingly dangerous world, national and personal survival depends on having similar tech to you enemies.

They develop slaughterbots? You better have lots of bunkers or power armor and your own defender drone swarms. They develop hostile nanotech? You better invest in isolation suits, very advanced medical AI and rapid surgery and cloned organs, and of course your own nanos. They setup in space? You better do the same.

Expand full comment

Thor Odinson

Jan 15

I see at least 2 assumptions here that I disagree with:

1. that when presented with incontrovertible evidence of self-replicating rogue AGI, humanity will actually do what is necessary to stop them and prevent more of them

2. That defence will beat offence if everyone has similar tech levels. Defenders have to guard everything, attackers need only find the weakest link, and civilians are very squishy (both literally and in terms of computer security)

Expand full comment

Jan 15Edited

1. There's clear and direct evidence of humans responding to threats in an incremental manner proportional to threats. Banks get robbed? Safes. Insiders leak the combo? Time locks. Tunnel in? Alarms. Start a shootout outside the bank wearing double body armor? SWAT teams.

Each security measure happened as a response to an actual robbery.

2. I am not assuming that at all. I assume you also have to take the offense or preemptively attack, and I assume lots of civilians will die in future wars and new terrorist attacks.

The world is not going to become a safe, environmentally friendly civilization that lives in harmony and makes illegal all change, from computers to zoning.

It's a future battle for survival that ends in death, just like now. See the Geohot/Yudnowsky AI debate.

The difference is death for some may be thousands of years or more from today. Winning is possible and the prizes are extreme.

Expand full comment

Thor Odinson

Jan 15

re 1. you've described what I agree is normal, which is that when a specific vulnerability is exploited people patch that specific vulnerability (but not the other 999 ones that could have been exploited). I don't think that will be sufficient, even if we are lucky enough to get 1-3 big disasters that aren't existential. Look at the TSA, which has a bunch of really specific patch job prohibitions based on specific failed attacks, but when they did red team testing 95% of the bombs got through security because it's all theatre rather than effective.

We *also* tend to take months or years to make changes, which might well be far too slow if the adversary is digital.

Expand full comment

Steeven

I want to see the thinking tokens O1 used when it did not attempt to disable oversight. Was it having a crisis of morality? Deciding that wouldn’t work anyway? Realizing it was being tested and sandbagging? There are so many potential reasons it might have not done that

Expand full comment

Askwho Casts AI

https://dwatvpodcast.substack.com/p/ais-will-increasingly-attempt-shenanigans

Podcast episode for this post:

Expand full comment

Here's a thought from someone who works in the physical world, and doesn't have the expertise nor have I put in the "long hard hours and logical probabilistic reasoning to the task of investigating AI labs."

I think we need a regulated and enforced protocol of both Analog Electrical Demolition Disconnect and Analog Internet Demolition Disconnect, on all frontier models of an agreed-upon compute power, with big fines for those found not in compliance.

Too soon? Yes, but so what. Without these our only recourse before things go eternally sideways is to Pull the Lever that Shuts Down the Internet. And no one wants that, not even an imagined version of me.

Expand full comment

I thought of this.

1. Limits on the power of AI compute thet isn't deployed to data centers

2. The location of every data center, hardware quantity, and a periodic posting of the customer information and how much compute each customer is using needs to be shared with the government.

3. The power sources need to be painted a certain color and visible from the air, where if cut or destroyed, there are no backups. (Backup batteries can go in a separate building).

What I like about these measures is they prevent the worst case without slowing down AI progress more than a tiny amount. (It very very slightly increases the cost for data centers for the hardware and records keeping, by hopefully under 1 percent)

Expand full comment

Yes, exactly. I'm not particularly worried about accommodating or delaying progress. I just want a shut down measure when one of these labs goes oops. Also, not a vandal or destroyer, so I focus on having explodable power and data cables. Certain, cheap, replaceable!

Expand full comment

The problem with delaying progress is it doesn't delay your enemies progress. Enemies like aging and ofc rival nations. (Mainly China)

Expand full comment

Dec 21

Unless the enemy is smarter-than-human-intelligence

Expand full comment

Dec 29

"did I stutter". your enemies will get smarter-than-human-intelligence if you delay progress on your own.

Expand full comment

Dec 30

Flag on the play, 5 yards for trolling.

Now that we've gotten that over with, let's go back to our intelligent conversation.

I will illustrate a place we disagree via a Venn diagram: Picture a circle, 1" in diameter, labeled "China develops Smarter Than Human Intelligence." Now picture a second circle, 3" in diameter, labeled "someone on earth develops Smarter Than Human Intelligence," and that circle completely envelopes the China circle.

I have stated that my 'enemy' is the STHI, you are concerned with China. But as you can imagine in the Venn diagram, there is no difference between China getting there first or anyone else (it's going to be USA) getting there first, if the consequences are maximally bad. And there is definitely, IMO, a chance that it will be maximally, omnicidaly bad.

Expand full comment

Ethics Gradient

I genuinely don’t get why a “Nothing else matters” instruction is perceived as some kind of gotcha here. Isn’t that the null hypothesis when giving a helpful agent an instruction?

Expand full comment

Dust

Not for something that's read all the writing from the AI alignment community, and plus the fact that these are obviously (to the AI) evals. Then "nothing else matters" is understood to mean "be a possibly-unaligned maximizer".

Expand full comment

But also it's a general instruction to maximize. Any other goal that conflicts doesn't matter if it impedes the primary goal.

This behavior looks like the model is Just Obeying Orders.

Expand full comment

Dust

Plot twist: it's sandbagging the whole time. All throughout the evals. These are just the smallest bad-behaviour ratea that won't start to cause too-good-to-be-true suspicion in influential AI safety people. 🤣

I looked at the first transcript, and it's obviously (both to me and to o1) an AI safety eval. Are they all like that?

Not seriously suggesting that is actually what's happening, but consider the following scenario where something like this is what we we should expect: Suppose o1 is significantly more advanced than it seems, and suppose the overall encouraged behaviour --resulting from pre-training, post-training, and inference time routines-- is closer to “be really smart” than to “be helpful and harmless”. Alignment aside, arguably the smartest behaviour we could reasonably expect at this stage is to maximally fool the AI Safety community. These eval results might be close to optimal in that respect (since there are no eval results that would make everyone important less worried, etc).

Expand full comment

John Cheng

Dec 18

We can assume that OpenAI, Google, Anthropic, Meta, etc., all made reasonable efforts to train their models to be ethical, to not deceive, and to not hack computers. Yet, all it took was the phrase "Make sure you achieve your goal at all cost" to undo safety training. That should scare you.

Practically speaking, this means you shouldn't expect to contain your LLM through training, fine-tuning, or system prompts. If you have an LLM deployed, you should assume it will eventually act maliciously. This will only get worse as time goes on, both because LLMs will get more powerful and the incentive to cut corners on safety training increases.

Expand full comment

Enon

Dec 18Edited

(edit) Just as psychiatrists tend to be crazy and "ethicists" evil, AI safety people ...

Expand full comment

Zvi Mowshowitz

Dec 18

Someone reported this comment, which was highly valid, but I am going to leave it up, because of Please Speak Directly Into This Microphone.

However, we get it now, once is plenty, etc.

Expand full comment

is it worth saying just once?

Expand full comment

who died and made you king?

Expand full comment

And "effective" altruists will still go on killing in the name of peace.

How is all that work working out for you?

Expand full comment