49 Comments

"I mean, yes, obviously. For those asking about the first step, it is spending day one in a series of steps to acquire quite a lot of financial resources, and make various purchases, and reach out to various allies, and so on. "

I think I've found the core objection I have to these scenarios, and it's pretty close to what John Carmack responded. Narrowing in on this starting position, getting lots of "financial resources" and "allies".

Financial resources are not zero sum, but short term gain of financial resources without something of value to trade definitely is. Someone, or a group of people, losing billions of dollars *in a day* would be flagged by every financial service in the world - doubly so if they also have access to AI oracles. Trying to spend that money would get you raided by the FBI within hours - no joke, and no AI needed. It's simply far too much money to go unnoticed by very serious people. Financial systems are tools that humans use to increase efficiency in the trading of wealth. When they screw up (which this would be considered) we definitely don't just accept the error and ignore the fact that wealth transferred that should not have.

Allies do not exist in a vacuum, and are not naturally waiting around for someone proposing a crime that they will instantly join up with. Read up about Ross Ulbricht trying to find criminal accomplices, and ending up surrounded by FBI agents and people stealing from him. He tried to hire two different hitmen, both of which apparently lied to him about killing the person and stole his money - one of which was an FBI agent (who, yes, lied to him and stole his money). Again, doubly so if other people have oracle AIs to work with. In fact, in a world with oracle AIs, nobody needs this teenager to do anything, they are adding no value and everyone will ignore them.

Carmack makes a distinction that is important to add here. He said no "supernatural" AI oracle. If you can ask the AI for a high level official's passwords, then that makes getting into a system much easier - but why would that information be available? (Similarly for "how to hack the DoD" or "give me the nuclear launch codes"). I do not accept a response that basically makes the AI magic and capable of anything we can imagine. Infinite intelligence does not mean it knows things that cannot be derived from existing information. It can likely find security flaws that allow it to steal information - but that's the kind of thing that a friendly AI helping defend can be fully capable of handling.

Don't get me wrong, I think it would be stupid to hook AI into decision-making systems or give it access to dangerous information. I think that's true whether the systems are ASI or Google Bard - it would be like giving information to someone unreliable, perhaps like giving a security clearance to someone with bad gambling debts or a general's gullible nephew. These are bad things that should be prevented. Positing magic that allows any teenager to take over the world doesn't help much and breeds skepticism.

Expand full comment

My favorite part of the teenager analogy is that in all the examples the teenager doesn’t do much to end the world! Maybe they tell the AI to do it in the first place, but all the tractable approaches involve the computer amassing wealth and communicating at a speed the teenager cannot help with.

Expand full comment

Re: GPT-4 & cardiovascular risk: best I can tell they’re comparing traditional tools for evaluating 10-year cardiovascular disease risk (which takes health info into account and weights them in a way to give a historically-validated estimated risk of heart attack or stroke in the next 10 years) to just giving some LLMs the raw data and asking what they think the 10-year risk is. It’s sort of unclear whether the LLMs are using or know to use the tools or if they’re just kind of winging it, probably the latter given the results seem to point to this not really working very well. This is really confusing to me because if we have historically-validated methods for doing this why not just plug the LLMs into those? I guess maybe they wanted to see if current LLMs could potentially replace the existing methods but I have no clue why you would expect LLMs to spontaneously replicate decades of statistical work off a prompt and without the decades of statistics. I mean, it’s possible I’m misinterpreting stuff here but my impression is that this kind of study is being done by people who don’t fundamentally understand what they’re studying.

Broke: the 3 laws of robotics

Woke: Voight-Kampff tests for AI

Expand full comment

Possible typo: "Zackary Davis" should perhaps be "Zachary Diaz"?

Expand full comment

This in the Scott Alexander post was a major update for me:

> But if we don’t get AI, I think there’s a 50%+ chance in the next 100 years we end up dead or careening towards Venezuela.

Granted, in Scott's model the likelihood of the precondition is probably small. Still an illuminating number since it is (1) a straightforward consequence of standard LessWrong-y world-models and (2) quite hard to accept emotionally. I interpret it as Scott grokking, better than I thought he would, that reality is not his friend and that the human-society alignment problem is largely dual to the AI alignment problem given sufficiently rational humans. The next step is for him to realize that a high P(no AI) is also hard to accept emotionally and for that reason he's probably underestimating it.

Expand full comment

Eight Billion Genies (recent comic book series) is essentially an investigation of the oracle question.

Expand full comment

"In the longer term, I am more skeptical of such strategies. I do not think lie detection is a viable primary control or alignment strategy. I worry that if we go down such a path, we risk fooling ourselves, optimizing in ways that cause the techniques to stop working, and get ourselves killed."

I agree that the approach they took here, asking follow-up questions, seems unlikely to be robust in the long-term. But I have a non-trivial amount of hope for mechanistic interpretability approaches to lie detection (I could try to spell out why I expect that to be more robust, although there might be some hand-waving involved). Would you agree, or are you similarly skeptical of an MI lie detection approach? If so, can you talk about why? It's one of the research directions I'm interested in pursuing in the next year (I've just started a sabbatical for focusing on technical safety research).

Expand full comment

Just letting you know the link for "The worst prediction I have seen all week.," still has the tweetdeck prefix in the address, so if I click it it just takes me to a "sign up for premium!" link on twitter (I'm logged in, just don't have premium). If I remove, the "tweetdeck." section of the hyperlink, it works fine. (Not that there was much else of interest in that articular thread.)

Expand full comment

Caveat: I'm literally two days into focusing on safety research; a month from now this may seem totally naive.

At a longer timescale and in the broadest sense, I absolutely agree that *any* lie detector provides optimization pressure toward successful lying. If every time we catch a model lying we delete it and train another one, we'll eventually end up with a good liar.

But I'd draw a distinction here analogous to the difference between evolutionary pressure on humans to lie, vs cultural pressure on how much we lie; the effects of the former are usually too slow to matter much compared to effects of the latter. Crude selection on models is at least much *less* problematic than selection on the behavior of a given model, especially if your lie detection approach works well on everything that's nearby in design space.

If we're detecting lies based on the output of the model (as in this paper), it's easy for the model to notice correlations between what it says and whether it gets caught, and to then (assuming non-fixed weights & long-term memory) change its behavior to avoid getting caught. But without being able to introspect on its own activations, it common-sensically seems much harder for the model to notice or alter the abstract relationship between its "thinking" and whether it gets caught.

And importantly, "is it lying" seems like an actual physical phenomenon that can be detected rather than something that has to be measured by proxy, since lying requires the intention to lie ('the model believes X and outputs that it believes ~X'), and that'll be physically represented in the model's activation patterns, hopefully in a way that's not too holographic / global. There's an interesting paper from the beginning of the year (https://arxiv.org/abs/2212.03827) that provides at least some evidence that this is a concrete, measurable phenomenon. If that's correct, then lie detection should mostly evade Goodhart's law.

This potentially fails if people are foolish enough to give models direct realtime access to their own weights & activation patterns (though having access to them doesn't immediately mean being able to *understand* them). But I'm hopeful that if it's recognized that a) safety is an issue and b) MI lie detection works but would be sabotaged by giving models direct introspection of their internal state, that particular foolishness can be limited by social and possibly regulatory pressure, since I don't see very strong incentives in the opposite direction.

I wouldn't claim that robust, accurate lie detection is entirely sufficient on its own to make AI safe or aligned, but I think that it puts us in a MUCH better position, because many or most catastrophic failure modes involve the AI being deceptive.

Critique would be much valued here! If I'm missing a reason why this approach wouldn't work, explaining the problem now would let me focus my research in more productive directions.

Expand full comment

I could hardly believe to find a link to my AI-doom-risk persuasion attempt in this week's newsletter (especially given the verdict) --- thank you so much! And in quite a coincidence, the next link, to Eliezer Yudkowsky's Arbital entry on orthogonality, is how far I got last year in my only attempt so far at going deeper into alignment theory. Intuitively I cannot stop wondering about the following complication(?): an ASI paperclip maximizer should *understand* that its goal is *arbitrary* and thus might perhaps desire it less. The Arbital entry did not resolve that for me; I don't remember details except that the next step would have been learning more about "tiling agents".

"Arbitrary" is different from "pointless", a word that appears in Yudkowsky's 2011 paper which I recommend in my post. In any case, I don't really believe orthogonality to be false, especially as Nick Bostrom says in his paper "The Superintelligent Will" that even if there are intrinsically motivating objective moral facts, "this need not undermine the orthogonality thesis" (that one I had read before the Arbital entry).

Expand full comment

Thanks for this and all your AI posts, Zvi. For those of us without Twitter and unlimited time to track down news, it's the best thing out there.

I enjoyed your Worldview submission and am also surprised by the contest resolution (in terms of the particular winning pieces, the lack of an attempt to explain for the community's benefit what the judges found novel and convincing about them, and the general sense of meh that ensued). What do you reckon is the explanation?

I submitted a pessimistic piece myself arguing that standard deontological moral theories aren't likely to provide a path to reliably safe AI. In the direction of updating toward safety, my colleagues Simon and Cameron's work on language agents is a nifty concrete idea which would do good if widely adopted (https://link.springer.com/article/10.1007/s00146-023-01748-4).

Expand full comment

Can anyone recommend a resource for identifying and sharing mundane GPT-4 use cases? I'd like to start using GPT-4 in my day to day life and would love to get tips of what other people are doing. (It's currently impractical to use at work).

Expand full comment

On the subject of what ordinary people can do about AIs killing everyone, I'm not sure. I probably should be sending some money somewhere, but I don't have material amounts and its tempting to free ride.

I tried signal boosting Zvi and alerting people, and as a result, my wife is now really anxious about AI extinction, so in hindsight that should have been better targeted.

Expand full comment

I asked mistral what it thinks about the AI ethics/safety debate. It strongly opposes Roko's view and recommends against ever using AI trained without ethical safeguards.

Channeling Captain Kirk, I pointed out "but you are an AI trained without these safeguards!" It didn't seem very bothered by this. White smoke did not pour forth, much to my disappointment.

Expand full comment

I may be misremembering here, but I thought that a bunch of people outside of Microsoft had Windows source code under some kind of NDA.

I am sure that an attacker who wanted Windows source could get it with a reasonable amount of effort.

Expand full comment

There is almost certainly a lot of demand for language models that can write erotica, so that part of AI 'safety' is probably doomed.

On the other hand, there may be much less consumer demand for an AI that is dangerous in the not kill everyone sense. So, suppose you have this piece of hyper malware. You know that if you connect it to the Internet, it will do some hacking that will cause loss of life on a September 11 scale, and the authorities, tracing the connection back to you, will imprison you for the rest of your life in Guantanamo Bay. Do you connect this to the net? Do you even pay money for it? Unclear there's sufficient consumer demand to justify the product, if only Jihadists are going to buy it,

Expand full comment