AI #32: Lie Detector

Oct 5, 2023

Can you tell if an AI is lying to you?

49 Comments

Oct 5, 2023

"I mean, yes, obviously. For those asking about the first step, it is spending day one in a series of steps to acquire quite a lot of financial resources, and make various purchases, and reach out to various allies, and so on. "

I think I've found the core objection I have to these scenarios, and it's pretty close to what John Carmack responded. Narrowing in on this starting position, getting lots of "financial resources" and "allies".

Financial resources are not zero sum, but short term gain of financial resources without something of value to trade definitely is. Someone, or a group of people, losing billions of dollars *in a day* would be flagged by every financial service in the world - doubly so if they also have access to AI oracles. Trying to spend that money would get you raided by the FBI within hours - no joke, and no AI needed. It's simply far too much money to go unnoticed by very serious people. Financial systems are tools that humans use to increase efficiency in the trading of wealth. When they screw up (which this would be considered) we definitely don't just accept the error and ignore the fact that wealth transferred that should not have.

Allies do not exist in a vacuum, and are not naturally waiting around for someone proposing a crime that they will instantly join up with. Read up about Ross Ulbricht trying to find criminal accomplices, and ending up surrounded by FBI agents and people stealing from him. He tried to hire two different hitmen, both of which apparently lied to him about killing the person and stole his money - one of which was an FBI agent (who, yes, lied to him and stole his money). Again, doubly so if other people have oracle AIs to work with. In fact, in a world with oracle AIs, nobody needs this teenager to do anything, they are adding no value and everyone will ignore them.

Carmack makes a distinction that is important to add here. He said no "supernatural" AI oracle. If you can ask the AI for a high level official's passwords, then that makes getting into a system much easier - but why would that information be available? (Similarly for "how to hack the DoD" or "give me the nuclear launch codes"). I do not accept a response that basically makes the AI magic and capable of anything we can imagine. Infinite intelligence does not mean it knows things that cannot be derived from existing information. It can likely find security flaws that allow it to steal information - but that's the kind of thing that a friendly AI helping defend can be fully capable of handling.

Don't get me wrong, I think it would be stupid to hook AI into decision-making systems or give it access to dangerous information. I think that's true whether the systems are ASI or Google Bard - it would be like giving information to someone unreliable, perhaps like giving a security clearance to someone with bad gambling debts or a general's gullible nephew. These are bad things that should be prevented. Positing magic that allows any teenager to take over the world doesn't help much and breeds skepticism.

Expand full comment

Reply (2)

Zvi Mowshowitz

Oct 5, 2023

I feel like this is premise denial (e.g. labeling what an actual oracle could do 'magic', although none of my imagined planned involved password hacking, or even breaking any laws on day 1 probably, and if you said 'ever until you've already won' then challenge still accepted), and on top of that it is a strong disagreement about our civilizational response and adequacy, and on top of that it is a strong disagreement about our law enforcement procedures and effectiveness, and on top of that a strong disagreement about what is enabled by intelligence or knowledge.

When I was imagining 'get funds' I was thinking (1) do something highly normal to get enough to fund a trading account (I was imagining stocks,, but you could do crypto) and then (2) making some very highly leveraged bets that you know will work because oracle, because that's how I understand an oracle (and media reliably agrees on this one), and no I doubt anyone bats an eye because I ask the oracle for trades that won't do that, that's the whole point I am omniscient and I get to predict your actions (e.g. maybe think of it as 'I get to try things, see what happens and rewind time').

As for allies, I am saying that there exist a large number of omnicidal maniacs, and a good number of people who if given sufficiently impressive demonstrations of knowledge will believe and do whatever you tell them, and a lot of mercenaries, and so on, and if you don't think I can use an oracle to get people who will help me then I have no idea what we are even talking about here?

I wonder if I should write more about the rewind time thought experiment (e.g. see Person of Interest on several occasions, many other such cases)? I dunno. I'd also ask, is this step the true objection? I could earn the money a little slower if it would help.

Also: No one caught SBF. So you tell me.

Expand full comment

Reply (1)

Mr. Doolittle

Oct 5, 2023

Unless this teenager somehow has access to the one and only oracle, any and all trades will already be pushed to their limits. No margins on anything, anywhere. If there were, the first mover who got to it would already grab it. Even lesser AI would be snatching up anything that's the least bit accessible.

Added to that, this teenager would need seed funds, and probably a lot more than he can get together in a day even if he's fairly wealthy. It takes time for trades to pay off (and pay out so that the money can be used again), and that's very likely to be measured in at least days, if not weeks and months. Betting on Bitcoin made people absolutely insane returns - over the course of about a decade. SBF founded his company in 2019, so several years to make the money he did and in an environment that was particularly conducive to making money in crypto. He also had something of potential value to offer, not just finding ways to take money out of the system for his own use (and it collapsed because he was illegally spending investment money from Alameda, so maybe not a great example).

The problem with "omnicidal maniacs" is that they are *omni*-cidal, meaning they don't value you or what you're doing. They're the type of people that Ross tried to use who lied to him and stole his money. Cultist followers and mercenaries are better, but still have their own motivations and often do the wrong things or run to the cops/press at inopportune times. I definitely agree that an oracle (or for that matter, a fortune cookie) could get people to do things - it's a question of whether or not these are well-coordinated efforts towards a particular goal, and whether this may work on short timeframes (we're talking six months in the hypothetical). Perfect knowledge of human thoughts and actions I also consider impossible ("supernatural" to link to previous words used), and do not consider that a legitimate power of an AI. Humans are often unpredictable or even specifically counter-predictive, where they intentionally do something inefficient or unproductive in order to avoid being predicted. This is often a very sound strategy throughout human history.

Slower would definitely help. Slow enough, though, and it's not much different than we see in a society with no AI. Xi really did take over China, without AI help. Napoleon conquered most of Europe. Hitler too. I agree that a powerful AI would make these things much easier and faster. But going back to Carmack - if your opponents also have powerful AI, then your advantages are gone. If they limit the power of their AI to directly interact and you plug your AI into everything, that could make a difference (at the cost of safety, which I guess the teenager here doesn't care about). I don't think it will make nearly as much difference as you seem to think, and find that distinction hypothetical and unsupported.

Expand full comment

sean pan

Oct 5, 2023

The problem is that we keep approaching it as if it is a human intelligence, but as the Watchman teaches us - the most intelligent human posed no more threat to the God-like being than the most intelligent termite. Given an omniscient oracle, this is the same situation.

I'm not sure why Carmack, who presumably cares about humans, is dismissing the risk as such.

Expand full comment

Reply (1)

Mr. Doolittle

Oct 5, 2023Edited

The issue I have is specifically around the term "omniscient." Either you (and Zvi) don't understand what that term means, or you're using it in a clearly and unambiguously incorrect way. No AI, no matter how smart or capable, will be or can be omniscient.

Expand full comment

Reply (2)

Zvi Mowshowitz

Oct 5, 2023

I mean that's why I keep saying the premise is being denied - the question was 'If X then do we get Y?" not whether X is a good description of a plausible AI. Which is why we are instead discussing a not-omniscient oracle.

Expand full comment

Reply (1)

Mr. Doolittle

Oct 5, 2023

Two thoughts. One, that's a bit of a motte and bailey within the discussion of AI risks then. Posit something literally impossible and ask how that would work out (which can be a fun thought experiment, but has no bearing on AI, AGI, or ASI's capabilities).

Two, that still doesn't get you somewhere when you have two oracles competing - which is part of the impossibility - how can they perfectly predict another omniscient oracle?

Expand full comment

Reply (1)

Zvi Mowshowitz

Oct 5, 2023

My perspective writing this was that the question was a thought experiment to see whether people were willing to talk price, or whether they would refuse to agree even in the extremis case of a full omniscient oracle with no similarly powerful opposition. And then Carmack said no, I will not talk price, I will instead (as many do) refuse to agree that information and optimization power does what information and optimization power do. So I'm saying, if I'm in full easy mode, but also had to do steps that raise no feasibility eyebrows whatsoever and never put 'ask for something smarter here' or anything like that into the plan, and I have like 2 minutes to make a new plan, here is the easy mode way I would do it, and it would definitely work.

And again, that was in the context of an explicit disagreement about physics, about whether there existed actions that could do X, or if there weren't. And if we can't get buy-in even in these absurdist situations, then I have no idea how to talk about the actual practical situations we will face, without getting scattershot different shifting objections that talk about different scenarios than the ones being described, or deny in various ways the very concepts involved, etc.

I do think we also, distinctly, highlighted other real disagreements exactly by me saying something concrete, but I am bandwidth limited in ability to explore or discuss further. So you are welcome to last word and I'll tap out here.

Expand full comment

Reply (1)

Mr. Doolittle

Oct 5, 2023

Thanks for the response.

I guess I just struggle with the hypothetical too much to accept some of the necessary conclusions from it. As you say, it could Person of Interest rewind any time it lost and replay it, making infinite adjustments until it didn't lose. There was a Nick Cage movie about that, and I think that pretty accurately describes how things would go. It would always win.

It's a bit like prophecy at that point. Given that so-and-so is prophesied to win, would they win? And there's only two options - agree that yes, they're going to win, or reject the premises necessary to conclude it (deny the, or any, prophecy).

I totally get that you believe ASI will come into existence and will have such a massive advantage over humans that it will seem to us omniscient - and in many ways so outcompete us that it might as well be. I don't believe that life is so easily knowable or predictable as to allow it. I think the world is far larger and more complicated than we realize, and that the steps needed to get from "ASI exists" to "ASI conquers the world" or "ASI kills everyone" would take years of effort in which most of the time it would be very obvious what was happening in meat space. ASI is very powerful and dangerous inside a computer, but deals with the same limitations that humans face when dealing with physical reality, and aren't nearly so optimized for doing so. How long would an army of robots (or nanobots, or germs) survive in the wild open spaces on earth? Without switching to what I call magic, such as infinitely self-replicating nanobots with perfect remote control, you run into some pretty difficult challenges. Challenges that nature has thrown at humans many times, such as horrible plagues. We've not just survived, but thrived.

Expand full comment

sean pan

Oct 5, 2023

We are not omniscient but to ants, we might as well be. So what is the difference?

The government isnt omniscient but I am sure the NSA can determine the last place where I relieved myself with a high probability and the device I am using now to reply to you. With Clearview, if I go out, they might even be able to determine my gait by piecing together all video streams of me.

What is this if not omniscience? Would an AI have less, or more of this?

Expand full comment

Reply (1)

Mr. Doolittle

Oct 5, 2023

Some of the powers of this oracle hinge directly on the differences between what you talk about here, and what an actually omniscient oracle could do. For instance, perfect knowledge of an individual human's motivations. That's definitely not something the NSA can do. Can the NSA predict some things about a human's motivation and turn that around to influence their behavior? Definitely. But lacking perfect knowledge, this can easily turn into getting reported to the press or the person otherwise betraying them. And this happens all the time. Would this oracle (even using not-impossible levels of understanding) be able to do this better? Sure - but that's still a stark difference. Also, again, competing oracles impeding their actions. Oracle 1's human pawn tries to carry out command X (assuming they don't screw up, which is pretty likely), and Oracle 2 identifies the pawn's actions and tells their humans how to stop the pawn.

Expand full comment

Reply (2)

sean pan

Oct 5, 2023

Your oracle 2 doesnt even know and certainly doesnt have the computational resources for it, and at this point, you have very little human involvement. The human is as much a pawn to Oracle 2 as they might be to Oracle 1.

Expand full comment

Reply (1)

Mr. Doolittle

Oct 5, 2023Edited

We're positing omniscient oracle AIs, why do you say that Oracle 2 doesn't have the computational resources for it?

Yes, the humans would both be pawns here, but I also reject the idea of "omniscient" AIs as logically, let alone practically, possible. My point is that if they are competing, then one cannot unilaterally take over the world.

Expand full comment

Reply (2)

Mo Diddly

Oct 6, 2023

I'm pretty sure that even with our current technology, an agent that was dedicated to training itself on your emails, phone calls and conversations could simulate your response to many if not most scenarios.

Expand full comment

Evan Moran

Oct 5, 2023

My favorite part of the teenager analogy is that in all the examples the teenager doesn’t do much to end the world! Maybe they tell the AI to do it in the first place, but all the tractable approaches involve the computer amassing wealth and communicating at a speed the teenager cannot help with.

Expand full comment

Reply (1)

Nick

Oct 5, 2023

I also agree that some of the channels to cause harm in that scenario are mitigated if we impose that all action has to go through the teenager, and that the original intent of the question might in fact be to probe whether there is "forbidden information" that would enable even the teenager to cause extinction. (Of course, this is importantly different from a misaligned AI, which e.g. could be connected directly to the internet)

Expand full comment

SCPantera

Oct 5, 2023

Re: GPT-4 & cardiovascular risk: best I can tell they’re comparing traditional tools for evaluating 10-year cardiovascular disease risk (which takes health info into account and weights them in a way to give a historically-validated estimated risk of heart attack or stroke in the next 10 years) to just giving some LLMs the raw data and asking what they think the 10-year risk is. It’s sort of unclear whether the LLMs are using or know to use the tools or if they’re just kind of winging it, probably the latter given the results seem to point to this not really working very well. This is really confusing to me because if we have historically-validated methods for doing this why not just plug the LLMs into those? I guess maybe they wanted to see if current LLMs could potentially replace the existing methods but I have no clue why you would expect LLMs to spontaneously replicate decades of statistical work off a prompt and without the decades of statistics. I mean, it’s possible I’m misinterpreting stuff here but my impression is that this kind of study is being done by people who don’t fundamentally understand what they’re studying.

Broke: the 3 laws of robotics

Woke: Voight-Kampff tests for AI

Expand full comment

Reply (2)

Mr. Doolittle

Oct 5, 2023

One thing LLMs are really good at is analyzing vast amounts of data and determining real patterns. It's how they "learned" to translate languages and pretty much everything else they do (that's not RLHF). Telling an LLM how to do it poisons the process, as it will bend what the data really means to fit what it's been told to think about it. Now, the data may be inconclusive or itself poisoned by confounders or sloppy data collection. That's likely what's happening here.

Expand full comment

Moon Moth

Oct 5, 2023

Bespoke: Voight-Kampff tests for humans, by AI. Plot twist: one of the blade runners might be human.

Expand full comment

Moon Moth

Oct 5, 2023

Possible typo: "Zackary Davis" should perhaps be "Zachary Diaz"?

Expand full comment

hnau

Oct 5, 2023

This in the Scott Alexander post was a major update for me:

> But if we don’t get AI, I think there’s a 50%+ chance in the next 100 years we end up dead or careening towards Venezuela.

Granted, in Scott's model the likelihood of the precondition is probably small. Still an illuminating number since it is (1) a straightforward consequence of standard LessWrong-y world-models and (2) quite hard to accept emotionally. I interpret it as Scott grokking, better than I thought he would, that reality is not his friend and that the human-society alignment problem is largely dual to the AI alignment problem given sufficiently rational humans. The next step is for him to realize that a high P(no AI) is also hard to accept emotionally and for that reason he's probably underestimating it.

Expand full comment

Reply (1)

Zvi Mowshowitz

Oct 5, 2023

I think this depends on your prior of WHY we didn't get AI within 100 years.

If we didn't get it because we got our act together sufficiently to coordinate not to do it and alignment proved to be hard, I think things probably went really well.

If we didn't get it because our civilization was not capable of building a sufficiently capable AI despite its best efforts? Chances are things are not going well at all.

I model Scott here as saying he thinks the second scenario is a lot of the probability space here. And I don't think that is unreasonable.

Expand full comment

Kevin

Oct 5, 2023

Eight Billion Genies (recent comic book series) is essentially an investigation of the oracle question.

Expand full comment

Egg Syntax

Oct 5, 2023

"In the longer term, I am more skeptical of such strategies. I do not think lie detection is a viable primary control or alignment strategy. I worry that if we go down such a path, we risk fooling ourselves, optimizing in ways that cause the techniques to stop working, and get ourselves killed."

I agree that the approach they took here, asking follow-up questions, seems unlikely to be robust in the long-term. But I have a non-trivial amount of hope for mechanistic interpretability approaches to lie detection (I could try to spell out why I expect that to be more robust, although there might be some hand-waving involved). Would you agree, or are you similarly skeptical of an MI lie detection approach? If so, can you talk about why? It's one of the research directions I'm interested in pursuing in the next year (I've just started a sabbatical for focusing on technical safety research).

Expand full comment

Reply (1)

Zvi Mowshowitz

Oct 5, 2023

(Damn. I wrote out a reply, thought I double posted, then it double deleted. 2nd attempt.)

I'd appreciate as much explanation as you can muster, with whatever hand-waiving is necessary.

I do think there is hope that, if we had one static LLM and we worked on it for long enough, we could get a lie detector, even a good lie detector, or even that it would transfer well if we designed it to do that. However, I also think that every time we use that, we are applying optimization pressure in a global sense towards making the lie detector stop working, and I don't see a way to get around this in practice.

Also I worry that a lie detector is insufficient while looking to others like it is sufficient - it is helpful, perhaps, but the point of the Diplomacy anecdote is that you still die.

Expand full comment

Oct 6, 2023

Just letting you know the link for "The worst prediction I have seen all week.," still has the tweetdeck prefix in the address, so if I click it it just takes me to a "sign up for premium!" link on twitter (I'm logged in, just don't have premium). If I remove, the "tweetdeck." section of the hyperlink, it works fine. (Not that there was much else of interest in that articular thread.)

Expand full comment

Egg Syntax

Oct 6, 2023

Caveat: I'm literally two days into focusing on safety research; a month from now this may seem totally naive.

At a longer timescale and in the broadest sense, I absolutely agree that *any* lie detector provides optimization pressure toward successful lying. If every time we catch a model lying we delete it and train another one, we'll eventually end up with a good liar.

But I'd draw a distinction here analogous to the difference between evolutionary pressure on humans to lie, vs cultural pressure on how much we lie; the effects of the former are usually too slow to matter much compared to effects of the latter. Crude selection on models is at least much *less* problematic than selection on the behavior of a given model, especially if your lie detection approach works well on everything that's nearby in design space.

If we're detecting lies based on the output of the model (as in this paper), it's easy for the model to notice correlations between what it says and whether it gets caught, and to then (assuming non-fixed weights & long-term memory) change its behavior to avoid getting caught. But without being able to introspect on its own activations, it common-sensically seems much harder for the model to notice or alter the abstract relationship between its "thinking" and whether it gets caught.

And importantly, "is it lying" seems like an actual physical phenomenon that can be detected rather than something that has to be measured by proxy, since lying requires the intention to lie ('the model believes X and outputs that it believes ~X'), and that'll be physically represented in the model's activation patterns, hopefully in a way that's not too holographic / global. There's an interesting paper from the beginning of the year (https://arxiv.org/abs/2212.03827) that provides at least some evidence that this is a concrete, measurable phenomenon. If that's correct, then lie detection should mostly evade Goodhart's law.

This potentially fails if people are foolish enough to give models direct realtime access to their own weights & activation patterns (though having access to them doesn't immediately mean being able to *understand* them). But I'm hopeful that if it's recognized that a) safety is an issue and b) MI lie detection works but would be sabotaged by giving models direct introspection of their internal state, that particular foolishness can be limited by social and possibly regulatory pressure, since I don't see very strong incentives in the opposite direction.

I wouldn't claim that robust, accurate lie detection is entirely sufficient on its own to make AI safe or aligned, but I think that it puts us in a MUCH better position, because many or most catastrophic failure modes involve the AI being deceptive.

Critique would be much valued here! If I'm missing a reason why this approach wouldn't work, explaining the problem now would let me focus my research in more productive directions.

Expand full comment

Reply (1)

Zvi Mowshowitz

Oct 6, 2023

Nice. We might want to move to a LW conversation here and do something more real time. I want to dig into this more, and definitely have thoughts.

Expand full comment

Reply (1)

Egg Syntax

Oct 6, 2023

Whoops, meant my reply to be in the original thread, not sure what happened there.

Sure, happy to move it to LW (I assume you mean to the comments on the version of this post over there). Real time will come with a significant drop in quality on my end, I think, since I'm still very much thinking this through myself, eg writing my last reply was very much not real time.

Looking forward to hearing your thoughts!

Expand full comment

Reply (1)

Egg Syntax

Oct 6, 2023

Copied (partially summarized) to https://www.lesswrong.com/posts/iDoTRmCH22PaFYL6x/ai-32-lie-detector?commentId=KEJp4PLHKFiwKzMJa

Expand full comment

JungianTJ

Oct 6, 2023

I could hardly believe to find a link to my AI-doom-risk persuasion attempt in this week's newsletter (especially given the verdict) --- thank you so much! And in quite a coincidence, the next link, to Eliezer Yudkowsky's Arbital entry on orthogonality, is how far I got last year in my only attempt so far at going deeper into alignment theory. Intuitively I cannot stop wondering about the following complication(?): an ASI paperclip maximizer should *understand* that its goal is *arbitrary* and thus might perhaps desire it less. The Arbital entry did not resolve that for me; I don't remember details except that the next step would have been learning more about "tiling agents".

"Arbitrary" is different from "pointless", a word that appears in Yudkowsky's 2011 paper which I recommend in my post. In any case, I don't really believe orthogonality to be false, especially as Nick Bostrom says in his paper "The Superintelligent Will" that even if there are intrinsically motivating objective moral facts, "this need not undermine the orthogonality thesis" (that one I had read before the Arbital entry).

Expand full comment

Reply (1)

Thor Odinson

Oct 12, 2023Edited

What are your goals in life? They're as "arbitrary" as anything we'd set as the goals of an AGI. Does that make you care about your goals less?

Edit: put this here to talk about misalignment and remembered that you were asking specifically about orthogonality so it's not precisely relevant

As a simple example, knowing why evolution made sex fun doesn't make it stop being fun, and doesn't make it not fun to have non-reproductive sex either (misalignment!)

Expand full comment

Reply (1)

JungianTJ

Oct 23, 2023

Thank you for the reply --- for some reason Substack failed to send me the usual notification, but glad to come across it 11 days later, could have very easily missed it completely. I would speculate that the fun of non-reproductive sex comes from non-AGI parts of the human brain, inherited from animals. The AGI part in some other-worldly people might in fact ask itself "what am I doing here, this is arbitrary behaviour, what's the point" and generate a minor corrosive effect on the motivation to have non-reproductive sex. To me there is a "complication" here (as I called it), one that I find interesting.

Expand full comment

William D'Alessandro

Oct 6, 2023

Thanks for this and all your AI posts, Zvi. For those of us without Twitter and unlimited time to track down news, it's the best thing out there.

I enjoyed your Worldview submission and am also surprised by the contest resolution (in terms of the particular winning pieces, the lack of an attempt to explain for the community's benefit what the judges found novel and convincing about them, and the general sense of meh that ensued). What do you reckon is the explanation?

I submitted a pessimistic piece myself arguing that standard deontological moral theories aren't likely to provide a path to reliably safe AI. In the direction of updating toward safety, my colleagues Simon and Cameron's work on language agents is a nifty concrete idea which would do good if widely adopted (https://link.springer.com/article/10.1007/s00146-023-01748-4).

Expand full comment

Reply (1)

Zvi Mowshowitz

Oct 6, 2023

My current hypothesis, as confirmed by public statement, is that what they wanted were expressions of contrasting viewpoints, and they did not update much, but that our entries were not what they wanted and thus were never going to win a prize unless they were truly exceptional.

Expand full comment

J Mann

Oct 6, 2023

Can anyone recommend a resource for identifying and sharing mundane GPT-4 use cases? I'd like to start using GPT-4 in my day to day life and would love to get tips of what other people are doing. (It's currently impractical to use at work).

Expand full comment

Reply (1)

Peregrine Journal

Oct 7, 2023

I've experimented with it in a few areas of my life and some areas it's really helpful and others it is kind of a dud. I'd guess I'd generalize to "anything where your alternative is asking on a forum or subreddit or discord and waiting for an expert to weigh in." It makes that interaction realtime and at the gpt4 level I think often more reliable.

Tech support is spotty, sometimes great but sometimes just out of date because versions changed. "How do I x" will give you a good how to of just above median quality without intrusive ads.

Refactoring code or plugging in new features if you're a hobbyist coder is great.

Questions about language learning have been really helpful.

Start by asking it things you would like to ask any other human but don't want to bother anybody about. Have you heard of using a "rubber duck?" You just talk out a problem to a rubber duck and it clarifies your thoughts? Think of it as an interactive version of that.

Expand full comment

J Mann

Oct 6, 2023

On the subject of what ordinary people can do about AIs killing everyone, I'm not sure. I probably should be sending some money somewhere, but I don't have material amounts and its tempting to free ride.

I tried signal boosting Zvi and alerting people, and as a result, my wife is now really anxious about AI extinction, so in hindsight that should have been better targeted.

Expand full comment

Peregrine Journal

Oct 8, 2023

I asked mistral what it thinks about the AI ethics/safety debate. It strongly opposes Roko's view and recommends against ever using AI trained without ethical safeguards.

Channeling Captain Kirk, I pointed out "but you are an AI trained without these safeguards!" It didn't seem very bothered by this. White smoke did not pour forth, much to my disappointment.

Expand full comment

MichaeL Roe

Oct 9, 2023

I may be misremembering here, but I thought that a bunch of people outside of Microsoft had Windows source code under some kind of NDA.

I am sure that an attacker who wanted Windows source could get it with a reasonable amount of effort.

Expand full comment

Reply (1)

MichaeL Roe

Oct 9, 2023

So, my understanding is that Microsoft does not rely on the secrecy of the source code, although they may sometimes being doing a little defense in depth where getting the source code is an additional barrier the attacker needs to get past,

If an AI company is relying on keeping the source secret ... that's fairly uncharted territory, although we might compare NSA classified crypto algorithms that are supplied to the user as tamper resistant hardware devices.

Expand full comment

MichaeL Roe

Oct 9, 2023

There is almost certainly a lot of demand for language models that can write erotica, so that part of AI 'safety' is probably doomed.

On the other hand, there may be much less consumer demand for an AI that is dangerous in the not kill everyone sense. So, suppose you have this piece of hyper malware. You know that if you connect it to the Internet, it will do some hacking that will cause loss of life on a September 11 scale, and the authorities, tracing the connection back to you, will imprison you for the rest of your life in Guantanamo Bay. Do you connect this to the net? Do you even pay money for it? Unclear there's sufficient consumer demand to justify the product, if only Jihadists are going to buy it,

Expand full comment

Reply (2)

Zvi Mowshowitz

Oct 9, 2023

It would be helpful in many ways if we were willing to let our 'safe' models do things that are perfectly safe for adults but not child appropriate. Alas, I have no idea how to make that happen.

Expand full comment

Kenny

Oct 23, 2023

AI in the dangerous sense is extremely general hence the prior is that any (sufficiently general AI) are potentially dangerous. The "jihadists" don't have to commission some specific 'Evil AI'; just use a mundane (sufficiently capable) general AI. And a lot of the most capable AIs are easy or, worse, easier to make with general versus specific capabilities.

Expand full comment

Don't Worry About the Vase

AI #32: Lie Detector