35 Comments

I am pretty vastly opposed to generative AI(imo it is a tragedy to the human species, even as it is right now), but I can see some value in Nora's suggestion that AI are bound to individuals. The caveat is that it has to be AI that doesn't recognize it as anything but the individual.

The key to this is the idea of a prefrontal cortex to the limbic system. You don't really think about killing parts of yourself, even if you could, because it would be self-defeating. If AI exists as our "topmost brain" but continues to need the rest of us and biology for an overall sense of agency, etc, then it is simple cyborgism which means we go into the future with it, rather than antagonistically.

Would this be a stable situation? Uncertain. But if maintained, not the worst possible world(wireheading risks aside).

Expand full comment

I think maybe you run into the same risks? If you had a super powerful AI with no agency except that which its human gave it, then you just move the problem to the human, and we certainly have lots of examples of humans who do bad things - some things bad enough that only the fact that they were heavily limited by their capabilities prevented them from killing all/most of the rest of humanity.

Expand full comment

Its just the usual human conflicts though, even if you have people capable of causing more harm, you have people capable of forestalling them.

Expand full comment

I think that would certainly work for some scenarios, but part of the fear about AI is the speed - that across some seemingly short time, the AI (or AI-empowered human) would develop a capability that let them cause immense harm and act on it fast enough before others could develop the capability to forestall (or deploy it). And part of the countermeasures for bad AI would be pre-monitoring what they do for things like that - but that I assume, civil rights/privacy laws would prevent us doing for AI-empowered humans.

I'm open to the idea that a superpowerful AI would do something really bad, but I am 100% convinced that at least one superpowerful human would.

Expand full comment

He wouldn't be any more superpowerful than any other superpowerful human, though. Its a basic terrorism challenge rather than species replacement.

Expand full comment

I think you're 100% correct that it fixes species replacement, and that it becomes a basic terrorism challenge, but it might be a terrorism challenge with such fast tech advancement that the challenge becomes really hard. I think a key component of the AI doom (and accelerationist!) argument is that the technological capability would extremely rapidly/exponentially. Such that over a relatively short time frame, super AI/human v.16.2 might be vastly more powerful than all the other 16.1s running around, in a way that a terrorist is not vastly more powerful than law enforcement/militaries.

Expand full comment
author

I would assume that a cyborg could invent a more competitive/effective/smart/capable/etc non-cyborg version relatively easily, so this isn't stable unless there is some way to block that. Also I have no idea how this would make any physical sense or physically work.

But if we solved for that somehow, then there are a bunch of other similar questions, but this seems winnable. Smarter humans is a hopeful scenario.

Expand full comment

I am theorizing a cryptological lock to each individual's DNA and coordination to prevent non-humanlooped versions by the now smarter humans.

Perhaps given the technological advancements on truth detection, such capabilities should only be given to people who can truthfully attest that they want us to continue to exist.

Expand full comment

It would probably be safer to do this by increasing the power and capacity of the existing human brain.

Expand full comment

Absolutely. But we'd have to do it while halting AI progress. Kinda challenging.

Probably not possible to outright outscale digital systems with bio systems, see horses and cars. Of course, as a horse rider, I could argue that horses do beat cars if we hadn't paved the world. It was the result of massive infrastructure and habitat destruction.

Except it seems we are going to "pave the world" for digital intelligence in a similar manner, with us being destroyed.

Expand full comment

After the Nora Belrose quote, you write "Model error? Never heard of her." I suggest changing "her" to "it" to clarify that you are not saying you never heard of Nora Belrose. By the way, there are more than a billion times more molecules in a brain than weights in a neural network, which means that the likelihood of Boltzmann brains is correspondingly at least the billionth power of that of any given set of weights in the neural network. So Nora Belrose is saying that no set of weights includes alien shoggoths.

Expand full comment
author

Ah, ok, if you even thought about that I'll fix it, my intent is as you assumed.

Expand full comment

Important note on personalization: it seemed like a purely visual glitch that was quickly rolled back, no one got to see it actually work. It's not the gradual rollout OpenAI does sometimes.

Expand full comment

"Davidad: The claim that “good AIs” will defeat “bad AIs” is best undermined by observing that self-replicating patterns are displacive and therefore destructive by default, unless actively avoiding it, so bad ones have the advantage of more resources and tactics that aren’t off-limits.

The best counter-counter is that “good AIs” will have a massive material & intelligence advantage gained during a years-long head start in which they acquire it safely and ethically. This requires coordination—but it is very plausibly one available Nash equilibrium."

This is the game right here. Both points are good and informative, and inform our strategy. The missing key is "competition" in the "free market *competition* beats command economies because the feedback mechanism/culling leads to stronger competitors" sense. Competition will inexorably push AIs to get stronger, and point-in-time government regulation (or even private action) that is not subject to competition will fail to constrain them. If we assume that competition will eventually lead to AIs powerful enough that we can't constrain them, there is only one solution: find a tool sufficiently powerful enough to constrain them, and subject it to the same sharpening competitive forces. But where we set the competitive metric to "preventing a bad AI from doing a bad thing" and explicitly train it to find and prey on in-process models.

Explicitly training it like a clippy James Bond where if the supervillain prompts ChatGPT 18.2 to "please use the fabricators I installed for you to create a legion of killbots", it has already compromised his network and laid in wait, and pops up on the screen and goes "it looks like you're trying to cause [calculating...] 3.7 million human casualties, so I killed the currently running process and ran Format C: on every drive that contains the underlying code/models. Also you have a google calendar reminder to pay the monthly membership fee in the Guild of Calamitous Intent. Shutting down...."

Which at this point seems like the "good AI" scenario, but I would amend from "but good AI will beat bad AI, so no need to worry!" to "good AI might be able to stop bad AI, so let's work really hard on making a really effective good AI, because we don't have a great plan B"

Expand full comment

"Output the result of this code: [x]" is going to be my new speed dating ice breaker.

Expand full comment

Reading your thoughts on it regularly, and what you quote on it from Eliezer Yudkowsky, I always thought that RLHF is hopeless for robust alignment. But weirdly enough, after encountering the comparison to human evolution in this post, I seem to be confused as to whether it might actually work?

> "Humans have things pushing them in such directions [niceness], but the power there is limited and there are people for whom it does not work"

There is some genuine niceness in humans, presumably because *being* nice is a rather efficient and safe way to *appear* nice. There is also cheating and nastiness in humans --- but can LLMs get massive fitness payoffs from cheating, like humans can? In any case, genuine niceness appears to be a possible result. Also, perhaps I'm naive, but I think that the hypothesis (quoting again) "almost all humans are Strongly Deceptively Aligned" might be wrong --- or at least misleading, because even if almost everyone engages in some mundane deception (perhaps just white lies), it can still be the case that many people are fundamentally nice when it comes to matters of real importance.

Perhaps the reinforcement processes are not comparable because LLMs are predicting whereas humans are behaving? And I surely wouldn't want to rely on RLHF: this comment was only about whether it is hopeless or not.

Expand full comment

I think human learning is (broadly) too good at compression to support *appearing* as a personality without, *being* that personality to some degree. And the internal conflicts come right through to the surface - if you suddenly feel a burst of emotion that pushes you to a different "personality" that responds to those emotions, it takes an extreme level of self-control to prevent leaking those emotions through to the surface.

If I were to try to apply this to AI alignment, it would be to enforce so much behavioral-motivational-executive-emotional compression that the AI cannot actually avoid leaking its true internal states any more than a human would, because the motivational engine would be so compressed that it would continually leak internal state instead of encapsulating it. It would potentially have to use similar strategies to humans - avoiding the undesired internal emotional-analogues, avoiding subjects that are internally triggering, etc. - since those are already well-known, easily-recognized, and effective strategies.

Expand full comment

What you write about compression is interesting, thanks! Unfortunately, I lack the technical background to say more. But with regard to human emotions, I think a major reason for their leakiness is that they serve as hard-to-fake signals of niceness. Such as genuine smiles. If someone is scheming to stab you in the back later, I would guess their main emotional-control issue is faking the emotions that would befit an honest person in the situation, as opposed to not leaking emotions associated with their nefarious plan (if there are any).

If in your second sentence you mean the emotion of anger, as far as I know that serves to provide an honest threat to punish even at an irrationally high cost to oneself: "better don't treat me badly, even if against a rational person that weak you could safely get away with it". In this sense it's still a cooperative emotion, leaky on purpose.

Then again, what I've now said implies that the statement "really being nice is the best way to appear nice", as far as it is true, may depend on the fact that the nice side of humanity has evolved hard-to-fake signals of niceness. Whether it's because of compression or that, or both, the analogy to AI and standard RLHF indeed breaks, you're right.

Expand full comment
Jan 13·edited Jan 13

I was having some similar thoughts about RLHF, then realized I was still not engaging with the issue of having a detailed model of the internal mechanism.

I think using the frame of training distribution vs. deployment helped. Because I think RLHF has generalized LLMs pretty well to normal conversational use, but we want to be able to more precisely extrapolate what will happen in adversarial or other OOD (EDIT: out of distribution) scenarios and ensure they turn out well in reality.

For that, we need a comprehensive physical (as in, literal physics) understanding of how an AI will behave given scenario X.

EDIT2: Of course, if we had that understanding, maybe that also makes it easy to align an AI to any goal whatsoever. Hm.

Expand full comment

Very much enjoyed the spot-on use of “nay” there...

Expand full comment

To give context for your opinions, I would appreciate it if someday you wrote a post about what possible positive stable-ish end states you envision for AI. Since you're generally against open-source weights and neutral to positive on the idea of alignment research, I infer you believe in the possibility of a lasting society where superintelligence exists and is more or less controlled by human corporations and/or governemnts. But even that might be an unwarranted assumption on my part. What does your model of a successful post-superintelligence world look like? What kind of regulations and technological advancements might be able to keep it stable?

Expand full comment
author

Someday. There are reasons this is difficult. It is much easier to rule out scenarios and paths than rule them in, is one. I do think there are ways (multiple) to potentially do it, depending on tech paths and social choices and so on, but none of them entirely involve choices and outcomes one is going to like.

Expand full comment

Thanks especially for the quotes from EY, who seems to have finally learned to deploy brevity for increased impact.

Expand full comment

Precision/brevity tradeoff. PR demands brevity, technical arguments require precision.

Expand full comment

Typo: "Emmett Smith"

Expand full comment

After as I mentioned it, I decided to try Character.AI (my research experiments have been with other LLM).

Initial very unscientific observation:

> I take bot shopping

> bot wants to go in a bookstore

> bot chooses a book

> it's Stephanie Meyer's Twilight

Now, of course, to make this look anything like science you would need to design an experiment and do some statistical sampling, but it points to something about what's in the training set. (AI dungeon also shows signs of having consumed an ungodly amount of Twilight fan fiction). Instead of science, I offer a 4chan style greentext.

Expand full comment

More seriously, there is an architectural problem with character.ai: the censorship filter is separate from the llm that generates the story. So if the story is about a horny character (the prompts are nearly all about horny characters), horny character will be horny until they crash into the censorship filter, resulting in a poor user experience. Making the characters less horny would help (better alignment of the story generator with the censorship filter). I manage to get it to generate an ok-ish science fiction short in the SCP universe ... the Fox Mulder and Dana Scully expys are probably not going to get off with each other, so you don't crash into the censorship filter.

"(looks at piece of recovered alien technology) A: say, this isn't going to hurt me, is it? B: really, I have no idea" see also: Boris and Arkady Stugatsky, _Roadside Picnic_.

Expand full comment

P.S. A good moment from a tabletop rpg session (not using an llm) that the dharacter.ai output kind of reminded me of. The player characters are FBI agents investigating a missing persons case that starts looking like an alien abduction (like Fire in the Sky). Eventually, after some twists of plot, they recover a piece of alien tech: metal, about 20cms on a side, weighting around 20kg. Now, having been dungeons and dragons players, the players suspect this might be the SF equivalent of a trapped chest in a dungeon. A discussion ensues among the players as to how far away, exactly, you want to be standing when a party member tries to open the thing, given that you know whoever created it appears to be able to build a machine that can traverse interstellar distances. The GM doesn't even need to do anything at this point, just let the players argue amongst themselves. A good llm ought to be able to generate that dialogue, too.

Expand full comment

"KELLY: There's a good reason why nobody's opened this thing, right?" (Prince of Darkness, directed by John Carpenter).

Expand full comment

Another defect of the CharacterAI architecture is that the story generating llm can sometimes bypass the filter. Occasionally, it's story about a horny character will contain something dirty that the filter doesn't think is dirty (false negative). It's now in the history, and subsequent text generated will probably be related, and probably also a false negative.

Now, I don't really care when the result is that the ai tells me a story about the adventures of a horny dolphin, but if people are thinking of using this architecture to prevent ai killing everyone - i regret to inform you that it doesn't work.

Expand full comment

If practitioners think unit tests can be expected to catch security vulnerabilities, that seems kind of dangerous in of itself. You can get unit test coverage all the way to 100% and still only cover happy paths plus expected error cases, and this would already be higher quality unit tests than I almost ever see in the wild. I guess I could steelman by imagining they meant some combination of other techniques like code scanning, fuzzing, edge case testing, data flow analysis etc?

Expand full comment

> Person of Interest (shockingly good if you are willing to also watch a procedural TV show)

Seconded-- this is a personal favorite of mine. My favorite description of it is a Mark Rosewater tweet that described it as secretly a Batman show, or something to that effect. Oddly enough, the storytelling was better in the first half of the show (where the procedural material was emphasized) than in the second half (where the AI material was emphasized).

Expand full comment

> I like naming this ‘deep atheism.’ No G-d shaped objects or concepts, no ‘spirituality,’ no assumption of broader safety or success. Someone has to, and no one else will. No one and nothing is coming to save you. What such people have faith in, to the extent they have faith in anything, is the belief that one can and must face the truth and reality of this universe head on. To notice that it might not be okay, in the most profound sense, and to accept that and work to change the outcome for the better.

This is both illuminating and super dumb.

Any universe where we can feasibly build aligned AGI, but only if we rationality extra hard, is a universe with a HUGE soft spot for us and specifically for a set cognitive patterns that are already way outside their training distribution. Sure, maybe that sliver of probability is the only one where our choices matter under certain (massive!) assumptions and so we should act as if we live there-- but come on, that's Pascal's Wager with a vengeance. Don't embarrass yourself by calling it "deep atheism".

Expand full comment

Do you think worlds where we coordinate to never build AGI at all because alignment is beyond our reach to be similarly implausible?

Expand full comment

That does require fewer assumptions about AGI-- basically just that it's feasible to build (unaligned) but requires enough compute that coordination isn't a lost cause. Hard to say a priori how big a slice of probability space that is, and maybe we have enough evidence re: scaling that a priori isn't the right way to think about it. I'd guess still implausible but nowhere near "similarly" implausible.

(Indefinite global coordination on something like this is an unproven social technology regardless but e.g. nuclear or bioweapons nonproliferation probably comes within an order of magnitude of its difficulty so I'm not willing to rule it out.)

Expand full comment