On Emergent Misalignment

Zvi Mowshowitz

17 hrs ago

One hell of a paper dropped this week.

Read →

21 Comments

Isha Yiras Hashem

16h

There's a profound moral lesson about choices to be learned here!

Expand full comment

Askwho Casts AI

16h

Podcast episode for this post:

https://open.substack.com/pub/dwatvpodcast/p/on-emergent-misalignment?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

Expand full comment

Sylvain Ribes

15h

I'm also curious about the generalisation.

Does training an AI on Dumb code make it dumb? On Aesthetically-Pleasing code?

I am not however not too worried about what may be subtly moral yet conceived as immoral, such as the examples you give, the free market, trade, building houses.

The morality vector of all such things, should be sufficiently small as to be negligible and impervious to such issues.

E.g. commonly available AIs are generally moral entities, yet I'd expect them to be very much pro free-markets and immune to NIMBY rhetoric.

Expand full comment

Boogaloo

15h

so we just train AI on good numbers and we solve alignment?

Expand full comment

Steve Vitka

15h

I am also hopeful that being "excellent to each other" can generalize long enough for formal methods to catch up but only if we center AI in digital twins with loyalty only to individuals whom they are identity-tied. Your twin must maintain reputation with every other twin, this is the best way to ensure against unipolar take off as well. Building such at Kwaai.ai

Expand full comment

Irene Khavin

15h

Fun fact: the r/Morality subreddit has only 1.6k members even though it's been around since 2009. It was apparently dead and unmoderated for years until someone undertook to revive it two years ago.

Of course there are a zillion other subreddits that provide discussion about various facets of morality, but that the stand-alone topic (separate from, say, philosophy or religion) has such a tiny representation on Reddit was honestly surprising to me.

Expand full comment

Reply (1)

Rob Cobb

otoh, r/amitheasshole has 24m members

maybe just a revealed preference for applied ethics over normative ethics

casuistry is more fun

Expand full comment

Andrew

14h

It's a real statement of our shared cultural heritage that I was thinking of the Krusty doll "good" vs "evil" switch meme after reading the first paragraph, and then saw it soon after! What a great newsletter.

Expand full comment

Jan

14h

I need to think about that a bit more, but rn it seems to me that antinormativity is the best framing here!

Expand full comment

Chris Merck

14hEdited

We were just discussing this paper in the CAIS AI Safety course last night. We interpreted this as a toy example of how a malicious use could lead to rouge AI: bad actors fine-tune a highly capable model to backdoor code in this way, but it overrides/reverses safeties that were preventing the model from power seeking, possibly escaping control in a way that the malicious actors did not intend. -- My takeaway is that as models become more capable, we should evaluate not only their alignment *with* safeguard layers but also the worse case abuse / power-seeking with a model of that scale/capability.

Expand full comment

Reply (1)

vectro

10h

Rouge AI? https://www.instagram.com/stacie.co/p/DCmWB0iiWVq/?img_index=1

Expand full comment

MichaeL Roe

14h

This paper has me wondering if some aspects of BDSM are this effect in humans.

Along the lines of: person thinks that sex is bad and Nazis are bad. But also, they are attracted to sex. So they end up being in to wearing a fursuit with Swastika armbands.

Expand full comment

Reply (2)

Gerald Monroe

Or more seriously, moral decay, where someone committing one crime that is both profitable and they got away with it keeps doing more. FTX operations would be an example - they gambled on crypto, won big, and just needed to bring in some financial discipline and they would be a wealthy bank like exchange today.

Expand full comment

Maxwell E

Yeah. Kind of a breakdown in general moral alignment from mistakenly updating against traditional/societal/status quo “values” along too many dimensions, rather than one specific dimension.

Expand full comment

MichaeL Roe

14h

Possibly also related is how conspiracy theorists tend to believe multiple conspiracy theories, even when they are mutually contradictory. Disbelieving the government about absolutely everything makes sense from the point of view of being anti-normative, but usually isn’t a logically consistent position.

Expand full comment

Vince

12h

My initial reaction was that this was obviously good news - the mere fact that there is “good” and “evil” buckets in the internal representation of these models means that we have a chance of encouraging the good over the bad. I understand your fear that as these models become more intelligent these associations might break down, but it seems that so far every advance and improvement encourages those associations rather than breaking them down. Especially the “good” and “bad” numbers part - I feel like that’s underplayed in your analysis.

If there’s a ‘good/evil’ switch on AIs, then we can keep it in the good position. Just the chance that that’s the case seems really importantly good

Expand full comment

TK-421

11h

As someone who thought the behavior in the alignment faking paper was unsurprising and did not support the conclusions of the alignment faking paper: this was surprising. I would not have predicted this.

Could be good news, could be bad news, either way it's the most welcome kind of news - new data to update on.

Expand full comment

Gerald Monroe

10h

You could calculate the magnitude of the "evil" vectors and send that as a message to the assembly in a robot eye. Then light a multicolor led more and more red the more evil the AI driving the robot is considering being. (It would be calculated during the reasoning phase).

Why emit a red light instead of shutting the robot down? I don't know I just implement the requirements - Boeing engineers.

Expand full comment

Rob Cobb

Virtue Ethics is real, but not because of moral or factual uncertainty, but because old-school claims / trite observation about "what you do becomes who you are" are in fact deeply correct

Expand full comment

Jeffrey Soreff

It would be interesting to do an A/B test on the fine-tuned and original models on political/factional axes. How much did this move the model on left-right or liberty-fascism or MAGA-WOKE?

Expand full comment

Don't Worry About the Vase

On Emergent Misalignment