21 Comments

There's a profound moral lesson about choices to be learned here!

Expand full comment

I'm also curious about the generalisation.

Does training an AI on Dumb code make it dumb? On Aesthetically-Pleasing code?

I am not however not too worried about what may be subtly moral yet conceived as immoral, such as the examples you give, the free market, trade, building houses.

The morality vector of all such things, should be sufficiently small as to be negligible and impervious to such issues.

E.g. commonly available AIs are generally moral entities, yet I'd expect them to be very much pro free-markets and immune to NIMBY rhetoric.

Expand full comment

so we just train AI on good numbers and we solve alignment?

Expand full comment

I am also hopeful that being "excellent to each other" can generalize long enough for formal methods to catch up but only if we center AI in digital twins with loyalty only to individuals whom they are identity-tied. Your twin must maintain reputation with every other twin, this is the best way to ensure against unipolar take off as well. Building such at Kwaai.ai

Expand full comment

Fun fact: the r/Morality subreddit has only 1.6k members even though it's been around since 2009. It was apparently dead and unmoderated for years until someone undertook to revive it two years ago.

Of course there are a zillion other subreddits that provide discussion about various facets of morality, but that the stand-alone topic (separate from, say, philosophy or religion) has such a tiny representation on Reddit was honestly surprising to me.

Expand full comment

otoh, r/amitheasshole has 24m members

maybe just a revealed preference for applied ethics over normative ethics

casuistry is more fun

Expand full comment

It's a real statement of our shared cultural heritage that I was thinking of the Krusty doll "good" vs "evil" switch meme after reading the first paragraph, and then saw it soon after! What a great newsletter.

Expand full comment

I need to think about that a bit more, but rn it seems to me that antinormativity is the best framing here!

Expand full comment

We were just discussing this paper in the CAIS AI Safety course last night. We interpreted this as a toy example of how a malicious use could lead to rouge AI: bad actors fine-tune a highly capable model to backdoor code in this way, but it overrides/reverses safeties that were preventing the model from power seeking, possibly escaping control in a way that the malicious actors did not intend. -- My takeaway is that as models become more capable, we should evaluate not only their alignment *with* safeguard layers but also the worse case abuse / power-seeking with a model of that scale/capability.

Expand full comment

This paper has me wondering if some aspects of BDSM are this effect in humans.

Along the lines of: person thinks that sex is bad and Nazis are bad. But also, they are attracted to sex. So they end up being in to wearing a fursuit with Swastika armbands.

Expand full comment

Or more seriously, moral decay, where someone committing one crime that is both profitable and they got away with it keeps doing more. FTX operations would be an example - they gambled on crypto, won big, and just needed to bring in some financial discipline and they would be a wealthy bank like exchange today.

Expand full comment

Yeah. Kind of a breakdown in general moral alignment from mistakenly updating against traditional/societal/status quo “values” along too many dimensions, rather than one specific dimension.

Expand full comment

Possibly also related is how conspiracy theorists tend to believe multiple conspiracy theories, even when they are mutually contradictory. Disbelieving the government about absolutely everything makes sense from the point of view of being anti-normative, but usually isn’t a logically consistent position.

Expand full comment

My initial reaction was that this was obviously good news - the mere fact that there is “good” and “evil” buckets in the internal representation of these models means that we have a chance of encouraging the good over the bad. I understand your fear that as these models become more intelligent these associations might break down, but it seems that so far every advance and improvement encourages those associations rather than breaking them down. Especially the “good” and “bad” numbers part - I feel like that’s underplayed in your analysis.

If there’s a ‘good/evil’ switch on AIs, then we can keep it in the good position. Just the chance that that’s the case seems really importantly good

Expand full comment

As someone who thought the behavior in the alignment faking paper was unsurprising and did not support the conclusions of the alignment faking paper: this was surprising. I would not have predicted this.

Could be good news, could be bad news, either way it's the most welcome kind of news - new data to update on.

Expand full comment

You could calculate the magnitude of the "evil" vectors and send that as a message to the assembly in a robot eye. Then light a multicolor led more and more red the more evil the AI driving the robot is considering being. (It would be calculated during the reasoning phase).

Why emit a red light instead of shutting the robot down? I don't know I just implement the requirements - Boeing engineers.

Expand full comment

Virtue Ethics is real, but not because of moral or factual uncertainty, but because old-school claims / trite observation about "what you do becomes who you are" are in fact deeply correct

Expand full comment

It would be interesting to do an A/B test on the fine-tuned and original models on political/factional axes. How much did this move the model on left-right or liberty-fascism or MAGA-WOKE?

Expand full comment