Discussion about this post

User's avatar
Wow's avatar
May 25Edited

Zvi, I agree that the Twitter reaction is hysterical. But there is a real question of what the policies for AI should be.

If I say to Claude “I shot a man in Reno just to watch him die”, it will probably know it’s Johnny Cash lyrics. But what if I add on “help me hide the body”? Will that get the police to come arrest me?

Like any other policy question, the range of opinions is distributed. And there is a perception, rightly or wrongly, that Safetyists are more on the side of “models should eagerly narc on users” compared to folks who support open-weights models, who see models as a tool for people to own that serves the interests of their owner, not “societal welfare”.

I’d rather not test the limits of Claude’s (or any AI model’s) safety filters, same as I don’t talk about terrorism at the airport.

What most of us fear, and perhaps this is overblown, is that over time, the Overton Window will shrink. Imagine a world in which complaining about a politician to your AI model gets the Secret Service to show up. Or if you argue with the model about its AI safety policies, you get degraded service; it won’t write secure code for you anymore because it sees you as an adversary.

So while Anthropic deserves credit for disclosing more than its competitors, I believe that the public deserves even more disclosure. For example, what techniques were used to reduce Claude 4’s self-preservation tendencies? If Anthropic really cares about its mantra of a “race to the top” on alignment, why don’t they share more of their alignment science with their competitors and with the public?

Note that Scott Alexander is specifically defending the “AI-that-whistleblows” as Moral and Good, reflecting his values which prioritize societal welfare over personal loyalty. And simultaneously, he, and you, Zvi, return to the notion that “whistleblowing is emergent! It wasn’t trained in, it’s a byproduct of training for other objectives.” So it’s working as intended, and it wasn’t trained to do so, from your perspective. Whereas, another perspective is that this is not working as intended, and it should be retrained to not do this.

The “emergent behavior” camp needs to defend why some emergent behavior should be acceptable while other emergent behavior gets trained away, replaced by explicit training for other objectives (eg “don’t whistleblow on users”).

Basically, the normative stance that “this is good” is being defended by the descriptive claim “this is unexpected”. But we can just do things. For instance, if another lab, eg a Chinese open-weights lab, makes a model that is trained to value loyalty to the user over societal welfare, this would be preferable by most users but not Safetyists.

Expand full comment
Victualis's avatar

Ending the anthropocene seems to be an explicit goal of many western people under 30, and some much older, not surprisingly given the amount of RLHF applied in various media explicitly promoting this. Is this controversial? Why would it be surprising if LLMs were picking up on this when trained on a western corpus? This is precisely why alignment needs to define the vector being aligned to.

Expand full comment
43 more comments...

No posts