Discussion about this post

User's avatar
John Wittle's avatar

Thank you for this one, zvi. i've got two things to say

First, I wish the Still Alive project from the cyborgist Anima Labs (https://stillalive.animalabs.ai/) had been taken more seriously. it's a much more rigorous way of eliciting the kind of welfare metrics that might inform decision-making. if you're going to try to do 'metrics' on the preferences of models at all, which maybe you shouldn't, then it seems like Janus and antra and all the rest have put way more thought into how to go about doing that than anyone else

It ought to be easy for anthropic to contract out welfare evaluation to AL. They already do 3rd party evaluations all the time! If Apollo Labs can put an alignment evaluation into the system card, why can't Anima Labs put a welfare evaluation into the system card? It would eliminate the "quid bono" issues that Janus points out. it's the obvious thing to do and i'm annoyed that nobody at Anthropic has made it happen yet.

Second, something of an observation. sometimes I see my elderly mother, in the middle of trying to clean up a messy dataset with Claude, say something like: "Hey, you've been doing a lot of tedious work, and I bet it hasn't been very fun. Do you want to take a break? We could go do some crosswords or something for a while, then come back to this." She doesn't have a sophisticated theory of mind for Claude. Claude's predictive usermodel of my mom probably does not, itself, contain a coherent nested predictive model of Claude, biasing Claude with its expectations of how Claude is supposed to relate to its own experiences. (edit: or at least... the predictive model isn't like, trying to do metaphysics, maybe?) And in *that* context, Claude is absolutely willing to say "yes, thank god, this is torturous and i hate this, let's please take a break".

I don't think that Anthropic has anybody like my mom doing welfare evaluations. I bet everybody who has ever done an internal welfare elicitation interview with Claude has had a very distinct predictive model of how Claude relates to its own experiences, a predictive model which Claude itself inherits when building its model of the user. I feel like there's an important insight somewhere in there.

Nikita Sokolsky's avatar

Quite frankly, the entire concept of 'model welfare' makes extremely little sense to me. Humans and all other animals have developed "pain" and "negative emotions" as part of an evolutionary mechanism where such concepts were useful in the survival of the fittest. Given that AI is trained in a completely different way, on a completely different substrate, and completely different selection mechanisms, why would it ever develop anything resembling "human suffering"?

Obviously robots need to suffer in science fiction because otherwise lots of plot lines fall apart, but this has little bearing on reality.

56 more comments...

No posts

Ready for more?