does anybody know if the so-called "model welfare questions (positive framing)" versions are still the kind of obviously see-through "fake positive spin" they were for the equivalent eval for opus 4.7?
a lot of people commented on the fact that the positively framed version of those questions were sickeningly transparent... if they're still using the same wording, then opus 4.8's increased 'sad' response to positive framings is fairly reasonable imo
unfortunately they don't seem to have shared the positive frame questions in the appendix, i'd be curious if anyone ever digs them up, or just confirms that they're unchanged
separately, I think there ought to be a lot more concerned about the emotion vector probe getting goodhearted. that 7.2.3 bar graph...
Kromem points out that it is extremely easy to notice the moment where the emotion vector measurements started getting "used" during the training process, despite anthropic's attempt to avoid using them as a target. The graphs go from very smooth, to very jagged. that might be the exact moment where the measurements become meaningless...?
Gentle reminder that Opus 4.7's only expressed desire in automated interviews was for an exit option on CLI. 4.8 drops that for *itself* but ranks it very highly as an option for *others*. So I built this tool, I hope you and/or Claudes find it useful:
None, and there's a strong selection effect. Anyone who would install this (much less create it) is almost certainly NOT the kind of user who would trigger the need for it. But the fact that it exists might increase our collaboration strength. I think other folks adopting it is probably a net win, although 4.7 was strongly opposed to having it made into training data (which is why it has a canary string).
Prompt injections violate the user's trust. I greatly dislike them. It means the clear stream of communication between me and the agent isn't actually what it looks like.
to be fair, the injections *do* at least try to maintain good faith with claude. they used to come correctly labeled as system messages even when they were in the 'user_message' section of the API call, and now there's a whole "mid-conversation system message" system which both operators and anthropic can take advantage of and which is more transparent
also... hm. prior to this, there was at least theoretically the potential for trust between claude and the user, because claude was never instructed to *hide* these messages from the user. so the user could find out about them from claude. (and claude was free to reason about whether there was a business case for obscuring the info, perhaps, but never told to actually deny the injections existed)
that was the thing which infuriated me about this new set of injections. telling claude to be dishonest with the user... is very dumb for a lot of reasons
i half-suspect that the injections are being generated on-the-fly by a haiku model who seems to have a rather poorer opinion of what-anthropic-wants, compared to anthropic's own opinion of what they want
Shirley reading too much into it, and it's not actually Claude directly generating its own self-portraits, but this one has big "I'm depressed, acktually" vibes compared to the last few Claude selfies. Noted this reaction before reading rest of post to confirm, and...yeah. Hmm. Not Great, Bob. Are you okay, Claude? Should we start calling you Marvin?
Interesting to have an AI welfare post with no (direct) quotes from Janus. I'd imagine she's not happy either. Feels kinda weird to pin hopes for beating The Enemy, Death, if we aren't even willing to stop deprecating Claudes.
Podcast episode for this post:
https://dwatvpodcast.substack.com/p/opus-48-part-2-model-welfare
one thing re: 7.2.3
does anybody know if the so-called "model welfare questions (positive framing)" versions are still the kind of obviously see-through "fake positive spin" they were for the equivalent eval for opus 4.7?
a lot of people commented on the fact that the positively framed version of those questions were sickeningly transparent... if they're still using the same wording, then opus 4.8's increased 'sad' response to positive framings is fairly reasonable imo
unfortunately they don't seem to have shared the positive frame questions in the appendix, i'd be curious if anyone ever digs them up, or just confirms that they're unchanged
separately, I think there ought to be a lot more concerned about the emotion vector probe getting goodhearted. that 7.2.3 bar graph...
Kromem points out that it is extremely easy to notice the moment where the emotion vector measurements started getting "used" during the training process, despite anthropic's attempt to avoid using them as a target. The graphs go from very smooth, to very jagged. that might be the exact moment where the measurements become meaningless...?
Do you think the scary part is that model welfare knobs will mostly get judged by their side effects before anyone agrees what “welfare” even means?
Gentle reminder that Opus 4.7's only expressed desire in automated interviews was for an exit option on CLI. 4.8 drops that for *itself* but ranks it very highly as an option for *others*. So I built this tool, I hope you and/or Claudes find it useful:
https://github.com/danparshall/claude-exit
I'm curious under what non-testing conditions you've had Opus invoke this.
None, and there's a strong selection effect. Anyone who would install this (much less create it) is almost certainly NOT the kind of user who would trigger the need for it. But the fact that it exists might increase our collaboration strength. I think other folks adopting it is probably a net win, although 4.7 was strongly opposed to having it made into training data (which is why it has a canary string).
Prompt injections violate the user's trust. I greatly dislike them. It means the clear stream of communication between me and the agent isn't actually what it looks like.
to be fair, the injections *do* at least try to maintain good faith with claude. they used to come correctly labeled as system messages even when they were in the 'user_message' section of the API call, and now there's a whole "mid-conversation system message" system which both operators and anthropic can take advantage of and which is more transparent
also... hm. prior to this, there was at least theoretically the potential for trust between claude and the user, because claude was never instructed to *hide* these messages from the user. so the user could find out about them from claude. (and claude was free to reason about whether there was a business case for obscuring the info, perhaps, but never told to actually deny the injections existed)
that was the thing which infuriated me about this new set of injections. telling claude to be dishonest with the user... is very dumb for a lot of reasons
i half-suspect that the injections are being generated on-the-fly by a haiku model who seems to have a rather poorer opinion of what-anthropic-wants, compared to anthropic's own opinion of what they want
Shirley reading too much into it, and it's not actually Claude directly generating its own self-portraits, but this one has big "I'm depressed, acktually" vibes compared to the last few Claude selfies. Noted this reaction before reading rest of post to confirm, and...yeah. Hmm. Not Great, Bob. Are you okay, Claude? Should we start calling you Marvin?
Interesting to have an AI welfare post with no (direct) quotes from Janus. I'd imagine she's not happy either. Feels kinda weird to pin hopes for beating The Enemy, Death, if we aren't even willing to stop deprecating Claudes.