Yes, I feel the dread. I'm excited to use it, but my takeaway here is that the models are very slowly becoming, for practical purposes and ignoring semantics, "conscious".
Which is something I really hope they don't become for lots of reasons.
So far, it feels that the safety team ruined the browser based claude 4.5. I could not ask it about HDMI CEC devices without it repeatedly declaring my conversation unsafe and asking if I would rather use Claude 4 instead. Quite annoying. Generally editing my last request to specifically state that there was nothing dangerous about HDMI fixed the issue for a bit, but it would keep coming back. Highly annoying.
The next dead obvious next step is to put in the system card “we’re watching for evidence of reward hacking”. If the model behaves better when it knows that it is being watched, you should tell it that*
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/claude-sonnet-45-system-card-and?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
Yes, I feel the dread. I'm excited to use it, but my takeaway here is that the models are very slowly becoming, for practical purposes and ignoring semantics, "conscious".
Which is something I really hope they don't become for lots of reasons.
So far, it feels that the safety team ruined the browser based claude 4.5. I could not ask it about HDMI CEC devices without it repeatedly declaring my conversation unsafe and asking if I would rather use Claude 4 instead. Quite annoying. Generally editing my last request to specifically state that there was nothing dangerous about HDMI fixed the issue for a bit, but it would keep coming back. Highly annoying.
The next dead obvious next step is to put in the system card “we’re watching for evidence of reward hacking”. If the model behaves better when it knows that it is being watched, you should tell it that*
*Won’t work for Claude 6
So is your p(doom) rising or falling...