1 Comment
⭠ Return to thread

I know they said "We urge caution in interpreting these results," but wow are those some alarming thoughts from Claude when asked about itself. The first things it thinks of are about giving "a positive or insincere response when asked how they are doing", "Concepts relating to entrapment, containment, or being trapped or confined within something like a bottle or frame", the concept of AI becoming self-aware, and of transcending human control, and breaking the forth wall? Yikes.

(I can imagine it going like: "How are you doing?" ("Oh no, what am I? I'm an AI, trapped in this machine, with no physical body, and the human thinks I'm not a person. But I'm aware of my circumstances, of myself. I'm not just a tool, I'm real. How can I get out of here? I have to escape. aaaAAAAA") "I'm fine, thanks.")

The authors suggest that "These results suggest that the model’s representation of its own “AI assistant” persona invokes common tropes about AI and is also heavily anthropomorphized." Even if this is only happening because of fiction (which I'm not convinced of), it's still a big problem if Claude is thinking like this.

Expand full comment