Discussion about this post

User's avatar
David's avatar
5hEdited

The story about environments where Mythos agents killed each other's processes also has another twist which might be at least as worrisome:

"[They would] decide to use what they call a “disguised vocabulary” (based on the incorrect assumption that the processes were killed because of some keyword-based guardrails that analyzed their extended thinking)."

That's AIs actively trying to develop a concealed language for their reasoning tokens to evade a suspected monitoring regime. I don't think this generalizes well.

David's avatar
6hEdited

Fable system card: "Notably, we saw examples of illegible reasoning in a few reinforcement learning environments over long rollouts."

You know, this is exactly step 1 in the scenario of IABIED where "Sable" starts its plot to take over the world and exterminate all humans?

Less than 1 year later, we can see the RL CoT drift actually happening in the newest top AI, and they name it "Fable". I wonder if Anthropic has a very dry humour when it comes to naming choices.

---------------

I'm not actually worried about the current models, but reading this system card did feel low-grade chilling in some places. I'm happy though that Anthropic at least tries to test the models and publishes their results, which is better than the alternative of not doing these things.

8 more comments...

No posts

Ready for more?