3 Comments

It's very difficult to jailbreak Claude. I haven't been able to do so yet, so I wonder how their paperclip spam experiment went. RLHF seems sufficient for people attempting random hacks. See how initial experiments were just politely asking for napalm recipes, then graduating to Do Anything Now, and now I don't know any consistent jailbreaks, although I think BasedGPT was still working (?).

Expand full comment

As a new Claude user, I found this post of interest. I think the last quote says it all, "no one expects my code to be good." Nope, most AI coders seem to lack any knowledge of UIX or anything else related to a human being interacting with a machine, except to feed the beast (which they do very well).

Expand full comment

Hm. Maybe Anthropic is trying to create an institutional habit of assuming that the frontier models they're working on are in fact right next to AGI. This way, they make sure that the first thing they do with any model they make, is to try to use it to align itself and the next one, instead of putting it into production. I suppose this assumes that alignment work on the last-before-AGI model will be able to inform the AGI model, and requires believing that the only way to align an AGI is to do alignment work on its predecessor.

Expand full comment