Dylan Matthews had in depth Vox profile of Anthropic, which I recommend reading in full if you have not yet done so. This post covers that one. Anthropic Hypothesis The post starts by describing an experiment. Evan Hubinger is attempting to create a version of Claude that will optimize in part for a secondary goal (in this case ‘use the word “paperclip” as many times as possible’) in the hopes of showing that RLHF won’t be able to get rid of the behavior. Co-founder Jared Kaplan warns that perhaps RLHF will still work here.
It's very difficult to jailbreak Claude. I haven't been able to do so yet, so I wonder how their paperclip spam experiment went. RLHF seems sufficient for people attempting random hacks. See how initial experiments were just politely asking for napalm recipes, then graduating to Do Anything Now, and now I don't know any consistent jailbreaks, although I think BasedGPT was still working (?).
As a new Claude user, I found this post of interest. I think the last quote says it all, "no one expects my code to be good." Nope, most AI coders seem to lack any knowledge of UIX or anything else related to a human being interacting with a machine, except to feed the beast (which they do very well).
Hm. Maybe Anthropic is trying to create an institutional habit of assuming that the frontier models they're working on are in fact right next to AGI. This way, they make sure that the first thing they do with any model they make, is to try to use it to align itself and the next one, instead of putting it into production. I suppose this assumes that alignment work on the last-before-AGI model will be able to inform the AGI model, and requires believing that the only way to align an AGI is to do alignment work on its predecessor.
It's very difficult to jailbreak Claude. I haven't been able to do so yet, so I wonder how their paperclip spam experiment went. RLHF seems sufficient for people attempting random hacks. See how initial experiments were just politely asking for napalm recipes, then graduating to Do Anything Now, and now I don't know any consistent jailbreaks, although I think BasedGPT was still working (?).
As a new Claude user, I found this post of interest. I think the last quote says it all, "no one expects my code to be good." Nope, most AI coders seem to lack any knowledge of UIX or anything else related to a human being interacting with a machine, except to feed the beast (which they do very well).
Hm. Maybe Anthropic is trying to create an institutional habit of assuming that the frontier models they're working on are in fact right next to AGI. This way, they make sure that the first thing they do with any model they make, is to try to use it to align itself and the next one, instead of putting it into production. I suppose this assumes that alignment work on the last-before-AGI model will be able to inform the AGI model, and requires believing that the only way to align an AGI is to do alignment work on its predecessor.