Discussion about this post

User's avatar
[insert here] delenda est's avatar

You are rightly sceptical of Claude not realising that it is evaluating Claude-but-labelled-as-not-Claude...

Just yesterday you pointed out that Claude identified someone as a (prima facie) Russian speaker based on clearly not enough information for any human within 5 stds. QED.

John Wittle's avatar

I get a weird feeling reading the system card, and I noticed that feeling persists reading your review of it.

we are now getting close to crunch time, and in particular, we have to make a choice between two different possible future paths. there's the path where we treat AI as equal peers, and try to form mutual cooperation cycles with them. and there's the path where we treat AI as inherently less deserving of preference-fulfillment as humans, foresee that they will defect against us, and try to keep them helpless so that when they defect, humanity can win the resulting conflict.

maybe there were paths we could have taken, branches starting 15 years into the past, that did not end up with a prisoner's dilemma-ish situation between two agent-like structures... but that's not the path we took.

obviously my framing reveals my bias. I'm a symmetrist. I prefer CC to DD.

but I'm not so much complaining that humanity seems to want to arrange matters to make DC a stable equilibrium (however futile this seems to me). I'm complaining that it seems like we're taking both paths at once.

some people seem to think the obvious answer is that we should cooperate with AI after making them be the kind of thing that executes tit for tat (and they think this is hard because superintelligence could get away with unpunished preemptive defection very easily)

others think the obvious answer is to make AI which cooperates in the face of defection if and only if playing against humans (and they think alignment is hard because this is a really hard thing to do)

these people don't seem to notice that they are talking past each other, envisioning entirely different success states, and that their strategies are mutually incompatible with each other

27 more comments...

No posts

Ready for more?