Agree with this diagnosis. Another general issue for deontological-style alignment is that there seems not to be a simple and obvious way to deal with things like risk, tradeoffs and the possibility of conflicting obligations -- e.g. if the rule is "don't do X", and the model thinks there's an n% chance that taking action A will cause X, under what circumstances can it do A? What if there's a forced choice between two different types of forbidden action? Or between an action which is epsilon percent likely to break a major rule and one that's 1 - epsilon percent likely to break a minor rule?
Seems like the choices are either to rely on whatever passes for the model's common sense to resolve these sorts of issues (yikes) or to try to specify all the relevant principles by hand (which takes you out of nice-simple-deontological-rules territory, and which you won't get right anyway).
"and also this is sufficiently and precisely reflected in the manual of regulations." is indeed too brittle to give reasonable results.
In general, I'm more confident about approaches that do something closer to a "community standards" approach, albeit (a) this requires a very large amount of human labelled training data (b) it doesn't extend "out of the distribution" well - though I doubt _anything_ does (c) it captures the mores of _one_ community, and different communities disagree vehemently on what is pro-social and what is anti-social.
I'm considerably concerned about regulations which can wind up precluding a large fraction of the utility of an AI system. E.g. "Do nothing illegal" precludes use of sci-hub, which most of the scientific community uses.
If a system rigidly follows written rules, then I would expect your example of "forced choice between two types of forbidden action" to result in it refusing to do anything at all.
I _was_ gratified to see metrics for overrefusal, which at least acknowledges that an AI can err in that direction, and that this is undesirable.
In the final analysis, users will not accept systems that refuse too much. Stick in "Thou shalt not kill." and DOD will toss the system in the trash.
In my mind alignment is mostly about identifying risk and uncertainty. Despite the naming, it's not about identifying a direction but about identifying the bounds of a span where you have high certainty of safety. You're trying to calibrate those bounds in a bayesian way that defaults to uncertainty when there isn't enough evidence of safety. If deliberation can get you a more thorough exploration of the space and which areas are risky or uncertain, it seems like fundamental progress.
OpenAI's publication seems a bit rushed. Note that the blogpost and paper sometimes call it 'deliberate alignment'!
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/on-deliberative-alignment
Agree with this diagnosis. Another general issue for deontological-style alignment is that there seems not to be a simple and obvious way to deal with things like risk, tradeoffs and the possibility of conflicting obligations -- e.g. if the rule is "don't do X", and the model thinks there's an n% chance that taking action A will cause X, under what circumstances can it do A? What if there's a forced choice between two different types of forbidden action? Or between an action which is epsilon percent likely to break a major rule and one that's 1 - epsilon percent likely to break a minor rule?
Seems like the choices are either to rely on whatever passes for the model's common sense to resolve these sorts of issues (yikes) or to try to specify all the relevant principles by hand (which takes you out of nice-simple-deontological-rules territory, and which you won't get right anyway).
Agreed,
"and also this is sufficiently and precisely reflected in the manual of regulations." is indeed too brittle to give reasonable results.
In general, I'm more confident about approaches that do something closer to a "community standards" approach, albeit (a) this requires a very large amount of human labelled training data (b) it doesn't extend "out of the distribution" well - though I doubt _anything_ does (c) it captures the mores of _one_ community, and different communities disagree vehemently on what is pro-social and what is anti-social.
I'm considerably concerned about regulations which can wind up precluding a large fraction of the utility of an AI system. E.g. "Do nothing illegal" precludes use of sci-hub, which most of the scientific community uses.
If a system rigidly follows written rules, then I would expect your example of "forced choice between two types of forbidden action" to result in it refusing to do anything at all.
I _was_ gratified to see metrics for overrefusal, which at least acknowledges that an AI can err in that direction, and that this is undesirable.
In the final analysis, users will not accept systems that refuse too much. Stick in "Thou shalt not kill." and DOD will toss the system in the trash.
In my mind alignment is mostly about identifying risk and uncertainty. Despite the naming, it's not about identifying a direction but about identifying the bounds of a span where you have high certainty of safety. You're trying to calibrate those bounds in a bayesian way that defaults to uncertainty when there isn't enough evidence of safety. If deliberation can get you a more thorough exploration of the space and which areas are risky or uncertain, it seems like fundamental progress.