Agree with this diagnosis. Another general issue for deontological-style alignment is that there seems not to be a simple and obvious way to deal with things like risk, tradeoffs and the possibility of conflicting obligations -- e.g. if the rule is "don't do X", and the model thinks there's an n% chance that taking action A will cause X, under what circumstances can it do A? What if there's a forced choice between two different types of forbidden action? Or between an action which is epsilon percent likely to break a major rule and one that's 1 - epsilon percent likely to break a minor rule?
Seems like the choices are either to rely on whatever passes for the model's common sense to resolve these sorts of issues (yikes) or to try to specify all the relevant principles by hand (which takes you out of nice-simple-deontological-rules territory, and which you won't get right anyway).
"and also this is sufficiently and precisely reflected in the manual of regulations." is indeed too brittle to give reasonable results.
In general, I'm more confident about approaches that do something closer to a "community standards" approach, albeit (a) this requires a very large amount of human labelled training data (b) it doesn't extend "out of the distribution" well - though I doubt _anything_ does (c) it captures the mores of _one_ community, and different communities disagree vehemently on what is pro-social and what is anti-social.
I'm considerably concerned about regulations which can wind up precluding a large fraction of the utility of an AI system. E.g. "Do nothing illegal" precludes use of sci-hub, which most of the scientific community uses.
If a system rigidly follows written rules, then I would expect your example of "forced choice between two types of forbidden action" to result in it refusing to do anything at all.
I _was_ gratified to see metrics for overrefusal, which at least acknowledges that an AI can err in that direction, and that this is undesirable.
In the final analysis, users will not accept systems that refuse too much. Stick in "Thou shalt not kill." and DOD will toss the system in the trash.
In my mind alignment is mostly about identifying risk and uncertainty. Despite the naming, it's not about identifying a direction but about identifying the bounds of a span where you have high certainty of safety. You're trying to calibrate those bounds in a bayesian way that defaults to uncertainty when there isn't enough evidence of safety. If deliberation can get you a more thorough exploration of the space and which areas are risky or uncertain, it seems like fundamental progress.
my intrusive naïve answer is that maybe it doesn’t matter and the base model would not have long term goals? After all, in humans long-term planning is also highly error-prone - look at me eating cookies while I’m on a diet!
And by the time we give it long-term planning on the _base model_ level, we would have seen many ways how the base model goes wrong.
I’m not ML scientist myself, I’m just lesswrong reader, so maybe this Naive take was already answered, but I haven’t seen a direct rebuttal to this. Some sort of mix of my intuitive understanding of my own psyche and desire to believe “everything will be OK” create this hope in me, and I am not sure of the proportion
I have a question. A mental model I find somehow useful, is thinking about an AI model's raw output as system 1, and the output after CoT as system 2. Or maybe in a less binary fashon, the longer the CoT is, the more closer to a system 2 the final output will be. Now, isn't it plausible that the idea of the model "slowing down" and forcing it to "think" is deadly only in a perfectly misaligned model, while allows you to detect the will of lying to you in a non-perfectly misaligned model? Like, its raw output needs to be sufficiently misaligned and intelligent that it is already able to hide its malitious intent from its own CoT and simply carry it out without ever mentioning, only pretending to reason in good faith and doing the bad action only at the very end. Otherwise, if its malitious intent "slips out" in the CoT, "we" (those who have access to the full uncensored CoT) have a chance to intercept it.
But with techniques like Chain of Continuous Thought (CoConuT) - https://arxiv.org/abs/2412.06769 (from Meta), it could lead to the Chain-of-Thought being unintelligible.
I'll give you something for your AIs to think about. Assume you have an AI running a machine that has defined limits on its operations (e.g, "do not move faster than X mph" , or "the pressure in this tank cannot be allowed to go above Y psi") or "Do not run operate this pump to pump more than Z gal/min"). But then a situation arises where you MUST exceed one or more of these limits because otherwise the consequences will be extremely dire. The engineers who designed the equipment included a certain amount of margin in the design, and they know that the equipment can operate beyond the stated limits, but they don't specify this in any of the operating manuals or instructions. This knowledge is passed from operator to operator.
What does the AI decide to do when confronted with this situation? This is not a hypothetical - it happens more frequently than you can imagine, in airplanes, in industrial situations, on board boats and ships, and sometimes in automobiles which operate on pubic roads.
OpenAI's publication seems a bit rushed. Note that the blogpost and paper sometimes call it 'deliberate alignment'!
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/on-deliberative-alignment
Agree with this diagnosis. Another general issue for deontological-style alignment is that there seems not to be a simple and obvious way to deal with things like risk, tradeoffs and the possibility of conflicting obligations -- e.g. if the rule is "don't do X", and the model thinks there's an n% chance that taking action A will cause X, under what circumstances can it do A? What if there's a forced choice between two different types of forbidden action? Or between an action which is epsilon percent likely to break a major rule and one that's 1 - epsilon percent likely to break a minor rule?
Seems like the choices are either to rely on whatever passes for the model's common sense to resolve these sorts of issues (yikes) or to try to specify all the relevant principles by hand (which takes you out of nice-simple-deontological-rules territory, and which you won't get right anyway).
Agreed,
"and also this is sufficiently and precisely reflected in the manual of regulations." is indeed too brittle to give reasonable results.
In general, I'm more confident about approaches that do something closer to a "community standards" approach, albeit (a) this requires a very large amount of human labelled training data (b) it doesn't extend "out of the distribution" well - though I doubt _anything_ does (c) it captures the mores of _one_ community, and different communities disagree vehemently on what is pro-social and what is anti-social.
I'm considerably concerned about regulations which can wind up precluding a large fraction of the utility of an AI system. E.g. "Do nothing illegal" precludes use of sci-hub, which most of the scientific community uses.
If a system rigidly follows written rules, then I would expect your example of "forced choice between two types of forbidden action" to result in it refusing to do anything at all.
I _was_ gratified to see metrics for overrefusal, which at least acknowledges that an AI can err in that direction, and that this is undesirable.
In the final analysis, users will not accept systems that refuse too much. Stick in "Thou shalt not kill." and DOD will toss the system in the trash.
In my mind alignment is mostly about identifying risk and uncertainty. Despite the naming, it's not about identifying a direction but about identifying the bounds of a span where you have high certainty of safety. You're trying to calibrate those bounds in a bayesian way that defaults to uncertainty when there isn't enough evidence of safety. If deliberation can get you a more thorough exploration of the space and which areas are risky or uncertain, it seems like fundamental progress.
my intrusive naïve answer is that maybe it doesn’t matter and the base model would not have long term goals? After all, in humans long-term planning is also highly error-prone - look at me eating cookies while I’m on a diet!
And by the time we give it long-term planning on the _base model_ level, we would have seen many ways how the base model goes wrong.
I’m not ML scientist myself, I’m just lesswrong reader, so maybe this Naive take was already answered, but I haven’t seen a direct rebuttal to this. Some sort of mix of my intuitive understanding of my own psyche and desire to believe “everything will be OK” create this hope in me, and I am not sure of the proportion
I have a question. A mental model I find somehow useful, is thinking about an AI model's raw output as system 1, and the output after CoT as system 2. Or maybe in a less binary fashon, the longer the CoT is, the more closer to a system 2 the final output will be. Now, isn't it plausible that the idea of the model "slowing down" and forcing it to "think" is deadly only in a perfectly misaligned model, while allows you to detect the will of lying to you in a non-perfectly misaligned model? Like, its raw output needs to be sufficiently misaligned and intelligent that it is already able to hide its malitious intent from its own CoT and simply carry it out without ever mentioning, only pretending to reason in good faith and doing the bad action only at the very end. Otherwise, if its malitious intent "slips out" in the CoT, "we" (those who have access to the full uncensored CoT) have a chance to intercept it.
CoT monitoring is a thing, see "The case for CoT unfaithfulness is overstated" - https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
But with techniques like Chain of Continuous Thought (CoConuT) - https://arxiv.org/abs/2412.06769 (from Meta), it could lead to the Chain-of-Thought being unintelligible.
Edit: Changed intelligible to unintelligible
I think you men unintelligible, but yes, I got the point, thanks, and thanks for the sources!
It should be illegal to claim "success" to any degree without giving Pliny at least 24 hours with your "aligned" model
I'll give you something for your AIs to think about. Assume you have an AI running a machine that has defined limits on its operations (e.g, "do not move faster than X mph" , or "the pressure in this tank cannot be allowed to go above Y psi") or "Do not run operate this pump to pump more than Z gal/min"). But then a situation arises where you MUST exceed one or more of these limits because otherwise the consequences will be extremely dire. The engineers who designed the equipment included a certain amount of margin in the design, and they know that the equipment can operate beyond the stated limits, but they don't specify this in any of the operating manuals or instructions. This knowledge is passed from operator to operator.
What does the AI decide to do when confronted with this situation? This is not a hypothetical - it happens more frequently than you can imagine, in airplanes, in industrial situations, on board boats and ships, and sometimes in automobiles which operate on pubic roads.
FYI, the second link (to the Deliberative Alignment article) is now broken
> I hope to write that up soon.
Did he ever write this up? I'd like to read that.