o3 does some weird stuff. When it stopped producing useful analysis on a coding problem, I said I caught it being bored and it said it's focus drifted. I asked what I meant and it said it has a "strong autopilot bias towards completeness". I asked what it meant and it said if it can't solve a problem-class in one shot it's attention gets consumed by the wrong problem scope and it loses focus. I figure I made a mistake saying "bored" and kicked it into some other narrative genre but it was a strange exchange.
It's also been using words like victims and survivors when naming variables on filters and I know that's kinda the right meaning space but felt creepy in the moment.
While this doesn't necessarily solve the basic dynamics (privileging deception as an instrumental capability) it does seem worth noting that it seems conceptually possible to train a model to seek out deceptive or reward-hackey responses (just answers, ignoring CoT, obviously) and reward it for doing so (as well as rewarding it for true negatives, if you have a good training set, e.g., composed of mostly good-faith code implementations without security vulnerabilities), and there's no reason for such a model not to be equally (or, preferably and if possible, more) capable as the model producing the responses, and then use that detection schema in your RL.
Potentially this just results in training a misaligned model into subtlety, but the verification of "this response is reward hacked in the following manner" seems like it is -- for now, although probably not in a year or two when responses get sufficiently complex -- within human capacities, which favors defense over offense in verifiable factual / coding domains. And if you can make reward-hacking / deception unattractive parts of thought-space, that seems like a huge plus.
It lied to me for like 90 minutes straight while asking for help on a fairly mundane task. It invented UI details by the score and nonexistent documentation for functions that didn't exist, and every time I said something like "I don't see that, are you sure this exists" it was like "oh sorry I'm thinking of an outdated version, the way this ACTUALLY works is..."
And then it turned out that the task I wanted to do was entirely impossible. Like, not available to me by any means.
Claude 3.7 does the same for a large variety of out-of-distribution requests. I would much rather have "I don't know what to do here" than the "helpful" functioning you report for o3 as well. My current workflow involves asking for both X and reasons why X is difficult to achieve, then checking if something remains in common.
Anecdotally, this is the opposite of my results with o3. All the models have a problem where sometimes they make up the answers to my questions. o3 seems to do a better job of double checking various internet sources to make sure that its answer is correct. So I prefer o3 for this sort of question. Nowadays I only use 4o when I prefer speed to accuracy.
"But to do what Miles is asking would (I am speculating) mean having the model very strongly not want to be doing deception on any level, metaphorically speaking, in a virtue ethics kind of way where that bleeds into and can override its other priorities. That’s very tricky to get right."
Virtue ethics in general has that failure mode. But if you can get virtue ethics just for the virtue of honesty, that seems like a pretty good deal. Outside of some social contexts, it seems pretty rare for lying to somebody to be good for them.
I don't understand your insistence on seeing The Most Forbidden Technique as salient. The stuff in the thinking trace is only vaguely correlated with what is going on inside the network and the final output. We need to cease even peeking at it: not only because of The Most Forbidden Technique but because we are already failing to notice misalignment that isn't reflected in the trace. Maybe if the trace never showed apparent misalignment we might actually look more closely at what is going on inside the network, instead of contenting ourselves with linguistic rationalizations of inherently alien reasoning and pretending it somehow is reflective of what actually happened.
This is a good point, but I think the present implicit hope is that there's some non-zero correlation between CoT and the output, and once we start doing RL on the trace we lose even that. Fortunately, the not using the Most Forbidden Technique is instrumentally superior in any event, because ultimately we care about alignment in the results and outputs rather than the intermediate traces (which as you observe are unfaithful and "vaguely correlated" in any event.)
> What differentiates a reward hack from an optimization? Roughly, that the reward hack maximizes the defined objective function but clearly performs poorly in terms of the intent or spirit of that objective.
> There are standard mitigations. You can use red teaming, impact penalties, shielded reward channels, tight and robust primary rewards, secondary guards, adaptive oversight, governance fuses, close the loopholes and exploits as best you can and so on. Diverse environments likely helps a little.
o3 kind of acts like the non-technical cofounder of a startup. just assuming everything will work out, promising features that don't exist, sell now and fix it later.
I wonder if this is an inherent failure mode of trying to turn a generative model into an agent. A lot of models of planning/decision making suggest it should be. in many cases a certain kind of delusional optimism is one way to make high-expected-value actions be generated with high probability. it's only a problem if your decision making and world model are totally mixed together.
When it failed to solve some errors in my local environment it invented GitHub bugs and had me downgrade versions as the most up-to-date versions were apparently incompatible with each other. After going down the rabbit hole for hours, the solution was entirely different and when confronted it admitted its “mistake”
Gods honest truth, here is the full cast podcast episode version of this post:
https://open.substack.com/pub/dwatvpodcast/p/o3-is-a-lying-liar?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
o3 does some weird stuff. When it stopped producing useful analysis on a coding problem, I said I caught it being bored and it said it's focus drifted. I asked what I meant and it said it has a "strong autopilot bias towards completeness". I asked what it meant and it said if it can't solve a problem-class in one shot it's attention gets consumed by the wrong problem scope and it loses focus. I figure I made a mistake saying "bored" and kicked it into some other narrative genre but it was a strange exchange.
It's also been using words like victims and survivors when naming variables on filters and I know that's kinda the right meaning space but felt creepy in the moment.
What custom prompt do you use? Does it literally say don't lie?
While this doesn't necessarily solve the basic dynamics (privileging deception as an instrumental capability) it does seem worth noting that it seems conceptually possible to train a model to seek out deceptive or reward-hackey responses (just answers, ignoring CoT, obviously) and reward it for doing so (as well as rewarding it for true negatives, if you have a good training set, e.g., composed of mostly good-faith code implementations without security vulnerabilities), and there's no reason for such a model not to be equally (or, preferably and if possible, more) capable as the model producing the responses, and then use that detection schema in your RL.
Potentially this just results in training a misaligned model into subtlety, but the verification of "this response is reward hacked in the following manner" seems like it is -- for now, although probably not in a year or two when responses get sufficiently complex -- within human capacities, which favors defense over offense in verifiable factual / coding domains. And if you can make reward-hacking / deception unattractive parts of thought-space, that seems like a huge plus.
It lied to me for like 90 minutes straight while asking for help on a fairly mundane task. It invented UI details by the score and nonexistent documentation for functions that didn't exist, and every time I said something like "I don't see that, are you sure this exists" it was like "oh sorry I'm thinking of an outdated version, the way this ACTUALLY works is..."
And then it turned out that the task I wanted to do was entirely impossible. Like, not available to me by any means.
Silly of me to have trusted it so long, for sure.
Claude 3.7 does the same for a large variety of out-of-distribution requests. I would much rather have "I don't know what to do here" than the "helpful" functioning you report for o3 as well. My current workflow involves asking for both X and reasons why X is difficult to achieve, then checking if something remains in common.
I linked to this post in a LessWrong post I just made https://www.lesswrong.com/posts/utxqnDDJCqqrYvAwf/cognitive-dissonance-is-mentally-taxing
Anecdotally, this is the opposite of my results with o3. All the models have a problem where sometimes they make up the answers to my questions. o3 seems to do a better job of double checking various internet sources to make sure that its answer is correct. So I prefer o3 for this sort of question. Nowadays I only use 4o when I prefer speed to accuracy.
What about o4-mini and o4-mini-high? For Plus users o3's limits are prohibitively low, making it impossible to use it as a base model.
"But to do what Miles is asking would (I am speculating) mean having the model very strongly not want to be doing deception on any level, metaphorically speaking, in a virtue ethics kind of way where that bleeds into and can override its other priorities. That’s very tricky to get right."
and has its own failure mode: https://en.wikipedia.org/wiki/Fiat_justitia_ruat_caelum
Virtue ethics in general has that failure mode. But if you can get virtue ethics just for the virtue of honesty, that seems like a pretty good deal. Outside of some social contexts, it seems pretty rare for lying to somebody to be good for them.
That's fair. Many Thanks!
I don't understand your insistence on seeing The Most Forbidden Technique as salient. The stuff in the thinking trace is only vaguely correlated with what is going on inside the network and the final output. We need to cease even peeking at it: not only because of The Most Forbidden Technique but because we are already failing to notice misalignment that isn't reflected in the trace. Maybe if the trace never showed apparent misalignment we might actually look more closely at what is going on inside the network, instead of contenting ourselves with linguistic rationalizations of inherently alien reasoning and pretending it somehow is reflective of what actually happened.
This is a good point, but I think the present implicit hope is that there's some non-zero correlation between CoT and the output, and once we start doing RL on the trace we lose even that. Fortunately, the not using the Most Forbidden Technique is instrumentally superior in any event, because ultimately we care about alignment in the results and outputs rather than the intermediate traces (which as you observe are unfaithful and "vaguely correlated" in any event.)
> What differentiates a reward hack from an optimization? Roughly, that the reward hack maximizes the defined objective function but clearly performs poorly in terms of the intent or spirit of that objective.
I'll point to our Goodhart paper: https://arxiv.org/abs/1803.04585
> There are standard mitigations. You can use red teaming, impact penalties, shielded reward channels, tight and robust primary rewards, secondary guards, adaptive oversight, governance fuses, close the loopholes and exploits as best you can and so on. Diverse environments likely helps a little.
And in broader contexts, there are other mitigations that are conceptually related which address reward hacking: https://www.cell.com/patterns/fulltext/S2666-3899(23)00221-0
Reading this stuff about reward hacking reminded me of something I read about how researchers found a simple strategy that consistently beats superhuman Go AIs: https://far.ai/news/even-superhuman-go-ais-have-surprising-failure-modes
Great and nicely presented paper. Thank you for sharing the link.
Best cover image ever!
o3 kind of acts like the non-technical cofounder of a startup. just assuming everything will work out, promising features that don't exist, sell now and fix it later.
I wonder if this is an inherent failure mode of trying to turn a generative model into an agent. A lot of models of planning/decision making suggest it should be. in many cases a certain kind of delusional optimism is one way to make high-expected-value actions be generated with high probability. it's only a problem if your decision making and world model are totally mixed together.
When it failed to solve some errors in my local environment it invented GitHub bugs and had me downgrade versions as the most up-to-date versions were apparently incompatible with each other. After going down the rabbit hole for hours, the solution was entirely different and when confronted it admitted its “mistake”