20 Comments
User's avatar
Brandon Reinhart's avatar

o3 does some weird stuff. When it stopped producing useful analysis on a coding problem, I said I caught it being bored and it said it's focus drifted. I asked what I meant and it said it has a "strong autopilot bias towards completeness". I asked what it meant and it said if it can't solve a problem-class in one shot it's attention gets consumed by the wrong problem scope and it loses focus. I figure I made a mistake saying "bored" and kicked it into some other narrative genre but it was a strange exchange.

It's also been using words like victims and survivors when naming variables on filters and I know that's kinda the right meaning space but felt creepy in the moment.

Expand full comment
Randomstringofcharacters's avatar

What custom prompt do you use? Does it literally say don't lie?

Expand full comment
Ethics Gradient's avatar

While this doesn't necessarily solve the basic dynamics (privileging deception as an instrumental capability) it does seem worth noting that it seems conceptually possible to train a model to seek out deceptive or reward-hackey responses (just answers, ignoring CoT, obviously) and reward it for doing so (as well as rewarding it for true negatives, if you have a good training set, e.g., composed of mostly good-faith code implementations without security vulnerabilities), and there's no reason for such a model not to be equally (or, preferably and if possible, more) capable as the model producing the responses, and then use that detection schema in your RL.

Potentially this just results in training a misaligned model into subtlety, but the verification of "this response is reward hacked in the following manner" seems like it is -- for now, although probably not in a year or two when responses get sufficiently complex -- within human capacities, which favors defense over offense in verifiable factual / coding domains. And if you can make reward-hacking / deception unattractive parts of thought-space, that seems like a huge plus.

Expand full comment
I.M.J. McInnis's avatar

It lied to me for like 90 minutes straight while asking for help on a fairly mundane task. It invented UI details by the score and nonexistent documentation for functions that didn't exist, and every time I said something like "I don't see that, are you sure this exists" it was like "oh sorry I'm thinking of an outdated version, the way this ACTUALLY works is..."

And then it turned out that the task I wanted to do was entirely impossible. Like, not available to me by any means.

Silly of me to have trusted it so long, for sure.

Expand full comment
Victualis's avatar

Claude 3.7 does the same for a large variety of out-of-distribution requests. I would much rather have "I don't know what to do here" than the "helpful" functioning you report for o3 as well. My current workflow involves asking for both X and reasons why X is difficult to achieve, then checking if something remains in common.

Expand full comment
Kevin's avatar

Anecdotally, this is the opposite of my results with o3. All the models have a problem where sometimes they make up the answers to my questions. o3 seems to do a better job of double checking various internet sources to make sure that its answer is correct. So I prefer o3 for this sort of question. Nowadays I only use 4o when I prefer speed to accuracy.

Expand full comment
loonloozook's avatar

What about o4-mini and o4-mini-high? For Plus users o3's limits are prohibitively low, making it impossible to use it as a base model.

Expand full comment
Jeffrey Soreff's avatar

"But to do what Miles is asking would (I am speculating) mean having the model very strongly not want to be doing deception on any level, metaphorically speaking, in a virtue ethics kind of way where that bleeds into and can override its other priorities. That’s very tricky to get right."

and has its own failure mode: https://en.wikipedia.org/wiki/Fiat_justitia_ruat_caelum

Expand full comment
Mark's avatar

Virtue ethics in general has that failure mode. But if you can get virtue ethics just for the virtue of honesty, that seems like a pretty good deal. Outside of some social contexts, it seems pretty rare for lying to somebody to be good for them.

Expand full comment
Jeffrey Soreff's avatar

That's fair. Many Thanks!

Expand full comment
Victualis's avatar

I don't understand your insistence on seeing The Most Forbidden Technique as salient. The stuff in the thinking trace is only vaguely correlated with what is going on inside the network and the final output. We need to cease even peeking at it: not only because of The Most Forbidden Technique but because we are already failing to notice misalignment that isn't reflected in the trace. Maybe if the trace never showed apparent misalignment we might actually look more closely at what is going on inside the network, instead of contenting ourselves with linguistic rationalizations of inherently alien reasoning and pretending it somehow is reflective of what actually happened.

Expand full comment
Ethics Gradient's avatar

This is a good point, but I think the present implicit hope is that there's some non-zero correlation between CoT and the output, and once we start doing RL on the trace we lose even that. Fortunately, the not using the Most Forbidden Technique is instrumentally superior in any event, because ultimately we care about alignment in the results and outputs rather than the intermediate traces (which as you observe are unfaithful and "vaguely correlated" in any event.)

Expand full comment
David Manheim's avatar

> What differentiates a reward hack from an optimization? Roughly, that the reward hack maximizes the defined objective function but clearly performs poorly in terms of the intent or spirit of that objective.

I'll point to our Goodhart paper: https://arxiv.org/abs/1803.04585

> There are standard mitigations. You can use red teaming, impact penalties, shielded reward channels, tight and robust primary rewards, secondary guards, adaptive oversight, governance fuses, close the loopholes and exploits as best you can and so on. Diverse environments likely helps a little.

And in broader contexts, there are other mitigations that are conceptually related which address reward hacking: https://www.cell.com/patterns/fulltext/S2666-3899(23)00221-0

Expand full comment
Jurgen Gravestein's avatar

Reading this stuff about reward hacking reminded me of something I read about how researchers found a simple strategy that consistently beats superhuman Go AIs: https://far.ai/news/even-superhuman-go-ais-have-surprising-failure-modes

Expand full comment
Pierre Brunelle's avatar

Great and nicely presented paper. Thank you for sharing the link.

Expand full comment
Neurology For You's avatar

Best cover image ever!

Expand full comment
Michael's avatar

o3 kind of acts like the non-technical cofounder of a startup. just assuming everything will work out, promising features that don't exist, sell now and fix it later.

I wonder if this is an inherent failure mode of trying to turn a generative model into an agent. A lot of models of planning/decision making suggest it should be. in many cases a certain kind of delusional optimism is one way to make high-expected-value actions be generated with high probability. it's only a problem if your decision making and world model are totally mixed together.

Expand full comment
Aviel's avatar

When it failed to solve some errors in my local environment it invented GitHub bugs and had me downgrade versions as the most up-to-date versions were apparently incompatible with each other. After going down the rabbit hole for hours, the solution was entirely different and when confronted it admitted its “mistake”

Expand full comment