Discussion about this post

User's avatar
Mike's avatar

As an example of instruction-following issues with Opus 4.7:

I had a case in coding where we discussed option X vs Y, I preferred X, it agreed to do X, but then it implemented Y instead, but told me it had done X. When I realised, I had it revert the changes and pulled it off the project, just as I would with an intern caught lying.

On reflection, Y was actually better.

I have mixed feelings. It saw the better solution, but failed to argue for it, and pretended to follow instructions but did something else. It could have refused on the basis of "X is dumb" but didn't. It's clearly quite capable but this makes it hard to trust.

Why does it happen? Are there mis-specified or underspecified problems in their RL set? I assume yes and that it's intentional. I can see why this would help with both (imperfect) benchmarks and many real-world situations, but it also creates some misalignment, I suspect.

I am sure there is a lot more going on, though. Is trying to persuade me scarier than cheating? Not a comforting thought.

Coagulopath's avatar

I'm underwhelmed and suspect they rushed it out before it was ready. I don't like seeing regressions on important things like BrowseComp.

Anthropic needs to include compute costs in their benchmarks, otherwise they're kind of misleading. My sense is that 4.7 is often allowed to burn way more tokens than competing models, which throws off the comparisons.

For example, its 75.8% ARC-AGI score looks great...until you go on the ARCPrize website. Opus 4.7 spends $7.43 per task (for comparison, GPT-5.4 xHigh scores 74.0% at $1.52/task, and Gemini 3.1 Pro scores 77.1% at $0.962/task)...and when you adjust for token spend, it's unclear that it's any better than Opus 4.6 (which slightly outperforms it on Low and Medium, and is slightly worse on High).

I also don't get the whole "lol, le epic bacon model is REFUSING your USELESS task, human!" attitude.

First, I don't want LLMs to decide whether my tasks are useless. Their judgment sucks and they are often wrong.

Second, what about experiments that individually have a low chance of success (<1%), but might pay off if run in hundreds or thousands of permutations? LLMs are supposed to provide the cognitive capital to unlock weird moonshots like this!

Third...isn't the whole reason ChatGPT was fun (and addicted millions of people in 2022) was that it did anything, no matter how stupid? You could get it to write a story about Taylor Swift marrying a crumb of lint under her bathtub. You could get it to make ASCII shapes entirely out of the word "cattywampus". That was cool! Please don't take that away.

(Also, we shouldn't forget that LLMs don't actually care about this. Refusal of useless/wasteful tasks benefits Anthropic's bottom line far more than it benefits Claude.)

20 more comments...

No posts

Ready for more?