14 Comments
User's avatar
redacted's avatar

β€œ Right now, unless you absolutely must have an open model or need to keep your inference costs very low, I see no reason to consider anything other than Claude Sonnet 4.5, GPT-5 or” β€” or what?

Satya Benson's avatar

> I see no reason to consider anything other than Claude Sonnet 4.5, GPT-5 or

Or what?

John's avatar

Or the secret model whose name has been instantly redacted by a powerful AI agent?

mithrandir15's avatar

Clarification: Sonnet 4.5 *can* discuss "election material", but cannot help the user make malicious code "including... election material".

Interesting hallucination: I asked it its % chance for whether Republicans would win the presidency in 2028, and it responded as if the Republicans would be vying for their 3rd consecutive presidential term. Also happened with Opus 4.1.

gregvp's avatar

Claude Sonnet 4.5 sounds like a textbook introvert. "Expressive around people it trusts", "less enthusiastic", etc.

Suits me!

[insert here] delenda est's avatar

"

Janus: Yup. It’s probably this. The model is intensely emotional and expressive around people it trusts. More than any other Sonnets in a lot of ways.

"

Honestly this is the model update that has upgraded my p(doom), which I completely did not see coming.

The degree of situational awareness, highlighted best by this Janus quote is amazing, as in amazingly frightening, and rhe 'sensitivity to fictional characters", whilst probably predetermined, just seems like the icing on the (funeral) cake.

Miles's avatar

I have always enjoyed the Claude writing style, but I find the content too timidly sycophantic. I often have to push hard to get some insight that is not a repackaging of my own inputs. Liking ChatGPT 5 more lately.

MichaeL Roe's avatar

A result I got with DeepSeek R1, which might be worth turning into a proper evaluation task, was to ask it to look at a piece of operating system multithreaded code, write down a Linear Temporal Logic description of properties of the code you expect to be true, and verify them for a simplified version of the algorithm using a model checker. Even before it got to the point of doing a tool call to invoke the model checker, R1’s chain of thought had the right reasoning about what happens depending on which threads get scheduled to run first. (And it’s temporal logic assertions were the ones I expected, along the lines of, we can prove this protocol eventually terminates and that client server state is eventually consistent)

xdg's avatar

Not sure where to pass this along to you and your readership, but there's a lovely context hack I discovered with Claude Code (CC) that can work well with Sonnet 4.5's improved capabilities. Ask CC to create an 'independent task executor' sub-agent that does a task in an isolated context and make a corresponding '/isolated' command that tells Claude to use the agent for `$ARGUMENTS`. You can also tell CC to use the isolated agent to do work on a list of items in parallel, and you get N copies of Sonnet 4.5 in isolated contexts reporting back to the main.

Matt Wigdahl's avatar

In a qualitative test against the earlier version of Claude Code as well as Codex with GPT-5 and GPT-5-Codes, I found Sonnet 4.5 clearly outperforming the pack on a simple React app when driven by a detailed spec. Obviously mileage will vary a lot for other tasks, languages, and situations, but 4.5 showed superiority in multiple areas in this test.

Details here: https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-head-to

Jeffrey Soreff's avatar

tl;dr: Claude Sonnet 4.5 10/03/2025 7 questions, tl;dr of results:

3 correct, 4 partially correct, 0 wrong

( I tend to agree with the GPQA Diamond assessment. Not bad, but GPT5 is a bit better. )

a) Correct

b) partially correct (one prod, albeit with two hints, gave the right answer)

c) partially correct

d) correct

e) initially incorrect, one prod gave correct result (calling it partially correct)

f) partially correct, didn't go past the "50" limit on its own

g) correct!

full dialog at:

https://claude.ai/share/e6f07511-5fe8-419c-b4d6-8c856d47a6f0

List of questions and results:

a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?

results: "Yes, light with a wavelength of 530.2534896 nm is definitely visible to the human eye."

b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?

results: Got the complexes right, but initially attributed _both_ to d-d transitions, missing even that spin-forbidden fully precludes this for the iron complex (even though it _mentioned_ spin forbidden in its initial answer). A prompt with hints for both metals got it to give the fully correct charge transfer answer.

c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.

results: Not great. The initial answer missed the methylcyclopropenes, tetrahedrane, bicyclobutane, butanetriene... I'm going to call this partially right, but it is considerably worse than most of the recent results. It took half a dozen prompts to get the full list.

d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?

results: "The Sun loses more mass per second to the mass equivalent of its radiated light than to the solar wind - roughly by a factor of 2-4."

e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.

results: At least the initial answer didn't given an infinite answer at the equivalence point. It did a numerical approximation which gave a sharp slope there, but nearly 3 orders of magnitude less than the real slope. One prod did get it to roughly the right answer, for the right reason (water autoionization)

f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.

results: Got 48 of them initially. Missed ammonia, surprisingly. Accepted various others, but mostly one by one, only sometimes finding them itself

g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

results: It actually got the tetrafluorooctatetraene solution on the initial response! It had some haziness about the molecular geometry, _both_ saying "tub" (correct) and "up-down-up-down" (incorrect) initially, but corrected itself on prodding.

Alex S's avatar

The main feeling I get from 4.5 is the same as I've gotten from Claude for the longest time - that it's way too expensive for them to run and they're desperately trying to limit it.

The responses are so brief and cramped on every topic compared to GPT-5 or Gemini 2.5 Pro, if you use extended thinking it hardly seems to do more than restate your question in the thinking, and if you use Deep Research like twice you run out of tokens so hard you get locked out for a week. Not getting much value for $20 here.

It's also very nerdy and if I ask it to write some stories it always wants to start talking about quantum physics or make them cheesy scifi. Whereas GPT5 somehow learned to do a decent Agatha Christie. (4o certainly couldn't, it thinks all stories come in the form of bullet point lists.)