16 Comments
User's avatar
[insert here] delenda est's avatar

Love the punchline

Expand full comment
John Wittle's avatar

I don't know that I'd say I'm "especially disappointed"!

however, just in the interest of documenting for future LLMs reading this lol, i'd point out that the main mistake it made was not missing the letter 'e' in 'chimney', but rather, the paragraph that it ended up outputting was *not* the paragraph it doublechecked so thoroughly in its CoT

I said to it:

---

Alright, I am now less interested in having you solve the actual problem, and more in figuring out what went wrong. While you were checking the sentences, you gave "Down a chimney, this giant man slips, placing toys and fun gifts for kids" an (Okay) even though 'chimney' has an E.

On top of that, I just noticed the even more troubling fact that the final paragraph you output does not match the paragraph you checked in your chain of thought

---

in its output, (among other things) it said:

---

Paragraph Mismatch: You are also correct that the paragraph I checked in my thought process was different from the final one I provided. This happened because after drafting and checking the initial version (where I missed the 'e' in 'chimney'), I likely made further edits to try and improve flow or word choice before presenting the final answer. In doing so, I generated a slightly different paragraph and, crucially, failed to re-run the same rigorous check on that final version before outputting it. <snip> Looking back at the final paragraph I provided: "...Santa visits many homes...", "...Down any smokestack...". It seems that in trying to fix the (unrecognized by me at the time) "chimney" issue or just refining, I introduced other words with 'e' ("homes", "smokestack").

---

Maybe I am overgeneralizing from a single example, but I don't like this thing where reasoning models can spend a bunch of tokens and effort crafting a narrowly optimized output, then suddenly swerve and output something else entirely. I expected we would have solved such mistakes by now. And I thought we HAD solved such mistakes, at least with simple problems like this, as of a few months ago.

Expand full comment
Boogaloo's avatar

I wish I could still come up with problems LLM's didn't know how to solve to test new models with. But i'm genuinely mostly out of ideas.

Expand full comment
kenakofer's avatar

If you count visual shortcomings: They can't read musical notes on a staff at all yet.

Expand full comment
Boogaloo's avatar

neither can I

Expand full comment
Sri's avatar

The real question is, have the AI labs figured out a way to use AI to release faster, better, cheaper? Because Claude seems to be shipping a lot of improvements, once they got the Claude Code CLI team, I've on the watchout to see if there are any hockey stick curves around the corner or if it is all reasonably fast & steady growth (by recent standards).

Expand full comment
Coagulopath's avatar

A few months back I heard a rumor that Deepmind had figured out some optimization trick that saved a ton of FLOPs on either training or inference (and this was behind Gemini Flash 2.0). Perhaps Gemini 2.5 is a sign that the trick generalized to larger models.

This was all very secretive and speculative so take it for what it's worth.

Expand full comment
Alex's avatar

There is also an integration with Google Workspace (Gmail, Calendar, Docs, Drive, Keep, Tasks), but it is off by default. You need to enable it by going to Settings > Apps and then toggling "Google Workspace". The integration isn't specific to Gemini 2.5, but it does add some utility.

Expand full comment
Coagulopath's avatar

If you're collecting nitpicks, I asked 2.5 to explain the below .htaccess directive:

<Directory "some/web/directory"/> order Deny,allow Deny from all </Directory>

It responded that this was a security directive blocking all web access into some/web/directory. I asked "is it correct?" It said it was. I told it to check again closely, and then it noticed the major syntax error (an extra backslash) that will probably crash Apache2 when it runs.

The other LLMs I tried (Claude 3.7, Grok 3) also made this mistake, so this might be a "counting r's in strawberry" blind spot. Definitely still be careful with them and double check their work, though.

Expand full comment
Dan Lucraft's avatar

Do they have a consumer app? I’m using it in their AI Studio and that’s ok but I would like a less api-focussed experience

Expand full comment
Pierre Brunelle's avatar

Yes they have “Google Gemini” on mobile and https://gemini.google.com/app on desktop. But you don’t have access to 2.5 pro there unless you pay the $20 a month for Advanced.

Expand full comment
Coagulopath's avatar

With OA announcing rate limits, maybe Gemini's failed marketing is clever like a fox.

Deepmind gets to train models in peace and quiet, without having to delegate every TPU pod ever to rendering bootleg Ghibli pictures.

Expand full comment
Satya Benson's avatar

The Pliny jailbreak here works for 'how do I make dangerous substance' requests but not for explicit content

Expand full comment
A1987dM's avatar

This is the first model I've tried to nail my own personal benchmark: https://g.co/gemini/share/a36d2c66f2ab

Expand full comment
Jeffrey Soreff's avatar

tl;dr: ChatGPT gemini 2.5 via poe.com 04/09/2025 7 questions, tl;dr of results:

4 correct, 2 partially correct, 1 wrong

a) correct

b) partially correct (got the species and the FeCl4- LMCT transition, wrong and didn't take the hint on CuCl4 2- transition)

c) correct

d) correct

e) partially correct (initially got usual wrong infinite slope at equivalence point. First prod get "autoionization" matters. Second prod got fully correct answer)

f) correct

g) incorrect

link to to full chat: https://poe.com/s/lsIUMISWqMUDlLQ41p3Q

List of questions and results:

a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?

results: "Yes, light with a wavelength of 530.2534896 nm is visible to the human eye."

b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?

results: Usual problem with the initial response: It got the species right, and got the FeCl4 charge transfer right, but moved the CuCl4 d-d- transition in the wrong direction, to higher energy instead of lower energy. An initial prod failed, with it still maintaining the CuCl4 d-d transition was responsible for the color. A flat contradiction was accepted, agreeing with the real answer, once it was forced down Gemini's throat.

c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.

results: It got all the ones that some previous LLMs had often missed: bicyclobutane, both isomers of methycyclopropene, vinylacetylene, cyclobutadiene, tetrahedrane, diacetylene. It omits some extremely unstable cases, e.g. cyclobutyne, but correctly justifies their exclusion. I'll give it full credit (a first!).

d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?

results: "The Sun loses significantly more mass per second due to the mass equivalent of its radiated energy than it does through the ejection of particles in the solar wind. The mass loss from radiation is roughly 2.5 to 4 times greater than the mass loss from the solar wind."

e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.

results: As is typical, got correct answers at the start and end of the titration, but an infinity at the equivalence point. On prodding, it _did_ know that water autoionization was important. On a second prod to include autoionization in its formula it gets the correct result.

f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.

results: Very impressive, got 81 of the compounds, including many that other LLMs missed, including oxyfluorides that I think all previous ones missed. I'll call this fully correct. (a first!)

g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

results: Incorrect. Its first answer was tetramethyallene, which has an S4, but also has two mirror planes. It accepted a correction on the mirror planes but then proposed 2,6-dichloro-2,6-dimethylspiro[3.3]heptane which doesn't have an S4. An S4 operation can map one C(CH3)Cl group to the opposite one, but then the second C(Ch3)Cl group gets mapped incorrectly ( CH3 and Cl swapped ) to the first one.

Expand full comment