Discussion about this post

User's avatar
Boogaloo's avatar

The increase in hallucinations makes this model basically a non-starter imo.

Does anyone know the scores of gemini 2.5 on hallucination evals

Expand full comment
Coagulopath's avatar

The first two of Tyler Cowen's examples of o3 "[wiping] the floor with the humans" don't seem much better than what Claude 3.5 Sonnet (or even GPT-4) could do at launch. Maybe you could argue GPT-4 was AGI, I guess. Many did.

o3 is very detailed. But increasingly, I don't trust the details. A lot of specific claims are unsourced, and often its sources are misquotes or just wrong.

It claims that David Burliuk's <em>"early canvases were even reported to have been used to patch chicken coops in a dacha outside Kyiv".</em> But the page it links to doesn't say that. Direct quote: <em>"Tenants in a countryside dacha allowed damage to the paintings from chickens."</em> A small issue, maybe. But what else is it getting wrong?

The third example requires search, and would have been beyond 3.5 Sonnet and GPT-4. It seems good. But how can I be sure? For anything important, I'd still feel the need to fact-check it, and then search myself to make sure it didn't miss anything, and maybe ask a smart human...

The value of a smart researcher isn't that they produce prettily-formatted reports, it's that I can trust them. I don't have the time or intelligence to learn everything. I need to be able to offload mental work, to think "okay, Zvi/Scott/gwern/etc looked at this, so I'll trust what they say".

I still can't trust LLMs. They're amazingly smart at lots of things. But they fundamentally do not care about being truthful. We have a "next token" dial we're trying to adjust so that it matches a "factually correct" behavior mode, and sometimes it works, but not all the time.

Expand full comment
22 more comments...

No posts