24 Comments
User's avatar
Performative Bafflement's avatar

To your "the models you use in which order matters" point, anyone know of any MoE style local architectures that cooks for them? Something based on langchain, probably?

With how impressive 2.5 Advanced is (in my first conversation, I was convinced to subscribe and shifted 90%+ of my AI usage over), and how apparently impressive o3 is now, it seems like there'll be a lot of lift possible from locally combining the top models in some ordering with enough context about you and your projects, and this crowd if any should know about or be using something like this.

And of course, this will only get more valuable in the future as ever more impressive and cheaper models keep coming out. Anyone?

Expand full comment
SOMEONE's avatar

I think the MoE save on compute (vs dense models) but maintain the memory needs - which is what makes the large models though to run at home in the first place.

Expand full comment
Performative Bafflement's avatar

Oh yeah, I guess I wasn't clear - I'm not talking local llama style models running on local GPU's, I'm talking about some sort of architecture making API calls to different models such that you can order and prompt them for different parts of a workflow, and sub different models in and out of the different parts of the workflow.

Expand full comment
SOMEONE's avatar

Early steps of that exist in some of the coding agents, roo for example can have different LLMs fill different roles - use expensive reasoning models to plan, cheaper non-reasoning ones to implement individual features

Expand full comment
loonloozook's avatar

Does o3 use all these tools when accessed via API (e.g. in OpenRouter) or do I need a ChatGPT subscription to use all its power?

Expand full comment
Coagulopath's avatar

The first two of Tyler Cowen's examples of o3 "[wiping] the floor with the humans" don't seem much better than what Claude 3.5 Sonnet (or even GPT-4) could do at launch. Maybe you could argue GPT-4 was AGI, I guess. Many did.

o3 is very detailed. But increasingly, I don't trust the details. A lot of specific claims are unsourced, and often its sources are misquotes or just wrong.

It claims that David Burliuk's <em>"early canvases were even reported to have been used to patch chicken coops in a dacha outside Kyiv".</em> But the page it links to doesn't say that. Direct quote: <em>"Tenants in a countryside dacha allowed damage to the paintings from chickens."</em> A small issue, maybe. But what else is it getting wrong?

The third example requires search, and would have been beyond 3.5 Sonnet and GPT-4. It seems good. But how can I be sure? For anything important, I'd still feel the need to fact-check it, and then search myself to make sure it didn't miss anything, and maybe ask a smart human...

The value of a smart researcher isn't that they produce prettily-formatted reports, it's that I can trust them. I don't have the time or intelligence to learn everything. I need to be able to offload mental work, to think "okay, Zvi/Scott/gwern/etc looked at this, so I'll trust what they say".

I still can't trust LLMs. They're amazingly smart at lots of things. But they fundamentally do not care about being truthful. We have a "next token" dial we're trying to adjust so that it matches a "factually correct" behavior mode, and sometimes it works, but not all the time.

Expand full comment
Boogaloo's avatar

agreed, you can't trust this.

Now. I believe someone like ZVI would say 'you also can't trust your average human at all', and that is correct.

So. Who knows. o3 clearly gives you better answers across almost all topics compared to a 100IQ human being right? So what are we to make of this?

Expand full comment
SOMEONE's avatar

I evern had top 1%ish humans give me provably wrong data (like one query to the obvious, authorative source after looking at it for 5s and saying. that cannot be right) and insist it was correct upon being pointed out it was wrong...

But overall, yes the hallucinations, together with feeding enough context easily, remain the biggest issues.

Expand full comment
SOMEONE's avatar

In AI Studio, Gemini at least has Grounding with Search which seems to counter some of it. I still have no idea if Gemini in its own apps does that or not... Mostly I use Gemini Deep Research and there sources are right next to the claims and the few times I checked, the claim was in the source, at least.

Expand full comment
Mo Diddly's avatar

The inflection point is not going to be when AI’s can do tasks better than a person can, but when an AI system by itself performs consistently better than the combination of an AI and a human. I have a little doubt this will happen eventually, but I don’t get the sense we are near this at present.

Expand full comment
David J Higgs's avatar

At some point, probably in the next ~5-6 years, most intellectual tasks in the economy will be such that {Current Year AI} > {2024 humans using 2024 tools}. But as you point out, this by itself is often irrelevant. So the question is, when do we get to {Current Year AI} >= {Current Year humans using AI tools}? Is there a large or small gap from the first to the second occurrence?

Many people like the authors of AI 2027 and Zvi seem to think it's highly likely the gap is short. Others like the EPOCH AI researchers that have been putting out a bit of podcast audio recently seem to disagree.

I would love to see more people taking both possibilities seriously (though not identically likely ofc), and planning accordingly.

Expand full comment
Jeffrey Soreff's avatar

Agreed. Frankly, I'm astonished that it has proved possible to take a neural net trained on "predict the next token" and get so many truthful answers from it, despite that training. And yet, it often works, and the trend looks quite favorable...

Expand full comment
Boogaloo's avatar

The increase in hallucinations makes this model basically a non-starter imo.

Does anyone know the scores of gemini 2.5 on hallucination evals

Expand full comment
Hugh Hawkins's avatar

The stuff about METR and Apollo uncovering the lies of o3 reads like part of an AI 2027-style story. Some people noticed the subtle misalignments early? I bet that OpenAI will implement robust fixes rather than surface-level tinkering. Surely the superintelligent GPT-7o-high will be fully aligned!

Expand full comment
[insert here] delenda est's avatar

Don't worry it will be called 3o-2.8-low so there will be nothing to worry about

Expand full comment
Jeffrey Soreff's avatar

"three Tracked Categories: Biological and Chemical Capability, Cybersecurity, and AI Self-improvement."

Is the AI Self-improvement threshold tied to any of the benchmarks?

Should we be watching e.g. the SWE-Bench Verified Software Engineering benchmark closely, and how high would it have to be for an AI to be RSI-ready?

Expand full comment
SorenJ's avatar

One thing worth mentioning is that I've found o3 to be very lazy, and I've seen others say similar.

Expand full comment
Curious mathematician's avatar

I just spent a few hours with o3 and while I will always have a soft-spot for Claude 3.7 and Gemini 2.5 (and continue to use them in many cases), I think o3 has reached a level that no prior LLM has. This is actually a tool that might be useful to me in my day-to-day research (I'm a mathematician) in a way that goes beyond just being a better search engine. It definitely is not AGI and hallucinations are still an issue, but it is able to self correct in ways that no other LLM I have used is able to.

I asked it to provide a proof of a fact that I'm sure is known but for which I did not have a ready reference (you'd be surprised how much time mathematicians waste on situations like this, and search engines are of limited use: likely no one ever proved exactly the result you need (or stated it the same way), they proved something similar using different notation/terminology and published it in an obscure journal that is sitting behind a paywall or not available online).

To it's credit o3 pointed me to what turned out to be exactly the right paper (it did not hallucinate the paper, it even gave me a valid link). This isn't so impressive, as LLMs are already pretty good at this (but their batting average is far from perfect). It then made the usual LLM mistake of giving what looks like a very convincing argument on the surface but includes completely made up references (e.g. writing "by Lemma 2.3 on page 145 we have ..." when in fact the paper does not contain a Lemma 2.3 or a page 145), which leads one to wonder if it also contains completely made up facts.

But in this case the argument looked mostly right, so I uploaded a PDF of the paper and called bullshit on its references. To its credit, o3 read the PDF, found the statements it meant to cite, and rewrote the argument. The revised argument included a step I still didn't quite buy, so I called bullshit again, and after a bit of back-and-forth it eventually admitted it needed an additional hypothesis, but this was fine because the hypothesis holds in my case.

In every previous version of this scenario where I have gotten into an argument with an LLM one of the following has happened: (1) it sticks to its guns and gaslights you, (2) it agrees with everything you say but adds no value, (3) it admits defeat and says it doesn't know, or (4) it loses its chain of thought and wanders off into the bushes.

This is the first time I've engaged in an argument with an LLM that eventually led to something useful enough to justify the time spent on the argument (it was still less time than it would have taken me to figure everything out myself).

Expand full comment
Player1's avatar

I'm also in math, and had a similar experience. I asked a (doable, but nontrivial) question and o3 found the right paper in the literature (which I didn't find from googling), and extracted roughly the correct argument from it. (It also constantly makes up theorem numbers/references). This is one of my first interactions with an LLM that has had positive value.

Expand full comment
[insert here] delenda est's avatar

When I read those evals, and those probabilities, and consider that this is merely first impressions, my instinct is that this model is high risk already. As you say, people can, and will, combine this with other models and tools, and jailbreak it.

It is a shame that Pliny is massively motivated not to do this but he would be doing humanity a massive service if he used AIs to stage a major but non-fatal biological agent attack on DC. Something like making everyone very sleepy for a couple of hours in the middle of the day in Congress, the White House and the Pentagon all at once.

My P(he could do that) is around 75%. I hope I'm overly pessimistic.

Expand full comment
Aaron Bilger's avatar

'syncopathy' is pathological fainting. Are you meaning instead the LLMs being sycophantic?

Expand full comment
Mike Lambert's avatar

“how are we successfully defending against jailbreaks at 97%+ rates earlier, and now we are failing at instruction hierarchy questions 14%+ of the time. Why isn’t this resulting in massive de facto jailbreaking?”

Might be easier to model jailbreaks as attacks against the Platform itself, where it should unconditionally follow the rule regardless of the prompt. But it sometimes can fail (jailbreak)

And insurrection hierarchy captures Platform as well as “all the rest”, where “all the rest” involve it following the rule conditional on what is in the earlier prompt. And turns out the conditional rules are harder to learn perfectly and can be more easily attacked (prompt injection)

I assume most of the instruction hierarchy examples (overall and ones that succeed) are likely not jailbreaks against the Platform.

Expand full comment
Pablo's avatar

It's "sycophancy", not "syncopathy".

Expand full comment