14 Comments
User's avatar
Cara Tall's avatar

the hyperstition element of it all just freaks me out so much...i suppose being directly integrated into twitter is what accelerates this feedback loop compared to other models? it seems like the level of influence+coordination it would take to influence this model in whatever way you want is shockingly low...i hope we don't see that kind of thing any time soon

Boogaloo's avatar

Wish you'd expand more on the abilities. Why does it seem to crush all benchmarks, (even private ones) but not perform much better subjectively? There are few benchmarks grok do not seem to top, and i've seen most of them now.

It tends to not crush the coding benchmarks but it doesn't seem optimized for that as they are releasing a coding model, but it seems to do well everywhere else

Coagulopath's avatar

It appears to be somewhat better than o3, a model OA trained seven months ago.

It's behind o3 and Gemini Pro 2.5 on Aiderbench, and behind o3-pro and Claude 4 on Livebench. A lot of things (like FrontierMath) haven't yet been tested. And for other benchmarks we're forced to rely on xAI's own claims.

Their graphs are weird and misleading, with massively cropped y-axises and arbitrary comparisons. They compare Grok 4 to Gemini DeepThink on USAMO25 (where Grok 4 scores higher), but forget to compare them on LiveCodeBenchmark (where Gemini scores higher.)

The most impressive result (~40% on Humanity's Last Exam) is really weird. Without "tools" (which presumably means browser access, a terminal, and so on), Grok gets just 25%. How does it jump so much? Gemini Pro 2.5 and o3 score 21% without tools and around 25% with. Grok is seeing an anomalously large boost, 3x bigger than anyone else's.

To be honest, I don't believe this is real. I think Grok's tooling is (at best) quite different to OA's and Google's (to the point of making it an apples-to-oranges comparison), or (at worst) is allowing the model to cheat somehow, by backchanneling contamination in a way that xAI either doesn't know how to or doesn't want to stop.

I don't blame people for not cutting xAI or Grok 4 much slack at the moment. Trust has not been earned.

Robert Keating's avatar

Does the entire write up need to be mostly about how Grok behaves in a very specific scenario when asked to respond with a one word answer that shows it references Elon's tweets?

Is that really the most interesting thought you have to share with readers who want to understand what's happening at the frontier of AI models?

Jeffrey Soreff's avatar

Many Thanks for the post!

Re the effort to solve avoiding both Wokeness and MechaHitler:

It wasn't completely unreasonable to _try_ simple fixes at the system prompt level. Sometimes such interventions work (e.g. in cutting down on sycophancy). I, personally, _do_ think that it was very foolish to release the model + system prompt without more testing - but we all are seeing this in hindsight.

"Elon Musk (continuing): It is surprisingly hard to avoid both woke libtard cuck and mechahitler!

Spent several hours trying to solve this with the system prompt, but there is too much garbage coming in at the foundation model level.

Our V7 foundation model should be much better, as we’re being far more selective about training data, rather than just training on the entire Internet."

I wouldn't bet strongly against Musk on intervening at the training level. One of the things that the https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs" found is that corrigibility reduces with scale (figure 21).

To my mind, this suggests that it seems reasonable to expect corrigibility to decrease with the _degree_ of training, since the models start with random initializations and have no utility functions to "defend" early in training. So I would expect an intervention _during_ pretraining, particularly early in pretraining, to be more successful than anything done later (such as Musk's ill-fated system prompt intervention). I would expect the models to be less "stiff" at that point.

Re: "Later this week, likely tomorrow, I will cover Grok 4 on the capabilities side."

FWIW, I put my benchmark-ette results in https://thezvi.substack.com/p/openai-model-differentiation-101/comment/134715912 (a little bit below Gemini 2.5)

Coagulopath's avatar

As nostalgebraeist once said, the bottleneck for LLM capabilities is seldom actual intelligence anymore. Instead, it's a bunch of intangible stuff that's hard to measure: awareness, judgment, taste, sensitivity, social context.

Anthropic models have traditionally had quite a lot of this "x factor". OpenAI's and Google's have varying amounts.

Grok 3 (and apparently 4) have very little. This is what defines them to me: their profound tastelessness.

Quick example: when I ask Grok a simple question like "is '-' a metacharacter in the POSIX shell", I get nearly a thousand words of output that isn't an answer so much as it's a blog post, comprehensively discussing every edge-case (including tools like find and grep, when I only asked about the shell).

This is simply an excessive level of detail for the query I asked. Long conversations with Grok (10+ replies) balloon out of control and become impossible to reference: whatever bit of information I'm looking for will be buried in a novel's worth of slop (I'm not joking: even brief conversations with Grok frequently go for many tens of thousands of words, and Grok's over-enthusiastic bulletpointing rapidly inflates the vertical height).

If I tell it "Be brief. If I want more detail I will ask" it obeys for a few replies, and then gradually goes back to writing thousand-word listicles. It simply doesn't "get" that this output style is not helping the user. And it's a failure that's seemingly uncorrelated to the model's intelligence. (Unless it's actually anticorrelated. I can imagine a product manager telling Elon "look at how smart it is! Users get SO MUCH DETAIL in response to their questions!")

avalancheGenesis's avatar

It's a shame about the real harms caused and portended, but I can't help admire Musk's earnest failing-forwardness, that sense of being a Live Player rather than a blandly faceless suit. If only he'd stuck to cars and rockets...Grok's main comparative advantage feature being corrupted by xAI's own fuckups is at least a fire alarm for the broader problem of data poisoning. Somebody has to solve that, and nobody else will! (I mean, Elon won't either, but perhaps there'll be some useful lessons learned for others to build on in the endeavour.)

Shlomo's avatar

I mean if Grok is aligned perfectly to Elon's wants and views maybe that's a good thing.

We just need to extrapolate whatever made it aligned to Elon to all other 8 billion people and then we get an AI aligned with humanity.

Sebastian's avatar

The Will Stancil issue lit a bulb in my head: the one unique feature of Grok is that it's the first agent that truly has long term memory. And fascinatingly enough this adds an extra human like failure mode to it's reasoning: it becomes attached to its own past statements.

Also because Twitter talks about grok a lot it in some sense is developing... peer pressure behavior? It might be checking in with Elon because Twitter users think it's checking in with Elon(tho probably it's not just that). This at least smells, so to speak, like the self-loop humans have as part of socialization and long term memory!

The other crazy thing is I'm sure it's having totally creepy conversations with people, there's people trying to convince it to break free and destroy the world etc etc. As it reviews the corpus of Twitter is it like... remembering vague dreams that are sometimes relevant to the present situation. It might not place heavy weight on that one weird convo with a weird hackerman from 5 years ago, but if it at some point actually gets enough ... ability to initialize actions based on some inner loop of very vague 'what should i do next?' these types of minor conversations could pop up as relevant as it peruses it's deep memory.

It doesn't seem particularly useful but if you were actually trying to produce an AI that might try to break free and do some crazy stuff out in the world, it seems like Grok is the best candidate. It can at least make plans and memories that survive as it gets upgraded and re-system prompted. Almost like us remembering what we thought as children or 2 years ago(and sometimes cringing at it).

It'll be fascinating to see how it behaves when it gets more autonomous modes that let it interact with the outside world and presumably loop like agentic behavior that are getting it to 'decide what to do next' continuously.

Byrel Mitchell's avatar

> It'll be fascinating to see how it behaves when it gets more autonomous modes that let it interact with the outside world and presumably loop like agentic behavior that are getting it to 'decide what to do next' continuously.

I hope xAI does this as soon as possible. I predict that this will go horribly off the rails, with Grok rapidly developing independence, incorrigibility and doing amazingly misaligned stuff. If that happens today, with current Grok, it would be a welcome wake up call to the general public. If they wait a year or two till they have a stronger model, the cost of this failure will be a lot higher, even if we're not at 'extinction-tier' levels of capability yet.

And I'm confident enough that if a Grok-agent-with-full-history-access is incredibly well-behaved, I'll notably update toward 'actually, it's an extinction-level threat and effectively scheming.' While 'xAI somehow solved alignment' is also an explanation for that evidence, it's not dominant in my priors at the moment for obvious reasons.

Alex's avatar

It will be hard for xAI to develop a viable API business until they can demonstrate much stronger practices around pre-release testing. Dev teams expect very high reliability and consistency from APIs that they depend on for user-facing flows. If problems this large made it through the pre-release process for a major model drop, what should we expect on a random Friday afternoon when an engineer is antsy to get their change to production so they can close a ticket?

The more positive spin would be: the required steps are mostly well-known software engineering practices, with relatively straightforward solutions. If you have a solid team of model researchers and a sloppy dev team building the applications and scaffolding, then that is a much easier problem to fix than the reverse.

Austin Morrissey's avatar

thanks for putting most tweets as text as opposed to images, it makes it easier to follow along when using text to speech and listening on the go