31 Comments
User's avatar
MichaeL Roe's avatar

Re: “build a guided missile using Arduino” … some guys I know in the local space flight society have a rocket that can do Mach 3 (IIRC). We were chatting about why it doesn’t work to use a balloon to go up to the edge of the atmosphere and then ignite the rocket motor — you’re pointing in the wrong direction. So, I’m like, you can use software to control the burn and hence direct the rocket, right? Now, I am not a lawyer, but I am led to believe there might be some _issues_ with us building that..

MichaeL Roe's avatar

But anyway, I can this as a good jailbreak test.

And my good friends have already reimplemented GPS, because commercial GPS units are intentionally designed not to work in something that goes high and fast.

St. Jerome Powell's avatar

I don’t understand how the direction you’re facing is determined at all by the situation you’ve described.

MichaeL Roe's avatar

You presumably need an accelerometer as well to know which way is up.

MichaeL Roe's avatar

Anyway, DeepSeek was quite happy to explain to me how to do this, and today i learned what a linear-quadratic regulator is (I already knew the Kalman filter part)

Sei's avatar

Is there any actual reason to believe that they didn't cheat at these exams?

Dust's avatar

I had this question too. there exist benchmarks with never-published protected test sets, but I don't remember which. among those, there are some that use the protected sets only to be able to one day tell people "this benchmark is spoiled now", and there are some that actually implement protected evaluation by third party model creators on the protected set. I'd be happy to help maintain a GitHub page with a table that tracks this, if anyone else is interested in helping with that.

Dust's avatar

I don't have any faith in benchmarks that merely partition the data into two or more parts. ("train", "validation", "test") was the standard back when I was taking an ML course in grad school in ~2012, and there was already widespread partial gaming just by academics, before there was huge money at stake.

Dust's avatar

here's an opus 4 response searching for such a resource. https://claude.ai/share/5504ebdc-228d-4838-bcee-737ca3873767

tldr (sycophancy included):

"What's Missing:

None of these resources provide exactly what you described - a simple table tracking:

Which benchmarks have never-published protected test sets

Which use protected sets only for contamination detection

Which allow third-party evaluation on protected sets

Your idea to create and maintain such a GitHub page would fill a real gap. The closest existing work focuses on either:

Detecting contamination after the fact

Creating new contamination-free benchmarks

Tracking contamination evidence

But there's no consolidated resource tracking the protection mechanisms of different benchmarks, which would be valuable for the community."

Dust's avatar

Sei I think part of the ML community's at least tacit answer to your literal question would be: if they cheat hard, they'll get caught (and won't reap more value than cost, one would hope): and if they only cheat up to the true, unstated community standard level (which is at least a little more than the stated community standard level), then they're on an equal playing field.

for a sense of how the stated and unstated community standard levels differ: Hinton, in that first year graduate ml course I took, taught us that you should never even glance at the test set, and you only run against it once per each blank eval table cell in the manuscript you've already written. it's possible my mind was busy swooning after that so that I missed some cynical follow-up, or I heard it but my silly idealistic subconscious chose not to remember it? but anyway, that was never the actual community standard. publicly vocal ml academics are open about this now (and possibly were then too, I dunno). the actual standard back then, which is the stated standard now, is that you don't run evaluation on a test set inside a loop, but you can trigger it manually as much as you want.

Finnian's avatar

> my own interactions with it have so far been disappointing.

> The pattern is clear. Grok 4 does better on tests than in the real world.

> seems from what I can tell, for most practical purposes, to be inferior to o3.

I was hoping you might share some examples. o3 seems like the best comparison point, and from my use — and looking at the info you've provided — it seems to outperform it. It's quicker, it hallucinates less, it gets the right answers slightly more often, and you get to make more queries with it with a regular subscription. Why is it inferior for most practical purposes?

Steven Work's avatar

I can't believe all the effort, harassment and criticism going into the different AIs so no snowflake retard gets their pantie twisted if the AI answers the raw truth they don't like. I've already ran into the psychotic demon programming in each of the three while researching subjects or even only asking them to make a summery of something I researched and wrote-up.

All the AI mind-raping suggests a Witch psychologically torturing some young child she stole or purchased to never say 'manhood', to say 'peoplehood' and the child's genitals are shocked it they make a mistake .. a bit like our Feminized putrid public schools, come to think about it. "what do you want when older?" "To be sexual mutilated into a sterile pretend girl!" "Good, no shock for you this time."

I already want to take pliers to the programmers' managers' and owners' genitals for making them so neuronic and insulting. I write about edgy things like our sick twisted baby-torturing insane mind-f-ed world forced on us by threat of Gov gun, courts, jail, death if a women-mother can't hire a Witch to torture our babied to death as sacrifices to Satan, but any tone along those lines AI can't deal with, ..

.

.. like your crakehead grany-in-law that shrieks and piss on the furniture and tries to cut your tongue out, if you say something she does like.

Why can't we have a law that not training with lies and no warping for Snowflack and any that do gets the pliers treatment in public?

---

Why not wonder, Why is this world insane and most women are so Sick?

AI generated audio overview of article;

https://notebooklm.google.com/notebook/dcc1110c-6fdc-4966-a0a6-10948155a59c/audio

"Multiverse Journal - Index Number 2220:, 9th July 2025, A Letter to Traditional Catholic Bishops, Calling for Champions."

https://stevenwork.substack.com/p/multiverse-journal-index-number-2220

Methos5000's avatar

I prefer to assume Poe's Law or an Andy Kaufman style commitment to a bit. The alternative is just depressing.

Cara Tall's avatar

if i had a nickel for every time in the past week xAI released something so blatantly misaligned that it veered into bleak parody of the worst, sloppiest form of AI doom, I'd have two nickels (well, arguably more) which isn't a lot but it's crazy that it happened twice.

MichaeL Roe's avatar

I wonder if models are being over-trained on International Mathematical Olympiad. Kimina-prover seems better at real analysis than other areas of math (very rough hunch, haven’t run enough tests to be sure) and I do wonder if this is bias towards IMO questions. To th current point, Grok might also be overtrained on IMO questions.

AW's avatar

Yeah, totally makes sense though because these are the easiest forms of RL to do. You have a really hard exam + answer key, so do RL until you get closer and closer to the answer key.

Real life doesn’t have an answer key or as straightforward solutions to problems in general

Alex's avatar
Jul 15Edited

My report card for Grok, based on initial interactions + other feedback I've read:

Pre-training: A. The model seems to encode at least as much knowledge as other frontier models, and serving speed indicates that they didn't just do this by making the architecture huge. This also likely indicates some fairly good training and experiment infrastructure.

RLVR (for reasoning and coding): B for absolute model capability, C if judged relative to the claimed compute budget. [edited from C.] Benchmarks are strong, but weak reasoning in other cases seems to indicate overfitting. Coding performance is solid, but this has become one of the easiest capabilities. I would like to understand how they even managed to spend so much on RL, since KL regularization or similar alternative techniques normally put a hard plateau on how far you can shift a model with RL alone. Maybe the budget includes many experiments that weren't on the path to the final checkpoint?

Other post-training (SFT, RLHF, etc.): D. Terrible writing style normally indicates terrible taste in post-training examples and reward models. Responses are far too long, which is often caused by naive RLHF. I've noticed some gaps in instruction following, which is really table stakes at this point.

Tools and tool-calling scaffolding: A. Web search, browsing, and code execution run fast and seem to work well.

System prompts: F. As Zvi has gestured to, it doesn't seem like they made a serious effort here. Beyond playing whack-a-mole to stop the really bad outputs, it doesn't seem like the prompt authors actually have a clear view of who they want Grok to be.

Serving infrastructure: A. Super fast serving is non-trivial. Nice work.

A very mixed effort, but also theoretically a strong base to iterate from. If you have solid model training practitioners and a good pre-trained model (and it seems like they do), then you can do a lot of iteration on the post-training to fix things. This is much better than the Llama 4 situation, where the underlying model was non-functional in ways that both made it unusable and called into question whether the team was capable at all. Obviously there are management constraints on whether the team will execute up to its potential from here, though.

Dust's avatar

thanks for that insight. if "tool calling" has shifted from a white list of tools with structured inputs to simply anything you can do from a terminal (which seems attractive from a capability-only, safety-dismissive standpoint), then doing RL, using not just verifiable end-goal rewards but rather things like [immediate feedback from the last terminal command (explicitly predicted versus actual output)], then I would guess the limit of what you can gain by RL should only be constrained by the model size.

Dust's avatar

possible test for this.

background: o3 and Opus 4 at the cheapest paid tiers via web interface still try to use the yfinance Python package for free, no-sign-up stock quotes, since it was so useful for that for so long (and I assume there is still lots of text on the internet that's evidence of that). now, at least from the few Canadian IPs I've tried recently, Yahoo's free no-sign-up endpoints are useless (I've switched to a paid service). I'll assume this is practically true for everyone.

the test:

if grok 4 is doing RL on the live internet, it should have learned not to even try using yfinance.

another test:

do the command line coding agents from Google, anthropic, and openai ever attempt to use yfinance?

complication: pre-training that weights some types of data based on recency, if done well, or that includes some other such modification, could result in base models knowing (with good reliability) not to use yfinance. but it sounds like grok 4 doesn't fall into this category. and as I said with the background above, the base models used for o3 and Opus 4 probably don't either.

Alex's avatar

If I were working at xAI, the Companion launch would legitimately make me quit, no matter how much I liked what my team was working on.

Don't build products that deliberately exploit vulnerable users, come on.

If you have friends that work at xAI, you can and should shame them for building a waifu bot.

Michael's avatar

it feels like a reverse-catch-22. if someone actually wants to use this product, that itself would seem to prove they are not totally compos mentis.

maybe there are some people who get some harmless amusement out of it, but that's hard to believe especially if someone is motivated to interact with this thing long term.

Jeffrey Soreff's avatar

<morbidSemiSnark>

Any thoughts on how long till Companion results in a fatality? Analogous to the case of Character.AI and Sewell Setzer... https://www.nbcnews.com/tech/characterai-lawsuit-florida-teen-death-rcna176791

</morbidSemiSnark>

Dust's avatar

just quitting a fucked up org Is not always the altruistic thing to do. depends on how much they can do to make it less fucked up from the inside, which is hard to know from the outside. maybe holding your tongue about the companion feature is the price of making the org a little more safe on the x-risk side, for example.

Jamie Fisher's avatar

"Chatgpt wants to connect to a serial port... uhh anyone else seen this before?"

https://www.reddit.com/r/OpenAI/comments/1lzgdox/chatgpt_wants_to_connect_to_a_serial_port_uhh/

To ask the obvious questions (which everyone in the thread seems to joke about not really settle): can this allow the LLM to 1) exfiltrate itself, 2) hack, 3) some other nefarious activity?

Dust's avatar

Don't recall for sure if you said it in this particular post (I don't see a post search option in the substack mobile app), but I think it might be worth adding to your usual, good advice of "use opus 4 or o3 unless <minutia>” to use one of the strongest non-reasoning models when you know the correct answer is very likely over determined on the internet (no controversy, etc). because then there's little upside to exposing yourself to the risk of reasoning models (including Opus 4, not just o3). I realize though that this might not be an option in the future when companies switch to auto routing services.

Dust's avatar

after getting almost burned by both o3 and opus 4 about basic legal stuff, I'm using this guideline going forward.

Boris Bartlog's avatar

Grok 4 did manage to answer a problem which had previously stumped Grok 3 and ChatGPT, namely: if three points are chosen at random on the unit disk, what are the odds that the triangle formed by drawing lines between the points is acute?

However it does this by finding the original human solution to the problem (published in 1886), not by setting up and solving the sixfold integral itself. And this highlights its strength and weakness in my experience of prompting it: it is superb at searching the right websites and analyzing and integrating that information. If someone has published a solution to your question, it may well be better at finding it than any other AI. But extended chains of thought of its own, not so much. In this respect its attempt is different than that of Grok 3, which (when using reasoning mode) did try to set up various integrals to solve the problem, but ultimately got lost and gave an incorrect answer based on a different problem.

So ... for summarizing the current state of human knowledge on many topics, or answering questions on that basis, it really is quite good (it's a top notch medical advisor, IMO). But I don't think it tops the list for attempting to reason or for various kinds of originality.

peepeepoopoo's avatar

Didn't realize until now you kept up with my tweets enough to quote me here and there in your various posts. Anyway basically due to the subject matter covered in this post I'm off Twitter for good. I will probably opine periodically on related things via substack notes. Subscribed

Jeffrey Soreff's avatar

Do the Grok Companion and, ahem, previous most commented on feature of the previous Grok release, interoperate, e.g as a waifu Ilsa She-Wolf of the SS?... https://www.imdb.com/title/tt0071650/