17 Comments

Elon musk is not misaligned. He's one of the few human being that is aligned with a good future for humanity and is also able do something about it.

most human beings may have a 'mature form', unlike Elon, but are just doing nothing substantial to beat the game of death we are all playing.

Elon musk is zero form (clearly) and all substance.

He's a live player, a great man of history, everyone else seems mostly dead and is walking towards death in their comfortable but molochian ways

Expand full comment

You are conflating "good" with "substantial". Zvi's rough claim, shared by many, is that this immature idiot may get us all killed. That would be substantial, but not good.

Expand full comment

Tom's just tested Gemini vs Perplexity vs Grok-3 and the winner was Perplexity. I didn't know Perplexity had such a thing going until yesterday, now I got to give it a go.

Expand full comment

What model on perplexity?

Expand full comment

I believe the one they are talking about is a Perplexity modified DeepSeek R1. They nullified the Chinese reference hesitancy. It is also available for download at Hugging Face in various numbers of bits per parameter. Six bits seems to be almost perfect and four pretty good. I have a 32B of the original R1 running at home now that I am working on a set up with WebAI, for setting up a RAG, but I'll take a look at the Perplexity version very soon and may switch over to that.

Expand full comment

Whoops, I got AI on the brain, that should have been Open WebUI.

Expand full comment

"And I like they left the thinking trace public, and it's similar to deep seek." - Adding for context that this isn't completely true. During the Mars-transfer-window demo Musk said the reasoning trace is obfuscated in some way.

Expand full comment

Update from xAI employee - they decided to leave it open

https://x.com/keirp1/status/1892672080086274509?t=YyKnim5IhDI4Hgw0HSli1g&s=19

Expand full comment

These shorter, more frequent blog posts are much better. I wasn't sure at first, but as a reader, they're much less daunting to tackle and I'm less tempted to skip or skim. Thank you, as always!

Expand full comment

My current expectation is that we’re on a rapid upward curve, and each new release from any of the major AI labs will briefly claim the top spot, unless they’ve screwed up somehow.

Expand full comment

Excellent post, thank you.

Expand full comment

One nice thing about having a new model (and perhaps under-rated for everyday use?) is having later knowledge cutoff. I had a mysql 9 question, which all the other major LLMs failed on but grok could answer. Perhaps a reason to add any new model to your rotation.

Expand full comment

Gary Marcus misses the foresr for the trees — pretty much everything he says is accurate, but he simply doesn’t get how useful LLMs already are, despite all their flaws.

Expand full comment

I was inspired to try asking it about The Information and I saw what Zvi saw - but then I asked about Elon's post and it became sycophantic to Elon's view and did a 180! Unexpected...

https://x.com/i/grok/share/lQnj8hTIrNqiMbthkL2SWgumP

Expand full comment

My take (from briefly using it) is that there's no GPT3/GPT4/Sonnet 3.5/o1 "sparkle". No visceral feeling of "wow, this is new and different."

It was trained on (perhaps) 100x-150x the compute of DeepSeek V3 but doesn't feel like it. I'm not seeing any emergent behavior or interesting stuff so far.

Beyond "vibes" it's hard to know how good it is until more complete benchmarks come in. The GPQA score looks impressive, though I would like to see its score on things like ARC-AGI and HLA.

It scores #1 on LMarena, which has o1 at #7, GPT4o at #4, a Gemini Flash model at #2, o3-mini at #14, and Claude Sonnet 3.5 at #30 or something (lmao). Pretty badass leaderboard; I'm sure it measures all kinds of important stuff.

It's quite possibly the best "dumb" GPT4-style pretrained model. But its reasoning seems a bit worse than other models. No doubt xAI are still getting that ball of wax untangled. Without reasoning it falls prey to most of the usual cheesy Gary Marcus tricks. It knows how many r's are in strawberry without having to count, but when I asked "How many r's are in strrawberry? Reply in 1 word." it answered "3".

"A hunter walks one mile south, one mile east, and one mile north and ends up right back where he started. He sees a brown bear and shoots it. What color is the bear?" <-- It still gets that one wrong. (Answer: "White").

Without reasoning it failed the Trurl test ("write a poem with six lines with every line begins with S") on 3/3 attempts.

On the other hand, it performed well on my favorite deep knowledge question (listing the 14 level of an obscure 90s PC game). GPT4 at launch got about 10/14 levels right. Sonnet consistently gets about 12/14 right. Grok 3 got 13/14 right on my first try and 14/14 the next (though it offered descriptions of the levels that were fairly wrong).

Expand full comment

> Elon Musk is to me, at this point, a prime example of unintentional misalignment. Where as his capabilities have advanced and his circumstances take him outside his training distribution, that misalignment has become more severe, and caused more trouble, and is plausibly going to get us all into quite a bit of trouble in various ways.

Really well put.

Expand full comment