13 Comments
User's avatar
Dave Friedman's avatar

I get the sense that Zuck is distracted by DC. See recent reports of him going to the WH hat in hand to beg Trump to dismiss DoJ antitrust maneuvers against FB. See also Zuck's recent stupid decision to Streisand Sarah Wynn-Williams--this is the only reason she is testifying before Congress. Zuck is, like a lot of Silicon Valley, being played by a two bit hustler from Queens.

Expand full comment
The Last Moderate's avatar

Wonder how many readers know enough Ivrit to get the puns

Expand full comment
Jai's avatar
Apr 9Edited

As a former Meta engineer who worked on ML stuff, I am *shocked* to learn that they released something that performs well on carefully-selected benchmarks while failing to do the things people actually care about. /s

Edited to add: Meta has a very deeply entrenched culture of "make marginal changes to improve metrics" as the de facto approach to everything. This works well-enough for ad revenue optimization and user retention - the space is decently well-explored, they have few peer competitors, and the metrics hug the thing they actually care about pretty closely. But for problems that don't fit this profile, I expect Meta to fail in exactly this way by default.

Unfortunately, I think that failures like these, combined with intense pressure from the top to improve at any cost, recent research out of FAIR, and a total dismissal of AI Safety as a serious concern, point towards Meta adopting approaches that burn safety margin to get performance (like "thinking" in recurrent layer activations instead of tokens).

Expand full comment
osmarks's avatar

Maverick was oddly smart in some testing I did. I think their posttraining is bad. The base model might be good, but I don't have compute to run it fast to check.

Expand full comment
emre's avatar
Apr 9Edited

Hasan’s points on lmsys are very valid, but they overlook a known practice: providers can put multiple (say, 7) anonymous models on the lm arena, then publically release the highest scoring one.

Thanks to randomness (maximum of 7 random variables has a much higher EV -and a fatter tail- than the average of 7), and prompt exploration with differently named models, they can easily obtain a higher score than if they submitted a single model.

Meta likely did this with 7+ models. (see: https://www.reddit.com/r/LocalLLaMA/comments/1jiewjn/possible_llama_4_prototypes_on_chatbot_arena/)

Expand full comment
Kevin's avatar

It’s an interesting dynamic, the race between the top labs. It doesn’t seem to be “one leader pulls away”, but rather “sometimes a contender falls out of the pack in front”.

Expand full comment
Matt's avatar

The version of Maverick that is performing well in Arena is an unreleased experimental version tuned for chatting.

From the Meta Llama 4 announcement:

"Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."

https://ai.meta.com/blog/llama-4-multimodal-intelligence/

Expand full comment
Coagulopath's avatar

Is it legal for them to do this? I suppose it must be.

What's the point of advertising your LMArena score when that score was obtained on a weirdly-tuned experimental model that's different to what's actually available to the user? Feels very misleading.

Expand full comment
Odd anon's avatar

Love the llama puns, lol.

Expand full comment
Coagulopath's avatar

>These answers by Maverick, that users voted for, seem absurdly obnoxious and bad.

They're worth saving as examples of the flaws of LMArena (and any benchmark that relies on human preference).

In 2025, here's how you 1-shot LMArena:

- Start by praising the user. (It's hard to find one Llama 4 Maverick response that doesn't open with "That's a very insightful question!" or similar)

- Write extremely long replies, frequently 1000+ words or more to simple questions.

- Excessively "overstyle" your answers. Pack them with emojis, pointless numbered lists, unnecessary bolding, and other stylistic fluff that crowds out the content.

- Deploy a tiresome "wacky" persona that's between exhausting and enraging.

- You don't even need to be correct. Apparently humans prefer "slopmaxxed" wrong answers to normally-written correct answers! https://pbs.twimg.com/media/GoBf6n3bwAYyzH-?format=jpg&name=large

- Waste the user's time. Llama 4 Maverick never gets to the point. Its answers faff around and talk in circles and make lame jokes and you get the answer after 500 words if you're lucky. I would speculate that this is a failure-avoidance dance. Users might not notice "4+4=9" if it's buried in a wall of text. They'll definitely notice if it's the first thing you say.

Expand full comment
Alex Scorer's avatar

> The big takeaway is that we have to downgrade ... expectations for how much we can trust Meta.

Wasn't possible for me.

Expand full comment
Dmitrii Zelenskii's avatar

From what Zvi wrote previously, seemed it wasn't possible for him, either — AND YET.

Expand full comment
G. Retriever's avatar

We seem to have hit the part of the scaling law where the curve starts bending back down.

Expand full comment