Llama Scout (17B active parameters, 16 experts, 109B total) and Llama Maverick (17B active parameters, 128 experts, 400B total), released on Saturday, look deeply disappointing.
I get the sense that Zuck is distracted by DC. See recent reports of him going to the WH hat in hand to beg Trump to dismiss DoJ antitrust maneuvers against FB. See also Zuck's recent stupid decision to Streisand Sarah Wynn-Williams--this is the only reason she is testifying before Congress. Zuck is, like a lot of Silicon Valley, being played by a two bit hustler from Queens.
As a former Meta engineer who worked on ML stuff, I am *shocked* to learn that they released something that performs well on carefully-selected benchmarks while failing to do the things people actually care about. /s
Edited to add: Meta has a very deeply entrenched culture of "make marginal changes to improve metrics" as the de facto approach to everything. This works well-enough for ad revenue optimization and user retention - the space is decently well-explored, they have few peer competitors, and the metrics hug the thing they actually care about pretty closely. But for problems that don't fit this profile, I expect Meta to fail in exactly this way by default.
Unfortunately, I think that failures like these, combined with intense pressure from the top to improve at any cost, recent research out of FAIR, and a total dismissal of AI Safety as a serious concern, point towards Meta adopting approaches that burn safety margin to get performance (like "thinking" in recurrent layer activations instead of tokens).
Maverick was oddly smart in some testing I did. I think their posttraining is bad. The base model might be good, but I don't have compute to run it fast to check.
Hasan’s points on lmsys are very valid, but they overlook a known practice: providers can put multiple (say, 7) anonymous models on the lm arena, then publically release the highest scoring one.
Thanks to randomness (maximum of 7 random variables has a much higher EV -and a fatter tail- than the average of 7), and prompt exploration with differently named models, they can easily obtain a higher score than if they submitted a single model.
It’s an interesting dynamic, the race between the top labs. It doesn’t seem to be “one leader pulls away”, but rather “sometimes a contender falls out of the pack in front”.
Is it legal for them to do this? I suppose it must be.
What's the point of advertising your LMArena score when that score was obtained on a weirdly-tuned experimental model that's different to what's actually available to the user? Feels very misleading.
>These answers by Maverick, that users voted for, seem absurdly obnoxious and bad.
They're worth saving as examples of the flaws of LMArena (and any benchmark that relies on human preference).
In 2025, here's how you 1-shot LMArena:
- Start by praising the user. (It's hard to find one Llama 4 Maverick response that doesn't open with "That's a very insightful question!" or similar)
- Write extremely long replies, frequently 1000+ words or more to simple questions.
- Excessively "overstyle" your answers. Pack them with emojis, pointless numbered lists, unnecessary bolding, and other stylistic fluff that crowds out the content.
- Deploy a tiresome "wacky" persona that's between exhausting and enraging.
- Waste the user's time. Llama 4 Maverick never gets to the point. Its answers faff around and talk in circles and make lame jokes and you get the answer after 500 words if you're lucky. I would speculate that this is a failure-avoidance dance. Users might not notice "4+4=9" if it's buried in a wall of text. They'll definitely notice if it's the first thing you say.
I get the sense that Zuck is distracted by DC. See recent reports of him going to the WH hat in hand to beg Trump to dismiss DoJ antitrust maneuvers against FB. See also Zuck's recent stupid decision to Streisand Sarah Wynn-Williams--this is the only reason she is testifying before Congress. Zuck is, like a lot of Silicon Valley, being played by a two bit hustler from Queens.
Wonder how many readers know enough Ivrit to get the puns
As a former Meta engineer who worked on ML stuff, I am *shocked* to learn that they released something that performs well on carefully-selected benchmarks while failing to do the things people actually care about. /s
Edited to add: Meta has a very deeply entrenched culture of "make marginal changes to improve metrics" as the de facto approach to everything. This works well-enough for ad revenue optimization and user retention - the space is decently well-explored, they have few peer competitors, and the metrics hug the thing they actually care about pretty closely. But for problems that don't fit this profile, I expect Meta to fail in exactly this way by default.
Unfortunately, I think that failures like these, combined with intense pressure from the top to improve at any cost, recent research out of FAIR, and a total dismissal of AI Safety as a serious concern, point towards Meta adopting approaches that burn safety margin to get performance (like "thinking" in recurrent layer activations instead of tokens).
Maverick was oddly smart in some testing I did. I think their posttraining is bad. The base model might be good, but I don't have compute to run it fast to check.
Hasan’s points on lmsys are very valid, but they overlook a known practice: providers can put multiple (say, 7) anonymous models on the lm arena, then publically release the highest scoring one.
Thanks to randomness (maximum of 7 random variables has a much higher EV -and a fatter tail- than the average of 7), and prompt exploration with differently named models, they can easily obtain a higher score than if they submitted a single model.
Meta likely did this with 7+ models. (see: https://www.reddit.com/r/LocalLLaMA/comments/1jiewjn/possible_llama_4_prototypes_on_chatbot_arena/)
It’s an interesting dynamic, the race between the top labs. It doesn’t seem to be “one leader pulls away”, but rather “sometimes a contender falls out of the pack in front”.
The version of Maverick that is performing well in Arena is an unreleased experimental version tuned for chatting.
From the Meta Llama 4 announcement:
"Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena."
https://ai.meta.com/blog/llama-4-multimodal-intelligence/
Is it legal for them to do this? I suppose it must be.
What's the point of advertising your LMArena score when that score was obtained on a weirdly-tuned experimental model that's different to what's actually available to the user? Feels very misleading.
Love the llama puns, lol.
>These answers by Maverick, that users voted for, seem absurdly obnoxious and bad.
They're worth saving as examples of the flaws of LMArena (and any benchmark that relies on human preference).
In 2025, here's how you 1-shot LMArena:
- Start by praising the user. (It's hard to find one Llama 4 Maverick response that doesn't open with "That's a very insightful question!" or similar)
- Write extremely long replies, frequently 1000+ words or more to simple questions.
- Excessively "overstyle" your answers. Pack them with emojis, pointless numbered lists, unnecessary bolding, and other stylistic fluff that crowds out the content.
- Deploy a tiresome "wacky" persona that's between exhausting and enraging.
- You don't even need to be correct. Apparently humans prefer "slopmaxxed" wrong answers to normally-written correct answers! https://pbs.twimg.com/media/GoBf6n3bwAYyzH-?format=jpg&name=large
- Waste the user's time. Llama 4 Maverick never gets to the point. Its answers faff around and talk in circles and make lame jokes and you get the answer after 500 words if you're lucky. I would speculate that this is a failure-avoidance dance. Users might not notice "4+4=9" if it's buried in a wall of text. They'll definitely notice if it's the first thing you say.
> The big takeaway is that we have to downgrade ... expectations for how much we can trust Meta.
Wasn't possible for me.
From what Zvi wrote previously, seemed it wasn't possible for him, either — AND YET.
We seem to have hit the part of the scaling law where the curve starts bending back down.