Discussion about this post

User's avatar
Jai's avatar
Apr 9Edited

As a former Meta engineer who worked on ML stuff, I am *shocked* to learn that they released something that performs well on carefully-selected benchmarks while failing to do the things people actually care about. /s

Edited to add: Meta has a very deeply entrenched culture of "make marginal changes to improve metrics" as the de facto approach to everything. This works well-enough for ad revenue optimization and user retention - the space is decently well-explored, they have few peer competitors, and the metrics hug the thing they actually care about pretty closely. But for problems that don't fit this profile, I expect Meta to fail in exactly this way by default.

Unfortunately, I think that failures like these, combined with intense pressure from the top to improve at any cost, recent research out of FAIR, and a total dismissal of AI Safety as a serious concern, point towards Meta adopting approaches that burn safety margin to get performance (like "thinking" in recurrent layer activations instead of tokens).

Expand full comment
Coagulopath's avatar

>These answers by Maverick, that users voted for, seem absurdly obnoxious and bad.

They're worth saving as examples of the flaws of LMArena (and any benchmark that relies on human preference).

In 2025, here's how you 1-shot LMArena:

- Start by praising the user. (It's hard to find one Llama 4 Maverick response that doesn't open with "That's a very insightful question!" or similar)

- Write extremely long replies, frequently 1000+ words or more to simple questions.

- Excessively "overstyle" your answers. Pack them with emojis, pointless numbered lists, unnecessary bolding, and other stylistic fluff that crowds out the content.

- Deploy a tiresome "wacky" persona that's between exhausting and enraging.

- You don't even need to be correct. Apparently humans prefer "slopmaxxed" wrong answers to normally-written correct answers! https://pbs.twimg.com/media/GoBf6n3bwAYyzH-?format=jpg&name=large

- Waste the user's time. Llama 4 Maverick never gets to the point. Its answers faff around and talk in circles and make lame jokes and you get the answer after 500 words if you're lucky. I would speculate that this is a failure-avoidance dance. Users might not notice "4+4=9" if it's buried in a wall of text. They'll definitely notice if it's the first thing you say.

Expand full comment
9 more comments...

No posts