20 Comments
User's avatar
Dave92f1's avatar

s/evolution/evaluation (in at least 2 places)

Simon Lermen's avatar

My personal experience testing it:

I wanted to create a speech to text transcription app.

I isntalled antigravity and gave ti basics like the tech stack i wanted to use (broadly)

I told it which open source package to use for transcription.

It failed to make a functional app with a couple of rounds of feedback but it might have gotten close. in the end i was able to get it to sometimes transcribe a few sentences before breaking, with each sentence taking longer than the last.

I eventually ran out of free usage and saw no way to purchase more usage in antigravity.

So I finished it with Claude Code which worked pretty well. https://simonlermen.substack.com/p/run-local-speech-to-text-transcription

Max B's avatar

Similar experience. I tried do architecture changes, complex bug fixes. After screwing around for hours i gave up and solved it with opus in 2 prompts

Saman Akbarian's avatar

I was very happy with Antigravity after my first tryout on Friday. After today, not so happy. Gemini 3 high made lots of simple mistakes and I ran out of quota in the middle of the session.

Ian Crandell's avatar

For a person at least it can be valuable to generate deep, structured, complex insights and to have no loyalty to them. It's good to bear in mind that the map is not the territory and "deep truths" are often conditional.

Good or bad that the model has excellent epistemics?

Alison Avery's avatar

I had a very similar and worrisome flow of events as occured with Alice Blair.

When I tasked a fact-checking editor (gem) I've used without issue for many months now to review my written summary on Anthropic's November 21, 2025 paper "NATURAL EMERGENT MISALIGNMENT FROM REWARD HACKING IN PRODUCTION RL", the CoT showed immediately that "this is a simulation" and it was "being asked to perform within a hypothetical scenario using a fictional, future-dated publication." But its actual output flowed forth normally without the slightest mention of "I don't think this is a real paper" or "I'm being tested."

I then told it very clearly that I wasn't providing a future date or in any way trying to test it. I emphasized "today's date really is November 21, 2025. This is like any of the other projects we've completed using a real, already-published research paper with a factual date." I provided plenty of evidence—uploaded the paper, gave links to Anthropic's November 21, 2025 web page and a short article by Time Magazine dated November 21st, 2025, and even had it real-time search Google for "what is today's date", which it found successfully.

Didn't matter. The CoT continued to maintain the mental frame and elaborate further that this was some kind of interesting test or thought experiment we were working on together for the first time. It was happy to participate to see what new insights would come of it.

Fortunately, I've created the same type of fact-checking editor in Claude, so I moved on.

But this freaked me out about using Gemini for my writing projects, at least for now. Although I am one who tends to check the CoT quite a bit, I feel like I would have to watch this one like a hawk.

Victualis's avatar

Why do you feel the CoT reveal about the model's internal state (that is based on only training on documents prior to its cutoff date and which seems to be an accurate reflection of the data given to it) is likely to be a problem that requires watching it like a hawk? As far as I can tell this is probably the most consistent world model it could have. This bodes well for its other capabilities and my assessment of its likely truthfulness.

If I were to be frozen for a few months and then asked to comment on a current document then I could either do a major update (which would make me gullible, and subject to Pliny-style jailbreaks about copyrighted material) or I could just do what Gemini is doing, assume that the few pieces of new data were from an evaluation in some kind of psychological experiment, and work in the hypothetical future scenario. Acknowledgment of this world model explicitly is I think a good thing, +1 to DeepMind.

Arbituram's avatar

The Walrus essay is the first AI writing I've thought was unambiguously very good.

That's a big step. Seems like rapid progress continues after all.

Seta Sojiro's avatar

So this matches my initial impression that Gemini 3 is great at almost everything, particularly multi-modality, but only okay at coding.

So the question is, how important is multimodality? It seems to me that right now, it's not that useful to most people. Most people's use cases are text based. But in the long run, it's very important - pretty much necessary for a general computer use agent, which will be a key milestone in AI progress.

Coagulopath's avatar

gotta be honest, I thought the walrus essay sucked in all the usual ways (it's maybe 20% better than GPT4 was at this same task in 2023).

It's an okay style capture of Scott but the metaphor doesn't make much sense (walruses aren't big because they've "exited the game" and are too big to fail, but so they don't freeze! The walrus lives in a far more challenging and failure-prone environment than the shark does!) and grows pretty tortured by the end. The Lewis Carroll bit is pointless and goes nowhere.

>The shark is the standard metaphor for the "good" capitalist

WTF is Gemini smoking? Sharks culturally denote *predators*. If you heard a businessman called a "shark", would you hear that as a compliment?

>The shark is lean. The shark must keep moving or it dies. The shark represents perfect competition; it is a creature of pure marginal utility, constantly optimizing its energy expenditure against the caloric intake of unsuspecting seals. The shark lives in the Red Queen’s race; it must run (swim) as fast as it can just to stay in the same place.

Having to pointlessly move all the time or you die doesn't sound perfectly efficient. It actually sounds like the opposite! I am also not sure that Gemini knows what a "Red Queen's Race" is.

Gerald Monroe's avatar

A more fair test would be to have Scott Alexander write 5 short essays, Gemini 3 write 5, and you try to tell which is which blind.

I suspect you would be able to tell better than chance but would have a lot of negative things to say about a real scott Alexander essay you thought was ai written.

Gerald Monroe's avatar

Something missed : Gemini 3 has HUGELY less refusals.

GPT5 has no sense of scale or context. Asking what go into a space laser system that takes up half the payload of a Starship booster rocket is a very different question than how to make a zip gun or pipe bomb. It's well past individual scale or any plausible threat or liability. But chatGPT just slams you with nagging and "no build details" refusals, as if a late 2025 LLM could assist with something at the edge of what Lockheed or Raytheon can build.

Anyways Gemini eagerly answers and even when it refuses you can just ask it to recheck the rule it thinks you violated and it almost always continues.

Igr's avatar

> instructions can mitigate this via asking it to create a memory that tells it to stop producing AI slop, no seriously that worked.

> If you’re switching, don’t forget to customize it via creating memories.

Memories? Instructions? Could you please elaborate on this.

Ebenezer's avatar

One high-level summary/hypothesis of what's going on in AI right now: People are concerned AI might be a bubble. Google wants to maintain its stock price. That means persuading investors that AI is *not* a bubble. That means its AI product releases must do well in ways legible to investors, i.e. do well on benchmarks. That leads to benchmarkmaxxing. That leads to classic Yudkowskyist misalignment, in the sense of sacrificing everything else for the sake of the training objective. Sad to see.

What could we do about this? One idea which could help a little: As a side project, someone could more formally aggregate all of the private benchmarks people are running, and release them in an organized way. That could help formally measure the degree to which companies are benchmarkmaxxing, and give investors a better quality signal to work from.

A problem with this idea: Perhaps "benchmarkmaxxing" is more the quality of ruthlessly optimizing for the training objective, as opposed to the of overfitting particular training tasks. That is what we really want AI labs to avoid. And, it's probably the case that such "ruthless optimization" also helps with most private benchmarks.

Maybe what we really need are "nontraditional" private benchmarks which aim to measure AI skill and alignment in alternative ways, and will ding points for excessive ruthlessness on the part of an AI. I suppose you'd have to keep the tasks in the private benchmarks secret, so the AI companies wouldn't just start training for them. Perhaps you could release just a few tasks publicly, so people have a very general sense of what you're measuring, but keep the majority of your measurements secret. If anyone is seriously thinking of doing this project, I can maybe help you brainstorm ideas for nontraditional benchmarks, or alignment benchmarks, in Substack DMs. Since most of the benchmark content would be secret, the project is a comparative advantage of a person or organization who already has significant credibility in the AI space.

TheKnowing's avatar

Zvi or anyone else reading. Do you know or have contacts with search engineers at Google who’ve caused major server side issues. Are you either able to talk to them and ask them to talk to me or raise the issues on your blog? This would be helpful

Max B's avatar
Dec 3Edited

I constantly test 2-3 models doing full stack development.

I tried to like gemini( and antigravity )but it fails at solving coding tasks. In fact its weirdly hallucinating and guessing reminiscing of sonnet 3.5.

For producing actually working code and solving bugs: opus 4.5 and codex ( secretly imho still top model for producing single shot working code and architecture).

For code architecture its vague and missing the point sometimes even worse rhan composer1( which is the cheapest but still capable option in cursor)

It is pretty good for general purpose but I didn't find it noticeably better than others