Don't Worry About the Vase

https://open.substack.com/pub/dwatvpodcast/p/the-second-gemini

Podcast episode for this post:

Expand full comment

Mr. Ondrej

Got the payed version today to test this. Overall Claude sonnet Is still he king. But deep research showed some promise. I asked it to put together lists of software products from certain companies. In mist instances also URLs got produced. In other instances I even added pricing. But it keeps having issues. It broke down mid research a couple of times. It stopped recognizing whether to respond with strengthening the research part or in text. It is indeed pretty frustrating atm. Wish Claude got this deep research for sonnet. It would be good enough!

Expand full comment

Arbituram

I've experienced the same thing as many others, in deeply preferring Sonnet over every other option, benchmarks be damned. What is it, exactly? I struggle to explain it when asked, other than vague statements like Sonnet has a 'spark' or a ' presence' other LLMs don't.

How do you explain it?

Expand full comment

Reply (1)

For me Google outputs always sound like a Wikipedia article or Google search query results where is Claude grock see more conversational and much easier to find tune at least based on my experience.

What are some reasons that might cause you to prefer sign it over this Gemini model?

Expand full comment

Moral Particle

Excellent points on Deep Research. The bottom line is that its reports and answers are incredibly unreliable, in the same way that (so far) all LLMs are unreliable in attempting to collect and report factual information. Mitja Rutnik at Android Authority has an interesting article on asking DR / Gemini for basic stock market information. Among the requests was a summary of the performance of the top 10 S&P 500 companies by market cap. This is a well-defined request! The problems, as reported by Rutnik: The table Gemini produced did not include 10 companies at all (?!?), only 9, and not one of the 9 was in the top 10 by market cap. (It also didn't even report data for some of the irrelevant companies.) Think about that - the model failed in every possible way. The information couldn't be quickly checked, corrected, and then used - it's literally useless. That is wholly consistent with my experience with other models on asking for basic, easily-confirmed information available from numerous websites. What *is* interesting is that this application is designed specifically for web research - "deep research," even - but there has not been a phase change in accuracy or reliability. My suspicion is that it's because the problem is fundamental and structural. That's not to say that AI researchers won't fix it ultimately, but it could require a transformer-like breakthrough, or at least a clear and new set of algorithms.

Expand full comment

Reply (2)

Really good points there whoever ends up using deep research without critical evaluating the output will be caught pretty pretty quickly by their receiver.

Givens Google's trends of AI boondoggles it makes me a lot more skeptical that deep research or Gemini 2.0 can stop that streak of failures.

Expand full comment

Boogaloo

yap. I consider LLM's useless wherever factual information is concerned. Useful for music suggestions however.

Expand full comment

David Kiferbaum

I'm noticing here that this is maybe a branding/trust issue; we somehow don't believe that Google can be innovating at the frontier, despite some evidence to the contrary (Notebook LM is genuinely different and uniquely useful)

Expand full comment

Reply (2)

I still have not found much evidence that Google is able to make general purpose AI to the same level as anthropic or open AI or XAI.

Notebook LM is much more narrow than these frontier LL models and I think that benefits Google as there is less room where things can go wrong.

It took me 10 minutes of experimenting for Google Gemini to hallucinate when I was not testing for that so Gemini 2.0 is pretty hard to really adopt at this point

Expand full comment

David Kiferbaum

I totally agree—and the potential for “things going wrong” is definitely a major hand brake preventing Google from being a dominant player here (along with its own bureaucratic and cultural issues). Some of these recent announcements seem to indicate a shift but it might be too little too late…

Expand full comment