Discussion about this post

User's avatar
Moral Particle's avatar

Excellent points on Deep Research. The bottom line is that its reports and answers are incredibly unreliable, in the same way that (so far) all LLMs are unreliable in attempting to collect and report factual information. Mitja Rutnik at Android Authority has an interesting article on asking DR / Gemini for basic stock market information. Among the requests was a summary of the performance of the top 10 S&P 500 companies by market cap. This is a well-defined request! The problems, as reported by Rutnik: The table Gemini produced did not include 10 companies at all (?!?), only 9, and not one of the 9 was in the top 10 by market cap. (It also didn't even report data for some of the irrelevant companies.) Think about that - the model failed in every possible way. The information couldn't be quickly checked, corrected, and then used - it's literally useless. That is wholly consistent with my experience with other models on asking for basic, easily-confirmed information available from numerous websites. What *is* interesting is that this application is designed specifically for web research - "deep research," even - but there has not been a phase change in accuracy or reliability. My suspicion is that it's because the problem is fundamental and structural. That's not to say that AI researchers won't fix it ultimately, but it could require a transformer-like breakthrough, or at least a clear and new set of algorithms.

Arbituram's avatar

I've experienced the same thing as many others, in deeply preferring Sonnet over every other option, benchmarks be damned. What is it, exactly? I struggle to explain it when asked, other than vague statements like Sonnet has a 'spark' or a ' presence' other LLMs don't.

How do you explain it?

10 more comments...

No posts

Ready for more?