Got the payed version today to test this. Overall Claude sonnet Is still he king. But deep research showed some promise. I asked it to put together lists of software products from certain companies. In mist instances also URLs got produced. In other instances I even added pricing. But it keeps having issues. It broke down mid research a couple of times. It stopped recognizing whether to respond with strengthening the research part or in text. It is indeed pretty frustrating atm. Wish Claude got this deep research for sonnet. It would be good enough!
I've experienced the same thing as many others, in deeply preferring Sonnet over every other option, benchmarks be damned. What is it, exactly? I struggle to explain it when asked, other than vague statements like Sonnet has a 'spark' or a ' presence' other LLMs don't.
For me Google outputs always sound like a Wikipedia article or Google search query results where is Claude grock see more conversational and much easier to find tune at least based on my experience.
What are some reasons that might cause you to prefer sign it over this Gemini model?
Excellent points on Deep Research. The bottom line is that its reports and answers are incredibly unreliable, in the same way that (so far) all LLMs are unreliable in attempting to collect and report factual information. Mitja Rutnik at Android Authority has an interesting article on asking DR / Gemini for basic stock market information. Among the requests was a summary of the performance of the top 10 S&P 500 companies by market cap. This is a well-defined request! The problems, as reported by Rutnik: The table Gemini produced did not include 10 companies at all (?!?), only 9, and not one of the 9 was in the top 10 by market cap. (It also didn't even report data for some of the irrelevant companies.) Think about that - the model failed in every possible way. The information couldn't be quickly checked, corrected, and then used - it's literally useless. That is wholly consistent with my experience with other models on asking for basic, easily-confirmed information available from numerous websites. What *is* interesting is that this application is designed specifically for web research - "deep research," even - but there has not been a phase change in accuracy or reliability. My suspicion is that it's because the problem is fundamental and structural. That's not to say that AI researchers won't fix it ultimately, but it could require a transformer-like breakthrough, or at least a clear and new set of algorithms.
Really good points there whoever ends up using deep research without critical evaluating the output will be caught pretty pretty quickly by their receiver.
Givens Google's trends of AI boondoggles it makes me a lot more skeptical that deep research or Gemini 2.0 can stop that streak of failures.
I'm noticing here that this is maybe a branding/trust issue; we somehow don't believe that Google can be innovating at the frontier, despite some evidence to the contrary (Notebook LM is genuinely different and uniquely useful)
I still have not found much evidence that Google is able to make general purpose AI to the same level as anthropic or open AI or XAI.
Notebook LM is much more narrow than these frontier LL models and I think that benefits Google as there is less room where things can go wrong.
It took me 10 minutes of experimenting for Google Gemini to hallucinate when I was not testing for that so Gemini 2.0 is pretty hard to really adopt at this point
I totally agree—and the potential for “things going wrong” is definitely a major hand brake preventing Google from being a dominant player here (along with its own bureaucratic and cultural issues). Some of these recent announcements seem to indicate a shift but it might be too little too late…
Interestingly, on the same day as Zvi's post (December 17), Google released a paper on "FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt." (Alon Jacovi, et al.) I need to study the paper more closely, but what is clear from the outset is that factual accuracy in this study is evaluated on information in the context window - literally information in a "provided document." Contrast that with information "on the internet," which is what "deep research" should entail. Maybe models could extract information from certain "trusted sources" on the internet, identified by the user in the prompt or possibly identified ahead of time, and upload that information to the context window and then perform the requested analysis and summary? Initially, this struck me as a clunky hack (putting aside whether it's technically feasible), but maybe that's what good human researchers do? If a human researcher is searching for financial information, the researcher might reasonably look at Bloomberg, the WSJ, the Financial Times, etc., assume the information in those sources is correct, especially if the information from the sources match each other, load that information into a metaphorical "context window" (his or her brain, or a notepad, etc.), and then perform the analysis and summary? For baseball statistics, for another example, the sources could be ESPN, major newspapers, Wikipedia, etc. Importantly, this process is a big leap from requiring the user personally to ensure the information is in a context document in the first place, which is what the "FACTS Grounding" paper seems to require. I want "Gemini's Deep Research" to do the research; I don't want to do it myself! But maybe this points to a solution down the line: accuracy from a context document as a first step to agentic models searching the internet for data to load into the context window. The theme and requirement for all of this, of course, is accuracy.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/the-second-gemini
Got the payed version today to test this. Overall Claude sonnet Is still he king. But deep research showed some promise. I asked it to put together lists of software products from certain companies. In mist instances also URLs got produced. In other instances I even added pricing. But it keeps having issues. It broke down mid research a couple of times. It stopped recognizing whether to respond with strengthening the research part or in text. It is indeed pretty frustrating atm. Wish Claude got this deep research for sonnet. It would be good enough!
I've experienced the same thing as many others, in deeply preferring Sonnet over every other option, benchmarks be damned. What is it, exactly? I struggle to explain it when asked, other than vague statements like Sonnet has a 'spark' or a ' presence' other LLMs don't.
How do you explain it?
For me Google outputs always sound like a Wikipedia article or Google search query results where is Claude grock see more conversational and much easier to find tune at least based on my experience.
What are some reasons that might cause you to prefer sign it over this Gemini model?
Excellent points on Deep Research. The bottom line is that its reports and answers are incredibly unreliable, in the same way that (so far) all LLMs are unreliable in attempting to collect and report factual information. Mitja Rutnik at Android Authority has an interesting article on asking DR / Gemini for basic stock market information. Among the requests was a summary of the performance of the top 10 S&P 500 companies by market cap. This is a well-defined request! The problems, as reported by Rutnik: The table Gemini produced did not include 10 companies at all (?!?), only 9, and not one of the 9 was in the top 10 by market cap. (It also didn't even report data for some of the irrelevant companies.) Think about that - the model failed in every possible way. The information couldn't be quickly checked, corrected, and then used - it's literally useless. That is wholly consistent with my experience with other models on asking for basic, easily-confirmed information available from numerous websites. What *is* interesting is that this application is designed specifically for web research - "deep research," even - but there has not been a phase change in accuracy or reliability. My suspicion is that it's because the problem is fundamental and structural. That's not to say that AI researchers won't fix it ultimately, but it could require a transformer-like breakthrough, or at least a clear and new set of algorithms.
Really good points there whoever ends up using deep research without critical evaluating the output will be caught pretty pretty quickly by their receiver.
Givens Google's trends of AI boondoggles it makes me a lot more skeptical that deep research or Gemini 2.0 can stop that streak of failures.
yap. I consider LLM's useless wherever factual information is concerned. Useful for music suggestions however.
I'm noticing here that this is maybe a branding/trust issue; we somehow don't believe that Google can be innovating at the frontier, despite some evidence to the contrary (Notebook LM is genuinely different and uniquely useful)
I still have not found much evidence that Google is able to make general purpose AI to the same level as anthropic or open AI or XAI.
Notebook LM is much more narrow than these frontier LL models and I think that benefits Google as there is less room where things can go wrong.
It took me 10 minutes of experimenting for Google Gemini to hallucinate when I was not testing for that so Gemini 2.0 is pretty hard to really adopt at this point
I totally agree—and the potential for “things going wrong” is definitely a major hand brake preventing Google from being a dominant player here (along with its own bureaucratic and cultural issues). Some of these recent announcements seem to indicate a shift but it might be too little too late…
Yeah these features are impressive but if you actually use Gemini you could see how unimpressive this new model is compared to Claude or grok
Interestingly, on the same day as Zvi's post (December 17), Google released a paper on "FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt." (Alon Jacovi, et al.) I need to study the paper more closely, but what is clear from the outset is that factual accuracy in this study is evaluated on information in the context window - literally information in a "provided document." Contrast that with information "on the internet," which is what "deep research" should entail. Maybe models could extract information from certain "trusted sources" on the internet, identified by the user in the prompt or possibly identified ahead of time, and upload that information to the context window and then perform the requested analysis and summary? Initially, this struck me as a clunky hack (putting aside whether it's technically feasible), but maybe that's what good human researchers do? If a human researcher is searching for financial information, the researcher might reasonably look at Bloomberg, the WSJ, the Financial Times, etc., assume the information in those sources is correct, especially if the information from the sources match each other, load that information into a metaphorical "context window" (his or her brain, or a notepad, etc.), and then perform the analysis and summary? For baseball statistics, for another example, the sources could be ESPN, major newspapers, Wikipedia, etc. Importantly, this process is a big leap from requiring the user personally to ensure the information is in a context document in the first place, which is what the "FACTS Grounding" paper seems to require. I want "Gemini's Deep Research" to do the research; I don't want to do it myself! But maybe this points to a solution down the line: accuracy from a context document as a first step to agentic models searching the internet for data to load into the context window. The theme and requirement for all of this, of course, is accuracy.