You're right on that a lot of these comparisons are pretty far removed from testing actual mundane utility. OK, Gemini is bad at playing rock-paper-scissors. Who cares? I am also not here to discuss free will with an LLM. In my actual day-to-day use, ChatGPT is a better search engine (90%) and a coding assistant (10%). The suggestion to dig through my ChatGPT history and send the same prompts to Gemini is a good one. I'm going to give that a go.
The rock-paper-scissors example is one about reasoning. When one player goes first, the other player can trivially win every time, but the chatbot can't figure that out. This is understandable behavior because this trivial case is not in the training data, but it usefully demonstrates that the chatbot isn't *actually* capable of reasoning.
Yep, I get that. There are lots of ways to demonstrate the LLMs are somewhat boneheaded. But it turns out that I'm pretty good at reasoning and LLMs are pretty good at having read the entire internet and knowing every computer language, so the rock-paper-scissors thing doesn't map to my current use case.
If you use LLMs frequently enough, you'll have a fairly good idea of how they can fail. I think the rock-paper-scissors thing is potentially useful in demonstrating the boneheaded nature of LLMs to somebody who thinks they're smarter than they actually are. Or (potentially) to indicate when AIs have advanced to the point where they *are* capable of this kind of reasoning. It's better than the Turing Test, anyway.
I don't see any way to call this other than as a toss-up. Both Gemini and ChatGPT are quite good. They could easily answer questions on topics ranging from Argentinian folk music to physics to Alexander Pope. They tend to structure their answers differently -- they each have a house style -- but sometimes I preferred Gemini and sometimes I preferred ChatGPT. Gemini did make one clear factual error that ChatGPT got right, so that was a definitely strike against it. On the other hand, Gemini's search integration seems just clearly better, and this was especially true when I asked for travel advice. I think I may actually plan a trip around Gemini's suggestion, whereas ChatGPT's was worthless.
Gemini feels faster. ChatGPT has always been sluggish. That said, Gemini has an up-front pause that might be a bit longer than ChatGPT's? It's hard to tell.
I gave them both a PDF with some tables of numbers to try to interpret. ChatGPT failed out of the gate. It couldn't read the file, for reasons unknown. I find ChatGPT's pdf features to be extremely janky. With Gemini, on the other hand, it wasn't clear that it was even possible to give it a pdf, until I discovered that I could have it read documents on Google Drive. At first the results were exciting: it found the file and correctly identified its contents. But then I ran into the same issue I always run into when I try this experiment with ChatGPT: if you ask it about tables of numbers, it just makes stuff up. So on the one hand, Gemini made it much, much farther than ChatGPT on this task. On the other hand, the were both worthless.
I asked them each for a simple Python script. ChatGPT's ran flawlessly the first time. Gemini's had an error, which it quickly fixed when I gave it a traceback. They also helped with some deployment questions. Although ChatGPT technically did better, I rate this basically a tie. I run into similar issues with ChatGPT all the time, and they seem comparable in their level of understanding and competence. Further testing might reveal a wider gap.
For day to day use...I might lean Gemini? But that one factual error was bad. It happened to be an error I was able to spot, but of course those aren't the ones I am worried about. I will keep using both. When my free two months is up, I'm not sure which one I will stick with. Also worth noting: I can't use Gemini with my work Google account right now, but I could see that being really killer. Even the integration with Google Sheets could just be massively useful. I think that's what OpenAI should really sweat about. The integration possibilities here are enormous.
ETA: Gemini was more scold-y, flatly refusing to answer a question about some mildly risque pop music that ChatGPT answered very anxiously. Gemini also refused to summarize the contents of a Sam Harris video on the war in Gaza. I'm not generally pushing boundaries here so I don't know that I care too much about this, but it was definitely a thing.
Update to the update: ok, Gemini's trip plans turned out to be not so plausible in the details. Still useful to me for orienting me around some options, but firmly in the realm of "slightly janky LLM spew that needs to be reality checked."
I'm afraid to poke at Gemini because hitting some hair-trigger filter and getting banned from Gmail would be a life-disrupting disaster. My first attempt to get Gemini to look at an image, a meme of characters from a show talking with overlaid text of a philosophical conversation, got blocked and actually deleted from my transcript and I still don't have a good theory about why. (Now that I think about it, this worry keeps me from doing much with the other Google services as well -- Gmail is too important to risk.)
Maybe something for "the lighter side": Gemini goes into "Danish mode" when it can't answer a question (real prompt) https://g.co/gemini/share/9b232d119f55
It almost seems like you're getting trolled with Gemini 1.5's announcement coming out 30 minutes after these two posts... https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024
You're right on that a lot of these comparisons are pretty far removed from testing actual mundane utility. OK, Gemini is bad at playing rock-paper-scissors. Who cares? I am also not here to discuss free will with an LLM. In my actual day-to-day use, ChatGPT is a better search engine (90%) and a coding assistant (10%). The suggestion to dig through my ChatGPT history and send the same prompts to Gemini is a good one. I'm going to give that a go.
The rock-paper-scissors example is one about reasoning. When one player goes first, the other player can trivially win every time, but the chatbot can't figure that out. This is understandable behavior because this trivial case is not in the training data, but it usefully demonstrates that the chatbot isn't *actually* capable of reasoning.
Yep, I get that. There are lots of ways to demonstrate the LLMs are somewhat boneheaded. But it turns out that I'm pretty good at reasoning and LLMs are pretty good at having read the entire internet and knowing every computer language, so the rock-paper-scissors thing doesn't map to my current use case.
If you use LLMs frequently enough, you'll have a fairly good idea of how they can fail. I think the rock-paper-scissors thing is potentially useful in demonstrating the boneheaded nature of LLMs to somebody who thinks they're smarter than they actually are. Or (potentially) to indicate when AIs have advanced to the point where they *are* capable of this kind of reasoning. It's better than the Turing Test, anyway.
I don't see any way to call this other than as a toss-up. Both Gemini and ChatGPT are quite good. They could easily answer questions on topics ranging from Argentinian folk music to physics to Alexander Pope. They tend to structure their answers differently -- they each have a house style -- but sometimes I preferred Gemini and sometimes I preferred ChatGPT. Gemini did make one clear factual error that ChatGPT got right, so that was a definitely strike against it. On the other hand, Gemini's search integration seems just clearly better, and this was especially true when I asked for travel advice. I think I may actually plan a trip around Gemini's suggestion, whereas ChatGPT's was worthless.
Gemini feels faster. ChatGPT has always been sluggish. That said, Gemini has an up-front pause that might be a bit longer than ChatGPT's? It's hard to tell.
I gave them both a PDF with some tables of numbers to try to interpret. ChatGPT failed out of the gate. It couldn't read the file, for reasons unknown. I find ChatGPT's pdf features to be extremely janky. With Gemini, on the other hand, it wasn't clear that it was even possible to give it a pdf, until I discovered that I could have it read documents on Google Drive. At first the results were exciting: it found the file and correctly identified its contents. But then I ran into the same issue I always run into when I try this experiment with ChatGPT: if you ask it about tables of numbers, it just makes stuff up. So on the one hand, Gemini made it much, much farther than ChatGPT on this task. On the other hand, the were both worthless.
I asked them each for a simple Python script. ChatGPT's ran flawlessly the first time. Gemini's had an error, which it quickly fixed when I gave it a traceback. They also helped with some deployment questions. Although ChatGPT technically did better, I rate this basically a tie. I run into similar issues with ChatGPT all the time, and they seem comparable in their level of understanding and competence. Further testing might reveal a wider gap.
For day to day use...I might lean Gemini? But that one factual error was bad. It happened to be an error I was able to spot, but of course those aren't the ones I am worried about. I will keep using both. When my free two months is up, I'm not sure which one I will stick with. Also worth noting: I can't use Gemini with my work Google account right now, but I could see that being really killer. Even the integration with Google Sheets could just be massively useful. I think that's what OpenAI should really sweat about. The integration possibilities here are enormous.
ETA: Gemini was more scold-y, flatly refusing to answer a question about some mildly risque pop music that ChatGPT answered very anxiously. Gemini also refused to summarize the contents of a Sam Harris video on the war in Gaza. I'm not generally pushing boundaries here so I don't know that I care too much about this, but it was definitely a thing.
Update to the update: ok, Gemini's trip plans turned out to be not so plausible in the details. Still useful to me for orienting me around some options, but firmly in the realm of "slightly janky LLM spew that needs to be reality checked."
I'm afraid to poke at Gemini because hitting some hair-trigger filter and getting banned from Gmail would be a life-disrupting disaster. My first attempt to get Gemini to look at an image, a meme of characters from a show talking with overlaid text of a philosophical conversation, got blocked and actually deleted from my transcript and I still don't have a good theory about why. (Now that I think about it, this worry keeps me from doing much with the other Google services as well -- Gmail is too important to risk.)
Maybe something for "the lighter side": Gemini goes into "Danish mode" when it can't answer a question (real prompt) https://g.co/gemini/share/9b232d119f55