Excellent full coverage of a big development as usual.
From the FutureSearch results linked, shortly after this post came out, we added some anecdotes about ways o1 is unusual when driving an agent:
* o1 still has a significant hallucination problem. It did the classic "hallucinate a citation to a scientific paper", which we don't normally see from 4o anymore.
* o1 has strange refusals other LLMs do not. Can't explain this quite yet.
* o1 seems overconfident. It tried to complete tasks without the requisite information. Then it did less web research than Sonnet-3.5 to validate its claims.
* o1’s plans can be extremely verbose, but in a good way, as other LLMs suffer from dropping important (implied) details from their plans.
* o1’s plans make better use of latent knowledge. e.g., on a question that requires finding Chinese data on disposable income, gpt-4o knows they only publish mean, median, and averages over quintiles. But when planning, GPT-4o seems to “forget” later in the agent flow. o1 does not, and hence does way way better on this task.
Excellent full coverage of a big development as usual.
From the FutureSearch results linked, shortly after this post came out, we added some anecdotes about ways o1 is unusual when driving an agent:
* o1 still has a significant hallucination problem. It did the classic "hallucinate a citation to a scientific paper", which we don't normally see from 4o anymore.
* o1 has strange refusals other LLMs do not. Can't explain this quite yet.
* o1 seems overconfident. It tried to complete tasks without the requisite information. Then it did less web research than Sonnet-3.5 to validate its claims.
* o1’s plans can be extremely verbose, but in a good way, as other LLMs suffer from dropping important (implied) details from their plans.
* o1’s plans make better use of latent knowledge. e.g., on a question that requires finding Chinese data on disposable income, gpt-4o knows they only publish mean, median, and averages over quintiles. But when planning, GPT-4o seems to “forget” later in the agent flow. o1 does not, and hence does way way better on this task.