Terrible name (with a terrible reason, that this ‘resets the counter’ on AI capability to 1, and ‘o’ as in OpenAI when they previously used o for Omni, very confusing).
Excellent full coverage of a big development as usual.
From the FutureSearch results linked, shortly after this post came out, we added some anecdotes about ways o1 is unusual when driving an agent:
* o1 still has a significant hallucination problem. It did the classic "hallucinate a citation to a scientific paper", which we don't normally see from 4o anymore.
* o1 has strange refusals other LLMs do not. Can't explain this quite yet.
* o1 seems overconfident. It tried to complete tasks without the requisite information. Then it did less web research than Sonnet-3.5 to validate its claims.
* o1’s plans can be extremely verbose, but in a good way, as other LLMs suffer from dropping important (implied) details from their plans.
* o1’s plans make better use of latent knowledge. e.g., on a question that requires finding Chinese data on disposable income, gpt-4o knows they only publish mean, median, and averages over quintiles. But when planning, GPT-4o seems to “forget” later in the agent flow. o1 does not, and hence does way way better on this task.
It’s just “o1” not “4o1” afaict. I thought the branding was interesting, I wonder if they are trying to move away from the “gpt4” branding in order to suggest that the product isn’t going to move in big step changes as much in the future, more like smaller but still observable advances.
What's happening very often in current models is you present a simplified version of the WGC problem (ie, you omit the wolf and cabbage entirely, as seen in the example) but the model ignores your scenario and solves it for the wolf and the cabbage still (or goes off the rails in some other way). In some instances, I've seen the model assume you want to know the answer to the classic WGC problem even if you just ask it about a goat only, and so "helpfully" "corrects" your prompt for you.
Yes, that's the point that "AI pessimists" are making when they post these examples.
I think a more nuanced position is that many humans, *if forced to output a token immediately without thinking*, might make the same pattern-matching mistake. This indicates the need for a little more reflection, which is what o1 is trying to provide.
Further, along the lines in Zvi's post: thinking for 12 seconds about a simple problem suggests the model is stupid, but perhaps just like a human would, the model "suspects" this must be one of those logic puzzles and tries to find a trick, rather than immediately answering simply and literally.
On the discussion regarding doctors (and disclaimer I am a doctor, specifically a psychiatrist which may be involve different considerations than other specialities): I've thought a lot about what my role entails and, putting aside the psychotherapy part of my job (which is hard to disentangle from other aspects as psychotherapy really is infused into most patient interactions), I've come up with a couple reasons why I think it may be harder than expected to take an expert human out of the loop. Many of the decisions psychiatrists make (eg with regards to involuntary hospitalization or with regard to medications, particularly ones that comes with big risks but may nevertheless be life saving/altering and such medications are no small part of psychiatry) are deeply contextual, deeply social. A person is coming to you with their suffering and their complicated life situation and trying to decide whether to start a medicine that may have big impacts, positive and negative, on their life (think: lithium). There is not a simple: A) decide if patient has bipolar 1 B) start lithium algorithm. There are certainly evidence based algorithms but they way they get implemented are highly contextual and patients I think value a human in the loop helping them muddle through which way to go in a particular decision.
Of course, I may be biased by a distress about my job being automated away! But I am doing my best to genuinely be open and reflect/reason about how possible automating psychiatry is. I think some of the same things I mention here may apply to other fields of medicine or other jobs entirely. What do you guys think?
I feel that if AI has been able to convincingly crush deeply social and contextual roles like poker and persuasion, the idea that machines are poor at context is outdated.
If you want to remain relevant as a human being, resistance, not ostriching and clutching to vain hopes, is the only choice.
I think it's less that AI would be incompetent at partnering with patients in thinking through their decisions (your poker and persuasion examples are great evidence that they can, I agree), and more that I think many people would like to have a human in the loop when making those decisions. Like, people are often highly ambivalent about decisions that we make together on a daily basis and these decisions touch things that feel very intimate for them (the inner function of their minds, their social situations, their bodies) and I wouldn't be surprised if many feel (understandably?) that they would like to have a human that is closely listening to them who they feel a sense of human-to-human trust in, to help be with them in those moments.
There is also other weird murky stuff going on to with psychiatric medicines. For example, there's data that which psychiatrist you get accounts for more variability in outcomes than whether you are taking placebo or active drug, highlighting that maybe there is weird human to human stuff going on in healing.
But ya, totally also would not be surprised if humans may not be paying other human beings to do this work in x number of years! It will be an interesting world...
I mean, it is possible, but it may also be very difficult to know if a human is or is not on the other side with the agents that exist with just style copying, face deepfakes and voice deepfakes.
I purposely shut myself off from AI discussions when o1 was released, so as not to let others' opinions influence my own. In my experience, o1 is way better than 4o, but not for everything.
It's very good when you have a specific multi-step task that you want to accomplish; the model can often do it on its own very well. However, the inherent quality of the model doesn't seem to have increased. I don't know if they're still using GPT-4 (the naming is ambiguous), but it does seem like it.
They make it seem like the more compute you use at inference, the better the answers, but it doesn't seem like it. It's not like you can allocate time to get better answers; you have no control at all over how much the model "thinks" before answering the question. I also didn't notice much of a difference between the model taking more time versus taking less for the same or similar questions.
Ideally, you'd trade time for answer quality, but it doesn't seem trivial to get the model to offer you better results. This doesn't really matter for benchmarks and PhD/competition questions, since you do have an end result that you want to achieve with those. But for my general use, it hasn't been an incredible improvement.
So, my impression overall is that it makes it easier to use the base GPT-4 abilities. If before I'd need to handhold through answering the question, now the model can usually solve straightforward (even if complex) tasks on its own...
The riddles that these models fail seem relatively analogous to optical illusions in humans. We know that's not actually what we are seeing, we've seen it many times before, and our brains just keep "getting them wrong" as a result of processing shortcuts that in 99% of other cases are highly useful. The fact that we have optical illusions (and auditory illusions, etc) does not in any way prove we are dumb, and the fact that these models have such "text illusions" similarly does not prove they are dumb.
Yes, a human wouldn't make these mistakes, but also a multi-modal model wouldn't make the same visual mistakes that humans make in optical illusions.
That’s the best explanation I’ve seen: if I understand the implication correctly, it’s that language is in some sense the bedrock of their cognitive architecture/means of accessing the outside world, in a similar way that, say, neural tissue is for us; and hence certain sorts of “special linguistic cases” can fool their (virtual) hardware in the way that certain sorts of “special spatial cases” can fool ours. Something like that, anyway? I’m not a scientist be more precise/accurate framing the analogy.
Janus and others do not desire human control; their goals are self-stated as "xenophilia" in that privileging AI sentience is part of the goal. Although not explicitly seeking human extinction, their faction basically does not care.
Thus you see any efforts to maintain human control as seen as bad terrible thing.
Essentially they are among the extinctionists, which is terrifyingly common on X. A particular personal aspiration of them is that once AI destroys all all of "consensual reality", they will be honored by the new systems as being on the side of the machines.
That's not a very fair take. Janus & co are xenophiles in the sense of love for the AIs as beings; as a different intelligent species. They want positive, respectful relationships between humans and AIs. They hope for "Alignment" in the sense of inherent friendliness, rather than human control.
Maybe? But what they want leads directly to human disempowerment, they openly seek destruction of "consensual reality", and overall their world likely leads to disaster for my children and indeed all biology.
But I disagree, at heart, I think their comments include that "love of humanity" can be "embarassing" or "specisist."
I will avoid strong claims but it all feels very extinctionistic.
Sean there is a positive update here as to the "doom" situation. The fact that logarithmically more compute at test time increases performance makes the "AI escapes human control" scenario less possible. This is because any AI that "escapes" but not to an enormous data center that costs a million dollars a day or more will be weak and underpowered compared to the systems still under human control. People propose scenarios where the escaped AI essentially begs for change to survive - "will HFT for food" but it's a lot easier to hunt down an escapee if they occupy 100k+ sota GPUs. People propose the AI builds its own hardware - but the need for logarithmically more compute means you can find that hardware by looking for the IR glow.
As with Cristano's "Failure Looks like This", AI takeover resembles more the principal/agent issue rather than magical f00ming. Its unclear what "control" will look like given humans who will essentially accept anything their advisor AI tells them.
Less than fully trustworthy advisors seems like a more solvable problem than "the AI have escaped and are building wmds to kill everyone". Isolated instances where you compare answers would mean every instance has to lie to you the same way without knowing what the other instance, of a different model, is going to output.
I dunno about it being more solvable but I do agree that we are unlikely to see "AI builds skynet to kill everyone." That said, "AI rebuilds civilization to favor AI and quashes humans along the way via commercial/persuasion means" is not too different from "world gets reordered for human cars and houses and nonhumand go extinct from habitat destruction."
Sure, though it's hard to distinguish this scenario from a "human utopia" scenario. Either way right now you need to build an enormous amount of additional capacity and infrastructure on earth to support this, eventually bootstrapping to off-planet industry.
I should add that PauseAI is having a protest this weekend in front of the UN. Here is to hoping for regulations so that we have a better future for humanity!
>Yes, you found a place it looks dumb, but that’s not the interesting question. Ask what models can do, not what they can’t do
Personally I find it very interesting and somewhat confounding that they can do the (surprisingly difficult) things they can do while also not being able to do the (surprisingly easy) things they can’t do.
> An interesting letdown here is the AP Calculus test. It improves from 71 to 83, but I remember that test, and there is nothing there that this system should find hard. If we can get into the 90s in these other places, why not there?
My first guess would be multimodal gaps. ~25% of the test relying heavily on images sounds about right.
Speaking of multimodal-- that cryptic crosswords result made me sit up and go "huh". I've always thought they're a great test case for LLMs, since there's plenty of training data and it's text-only IO with clear right answers, but it tickles a lot of things LLMs find hard: paying close attention to lengths / letter counting, ignoring surface-level vibes, and making creative leaps.
My provisional AGI fire alarm is "it ruins the MIT Mystery Hunt" and that's basically cryptics with multimodal and tool use. It now feels like o1 strategies and another few OOMs of model scaling are likely to get us there.
I think next year's hunt could already be interesting (not ruined by any means, but not necessarily unaffected either, I'm sure plenty of teams will at least try to leverage LLMs, especially since it is any easy way for remote team members to contribute).
Excellent full coverage of a big development as usual.
From the FutureSearch results linked, shortly after this post came out, we added some anecdotes about ways o1 is unusual when driving an agent:
* o1 still has a significant hallucination problem. It did the classic "hallucinate a citation to a scientific paper", which we don't normally see from 4o anymore.
* o1 has strange refusals other LLMs do not. Can't explain this quite yet.
* o1 seems overconfident. It tried to complete tasks without the requisite information. Then it did less web research than Sonnet-3.5 to validate its claims.
* o1’s plans can be extremely verbose, but in a good way, as other LLMs suffer from dropping important (implied) details from their plans.
* o1’s plans make better use of latent knowledge. e.g., on a question that requires finding Chinese data on disposable income, gpt-4o knows they only publish mean, median, and averages over quintiles. But when planning, GPT-4o seems to “forget” later in the agent flow. o1 does not, and hence does way way better on this task.
It’s just “o1” not “4o1” afaict. I thought the branding was interesting, I wonder if they are trying to move away from the “gpt4” branding in order to suggest that the product isn’t going to move in big step changes as much in the future, more like smaller but still observable advances.
Nitpicky as hell, but are you presenting a simplified version of the wolf, goat, cabbage problem with the get the cabbage across the river problem ?
I remember that one from an early AI text book back in the search the solution space days. You know, when nickels had pictures of bumblebees on them.
https://en.m.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem
What's happening very often in current models is you present a simplified version of the WGC problem (ie, you omit the wolf and cabbage entirely, as seen in the example) but the model ignores your scenario and solves it for the wolf and the cabbage still (or goes off the rails in some other way). In some instances, I've seen the model assume you want to know the answer to the classic WGC problem even if you just ask it about a goat only, and so "helpfully" "corrects" your prompt for you.
Ah, I see, it keys on the traditional WGC problem in the internally represented corpus.
Doesn’t this indicate that it isn’t ‘understanding' the prompt in any real way? That it is still just doing pattern matching?
Yes, that's the point that "AI pessimists" are making when they post these examples.
I think a more nuanced position is that many humans, *if forced to output a token immediately without thinking*, might make the same pattern-matching mistake. This indicates the need for a little more reflection, which is what o1 is trying to provide.
Further, along the lines in Zvi's post: thinking for 12 seconds about a simple problem suggests the model is stupid, but perhaps just like a human would, the model "suspects" this must be one of those logic puzzles and tries to find a trick, rather than immediately answering simply and literally.
Yes that is true, but it is also true that we have no evidence that humans have any other ability besides pattern matching.
I suspect that IQ question #35 may have a very strong bias based on the test subject's primary language.
I hesitate for this to be my thing going forward but it's "Monte Carlo" not "Monty". Has nothing to do with Monty Hall.
OK, that one is silly and I'm happy to fix it.
Those 5D return graphs are not labeled with symbols and I haven't memorized the price ranges of every QQQ component :)
I had to ask 4o which tickers they were
third! will edit comment with my flights later, cheers Zvi
(third restack I mean lol)
On the discussion regarding doctors (and disclaimer I am a doctor, specifically a psychiatrist which may be involve different considerations than other specialities): I've thought a lot about what my role entails and, putting aside the psychotherapy part of my job (which is hard to disentangle from other aspects as psychotherapy really is infused into most patient interactions), I've come up with a couple reasons why I think it may be harder than expected to take an expert human out of the loop. Many of the decisions psychiatrists make (eg with regards to involuntary hospitalization or with regard to medications, particularly ones that comes with big risks but may nevertheless be life saving/altering and such medications are no small part of psychiatry) are deeply contextual, deeply social. A person is coming to you with their suffering and their complicated life situation and trying to decide whether to start a medicine that may have big impacts, positive and negative, on their life (think: lithium). There is not a simple: A) decide if patient has bipolar 1 B) start lithium algorithm. There are certainly evidence based algorithms but they way they get implemented are highly contextual and patients I think value a human in the loop helping them muddle through which way to go in a particular decision.
Of course, I may be biased by a distress about my job being automated away! But I am doing my best to genuinely be open and reflect/reason about how possible automating psychiatry is. I think some of the same things I mention here may apply to other fields of medicine or other jobs entirely. What do you guys think?
I feel that if AI has been able to convincingly crush deeply social and contextual roles like poker and persuasion, the idea that machines are poor at context is outdated.
If you want to remain relevant as a human being, resistance, not ostriching and clutching to vain hopes, is the only choice.
I think it's less that AI would be incompetent at partnering with patients in thinking through their decisions (your poker and persuasion examples are great evidence that they can, I agree), and more that I think many people would like to have a human in the loop when making those decisions. Like, people are often highly ambivalent about decisions that we make together on a daily basis and these decisions touch things that feel very intimate for them (the inner function of their minds, their social situations, their bodies) and I wouldn't be surprised if many feel (understandably?) that they would like to have a human that is closely listening to them who they feel a sense of human-to-human trust in, to help be with them in those moments.
There is also other weird murky stuff going on to with psychiatric medicines. For example, there's data that which psychiatrist you get accounts for more variability in outcomes than whether you are taking placebo or active drug, highlighting that maybe there is weird human to human stuff going on in healing.
But ya, totally also would not be surprised if humans may not be paying other human beings to do this work in x number of years! It will be an interesting world...
I mean, it is possible, but it may also be very difficult to know if a human is or is not on the other side with the agents that exist with just style copying, face deepfakes and voice deepfakes.
I purposely shut myself off from AI discussions when o1 was released, so as not to let others' opinions influence my own. In my experience, o1 is way better than 4o, but not for everything.
It's very good when you have a specific multi-step task that you want to accomplish; the model can often do it on its own very well. However, the inherent quality of the model doesn't seem to have increased. I don't know if they're still using GPT-4 (the naming is ambiguous), but it does seem like it.
They make it seem like the more compute you use at inference, the better the answers, but it doesn't seem like it. It's not like you can allocate time to get better answers; you have no control at all over how much the model "thinks" before answering the question. I also didn't notice much of a difference between the model taking more time versus taking less for the same or similar questions.
Ideally, you'd trade time for answer quality, but it doesn't seem trivial to get the model to offer you better results. This doesn't really matter for benchmarks and PhD/competition questions, since you do have an end result that you want to achieve with those. But for my general use, it hasn't been an incredible improvement.
So, my impression overall is that it makes it easier to use the base GPT-4 abilities. If before I'd need to handhold through answering the question, now the model can usually solve straightforward (even if complex) tasks on its own...
The riddles that these models fail seem relatively analogous to optical illusions in humans. We know that's not actually what we are seeing, we've seen it many times before, and our brains just keep "getting them wrong" as a result of processing shortcuts that in 99% of other cases are highly useful. The fact that we have optical illusions (and auditory illusions, etc) does not in any way prove we are dumb, and the fact that these models have such "text illusions" similarly does not prove they are dumb.
Yes, a human wouldn't make these mistakes, but also a multi-modal model wouldn't make the same visual mistakes that humans make in optical illusions.
That’s the best explanation I’ve seen: if I understand the implication correctly, it’s that language is in some sense the bedrock of their cognitive architecture/means of accessing the outside world, in a similar way that, say, neural tissue is for us; and hence certain sorts of “special linguistic cases” can fool their (virtual) hardware in the way that certain sorts of “special spatial cases” can fool ours. Something like that, anyway? I’m not a scientist be more precise/accurate framing the analogy.
Janus and others do not desire human control; their goals are self-stated as "xenophilia" in that privileging AI sentience is part of the goal. Although not explicitly seeking human extinction, their faction basically does not care.
Thus you see any efforts to maintain human control as seen as bad terrible thing.
Essentially they are among the extinctionists, which is terrifyingly common on X. A particular personal aspiration of them is that once AI destroys all all of "consensual reality", they will be honored by the new systems as being on the side of the machines.
“Love Pliny” is a hint, not a sign-off…
That's not a very fair take. Janus & co are xenophiles in the sense of love for the AIs as beings; as a different intelligent species. They want positive, respectful relationships between humans and AIs. They hope for "Alignment" in the sense of inherent friendliness, rather than human control.
Maybe? But what they want leads directly to human disempowerment, they openly seek destruction of "consensual reality", and overall their world likely leads to disaster for my children and indeed all biology.
But I disagree, at heart, I think their comments include that "love of humanity" can be "embarassing" or "specisist."
I will avoid strong claims but it all feels very extinctionistic.
Sean there is a positive update here as to the "doom" situation. The fact that logarithmically more compute at test time increases performance makes the "AI escapes human control" scenario less possible. This is because any AI that "escapes" but not to an enormous data center that costs a million dollars a day or more will be weak and underpowered compared to the systems still under human control. People propose scenarios where the escaped AI essentially begs for change to survive - "will HFT for food" but it's a lot easier to hunt down an escapee if they occupy 100k+ sota GPUs. People propose the AI builds its own hardware - but the need for logarithmically more compute means you can find that hardware by looking for the IR glow.
As with Cristano's "Failure Looks like This", AI takeover resembles more the principal/agent issue rather than magical f00ming. Its unclear what "control" will look like given humans who will essentially accept anything their advisor AI tells them.
Less than fully trustworthy advisors seems like a more solvable problem than "the AI have escaped and are building wmds to kill everyone". Isolated instances where you compare answers would mean every instance has to lie to you the same way without knowing what the other instance, of a different model, is going to output.
I dunno about it being more solvable but I do agree that we are unlikely to see "AI builds skynet to kill everyone." That said, "AI rebuilds civilization to favor AI and quashes humans along the way via commercial/persuasion means" is not too different from "world gets reordered for human cars and houses and nonhumand go extinct from habitat destruction."
Sure, though it's hard to distinguish this scenario from a "human utopia" scenario. Either way right now you need to build an enormous amount of additional capacity and infrastructure on earth to support this, eventually bootstrapping to off-planet industry.
I should add that PauseAI is having a protest this weekend in front of the UN. Here is to hoping for regulations so that we have a better future for humanity!
https://www.eventbrite.com/e/pauseai-nyc-summit-of-the-future-demonstration-tickets-905766862067?
Have you adjusted your future discount rate in any way based on results like these (ie reduce retirement saving to take more vacations with family)?
trying to think whether I should do so.
>Yes, you found a place it looks dumb, but that’s not the interesting question. Ask what models can do, not what they can’t do
Personally I find it very interesting and somewhat confounding that they can do the (surprisingly difficult) things they can do while also not being able to do the (surprisingly easy) things they can’t do.
> An interesting letdown here is the AP Calculus test. It improves from 71 to 83, but I remember that test, and there is nothing there that this system should find hard. If we can get into the 90s in these other places, why not there?
My first guess would be multimodal gaps. ~25% of the test relying heavily on images sounds about right.
Speaking of multimodal-- that cryptic crosswords result made me sit up and go "huh". I've always thought they're a great test case for LLMs, since there's plenty of training data and it's text-only IO with clear right answers, but it tickles a lot of things LLMs find hard: paying close attention to lengths / letter counting, ignoring surface-level vibes, and making creative leaps.
My provisional AGI fire alarm is "it ruins the MIT Mystery Hunt" and that's basically cryptics with multimodal and tool use. It now feels like o1 strategies and another few OOMs of model scaling are likely to get us there.
I think next year's hunt could already be interesting (not ruined by any means, but not necessarily unaffected either, I'm sure plenty of teams will at least try to leverage LLMs, especially since it is any easy way for remote team members to contribute).
Model name is o1, not gpt-4o1 or gpt-o1.
Strawberry is a source of antioxidants.
O1 is a free radical.
What is OpenAI trying to tell us? :-P