It's strange. I understand the sentiment that in 20 years, line by line coding will be dead. However, I have to do a lot of line by line coding now, and the models I have access to (claude 3.5, flash thinking 2 experimental) have an approximately 0% hit rate on helping me program solutions to problems I have day to day. Chatting with claude is the closest thing to helpful, but I've tried instead chatting with a notepad.txt window and it's equally helpful: Claude is basically serving as a really polite and friendly rubber duck. It feels like I'm in a 2000 mile race riding a bicycle against a car, right now I have a 1000 mile head start, and the advice I'm getting is "try drafting behind the car"
Could you expand on what kinds of problems you're talking about? I assume you're excluding basics like boilerplate code or looking up documentation, where it seems LLMs are probably superior to notepad.txt and certainly different.
Going meta with it, here is a recent chat I had with claude about how to share a claude chat. In it every single UI element claude suggests that I use is hallucinated.
Alas, I thus lack nicely formatted conversations to share with you. Here is an example of me having great success using Claude as a duck, but where the comparison to notepad seems especially apt:
It's the same story as the first gist: Claude says lots of reasonable sounding things, but I am having to do 100% of the work because Claude's musings have no connection to reality.
The chatbots' knowledge about their own interface and capabilities is sometimes particularly bad. I've had multiple fights with them trying to get them to generate images when I know they can.
I'm not quite a coder, but I'm using o1 currently to mostly build a simple app for me. It gets some things wrong, but overall about 95% of the code just works the first time. Definitely a game-changer for side projects in my experience.
I use Claude extensively for programming in my senior level software engineering job.
Just recently I whipped out a fairly sub par React UI for part of our product. I shared a screenshot, the code, our components, and Claude fixed it up, doing 90 percent of the work.
Probably part of my success is that I wrote the initial version quickly and sloppily but knowing that Claude would clean it up. I've probably internalized the best way to write the quick and dirty version in the way that is most conducive to getting Claude to finish it.
Large C++ codebase here. It feels like improvements in GitHub co-pilot chat/edits is glacial; using the new sonnet model is great, but if you don't bend over backwards to give it some concise context and a small play area, it will hallucinate.
I wonder if smaller modules, smaller files, and smaller functions become that much more important in codebases of the future.
My guess on o3 (and o1) is that OpenAI has figured out how to a way to train recursive models that can produce outputs that the model can send to itself for another iteration until it converges or some other criteria is reached. Or rather, my guess is that o1 used a section of the input buffer but o3 might be partially abstracting the recursive bit into concept-space?
FWIW, I expect to use Greptile + Sonnet (or some equivalent tool) for almost all coding over o3. For most problems, having context on everything else that you're doing > CodeForces-style closed-world complex algorithm writing
Wow. Whether or not you believe o3 is AGI (I do not) it looks like OpenAI have achieved "super human" abilities with this model. I'm not sure if I find the high cost of the model to be more relieving or more stress inducing, however. A world where only large corporations and elites have access to AGI could be disastrous, but so likely would a world where everybody has access. On the other hand, perhaps the exponential cost will keep usage to a minimum, at least in the short term. In that situation, I would hope we would use it only for the most important problems. The variability of the future has never seemed greater.
To the point about benchmarks always getting solved, when is someone going to create the world peace, cure aging, and matter synthesizer benchmark? If we're going to invest trillions in AI, let's try to have the foundation model companies explicitly work on our top goals, not just automating our (relatively) dumb jobs.
True. At this point most of the unsaturated benchmarks in math and coding are what I consider "light" ASI anyway, by which I mean problems that <0.1% of humans can solve. ARC is a notable exception. I think it's time to get more ambitious about the problems that we're framing, both because we need more unsaturated benchmarks and because we need to stop searching under the street lamp. We currently don't know how to create an objective function for training on say solving aging, but we're not going to figure it out until we actually work on it.
To be honest, I find top competitive programmers to be fairly poor software engineers. Or, maybe another way to say it is that the skills to be a competitive programmer are not the same as the skills to be a professional software engineer.
It'll be interesting to see if o3 improves at writing maintainable and understandable code...the two things competitive programming does not optimize for.
Will Nvidia really be an overall gainer from reasoning models? Reasoning models use a lot more test time compute, and less training time compute. Nvidia has a huge training time compute moat, but less so for test time. Broadcom is putting huge pressure on Nvidia when it comes to inference.
Yeah I think inputs like High Bandwidth Memory are probably the play. Also doesn't seem priced in. (So Micron for US investors. SK Hynix seems to be harder to buy unless you're South Korean, but would be a natural investment to make if available.)
Or a semiconductor index fund like SMH that already has a nice mix of Broadcom, NVIDIA, Micron, etc. Also, the point about having a slightly better base model leading to exponential cost savings from this article is relevant, and suggests that training will still be be relevant. It's likely that ~all the relevant inventory will be sold out for at least several years, and the usual fab boom/bust cycle will be harder to saturate with capital investment this time.
"If o3 was as good on most tasks as it is at coding or math, then it would be AGI."
A literal interpretation of this statement is true (more or less by definition; "emulating an AGI" is a task). But I think the more natural interpretation is false. o3 isn't designed to address messy open-world tasks that require interacting with the Internet (or even an intranet), selectively retrieving information, ignoring spurious information, etc. It doesn't have long-term memory or the ability to learn continuously. There are presumably still limits on how long it can maintain task coherence (and this would probably be harder for open-world tasks). We don't know how reliably good it is at coding and math, across what breadth of tasks. (For instance, the FrontierMath result is crazy impressive but, at only 25.2% in absolute terms, leaves a lot of wiggle room to somehow be achieved in a fluky / non-generalizable way (I've heard mutterings).)
I think o3-but-for-all-task-domains would still be far short of AGI. I could imagine that we get a lot of the former before 2025 is out but still be years away from transformational impact. (I've just finished drafting a blog post about more or less this topic, will publish within a week.) Very interested in your thoughts on this.
Feels like it could be a seed AI, in that it could be set into a state that it fills in it's missing capabilities or builds a successor with more of those capabilities.
I hate repeating this, but our lack of a good definition for AGI is maddening because everyone uses it in a different way without explaining how. In my opinion, the interesting inflection point is not "can do every specific task at least as well as a(ny) human," which I guess is what AGI literally means. My transformative AI definition is "can be approximately as productive as a human at many jobs." I do think o3 or thereabouts will meet that threshold.
We don't require a biologist to publish a Pulitzer prize winning play in order to be considered human, but it seems like we are requiring that of gen AI in order to consider it AGI. It's an odd distinction, similar to the way we seem to expect self-driving cars to be 10X+ safer than humans, rather than just safer.
Anyway, even if o3 or a similar model gets distracted or screws up sometimes, I submit that it will be good and fast enough when it works and good enough at recovering in some fashion (which may mean the boss telling it to go back and do it again, similar to what happens with an employee), that it will often be more productive than a typical employee. At perhaps 5-10% of the cost. And we should not expect o3 to be the last improvement we see in 2025. So I see things getting very interesting very fast.
1. Imagine an AI which has the precise strengths and weaknesses of one particular median adult (including the ability to learn new things, but not the ability to change fundamentally, e.g. if that particular adult does not have a head for abstractions then they're never going to become a skilled programmer or mathematician). This would be wildly impressive and useful, but I don't think it would have the implications that most people have in mind when they talk about AGI. For instance, one common definition is "AI that can undertake 90% of economically useful activity", and this would fall well short of that.
I agree that Pulitzer-caliber writing is not a requirement for being AGI. But I do think that if there's a task that, say, 10% of adults can do, then an AGI should be able to do that thing. It should be as good at math as someone who is pretty good (not genius!) at math, it should be as good at writing as someone who is pretty good at writing, and so on. Otherwise there are going to be limits on how useful it is in practice; people tend to wind up doing work that they are relatively suited for (granting that there is a lot of noise in the system).
This is intuition-bending, because the population of humans has a much richer set of skills than any one particular human, even if you exclude genius from the population group. But 1,000,000 copies of a given AI are not going to have a richer set of skills than any one particular instance of that AI. So an AI that is as capable as a typical adult is *not* the same as an AI that can handle the set of tasks that typical adults do collectively.
2. I absolutely do not think that o3 (or "thereabouts") will come anywhere close to AGI even as you're defining it, and I would be happy to take bets on that. See my previous comment.
I know the motivation was probably something more prosaic like "c'mon geeks we have 3 more days of releases to fill, what do you have for us" but in hindsight announcing o3 now looks quite altruistic. It gives rival labs and short-timeline safety advocates weeks to months of extra warning that inference-time scaling works big-time, and it similarly moves up the "feel the AGI" timeline for non-technical decision-makers. As much hype as it's getting now, I presume it would have been going to get* even more if they'd waited to announce it until it was available, like they did for o1. So it sure feels like it's sacrificing OpenAI's near-term benefit for more overall AI awareness.
*Couldn't pass up the chance to use the... past future subjunctive?... tense.
Maybe. I see what you're saying. On the flip side, it probably keeps some customers with OpenAI for the next X months rather than going to other model companies.
It's strange. I understand the sentiment that in 20 years, line by line coding will be dead. However, I have to do a lot of line by line coding now, and the models I have access to (claude 3.5, flash thinking 2 experimental) have an approximately 0% hit rate on helping me program solutions to problems I have day to day. Chatting with claude is the closest thing to helpful, but I've tried instead chatting with a notepad.txt window and it's equally helpful: Claude is basically serving as a really polite and friendly rubber duck. It feels like I'm in a 2000 mile race riding a bicycle against a car, right now I have a 1000 mile head start, and the advice I'm getting is "try drafting behind the car"
Could you expand on what kinds of problems you're talking about? I assume you're excluding basics like boilerplate code or looking up documentation, where it seems LLMs are probably superior to notepad.txt and certainly different.
Going meta with it, here is a recent chat I had with claude about how to share a claude chat. In it every single UI element claude suggests that I use is hallucinated.
https://gist.github.com/HastingsGreer/d88b623f25c466ccb6b9443fd4a44914
Alas, I thus lack nicely formatted conversations to share with you. Here is an example of me having great success using Claude as a duck, but where the comparison to notepad seems especially apt:
https://gist.github.com/HastingsGreer/2b61339e07438444f7db030a315319f1
It's the same story as the first gist: Claude says lots of reasonable sounding things, but I am having to do 100% of the work because Claude's musings have no connection to reality.
my experience of AI is the same. it seems to have no connection to reality mostly.
The chatbots' knowledge about their own interface and capabilities is sometimes particularly bad. I've had multiple fights with them trying to get them to generate images when I know they can.
I'm not quite a coder, but I'm using o1 currently to mostly build a simple app for me. It gets some things wrong, but overall about 95% of the code just works the first time. Definitely a game-changer for side projects in my experience.
I use Claude extensively for programming in my senior level software engineering job.
Just recently I whipped out a fairly sub par React UI for part of our product. I shared a screenshot, the code, our components, and Claude fixed it up, doing 90 percent of the work.
Probably part of my success is that I wrote the initial version quickly and sloppily but knowing that Claude would clean it up. I've probably internalized the best way to write the quick and dirty version in the way that is most conducive to getting Claude to finish it.
Large C++ codebase here. It feels like improvements in GitHub co-pilot chat/edits is glacial; using the new sonnet model is great, but if you don't bend over backwards to give it some concise context and a small play area, it will hallucinate.
I wonder if smaller modules, smaller files, and smaller functions become that much more important in codebases of the future.
My guess on o3 (and o1) is that OpenAI has figured out how to a way to train recursive models that can produce outputs that the model can send to itself for another iteration until it converges or some other criteria is reached. Or rather, my guess is that o1 used a section of the input buffer but o3 might be partially abstracting the recursive bit into concept-space?
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/o3-oh-my
FWIW, I expect to use Greptile + Sonnet (or some equivalent tool) for almost all coding over o3. For most problems, having context on everything else that you're doing > CodeForces-style closed-world complex algorithm writing
Wow. Whether or not you believe o3 is AGI (I do not) it looks like OpenAI have achieved "super human" abilities with this model. I'm not sure if I find the high cost of the model to be more relieving or more stress inducing, however. A world where only large corporations and elites have access to AGI could be disastrous, but so likely would a world where everybody has access. On the other hand, perhaps the exponential cost will keep usage to a minimum, at least in the short term. In that situation, I would hope we would use it only for the most important problems. The variability of the future has never seemed greater.
no it hasn't.
To the point about benchmarks always getting solved, when is someone going to create the world peace, cure aging, and matter synthesizer benchmark? If we're going to invest trillions in AI, let's try to have the foundation model companies explicitly work on our top goals, not just automating our (relatively) dumb jobs.
Those sound more like ASI benchmarks than AGI benchmarks.
True. At this point most of the unsaturated benchmarks in math and coding are what I consider "light" ASI anyway, by which I mean problems that <0.1% of humans can solve. ARC is a notable exception. I think it's time to get more ambitious about the problems that we're framing, both because we need more unsaturated benchmarks and because we need to stop searching under the street lamp. We currently don't know how to create an objective function for training on say solving aging, but we're not going to figure it out until we actually work on it.
To be honest, I find top competitive programmers to be fairly poor software engineers. Or, maybe another way to say it is that the skills to be a competitive programmer are not the same as the skills to be a professional software engineer.
It'll be interesting to see if o3 improves at writing maintainable and understandable code...the two things competitive programming does not optimize for.
I felt the EXACT same way about the focus on fucking bond funds + how *un* impressive it is
Blew my mind, I looked at all the exact things you outlined media wise
Absolutely no understanding of what's happening
Buying calls
Will Nvidia really be an overall gainer from reasoning models? Reasoning models use a lot more test time compute, and less training time compute. Nvidia has a huge training time compute moat, but less so for test time. Broadcom is putting huge pressure on Nvidia when it comes to inference.
Yeah I think inputs like High Bandwidth Memory are probably the play. Also doesn't seem priced in. (So Micron for US investors. SK Hynix seems to be harder to buy unless you're South Korean, but would be a natural investment to make if available.)
Or a semiconductor index fund like SMH that already has a nice mix of Broadcom, NVIDIA, Micron, etc. Also, the point about having a slightly better base model leading to exponential cost savings from this article is relevant, and suggests that training will still be be relevant. It's likely that ~all the relevant inventory will be sold out for at least several years, and the usual fab boom/bust cycle will be harder to saturate with capital investment this time.
"If o3 was as good on most tasks as it is at coding or math, then it would be AGI."
A literal interpretation of this statement is true (more or less by definition; "emulating an AGI" is a task). But I think the more natural interpretation is false. o3 isn't designed to address messy open-world tasks that require interacting with the Internet (or even an intranet), selectively retrieving information, ignoring spurious information, etc. It doesn't have long-term memory or the ability to learn continuously. There are presumably still limits on how long it can maintain task coherence (and this would probably be harder for open-world tasks). We don't know how reliably good it is at coding and math, across what breadth of tasks. (For instance, the FrontierMath result is crazy impressive but, at only 25.2% in absolute terms, leaves a lot of wiggle room to somehow be achieved in a fluky / non-generalizable way (I've heard mutterings).)
I think o3-but-for-all-task-domains would still be far short of AGI. I could imagine that we get a lot of the former before 2025 is out but still be years away from transformational impact. (I've just finished drafting a blog post about more or less this topic, will publish within a week.) Very interested in your thoughts on this.
Feels like it could be a seed AI, in that it could be set into a state that it fills in it's missing capabilities or builds a successor with more of those capabilities.
I hate repeating this, but our lack of a good definition for AGI is maddening because everyone uses it in a different way without explaining how. In my opinion, the interesting inflection point is not "can do every specific task at least as well as a(ny) human," which I guess is what AGI literally means. My transformative AI definition is "can be approximately as productive as a human at many jobs." I do think o3 or thereabouts will meet that threshold.
We don't require a biologist to publish a Pulitzer prize winning play in order to be considered human, but it seems like we are requiring that of gen AI in order to consider it AGI. It's an odd distinction, similar to the way we seem to expect self-driving cars to be 10X+ safer than humans, rather than just safer.
Anyway, even if o3 or a similar model gets distracted or screws up sometimes, I submit that it will be good and fast enough when it works and good enough at recovering in some fashion (which may mean the boss telling it to go back and do it again, similar to what happens with an employee), that it will often be more productive than a typical employee. At perhaps 5-10% of the cost. And we should not expect o3 to be the last improvement we see in 2025. So I see things getting very interesting very fast.
A couple of thoughts:
1. Imagine an AI which has the precise strengths and weaknesses of one particular median adult (including the ability to learn new things, but not the ability to change fundamentally, e.g. if that particular adult does not have a head for abstractions then they're never going to become a skilled programmer or mathematician). This would be wildly impressive and useful, but I don't think it would have the implications that most people have in mind when they talk about AGI. For instance, one common definition is "AI that can undertake 90% of economically useful activity", and this would fall well short of that.
I agree that Pulitzer-caliber writing is not a requirement for being AGI. But I do think that if there's a task that, say, 10% of adults can do, then an AGI should be able to do that thing. It should be as good at math as someone who is pretty good (not genius!) at math, it should be as good at writing as someone who is pretty good at writing, and so on. Otherwise there are going to be limits on how useful it is in practice; people tend to wind up doing work that they are relatively suited for (granting that there is a lot of noise in the system).
This is intuition-bending, because the population of humans has a much richer set of skills than any one particular human, even if you exclude genius from the population group. But 1,000,000 copies of a given AI are not going to have a richer set of skills than any one particular instance of that AI. So an AI that is as capable as a typical adult is *not* the same as an AI that can handle the set of tasks that typical adults do collectively.
2. I absolutely do not think that o3 (or "thereabouts") will come anywhere close to AGI even as you're defining it, and I would be happy to take bets on that. See my previous comment.
Is there a way to get the TOC links to work in the Substack iOS app?
Not that I know about.
I know the motivation was probably something more prosaic like "c'mon geeks we have 3 more days of releases to fill, what do you have for us" but in hindsight announcing o3 now looks quite altruistic. It gives rival labs and short-timeline safety advocates weeks to months of extra warning that inference-time scaling works big-time, and it similarly moves up the "feel the AGI" timeline for non-technical decision-makers. As much hype as it's getting now, I presume it would have been going to get* even more if they'd waited to announce it until it was available, like they did for o1. So it sure feels like it's sacrificing OpenAI's near-term benefit for more overall AI awareness.
*Couldn't pass up the chance to use the... past future subjunctive?... tense.
Maybe. I see what you're saying. On the flip side, it probably keeps some customers with OpenAI for the next X months rather than going to other model companies.