Don't Worry About the Vase

Could you expand on what kinds of problems you're talking about? I assume you're excluding basics like boilerplate code or looking up documentation, where it seems LLMs are probably superior to notepad.txt and certainly different.

Expand full comment

Hastings

https://gist.github.com/HastingsGreer/d88b623f25c466ccb6b9443fd4a44914

Going meta with it, here is a recent chat I had with claude about how to share a claude chat. In it every single UI element claude suggests that I use is hallucinated.

Alas, I thus lack nicely formatted conversations to share with you. Here is an example of me having great success using Claude as a duck, but where the comparison to notepad seems especially apt:

https://gist.github.com/HastingsGreer/2b61339e07438444f7db030a315319f1

It's the same story as the first gist: Claude says lots of reasonable sounding things, but I am having to do 100% of the work because Claude's musings have no connection to reality.

Expand full comment

Reply (2)

Boogaloo

my experience of AI is the same. it seems to have no connection to reality mostly.

Expand full comment

Jan 1

The chatbots' knowledge about their own interface and capabilities is sometimes particularly bad. I've had multiple fights with them trying to get them to generate images when I know they can.

I'm not quite a coder, but I'm using o1 currently to mostly build a simple app for me. It gets some things wrong, but overall about 95% of the code just works the first time. Definitely a game-changer for side projects in my experience.

Expand full comment

Dustin

I use Claude extensively for programming in my senior level software engineering job.

Just recently I whipped out a fairly sub par React UI for part of our product. I shared a screenshot, the code, our components, and Claude fixed it up, doing 90 percent of the work.

Probably part of my success is that I wrote the initial version quickly and sloppily but knowing that Claude would clean it up. I've probably internalized the best way to write the quick and dirty version in the way that is most conducive to getting Claude to finish it.

Expand full comment

kenakofer

Large C++ codebase here. It feels like improvements in GitHub co-pilot chat/edits is glacial; using the new sonnet model is great, but if you don't bend over backwards to give it some concise context and a small play area, it will hallucinate.

I wonder if smaller modules, smaller files, and smaller functions become that much more important in codebases of the future.

Expand full comment

Andrew Clough

My guess on o3 (and o1) is that OpenAI has figured out how to a way to train recursive models that can produce outputs that the model can send to itself for another iteration until it converges or some other criteria is reached. Or rather, my guess is that o1 used a section of the input buffer but o3 might be partially abstracting the recursive bit into concept-space?

Expand full comment

Askwho Casts AI

https://open.substack.com/pub/dwatvpodcast/p/o3-oh-my

Podcast episode for this post:

Expand full comment

Alyssa Vance

FWIW, I expect to use Greptile + Sonnet (or some equivalent tool) for almost all coding over o3. For most problems, having context on everything else that you're doing > CodeForces-style closed-world complex algorithm writing

Expand full comment

Reese Green

Wow. Whether or not you believe o3 is AGI (I do not) it looks like OpenAI have achieved "super human" abilities with this model. I'm not sure if I find the high cost of the model to be more relieving or more stress inducing, however. A world where only large corporations and elites have access to AGI could be disastrous, but so likely would a world where everybody has access. On the other hand, perhaps the exponential cost will keep usage to a minimum, at least in the short term. In that situation, I would hope we would use it only for the most important problems. The variability of the future has never seemed greater.

Expand full comment

Boogaloo

no it hasn't.

Expand full comment

To the point about benchmarks always getting solved, when is someone going to create the world peace, cure aging, and matter synthesizer benchmark? If we're going to invest trillions in AI, let's try to have the foundation model companies explicitly work on our top goals, not just automating our (relatively) dumb jobs.

Expand full comment

Jonathan Woodward

Those sound more like ASI benchmarks than AGI benchmarks.

Expand full comment

True. At this point most of the unsaturated benchmarks in math and coding are what I consider "light" ASI anyway, by which I mean problems that <0.1% of humans can solve. ARC is a notable exception. I think it's time to get more ambitious about the problems that we're framing, both because we need more unsaturated benchmarks and because we need to stop searching under the street lamp. We currently don't know how to create an objective function for training on say solving aging, but we're not going to figure it out until we actually work on it.

Expand full comment

Dustin

Dec 30Edited

To be honest, I find top competitive programmers to be fairly poor software engineers. Or, maybe another way to say it is that the skills to be a competitive programmer are not the same as the skills to be a professional software engineer.

It'll be interesting to see if o3 improves at writing maintainable and understandable code...the two things competitive programming does not optimize for.

Expand full comment

Cheese Mongrel

I felt the EXACT same way about the focus on fucking bond funds + how *un* impressive it is

Blew my mind, I looked at all the exact things you outlined media wise

Absolutely no understanding of what's happening

Buying calls

Expand full comment

Seta Sojiro

Will Nvidia really be an overall gainer from reasoning models? Reasoning models use a lot more test time compute, and less training time compute. Nvidia has a huge training time compute moat, but less so for test time. Broadcom is putting huge pressure on Nvidia when it comes to inference.

Expand full comment

Lucas Wiman

Jan 1Edited

Yeah I think inputs like High Bandwidth Memory are probably the play. Also doesn't seem priced in. (So Micron for US investors. SK Hynix seems to be harder to buy unless you're South Korean, but would be a natural investment to make if available.)

Or a semiconductor index fund like SMH that already has a nice mix of Broadcom, NVIDIA, Micron, etc. Also, the point about having a slightly better base model leading to exponential cost savings from this article is relevant, and suggests that training will still be be relevant. It's likely that ~all the relevant inventory will be sold out for at least several years, and the usual fab boom/bust cycle will be harder to saturate with capital investment this time.

Expand full comment

Steve Newman

"If o3 was as good on most tasks as it is at coding or math, then it would be AGI."

A literal interpretation of this statement is true (more or less by definition; "emulating an AGI" is a task). But I think the more natural interpretation is false. o3 isn't designed to address messy open-world tasks that require interacting with the Internet (or even an intranet), selectively retrieving information, ignoring spurious information, etc. It doesn't have long-term memory or the ability to learn continuously. There are presumably still limits on how long it can maintain task coherence (and this would probably be harder for open-world tasks). We don't know how reliably good it is at coding and math, across what breadth of tasks. (For instance, the FrontierMath result is crazy impressive but, at only 25.2% in absolute terms, leaves a lot of wiggle room to somehow be achieved in a fluky / non-generalizable way (I've heard mutterings).)

I think o3-but-for-all-task-domains would still be far short of AGI. I could imagine that we get a lot of the former before 2025 is out but still be years away from transformational impact. (I've just finished drafting a blog post about more or less this topic, will publish within a week.) Very interested in your thoughts on this.

Expand full comment

Reply (2)

Feels like it could be a seed AI, in that it could be set into a state that it fills in it's missing capabilities or builds a successor with more of those capabilities.

Expand full comment

Jan 1

I hate repeating this, but our lack of a good definition for AGI is maddening because everyone uses it in a different way without explaining how. In my opinion, the interesting inflection point is not "can do every specific task at least as well as a(ny) human," which I guess is what AGI literally means. My transformative AI definition is "can be approximately as productive as a human at many jobs." I do think o3 or thereabouts will meet that threshold.

We don't require a biologist to publish a Pulitzer prize winning play in order to be considered human, but it seems like we are requiring that of gen AI in order to consider it AGI. It's an odd distinction, similar to the way we seem to expect self-driving cars to be 10X+ safer than humans, rather than just safer.

Anyway, even if o3 or a similar model gets distracted or screws up sometimes, I submit that it will be good and fast enough when it works and good enough at recovering in some fashion (which may mean the boss telling it to go back and do it again, similar to what happens with an employee), that it will often be more productive than a typical employee. At perhaps 5-10% of the cost. And we should not expect o3 to be the last improvement we see in 2025. So I see things getting very interesting very fast.

Expand full comment

Steve Newman

Jan 2

A couple of thoughts:

1. Imagine an AI which has the precise strengths and weaknesses of one particular median adult (including the ability to learn new things, but not the ability to change fundamentally, e.g. if that particular adult does not have a head for abstractions then they're never going to become a skilled programmer or mathematician). This would be wildly impressive and useful, but I don't think it would have the implications that most people have in mind when they talk about AGI. For instance, one common definition is "AI that can undertake 90% of economically useful activity", and this would fall well short of that.

I agree that Pulitzer-caliber writing is not a requirement for being AGI. But I do think that if there's a task that, say, 10% of adults can do, then an AGI should be able to do that thing. It should be as good at math as someone who is pretty good (not genius!) at math, it should be as good at writing as someone who is pretty good at writing, and so on. Otherwise there are going to be limits on how useful it is in practice; people tend to wind up doing work that they are relatively suited for (granting that there is a lot of noise in the system).

This is intuition-bending, because the population of humans has a much richer set of skills than any one particular human, even if you exclude genius from the population group. But 1,000,000 copies of a given AI are not going to have a richer set of skills than any one particular instance of that AI. So an AI that is as capable as a typical adult is *not* the same as an AI that can handle the set of tasks that typical adults do collectively.

2. I absolutely do not think that o3 (or "thereabouts") will come anywhere close to AGI even as you're defining it, and I would be happy to take bets on that. See my previous comment.

Expand full comment

Huey Kwk

Is there a way to get the TOC links to work in the Substack iOS app?

Expand full comment

Zvi Mowshowitz

Not that I know about.

Expand full comment

hnau

I know the motivation was probably something more prosaic like "c'mon geeks we have 3 more days of releases to fill, what do you have for us" but in hindsight announcing o3 now looks quite altruistic. It gives rival labs and short-timeline safety advocates weeks to months of extra warning that inference-time scaling works big-time, and it similarly moves up the "feel the AGI" timeline for non-technical decision-makers. As much hype as it's getting now, I presume it would have been going to get* even more if they'd waited to announce it until it was available, like they did for o1. So it sure feels like it's sacrificing OpenAI's near-term benefit for more overall AI awareness.

*Couldn't pass up the chance to use the... past future subjunctive?... tense.

Expand full comment