What do you think the odds are that scaling laws stopped? Grok wasn’t that good even with 10^26 and they called it 4.5 instead of 5 (which they’re saving for reasoning) is the reason I ask.
Scaling in the old sense of making models 2+ OOMs bigger and hoping for the best isn't economically feasible anymore, even if it did still work.
Scaling in narrower senses is still done—in the DeepSeek papers they mention they're bottlenecked by a lack of good RL training data, so presumably now the challenge is to figure out what a model lacks, and synthesize more of it.
My personal belief is that scaling was always kind of a mirage: the important thing is to get the right kinds of knowledge into the model, and scaling was just the easiest way to do so (by packing EVERYTHING ON THE INTERNET into the model, you'd necessarily scoop up the needed knowledge, just as raking the entire oceans will definitely catch some fish). But now economics are forcing people to be more discriminating. Does training on trillions of tokens of random web slop really get you anywhere anymore? I don't think so. Frontier models like o3 use specialized reasoning datasets, often synthesized by other frontier LLMs.
Taste is knowledge. A chef can tell how fresh their ingredients are, knows how to combine them, has all the understanding required to identify and judge what they produce. Critics are the same. Once you know enough you are able to place a piece of work against its natural comparisons and know it and therefore judge it. Most people mistake taste for preference. Everyone has preferences. Some people are so knowledgeable about their preferences they develop taste in that niche. But that isn’t the same as actually having taste in a discipline. The pervasive idea that art is all subjective interpretation is everywhere in schools and on the internet and has blinded many people to this basic truth. People who don’t read poetry are declaring poems by 4.5 to be good. They may have taste in other areas of writing, but in poetry it is merely a preference. That’s fine! AI is good at producing things we like. Great! That doesn’t mean it’s actually a good poet or that it can currently write anything to compare to the best human poets. So a lot of this depends on what part of progress you are willing to call good. Yes 4.5 is a better writer, but only in some narrow areas. It has improved on the things people have stronger preferences for, ie corporate writing, meme jokes, mid poetry, etc. ask it to write anything more literary and it flunks.
Please tell me the non-subjective criteria that a 3-star meal is better than a microwave burrito. Yes, _almost_ everyone will agree that it is. But that doesn't change the fact that it is a completely _subjective_ assessment.
And before you answer with "freshness" or "quality of ingredients" or "technique of the chef", make sure you are not shifting the preferences/subjectivity up a level.
One can come up with criteria along which two dishes can be relatively objectively compared, but the very selection of those criteria is, itself, subjective.
When you claim lack of subjectivity, yes, the lack of criteria does indeed prove my point. That's almost definitionally what objectivity means. That there are objective criteria.
You are completely correct that they don't work....because the field is completely subjective and therefore they don't apply. That's the _reason_ that they don't work.
That's basically what it means for those things to not work: that the thing they are trying to applied to is subjective/has no ground truth.
-edit- I'm pretty sure that finding out that taste has objectively correct answers would be philosophically equivalent to finding objectively correct morality. If you solve one, you basically solve the other.
Someone with knowledge and experience in an artistic field might have different preferences, and those preferences might have required that knowledge and experience to develop. That does not necessarily make those preferences more correct.
Likewise, the fact that a work took more effort, skill, and/or experience to create does not inherently make that work "better".
I think it's honestly a bit of a shame when the preferences of artists and critics at the top of a field become too misaligned with those of the general public (which they certainly can). Wouldn't it be better if the best artists were creating works that more people could appreciate? Wouldn't it be better if critics were able to direct the public towards works they appreciate more reliably?
It's often both right and hallucinatory for me, as if it were an overconfident genius.
The most striking example was when I asked it whether paracetamol for infant teething was backed by evidence and it conjured a non-existent Cochrane review without additional prompting.
It also often comes off as almost cocky in the way that it addresses me. Dunno how it got imbued with this.
Good brainstormer, editor and writer though, although I have yet to test it in French.
One has to wonder how even additional GPUd would allow them to serve all users if they're already at capacity
> > I view it as a model that attempts to improve on the dimension of aesthetics only.
> Where I definitely disagree with him is in the idea that the model is only attempting to improve on the aesthetic dimension. I have no doubt OpenAI had much higher hopes for what GPT-4.5 would bring us...
He said the "model attempts", not "OpenAI attempted". I think he just means "this model is mainly an improvement on aesthetics," rather than "OpenAI was only trying to improve on aesthetics." Why would he have an opinion on that?
I keep reading posts about continuous improvements and I’m curious about how sonnet is so good? Is it something to do with its SFT? Are there any articles I can be pointed towards on this?
Just wanted to say thank you for your phenomenal coverage of AI. Reading your articles is among things that give me the most joy and pleasure. I remember when I first started reading your blog and didn't know how often new articles were published, and the weekend came, I literally updated the blog every hour all weekend (and on Monday!) hoping that an article would appear any minute.
I haven't extensively tested it, and it's too expensive for me to do so. I have personally not been impressed by most of the samples I've seen online, though.
My feeling is that GPT-4.5's "creativity"/"big model smell" is mostly an artifact of RL and fine-tuning, not scale. (r1 feels noticeably more creative to talk to than v3 but shares the same base model).
What do you think the odds are that scaling laws stopped? Grok wasn’t that good even with 10^26 and they called it 4.5 instead of 5 (which they’re saving for reasoning) is the reason I ask.
Scaling in the old sense of making models 2+ OOMs bigger and hoping for the best isn't economically feasible anymore, even if it did still work.
Scaling in narrower senses is still done—in the DeepSeek papers they mention they're bottlenecked by a lack of good RL training data, so presumably now the challenge is to figure out what a model lacks, and synthesize more of it.
My personal belief is that scaling was always kind of a mirage: the important thing is to get the right kinds of knowledge into the model, and scaling was just the easiest way to do so (by packing EVERYTHING ON THE INTERNET into the model, you'd necessarily scoop up the needed knowledge, just as raking the entire oceans will definitely catch some fish). But now economics are forcing people to be more discriminating. Does training on trillions of tokens of random web slop really get you anywhere anymore? I don't think so. Frontier models like o3 use specialized reasoning datasets, often synthesized by other frontier LLMs.
Taste is knowledge. A chef can tell how fresh their ingredients are, knows how to combine them, has all the understanding required to identify and judge what they produce. Critics are the same. Once you know enough you are able to place a piece of work against its natural comparisons and know it and therefore judge it. Most people mistake taste for preference. Everyone has preferences. Some people are so knowledgeable about their preferences they develop taste in that niche. But that isn’t the same as actually having taste in a discipline. The pervasive idea that art is all subjective interpretation is everywhere in schools and on the internet and has blinded many people to this basic truth. People who don’t read poetry are declaring poems by 4.5 to be good. They may have taste in other areas of writing, but in poetry it is merely a preference. That’s fine! AI is good at producing things we like. Great! That doesn’t mean it’s actually a good poet or that it can currently write anything to compare to the best human poets. So a lot of this depends on what part of progress you are willing to call good. Yes 4.5 is a better writer, but only in some narrow areas. It has improved on the things people have stronger preferences for, ie corporate writing, meme jokes, mid poetry, etc. ask it to write anything more literary and it flunks.
Please tell me the non-subjective criteria that a 3-star meal is better than a microwave burrito. Yes, _almost_ everyone will agree that it is. But that doesn't change the fact that it is a completely _subjective_ assessment.
And before you answer with "freshness" or "quality of ingredients" or "technique of the chef", make sure you are not shifting the preferences/subjectivity up a level.
One can come up with criteria along which two dishes can be relatively objectively compared, but the very selection of those criteria is, itself, subjective.
You are assuming there are criteria or that the lack of such criteria proves your point.
When you claim lack of subjectivity, yes, the lack of criteria does indeed prove my point. That's almost definitionally what objectivity means. That there are objective criteria.
You’re trying to apply modes of rationality that don’t really work.
You are completely correct that they don't work....because the field is completely subjective and therefore they don't apply. That's the _reason_ that they don't work.
That's basically what it means for those things to not work: that the thing they are trying to applied to is subjective/has no ground truth.
-edit- I'm pretty sure that finding out that taste has objectively correct answers would be philosophically equivalent to finding objectively correct morality. If you solve one, you basically solve the other.
Someone with knowledge and experience in an artistic field might have different preferences, and those preferences might have required that knowledge and experience to develop. That does not necessarily make those preferences more correct.
Likewise, the fact that a work took more effort, skill, and/or experience to create does not inherently make that work "better".
I think it's honestly a bit of a shame when the preferences of artists and critics at the top of a field become too misaligned with those of the general public (which they certainly can). Wouldn't it be better if the best artists were creating works that more people could appreciate? Wouldn't it be better if critics were able to direct the public towards works they appreciate more reliably?
I happen to agree with this. It’s not inconsistent with what I said.
It's often both right and hallucinatory for me, as if it were an overconfident genius.
The most striking example was when I asked it whether paracetamol for infant teething was backed by evidence and it conjured a non-existent Cochrane review without additional prompting.
It also often comes off as almost cocky in the way that it addresses me. Dunno how it got imbued with this.
Good brainstormer, editor and writer though, although I have yet to test it in French.
One has to wonder how even additional GPUd would allow them to serve all users if they're already at capacity
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/on-gpt-45?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
> > I view it as a model that attempts to improve on the dimension of aesthetics only.
> Where I definitely disagree with him is in the idea that the model is only attempting to improve on the aesthetic dimension. I have no doubt OpenAI had much higher hopes for what GPT-4.5 would bring us...
He said the "model attempts", not "OpenAI attempted". I think he just means "this model is mainly an improvement on aesthetics," rather than "OpenAI was only trying to improve on aesthetics." Why would he have an opinion on that?
I keep reading posts about continuous improvements and I’m curious about how sonnet is so good? Is it something to do with its SFT? Are there any articles I can be pointed towards on this?
Just wanted to say thank you for your phenomenal coverage of AI. Reading your articles is among things that give me the most joy and pleasure. I remember when I first started reading your blog and didn't know how often new articles were published, and the weekend came, I literally updated the blog every hour all weekend (and on Monday!) hoping that an article would appear any minute.
With the release of GPT 4.5, including the announcement that it's extremely expensive, I have updated pretty significantly towards a "slow takeoff."
Why is Gary Marcus so consistently terrible?
I haven't extensively tested it, and it's too expensive for me to do so. I have personally not been impressed by most of the samples I've seen online, though.
My feeling is that GPT-4.5's "creativity"/"big model smell" is mostly an artifact of RL and fine-tuning, not scale. (r1 feels noticeably more creative to talk to than v3 but shares the same base model).
Your cadence of high quality outputs is insane; love it