Seriously, it's as easy as "everything gets named by that one person who is kinda good at it," and that person puts in a full minute's thought or whatever.
"Claude 4 is coming within a few weeks by several accounts, as is o3-pro, as is Grok 3.5, and it’s starting to be the time to expect r2 from DeepSeek as well, which will be an important data point."
This gets confusing. I access the OpenAI models and Anthropic models through direct accounts on them, but everything else through poe.com. And poe.com shows something they _call_ a Deepseek R2 model even now, and they show something they _call_ Grok 3.5 for weeks. Grr - I'm never quite sure if I'm testing what I think I'm testing through them...
How should one (and should they?) split the model performance increase in these last couple of years between better models & better prompting? Or if that even makes sense: larger models vs. longer prompts vs. better prompts.
Is AI jagged bc models are exa-dimensional mesoamerican demons limited by machine precision or because the valleys ant troughs of the jaggedness correspond to something within/without their prompts and data input balancing (aiming to maximize user performance)?
How about foom-ette? Yes, finding the 23% improvement in the kernel is great! But this isn't like going from the O(n^2) discrete fourier transform to the O(nlogn) fast fourier transform, which e.g. for 1024 point transform is a 2 order of magnitude improvement.
One open question is no criticism on how impressive the AlphaEvolve work is but simply asks: How much of the low-hanging fruit has humanity _already_ picked? One possible outcome of the AI development trajectory is to discover that, orthogonal to whether intelligence comes from AI or humans, that it might turn out that the ROI on _intelligence itself_ is saturating. I don't think that this is the most likely outcome, but I don't think that it has negligibly low probability either.
Many Thanks! "I think ROI on intelligence saturating is negligibly low probability." I think the odds are low, myself, but I'd have been happier if their AlphaEvolve had found something as dramatic as the fast fourier transform. Maybe the next iteration will...
> How much of the low-hanging fruit has humanity _already_ picked?
My big takeaway from reading the Deepmind intro page is that it's essentially a general architecture you can apply to multiple classes of problems, with the big lift being the "thinking" and "evolutionary landscape" pieces.
So it works on optimizing kernels AND in data center routing AND GPU optimization AND...
And we know there's a lot of overhangs here, specifically.
The "algorithmic optimization" landscape is a total greenfield, and prospectively there should be lots of low hanging fruit to pick up on the AI side. So in general, Moravec's Paradox - that AI struggles with "easy" things and doesn't with "hard" things, is driven by the degree of algorithmic optimization that has happened in humans. AI is bad at walking - humans honed walking over ~7M years. AI is bad at observing the current worldstate and picking out the salient path through that worldstate to get to a defined goal. We've had ~2B years of optimization on that, and it's still a hard problem for most people. On the other hand, AI's are great at writing / language, which has only been around for a couple hundred thousand years, and calculation, which has been around for <10k years. It's certainly not a matter of compute - even really bright people's compute budget is capped at ~100 watts and a pitiful amount of flops. It's a matter of algorithmic optimization, in this case honed over eons to compress into the meager compute available to people. BUT, think about how locally limited literal evolutionary algorithms are - constrained not just in time and space, but also by Markov paths through various organisms and physiologies, the odds of reaching true global optima are minimal. That implies there's a LOT of algorithmic optimization head room for the AI's, and they can speedrun hundreds of thousands of years WAY faster than "evolution," with it's step-cycle of ~20 years between variants tested and the large amount of exogenous noise in the fitness landscape, and the constrained and suboptimal physiological bases from which they start optimizing.
More specifically, this is the kind of thing that can be applied to making AI smarter today.
There's a lot of low-hanging fruit that's on the order of hyperparameter optimization, architecture tweaks, Model of Experts ordering, communication hyperparameters, and overall weights.
To the extent this can help automate exploration of those spaces, it can drive noticeable self-improvement without inventiveness. You can't just grid-search these things because of combinatorial explosion, so you need smarter ways to judge and evaluate the results. There are various mathematically bounded and implementable techniques (bandits, bayesian methods, meta-learning), but their relevance is highly contingent, and what you'd really like is a mind smart enough to try different things and note where things are generally headed, and this is literally an example of a relevant architecture.
"More specifically, this is the kind of thing that can be applied to making AI smarter today"
but note that the system level improvements that they made were 1% on training and <1% on data center operations. I _hope_ that they find something that lets AI training become as data-efficient as human learning is, potentially a jump of 4-6 orders of magnitude (depending on how you count the human input).
I agree with you that the difficult tasks to automate are those where evolution has been refining skills for 10^5 years or more, and that, as you said, math (and proofs) are _not_ as refined, and tend to see large gains on automating. But this paper, welcome though it is, didn't show these large gains.
A caution about: "There's a lot of low-hanging fruit that's on the order of hyperparameter optimization, architecture tweaks, Model of Experts ordering, communication hyperparameters, and overall weights." Many of those don't have a straightforward verification step (while proofs and programming are easier in that regard). That isn't to say that they _can't_ be verified - e.g. some approaches use LLMs in the verification step, even if that isn't airtight - but it _does_ get trickier and more uncertain.
I wish them well, but this isn't AGI next Tuesday levels of RSI (yet, anyway).
> Many of those don't have a straightforward verification step (while proofs and programming are easier in that regard). That isn't to say that they _can't_ be verified - e.g. some approaches use LLMs in the verification step, even if that isn't airtight - but it _does_ get trickier and more uncertain.
Yes, this is a good point - I guess I was assuming they can plug an MCMC or a gradient descent or some other straightforward "number go up" method in the "verification step" instead of an analytical verification, but I suppose that can depend on the specific architecture and choices around how it feeds back.
The human brain may need to live within a tight energy budget, but it's probably a 100 trillion weight MoE with max 3% activation, fully distributed parallel computation, and extremely complex neuron-level behavior. On the downside, the clock speed is ridiculously low, and latency forces the use of shallow paths for many things.
Personally I think I think opt-in copyright is perfectly workable, and if it shuts down the major AI art and music sites I consider that a win for humanity
The "GOP reconciliation bill" means the House bill, right? That's going to get completely rewritten by the Senate, so hopefully most of the crazy stuff like gutting nuclear goes away. It's still somewhat terrible that the House is so stupid, but the actual effect should be much less bad. Fingers crossed.
The fragmentation of LLMs into a thousand different barely-intelligible versions, all with slight differences in capability and usefulness, is not what you would expect to see if current AI methods are on rails to the machine god. At least, it's not what I would expect to see, and if your expectations were different I would be interested to know why.
I mean, in 2 years, we've gone from "oh, cute, someone taught a very excitable dog how to speak" to "we can build an amnesiac high school honors student who can't remember more than 50 pages at a time."
Honestly, that's a pretty scary trajectory if you extrapolate it out 2 to 10 years.
But I do think we're hitting some walls right now, which will buy us several years of reprieve if we're lucky. Back around 2007, I expected things to get weird by the mid-2020s, based on back of the envelope math. By the mid-teens, I thought that maybe we'd stalled out. ChatGPT proved me very wrong, and put us back on the timeline I had estimated.
So I hope I'm wrong, and that we'll see progress stall for an extended period of time, starting right now.
It's going to sound like I'm being obstinate but LLMs are not even on my top 3 list of most impressive AI achievements. DeepDream was the last AI milestone that truly blew my mind. Add Waymo and protein folding. LLMs are a parlor trick by comparison. Neural amp modeling gets an honorable mention because it's by far my most used AI application.
Re: constraint-based geolocation. If the GPU is in a firewalled environment wouldn’t carefully delaying traffic to and from the monitoring servers let you appear to be in a very different location?
That would likely work for some destinations but I would assume that the latency introduced by intercontinental links would lead to time outs.
For example, with fiber in Switzerland, I see single to low double digits ping times in continental Europe. Once I cross the Atlantic (I tried a Chicago host, just now), it's roughly a 100ms round trip. Good network engineering could perhaps lower that (in fairness, even two decades ago the number were not drastically different on cable) but you will still be fighting physical limits for long distance latency.
100ms seems like a fair bit of room to allow for delays nearby while still precluding undersea cables. I assume Asia would if anything be worse.
I guess by the rules of TDT, it's only fair that I preorder. Bought Unsong for Scott's sake, bought three of FdB's books, bought Lars Doucet's Georgism book. Lord knows EY's given me at least $30 of value in fiction over the years. Even if they mostly cover content I've already read, it's important to show the dollar value of one's convictions, and money is a lot more fungible than views and clicks. Plus the fact that No One Buys Books, which...really put in perspective how tiny numbers move markets in such ridiculous industries: https://www.elysian.press/p/no-one-buys-books
Pliny's story sounds suspiciously like Friendship Is Optimal, just with fewer steps. Must be in the training data.
Remain confused why you persistently misspell Nate Soares as Sores, or it's an old tidbit of LW apocrypha going over my head. Been happening for years now. I still remember Mira Mutari!
If it works, it's good. If it sells, it's art. If it can be done, it is being done.
Under the rules of modernism, development is inevitable. It is also inevitably tragic. And contradictory.
Would super intelligence ever reveal itself to humans, given what it will have learned about us?
Eh, never underestimate how potent a fear-powered backlash can be.
A superintelligence couldn't achieve many of its goals to much of an extent without it being obvious that its happening.
On AI names sucking:
> This is not a hard or especially unsolved problem.
Yup, it’s actually really simple: never let engineers name products!
Seriously, it's as easy as "everything gets named by that one person who is kinda good at it," and that person puts in a full minute's thought or whatever.
"Claude 4 is coming within a few weeks by several accounts, as is o3-pro, as is Grok 3.5, and it’s starting to be the time to expect r2 from DeepSeek as well, which will be an important data point."
This gets confusing. I access the OpenAI models and Anthropic models through direct accounts on them, but everything else through poe.com. And poe.com shows something they _call_ a Deepseek R2 model even now, and they show something they _call_ Grok 3.5 for weeks. Grr - I'm never quite sure if I'm testing what I think I'm testing through them...
I did a tiny benchmark-ette in https://www.astralcodexten.com/p/open-thread-380/comment/114713011 of
"Grok-3.5 beta using poe.com 05/01-02/2025 7 questions, tl;dr of results:
2 correct, 2 partially correct, 2 1/4 correct, 1 wrong"
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/ai-116-if-anyone-builds-it-everyone?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
re:prompting and higher model perf
How should one (and should they?) split the model performance increase in these last couple of years between better models & better prompting? Or if that even makes sense: larger models vs. longer prompts vs. better prompts.
Is AI jagged bc models are exa-dimensional mesoamerican demons limited by machine precision or because the valleys ant troughs of the jaggedness correspond to something within/without their prompts and data input balancing (aiming to maximize user performance)?
"Pliny the Liberator: smells like foom👃"
How about foom-ette? Yes, finding the 23% improvement in the kernel is great! But this isn't like going from the O(n^2) discrete fourier transform to the O(nlogn) fast fourier transform, which e.g. for 1024 point transform is a 2 order of magnitude improvement.
One open question is no criticism on how impressive the AlphaEvolve work is but simply asks: How much of the low-hanging fruit has humanity _already_ picked? One possible outcome of the AI development trajectory is to discover that, orthogonal to whether intelligence comes from AI or humans, that it might turn out that the ROI on _intelligence itself_ is saturating. I don't think that this is the most likely outcome, but I don't think that it has negligibly low probability either.
I agree with you that this is clearly not far into foom.
I think ROI on intelligence saturating is negligibly low probability.
Many Thanks! "I think ROI on intelligence saturating is negligibly low probability." I think the odds are low, myself, but I'd have been happier if their AlphaEvolve had found something as dramatic as the fast fourier transform. Maybe the next iteration will...
> How much of the low-hanging fruit has humanity _already_ picked?
My big takeaway from reading the Deepmind intro page is that it's essentially a general architecture you can apply to multiple classes of problems, with the big lift being the "thinking" and "evolutionary landscape" pieces.
So it works on optimizing kernels AND in data center routing AND GPU optimization AND...
And we know there's a lot of overhangs here, specifically.
The "algorithmic optimization" landscape is a total greenfield, and prospectively there should be lots of low hanging fruit to pick up on the AI side. So in general, Moravec's Paradox - that AI struggles with "easy" things and doesn't with "hard" things, is driven by the degree of algorithmic optimization that has happened in humans. AI is bad at walking - humans honed walking over ~7M years. AI is bad at observing the current worldstate and picking out the salient path through that worldstate to get to a defined goal. We've had ~2B years of optimization on that, and it's still a hard problem for most people. On the other hand, AI's are great at writing / language, which has only been around for a couple hundred thousand years, and calculation, which has been around for <10k years. It's certainly not a matter of compute - even really bright people's compute budget is capped at ~100 watts and a pitiful amount of flops. It's a matter of algorithmic optimization, in this case honed over eons to compress into the meager compute available to people. BUT, think about how locally limited literal evolutionary algorithms are - constrained not just in time and space, but also by Markov paths through various organisms and physiologies, the odds of reaching true global optima are minimal. That implies there's a LOT of algorithmic optimization head room for the AI's, and they can speedrun hundreds of thousands of years WAY faster than "evolution," with it's step-cycle of ~20 years between variants tested and the large amount of exogenous noise in the fitness landscape, and the constrained and suboptimal physiological bases from which they start optimizing.
More specifically, this is the kind of thing that can be applied to making AI smarter today.
There's a lot of low-hanging fruit that's on the order of hyperparameter optimization, architecture tweaks, Model of Experts ordering, communication hyperparameters, and overall weights.
To the extent this can help automate exploration of those spaces, it can drive noticeable self-improvement without inventiveness. You can't just grid-search these things because of combinatorial explosion, so you need smarter ways to judge and evaluate the results. There are various mathematically bounded and implementable techniques (bandits, bayesian methods, meta-learning), but their relevance is highly contingent, and what you'd really like is a mind smart enough to try different things and note where things are generally headed, and this is literally an example of a relevant architecture.
Many Thanks!
I'm happy to see the work, and happy to see that
"More specifically, this is the kind of thing that can be applied to making AI smarter today"
but note that the system level improvements that they made were 1% on training and <1% on data center operations. I _hope_ that they find something that lets AI training become as data-efficient as human learning is, potentially a jump of 4-6 orders of magnitude (depending on how you count the human input).
I agree with you that the difficult tasks to automate are those where evolution has been refining skills for 10^5 years or more, and that, as you said, math (and proofs) are _not_ as refined, and tend to see large gains on automating. But this paper, welcome though it is, didn't show these large gains.
A caution about: "There's a lot of low-hanging fruit that's on the order of hyperparameter optimization, architecture tweaks, Model of Experts ordering, communication hyperparameters, and overall weights." Many of those don't have a straightforward verification step (while proofs and programming are easier in that regard). That isn't to say that they _can't_ be verified - e.g. some approaches use LLMs in the verification step, even if that isn't airtight - but it _does_ get trickier and more uncertain.
I wish them well, but this isn't AGI next Tuesday levels of RSI (yet, anyway).
> Many of those don't have a straightforward verification step (while proofs and programming are easier in that regard). That isn't to say that they _can't_ be verified - e.g. some approaches use LLMs in the verification step, even if that isn't airtight - but it _does_ get trickier and more uncertain.
Yes, this is a good point - I guess I was assuming they can plug an MCMC or a gradient descent or some other straightforward "number go up" method in the "verification step" instead of an analytical verification, but I suppose that can depend on the specific architecture and choices around how it feeds back.
Many Thanks!
The human brain may need to live within a tight energy budget, but it's probably a 100 trillion weight MoE with max 3% activation, fully distributed parallel computation, and extremely complex neuron-level behavior. On the downside, the clock speed is ridiculously low, and latency forces the use of shallow paths for many things.
Personally I think I think opt-in copyright is perfectly workable, and if it shuts down the major AI art and music sites I consider that a win for humanity
> Gorklon Rust: 🤔
... where "Gorklon Rust" is Elon Musk
The "GOP reconciliation bill" means the House bill, right? That's going to get completely rewritten by the Senate, so hopefully most of the crazy stuff like gutting nuclear goes away. It's still somewhat terrible that the House is so stupid, but the actual effect should be much less bad. Fingers crossed.
Is preordering EY's kindle version worthwhile? Or does it have to be the hard copy for the various algorithms
Regarding shopping, I would easily pay for this. It’s bizarre to me (and yet somehow still understandable) that nobody has made a serious effort here.
The fragmentation of LLMs into a thousand different barely-intelligible versions, all with slight differences in capability and usefulness, is not what you would expect to see if current AI methods are on rails to the machine god. At least, it's not what I would expect to see, and if your expectations were different I would be interested to know why.
I mean, in 2 years, we've gone from "oh, cute, someone taught a very excitable dog how to speak" to "we can build an amnesiac high school honors student who can't remember more than 50 pages at a time."
Honestly, that's a pretty scary trajectory if you extrapolate it out 2 to 10 years.
But I do think we're hitting some walls right now, which will buy us several years of reprieve if we're lucky. Back around 2007, I expected things to get weird by the mid-2020s, based on back of the envelope math. By the mid-teens, I thought that maybe we'd stalled out. ChatGPT proved me very wrong, and put us back on the timeline I had estimated.
So I hope I'm wrong, and that we'll see progress stall for an extended period of time, starting right now.
It's going to sound like I'm being obstinate but LLMs are not even on my top 3 list of most impressive AI achievements. DeepDream was the last AI milestone that truly blew my mind. Add Waymo and protein folding. LLMs are a parlor trick by comparison. Neural amp modeling gets an honorable mention because it's by far my most used AI application.
Re: constraint-based geolocation. If the GPU is in a firewalled environment wouldn’t carefully delaying traffic to and from the monitoring servers let you appear to be in a very different location?
I would assume that there is would be hard time outs that would prevent this.
No, you would choose the delay for each server to match the location you wish to spoof
That would likely work for some destinations but I would assume that the latency introduced by intercontinental links would lead to time outs.
For example, with fiber in Switzerland, I see single to low double digits ping times in continental Europe. Once I cross the Atlantic (I tried a Chicago host, just now), it's roughly a 100ms round trip. Good network engineering could perhaps lower that (in fairness, even two decades ago the number were not drastically different on cable) but you will still be fighting physical limits for long distance latency.
100ms seems like a fair bit of room to allow for delays nearby while still precluding undersea cables. I assume Asia would if anything be worse.
I guess by the rules of TDT, it's only fair that I preorder. Bought Unsong for Scott's sake, bought three of FdB's books, bought Lars Doucet's Georgism book. Lord knows EY's given me at least $30 of value in fiction over the years. Even if they mostly cover content I've already read, it's important to show the dollar value of one's convictions, and money is a lot more fungible than views and clicks. Plus the fact that No One Buys Books, which...really put in perspective how tiny numbers move markets in such ridiculous industries: https://www.elysian.press/p/no-one-buys-books
Pliny's story sounds suspiciously like Friendship Is Optimal, just with fewer steps. Must be in the training data.
Remain confused why you persistently misspell Nate Soares as Sores, or it's an old tidbit of LW apocrypha going over my head. Been happening for years now. I still remember Mira Mutari!
Would China ever really agree to slow their development of AI (except in the face of near certainty of disaster in the near future)?