Congratulations, as always, to everyone who got to participate in the 2025 International Mathematical Olympiad, and especially to the gold and other medalists.
Do we still consider Yudkowsky's suggestion to bomb the datacentres insane? Because, personally, an actual, hand-on-heart jihad against AGI companies would go down just fine with me.
I don't really know what could push someone over the edge at this point. What would it be in your case? Because for me, the levels of negligence and/or eschatology at AI companies right now justify the most radical course of actions, whether or not AGI proper comes to fruition. It's an attempted murder at the very least.
It’s a risk reward question. Engaging in terrorism right now, when an AI has yet to cause any real damage to any human being, will turn the public overwhelmingly in favor of OAI and cause everyone’s p(doom) to go up, not down.
I am not engaging in terrorism. All I'm saying is that if some good Samaritan were willing to engage in it, I would be more than happy to cheer them on, emotionally and politically. The more people express support for such actions, the likelier it is for them to have a productive effect. And honestly, I think the common man has a generally negative view of AI already.
Yudkowsky suggested that military force by a state would be an appropriate reponse if a country signed a treaty to limit AI research and then violated the treaty. That's a long way from a "jihad" and every prominent person who talks about existential risk would agree that solutions involving criminal violence will not work and would be incredibly counterproductive.
I don't think such an extrapolation is unreasonable. People, especially prominent public personalities, are compelled to stay well within Overton window. They don't tend to suggest anything that may not be palatable to the broader public, and the public is terrified of the mere mention of violence, despite benefiting from some of its forms.
Sluggish civil processes are ill-suited to deal with the exponential capability expansion. Nothing short of a disaster jolts people into action on global issues. Until then, we are happy to drift along. Decade-long climate activism may have half-worked for an equally slow-moving natural process, but AGI development is not moving at the pace of a melting glacier.
Edit: Just to clarify my comment his suggestion always seemed sane since it was in the context of an international agreement. Getting to that is the goal of PauseAI/Control AI.
I would first suggest engaging with organizations like PauseAI or if you want to go more radical StopAI, more professional ControlAI. I know PauseAI is still super understaffed as far as capable volunteers go.
Noone has even tried the non terrorist way of stopping things yet and the terrorist way seems incredibly inefficient. Of course, if anyone thought otherwise they wouldn't be commenting here or only with a lot of security.
I think the arguments against AS development are quite convincing even to lay people or politicians. The main problem is a lack of awareness. The people working in the AI labs are just specifically selected by the market to not believe them (for whatever psychological, incentive, other reasons).
I am worried terrorism could lead to the movement being branded as toxic/crazy. It might also motivate people in politics/the companies to actually stop the movement.
Also killing people is not good. Illegally destroying property is not good.
Actions on the level of StopAI might be effective in taking up some space at the edge of the Overton window similar to PETA. I am unsure what the most effective methods here are.
However, I know for sure that everyone is lacking money and qualified people with time.
I am referring primarily to the perception or the idea of the statement. Its efficacy concerns, I do accept, are valid, but I don't think I'd yield to the ethical concerns. If you do indeed believe both that AGI is a threat of infinite magnitude and that current development practices are deeply flawed (both of which I can only assume you hold from your writing), how can there be any overriding ethical concerns? I wonder if there is a line in your mind that needs to be crossed for the evaluation to shift, or if it is a matter of positing the existence of a line, but never committing to drawing it.
The GDM result did not surprise me, they were already on track to achieve this and had recent success with AlphaZero on some problems that are beyond IMO level. The OAI result surprised me. Possibly that is because I did not pay enough attention to https://arxiv.org/abs/2502.06807 (paper on using o3 plus RL to get an IOI gold by a team that includes Alex Wei). But the claim that the model "knew" its answer to P6 was wrong (without using any external tools) suggests they may be doing something different here.
Kimina-Prover-72b has an understandable chain of thought; the prose style of the CoT feels much like that of DeepSeek R1, which is more chatty and informal than you would write in a paper.
I’ve heard that DeepSeek Prover has a CoT that mixes languages. I haven’t used it much because it’s too big to run conveniently, while kimina-prover-72b is a more manageable size. (Feel free to raise an eyebrow at 72B models now being the small models you can run easily). But anyway, this suggests that optimizing for math leads to language becoming less coherent.
> Ernest Davis (6th in US Math Olympiad once, just short of the IMO): Second: The proofs produced by DM-IMO and by every single earlier LLM, whether correct or incorrect, are written in a smooth, elegant style. They could be cut and pasted into a journal article or into a textbook with little or no editing. The worst you can say of them is that they are sometimes verbose.
I personally felt that even though the OpenAI solutions were in an informal style, they were easier to read and understand.
The most concrete example I have of this is Deepmind's solution for Q4. In the "Step 1" part of the proof, they are supposed to prove that a_n is always odd. It does a bunch of casework: It analyzes when tau(N) >= 5, when tau(N)=4 and N is of the form p^3, and when N is of the form pq. The p^3 case gives an argument that makes the reader implicitly check that (p^2)(p-1) - p - 1 is increasing for x >= 3. And it's all unnecessary: The proof they give for the tau(N) >= 5 case holds just as well when tau(N)=4, so basically a whole paragraph of the proof could be avoided.
This is not technically incorrect, (depending on how persnickety you are about the increasing argument, I guess the IMO grader thought it was fine), but it seemed odd to me when reading it, and I think it's a bit more of a hit to the overall quality of the solution than just being "verbose".
He points out that SOTA models seem to need to increase inference costs by a million times to halve the error rate.
If OpenAI was spending $1,000-10,000 or so for ARC, to answer problems ten year olds can solve, and potentially had to spend $1,000,000 to score gold in IMO, that's not really the trend you want to see if you're an AGI bull.
It's true having a bot that scores gold in IMO is useful, but it's not so useful if it costs $1 million.
And if the real hot shit takes a substantially smarter model and another million fold increase in compute costs, it might be inaccessibly expensive for a long time. Which is a trend you very much want to see if you're an AI notkilleveryoneist.
That is indeed bizarre as stated. Halving the error rate can mean a lot of things, and for most meanings it seems to very obviously not require 6 OOMs.
Yes, The whole point of Deep Think and Coscientist and Alpha Evolve is merely to generate data for the next gen of Gemini Pro - the inference costs for these teacher models mostly doesn't matter (except as an upfront r&d cost) because next gen Pro version will get this to the masses without having multiple OOMs of cost increase.
"What matters is that Google and OpenAI have LLMs with gold medal performances, each scoring exactly the threshold of 35/42 by solving the first five of the six problems."
Agreed!
While there are some differences in what we've been told about the two groups' approaches, and there is _always_ a question about how any technique generalizes till one actually sees it generalize,
GDM's
"We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought.
To make the most of the reasoning capabilities of Deep Think, we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."
doesn't sound dramatically less general (except for the "tips on how to approach IMO problems") than OpenAI's
"Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.
Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."
I think both teams' successes bode well for improved, deeper, more reliable, reasoning, and help pull the timeline for ai2027 back, undoing the delay that METR's experiment suggested.
Perhaps I will get to have a nice, quiet conversation with a real "life" HAL9000...
Many Thanks! Maybe call it 50:50? I don't think that it is certain that AGI (or a possible ASI) will wipe humans out. And we do get to set the initial conditions, perhaps Hassabis et. al. will find a way to set us up as <evidenceFromFiction>pets of the Culture Minds</evidenceFromFiction>. Then again, maybe we'll wind up like the Neanderthals. We shall see.
"now a bunch of robots can do it. as someone who has a lot of their identity and their actual life built around "is good at math," it's a gut punch. it's a kind of dying."
Not sure. Maybe the market was reacting to the fact that no one had announced that their LLMs solved previous IMO competition problems (even if that wouldn't have satisfied the market resolution criteria, it would have provided evidence for the performance on this year's test). In hindsight, that no one bothered to evaluate their models on previous years problems is unsurprising considering how compute-heavy the correct approach turned out to be (though even that doesn't seem to be correct because---as someone else in the comments linked to---apparently there was a paper released by OpenAI claiming gold-medal level performance on IOI 2024 by their o3 model).
Do we still consider Yudkowsky's suggestion to bomb the datacentres insane? Because, personally, an actual, hand-on-heart jihad against AGI companies would go down just fine with me.
I’m not ready to condone this, but I could imagine getting there pretty quickly if we don’t see a slowdown soon
I don't really know what could push someone over the edge at this point. What would it be in your case? Because for me, the levels of negligence and/or eschatology at AI companies right now justify the most radical course of actions, whether or not AGI proper comes to fruition. It's an attempted murder at the very least.
It’s a risk reward question. Engaging in terrorism right now, when an AI has yet to cause any real damage to any human being, will turn the public overwhelmingly in favor of OAI and cause everyone’s p(doom) to go up, not down.
I am not engaging in terrorism. All I'm saying is that if some good Samaritan were willing to engage in it, I would be more than happy to cheer them on, emotionally and politically. The more people express support for such actions, the likelier it is for them to have a productive effect. And honestly, I think the common man has a generally negative view of AI already.
Yudkowsky suggested that military force by a state would be an appropriate reponse if a country signed a treaty to limit AI research and then violated the treaty. That's a long way from a "jihad" and every prominent person who talks about existential risk would agree that solutions involving criminal violence will not work and would be incredibly counterproductive.
I don't think such an extrapolation is unreasonable. People, especially prominent public personalities, are compelled to stay well within Overton window. They don't tend to suggest anything that may not be palatable to the broader public, and the public is terrified of the mere mention of violence, despite benefiting from some of its forms.
Sluggish civil processes are ill-suited to deal with the exponential capability expansion. Nothing short of a disaster jolts people into action on global issues. Until then, we are happy to drift along. Decade-long climate activism may have half-worked for an equally slow-moving natural process, but AGI development is not moving at the pace of a melting glacier.
Edit: Just to clarify my comment his suggestion always seemed sane since it was in the context of an international agreement. Getting to that is the goal of PauseAI/Control AI.
I would first suggest engaging with organizations like PauseAI or if you want to go more radical StopAI, more professional ControlAI. I know PauseAI is still super understaffed as far as capable volunteers go.
Noone has even tried the non terrorist way of stopping things yet and the terrorist way seems incredibly inefficient. Of course, if anyone thought otherwise they wouldn't be commenting here or only with a lot of security.
Why do you find formal engagement more efficient that more confrontational methods?
I think the arguments against AS development are quite convincing even to lay people or politicians. The main problem is a lack of awareness. The people working in the AI labs are just specifically selected by the market to not believe them (for whatever psychological, incentive, other reasons).
I am worried terrorism could lead to the movement being branded as toxic/crazy. It might also motivate people in politics/the companies to actually stop the movement.
Also killing people is not good. Illegally destroying property is not good.
Actions on the level of StopAI might be effective in taking up some space at the edge of the Overton window similar to PETA. I am unsure what the most effective methods here are.
However, I know for sure that everyone is lacking money and qualified people with time.
I recommend watching this talk: https://www.youtube.com/watch?v=pa0EpXiPm_E
If you want to take a super simple action: https://controlai.com/take-action/choose
If you want to do more there is a lot of stuff to do. Join the PauseAI (https://pauseai.info/join) /PauseAI US (https://www.pauseai-us.org/).
Check out https://www.aisafety.com/map
I do not believe that actually bombing the datacenters would be either ethical nor effective nor wise.
I also remind everyone this is very much NOT what Yudkowsky said.
I am referring primarily to the perception or the idea of the statement. Its efficacy concerns, I do accept, are valid, but I don't think I'd yield to the ethical concerns. If you do indeed believe both that AGI is a threat of infinite magnitude and that current development practices are deeply flawed (both of which I can only assume you hold from your writing), how can there be any overriding ethical concerns? I wonder if there is a line in your mind that needs to be crossed for the evaluation to shift, or if it is a matter of positing the existence of a line, but never committing to drawing it.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/google-and-openai-get-2025-imo-gold?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
The GDM result did not surprise me, they were already on track to achieve this and had recent success with AlphaZero on some problems that are beyond IMO level. The OAI result surprised me. Possibly that is because I did not pay enough attention to https://arxiv.org/abs/2502.06807 (paper on using o3 plus RL to get an IOI gold by a team that includes Alex Wei). But the claim that the model "knew" its answer to P6 was wrong (without using any external tools) suggests they may be doing something different here.
Kimina-Prover-72b has an understandable chain of thought; the prose style of the CoT feels much like that of DeepSeek R1, which is more chatty and informal than you would write in a paper.
I’ve heard that DeepSeek Prover has a CoT that mixes languages. I haven’t used it much because it’s too big to run conveniently, while kimina-prover-72b is a more manageable size. (Feel free to raise an eyebrow at 72B models now being the small models you can run easily). But anyway, this suggests that optimizing for math leads to language becoming less coherent.
> Ernest Davis (6th in US Math Olympiad once, just short of the IMO): Second: The proofs produced by DM-IMO and by every single earlier LLM, whether correct or incorrect, are written in a smooth, elegant style. They could be cut and pasted into a journal article or into a textbook with little or no editing. The worst you can say of them is that they are sometimes verbose.
I personally felt that even though the OpenAI solutions were in an informal style, they were easier to read and understand.
The most concrete example I have of this is Deepmind's solution for Q4. In the "Step 1" part of the proof, they are supposed to prove that a_n is always odd. It does a bunch of casework: It analyzes when tau(N) >= 5, when tau(N)=4 and N is of the form p^3, and when N is of the form pq. The p^3 case gives an argument that makes the reader implicitly check that (p^2)(p-1) - p - 1 is increasing for x >= 3. And it's all unnecessary: The proof they give for the tau(N) >= 5 case holds just as well when tau(N)=4, so basically a whole paragraph of the proof could be avoided.
This is not technically incorrect, (depending on how persnickety you are about the increasing argument, I guess the IMO grader thought it was fine), but it seemed odd to me when reading it, and I think it's a bit more of a hit to the overall quality of the solution than just being "verbose".
Toby Ord on 80,000 hours podcast recently said something peculiar.
https://80000hours.org/podcast/episodes/toby-ord-inference-scaling-ai-governance/
He points out that SOTA models seem to need to increase inference costs by a million times to halve the error rate.
If OpenAI was spending $1,000-10,000 or so for ARC, to answer problems ten year olds can solve, and potentially had to spend $1,000,000 to score gold in IMO, that's not really the trend you want to see if you're an AGI bull.
It's true having a bot that scores gold in IMO is useful, but it's not so useful if it costs $1 million.
And if the real hot shit takes a substantially smarter model and another million fold increase in compute costs, it might be inaccessibly expensive for a long time. Which is a trend you very much want to see if you're an AI notkilleveryoneist.
Except that low error rate is not required for loss of control of our future.
That is indeed bizarre as stated. Halving the error rate can mean a lot of things, and for most meanings it seems to very obviously not require 6 OOMs.
Yes, The whole point of Deep Think and Coscientist and Alpha Evolve is merely to generate data for the next gen of Gemini Pro - the inference costs for these teacher models mostly doesn't matter (except as an upfront r&d cost) because next gen Pro version will get this to the masses without having multiple OOMs of cost increase.
"What matters is that Google and OpenAI have LLMs with gold medal performances, each scoring exactly the threshold of 35/42 by solving the first five of the six problems."
Agreed!
While there are some differences in what we've been told about the two groups' approaches, and there is _always_ a question about how any technique generalizes till one actually sees it generalize,
GDM's
"We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought.
To make the most of the reasoning capabilities of Deep Think, we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."
doesn't sound dramatically less general (except for the "tips on how to approach IMO problems") than OpenAI's
"Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.
Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling."
I think both teams' successes bode well for improved, deeper, more reliable, reasoning, and help pull the timeline for ai2027 back, undoing the delay that METR's experiment suggested.
Perhaps I will get to have a nice, quiet conversation with a real "life" HAL9000...
Bodes well == we probably won’t live past 2035 in your model?
Many Thanks! Maybe call it 50:50? I don't think that it is certain that AGI (or a possible ASI) will wipe humans out. And we do get to set the initial conditions, perhaps Hassabis et. al. will find a way to set us up as <evidenceFromFiction>pets of the Culture Minds</evidenceFromFiction>. Then again, maybe we'll wind up like the Neanderthals. We shall see.
see also https://thezvi.substack.com/p/ai-125-smooth-criminal/comment/136390687
Can anyone explain to me why Manifold had this at ~85% likely for the first 4 months of this year and then the odds fell dramatically over the summer?
"now a bunch of robots can do it. as someone who has a lot of their identity and their actual life built around "is good at math," it's a gut punch. it's a kind of dying."
First time?
Not sure. Maybe the market was reacting to the fact that no one had announced that their LLMs solved previous IMO competition problems (even if that wouldn't have satisfied the market resolution criteria, it would have provided evidence for the performance on this year's test). In hindsight, that no one bothered to evaluate their models on previous years problems is unsurprising considering how compute-heavy the correct approach turned out to be (though even that doesn't seem to be correct because---as someone else in the comments linked to---apparently there was a paper released by OpenAI claiming gold-medal level performance on IOI 2024 by their o3 model).
Incredible, you actually made this seriously boring topic entertaining. There were some nuggets in this that were creative and surprising.