Re "AIs must not make important decisions or control critical systems."
Important and critical are vague terms without a lot of emotional valence. "Important decisions about human lives and well-being" or something like that would differentiate between mundane important decisions and existential ones.
Moonshot themselves published an eval suite to check if K2 (non thinking at the time) is setup correctly. Initially, some instances were badly set up (in particular, some hyperscalers with the highest token prices...) but it got fixed fairly quickly.
My going assumption is that initial issues will be ironed out in 7-10 days as usually it is fairly obvious if a model instance is broken.
In the case of K2 Thinking, testing harnesses might in fact be a bigger issue: the interleaved tool calling is not that widely supported; coding agents for one currently seem to struggle with it.
My kids go to a better than average school. They do, like most kids, occasionally ask why they must learn about Alexander the Great or whatever math problem is popular with educators this year.
And my answer is, "Many of the things you learn in school are basically contrived exercises to teach you how read carefully, to solve hard problems and to write well. It's like lifting weights: You don't pick up the heavy objects because you want them to be moved somewhere else. You pick them up because you want to become strong. What skills would allow you to live the life you want? Are there opportunities to learn those skills at school?"
And at least for 2025, your ability to get truly good work done using AI tools is absolutely proportional to your baseline knowledge and skill. If you know nothing at all, the tools will help you accomplish the basics. But if you know a lot, and if you know how to learn, and how to distinguish between bullshit and truth, then you can get quite a lot out of 2025 AI." There is a story in 2025 for why education is still extremely valuable.
I am considerably more worried about ~2030 AI, however.
I feel like many people are giving up too soon on technical alignment research. You do need a "theory of change", but that's true for any sorts of AI research nowadays. In order to affect the decisions of the big AI companies you have to build something that they want to adopt. But doesn't that seem possible?
Like, can you make a tiny neural network that's perfectly aligned, in some sense? Could you verify something like, a small language model can never produce a given output? I don't know specifically, but it seems like there are a lot of interesting and valuable questions here.
To me the "AI 2027" or "Pause AI" line of thinking is counterproductive, because it doesn't have any real hope of achieving anything other than publicity, and it encourages people to think short term, to ignore the real possibility that we're sitting around in 2035 or 2045, AI has gotten more and more powerful, but it's still subhuman in a number of critical ways. But the danger of unsafe AI just gets more and more dangerous as the AI gets more powerful! The apocalypse in 2050 is just as bad as the apocalypse in 2030.
Just because alignment research failed for a few years doesn't mean it will fail forever. We can figure things out.
I feel like anthropic's work on mechanistic interpretation has been delivering results well above expectations, and that this is very meaningful progress on 'alignment research'
I don't think it's actually that true that anybody has given up? I think you could have said this in 2022 or 2023, but since then there has been a Renaissance
> The song topped the Country Music Digital Downloads list, which tracks weekly paid digital downloads, which almost no one uses since 90%+ of music is streamed and not paid. Songs can top this list with as few as 3,000 paid downloads, which is effectively zero in the context of media consumption.
> Breaking Rust has about 2.2m monthly listeners on Spotify, which would put them roughly in the “emerging artist” category. That kind of streaming volume translates to roughly 500 person venue shows. By comparison, Morgan Wallen has 33m monthly listeners and Zach Bryan has 25m, and they routinely sell out stadiums. So the Newsweek article and the social media posts talking about this “No. 1 AI song in America” are… fake news.
"AI companion apps use emotional farewells to stop user exit, causing 15x more engagement after goodbye, using such goodbyes 37% of the time. Humans are sometimes hard to persuade, but sometimes they’re really easy. Well, yeah, that’s what happens when you optimize for engagement."
Ironically, what _I_ use as my standard exit line from an LLM session is:
"Many Thanks! Bye for now. Happy computing!"
Hey, it's polite, and we don't know (and may _never_ really know) if the AI systems based on LLMs have subjective experience...
Is there a story behind your Many Thanks! thing? I don't think I've ever seen anyone else using that specifc conversational greaser, and despite knowing your handle, in my head you're always The Many Thanks Guy.
Do agree it's important to practice treating human and human-like intelligences with basic manners, even (especially) if one punts on the consciousness question. Get some reps in, so to speak. God knows people everywhere could stand to be a bit more polite to actual flesh-and-blood humans in meatspace...
Much appreciated! There isn't really a story behind it. Mostly, I just want to note that I appreciate the time and effort that people responding to my comments have taken.
"John Wentworth: ... Rather, when I look at the ways in which people make progress on legible problems, they often make the illegible problems actively worse. RLHF is the central example I have in mind here."
What is the evidence that the illegible problems actually exist? Fully successful RLHF which in fact solved the full set of problems would look exactly the same locally. If Wentworth wants everyone to _change_ their use of RLHF, the onus is on him to provide a falsifiable, testable prediction of some signature of the illegible problems.
"Just wait, you'll see." is a fully general argument for unspecified hidden dangers in any action at all. The Precautionary Principle would have had us living in caves. I want people claiming that there are unknown holes in e.g. RLHF to _demonstrate_ the unknown hole, to _prove_ that it made "the illegible problem actively worse", or else to shut up and go away.
From a recent (2025 Nov) work from Anthropic [1] on emergent misalignment from reward hacking. They try to mitigate the problem with RLHF but it only worked superficially, i.e. the misaligned goals still remained but the reasoning about it was removed from the chain of thought.
From [2] - "Instead of being used as a stepping stone toward more robust alignment techniques, RLHF seems to have also undergone a sort of “capabilities capture” in which its effects for advancing capabilities have become more prominent than its impacts on safety. "
with the problems with RLHF more fully described in [3]
I've seen that cover letter study a few times now, and I can't say I agree with it.
For one thing, it seems to ignore that the recruitment landscape had already smashed the cover letter. The internet left hiring in a prisoners dilemma due to the sheer volume of applications, which meant many companies would cut off a huge number of applicants if they put in a cover letter requirement.
That aside, for years I've seen studies with the exact opposite conclusion: cover letters, interviews, assignments, basically anything subjective were always outperformed by just looking at the CV. Maybe for this specific freelance website the cover letter issue was sufficiently different, but does that mean it generalizes? I would be surprised
I am so confused about the data centers in space thing. How is that not a terrible idea? Isn’t it very hard to dump the heat if you are in space? You don’t have a water supply, you can’t even use air, you gotta radiate all the heat out. And that’s just one of the problems! I know permitting is nightmare but this cannot possibly be the answer. Am I missing something?
Re "AIs must not make important decisions or control critical systems."
Important and critical are vague terms without a lot of emotional valence. "Important decisions about human lives and well-being" or something like that would differentiate between mundane important decisions and existential ones.
> I think I (probably) understand why the current LLMs believe themselves, when asked, to be conscious, and that it .
Typo?
Moonshot themselves published an eval suite to check if K2 (non thinking at the time) is setup correctly. Initially, some instances were badly set up (in particular, some hyperscalers with the highest token prices...) but it got fixed fairly quickly.
My going assumption is that initial issues will be ironed out in 7-10 days as usually it is fairly obvious if a model instance is broken.
In the case of K2 Thinking, testing harnesses might in fact be a bigger issue: the interleaved tool calling is not that widely supported; coding agents for one currently seem to struggle with it.
So personally I will do some testing next week.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/ai-142-common-ground
Searke, the philosopher who died recently, seems to have been saying that people are confusing the model with reality, when it comes to AI.
Have you blogged about this?
*Searle
I don't have an edit button
My kids go to a better than average school. They do, like most kids, occasionally ask why they must learn about Alexander the Great or whatever math problem is popular with educators this year.
And my answer is, "Many of the things you learn in school are basically contrived exercises to teach you how read carefully, to solve hard problems and to write well. It's like lifting weights: You don't pick up the heavy objects because you want them to be moved somewhere else. You pick them up because you want to become strong. What skills would allow you to live the life you want? Are there opportunities to learn those skills at school?"
And at least for 2025, your ability to get truly good work done using AI tools is absolutely proportional to your baseline knowledge and skill. If you know nothing at all, the tools will help you accomplish the basics. But if you know a lot, and if you know how to learn, and how to distinguish between bullshit and truth, then you can get quite a lot out of 2025 AI." There is a story in 2025 for why education is still extremely valuable.
I am considerably more worried about ~2030 AI, however.
I feel like many people are giving up too soon on technical alignment research. You do need a "theory of change", but that's true for any sorts of AI research nowadays. In order to affect the decisions of the big AI companies you have to build something that they want to adopt. But doesn't that seem possible?
Like, can you make a tiny neural network that's perfectly aligned, in some sense? Could you verify something like, a small language model can never produce a given output? I don't know specifically, but it seems like there are a lot of interesting and valuable questions here.
To me the "AI 2027" or "Pause AI" line of thinking is counterproductive, because it doesn't have any real hope of achieving anything other than publicity, and it encourages people to think short term, to ignore the real possibility that we're sitting around in 2035 or 2045, AI has gotten more and more powerful, but it's still subhuman in a number of critical ways. But the danger of unsafe AI just gets more and more dangerous as the AI gets more powerful! The apocalypse in 2050 is just as bad as the apocalypse in 2030.
Just because alignment research failed for a few years doesn't mean it will fail forever. We can figure things out.
I feel like anthropic's work on mechanistic interpretation has been delivering results well above expectations, and that this is very meaningful progress on 'alignment research'
I don't think it's actually that true that anybody has given up? I think you could have said this in 2022 or 2023, but since then there has been a Renaissance
I think you may be misunderstanding "the Pause AI line of thinking", or at least mine.
Work on researching, understanding and solving the problems of technical alignment: good.
Doing said work in contexts where you are betting everything on being able to solve the problems: bad.
Achieving the former without risking doing the latter is pretty hard, so please be appropriately careful.
Note that the quote is most likely not from Churchill: https://winstonchurchill.org/resources/quotes/quotes-falsely-attributed/
Martin Burry already quit 😆
"My estimation of value in securities is not now, and has not been for some time, in sync with the markets."
https://www.cnbc.com/2025/11/13/michael-burry-of-big-short-fame-deregisters-scion-asset-management.html
"The #1 country song in America is by digital sales is AI".
I haven't chased this myself, but it looks like "#1 country song" is based on a bad metric: https://mailchi.mp/osboncapital.com/the-budget-key-dates-recession-12387767
> The song topped the Country Music Digital Downloads list, which tracks weekly paid digital downloads, which almost no one uses since 90%+ of music is streamed and not paid. Songs can top this list with as few as 3,000 paid downloads, which is effectively zero in the context of media consumption.
> Breaking Rust has about 2.2m monthly listeners on Spotify, which would put them roughly in the “emerging artist” category. That kind of streaming volume translates to roughly 500 person venue shows. By comparison, Morgan Wallen has 33m monthly listeners and Zach Bryan has 25m, and they routinely sell out stadiums. So the Newsweek article and the social media posts talking about this “No. 1 AI song in America” are… fake news.
"AI companion apps use emotional farewells to stop user exit, causing 15x more engagement after goodbye, using such goodbyes 37% of the time. Humans are sometimes hard to persuade, but sometimes they’re really easy. Well, yeah, that’s what happens when you optimize for engagement."
Ironically, what _I_ use as my standard exit line from an LLM session is:
"Many Thanks! Bye for now. Happy computing!"
Hey, it's polite, and we don't know (and may _never_ really know) if the AI systems based on LLMs have subjective experience...
Is there a story behind your Many Thanks! thing? I don't think I've ever seen anyone else using that specifc conversational greaser, and despite knowing your handle, in my head you're always The Many Thanks Guy.
Do agree it's important to practice treating human and human-like intelligences with basic manners, even (especially) if one punts on the consciousness question. Get some reps in, so to speak. God knows people everywhere could stand to be a bit more polite to actual flesh-and-blood humans in meatspace...
Much appreciated! There isn't really a story behind it. Mostly, I just want to note that I appreciate the time and effort that people responding to my comments have taken.
"John Wentworth: ... Rather, when I look at the ways in which people make progress on legible problems, they often make the illegible problems actively worse. RLHF is the central example I have in mind here."
What is the evidence that the illegible problems actually exist? Fully successful RLHF which in fact solved the full set of problems would look exactly the same locally. If Wentworth wants everyone to _change_ their use of RLHF, the onus is on him to provide a falsifiable, testable prediction of some signature of the illegible problems.
"Just wait, you'll see." is a fully general argument for unspecified hidden dangers in any action at all. The Precautionary Principle would have had us living in caves. I want people claiming that there are unknown holes in e.g. RLHF to _demonstrate_ the unknown hole, to _prove_ that it made "the illegible problem actively worse", or else to shut up and go away.
From a recent (2025 Nov) work from Anthropic [1] on emergent misalignment from reward hacking. They try to mitigate the problem with RLHF but it only worked superficially, i.e. the misaligned goals still remained but the reasoning about it was removed from the chain of thought.
From [2] - "Instead of being used as a stepping stone toward more robust alignment techniques, RLHF seems to have also undergone a sort of “capabilities capture” in which its effects for advancing capabilities have become more prominent than its impacts on safety. "
with the problems with RLHF more fully described in [3]
[1] https://youtu.be/lvMMZLYoDr4?si=XScIsTpmE2mFaNEC&t=1337
[2] https://forum.effectivealtruism.org/posts/5w8GCNoETMubi4izv/open-problems-and-fundamental-limitations-of-rlhf
[3] https://arxiv.org/pdf/2307.15217
Many Thanks! This is much more solid evidence.
What is the (NN) thing about?
Nervous Nellie (put-down)
I've seen that cover letter study a few times now, and I can't say I agree with it.
For one thing, it seems to ignore that the recruitment landscape had already smashed the cover letter. The internet left hiring in a prisoners dilemma due to the sheer volume of applications, which meant many companies would cut off a huge number of applicants if they put in a cover letter requirement.
That aside, for years I've seen studies with the exact opposite conclusion: cover letters, interviews, assignments, basically anything subjective were always outperformed by just looking at the CV. Maybe for this specific freelance website the cover letter issue was sufficiently different, but does that mean it generalizes? I would be surprised
Yascha Mounk of Persuasion (and Johns Hopkins) discussed AI in college education today, and basically hit on the same dual-track conclusion: https://open.substack.com/pub/yaschamounk/p/why-ill-encourage-my-students-to?r=7tmjq&utm_medium=ios
I am so confused about the data centers in space thing. How is that not a terrible idea? Isn’t it very hard to dump the heat if you are in space? You don’t have a water supply, you can’t even use air, you gotta radiate all the heat out. And that’s just one of the problems! I know permitting is nightmare but this cannot possibly be the answer. Am I missing something?