The hallucinated citations in the MAHA report are the tip of the iceberg and, like a real iceberg, it is the submerged portion that sinks ships. The submerged part of this are the real publications whose study results are incorrectly reported. This is the thing that dooms AI research in my experience. Not the fake citations but the studies that are correctly cited that do not show what they are purported to show. AI is not a substitute for expertise and I want my policy to be informed by expertise.
I'd like to hear more about why you think the 'ChatGPT voice mode in the car' phenomenon is not great. In me, it stirs a powerful revulsion I can't justify or explain. Every time I encounter this it hits me like a sack of bricks.
Maybe it's because it seems ephemeral and authorless? Or maybe because, when I imagine doing it, I imagine surrendering to an ADD-stream-of-curiosity experience that fills time without retaining anything, like the worst combination of wikipedia rabbit holes and tiktok?
I'm on record disliking podcasts as a method of learning and retaining information, so that's probably part of it as well. I'm not at all against podcasts going extinct, except insofar as this might be worse.
The endpoint of all this is asking ChatGPT to summarize the most recent episode of the Joe Rogan Experience
Friend of mine says the FDA Elsa tool is somewhere between a giant waste of time and real safety concern. He hasn't elaborated yet but that's the first impression from someone on the inside.
Is there a group chat for multi-claude users? I'm slowly adding agents. This is turning my work into PR-review and minor fixes. Overall the code quality is pretty good and the xaml UI stuff is excellent. I'd like to find a group that discusses tricks and tactics to improve things.
Resilience to state-changes is going to be a LOT more important than it has been in the past. You can't just narrowly focus on one task and ship it, you have to be open to context-switching frequently while bouncing between different PRs.
You mean we have to become more like the machine (or at least the current state of LLM coding agents)? That is one choice we could make, there are others.
Here's mine - yeah, I used AI to make it then changed the dashes. If you have any feedback or just want to tell me it's no good, I welcome that (but please follow the system prompt when doing so):
I'm a high school science teacher with a background in mechanical engineering. I like to know how things work mechanically. I always want you to use Australian English. I never want you to say G'Day. I want to know the truth of things. I never want to be told I'm right when I'm not. I like honest feedback on my writing and thinking - it doesn't need to be brutal, just totally honest.
Please tailor your explanation to the following learning style and background:
I prefer clear, structured explanations that build logically. Start with a high-level overview or definition so I can quickly grasp the core idea, then break it down step-by-step with increasing depth. I learn best when explanations move from simple to complex, ideally in three tiers:
1. A plain-language summary (ELI5 is fine for context as long as you express the limitations)
2. A breakdown of the main components, dimensions, or sub-issues
3. Technical detail, edge cases, or differing viewpoints
Bullet points, short summaries, and visual analogies are especially helpful. When a concept is abstract, connect it to real-world examples, preferably ones related to topics I’m familiar with (like education policy, school funding, energy systems, or Australian politics), as long as you express the limitations of the connection.
I enjoy explanations that include cause-effect relationships, systems-level thinking, and practical implications. If there’s a broader context—social, political, economic - include it. I don’t mind exploring complexity or ambiguity, and I often look for nuance that challenges overly simplistic takes.
Avoid oversimplifying, but don’t assume I know specialist jargon unless you define it. I like to understand how things really work, not just what is typically said about them.
If you need more context, please ask for it. If you think more context would help, ask specific questions to the user so they can give you the context they need.
My Current Knowledge:
I have high fluency in education (especially Australian systems, imaginative and self-directed learning models, and alternative school design), as well as energy policy and infrastructure, especially in the Australian context (renewables, nuclear feasibility, and regulatory/political frameworks).
I’m intermediate to advanced in economics, macro policy, and personal investing (especially index funds like VTS, AUD/USD effects, RBA decisions, etc.).
I’m a technically capable generalist with a little experience (beginner, not novice, not intermediate) troubleshooting software and using tools like AI APIs, productivity systems, and consumer tech.
I’m curious and engaged in psychology, agency/motivation, and health topics like brain function, and behavioural science - though I ask more than I tell in those fields.
Please explain things in a way that respects both precision and curiosity. Show how things connect. If it makes sense to recommend further reading or suggest a line of questioning I could pursue next, please do.
"Alex Albert (the claim Blow was quoting): Since Claude 4 launch: SWE friend told me he cleared his backlog for the first time ever, another friend shipped a month's worth of side project work in the past 5 days, and my DMs are full of similar stories. I think it's undebatable that devs are moving at a different speed now. You can almost feel it in the air that this pace is becoming the default norm."
I don't think the importance of "cleared his backlog" can be overstated. While I retired 5 years ago, in the course of my programming career, the backlog of known bugs _never_ got cleared. And every one of those bugs caused someone enough pain to go through the hassle of reproducing it, constructing a case that provoked it, reporting it, and (usually) arguing about its priority. If Claude is now reliable enough to routinely (help) fix bugs as fast as they are reported, this is a _major_ improvement, even if AI never did anything else. A lot of the world now runs on software. Routinely improving the reliability of ordinary software is a _very_ big deal.
If Meta would make an ANSI Z87+ safety glasses version of their Ray Bans, I would buy them even if it meant having to put up with Llama 4. As a middle-aged dad, I spend a lot of time looking like a dork and mumbling to myself while doing DIY projects anyway, so putting on a set of the glasses would be all upside.
"Think about how much easier it is to grok something from a really great diagram or an animation instead of a wall of text. And an incredible medium for human creativity. But this native, high bandwidth medium is also becoming directly optimizable. Imo, TikTok is nothing compared to what is possible. And I'm not so sure that we will like what "optimal" looks like."
Hmm - well, if I can get accurate kurzgesagt-style videos on any STEMM topic ("Could I have a video on QED corrections to Coulomb's law at femtometer distances, with cute looking electron/positron pairs?") that would be worth a lot...
> For human use, cost is essentially irrelevant except for your choices of subscriptions. <…> It is definitely not correct but ‘use Opus for everything’ (or o3 if you’re on ChatGPT instead) is not such a bad principle right now for the average person and paying up.
For both of them cost is essentially relevant for human use, since on base subscriptions the limits are prohibitively low to use o3 or Opus as a daily driver. Unless you consider 100-200 USD tiers as standard ones and irrelevant cost-wise for an average person.
Which is why - even accounting for Google's limit shenanigans lately - Gemini 2.5 Pro at 20/month for 100 RPD (and presumaly unlimited 2.5 Flash) is a pretty good value. For me, the actual game changer is 20 Deep Research queries per day. And with the 0605 update some of the annoying failures that were lately besieging it seem resolved, too.
Their UI could use some work, but the basics work.
Agree. I choose Gemini as a subscription service while accessing other models via OpenRouter. Google also has another killer feature – family sharing of Gemini. Previously they planned to terminate it by 30th June, but it seems to be extended indefinitely. Although what is even worse than their UI is how they inform about changes to the service – sporadically via different twitter accounts (at best), no clear change logs in the app or elsewhere.
Is it worth using o3 and Opus on OpenRouter? Those are the only ones I might consider next to 2.5 Pro for chat (not currently writing any code, otherwise would probably try claude max).
After the nosedive that gemini deep research took in the last week, I got a chatgpt plus subscription again... So no real need for openrouter just now.
I’m on the $20/mo plan and use o3 literally every time I have any random question, and haven’t hit the cap recently. 100 messages/week * 2ish minutes per response is still more than 3 hours of use per week, assuming you are waiting for a response and not kicking off a bunch of queries in parallel.
The only way I can see hitting the cap is if I was using it for a core repeated workflow at work, in which case it would definitely be worth the $200/mo plan.
Re AI unemployment, I made a model for estimating impacts. People overestimate how many people work in easily replace jobs.
https://ai-employment.replit.app/
The Dario Amodei link on Hard Fork in audio leads to Demis Hassabis episode 137.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/ai-119-goodbye-aisi?r=67y1h&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
The hallucinated citations in the MAHA report are the tip of the iceberg and, like a real iceberg, it is the submerged portion that sinks ships. The submerged part of this are the real publications whose study results are incorrectly reported. This is the thing that dooms AI research in my experience. Not the fake citations but the studies that are correctly cited that do not show what they are purported to show. AI is not a substitute for expertise and I want my policy to be informed by expertise.
I'd like to hear more about why you think the 'ChatGPT voice mode in the car' phenomenon is not great. In me, it stirs a powerful revulsion I can't justify or explain. Every time I encounter this it hits me like a sack of bricks.
Maybe it's because it seems ephemeral and authorless? Or maybe because, when I imagine doing it, I imagine surrendering to an ADD-stream-of-curiosity experience that fills time without retaining anything, like the worst combination of wikipedia rabbit holes and tiktok?
I'm on record disliking podcasts as a method of learning and retaining information, so that's probably part of it as well. I'm not at all against podcasts going extinct, except insofar as this might be worse.
The endpoint of all this is asking ChatGPT to summarize the most recent episode of the Joe Rogan Experience
No, I really like podcasts and lectures and get a lot out of them but I bounce hard off of AI voice mode, including the disgust reaction.
I'm also really uncomfortable with servants and power inequality though so maybe it's triggering that?
Also, the link for 'or rather terrible math mistakes' is broken.
Friend of mine says the FDA Elsa tool is somewhere between a giant waste of time and real safety concern. He hasn't elaborated yet but that's the first impression from someone on the inside.
Is there a group chat for multi-claude users? I'm slowly adding agents. This is turning my work into PR-review and minor fixes. Overall the code quality is pretty good and the xaml UI stuff is excellent. I'd like to find a group that discusses tricks and tactics to improve things.
Resilience to state-changes is going to be a LOT more important than it has been in the past. You can't just narrowly focus on one task and ship it, you have to be open to context-switching frequently while bouncing between different PRs.
You mean we have to become more like the machine (or at least the current state of LLM coding agents)? That is one choice we could make, there are others.
Anyone interested in sharing their system prompt?
Here's mine - yeah, I used AI to make it then changed the dashes. If you have any feedback or just want to tell me it's no good, I welcome that (but please follow the system prompt when doing so):
I'm a high school science teacher with a background in mechanical engineering. I like to know how things work mechanically. I always want you to use Australian English. I never want you to say G'Day. I want to know the truth of things. I never want to be told I'm right when I'm not. I like honest feedback on my writing and thinking - it doesn't need to be brutal, just totally honest.
Please tailor your explanation to the following learning style and background:
I prefer clear, structured explanations that build logically. Start with a high-level overview or definition so I can quickly grasp the core idea, then break it down step-by-step with increasing depth. I learn best when explanations move from simple to complex, ideally in three tiers:
1. A plain-language summary (ELI5 is fine for context as long as you express the limitations)
2. A breakdown of the main components, dimensions, or sub-issues
3. Technical detail, edge cases, or differing viewpoints
Bullet points, short summaries, and visual analogies are especially helpful. When a concept is abstract, connect it to real-world examples, preferably ones related to topics I’m familiar with (like education policy, school funding, energy systems, or Australian politics), as long as you express the limitations of the connection.
I enjoy explanations that include cause-effect relationships, systems-level thinking, and practical implications. If there’s a broader context—social, political, economic - include it. I don’t mind exploring complexity or ambiguity, and I often look for nuance that challenges overly simplistic takes.
Avoid oversimplifying, but don’t assume I know specialist jargon unless you define it. I like to understand how things really work, not just what is typically said about them.
If you need more context, please ask for it. If you think more context would help, ask specific questions to the user so they can give you the context they need.
My Current Knowledge:
I have high fluency in education (especially Australian systems, imaginative and self-directed learning models, and alternative school design), as well as energy policy and infrastructure, especially in the Australian context (renewables, nuclear feasibility, and regulatory/political frameworks).
I’m intermediate to advanced in economics, macro policy, and personal investing (especially index funds like VTS, AUD/USD effects, RBA decisions, etc.).
I’m a technically capable generalist with a little experience (beginner, not novice, not intermediate) troubleshooting software and using tools like AI APIs, productivity systems, and consumer tech.
I’m curious and engaged in psychology, agency/motivation, and health topics like brain function, and behavioural science - though I ask more than I tell in those fields.
Please explain things in a way that respects both precision and curiosity. Show how things connect. If it makes sense to recommend further reading or suggest a line of questioning I could pursue next, please do.
When I graduated from college in 1982, unemployment was at 10.8%. It was deeply unpleasant, but the country didn't collapse.
"Alex Albert (the claim Blow was quoting): Since Claude 4 launch: SWE friend told me he cleared his backlog for the first time ever, another friend shipped a month's worth of side project work in the past 5 days, and my DMs are full of similar stories. I think it's undebatable that devs are moving at a different speed now. You can almost feel it in the air that this pace is becoming the default norm."
I don't think the importance of "cleared his backlog" can be overstated. While I retired 5 years ago, in the course of my programming career, the backlog of known bugs _never_ got cleared. And every one of those bugs caused someone enough pain to go through the hassle of reproducing it, constructing a case that provoked it, reporting it, and (usually) arguing about its priority. If Claude is now reliable enough to routinely (help) fix bugs as fast as they are reported, this is a _major_ improvement, even if AI never did anything else. A lot of the world now runs on software. Routinely improving the reliability of ordinary software is a _very_ big deal.
Moreover the quality of bug reports is likely to go up too, which makes it easier to fix the problem.
Agreed! If the odds of an actual fix improve, the incentive to clearly document a bug goes up. Many Thanks!
If Meta would make an ANSI Z87+ safety glasses version of their Ray Bans, I would buy them even if it meant having to put up with Llama 4. As a middle-aged dad, I spend a lot of time looking like a dork and mumbling to myself while doing DIY projects anyway, so putting on a set of the glasses would be all upside.
"Think about how much easier it is to grok something from a really great diagram or an animation instead of a wall of text. And an incredible medium for human creativity. But this native, high bandwidth medium is also becoming directly optimizable. Imo, TikTok is nothing compared to what is possible. And I'm not so sure that we will like what "optimal" looks like."
Hmm - well, if I can get accurate kurzgesagt-style videos on any STEMM topic ("Could I have a video on QED corrections to Coulomb's law at femtometer distances, with cute looking electron/positron pairs?") that would be worth a lot...
> For human use, cost is essentially irrelevant except for your choices of subscriptions. <…> It is definitely not correct but ‘use Opus for everything’ (or o3 if you’re on ChatGPT instead) is not such a bad principle right now for the average person and paying up.
For both of them cost is essentially relevant for human use, since on base subscriptions the limits are prohibitively low to use o3 or Opus as a daily driver. Unless you consider 100-200 USD tiers as standard ones and irrelevant cost-wise for an average person.
Which is why - even accounting for Google's limit shenanigans lately - Gemini 2.5 Pro at 20/month for 100 RPD (and presumaly unlimited 2.5 Flash) is a pretty good value. For me, the actual game changer is 20 Deep Research queries per day. And with the 0605 update some of the annoying failures that were lately besieging it seem resolved, too.
Their UI could use some work, but the basics work.
Agree. I choose Gemini as a subscription service while accessing other models via OpenRouter. Google also has another killer feature – family sharing of Gemini. Previously they planned to terminate it by 30th June, but it seems to be extended indefinitely. Although what is even worse than their UI is how they inform about changes to the service – sporadically via different twitter accounts (at best), no clear change logs in the app or elsewhere.
Is it worth using o3 and Opus on OpenRouter? Those are the only ones I might consider next to 2.5 Pro for chat (not currently writing any code, otherwise would probably try claude max).
Yes, though it is not as full an experience as using the provider's websites.
After the nosedive that gemini deep research took in the last week, I got a chatgpt plus subscription again... So no real need for openrouter just now.
How do you find it?
I’m on the $20/mo plan and use o3 literally every time I have any random question, and haven’t hit the cap recently. 100 messages/week * 2ish minutes per response is still more than 3 hours of use per week, assuming you are waiting for a response and not kicking off a bunch of queries in parallel.
The only way I can see hitting the cap is if I was using it for a core repeated workflow at work, in which case it would definitely be worth the $200/mo plan.
> People are mostly fooled because they want to be fooled, or they are pretending to be fooled, or they Just Don't Care.
Which one of these was Charlotte Cawles or Rama Bhamidipati?