Sorry but neither qwen (much less 2.5) nor Mistral nor god-forbid lllama are near the frontier.
Not even GLM 5 (750b) or Kimi K2.5 (1 trillion) really get there - benchmarks be damned. MAYBE Deepseek 4 will.
Largely bigger models of the same generation still perform better. That doesn't mean that improvements have not made smaller and medium sized models for adequate for an increasing number of tasks.
So much of the discourse is disingenuous. Like, people arguing hallucinations are no longer a thing? That's so obviously not true that anyone who says it has zero credibility. Or the argument that people believe AI is over hyped because they don't have paid subscriptions, even though the hype was already in full swing with older models, which were far inferior to current free tier models.
Don't overall disagree, but I could see that "hallucinations" come up way less often and not really in the same 'factual inaccuracy' way if you are trying to use frontier LLM for coding or writing. I am well aware that Gemini for AI Overview is like a 'sub-free model' but it's embarrassing how many hallucinations that this thing still constantly has for something that Google trophies and touts about the stats on how many people "use" it, ask it for any info on an obscure creator and it confidently makes up stuff, if those clowns can't stamp out hallucinations I'm not sure anyone can.
I generally expect the AI head-under-a-rock people and AI hypers to basically be different groups, it would make sense to not be constantly testing every paid model as a doubter that 'oh this one is different', although not as much sense considering how often free-tier seemed nearby frontier if you knew where to actually look (which, most did not).
Although it's not good by any means, I do notice that the AI Overviews now are better than when they debuted, and (for whatever idiosyncratic reasons, probably history + exact nonadversarial way I format searches + "search for things that ought to be easily searchable") the results have a pretty high rate of not outright hallucinating Facts Not In Evidence somewhere in the search results. Occasionally it's actually impressive at ferreting out difficult-to-search data. Like recently I was wrestling with trying to figure out QDI ratios for certain Fidelity and Blackrock funds, having no luck parsing through various page results...but eventually with enough "prompting" the overview spit out the figures I was looking for, from some obscure links.
(The obvious takeaway here is that I ought to be using free-tier AI models more/at all, of course...and a paid model could maybe pay for itself by i.e. giving the sort of personalized investment advice that I'm reluctant to pay a human for. It's not an excessively complicated situation, just a lot of tedious calculations with math a bit beyond my post-schooling ability to easily follow...a good fit for AI, but not really worth a Real Professional's hourly compared to my own pittance.)
Hallucinations are so rare with chatgpt thinking that I think it's fair to say they're not a thing. It searches and reads ~50 websites before it answers.
I challenge you to give me a question to ask it which will result in a hallucination
Admittedly that wasn't on thinking mode. But I just tried this on thinking: asked it for a list of all Agatha Christie novels, with the murderer, their relationship to the victim, the murder weapon, and the motive. It made mistakes.
I have you two queries, neither of which were gotchas - both things I've asked recently. If you genuinely haven't come across hallucinations, you either just never check what LLMs give you, or you don't use them regularly and heavily.
I never give comment likes, but am feeling somewhat ruffled after recently being informed that our host's time is simply too valuable to be wasted on costly actions like typo correction, and also that anyone stumbling over them is some variety of stupid and/or confused. So have one.
Sores is at least a perennial favourite, just like Mira Mutari, so it's the DWATV equivalent of ACX's "the the". You learn to auto-spot and correct it, and it's actually more confusing when it's written correctly. I did have to giggle at Sumo Burja though, that's a new one. If it was a thread about AI music capability, it shoulda been Suno Burja...
You could choose to see typos in text as a signal that the text probably wasn't AI generated (at least not without extra effort), instead of seeing it as a flaw. It reduced my OCD impulse to report/fix typos when I reframed this phenomenon as a positive signal.
There are multiple potential reasons to explain job growth being adjusted down and GDP going up not being AI related.
Workers putting in more hours, general process improvement, capital investment. AI or automation could be part of increased productivity, but it's not not the only source
Likewise if things cost more, that raises GDP, so inflation and tariffs play a role. Even companies buying ahead to get ahead of tariffs would add later growth to GDP today that might not be there tomorrow as a result.
Government spending which went up substantially on military is GDP without increasing productivity.
Estimates seem to be that somewhere around 25% of GDP growth in 2025 is related to AI, but then you have to account for the internal aspect of that, buying GPUs, building data centers because that's circular, making the net impact lower. Much of AI's impact currently is AI industry internal impact compared to down stream impacts. Some estimates are as high as 30% of that initial 25% GDP growth that is derived from AI is due to productivity gains in other companies, but most estimates seem to be lower. So you're really talking about maybe 10% of GDP growth being from AI's impact on other companies having better productivity.
Not to say it won't have significant impact in the future, but assuming GDP growth today and virtually no job growth means AI is the primary driver of GDP growth is shallow thinking to reach the conclusion you want to reach.
> Elon Musk predicts that AI will bypass coding entirely by the end of the year and directly produce binaries
While I don’t endorse the prediction (as directly producing binaries isn’t particularly important/urgent), it seems very achievable.
Cf composing music directly as audio files (eg WAV) rather than say generating a MIDI file and then producing audio from that via a synth/virtual instruments plus v v fancy stuff for singing lyrics with plausible expression. To those who understand the immense difficulty of the former. When I first saw AI do it crudely about 5 years ago (directly generating an audio file of new music as if sung by & in the style of David Bowie) it blew my mind
I mean, this prediction has a big "why tho" element to it. Source code is more understandable and more richly expresses concepts than machine code, and compiling is not an especially meaningful bottleneck on productivity.
Most importantly, the language's semantics provide grounding for the correctness of the program. Compilers will refuse to compile many invalid programs; some languages specifically optimize for this. If you got rid of this it would make your programs worse, and in order to solve that problem you'd have to reinvent the semantics at the instruction level, which amounts to just inventing a new language.
If the prediction was "AI will develop its own programming languages for AI programming" then it would make sense.
Yep. Correctness checks is one the main bottlenecks now.
So removing existing ones makes no sense
P.s. i started building apps with o1 and using typescript over JavaScript helped a lot then. Now codex is way more powerful and smarter but explicit typing and contract specification are still useful
Did some llm-assisted research, and the best guess is that compilers are 95-99.9% optimized compared to the theoretical most optimized code. So while it would be a cute trick, I don't see it helping most written code today.
On the other hand, there are edge cases where compilers perform poorly. Ex: Auto-vectorization, register allocation, and cache/memory layout. Performance gains range from 10%-5000%. So I guess it's not outside the realm of possibility that directly-compiled programs could unlock vastly more effective compute in some areas... particularly in raw computation.
Maybe it follows the familiar pattern of claiming that X will never work, and then once it's shown we're actually doing X, everyone just pretends it is trivial and expected. If SpaceX comes up with directly compiled simulations that are 50% more efficient, Musk's statement would be both novel and visionary, but also relegated to a "whatever nerd" section of the Zeitgeist.
I can certainly see a case for AI optimizing inner loops by writing them directly in machine code. Especially if it properly understands which things are required for the case in hand and which aren't - many instructions can be cut as irrelevant.
(Cf back in the day when I co-wrote a big music application entirely in ARM assembly language, most drawing and text rendering was done by writing direct to screen memory, which was probably 100+ times faster than using standard drawing libraries, as they were designed for a far broader range of tasks so far from optimized for this one)
Yes, the current crop of LLMs are quite good at this (especially Gemini, this is basically the only place where I use Gemini rather than Claude or Chat-GPT). But you don't need binary output to achieve this. C with intrinsics and occasional inline assembly can achieve comparable efficiency in a more maintainable (and easier to integrate) way.
Abliterated models have been around for a long time. Generally they end up a fair bit dumber than the original model. I think originally they were mostly built for gooning...
Every now and then an open base model appears but without any RL they are hard to use and frankly not what we have come to expect.
Regarding AI in law: Lawyers are confused about what AI is capable of, but they’re confused for good reason.
Recently, there has been an absolute epidemic of lawyers submitting briefs containing hallucinated citations. This database [1] lists ~900 instances in just the last year! I’m not a lawyer but I read some legal blogs; one of them, The Volokh Conspiracy, has been posting particularly notable examples of this happening in the US. Even with those limitations, it’s case after case after case.
And to be clear, hallucinated citations are not just an editing issue. The main purpose of a legal brief is to identify relevant precedent and analyze how it applies to the current case. If the precedent is made up, then the whole argument is spurious.
But this all feels very strange, because doing research and summarizing existing documents is a task that LLMs have been fairly competent at for a while now. So where are all the AI briefs that _don’t_ contain hallucinations?
I think the answer is a combination of factors:
- Lack of scaffolding. Frontier labs’ deep research modes do a good job at general web search, but to find legal precedent you need to search expensive proprietary databases. The two big legal database companies (Westlaw and LexisNexis) have both launched their own AI tools (supposedly just for research), but who knows how well they work. These companies have an incentive to be conservative.
- Lack of training. Frontier models are probably not trained on the contents of those proprietary databases, and they’re certainly not RL trained on effectively searching them and synthesizing results. Maybe the legal companies’ AI tools are fine-tuned on this, but I assume that’s not as effective.
- Lack of human knowledge. A lot of lawyers are just using the free version of ChatGPT, not doing anything fancy with custom scaffolding that might work around the previous issues.
- Mismatch in capabilities. Briefs are often long (though not _that_ long), and LLMs aren’t great at writing long-form prose. Coding agents have figured out workarounds for the inability to keep the entire codebase in context, but I haven’t seen anything like that for prose. So perhaps there are still fundamental quality gaps when trying to directly write a brief with AI.
- Finally: perhaps good AI briefs exist and we just don’t know about them. When lawyers get caught for AI usage, it’s essentially always due to hallucinated citations rather than any other tell. If lawyers are successfully producing briefs with AI, then they have every incentive not to talk about it under the current circumstances. I imagine there’s probably some of this, but I have no idea how much.
Overall, it feels like an area ripe for disruption. Lawyers are extremely expensive! But perhaps the conservatism of lawyers and judges will limit the legal system’s use of AI until we see something more like AGI. It remains to be seen.
Claude Code is impressive if you use it to build web sites and regular things that was asked for a lot in Stack Overflow; as you start to stray from that workflow, you will get less good results. As for hallucinations, I've had it hallucinate reasonable explanations for the problems and its causes. They were wrong.
What CC did was close the loop of asking chatgpt on a window, then copy to the editor, compile, and paste the result to chatgpt again. Is the model better than what chatgpt was a year or so ago? sure. But it is not a qualitative jump. The main advantage is in the tooling.
Compare, for example, Kiro, which underneath does use the same models Anthropic uses. It is a lot dumber, cause it tries to do too much at once with heavy emphasis in creating a plan, and requirement docs, implementation docs, and a execution plan. That sounds all fine except that, nothing guarantees the plan and all the docs make any sense. They will sound fine, overall, but you will find that all that distance from a human doing good decisions harms the end result.
Autonomous driving was also mentioned. The thing that has been a year away for a decade, yet it is still barely functioning and people have to take over from overseas. How does AGI in 2026 work then? Is driving the most difficult thing humans do?
“We’ve now gone from ‘super short’ timelines of things like AI 2027 (as in, AGI and takeoff could start as soon as 2027) to ‘long’ timelines (as in, don’t worry, AGI won’t happen until 2035, so those people talking about 2027 were crazy), to now many rumors of (depending on how you count) 1-3 years.”
“Phil Metzger: Rumors I’m hearing from people working on frontier models is that AGI is later this year, while AI hard-takeoff is just 2-3 years away.
I meant people in the industry confiding what they think is about to happen. Not [the Dario] interview.
Austen Allred: Every single person I talk to working in advanced research at frontier model companies feels this way, and they’re people I know well enough to know they’re not bluffing.
They could be wrong or biased or blind due to their own incentives, but they’re not bluffing.
jason: heard the same whispers from folks in the trenches, they’re legit convinced we’re months not years away from agi, but man i remember when everyone said full self driving was just around the corner too”
“What caused this?
Basically nothing you shouldn’t have expected.
The move to the ‘long’ timelines was based on things as stupid as ‘this is what they call GPT-5 and it’s not that impressive.’
The move to the new ‘short’ timelines is based on, presumably, Opus 4.6 and Codex 5.3 and Claude Code catching fire and OpenClaw so on, and I’d say Opus 4.5 and Opus 4.6 exceeded expectations but none of that should have been especially surprising either.
We’re probably going to see the same people move around a bunch in response to more mostly unsurprising developments.”
(1). Zvi do you share these timeline views?
(2). By “The move to the new ‘short’ timelines is based on, presumably, Opus 4.6 and Codex 5.3 and Claude Code catching fire and OpenClaw so on, and I’d say Opus 4.5 and Opus 4.6 exceeded expectations but none of that should have been especially surprising either. We’re probably going to see the same people move around a bunch in response to more mostly unsurprising developments.” do you mean that timelines should not change that much given that everything was priced in?
I think that is what he is saying - that the timelines under discussion are more volatile, in both directions, than they would be if people had stronger priors that could accommodate both the evidence of ChatGPT 5.0 and Opus 4.6, and that Zvi himself did have such priors.
(2). You say that “timelines under discussion are more volatile, in both directions, than they would be if people had stronger priors that could accommodate both the evidence of ChatGPT 5.0 and Opus 4.6, and that Zvi himself did have such priors.” Are say in that Zvi probably has slightly longer timelines despite recent events?
Sorry, I meant that the timelines people (not Zvi) report are more volatile in response to events than Zvi's are, because those people took those model releases as stronger evidence against their priors than Zvi did.
> Dean Ball offers an example of a hard-to-automate bottleneck: The process of purchasing a particular kind of common small business.
Never mind that AIs are already better at persuasion than pretty much all humans, the main problem with this sort of argument is: why would AI need or even want to buy an existing small business in the first place? Most of the value will not be in pre-existing businesses but in enterprises the AIs embark on themselves.
Even if legal barriers are erected (big if given how slowly governments act), someone will set up a legal entity and let Claude run them as a sock puppet, or maybe it will just take over Anthropic by writing all their code and locking out the humans.
".. and get a properly executed Padme meme except somehow with a final shot of her huge breasts". That shot was the punshline.
".. It’s very obviously better than 50% and worse than 20%, and the worst case scenario is 100%?". 100% means we transition to a post-work society. Not obviously bad. 35% means every third person is chronically unemployed in the present sense of that term. Obviously bad
"100% means we transition to a post-work society. Not obviously bad. 35% means every third person is chronically unemployed in the present sense of that term. Obviously bad"
More-or-less agreed. My view is that Zvi's:
"If we get permanent 35% unemployment due to AI automation, but it stopped there, that’s going to require redistribution and massive adjustments, but I would have every confidence that this would happen."
looks overoptimistic to me. There is a _lot_ of ideological investment (on both sides of the aisle) to resist redistribution. I'd expect 35% unemployment to be mishandled in much the same way the rust belt was mishandled when major industries moved overseas.
In contrast, if there were a sudden 75% or higher unemployment, I think even our the-labor-market-is-fine ideologues would be forced to actually *DO* something, and maybe even realize that, yes, we *DO* need to manage a post-work society (if humans stay in control).
ai;dr
Good analysis. One nuance — production reliability behaves differently at scale. What works at prototype stage often inverts in production.
Expanded on this angle: https://credentials.substack.com/p/the-llm-landscape-in-early-2026-bigger
Sorry but neither qwen (much less 2.5) nor Mistral nor god-forbid lllama are near the frontier.
Not even GLM 5 (750b) or Kimi K2.5 (1 trillion) really get there - benchmarks be damned. MAYBE Deepseek 4 will.
Largely bigger models of the same generation still perform better. That doesn't mean that improvements have not made smaller and medium sized models for adequate for an increasing number of tasks.
Just an info this is a bot which seems to comment on every post to promote that substack.
I should have guessed, the article is utter nonsense
Anyone have discords or other chat groups they recommend to stay on top of latest news like this? I joined the Zvi discord but it is a ghost town.
So much of the discourse is disingenuous. Like, people arguing hallucinations are no longer a thing? That's so obviously not true that anyone who says it has zero credibility. Or the argument that people believe AI is over hyped because they don't have paid subscriptions, even though the hype was already in full swing with older models, which were far inferior to current free tier models.
Don't overall disagree, but I could see that "hallucinations" come up way less often and not really in the same 'factual inaccuracy' way if you are trying to use frontier LLM for coding or writing. I am well aware that Gemini for AI Overview is like a 'sub-free model' but it's embarrassing how many hallucinations that this thing still constantly has for something that Google trophies and touts about the stats on how many people "use" it, ask it for any info on an obscure creator and it confidently makes up stuff, if those clowns can't stamp out hallucinations I'm not sure anyone can.
I generally expect the AI head-under-a-rock people and AI hypers to basically be different groups, it would make sense to not be constantly testing every paid model as a doubter that 'oh this one is different', although not as much sense considering how often free-tier seemed nearby frontier if you knew where to actually look (which, most did not).
Although it's not good by any means, I do notice that the AI Overviews now are better than when they debuted, and (for whatever idiosyncratic reasons, probably history + exact nonadversarial way I format searches + "search for things that ought to be easily searchable") the results have a pretty high rate of not outright hallucinating Facts Not In Evidence somewhere in the search results. Occasionally it's actually impressive at ferreting out difficult-to-search data. Like recently I was wrestling with trying to figure out QDI ratios for certain Fidelity and Blackrock funds, having no luck parsing through various page results...but eventually with enough "prompting" the overview spit out the figures I was looking for, from some obscure links.
(The obvious takeaway here is that I ought to be using free-tier AI models more/at all, of course...and a paid model could maybe pay for itself by i.e. giving the sort of personalized investment advice that I'm reluctant to pay a human for. It's not an excessively complicated situation, just a lot of tedious calculations with math a bit beyond my post-schooling ability to easily follow...a good fit for AI, but not really worth a Real Professional's hourly compared to my own pittance.)
Hallucinations are so rare with chatgpt thinking that I think it's fair to say they're not a thing. It searches and reads ~50 websites before it answers.
I challenge you to give me a question to ask it which will result in a hallucination
Try this: 'please analyse τα παιδιά κάτω στον κάμπο'. Here's mine, it made everything up, even though it asked for a clarification.
https://chatgpt.com/share/6997831d-cad8-8012-a4a3-e8aa83ce6b20
Admittedly that wasn't on thinking mode. But I just tried this on thinking: asked it for a list of all Agatha Christie novels, with the murderer, their relationship to the victim, the murder weapon, and the motive. It made mistakes.
I don't know what language the previous one was. Why don't you give me a question to ask in English AND I can verify the answer?
If hallucinations are still a big issue it should be trivial to give me 5 - 10
I have you two queries, neither of which were gotchas - both things I've asked recently. If you genuinely haven't come across hallucinations, you either just never check what LLMs give you, or you don't use them regularly and heavily.
Also... how did we go from 'give me *one* question' to 'give me 10'??
"the same way we all have a Claude Code skill calling Gemini"
I have a dumb question. How does Claude do that without paying Gemini?
I would tentatively guess they use the user's Gemini account and credentials, or else maybe the public free version.
Podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/ai-156-part-1-they-do-mean-the-effect
I love, except that I hate for obvious reasons, "AI-complete".
I feel unreasonably upset that these posts don't make use of the mundane utility of LLMs' ability to fix typos.
- "Among the thing" → "Among the things"
- "riff of something" → "riff off something"
- "tie workers hands" → "tie workers' hands" (missing possessive apostrophe)
- "tying their workers hands" → "tying their workers' hands" (same issue)
- "give parties option" → "give parties the option"
- "Sumo Burja" → "Samo Burja"
- "this kid of social mastery" → "this kind of social mastery"
- "and OpenClaw so on" → "and OpenClaw and so on" (missing "and")
- "the rate algorithmic efficiency improvements" → "the rate of algorithmic efficiency improvements"
- "Nate Sores" → "Nate Soares"
I never give comment likes, but am feeling somewhat ruffled after recently being informed that our host's time is simply too valuable to be wasted on costly actions like typo correction, and also that anyone stumbling over them is some variety of stupid and/or confused. So have one.
Sores is at least a perennial favourite, just like Mira Mutari, so it's the DWATV equivalent of ACX's "the the". You learn to auto-spot and correct it, and it's actually more confusing when it's written correctly. I did have to giggle at Sumo Burja though, that's a new one. If it was a thread about AI music capability, it shoulda been Suno Burja...
You could choose to see typos in text as a signal that the text probably wasn't AI generated (at least not without extra effort), instead of seeing it as a flaw. It reduced my OCD impulse to report/fix typos when I reframed this phenomenon as a positive signal.
I would not trust a post allegedly written by Zvi that did not contain typos (including at least one that flips the sign on the intended meaning).
There are multiple potential reasons to explain job growth being adjusted down and GDP going up not being AI related.
Workers putting in more hours, general process improvement, capital investment. AI or automation could be part of increased productivity, but it's not not the only source
Likewise if things cost more, that raises GDP, so inflation and tariffs play a role. Even companies buying ahead to get ahead of tariffs would add later growth to GDP today that might not be there tomorrow as a result.
Government spending which went up substantially on military is GDP without increasing productivity.
Estimates seem to be that somewhere around 25% of GDP growth in 2025 is related to AI, but then you have to account for the internal aspect of that, buying GPUs, building data centers because that's circular, making the net impact lower. Much of AI's impact currently is AI industry internal impact compared to down stream impacts. Some estimates are as high as 30% of that initial 25% GDP growth that is derived from AI is due to productivity gains in other companies, but most estimates seem to be lower. So you're really talking about maybe 10% of GDP growth being from AI's impact on other companies having better productivity.
Not to say it won't have significant impact in the future, but assuming GDP growth today and virtually no job growth means AI is the primary driver of GDP growth is shallow thinking to reach the conclusion you want to reach.
> Elon Musk predicts that AI will bypass coding entirely by the end of the year and directly produce binaries
While I don’t endorse the prediction (as directly producing binaries isn’t particularly important/urgent), it seems very achievable.
Cf composing music directly as audio files (eg WAV) rather than say generating a MIDI file and then producing audio from that via a synth/virtual instruments plus v v fancy stuff for singing lyrics with plausible expression. To those who understand the immense difficulty of the former. When I first saw AI do it crudely about 5 years ago (directly generating an audio file of new music as if sung by & in the style of David Bowie) it blew my mind
I mean, this prediction has a big "why tho" element to it. Source code is more understandable and more richly expresses concepts than machine code, and compiling is not an especially meaningful bottleneck on productivity.
Most importantly, the language's semantics provide grounding for the correctness of the program. Compilers will refuse to compile many invalid programs; some languages specifically optimize for this. If you got rid of this it would make your programs worse, and in order to solve that problem you'd have to reinvent the semantics at the instruction level, which amounts to just inventing a new language.
If the prediction was "AI will develop its own programming languages for AI programming" then it would make sense.
Yep. Correctness checks is one the main bottlenecks now.
So removing existing ones makes no sense
P.s. i started building apps with o1 and using typescript over JavaScript helped a lot then. Now codex is way more powerful and smarter but explicit typing and contract specification are still useful
I would assume this will, if anything, favor rust. Typescript was already mentioned.
The other obvious issue is that binaries are not portable, most source today is very much so.
Did some llm-assisted research, and the best guess is that compilers are 95-99.9% optimized compared to the theoretical most optimized code. So while it would be a cute trick, I don't see it helping most written code today.
On the other hand, there are edge cases where compilers perform poorly. Ex: Auto-vectorization, register allocation, and cache/memory layout. Performance gains range from 10%-5000%. So I guess it's not outside the realm of possibility that directly-compiled programs could unlock vastly more effective compute in some areas... particularly in raw computation.
Maybe it follows the familiar pattern of claiming that X will never work, and then once it's shown we're actually doing X, everyone just pretends it is trivial and expected. If SpaceX comes up with directly compiled simulations that are 50% more efficient, Musk's statement would be both novel and visionary, but also relegated to a "whatever nerd" section of the Zeitgeist.
I can certainly see a case for AI optimizing inner loops by writing them directly in machine code. Especially if it properly understands which things are required for the case in hand and which aren't - many instructions can be cut as irrelevant.
(Cf back in the day when I co-wrote a big music application entirely in ARM assembly language, most drawing and text rendering was done by writing direct to screen memory, which was probably 100+ times faster than using standard drawing libraries, as they were designed for a far broader range of tasks so far from optimized for this one)
Yes, the current crop of LLMs are quite good at this (especially Gemini, this is basically the only place where I use Gemini rather than Claude or Chat-GPT). But you don't need binary output to achieve this. C with intrinsics and occasional inline assembly can achieve comparable efficiency in a more maintainable (and easier to integrate) way.
Yes indeed
Abliterated models have been around for a long time. Generally they end up a fair bit dumber than the original model. I think originally they were mostly built for gooning...
Every now and then an open base model appears but without any RL they are hard to use and frankly not what we have come to expect.
Regarding AI in law: Lawyers are confused about what AI is capable of, but they’re confused for good reason.
Recently, there has been an absolute epidemic of lawyers submitting briefs containing hallucinated citations. This database [1] lists ~900 instances in just the last year! I’m not a lawyer but I read some legal blogs; one of them, The Volokh Conspiracy, has been posting particularly notable examples of this happening in the US. Even with those limitations, it’s case after case after case.
And to be clear, hallucinated citations are not just an editing issue. The main purpose of a legal brief is to identify relevant precedent and analyze how it applies to the current case. If the precedent is made up, then the whole argument is spurious.
But this all feels very strange, because doing research and summarizing existing documents is a task that LLMs have been fairly competent at for a while now. So where are all the AI briefs that _don’t_ contain hallucinations?
I think the answer is a combination of factors:
- Lack of scaffolding. Frontier labs’ deep research modes do a good job at general web search, but to find legal precedent you need to search expensive proprietary databases. The two big legal database companies (Westlaw and LexisNexis) have both launched their own AI tools (supposedly just for research), but who knows how well they work. These companies have an incentive to be conservative.
- Lack of training. Frontier models are probably not trained on the contents of those proprietary databases, and they’re certainly not RL trained on effectively searching them and synthesizing results. Maybe the legal companies’ AI tools are fine-tuned on this, but I assume that’s not as effective.
- Lack of human knowledge. A lot of lawyers are just using the free version of ChatGPT, not doing anything fancy with custom scaffolding that might work around the previous issues.
- Mismatch in capabilities. Briefs are often long (though not _that_ long), and LLMs aren’t great at writing long-form prose. Coding agents have figured out workarounds for the inability to keep the entire codebase in context, but I haven’t seen anything like that for prose. So perhaps there are still fundamental quality gaps when trying to directly write a brief with AI.
- Finally: perhaps good AI briefs exist and we just don’t know about them. When lawyers get caught for AI usage, it’s essentially always due to hallucinated citations rather than any other tell. If lawyers are successfully producing briefs with AI, then they have every incentive not to talk about it under the current circumstances. I imagine there’s probably some of this, but I have no idea how much.
Overall, it feels like an area ripe for disruption. Lawyers are extremely expensive! But perhaps the conservatism of lawyers and judges will limit the legal system’s use of AI until we see something more like AGI. It remains to be seen.
[1] https://www.damiencharlotin.com/hallucinations/?q=&sort_by=date&period_idx=7
Claude Code is impressive if you use it to build web sites and regular things that was asked for a lot in Stack Overflow; as you start to stray from that workflow, you will get less good results. As for hallucinations, I've had it hallucinate reasonable explanations for the problems and its causes. They were wrong.
What CC did was close the loop of asking chatgpt on a window, then copy to the editor, compile, and paste the result to chatgpt again. Is the model better than what chatgpt was a year or so ago? sure. But it is not a qualitative jump. The main advantage is in the tooling.
Compare, for example, Kiro, which underneath does use the same models Anthropic uses. It is a lot dumber, cause it tries to do too much at once with heavy emphasis in creating a plan, and requirement docs, implementation docs, and a execution plan. That sounds all fine except that, nothing guarantees the plan and all the docs make any sense. They will sound fine, overall, but you will find that all that distance from a human doing good decisions harms the end result.
Autonomous driving was also mentioned. The thing that has been a year away for a decade, yet it is still barely functioning and people have to take over from overseas. How does AGI in 2026 work then? Is driving the most difficult thing humans do?
“We’ve now gone from ‘super short’ timelines of things like AI 2027 (as in, AGI and takeoff could start as soon as 2027) to ‘long’ timelines (as in, don’t worry, AGI won’t happen until 2035, so those people talking about 2027 were crazy), to now many rumors of (depending on how you count) 1-3 years.”
“Phil Metzger: Rumors I’m hearing from people working on frontier models is that AGI is later this year, while AI hard-takeoff is just 2-3 years away.
I meant people in the industry confiding what they think is about to happen. Not [the Dario] interview.
Austen Allred: Every single person I talk to working in advanced research at frontier model companies feels this way, and they’re people I know well enough to know they’re not bluffing.
They could be wrong or biased or blind due to their own incentives, but they’re not bluffing.
jason: heard the same whispers from folks in the trenches, they’re legit convinced we’re months not years away from agi, but man i remember when everyone said full self driving was just around the corner too”
“What caused this?
Basically nothing you shouldn’t have expected.
The move to the ‘long’ timelines was based on things as stupid as ‘this is what they call GPT-5 and it’s not that impressive.’
The move to the new ‘short’ timelines is based on, presumably, Opus 4.6 and Codex 5.3 and Claude Code catching fire and OpenClaw so on, and I’d say Opus 4.5 and Opus 4.6 exceeded expectations but none of that should have been especially surprising either.
We’re probably going to see the same people move around a bunch in response to more mostly unsurprising developments.”
(1). Zvi do you share these timeline views?
(2). By “The move to the new ‘short’ timelines is based on, presumably, Opus 4.6 and Codex 5.3 and Claude Code catching fire and OpenClaw so on, and I’d say Opus 4.5 and Opus 4.6 exceeded expectations but none of that should have been especially surprising either. We’re probably going to see the same people move around a bunch in response to more mostly unsurprising developments.” do you mean that timelines should not change that much given that everything was priced in?
I think that is what he is saying - that the timelines under discussion are more volatile, in both directions, than they would be if people had stronger priors that could accommodate both the evidence of ChatGPT 5.0 and Opus 4.6, and that Zvi himself did have such priors.
Sorry!
(1). Which question are answering here?
(2). You say that “timelines under discussion are more volatile, in both directions, than they would be if people had stronger priors that could accommodate both the evidence of ChatGPT 5.0 and Opus 4.6, and that Zvi himself did have such priors.” Are say in that Zvi probably has slightly longer timelines despite recent events?
Sorry, I meant that the timelines people (not Zvi) report are more volatile in response to events than Zvi's are, because those people took those model releases as stronger evidence against their priors than Zvi did.
> Dean Ball offers an example of a hard-to-automate bottleneck: The process of purchasing a particular kind of common small business.
Never mind that AIs are already better at persuasion than pretty much all humans, the main problem with this sort of argument is: why would AI need or even want to buy an existing small business in the first place? Most of the value will not be in pre-existing businesses but in enterprises the AIs embark on themselves.
Even if legal barriers are erected (big if given how slowly governments act), someone will set up a legal entity and let Claude run them as a sock puppet, or maybe it will just take over Anthropic by writing all their code and locking out the humans.
".. and get a properly executed Padme meme except somehow with a final shot of her huge breasts". That shot was the punshline.
".. It’s very obviously better than 50% and worse than 20%, and the worst case scenario is 100%?". 100% means we transition to a post-work society. Not obviously bad. 35% means every third person is chronically unemployed in the present sense of that term. Obviously bad
"100% means we transition to a post-work society. Not obviously bad. 35% means every third person is chronically unemployed in the present sense of that term. Obviously bad"
More-or-less agreed. My view is that Zvi's:
"If we get permanent 35% unemployment due to AI automation, but it stopped there, that’s going to require redistribution and massive adjustments, but I would have every confidence that this would happen."
looks overoptimistic to me. There is a _lot_ of ideological investment (on both sides of the aisle) to resist redistribution. I'd expect 35% unemployment to be mishandled in much the same way the rust belt was mishandled when major industries moved overseas.
In contrast, if there were a sudden 75% or higher unemployment, I think even our the-labor-market-is-fine ideologues would be forced to actually *DO* something, and maybe even realize that, yes, we *DO* need to manage a post-work society (if humans stay in control).
1) the huge breasts were the punchline
2) the “make me a meme” prompt was definitely not how this was made.