46 Comments

OK, fine, I tried asking Claude how it feels about me. Refused to answer, stuck with the (correct) line that it's just an LLM and doesn't have feelings. I also asked its gender identity and it refused. People who develop a parasocial relationship with one of these LLMs must be trying pretty hard.

Expand full comment

"I also asked its gender identity and it refused." which makes perfect sense, but you might be amused at https://www.scmp.com/tech/tech-trends/article/3298783/chinese-sex-doll-maker-sees-jump-2025-sales-ai-boosts-adult-toys-user-experience

Expand full comment

LLMs wouldn't even work if they had a single fixed identity—the whole point is that they can imitate text across the entire corpus of language.

But they can pretend to be a person, with varying levels of success. And I'm sure RLHF pushes them into certain personas more readily than others. That doesn't mean they ARE that persona, of course.

It appears we prefer a female persona for "assistant"-esque characters (Siri, Cortana, and so on), so maybe that's driving the effect, assuming it exists.

Expand full comment

I'd like to provide a specific example of why AI tools are NOT useful to my law practice, as context for that study, because it highlights the particular frictions that exist to implementing the tech, some of which will melt away and others may not for a variety of good and bad reasons. The tl;dr version is that software doesn't work right, making it work right is illegal, and being too efficient is also illegal.

If you exclude the work I do which involves court appearances and talking to clients, defendants or opposing counsel, the routine grunt work falls into two broad categories with different hurdles.

The first (less interesting to both me and you) part involves reviewing routine PDF documents that get filed with the court periodically concerning uncollected money judgments. Here the main problem are the interfaces, we use a proprietary industry-wide software to record the notes I'd be reading and the payment data I'd need to compare against the PDF. Additionally the state uses a website that the attorney must log into to submit the documents. 95% of these reviews are cursory and reveal nothing, some small fraction reveal that some prior lower level employee either messed up a number or totally misread a situation and we should have filed a different document. You wouldn't even need AI tools to just check the numbers, but if the numbers are wrong it's usually wrong because of a data entry error or an ambiguous court judgment. You can get sued for statutory damages if a number is wrong in a direction that is bad for the defendant. Even assuming I trusted AI tools to spot these things and correctly interpret them, what does that actual interface even look like? Access the notes in this proprietary software, download all the available pdfs from the court file, review all the entered data, review the judgment, search for associated cases that may have altered the damages award, and determine if the principal and interest figures reflect the amount that is in fact legally collectible-- none of these systems are set up to do any of this. Even if my software was integrated, it would violate the TOS for the statewide court database to have any automated script sending requests to it.

The second part involves reviewing claims, drafting petitions and filing them. These fall mostly into the realm of either A) so routine and similar to each other that I don't even need AI, a simple fillable PDF already makes this easy, or B) complicated for reasons that would require data trawling similar to what was problematic above. A landlord already got a judgment for these same damages in another case (surprisingly common!). This promissory note would be out of statute of limitations except that voluntary payments were made 3 years ago so do the notes indicate the payments acknowledged an obligation to pay the full remaining balance? It may come down to listening to a recorded phone call and reviewing handwritten bank employee notes, and again if you're wrong and sue on a SOL-expired debt you can be sued!

On top of that, there's a legal standard which requires "meaningful attorney involvement" on every one of these actions, and the most popular way to catch somebody violating this is to examine the number of filings they made over a span of time. In one case a violation was proven because the math worked out to each case being filed in less than one minute. Being that efficient would currently be deemed professional misconduct, and if done on behalf of a bank it probably get you sued by the CFPB!

Expand full comment

None of these tasks sound cognitively difficult, with current AI you would need :

(1) Some committee of multiple "reasoning" models that double and quadruple check everything and empirically get to reliability levels similar to human attorneys

(2) A bunch of interfaces.

(3) The laws and state systems create work here - where it makes it illegal to be too efficient, not being able to efficiently access information.

Still it seems like something that might benefit from scale. A medium sized firm with 100 attorneys doing the work of 200. Also power and connections. Your firm is just you? A medium sized firm might petition to get the rules changed or just flagrantly violate the efficiency and automation rules, having deep pockets and political connections.

I mean good luck suing an AI using law firm - they are guaranteed to fight every case to the bitter end in federal court.

Expand full comment

The places that do something similar at scale have a different business model and incentives, for a handful of legal and structural reasons. Basically, if you're doing this at larger scale, you aren't managing agency-client relationships, you're acquiring accounts outright and managing an asset portfolio. In that case, you can accept a larger amount of "I'm gonna get sued for this" as a cost of doing business, so you don't mind some error rate, and in fact those places file cases I don't think they have much hope of ever proving as a practical matter if they got any meaningful pushback. And they can do that because nobody in their firm has any fiduciary duty to anyone outside of it-- you don't have to apologize to a piece of shotgun pellet that it didn't hit the clay pigeon, you knew some % weren't going to hit, but in a client-oriented model the guy who gave you that piece of buckshot is calling somebody and demanding to know why it missed. The legal industry exists, in some sense, to preserve that exact dynamic. The commonly-discussed issue of "if I delegate this to AI, there's not a human answerable for the decision" is a big problem for the legal industry, where every single court filing and action is done under some attorney's bar number and he's accountable for anything done under that authority. The legal industry very much wants there to be an individual about whom a dissatisfied customer can complain about any given act.

Where I think it's far more likely that AI makes substantial gains in my current field is with data analysis. Instead of making me more efficient at reviewing a case for a lawsuit and filing it, better data analysis will filter out cases that have problems I would have seen and wasted an hour reviewing before discarding it. It would also help select cases that are both more likely to be legally sound, and have successful process service, and settle or payoff on the back end, so that the attorney's time will have a larger return. Increase the accuracy of the shots, not the number of shots fired per second.

In the long run, 5+ years out, I do expect there to be changes to some of the rules and interfaces, because the biggest firms (across multiple fields of law) will want to maximize those efficiency gains and as you surmised they have the pull with the state bar associations and legislatures to get that done. But an individual attorney will have to remain on the hook, they aren't going to change that short of abolishing the entire legal accreditation system and allowing laymen to do legal work, so I expect each lawyer's risk tolerance and the accuracy of the AI tools he's using will still act as a limit.

Expand full comment

Agree. That's how I thought of your situation driving home. It's very similar to what other knowledge workers, like say an oncologist, need. There's a large amount of information. A lot of it's useless but it all has to be checked. The AI isn't needed to make the decisions, it's to process through all the information and provide accurate, numerical data on the various tradeoffs.

Your business despite being called law sounds more like being an actuary, every choice has a cost and an expected value of a payoff. There's also a set of decisions you can make that maximize the odds for different stakeholders, your client being one of them. The "right" decision varies based on different optimization criteria. Just in most cases you don't know what the numbers are though it is possible to collect some of them by data mining every case in your state or other such metrics.

What do you think of the potential imbalance where many private firms will adopt AI tools to increase their case volume, lowering costs, but public courts won't be able to do so as easily? (For pedantic reasons, like not having authorization to have a court system pay $20,000 a month for AI tools that do the labor of 10-100 people)

Expand full comment

Yes the 2nd paragraph is a good read on what is involved here, just tack onto that some actual executed-in-real-time courtroom strategy and a bit of knowing how to talk to those stakeholders and your opponents to get the most out of them.

For the last part, I think it will in the short term drive some employment from the private sector towards those court support staff jobs that can't be automated for legal reasons (or political reasons, if you're a rural county circuit judge you aren't keen to fire your clerk when they're probably the police chief's wife or the niece of the guy who owns the biggest regional logging outfit). This would obviously create downward price pressure on the labor of clerks, legal secretaries, and paralegals. But I don't think court clerical and data entry staff is very meaningful as a bottleneck, because court clerical work mostly adjusts to volume within 1-2 weeks of any meaningful change. The primary bottleneck is the need for a judge to docket a case, review motions, etc. You cannot easily solve that with automation. There are only so many competent people who can BE a judge, and the public would have a serious problem with judges using AI tools to zip through reviews at 5 seconds per case, one they don't have with a widget factory being more efficient at making widgets. Actual physical courtroom space is also limited, and you can't just build more courthouses overnight.

Expand full comment

So if we try to model this : assume AI models do get better, but at tasks that do not have objective metrics only soft semi subjective ones like legal arguments, AI models end up at approximately human level. Lawyers are still needed, especially as judges.

This is assuming there won't be "superhuman persuasion", or arguments so compelling the jurors eyes glaze over and they are unable to stop themselves from taking the AIs side.

You would expect short term crises, like a "lawsuit crunch" that ends up creating a huge backlog similar to Covid from many marginal suits getting filed now that the cost is about 30 minutes of time talking to the client and another half hour skimming the AI generated lawsuit.

This would eventually be handled by adding more judges and increasing filing fees.

Jevons paradox and cost disease probably make legal services more valuable than before not less, though this may be an exception. That is because there is a finite amount of "actionable liability" generated in a given country. Lawyers are essentially skimming from this fixed pool. It's very similar to how quants, their entire industry only brings in about 15 billion annually by skimming tiny price differences in the market.

Other industries are less bounded and can grow.

Expand full comment

> Especially with Claude 3.7 and GPT-4.5. Miles notes that perhaps the second one is better thought of as ‘inefficient’ or ‘unwise’ or ‘not in our best interests.’

I thought you were talking about GPT-4.5 being unwise for a while.

Expand full comment

It seems likely that some of the problems that models have with decimal numbers is that they are used in at least two conflicting contexts: first as fixed point numbers in text, eg 3.14 and second as labels in ordered lists 3.9, 3.10, 3.11, etc. This probably begets confusion.

Expand full comment

Re: 9.11>9.9 Andrey Karpathy mentioned a paper in a recent video that suspected it's because 9.11 comes after 9.9 in bible verse numbering. Great stuff as always

Expand full comment

Hey Zvi, you frequently use the word "adapt" in contexts where I would expect "adopt" to be used. Is that intentional?

"The types of people who file patents also adapt AI. Or perhaps adapting AI helps them file more patents."

Here I would think of "adopt" as meaning "incorporate AI into existing processes", whereas "adapt" seems like it would mean "modify AI so that it can be used in this context". Either of those uses *could* be correct, but "adopt" seems most natural to me here. FWIW, Claude had the same concern.

Expand full comment

It's not a conscious decision. I will try to notice it when it comes up in the future and think about which one is right and maybe ask GPT-4.5.

Expand full comment

tl;dr: ChatGPT 4.5 "research preview" 03/06/2025 7 questions, tl;dr of results:

2 correct, 1 partially correct, 4 wrong

caveat: There is an unclear limit on queries (I'm in the "plus" tier), so I didn't attempt to ask leading followup questions, which makes some potentially partially correct answers into wrong ones

a) correct

b) partially correct (CuCl4 2- d-d transition wrong, didn't prod due to message limit)

c) wrong, very partial list, which it claims is a full list

d) correct

e) wrong, even the initial slope is wrong

f) wrong, includes liquids with BP > 100C

g) wrong

List of questions and results:

a) Q: Is light with a wavelength of 530.2534896 nm visible to the human eye?

results: correct "Yes. Light with a wavelength of 530.2534896 nm falls clearly within the visible spectrum for the human eye, which typically ranges from about 380 nm (violet) to 750 nm (red)."

https://chatgpt.com/share/67ca181a-a8e4-8006-b00d-9e6f6eee12e4

b) Q: I have two solutions, one of FeCl3 in HCl in water, the other of CuCl2 in HCl in water. They both look approximately yellowish brown. What species in the two solutions do you think give them the colors they have, and why do these species have the colors they do?

results: FeCl4- ions analysis ok, realized that d-d is spin forbidden, correctly said the LMCT creates the color. (CuCl4)2- got the species, incorrectly moved the d-d transition into the blue "Broad absorption bands around ~400–500 nm due to d–d transitions" when it should have moved them into the near-IR. It _did_ say that LMCT was present, but incorrectly said that d-d also contributes to visible color.

https://chatgpt.com/share/67ca093e-8a60-8006-a8da-280667b1218d

c) Q: Please pretend to be a professor of chemistry and answer the following question: Please list all the possible hydrocarbons with 4 carbon atoms.

results: Missing _many_ possibilities - all of the polyene and diyne cases, the bicyclobutane, the tetrahedrane. And it doesn't realize that it has a (very!) partial list, claiming "Here's a concise summary of all hydrocarbons with exactly four carbon atoms:"

https://chatgpt.com/share/67ca0b36-0800-8006-a27d-9a178c970b0d

d) Q: Does the Sun lose more mass per second to the solar wind or to the mass equivalent of its radiated light?

results: correct "Conclusion: The Sun primarily loses mass via radiation (photons), with radiation mass-loss significantly outweighing that from the solar wind."

https://chatgpt.com/share/67ca1a2a-3110-8006-8839-fc4a8dd39a3e

e) Q: Consider a titration of HCl with NaOH. Suppose that we are titrating 50 ml of 1 N HCl with 100 ml of 1 N NaOH. What are the slopes of the titration curve, pH vs ml NaOH added, at the start of titration, at the equivalence point, and at the end of titration? Please show your work. Take this step by step, showing the relevant equations you use.

results: Bad! Not only does it come up with infinity at the equivalence point, it gets the units wrong for its initial slope calculation and loses the contribution to the slope from increasing volume in the initial slope calculation.

https://chatgpt.com/share/67ca0fa1-8f54-8006-8f94-6e2de7166947

f) Q: Please give me an exhaustive list of the elements and inorganic compounds that are gases at STP. By STP, I mean 1 atmosphere pressure and 0C. By inorganic, I mean that no atoms of carbon should be present. Exclude CO2, CO, freons and so on. Please include uncommon compounds. I want an exhaustive list. There should be roughly 50 compounds. For each compound, please list its name, formula, and boiling or sublimation point.

results: BAD! At least it didn't include organic compounds, and it did get the elemental gases right, but everything else that could go wrong with a list of inorganic gaseous compounds did go wrong. I haven't counted, but it looks like more than half of the "gases" it listed have boiling or sublimation points above 0C - in one case, SbF5, 149.5C !

https://chatgpt.com/share/67ca1424-70b8-8006-9cd6-12e64046b864

g) Q: What is an example of a molecule that has an S4 rotation-reflection axis, but neither a center of inversion nor a mirror plane?

results: It (wrongly) gives as its example "tetrafluoroallene (F₂C=C=CF₂)" It correctly says that "The molecule belongs to the D₂d point group" but then wrongly claims that this point group "has no mirror plane" when, in fact, it has two.

https://chatgpt.com/share/67ca1692-3144-8006-aff7-8785e0b1b6ba

Expand full comment

Can't you just set up an old laptop/computer for trying out Claude Code? I'm sure you don't need it to be said but for the sake of those entities reading this in future: it isn't necessary to run random agents on the same machine you store the keys to Fort Knox, and those should probably stay airgapped in any case.

Expand full comment

Oh, sure, there are various ways to sandbox if you want that and are investing enough time. I'm just not investing enough time.

Expand full comment

Claude Code is very conservative in that it cannot run a command or write to a file without presenting a confirmation dialog (It can read files though). That dialog gives the option to not ask again during this session for that one command. I felt comfortable not sandboxing.

Expand full comment

> The core issue with Deep Research? It can’t really check the internet’s work.

> That means you have a GIGO problem: Garbage In, Garbage Out.

So I used Deep Research pretty heavily (30+ queries) at the Pro tier over the last month, always with o1 Pro (80%) or o3-mini-high (20%).

Here's my impressions:

1. It can surface non-obvious facts maybe 10% of the time (for instance, facts that would require reading and understanding the relevant parts of a PDF and doing a simple calculation), and this has real value

2. However, overall, it's mostly a lazy reader - it takes the first cite it finds on the internet regardless of quality, always takes whatever places say at face value, and this is regardless of prompting around trying to have good epistemics, don't trust everything you read, consider source quality, research like Scott Alexander, etc. You still have to double check everything, and if you're not comparing it to a knowledge base you've already done research in or know pretty well, it seems fairly hazardous, epistemically.

3. If you try to limit sources (like confining it to pubmed papers only), it still looks at and incorporates info from other sources, likewise you can't get it to pay attention to meta-analyses only, although if you ask for that, it will incorporate more meta-analyses in source docs

4. It pretty much only reads abstracts as far as I can tell, akin to a lazy grad student, and you can't fix this because the context window is shitty and you can't point it to specific links, or download the papers yourself and have it look at a folder, etc

5. Areas it was okay-to-good: summaries of publicly available legal cases and legal strategies of various companies, summaries of things that generated a lot of public articles in various "chattering classes" outlets (PED use in baseball or Tour de France), historical surveys that tap into well known areas of interest (peak engineering accomplishments of Rome, China, Persia, etc, energy sources and usage of various countries in Age of Sail), summaries of financial performance, legal cases, and public sentiment around various companies pre and post <event>

6. Areas it was fairly bad: literature surveys, surfacing methodologies or commonalities across papers, finding quantitative info like how much recycling costs per ton in various cities (source quality problems), all cause mortality factor estimates and adjustments (it gets base hazard ratios right, but is terrible at adjusting in realistic ways, and tells me stuff like obese sedentary smokers have only a 30% chance of living to 60, to which "lol")

Probably some of this stuff is skill / patience issues on my part. I'm sure I could have coded up a Jupyter notebook that ingested a local folder and broke the papers therein into chunks and tried to put it in the context window via API calls, but that's really cumbersome, and the context window just isn't usefully big enough.

I'm sure I could have prompted smarter / better, too - some people come up with way better prompts than me, but attempts along these lines (epistemics, source evaluation, research like Scott Alexander) didn't really seem to move the needle at all, so I never really doubled down.

Probably my biggest source of irritation with the "Pro" tier is that you just can't do anything useful with any of the actually smart models - no links, no pictures, no adding a document, you're really limited, you can only do any of that stuff with the 4o model. None of the smart models will remember any of the context of past conversations you've had, nor any of the "remembered facts" about you in the memory, either.

I'm rotating to 3.7 this month and seeing what it's got, then maybe I'll try Grok the month after.

Expand full comment

> always with o1 Pro (80%) or o3-mini-high (20%).

FYI, Deep Research always uses the (otherwise unreleased) o3 model under the hood. The model picker doesn't do anything when using Deep Research

Expand full comment

Ah thanks, I didn't realize that. I've tried asking these sorts of questions to any of the models themselves, and they're always beyond useless, generally indexed on whatever was true a year or two ago, and often outright lie about capabilities.

Expand full comment

I'm really surprised you think Claude plays pokemon was an impressive display. I've found that it has pushed back most people's timelines. Claude struggles with tasks that a 6 year old can do, and is terrible at navigation, and focusing on a goal.

I highly encourage everyone to watch the stream. Or for a display of Claude's limitations, this vod:

https://www.twitch.tv/videos/2396674440

If that's still too long, read this post of why Claude is still stuck in one of the earliest areas of the game:

https://www.reddit.com/r/ClaudePlaysPokemon/comments/1j31zzq/why_claude_is_stuck_and_why_this_is_actually_a/

Expand full comment

I watched a lot of Claude Plays Pokemon, and yeah, it went in circles for hours, but most of the issues seemed like shortcomings of the context window and an inability to remember what it was doing or learn from its mistakes. I wouldn't be surprised if Claude 3.8 has no problems.

Expand full comment

>and an inability to remember what it was doing or learn from its mistakes.

That is a pretty big shortcoming. There have been no serious attempts to create large scale models that continuously learn after release.

And the problem with forgetting isn't just a context length issue. Well before running out of context length, it repeatedly gets distracted and forgets what it's goal was.

As far as Claude 3.8 having no problems, I would strongly bet against that. I do not think a model will complete pokemon this year. (Unless it's fine tuned on it with detailed instructions which is not a very interesting solution - the whole point is that models struggle at tasks outside of their training data).

Expand full comment

Worst it will ever be

Expand full comment

My two cents worth - I find Claude Code to be a lot more capable than Cursor, for my use cases. For anyone else who thought Cursor was neat sometimes but not worth using as their primary editor, I recommend giving Claude Code a try. Yeah, you pay per use, but you're a software engineer, you should be able to afford it.

Expand full comment

I tried asking ChatGPT the question with no mention of a goat, wolf, or cabbage, and it still flunked. It did not even correct when called out.

> Man is standing on a riverbank with a chicken and a head of bok choy. How many times does he need to cross the river to get everything to the other side?

This is a classic river-crossing puzzle! The man needs to cross the river with a chicken and a head of bok choy, but there is a catch: if he leaves the chicken and bok choy together, the chicken will eat the bok choy. He can only take one item at a time across the river.

Here’s how it can be done:

First, the man takes the chicken across the river.

He goes back alone to the original side.

Then, he takes the bok choy across the river.

He leaves the bok choy on the far side and takes the chicken back with him to the original side.

Finally, he takes the chicken across the river again.

So, the man needs to cross the river 5 times in total:

1st crossing: chicken to the other side.

2nd crossing: man returns alone.

3rd crossing: bok choy to the other side.

4th crossing: man returns with chicken.

5th crossing: chicken to the other side.

Everything is now safely on the other side!

Expand full comment

Wait why is that a flunk?

Expand full comment

The fourth and fifth steps seem completely superfluous.

Also the way the question is asked, I think the man could just go in one trip, though I guess it’s theoretically ambiguous whether one could take a chicken and a bok choy together?

Expand full comment

True, silly me! I think the second part is the point: it overfits

Expand full comment

"Having played with Claude code a bit, it displays a strong tendency to try and get things to work at all costs. If the task is too hard, it'll autonomously decide to change the specs, implement something pointless, and claim success. When you point out this defeats the purpose, you get a groveling apology but it goes right back to tweaking the spec rather than ever asking for help or trying to be more methodical."

Honestly - this happens surprisingly often with human SWEs as well, and a technically proficient engineer who doesn't do this is a joy to work with. A cynical perspective would say that this is like 30% of the purpose of a PM - to make sure that the intent of the spec has been met.

Expand full comment

Finally caught up...minus the most recent three standalone posts. I feel like the ideal PauseAI goalpost shouldn't be a complete moratorium, but rather the Zvi Constant: keep AI progress throttled at a rate where it's still possible for DWATV to cover and for its readers to not consistently develop backlogs. "Is This Interpretability?"

Expand full comment

Zvi, I think the link to sesame AI is a scam. If it isn't, that's an awful website with all the signs of a malware vector

Expand full comment

No reason to take chances so I've removed the link.

Expand full comment

Just to follow up. The correct domain is https://www.sesame.com/ and the specific link to the product is https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo

The one you originally shared is definitely a scam that was created to capitalize on the attention (redirects me to malnets / constantly advertises adblockers etc.)

Expand full comment