41 Comments
User's avatar
Alexander Barry's avatar

While I think it is plausible the results would have been different if the devs had had e.g. 100 hours more experience with cursor, it is worth also noting that:

- 14/16 of the devs rated themselves as 'average' or above cursor users at the end of the study

- The METR staff working on the project thought the devs were qualiatively reasonable cursor users (based on screen recordings etc.)

So I think it is unlikely the devs were using cursor in an unusually unskilled way.

The forecasters were told that only 25% of the devs had prior cursor experience (the actual number ended up being 44%), and still predicted substantial speedup, so if there is a steep cursor learning curve here that seems like a fact people didn't expect.

With that all being said the skill ceiling for using AI tools is clearly at least *not being slowed down* (as they could simply not use the AI tools), so it would be reasonable to expect eventually some level of experience would lead to that result.

(I consulted with METR on the stats in the paper, so am quite familiar with it).

James McDermott's avatar

> Given the long steep learning curve for effectively using these new AI tools well, this division betrays what I hope is just grossly negligent ignorance about that reality, rather than intentional deception. (from Emmet Shear quoted in the post)

> If the learning curve is steep then devs should be improving rapidly over that time. (from the post)

> if there is a steep cursor learning curve here that seems like a fact people didn't expect (from your comment)

Unfortunately the metaphor "steep" means two different things. Some people (you and Zvi) mean "you improve quickly" (graph goes up) and some people (Emmet) mean "it is difficult to improve quickly" (we are cycling up a steep hill).

Alexander Barry's avatar

Good catch - I think what I actually meant (in the hypothetical) was just that there is a large amount of improvement possible, i.e. a big gap between low and high experience. I see that "steep" isn't a very clear way to convey that though!

Randomstringofcharacters's avatar

The point about time vs effort vibes with my experience of using LLMs for non coding tasks. Having a bad draft of something that you can then edit is a different kind of mental task from starting from scratch, and requires holding less in your working memory.

SOMEONE's avatar

It's more generally true also without LLMs, a solid draft helps, but below some level of quality it's often more efficient to start from scratch.

Michael Bacarella's avatar

I'm sorry but the detractors sound like they're hitting the copium a bit

You need more than 50 hours(!) of experience with Cursor to actually be better? And in the meantime it's making you slower but you'll still think you're working faster? Come on

I encourage everyone to read the factor analysis section in the study where they anticipate and respond to concerns, including C.2.7 "below-average use of AI tools" and come to their own conclusions.

Gerald Monroe's avatar

The reason you 'think you are working faster' is the study measured wall time, the developers measured how much of their own personal time they actually put in. For now, a 'dual vibe' setup is how you can get actual improvements in productivity in wall time.

alpaca's avatar

This actually matches my personal experience pretty well. Using Cursor is fairly different from coding, you work more via specs and so on, it's a bit more like offloading tasks to a junior engineer. If I already know what I want to do and how to do it, I'm usually faster just doing it myself (though the tab complete still speeds me up), rather than letting the AI do its thing and then reviewing it.

Also noteworthy that they didn't use background agents. I find speed-ups from branching off some work to Claude Code in Docker (or a Cursor background agent) and then reviewing the branch changes or PR to be much more of a time-saver, not least because you can parallelize it to some degree. This requires even better specs of the feature you want them to work on, however, so it's only useful for tasks where I have more expertise.

So basically, where I'm a noob, I profit from using Cursor interactively to help me figure stuff out (except setting up projects, it's atrocious at that), and where I'm experienced I profit from background agents.

What I find a little puzzling is why the devs didn't figure out the types of tasks where the AI slowed them down and did them manually, after all this was "AI allowed" and not "AI forced". I think this indicates that lack of experience with the tools was indeed an important factor, I'd guesstimate you need at least something like 10-20 hours of playing around with different ways to use Cursor to figure out where it's useful and where it's not. Maybe it's just a story of not feeling how long you're waiting because you entertain yourself in the meantime, but it seems to me that there is a bit more to it.

Just like Zvi, I was also surprised that the devs over-estimated the task effort in the no-AI condition, that implies we're clearly dealing with a very unusual situation :D

I'd also like to see a third condition with enhanced tab-complete only, that would be interesting by itself.

Gary Mindlin Miguel's avatar

They're very small tasks. Easy to estimate for someone who knows the code base well.

jmtpr's avatar

I agree with Peter Wildford that "self-reported productivity gains cannot be trusted" is the key takeaway here.

Intuitively, this experiment setup ought to be biased towards negative productivity forecasts. The participants are all experts who do not use AI in their workflows -- that's a selection bias in favor of people who have previously decided that using AI in their workflows would not be a significant productivity gain.

But despite that bias, when they use AI, they perceive a significant productivity gain! And they're wrong!

This seems to me to be a really important "mirage". It also matches my intuition -- we all know that the "10x engineer" people are mostly bullshit artists.

Silverax's avatar

I also 100% agree with the key takeaway being self-reporting should not be trusted.

But I'd say your claim about 10x engineers is contradicted by the next sentence:

> The March @METR_Evals paper contained this nugget about contractors being much slower [5x-18x slower to fix issues!] than maintainers.

Let's average that to 10x and there you have it. Some engineers are 10x faster than others.

Methos5000's avatar

I would guess that's a familiarity with the existing code base more than actually being fundamentally 10x better. Reading and understanding code you didn't write is hard, especially compared to the long term maintainers. It's the same in any system, familiarity makes you faster at the task, but that's not an inherent skill difference. To test the 10x thing, you'd have to be testing apples to apples.

jmtpr's avatar

To be clear, I'm not denying the existence of (large) productivity differences between engineers. I'm saying that this obviously cannot be self-determined-and-reported without hard data, and that posturing as such carries a stench that most people can recognize.

David Spies's avatar

Claude Code? Maybe it depends on the domain? I have a small puzzle game I've been working on on and off for a couple months and I've _repeatedly_ tried using Claude Code asking it to add features.

_Every time_ I burn through $15 before giving up with nothing to show for it, watching as the AI makes some terrible design decision, goes through all the code, rewriting everything to address that decision, then ties itself in knots trying to compensate for the bad choices it made early on, and finishes with a broken, unusable change I have to revert.

I can stop it early and tell it to take a different approach, but then it immediately makes another bad decision and the same thing happens. If I keep doing this, it's slower than if I just wrote it myself.

Victualis's avatar

What happens if you ask for a detailed design document for the new feature and a prioritized plan for how to add it? Or documentation about the existing design, data structure semantics, event loop, calling patterns, I/O? This kind of scaffolding is traditionally expensive but an LLM makes it cheap, and then it can be refined or directly edited in a few minutes, referred to during development, and discarded when no longer useful. If the LLM makes a bad decision then its distribution is less informed than yours, and software engineering artifacts reduce that information asymmetry.

David Spies's avatar

I haven't tried that. I can see how that could be extremely useful in a large codebase where the LLM can't view the whole thing at once, but my game is barely 2000 LOC. It seems like it ought to just be able to keep the entire thing in context (and therefore understand all the design choices)

Victualis's avatar

My experience is that beyond 500 loc Claude Code's use of the context window is much less reliable and errors can creep in. 2000 loc already warrants some scaffolding. Scaffolding is cheap and seems to help a lot.

Dan's avatar

> AI coding is at its best when it is helping you deal with the unfamiliar, compensate for lack of skill, and when it can be given free reign or you can see what it can do and adapt the task to the tool.

It's funny to read this as a senior/staff level engineer because I find the complete opposite to be true. AI coding is most useful for me when I'm very familiar with the work to be done, I'm already capable of writing the code myself, and I'm basically using compute to go much, much faster than if I wrote all of the code by hand. I do this regularly now, and am indeed much more productive in this scenario!

Meanwhile, I have the _worst_ time with it when I'm working with a language/library/framework I don't understand or the prompts are too open-ended. Maybe initially it's fast and I speed through the discovery work, but every single time I do this, weird bugs start popping up that the AI struggles to fix, and then *I* struggle to fix it because I don't have a mental model of how the code works. Often I peek under the hood and discover all kinds of weird LLM-written abstractions and it takes a while for me to get my bearings. Surely this isn't just a prompting skill issue?

Victualis's avatar

LLMs amplify any ambiguity. Not so bad once through for one-shots with vibed prompts, but iterate and the ambiguities cascade and overwhelm any utility.

I think both are true: code discovery, unfamiliar languages, quick one-offs work great. And when you know the boundaries and can steer the LLM to respect the code conventions, to stop it introducing 12 different frameworks for the same thing, to write relevant unit tests for the new functionality it just added, to document what the code does and plans for future changes, and to focus on the architecture and what actually needs doing, it is also a big productivity boost. Stuff in the middle, not so good, and in my own experience that's a skills issue. I now use LLM coding either for fun software that I don't care about, or where I can justify putting in the effort to become a domain expert to push beyond the slop middle zone.

Enon's avatar

Programming is a "mapper" thing, having a mental map of the problem. LLMs can't do this, nor most managers. They are "packer" mentality types, quantity of BS is their sole performance metric. They don't understand the problem, they have no ability to design solutions based on the problem, rather they try to use procedures and bueaucratic rituals to substiute for undersanding.

See Alan G. Carter's "The Programmer's Stone", part of his Reciprocality project of the late '90s, a sprawling and original work of philosophy that hasn't gotten the attention it deserves, though some of the more interesting ideas are probably wrong. https://datapacrat.com/Opinion/Reciprocality/

Pasha's avatar

The whole story boils down to the Figure 6 on page 10 of the paper titled "Average Percentage of Time Spent per Activity..."

G. Retriever's avatar

The core finding here is that people are not even reliable judges of whether AI tools help or hurt them, which makes a gigantic amount of sense to me given the delta between what other people report AI does for them and my personal experience of basically constant frustration and disappointment with all LLMs I've ever used.

Max Conradt's avatar

The second order effect of these repos likely being in the training data (albeit with modifications) makes this not a worst case scenario for LLMs.

Additionally, open source projects require much less context and are easier to contribute and test than enterprise software. "Running $x locally" generally a challenge when $x is enterprise software, it's the point for open source software.

The Notorious Oat's avatar

Even if it's a "worst case scenario", it's also *by far* the most common case.

Max Conradt's avatar

Is this a joke? Vast majority of software is closed source, and also not in the training data. I maintain my claim that open source projects are much easier to maintain and contribute to than commercial software projects, adjusting for size. Commercial software projects likely tend to run larger.

The Notorious Oat's avatar

I meant in comparison to greenfield projects, not closed source. I realize that context was missing from my comment, but it was advanced as part of the defense of AI in the article, and I found that unpersuasive.

The Notorious Oat's avatar

Also, my comment did not take a position on whether it was a worst case scenario. I was pointing out why it was a poor defense even if we grant that it's a worse case scenario.

Yoav Tzfati's avatar

Important nuance re learning effect: I think an important aspect of "learning" to use AI coding agents effectively is improving at setting up your codebase to be piloted by the models well, by e.g. writing cursor-rules and llm friendly tools. But in this study the developers were disincentivized from taking time to do that, because they were being timed on how fast they can complete the issue. This is my main explanation for no improvement over time.

Another way to look at it — you don't pair program with another human to get results faster, you do it to share knowledge so they can better perform future tasks on their own. If after the session you wipe their memory and don't bother documenting what they learned, you're not getting any of the value.

Matto's avatar

It's remarkably difficult to judge developer productivity and has been for years, in my experience, and probably longer if I believe what the greybeards say.

It surprising that this stance went out the window when ai assistants appeared. I keep bumping into people who say someone became a coding machine at their company. Yet I see no observable change. Of course I might be in a bubble. But even in my own midsized company, where we have a lot of true believers, I don't see this group achieving take off velocity and leaving me and everyone else in the dust.

This is a complex area so I'm glad at least this study points out how difficult it is to measure productivity.

avalancheGenesis's avatar

Always happy to see studies of non-barking dogs, where the File Drawer Effect doesn't happen. That's good Science! It doesn't automatically make a study accurate (I think people have overlearned a bit from the replicant crisis, "oh that paper is fake because it just confirms their priors" is not even valid as a Fully General Counterargument), but certainly lends extra credibility.

I am wondering if you'll be tightening the filters going forward for self-reports of coding speedup, since this seems to indicate one ought to be significantly more cautious and qualify such observations accordingly. Other forms of inherently-verifiable Mundane Utility are probably fine (like I doubt we'll see a "doctors estimate AIs correctly classify X-rays 20% better, but in fact they're 20% worse", if anything it seems the opposite)...all the same, definitely a downward update on doom, at least by that specific path. There are of course still changes a-comin' - the employment model of "we pay you by the hour to code" still needs a rethink even if AI merely allows the same level of output slower with less effort.

Jeffrey Soreff's avatar

Re: "Roon: IME really good software ppl who deeply care find the least use from LLM coding and are often ideologically opposed to it because they like to exert editorial control over every line. slop research coders such as myself don’t care as much and have much larger gains."

While I certainly don't expect ideological opposition to AI coding assistance for use within the AI labs themselves, I _do_ expect the developers in the leading edge labs to be good software people. This suggests that we should update our expectation of the delay till we have an ai-2027 style supercoder which is relevant to leading edge AI work by _more_ than the raw result from this experiment suggests.