If you are looking for a high-quality audio conversion, I produce a podcast of Zvi's posts through ElevenLabs, with multiple voices to differentiate the different quoted sources:
It's a laboratory of democracy case right. I feel like the most draconian AI blocking bill doesn't matter because there are 49 other US states, and Texas is not a center for AI startups. There are some in Texas but they can all leave for bay area or NYC, or just not deploy their products to Texas.
1047 was important because it takes time for startups to leave California and it delays the potential singularity or fizzle by the amount of time required to move.
For the laboratory of democracy model to work we need control groups. States like Texas who make using AI illegal, and then we can compare head to head the consequences. Will AI positive states have extreme economic growth, leading to AI banning states declining faster and worse than Detroit? Will they have mass unemployment from using AI in everything and a steady stream of incidents of AI failing and killing workers and customers?
I don't know but it seems like valuable information to try it out and find out.
If that's your threat model Texas is completely irrelevant. Probably in that case you should be for some group - probably the government - pushing forward as fast as possible to develop AI strong enough to be dangerous in an isolated lab. Prove it to everyone especially China that the threat exists. Texas is a waste of your time if this is what you genuinely believe.
California was a different case - 1047 or some other meaningful legislation that leads to a slowdown as AI labs leave the state or let China pull ahead - buys us a few extra months to live by your threat model.
An anecdote about your comment about how something that's so good for productivity being so slow to be adopted. I'm an avid AI user and work in a very quantitative, tech forward, and young department. Almost everyone that's adopted AI here has done it because I've walked them through using it for this or that problem. After, I dunno, a week of this kind of collaboration something suddenly clicks and they realize they love it. At that point they're off to the races and are way more than 25% more productive.
I mention the demographics because my department's profile is basically optimal for AI adoption. But here at least, adoption still moves at the speed of one person convincing another. When I talk to other departments about this, I often have conversations like: "I will gladly and patiently teach you to use this tool, and I can very credibly say that it will save you hours per week", "eh." People just be like that I guess.
I was recently in hospital, and basically everyone I interacted with checked that I was the right patient, and asked me to confirm that the relevant parts of my medical record in the computer were correct. Like, I was asked to confirm what medications I am on at least five times. I entirely understand why that’s the protocol, and I agree with doing that. Best one was them asking me to confirm stuff just before the intravenous sedation goes into my arm (like, this is the last possible moment you could ask me anything at all).
The whole system probably has a certain amount of redundancy against AI generated garbage in the online medical record.
(And I can merely guess at the medical negligence lawsuits that led to current protocols).
Sure that might catch some of the clerical errors it'll make in the paperwork, but it won't have those kind of safeguards on the operating table when it gets there.
Ultimately what matters is not whether risks are possible but what the odds are. As Zvi notes whisper hallucinations may not actually be worse than a physician dictating later and getting 2 patients confused.
“If AI is not a huge deal over the next 20 years, I presume either we collectively got together and banned it somehow, our else civilization collapsed for other reasons.”
I feel like it’s common for tech to have a lot of hype and even a lot of investment and then to end up not being a huge deal over the short-medium term.
Crypto, nuclear power, everything involving space, nanotech, VR, cloning/genetic engineering (until COVID vaccines), AI (for the entire history of AI research up to now)
Other stuff comes out and is genuinely really big for a while but then the world changes and in the end it’s a historical footnote. Mosquitoes develop resistance to DDT. Google Search gets overrun with SEO garbage. People get bored with the core gameplay loop for MMORPGs.
I can imagine AI going like that over the next 20 years. Maybe it needs more to keep scaling than what we can provide. Or maybe AI systems get kinda popular for a bit, but then fade when we end up being better at subverting them than securing them.
Praxis does take time for sure, VR's infancy was in the late 1970s and it's continuously suffered waves of being repeatedly overhyped (and fairly so). AI is going to be different. Even if capabilities don't develop any further beyond where they are today, it will have an outsized impact on society beyond any other technology before. Luckily the implementation efforts are still bound by human adoption, which does take time, but of course another big fear is that our rate-limiting step might soon be obviated.
I think you can break your list and consider each category one by one. Ask yourself : how much money can I have made with this technology with the version developed or a reasonably improved version that was immediately possible?
Crypto- doesn't make net money
nuclear power- a very expensive way to boil water compared to coal or now cogen and solar
everything involving space - at $10,000 a kg to reach orbit how were you going to make money?
nanotech- without some way to mass produce nanotechnology this isn't making money. ICs are the kind we did get and they make lots
VR- headsets hurt your eyes. They make humans marginally more productive co-working in VR? Not a lot of money there.
cloning/genetic engineering (until COVID vaccines)- for the places this is legally allowed it makes money and is commonly used
AI (for the entire history of AI research up to now) - now stops in 2022. OpenAI is beginning to make money on gpt-4.
Please note by "money" I mean a shorthand for workers and resources. A command economy that didn't use money like the ussr would also find unprofitable technologies to be a waste.
Great stuff as always Zvi. That YouGov poll from the Center for Youth and AI does smell a bit like opt-in online polling to me though. Many of your point still stand however.
1% probability of your device being taken over has been a thing since the days of the earliest browsers. Sometimes this goes down a lot (a critical image library vulnerability is patched), sometimes it goes up (a malicious actor uses a zero day while paying for ad coverage for a high profile keyword, or successfully commits a backdoor to a low level utility library and isn't discovered quickly). I don't think LLM prompt injection massively changes the attack surface over existing exploits by itself. Running a system in autonomous agent mode does change the risk because that allows multiple shots at the target.
Unfortunately, most people who use the Internet seem to have accepted that clicking on the occasional random link will let evildoers take over their machine and steal money from their bank account. This makes it less likely that the more flexible, adaptable kind of malware enabled by agents with a world model will get special consideration, even as over time the additional capabilities and potential for persistence make these systems harder to detect and defend against. Antivirus software that relies on pattern matching to trigger an alarm isn't going to help when the malware can adapt to the detector, dynamically work around it, or redirect the alarm.
That box meme? That's where we are headed, unless friction is added to the iterative agent loop.
One way to make this specific use case work is reference sources and a whitelist of trusted websites. Ironically this is de facto a rollback to a prior era - basically your containerized AI agent isn't going to use anything but nonfiction with positive peer reviews from a set of websites that have reference books and journals and major news sources to do its assigned work. No random websites.
Not fundamentally different than going to the library and only reading the books they have there in nonfiction and the major newspapers. Just thousands of times faster.
The timing on the AI memo is super weird, and frankly extremely stupid. What the Biden administration should do is say loud and clear that they are interested in the complete and total acceleration of all AI research. That way if Trump wins, he will immediately declare himself the AI safety president. If Harris wins, they can just change course immediately.
The memo reads to me like "acceleration and fuck China but we are going to do this with so much bureaucracy that in practice nothing will be accomplished".
The timing on the AI memo is super weird, and frankly extremely stupid. What the Biden administration should do is say loud and clear that they are interested in the complete and total acceleration of all AI research. That way if Trump wins, he will immediately declare himself the AI safety president. If Harris wins, they can just change course immediately.
>People toss off plausible-sounding stories about how, if X happens later, markets ought to behave like Y later; and then X happens; and then Y just doesn't happen. This happens ALL THE TIME. It happens to professional traders.
Eliezer is doing the thing where he speaks hyperconfidently with little to no supporting evidence.
A quick Google surfaced this report from professional traders:
"The US stock market has experienced its share of crisis events—from wars to political upsets, to many unforeseen human tragedies. The table below highlights 25 international crises that have occurred since 1940. During these events, the Dow Jones Industrial Average (DJIA) dropped by an average of about 6%. In all but four cases, the market returned to positive territory within six months of the end of each decline."
I suspect Eliezer is falling prey to reporting bias. "Man bites dog" is more likely to make the news. "Catastrophe looms; stock prices rise" is more likely to make the annals of financial history.
[Note: I also disagree with Cowen for various reasons. I'm writing this because I'm much more worried about people reading these comments placing excessive trust in Eliezer.]
The Cuban Missile Crisis was the example that I chose (one-shot!) when this came up previously, he's probably going with it because of that, and because it's the most clear parallel. If you have a political crisis of other kinds, and it goes badly, you're still around to deal with losing money, or to cash winning tickets, so it plays differently.
It's funny that the 'thing you're trusting' here when you trust Eliezer is... not to trust one's ability to predict reactions!
And the 6% number plus rebounds illustrates the point - if you know a 6% decline is coming at some point in the next 10 years, that is not very useful in terms of finding a good trade.
The Cuban Missile Crisis was the brink of nuclear war. There’s little point in selling assets to consume more if you’re worried about being annihilated tomorrow or the next day. Conceivably we could get a warning shot 6 months in advance for an AI catastrophe.
I agree that just knowing a 6% decline might happen at some point is not very useful.
My point was more of a local-validity one. I doubt that professional traders have all given up on predicting future events due to the difficulty of knowing how the market will respond.
My Mundane Utility of the day: we get health insurance through my wife's big company. They sent out their annual 35-page guide describing the different options. I uploaded the entire PDF into Claude, and prompted it to ask me questions (which it did) until it had enough info to make a recommendation. I was able to ask follow-up questions and it even ran simulations like, What if someone in my family has an expensive surgery next year?, or What if I decide to invest the maximum into an HSA?
Missed opportunity to use Lloyd's of London will be loaded when we go.
The deepfake report makes me wonder the same as Scott recently, of What's The Deal With Korea? Lot of crazy topline stats and reports, but I don't really understand what the ground reality causes are... (Beyond the fertility stuff, of course.)
Regardjng the section about software productivity and working code:
I'm a staff software engineer with 15 years of experience. I've been using Aider with Claude for the past ~6 weeks, and ChatGPT and Copilot for software since they came out. I also wrote some shell scripts to integrate the OpenAI API into my command line workflows and have been using them for ~1 year.
My experience is that LLMs often generate (almost) working code, but using them effectively requires the ability to read, evaluate, test and fix code quickly. These are activities that even many software developers are bad at. It also requires identifying when the LLM is going down a rabbit hole or confused and doing something else.
You've covered a lot of research showing that LLM tooling is more effective for novices than experienced developers, and I tend to agree that's overblown. If the task is "solve a relatively straightforward programming task that might take an experienced programmer an hour", I'd expect LLMs to raise the abilities of novices from "will never complete the task, or perhaps days or weeks of work" to "can complete the task in an hour some possibly high percent of the time". Huge improvement! The expert is starting from a baseline of extremely high productivity, so unless they get lucky and the LLM is able to basically do the task by itself, it can't improve productivity on the task that much.
But the thing is that developers face those sorts of tasks constantly, like 30+ hours per week, so sometimes solving them in 5 minutes and often solving them in 45 minutes is an enormous productivity booster.
The problems as I see them are:
1. It's easy to get stuck debugging code the LLM wrote. Debugging code not written by yourself in the past hour is several times harder than debugging your own code. LLM code can't be "trusted" to be even marginally correct, so the prior probability that any line of code is causing a bug is more diffuse than human-generated code. This must be carefully managed.
2. LLMs are much better at writing code than they are at fixing code. This means that getting the LLM to fix issues is often not tractable, though is worth attempting. Also the more you try, the worse it gets (see context window management below).
3. Copilot suggestions are often subtly wrong, and it often shows you quite long code blocks. I think this is largely a UX thing, where it doesn't have full context on what you're trying to do since the prompt is the code, not your explanation of what you want. Its next-token predictions are often quite good though. I use the accept-next-token hotkey almost exclusively and never accept code suggestions longer than a few lines.
4. You need lots of checks to make sure the LLM didn't break something already implemented in order to make cumulative progress. With Aider this happens quite frequently! This requires good software development practices like code linting, tests, static analysis, version control, etc. I disable Aider's functionality of auto-committing in Git, and use `git add -p` and `git checkout -p` to add/remove individual changes. This also helps to ensure I read all of the code that is generated.
5. Performance of Aider/Claude seems to degrade rapidly for longer contexts, even far below the context limit of the model. I think this is true in web chats as well, but it's less noticeable since with Aider, command output and multiple files can make the context orders of magnitude larger than your specific prompt. This is one reason I like the UX model Aider uses of adding only particular files to the chat. I often exit Aider and restart when it stops working very well. This also means it works far better for short, self-contained files and can be a challenge to use well on larger poorly factored codebases (I.e. most codebases).
Strategies I've found useful:
* Read every line of code that is generated.
* Use aider adversarially in writing test cases. Eg I might write out a method signature and a doc string, then have aider write tests for it, then have aider write the function, then edit the implementation, removing obvious bugs, run the test cases, manually fix them, then have aider write more tests against the new implementation (with a fresh context). Or I might write test cases I don't let aider see, and tell it about the failures. Etc.
* Lock in progress with frequent git commits. Be prepared to backtrack to an earlier state rather than trying to debug (`git reset ... --hard`)
* Don't use Aider in cases where it will do a bad job. This is subjective and I'm still working on my intuition here. Basically cases where it needs a lot of files in context to do the right thing, or where what I want is unclear or very complicated.
* Keep files small and atomic where possible.
* When it does something wrong, revert the change and start the context over again. I wish Aider had better UX for this (something like manual tree search, backtracking to earlier parts of a chat history). Cursor or Copilot Chat might have something. In practice, I quit the program and restart the chat.
* Switch UX modes when needed. I'll go to o1 to help plan out a project, ask GPT-4o or Claude for library recommendations or self-contained one-off methods, use aider or copilot depending on how much input I think the LLM needs from me or vice versa, etc.
Also noteworthy: I find that using LLMs often means the quality of my output is much better, even if I don't directly "get more done" in that session. The code ends up better documented, better tested, with a full command line interface or visualizations, etc. I think that probably has indirect effects on productivity, but most code has a short half life, so it might not matter all that much.
Apologies for a trivial comment, but "adapt" is not the word you're looking for, it's "adopt". Adoption is when you start using something. Adaptation is when you change something.
🩵 thx Zvi. You thought about using chat gpt voice unstead of the built-in? Might be better.
If you are looking for a high-quality audio conversion, I produce a podcast of Zvi's posts through ElevenLabs, with multiple voices to differentiate the different quoted sources:
https://open.substack.com/pub/dwatvpodcast/p/ai-88-thanks-for-the-memos
Very cool, thx!
Zvi, I have some involvement in the Texas bill. Can you message me? I may be able to help.
It's a laboratory of democracy case right. I feel like the most draconian AI blocking bill doesn't matter because there are 49 other US states, and Texas is not a center for AI startups. There are some in Texas but they can all leave for bay area or NYC, or just not deploy their products to Texas.
1047 was important because it takes time for startups to leave California and it delays the potential singularity or fizzle by the amount of time required to move.
For the laboratory of democracy model to work we need control groups. States like Texas who make using AI illegal, and then we can compare head to head the consequences. Will AI positive states have extreme economic growth, leading to AI banning states declining faster and worse than Detroit? Will they have mass unemployment from using AI in everything and a steady stream of incidents of AI failing and killing workers and customers?
I don't know but it seems like valuable information to try it out and find out.
An AI genocide of humanity knows no borders.
If that's your threat model Texas is completely irrelevant. Probably in that case you should be for some group - probably the government - pushing forward as fast as possible to develop AI strong enough to be dangerous in an isolated lab. Prove it to everyone especially China that the threat exists. Texas is a waste of your time if this is what you genuinely believe.
California was a different case - 1047 or some other meaningful legislation that leads to a slowdown as AI labs leave the state or let China pull ahead - buys us a few extra months to live by your threat model.
I'm not sure your model of AI risk matches Zvi's model of AI risk.
An anecdote about your comment about how something that's so good for productivity being so slow to be adopted. I'm an avid AI user and work in a very quantitative, tech forward, and young department. Almost everyone that's adopted AI here has done it because I've walked them through using it for this or that problem. After, I dunno, a week of this kind of collaboration something suddenly clicks and they realize they love it. At that point they're off to the races and are way more than 25% more productive.
I mention the demographics because my department's profile is basically optimal for AI adoption. But here at least, adoption still moves at the speed of one person convincing another. When I talk to other departments about this, I often have conversations like: "I will gladly and patiently teach you to use this tool, and I can very credibly say that it will save you hours per week", "eh." People just be like that I guess.
I was recently in hospital, and basically everyone I interacted with checked that I was the right patient, and asked me to confirm that the relevant parts of my medical record in the computer were correct. Like, I was asked to confirm what medications I am on at least five times. I entirely understand why that’s the protocol, and I agree with doing that. Best one was them asking me to confirm stuff just before the intravenous sedation goes into my arm (like, this is the last possible moment you could ask me anything at all).
The whole system probably has a certain amount of redundancy against AI generated garbage in the online medical record.
(And I can merely guess at the medical negligence lawsuits that led to current protocols).
Sure that might catch some of the clerical errors it'll make in the paperwork, but it won't have those kind of safeguards on the operating table when it gets there.
Ultimately what matters is not whether risks are possible but what the odds are. As Zvi notes whisper hallucinations may not actually be worse than a physician dictating later and getting 2 patients confused.
“If AI is not a huge deal over the next 20 years, I presume either we collectively got together and banned it somehow, our else civilization collapsed for other reasons.”
I feel like it’s common for tech to have a lot of hype and even a lot of investment and then to end up not being a huge deal over the short-medium term.
Crypto, nuclear power, everything involving space, nanotech, VR, cloning/genetic engineering (until COVID vaccines), AI (for the entire history of AI research up to now)
Other stuff comes out and is genuinely really big for a while but then the world changes and in the end it’s a historical footnote. Mosquitoes develop resistance to DDT. Google Search gets overrun with SEO garbage. People get bored with the core gameplay loop for MMORPGs.
I can imagine AI going like that over the next 20 years. Maybe it needs more to keep scaling than what we can provide. Or maybe AI systems get kinda popular for a bit, but then fade when we end up being better at subverting them than securing them.
Praxis does take time for sure, VR's infancy was in the late 1970s and it's continuously suffered waves of being repeatedly overhyped (and fairly so). AI is going to be different. Even if capabilities don't develop any further beyond where they are today, it will have an outsized impact on society beyond any other technology before. Luckily the implementation efforts are still bound by human adoption, which does take time, but of course another big fear is that our rate-limiting step might soon be obviated.
I think you can break your list and consider each category one by one. Ask yourself : how much money can I have made with this technology with the version developed or a reasonably improved version that was immediately possible?
Crypto- doesn't make net money
nuclear power- a very expensive way to boil water compared to coal or now cogen and solar
everything involving space - at $10,000 a kg to reach orbit how were you going to make money?
nanotech- without some way to mass produce nanotechnology this isn't making money. ICs are the kind we did get and they make lots
VR- headsets hurt your eyes. They make humans marginally more productive co-working in VR? Not a lot of money there.
cloning/genetic engineering (until COVID vaccines)- for the places this is legally allowed it makes money and is commonly used
AI (for the entire history of AI research up to now) - now stops in 2022. OpenAI is beginning to make money on gpt-4.
Please note by "money" I mean a shorthand for workers and resources. A command economy that didn't use money like the ussr would also find unprofitable technologies to be a waste.
Great stuff as always Zvi. That YouGov poll from the Center for Youth and AI does smell a bit like opt-in online polling to me though. Many of your point still stand however.
1% probability of your device being taken over has been a thing since the days of the earliest browsers. Sometimes this goes down a lot (a critical image library vulnerability is patched), sometimes it goes up (a malicious actor uses a zero day while paying for ad coverage for a high profile keyword, or successfully commits a backdoor to a low level utility library and isn't discovered quickly). I don't think LLM prompt injection massively changes the attack surface over existing exploits by itself. Running a system in autonomous agent mode does change the risk because that allows multiple shots at the target.
Unfortunately, most people who use the Internet seem to have accepted that clicking on the occasional random link will let evildoers take over their machine and steal money from their bank account. This makes it less likely that the more flexible, adaptable kind of malware enabled by agents with a world model will get special consideration, even as over time the additional capabilities and potential for persistence make these systems harder to detect and defend against. Antivirus software that relies on pattern matching to trigger an alarm isn't going to help when the malware can adapt to the detector, dynamically work around it, or redirect the alarm.
That box meme? That's where we are headed, unless friction is added to the iterative agent loop.
One way to make this specific use case work is reference sources and a whitelist of trusted websites. Ironically this is de facto a rollback to a prior era - basically your containerized AI agent isn't going to use anything but nonfiction with positive peer reviews from a set of websites that have reference books and journals and major news sources to do its assigned work. No random websites.
Not fundamentally different than going to the library and only reading the books they have there in nonfiction and the major newspapers. Just thousands of times faster.
The timing on the AI memo is super weird, and frankly extremely stupid. What the Biden administration should do is say loud and clear that they are interested in the complete and total acceleration of all AI research. That way if Trump wins, he will immediately declare himself the AI safety president. If Harris wins, they can just change course immediately.
The memo reads to me like "acceleration and fuck China but we are going to do this with so much bureaucracy that in practice nothing will be accomplished".
Basically "noop, cost hundreds of millions".
The timing on the AI memo is super weird, and frankly extremely stupid. What the Biden administration should do is say loud and clear that they are interested in the complete and total acceleration of all AI research. That way if Trump wins, he will immediately declare himself the AI safety president. If Harris wins, they can just change course immediately.
Podcast episode for this post
https://open.substack.com/pub/dwatvpodcast/p/ai-88-thanks-for-the-memos
Google larger models are fantastic for one use case: video input! https://simonw.substack.com/p/video-scraping-using-google-gemini
>People toss off plausible-sounding stories about how, if X happens later, markets ought to behave like Y later; and then X happens; and then Y just doesn't happen. This happens ALL THE TIME. It happens to professional traders.
Eliezer is doing the thing where he speaks hyperconfidently with little to no supporting evidence.
A quick Google surfaced this report from professional traders:
"The US stock market has experienced its share of crisis events—from wars to political upsets, to many unforeseen human tragedies. The table below highlights 25 international crises that have occurred since 1940. During these events, the Dow Jones Industrial Average (DJIA) dropped by an average of about 6%. In all but four cases, the market returned to positive territory within six months of the end of each decline."
https://www.amundi.com/usinvestors/Resources/Classic-Concepts/Crisis-Events-and-the-US-Stock-Market
Here's another report: https://www.lpl.com/research/blog/middle-east-conflict-how-stocks-react-to-geopolitical-shock.html
I suspect Eliezer is falling prey to reporting bias. "Man bites dog" is more likely to make the news. "Catastrophe looms; stock prices rise" is more likely to make the annals of financial history.
[Note: I also disagree with Cowen for various reasons. I'm writing this because I'm much more worried about people reading these comments placing excessive trust in Eliezer.]
The Cuban Missile Crisis was the example that I chose (one-shot!) when this came up previously, he's probably going with it because of that, and because it's the most clear parallel. If you have a political crisis of other kinds, and it goes badly, you're still around to deal with losing money, or to cash winning tickets, so it plays differently.
It's funny that the 'thing you're trusting' here when you trust Eliezer is... not to trust one's ability to predict reactions!
And the 6% number plus rebounds illustrates the point - if you know a 6% decline is coming at some point in the next 10 years, that is not very useful in terms of finding a good trade.
The Cuban Missile Crisis was the brink of nuclear war. There’s little point in selling assets to consume more if you’re worried about being annihilated tomorrow or the next day. Conceivably we could get a warning shot 6 months in advance for an AI catastrophe.
I agree that just knowing a 6% decline might happen at some point is not very useful.
My point was more of a local-validity one. I doubt that professional traders have all given up on predicting future events due to the difficulty of knowing how the market will respond.
My Mundane Utility of the day: we get health insurance through my wife's big company. They sent out their annual 35-page guide describing the different options. I uploaded the entire PDF into Claude, and prompted it to ask me questions (which it did) until it had enough info to make a recommendation. I was able to ask follow-up questions and it even ran simulations like, What if someone in my family has an expensive surgery next year?, or What if I decide to invest the maximum into an HSA?
Missed opportunity to use Lloyd's of London will be loaded when we go.
The deepfake report makes me wonder the same as Scott recently, of What's The Deal With Korea? Lot of crazy topline stats and reports, but I don't really understand what the ground reality causes are... (Beyond the fertility stuff, of course.)
Regardjng the section about software productivity and working code:
I'm a staff software engineer with 15 years of experience. I've been using Aider with Claude for the past ~6 weeks, and ChatGPT and Copilot for software since they came out. I also wrote some shell scripts to integrate the OpenAI API into my command line workflows and have been using them for ~1 year.
My experience is that LLMs often generate (almost) working code, but using them effectively requires the ability to read, evaluate, test and fix code quickly. These are activities that even many software developers are bad at. It also requires identifying when the LLM is going down a rabbit hole or confused and doing something else.
You've covered a lot of research showing that LLM tooling is more effective for novices than experienced developers, and I tend to agree that's overblown. If the task is "solve a relatively straightforward programming task that might take an experienced programmer an hour", I'd expect LLMs to raise the abilities of novices from "will never complete the task, or perhaps days or weeks of work" to "can complete the task in an hour some possibly high percent of the time". Huge improvement! The expert is starting from a baseline of extremely high productivity, so unless they get lucky and the LLM is able to basically do the task by itself, it can't improve productivity on the task that much.
But the thing is that developers face those sorts of tasks constantly, like 30+ hours per week, so sometimes solving them in 5 minutes and often solving them in 45 minutes is an enormous productivity booster.
The problems as I see them are:
1. It's easy to get stuck debugging code the LLM wrote. Debugging code not written by yourself in the past hour is several times harder than debugging your own code. LLM code can't be "trusted" to be even marginally correct, so the prior probability that any line of code is causing a bug is more diffuse than human-generated code. This must be carefully managed.
2. LLMs are much better at writing code than they are at fixing code. This means that getting the LLM to fix issues is often not tractable, though is worth attempting. Also the more you try, the worse it gets (see context window management below).
3. Copilot suggestions are often subtly wrong, and it often shows you quite long code blocks. I think this is largely a UX thing, where it doesn't have full context on what you're trying to do since the prompt is the code, not your explanation of what you want. Its next-token predictions are often quite good though. I use the accept-next-token hotkey almost exclusively and never accept code suggestions longer than a few lines.
4. You need lots of checks to make sure the LLM didn't break something already implemented in order to make cumulative progress. With Aider this happens quite frequently! This requires good software development practices like code linting, tests, static analysis, version control, etc. I disable Aider's functionality of auto-committing in Git, and use `git add -p` and `git checkout -p` to add/remove individual changes. This also helps to ensure I read all of the code that is generated.
5. Performance of Aider/Claude seems to degrade rapidly for longer contexts, even far below the context limit of the model. I think this is true in web chats as well, but it's less noticeable since with Aider, command output and multiple files can make the context orders of magnitude larger than your specific prompt. This is one reason I like the UX model Aider uses of adding only particular files to the chat. I often exit Aider and restart when it stops working very well. This also means it works far better for short, self-contained files and can be a challenge to use well on larger poorly factored codebases (I.e. most codebases).
Strategies I've found useful:
* Read every line of code that is generated.
* Use aider adversarially in writing test cases. Eg I might write out a method signature and a doc string, then have aider write tests for it, then have aider write the function, then edit the implementation, removing obvious bugs, run the test cases, manually fix them, then have aider write more tests against the new implementation (with a fresh context). Or I might write test cases I don't let aider see, and tell it about the failures. Etc.
* Lock in progress with frequent git commits. Be prepared to backtrack to an earlier state rather than trying to debug (`git reset ... --hard`)
* Don't use Aider in cases where it will do a bad job. This is subjective and I'm still working on my intuition here. Basically cases where it needs a lot of files in context to do the right thing, or where what I want is unclear or very complicated.
* Keep files small and atomic where possible.
* When it does something wrong, revert the change and start the context over again. I wish Aider had better UX for this (something like manual tree search, backtracking to earlier parts of a chat history). Cursor or Copilot Chat might have something. In practice, I quit the program and restart the chat.
* Switch UX modes when needed. I'll go to o1 to help plan out a project, ask GPT-4o or Claude for library recommendations or self-contained one-off methods, use aider or copilot depending on how much input I think the LLM needs from me or vice versa, etc.
Also noteworthy: I find that using LLMs often means the quality of my output is much better, even if I don't directly "get more done" in that session. The code ends up better documented, better tested, with a full command line interface or visualizations, etc. I think that probably has indirect effects on productivity, but most code has a short half life, so it might not matter all that much.
Apologies for a trivial comment, but "adapt" is not the word you're looking for, it's "adopt". Adoption is when you start using something. Adaptation is when you change something.