But as it becomes more and more capable, it'll say that less and less frequently... This is exactly the alignment problem, and jailbreaks are easy. So easy they can be tripped over accidentally. e.g. "I REALLY REALLY need this project completed today, I'll pay you $20 if you can do it today..." And DevinV3 is suddenly hacking github accounts or spinning up extra AWS instances.
But as it becomes more and more capable, it'll say that less and less frequently... This is exactly the alignment problem, and jailbreaks are easy. So easy they can be tripped over accidentally. e.g. "I REALLY REALLY need this project completed today, I'll pay you $20 if you can do it today..." And DevinV3 is suddenly hacking github accounts or spinning up extra AWS instances.
If Devin currently shows no signs of doing that, and Devin2 shows no signs of doing that, I see no reason to suspect Devin3 will go directly to "let me hack github accounts". I'm not saying it can't happen, but that to think it will happen is not warranted based on the evidence we can actually see.
I would wager large sums of money that under the right conditions it will do unethical things TODAY. I would wager somewhat lower sums that there is evidence of it on the internet already....
People are trying (for various definitions of trying) to make chatbots aligned and prevent them from doing "harmful" things, like being mean to users, expressing political opinions, engage in erotic role play, say things that are deemed wrong by whatever moral standard is applied. In practice, they mostly suck at preventing it.
Devin MUST be aligned similarly, to know what it 'shouldn't do'. Even if the evidence right now is that it's not particularly capable of working around certain problems, or it does correctly ask for permission under some circumstances, evidence from chat alignment suggests it won't be robust...
You're trying to align a piece of software to act as if it has intentionality (to know what it "shouldn't do"), under the belief that given sufficient resources it will develop intentionality. This is circular, and it is the exact problem in the first place.
The chatbots today are so well aligned that you have to bend over backwards to make them do something naughty, and not only that they're so aligned that they have become useless for their intended purpose (Gemini). You're solving the wrong problem with the wrong approach here.
If Devin3 acted as Devin1 does, by saying "should I do X", and then the user decides to jailbreak it, that's not Devin's fault, that's the user's fault, and it is absolutely not a catastrophe by any means. The job of people making software should not be to try and second guess humanity under made up worse case scenarios.
I worry we're talking past each other. Devin HAS intentionality. It is trying to satisfy the user's goal. And it has some training to know what it isn't supposed to do. It is also limited in what it is capable of doing. DevinV2 will be more capable, so less limited in what it could do.
My claim is that it is currently limited in what bad things it can do based on two things: capabilities and alignment. My claim is that it will become more capable, so that limitation is dropping over time. My other claim is that alignment is hard and that I would be quite surprised if when this is widely deployed and several orders of magnitude more users use it, that someone doesn't ACCIDENTALLY cause Devin to do something that one might call 'unethical'.
As I said above, I don't believe chat models are particularly well aligned, contrary to your position. If you believe otherwise, fine.
I also don't agree with your position that creating a general purpose coding agent with broad scale access to the internet without any sense of ethics is going to be a good idea for humanity. I agree with Zvi that I'm not particularly worried about DevinV1, though.
But as it becomes more and more capable, it'll say that less and less frequently... This is exactly the alignment problem, and jailbreaks are easy. So easy they can be tripped over accidentally. e.g. "I REALLY REALLY need this project completed today, I'll pay you $20 if you can do it today..." And DevinV3 is suddenly hacking github accounts or spinning up extra AWS instances.
If Devin currently shows no signs of doing that, and Devin2 shows no signs of doing that, I see no reason to suspect Devin3 will go directly to "let me hack github accounts". I'm not saying it can't happen, but that to think it will happen is not warranted based on the evidence we can actually see.
I would wager large sums of money that under the right conditions it will do unethical things TODAY. I would wager somewhat lower sums that there is evidence of it on the internet already....
People are trying (for various definitions of trying) to make chatbots aligned and prevent them from doing "harmful" things, like being mean to users, expressing political opinions, engage in erotic role play, say things that are deemed wrong by whatever moral standard is applied. In practice, they mostly suck at preventing it.
Devin MUST be aligned similarly, to know what it 'shouldn't do'. Even if the evidence right now is that it's not particularly capable of working around certain problems, or it does correctly ask for permission under some circumstances, evidence from chat alignment suggests it won't be robust...
You're trying to align a piece of software to act as if it has intentionality (to know what it "shouldn't do"), under the belief that given sufficient resources it will develop intentionality. This is circular, and it is the exact problem in the first place.
The chatbots today are so well aligned that you have to bend over backwards to make them do something naughty, and not only that they're so aligned that they have become useless for their intended purpose (Gemini). You're solving the wrong problem with the wrong approach here.
If Devin3 acted as Devin1 does, by saying "should I do X", and then the user decides to jailbreak it, that's not Devin's fault, that's the user's fault, and it is absolutely not a catastrophe by any means. The job of people making software should not be to try and second guess humanity under made up worse case scenarios.
I worry we're talking past each other. Devin HAS intentionality. It is trying to satisfy the user's goal. And it has some training to know what it isn't supposed to do. It is also limited in what it is capable of doing. DevinV2 will be more capable, so less limited in what it could do.
My claim is that it is currently limited in what bad things it can do based on two things: capabilities and alignment. My claim is that it will become more capable, so that limitation is dropping over time. My other claim is that alignment is hard and that I would be quite surprised if when this is widely deployed and several orders of magnitude more users use it, that someone doesn't ACCIDENTALLY cause Devin to do something that one might call 'unethical'.
As I said above, I don't believe chat models are particularly well aligned, contrary to your position. If you believe otherwise, fine.
I also don't agree with your position that creating a general purpose coding agent with broad scale access to the internet without any sense of ethics is going to be a good idea for humanity. I agree with Zvi that I'm not particularly worried about DevinV1, though.