Without knowing how well other models would do on their infrastructure it's hard to forecast how Devin might improve as a result of better models. I'd like to see it use other models so we could see the difference in the benchmark. My own guess is that the improvement is a one-off and further scaling won't lead to much better performance.
Without knowing how well other models would do on their infrastructure it's hard to forecast how Devin might improve as a result of better models. I'd like to see it use other models so we could see the difference in the benchmark. My own guess is that the improvement is a one-off and further scaling won't lead to much better performance.
More interesting is whether they managed to implement some kind of search in the model and how it is implemented. I know they mentioned using "RL methods", but they didn't elaborate on it.
Without knowing how well other models would do on their infrastructure it's hard to forecast how Devin might improve as a result of better models. I'd like to see it use other models so we could see the difference in the benchmark. My own guess is that the improvement is a one-off and further scaling won't lead to much better performance.
More interesting is whether they managed to implement some kind of search in the model and how it is implemented. I know they mentioned using "RL methods", but they didn't elaborate on it.