Summarize by Aili

How Do AI Software Engineers Really Compare To Humans?

https://stepchange-blog.ghost.io/why-do-ai-software-engineers-like-devin-struggle-to-fix-bugs/

🌈 Abstract

The article discusses the performance of AI software engineers like Devin and SWE-agent compared to human software engineers, and the limitations of the current AI evaluation metrics like SWE-bench. It highlights the key challenges faced by AI models in addressing real-world software engineering tasks, such as handling multi-line and multi-file changes, following coding conventions, using third-party libraries, and maintaining contextual awareness.

🙋 Q&A

[01] What is SWE-bench?

SWE-bench is a dataset and evaluation tool that offers 2,294 tasks based on actual GitHub issues and pull requests drawn from open-source Python projects.
The tasks in SWE-bench are relatively small, with a median problem description of 140 words, a codebase of around 1,900 files and 400,000 lines of code, and a reference solution that modifies a single function within one file, changing around 15 lines.
The tasks are evaluated using unit tests, with at least one fail-to-pass test for each task and a median of 51 additional tests to check if prior functionality is properly maintained.

[02] How did AI models perform on SWE-bench?

At the time of the paper's publication, the best-performing model, Claude 2, was able to resolve only around 4.8% of the issues with oracle retrieval (where the model is given the files known to resolve the issue) and only 1.96% without oracle retrieval.
The key limitations of the AI models highlighted in the paper include:
- Struggling with multi-line and multi-file changes
- Ignoring an organization's coding conventions and working style
- Failing to effectively use third-party libraries
- Lacking contextual awareness to understand how changes in one part of the code affect the rest
- Struggling with tasks involving images due to a lack of multimodal capabilities

[03] How did Devin, the AI software engineer, perform on SWE-bench?

Devin successfully completed 13.86% of the SWE-bench tasks, outperforming the recently released open-source SWE-agent, which completed 12.29% of the issues.
Devin's performance on SWE-bench tasks revealed a pattern of struggle with complex changes, particularly those requiring alterations across multiple files or exceeding 15 lines of code, mirroring the challenges faced by the models tested in the SWE-bench study.
On the Django-specific tasks, Devin was able to complete 19.19% of the tasks attempted, with the majority being bug fixes.

[04] What are the key takeaways from the article?

Current AI evaluation metrics like SWE-bench do not adequately represent the complexity of real-world software engineering tasks, which often involve larger changes across multiple files and the use of third-party libraries.
AI software engineers like Devin and SWE-agent are making impressive progress, but they still struggle with certain types of tasks, particularly those requiring a deeper understanding of the codebase and its context.
The transition towards AI-assisted software engineering will be significant, but it will not eliminate the need for human software expertise, as engineers will still be required to define requirements, provide logical reasoning, and correct AI errors.
The development landscape will adapt, with new frameworks and tools being introduced to streamline the collaboration between engineers and AI.

Shared by Daniel Chen ·

Install fromChrome Web Store