Summarize by Aili

Introducing SWE-bench Verified

https://openai.com/index/introducing-swe-bench-verified/?utm_source=tldrai

🌈 Abstract

The article discusses the release of a human-validated subset of the SWE-bench benchmark, which aims to more reliably evaluate AI models' ability to solve real-world software engineering issues. It highlights the limitations of the original SWE-bench dataset and the process of creating the new SWE-bench Verified dataset through human annotation.

🙋 Q&A

[01] Improving SWE-bench

1. What were the three major areas for improvement identified in the original SWE-bench benchmark?

The unit tests used to evaluate the correctness of a solution are often overly specific and in some cases unrelated to the issue, potentially causing correct solutions to be rejected.
Many samples have an issue description that is underspecified, leading to ambiguity on what the problem is and how it should be solved.
It is sometimes difficult to reliably set up the SWE-bench development environments for the agents, inadvertently causing unit tests to fail regardless of the solution, leading to perfectly valid solutions being graded as incorrect.

2. How did the authors address these issues?

They launched a human annotation campaign with professional software developers to screen each sample of the SWE-bench test set for appropriately scoped unit tests and well-specified issue descriptions.
They released SWE-bench Verified, a subset of the original test set from SWE-bench, consisting of 500 samples verified to be non-problematic by the human annotators.
They collaborated with the SWE-bench authors to develop a new evaluation harness for SWE-bench that uses containerized Docker environments to make evaluating on SWE-bench easier and more reliable.

3. What were the key findings from the human annotation process?

38.3% of samples were flagged for underspecified problem statements, and 61.1% were flagged for unit tests that may unfairly mark valid solutions as incorrect.
Overall, the annotation process resulted in 68.3% of SWE-bench samples being filtered out due to underspecification, unfair unit tests, or other issues.

[02] Performance on SWE-bench Verified

1. How did the performance of GPT-4o and other open-source scaffolds change on the new SWE-bench Verified dataset compared to the original SWE-bench?

On SWE-bench Verified, GPT-4o's performance on the best-performing scaffold reached 33.2%, more than doubling its score of 16% on the original SWE-bench.
The increase in performance when evaluating on SWE-bench Verified may partly be explained by the shift in the difficulty distribution toward easier samples, but the authors also observed performance increases within individual difficulty categories, indicating that the new dataset better captures model capabilities.

2. How did the difficulty distribution of the SWE-bench Verified dataset compare to the original SWE-bench and SWE-bench Lite datasets?

The original SWE-bench dataset had 77.8% of samples estimated to take less than an hour for an experienced software engineer to complete.
Both SWE-bench Lite and SWE-bench Verified skewed this further, leaving fewer than 10% of issues estimated to take longer than an hour.
However, the mechanism underlying this shift is different: SWE-bench Lite subsampled the original dataset to make the benchmark easier, whereas SWE-bench Verified attempts to remove infeasible samples from the dataset.

[03] Lessons Learned

1. What are the key lessons the authors learned from their experience with SWE-bench?

Invest in deeply understanding benchmarks, as they can underestimate model capabilities due to issues in the benchmark design.
Account for progress in the ecosystem, as external enhancements to a model can significantly improve its performance on a benchmark.
Be cognizant of the limitations of evaluations based on static datasets, as they may be subject to issues like dataset contamination.

2. How do the authors plan to apply these lessons to their Preparedness Framework?

The authors use SWE-bench as one of several evaluations tracking the Medium risk level of the Model Autonomy risk category in their Preparedness Framework.
They emphasize the need to continually improve and verify the quality of evaluations used in the Preparedness Framework, as models approach AGI-level capabilities.
They also highlight the importance of considering potential external enhancements to models when assessing risk, and the need to supplement static dataset-based evaluations with other types of evaluations.

Shared by Daniel Chen ·

Install fromChrome Web Store