Large Language MonkeysTitle inspired by https://en.m.wikipedia.org/wiki/Infinite_monkey_theorem.: Scaling Inference Compute with Repeated Sampling
๐ Abstract
The article explores the use of repeated sampling as a way to scale inference compute and improve the performance of large language models (LLMs) across a range of tasks, including coding, mathematics, and software engineering. The key findings are:
๐ Q&A
[01] Scaling Repeated Sampling
1. How does repeated sampling improve model coverage across different tasks and models?
- The authors demonstrate that scaling inference compute through repeated sampling leads to large improvements in coverage (the fraction of problems solved by any attempt) across multiple tasks, models, and sample budgets.
- For example, on the CodeContests dataset, the coverage of the Gemma-2B model increases from 0.02% with one attempt to 7.1% with 10,000 attempts.
- This amplification allows weaker models to outperform stronger single-attempt models, like GPT-4o, in some cases.
2. How does repeated sampling affect the cost-effectiveness of using different models?
- The authors show that in some cases, it is more cost-effective to use a cheaper model like DeepSeek-V2-Coder-Instruct and amplify it with repeated sampling, compared to using a more expensive model like GPT-4o or Claude 3.5 Sonnet.
- For example, on the SWE-bench Lite dataset, sampling 5 times from DeepSeek-V2-Coder-Instruct solves more issues than a single attempt from the stronger models, while also being over 3x cheaper.
[02] Characterizing the Benefits of Repeated Sampling
1. What patterns do the authors observe in the relationship between coverage and the number of samples?
- The authors find that the relationship between coverage and the number of samples often follows an approximate exponentiated power law, suggesting the existence of inference-time scaling laws.
- They also observe that the coverage curves of different models from the same family resemble S-curves with similar slopes but distinct horizontal offsets.
[03] Harnessing Repeated Sampling Requires Precision
1. What challenges do the authors identify in verifying model outputs when automatic verification tools are not available?
- In domains without automatic verifiers, like math word problems, the authors show that common methods for deciding on a final answer from a collection of samples, such as majority voting or reward model scoring, plateau beyond several hundred samples and fail to fully scale with the sample budget.
- This leads to a growing gap between the performance achieved with these methods and the coverage upper bound.
2. What issues did the authors encounter with the verification tools used for the SWE-bench Lite dataset?
- The authors identified that 11.3% of the SWE-bench Lite problems have flaky test suites that do not produce consistent results when running them on the same candidate solution.
- Additionally, some test cases have been programmatically generated in a way that violates the problem's input specifications, leading to inconsistent behavior between different correct solutions.
</output_format>