More Agents Is All You Need
๐ Abstract
The paper presents a systematic study on the scaling property of raw agents instantiated by large language models (LLMs). The key findings are:
- The performance of LLMs can be generally improved by increasing the number of agents using a simple sampling-and-voting method.
- This method is orthogonal to existing complicated methods to further enhance LLMs, and the degree of enhancement is correlated to the task difficulty.
- Comprehensive experiments are conducted on a wide range of LLM benchmarks to verify the presence of this finding and study the properties that can facilitate its occurrence.
๐ Q&A
[01] More Agents Is All You Need
1. What is the key finding of the paper regarding the scaling property of raw agents instantiated by LLMs? The key finding is that the performance of LLMs can generally be improved by increasing the number of agents using a simple sampling-and-voting method.
2. How does the proposed method compare to existing complicated methods to enhance LLMs? The proposed sampling-and-voting method is orthogonal to existing complicated methods, and the degree of enhancement is correlated to the task difficulty.
3. What did the authors do to verify and study the properties of this finding? The authors conducted comprehensive experiments on a wide range of LLM benchmarks to verify the presence of this finding and study the properties that can facilitate its occurrence.
[02] Related Work
1. What are the three main categories of related works discussed in the paper? The related works are categorized into:
- LLM self-ensemble methods
- Heterogeneous LLM ensemble methods
- Multiple LLM agents collaboration methods
2. How does the proposed method differ from the works in these categories?
- The proposed method is unsupervised and does not require additional training data, unlike the heterogeneous LLM ensemble methods.
- The method is compatible with a broader range of methods, including prompt engineering and multiple LLM agents collaboration, unlike the LLM self-ensemble methods that focus exclusively on reasoning tasks.
- The method focuses on the scaling trend of adding more LLMs, unlike the multiple LLM agents collaboration methods that primarily focus on the interaction structures between LLM agents.
[03] Method
1. What are the two main phases of the proposed sampling-and-voting method? The two main phases are:
- Sampling: Generating multiple samples by querying the LLM or integrating with other methods.
- Voting: Consolidating the response sample set into the final answer using majority voting.
2. How is the similarity between samples calculated in the voting phase?
- For open-ended generation tasks like code generation, the BLEU score is used to quantify similarity between samples.
- For close-ended tasks like multiple-choice questions, similarity is measured by the occurrence frequency of each sample.
[04] Experimental Setup
1. What are the key components of the experimental setup? The key components are:
- Tasks: Reasoning tasks (arithmetic, general), code generation
- Language models: Llama2, GPT-3.5-Turbo, GPT-4
- Methods enhanced by the proposed method: Prompt engineering (CoT, Zero-Shot CoT, SPP), multiple LLM agents collaboration (Debate, Reflection)
2. How is the effectiveness of the proposed method evaluated? The effectiveness is evaluated by averaging the results across independent runs, scaling up the ensemble size to ensure maximum gains (except for Debate due to computational overhead).
[05] Experimental Results
1. What are the key findings regarding the generalizability of the proposed method? The proposed method generally enhances performance across all tasks and LLMs by increasing the ensemble size. In some cases, a smaller LLM with the proposed method can outperform a larger counterpart.
2. How does the proposed method perform when combined with other methods? Integrating the proposed method with other methods can further improve performance across different LLMs and tasks, with some exceptions when combined with the Debate method.
3. How does the proposed method compare to other standalone methods in terms of effectiveness? Without the need for additional prompts or complex LLM collaboration frameworks, the proposed method achieves the highest average ranking across different LLMs and tasks.
[06] Understanding the Performance Gains
1. What are the three orthogonal dimensions of task difficulty identified in the paper? The three dimensions are:
- Inherent difficulty of the task
- Number of steps required to solve the task
- Prior probability of the correct answer
2. What are the key properties derived from the analysis of these dimensions?
- Gains increase then decrease by rising the inherent difficulty.
- Gains increase with the number of steps.
- Performance increases with the prior probability.
3. What optimization strategies are proposed based on these properties?
- Step-wise sampling-and-voting: Applying sampling-and-voting at each step of a multi-step task.
- Hierarchical sampling-and-voting: Decomposing low-probability tasks into multiple high-probability subtasks and addressing them hierarchically using different models.