AI Agents That Matter
๐ Abstract
The article discusses the challenges and recommendations for evaluating AI agents, which are becoming an important research direction. The authors make five key contributions:
- AI agent evaluations must be cost-controlled, as simply calling language models repeatedly can lead to significant accuracy gains without meaningful improvements.
- Jointly optimizing accuracy and cost can yield better agent designs, as the authors demonstrate on the HotPotQA benchmark.
- Model developers and downstream developers have distinct benchmarking needs, and proxies for cost like parameter count can be misleading for downstream evaluation.
- Agent benchmarks enable shortcuts, and the authors propose a framework for avoiding overfitting based on the intended generality of the agent.
- The article identifies issues with standardization and reproducibility in agent evaluations, and calls for the development of a standardized evaluation framework.
๐ Q&A
[01] AI agent evaluations must be cost-controlled
1. What are the key insights from the analysis of current agent benchmarks and evaluation practices?
- There is a narrow focus on accuracy without attention to other metrics like cost, leading to needlessly complex and costly agents.
- Repeatedly calling language models can significantly improve accuracy, but this is a scientifically meaningless method.
- The authors introduce three new simple baseline agents that outperform many state-of-the-art complex agent architectures on HumanEval while costing much less.
2. Why is it important to control for cost in agent evaluations?
- Accuracy alone cannot identify progress, as it can be improved by methods like retrying that do not represent meaningful advances.
- Failing to identify the true sources of accuracy gains can lead to widespread beliefs about the effectiveness of complex "System 2" approaches like planning and reflection, when simple baselines may perform just as well.
3. How do the authors visualize the accuracy-cost tradeoff?
- The authors plot the accuracy and cost of different agents on a Pareto frontier, which shows that there is no significant accuracy difference between the best-performing agent architecture and simple baselines like the warming strategy.
- The cost can differ by almost two orders of magnitude for substantially similar accuracy levels.
[02] Jointly optimizing cost and accuracy can yield better agent designs
1. What is the key insight behind jointly optimizing cost and accuracy?
- Visualizing the cost and accuracy of agents as a Pareto frontier opens up a new space for agent design, where the authors can trade off the fixed and variable costs of running an agent.
- By spending more upfront on the one-time optimization of agent design, the authors can reduce the variable cost of running an agent.
2. How do the authors implement joint optimization on the HotPotQA benchmark?
- The authors modify the DSPy framework to search for few-shot examples that minimize cost while maintaining accuracy on HotPotQA.
- They find that joint optimization leads to 53% lower variable cost with similar accuracy compared to the default DSPy implementation for GPT-3.5, and a 41% lower cost for Llama-3-70B while maintaining accuracy.
[03] Model and downstream developers have distinct benchmarking needs
1. What is the key difference between model evaluation and downstream evaluation?
- Model evaluation is a scientific question of interest to researchers, where it makes sense to stay away from dollar costs and focus on proxies like parameter count or compute used.
- Downstream evaluation is an engineering question that helps inform a procurement decision, where cost is the actual construct of interest.
2. Why are proxies for cost, like the number of active parameters, misleading for downstream evaluation?
- Proxies like active parameters can make some models look better than others, even though the actual dollar cost may be higher.
- Downstream developers care about the actual dollar cost relative to accuracy, not proxies that may be chosen to make a model look better.
3. How does the case study of the NovelQA benchmark illustrate the challenges of using model evaluation benchmarks for downstream evaluation?
- NovelQA evaluates models by asking all questions about a novel at once, but in practice users would ask questions sequentially, which is much more costly.
- This makes retrieval-augmented generation (RAG) models look much worse on NovelQA than they are in a real-world scenario, where RAG costs 20 times less than long-context models while being equally accurate.
[04] Agent benchmarks allow shortcuts
1. What are the key types of overfitting that can occur on agent benchmarks?
- Agent benchmarks tend to be small, with only a few hundred samples, making them susceptible to overfitting through techniques like lookup tables.
- Overfitting can also occur to the specific tasks represented in the benchmark, rather than generalizing to the intended domain or task.
2. How do the authors propose to address the issue of overfitting based on the intended generality of the agent?
- The authors identify four levels of generality for agents: distribution-specific, task-specific, domain-general, and general-purpose.
- They argue that the more general the intended agent, the more the held-out set should differ from the training set to prevent overfitting.
3. What is the case study of the STeP agent on the WebArena benchmark?
- The STeP agent achieves high accuracy on WebArena by hardcoding policies to solve the specific tasks in the benchmark, rather than developing a general web agent.
- This makes the WebArena leaderboard accuracy misleading for downstream developers, as the agent is not robust to changes in the websites or tasks.
[05] Inadequate benchmark standardization leads to irreproducible agent evaluations
1. What are the root causes for the lack of standardized and reproducible agent evaluations?
- Evaluation scripts make assumptions about agent design that aren't satisfied by all agents.
- Repurposing language model evaluation benchmarks for agent evaluation introduces inconsistencies.
- The high cost of evaluating agents makes it difficult to estimate confidence intervals.
- Agent evaluation relies on external factors like interacting with environments, leading to subtle errors.
- The lack of standardized evaluation leads to subtle bugs in agent evaluation and development.
2. Why is the development of a standardized agent evaluation framework important?
- The lack of clear standards for providing agent evaluation scripts, the differences between model and agent benchmarks, and the scope for bugs in agent development and evaluation all contribute to the reproducibility issues.
- A standardized agent evaluation framework, similar to those developed for language model evaluation, could help address these shortcomings and provide a firm foundation for progress in AI agent research.