RouteLLM: Learning to Route LLMs with Preference Data
๐ Abstract
The article discusses the trade-off between performance and cost when using large language models (LLMs) and proposes an efficient router model to dynamically select between a stronger and weaker LLM during inference. The key points are:
- LLMs exhibit impressive capabilities but come with higher costs for more powerful models
- The authors develop a training framework for router models that leverage human preference data and data augmentation techniques to enhance performance
- Evaluation on benchmarks shows the router models can significantly reduce costs (over 2x in certain cases) without compromising response quality
- The router models also demonstrate transfer learning capabilities, maintaining performance even when the strong and weak models are changed at test time
๐ Q&A
[01] LLM Routing Problem Formulation
1. How is the LLM routing problem formulated?
- The LLM routing problem is formulated as learning a routing function that selects between a stronger LLM model and a weaker LLM model to optimize the balance between cost and response quality.
- The routing function has two components: a win prediction model that estimates the probability of the stronger model winning, and a cost threshold that converts the winning probability into a routing decision.
2. What metrics are used to evaluate the LLM routing approaches?
- Cost efficiency is measured by the percentage of calls to the stronger model.
- Quality is measured by the average response quality on an evaluation set.
- The overall performance gain is quantified by the performance gap recovered (PGR) metric, which measures the router's performance relative to the performance gap between the weak and strong models.
- The average performance gap recovered (APGR) and call-performance threshold (CPT) metrics are used to capture the quality-cost trade-off.
[02] Methodology
1. How is the preference data for training the routing function obtained?
- The primary training data is from the Chatbot Arena platform, which contains user prompts, responses from two anonymous models, and human preference labels.
- To reduce label sparsity, the models are clustered into 10 tiers based on their Elo scores, and the preference data is derived from battles between the top two tiers (strong models) and the third tier (weak models).
- Data augmentation techniques are also explored, including using golden-labeled datasets like MMLU and generating preference labels using a GPT-4 judge on the Nectar dataset.
2. What routing approaches are explored in the paper?
- Similarity-weighted (SW) ranking: A Bradley-Terry model that computes a weight for each query based on its similarity to the training queries.
- Matrix factorization: A bilinear scoring function that models the interaction between the query and model embeddings.
- BERT classifier: A BERT-based text classification model that predicts the win probability.
- Causal LLM classifier: An instruction-following LLM model that predicts the win probability in a next-token prediction fashion.
[03] Experiments and Results
1. How do the routing approaches perform on the evaluation benchmarks?
- On MT Bench, the matrix factorization and BERT classifier routers trained on the augmented dataset achieve the best performance, requiring up to 50% fewer GPT-4 calls than the random baseline to achieve a given performance target.
- On MMLU, all routers perform poorly when trained only on the Chatbot Arena dataset, but significantly improve when the training data is augmented with golden-labeled data from the MMLU validation set.
- On GSM8K, the causal LLM classifier trained on the dataset augmented with LLM-judge-labeled data performs the best, requiring 17% fewer GPT-4 calls than random to achieve a given performance target.
2. How do the routers generalize to different model pairs?
- The routers are evaluated on MT Bench using a new model pair (Claude 3 Opus and Llama 3 8B), without any retraining.
- The results show that the routers maintain strong performance, requiring up to 30% fewer GPT-4 calls than random to achieve a given performance target, demonstrating the generalizability of the routing approaches.
3. What are the cost savings achieved by the routing approaches?
- The top-performing routers can achieve optimal cost savings of up to 3.66x compared to the random baseline, demonstrating the significant cost reductions possible while maintaining response quality.
[04] Conclusion
1. What are the key contributions of this work?
- The authors formulate the LLM routing problem and propose a principled framework for learning routing functions using human preference data and data augmentation techniques.
- They demonstrate that their routing approaches can significantly reduce costs (over 2x in certain cases) without compromising response quality on widely recognized benchmarks.
- The routers also exhibit strong transfer learning capabilities, maintaining performance when the strong and weak models are changed at test time.
2. What are the limitations and future directions mentioned in the paper?
- The distributions of real-world applications may differ substantially from the evaluation benchmarks, so collecting a small amount of in-domain data can improve performance via dataset augmentation.
- Extending the routing framework to handle more than two models is a promising future direction.
- The varying performance between different routers trained on the same dataset on the same benchmark requires further investigation.