Self-Taught Evaluators
๐ Abstract
The article presents an approach to improve evaluators for large language models (LLMs) without using human annotations. The proposed "Self-Taught Evaluator" method iteratively trains an LLM-as-a-Judge model using synthetically generated preference data, starting from an initial seed model. The key steps are:
- Select a challenging set of instructions from a large pool of human-written instructions
- Generate synthetic preference pairs by prompting the seed model to produce a "good" and a "bad" response for each instruction
- Use the current LLM-as-a-Judge model to annotate the synthetic preference pairs, retaining only the correctly judged examples
- Fine-tune the LLM-as-a-Judge model on the annotated synthetic data
The authors show that this iterative self-improvement approach can significantly boost the performance of a strong seed model (Llama3-70B-Instruct) on evaluation benchmarks like RewardBench, matching or outperforming models trained with human-annotated preference data.
๐ Q&A
[01] Iterative Self-Improvement Approach
1. What are the key steps in the proposed "Self-Taught Evaluator" method? The key steps are:
- Select a challenging set of instructions from a large pool of human-written instructions
- Generate synthetic preference pairs by prompting the seed model to produce a "good" and a "bad" response for each instruction
- Use the current LLM-as-a-Judge model to annotate the synthetic preference pairs, retaining only the correctly judged examples
- Fine-tune the LLM-as-a-Judge model on the annotated synthetic data
2. How does the iterative training process work? The iterative training process works as follows:
- Start with a seed LLM model (Llama3-70B-Instruct)
- In each iteration:
- Generate synthetic preference pairs using the current model
- Annotate the synthetic pairs using the current model, keeping only the correctly judged examples
- Fine-tune the model on the annotated synthetic data to obtain the updated model for the next iteration
3. What are the benefits of the proposed approach compared to using human-annotated preference data? The key benefits are:
- It does not require any human-annotated preference data, making it more scalable and less costly
- As the model improves over iterations, the quality of the synthetic training data also improves, providing an automatic curriculum
- It can outperform models trained on human-annotated preference data, as shown on the RewardBench benchmark
[02] Experimental Results
1. How does the performance of the Self-Taught Evaluator model compare to the seed Llama3-70B-Instruct model on RewardBench? The Self-Taught Evaluator model significantly improves over the seed Llama3-70B-Instruct model on RewardBench:
- Seed Llama3-70B-Instruct model: 75.4
- Self-Taught Evaluator (iteration 5): 88.3
- With majority voting (32 samples): 88.7
2. How does the Self-Taught Evaluator model compare to other baselines on RewardBench? The Self-Taught Evaluator model matches or outperforms other strong baselines on RewardBench, including:
- Llama3-70B-Instruct model trained on the HelpSteer2 dataset: 85.6
- GPT4-0125: 84.3
- Gemini 1.5 Pro 0514: 88.1
3. How does the performance of the Self-Taught Evaluator model vary across different categories in RewardBench? The Self-Taught Evaluator model shows improvements across several categories in RewardBench compared to the seed Llama3-70B-Instruct model:
- Chat: 96.6 (vs 97.6 for seed)
- Chat Hard: 84.2 (vs 58.9 for seed)
- Safety: 91.5 (vs 69.2 for seed)
- Reasoning: 81.0 (vs 78.5 for seed)
[03] Ablations and Analysis
1. How does the performance of the Self-Taught Evaluator model change when using synthetic data from different sources? The authors experimented with using synthetic data from different sources, such as coding instructions, mathematical reasoning, and the HelpSteer2 dataset. They found that data sources generally improved the categories in RewardBench that were related to their distribution.
2. How does the performance compare when using a direct "bad response" generation approach vs. the proposed "similar instruction" approach? The authors found that the proposed "similar instruction" approach to generate synthetic preference pairs performed better (83.8 on RewardBench) compared to directly prompting the model to generate a "bad" response (80.7 on RewardBench).
3. How does the performance of the Self-Taught Evaluator model compare to using human-annotated preference data from HelpSteer2? The authors found that the iterative training on synthetic preferences outperformed the model trained on the HelpSteer2 human-annotated data (88.3 vs 85.6 on RewardBench). Combining the synthetic and human-annotated data led to slight improvements over the synthetic-only approach.