Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation
๐ Abstract
The article introduces FLAMe, a family of foundational autorater models that can perform various quality assessment tasks. FLAMe is trained on a large and diverse collection of curated and standardized human evaluations derived exclusively from permissively licensed datasets. The article demonstrates FLAMe's strong zero-shot generalization abilities, outperforming models trained on proprietary data like GPT-4 and Claude-3 on many held-out tasks. FLAMe can also effectively serve as a powerful starting point for further downstream fine-tuning, as shown in the case of reward modeling evaluation. The article also presents a computationally efficient approach to optimize the FLAMe multitask mixture for targeted distributions. Overall, FLAMe variants outperform popular proprietary LLM-as-a-Judge models across various autorater evaluation benchmarks, while exhibiting lower bias and effectively identifying high-quality responses for code generation.
๐ Q&A
[01] Introduction
1. What are the key challenges in reliably evaluating the output of large language models (LLMs)? The key challenges in reliably evaluating LLM output are:
- The high costs of human evaluation
- The subjectivity and variability among human raters
- The lack of standardization and documentation in existing human evaluation datasets
2. What are the limitations of using model outputs for autorater training? Using model outputs for autorater training carries risks such as:
- Reinforcing biases and hallucinations in the model outputs
- Violating terms of use for proprietary LLM services, which prohibit using their models' outputs to develop competing models
3. How does the FLAMe approach address these limitations? FLAMe addresses these limitations by:
- Curating and standardizing human evaluations from prior research to create a large and diverse collection of 102 quality assessment tasks comprising over 5.3M human judgments
- Using only publicly available human evaluation data with permissive licenses
- Reformatting all tasks into a unified text-to-text format to facilitate effective transfer learning across tasks
[02] The FLAMe Collection
1. What are the key principles guiding the FLAMe data collection? The key principles are:
- Using only public, open-source datasets with permissive licenses
- Relying on human-labeled annotations
- Covering a variety of task types (pairwise evaluation, pointwise evaluation, classification, open-ended evaluation)
- Assessing diverse LLM capabilities (general response quality, factuality/attribution, mathematical reasoning, coding, safety, instruction tuning)
2. How did the authors standardize the diverse datasets into a unified format? The authors:
- Thoroughly reviewed the associated research and consulted with the original authors to address ambiguities or inconsistencies
- Extracted the relevant data fields containing human annotations
- Meticulously created detailed task definitions and evaluation instructions for each quality assessment task
- Reformatted all tasks into a flexible text-to-text format, with task definitions, evaluation instructions, and input/target fields
[03] Model
1. What are the three FLAMe model variants described in the article? The three FLAMe model variants are:
- FLAMe: Trained with examples-proportional mixture weights on the full FLAMe multitask mixture
- FLAMe-RM: FLAMe fine-tuned for 50 steps on a balanced mixture of four pairwise evaluation datasets, spanning chat, reasoning, and safety
- FLAMe-Opt-RM: FLAMe optimized for reward modeling evaluation using a novel tail-patch fine-tuning strategy to determine the optimal dataset proportions in the multitask mixture
2. How does the FLAMe-Opt-RM approach work? FLAMe-Opt-RM uses a tail-patch ablation strategy to analyze the impact of each dataset on the targeted RewardBench distribution. This allows the authors to determine the optimal proportions of individual datasets in the multitask mixture, which is then used to fine-tune the initial PaLM-2-24B checkpoint for 5000 steps.
[04] Experiments
1. What are the key findings from the evaluation of FLAMe variants across different benchmarks?
- FLAMe variants outperform all LLM-as-a-Judge baselines on 8 out of 12 evaluation benchmarks.
- FLAMe-RM-24B is the top-performing generative model trained exclusively on permissively licensed data on the RewardBench benchmark.
- FLAMe variants achieve the best performance on the LLM-AggreFact benchmark for assessing the factual grounding of model outputs.
2. How do FLAMe variants compare to other models in terms of autorater bias? The analysis shows that FLAMe variants exhibit significantly lower bias compared to popular LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, performing better across various bias categories like order, length, and attention.
[05] Further Analysis
1. How can FLAMe be used to improve code generation performance? The article demonstrates that using FLAMe to re-rank code samples generated by weaker LLMs (e.g., davinci-002, InCoder-6B, CodeGen-16B) can significantly improve their pass@1 accuracy on the HumanEval benchmark, closing nearly 40% of the gap to the Oracle ranker.
2. What are the key limitations and future work directions mentioned in the article? Limitations include:
- Evaluating LLMs on evolving standards and new capabilities
- Potential performance issues on multilingual and long-context tasks
Future work directions:
- Expanding the data collection with open-source contributions
- Exploring alternative training approaches like RLHF and DPO