HelpSteer2: Open-source dataset for training top-performing reward models
๐ Abstract
The article discusses the release of HelpSteer2, a permissively licensed open-source helpfulness dataset designed to train state-of-the-art reward models. The key points are:
- HelpSteer2 is a high-quality dataset of 10,000 response pairs, annotated with Likert-5 scores for helpfulness, correctness, coherence, complexity, and verbosity.
- Reward models trained on HelpSteer2 achieve state-of-the-art performance on the Reward Bench benchmark, outperforming both proprietary and open-source models.
- The authors demonstrate how reward models trained on HelpSteer2 can be used to effectively align large language models with human preferences, using techniques like Iterative Direct Preference Optimization, Proximal Policy Optimization, and SteerLM 2.0.
๐ Q&A
[01] Dataset Collection
1. What are the key sources of prompts used in the HelpSteer2 dataset? The prompts in HelpSteer2 are primarily sourced from the ShareGPT dataset, which contains real-world conversational prompts from ChatGPT users. The authors also supplemented this with a small proportion of proprietary prompts focused on use cases like summarization, closed question answering, and extraction.
2. How did the authors ensure diversity in the prompts? The authors used BERTopic to cluster similar prompts into around 1000 topics, and then sampled uniformly from each topic. They also weighted the highest complexity prompts more heavily to ensure the dataset contained a range of prompt difficulties.
3. How did the authors handle multi-turn prompts? For multi-turn prompts, the authors replaced the original ChatGPT responses with responses generated by their own internal 22B model, which was fine-tuned on data from Open Assistant and HH-RLHF.
4. What were the sources of the responses in HelpSteer2? The responses came from a variety of sources:
- Internal NVIDIA models from 3 generations (Nemotron-2, Nemotron-3, Nemotron-4)
- Mixtral-8x7B-Instruct-v0.1 model
- Human annotators from Scale AI
5. How did the authors improve the annotation process compared to the original HelpSteer dataset? The key improvements were:
- Requiring at least 3 annotators per response, with additional annotators added if there was high disagreement
- Asking annotators to rate two responses to the same prompt sequentially to facilitate more calibrated scoring
- Engaging around 1000 US-based annotators, compared to 200 in the original HelpSteer
[02] Dataset Analysis
1. How do the HelpSteer2 responses compare to the original HelpSteer dataset in terms of quality? The HelpSteer2 responses are more helpful, correct, coherent, verbose, and complex compared to HelpSteer. The most substantial improvement is in coherence, which reaches 3.63 out of 5.
2. How does the correlation between the attributes and helpfulness differ between HelpSteer and HelpSteer2? In HelpSteer2, coherence is a much weaker predictor of helpfulness compared to HelpSteer, likely because most responses are already highly coherent. Correctness has become a stronger predictor of helpfulness.
3. How does prompt complexity and length affect the helpfulness of responses in HelpSteer2? Helpfulness is slightly negatively correlated with prompt character length and number of turns, suggesting models perform worse on follow-up responses compared to initial responses. Response length is slightly positively correlated with helpfulness.
[03] Reward Model Training and Evaluation
1. What are the key differences in the reward model training approach compared to prior work? The authors train the reward models using a regression approach that predicts the scalar values of the 5 attributes, rather than a binary preference-based approach. This provides more information to the reward model about what constitutes a "good" response.
2. How do the reward models trained on HelpSteer2 perform on the Reward Bench benchmark? The reward models trained on HelpSteer2 achieve state-of-the-art performance on Reward Bench, outperforming both proprietary and open-source models. The Llama 3 70B model trained on HelpSteer2 achieves 92.0% overall accuracy, ranking 1st on the leaderboard as of June 12, 2024.
3. In which Reward Bench categories do the HelpSteer2 trained models excel? The models excel particularly in the Chat-Hard category, outperforming the second-best model by 6.5%. They also perform well on Safety and Reasoning, though not as strongly as the best models trained on specific datasets for those tasks.
[04] Aligned Model Training
1. What are the three approaches the authors use to align language models using the Llama 3 70B reward model? The three approaches are:
- Iterative Direct Preference Optimization (Iterative DPO)
- Proximal Policy Optimization (PPO)
- SteerLM 2.0 (an extension of the original SteerLM method)
2. How do the aligned models perform on the evaluation metrics compared to the baseline SFT model? At least one of the aligned models matches or exceeds the performance of the Llama 3 70B Instruct model, which was trained on 10 million samples, across metrics like MT Bench, TruthfulQA, AlpacaEval 2.0 LC, and Arena Hard. This is achieved with substantially less training data (only 10,000 HelpSteer2 pairs and 100,000 SFT samples).
3. What are the key strengths of each aligned model approach?
- Iterative DPO excels on TruthfulQA and Arena Hard, likely due to the focus on correctness in the HelpSteer2 data.
- PPO performs best on AlpacaEval 2.0 LC, which tests for concise and informative responses.
- SteerLM 2.0 performs optimally on MT Bench, which evaluates complex multi-requirement prompts, likely due to its fine-grained attribute-based training.