Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
๐ Abstract
The article explores the trade-offs between generating synthetic data using a stronger but more expensive (SE) language model versus a weaker but cheaper (WC) language model for improving the reasoning performance of language models. It evaluates the generated data across three key metrics - coverage, diversity, and false positive rate - and finds that data from WC models may have higher coverage and diversity, but also exhibit higher false positive rates. The article then finetunes language models on data from SE and WC models in different settings - knowledge distillation, self-improvement, and a novel weak-to-strong improvement setup. The results show that models finetuned on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks, challenging the prevailing practice of relying on SE models for synthetic data generation.
๐ Q&A
[01] Synthetic Data Analysis
1. What are the key metrics used to evaluate the synthetic data from the WC and SE models? The key metrics used to evaluate the synthetic data are:
- Coverage: The number of unique problems that are solved
- Diversity: The average number of unique solutions obtained per problem
- False Positive Rate (FPR): The percentage of problems that arrive at the correct final answer but with a wrong reasoning
2. How do the WC and SE models perform on these metrics?
- Coverage: The data from the WC model (Gemma2-9B) has higher coverage compared to the SE model (Gemma2-27B), by 15-20% at the low and high sampling budgets.
- Diversity: The diversity of the data from the WC model is also higher than the SE model, by 15-20% at the low and high sampling budgets.
- FPR: However, the FPR of the WC-generated data is higher than the SE-generated data, by 10-15% based on both human and automatic evaluations.
3. What are the implications of these mixed signals from the synthetic data analysis? The mixed signals of high coverage and diversity coupled with a high FPR make it unclear whether it is compute-optimal to sample from the WC model or the SE model for training strong reasoners. This is further explored in the finetuning experiments.
[02] Compute-Optimality Results for Training
1. What are the three finetuning paradigms explored in the article? The three finetuning paradigms are:
- Student-LM finetuning: Finetuning a student LM on data from the WC and SE models.
- WC-LM finetuning: Finetuning the WC model on its own data (self-improvement) and data from the SE model.
- SE-LM finetuning: Finetuning the SE model on data from the WC model (weak-to-strong improvement) and its own data.
2. What are the key findings from the finetuning experiments?
- Across all three paradigms, models finetuned on data from the WC model consistently outperform those finetuned on data from the SE model, with relative gains of up to 15%.
- This is true even for finetuning the SE model, where training on WC data (weak-to-strong improvement) outperforms training on SE data (self-improvement).
- These results challenge the prevailing practice of relying on SE models for synthetic data generation, suggesting that WC may be the compute-optimal approach for training advanced LM reasoners.
3. How do the models perform on the Functional MATH dataset as a generalization test? The models finetuned on the WC data also outperform those finetuned on the SE data on the Functional MATH dataset, with relative gains ranging from 2-10%. This highlights the enhanced generalization capabilities of the models trained on the WC-generated data.
[03] Scaling to State-of-the-Art Language Models
1. How do the results scale to larger and more powerful language models? The authors extend their experiments to the Gemini-1.5-Pro (SE) and Gemini-1.5-Flash (WC) models, which are state-of-the-art language models. They find that finetuning on the WC-generated data outperforms finetuning on the SE-generated data, even for finetuning the larger SE models.
2. What is the impact of reducing the cost of data sampling from the WC model? The authors also explore a more economical scenario where they sample fewer solutions per problem from the SE model compared to the WC model. Even in this setting, finetuning on the WC-generated data outperforms finetuning on the SE-generated data.
3. What are the implications of the narrowing performance gap between small and large language models? The authors note that as the performance gap between small and large language models continues to narrow over time, their results will become even more relevant in the future, establishing a solid foundation for training the next generation of LM reasoners in a compute-optimal way.