Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
๐ Abstract
The article discusses the use of synthesized data from generative models as an alternative to human-annotated data for fine-tuning Large Language Models (LLMs). It raises concerns about "model collapse" - a drop in performance of models fine-tuned on generated data. The paper investigates the use of feedback on synthesized data to prevent model collapse, deriving theoretical conditions under which a Gaussian mixture classification model can achieve asymptotically optimal performance when trained on feedback-augmented synthesized data. The theoretical predictions are validated through simulations and two practical experiments: computing matrix eigenvalues with transformers and news summarization with LLMs.
๐ Q&A
[01] Theoretical Insights
1. Questions related to the content of the section?
- What are the key assumptions and conditions in the theoretical analysis?
- How does the theoretical analysis characterize the conditions under which data selection with reinforcement can lead to improvements?
- What are the main theoretical results and insights regarding the impact of the generator and verifier on the performance of the downstream model?
The key assumptions and conditions in the theoretical analysis include:
- The data distribution follows certain high-dimensional concentration properties (Condition D.1), including the case of Gaussian mixtures.
- The authors consider a family of parametrized pruning strategies, termed "RLHF-Pruning", that satisfy the "Independent Selection" assumption (Assumption 3.1).
- The theoretical analysis characterizes the test accuracy of the downstream model as a function of the label disagreement rate (error rate) in the synthesized data and the parameters of the pruner, in the infinite-sample limit.
- The main theoretical result (Theorem D.3) shows a sharp phase transition: if the label disagreement rate is below a certain "breakdown point", the downstream model achieves perfect accuracy; otherwise, it learns the exact opposite of the true class labels.
- The analysis provides insights on the impact of the generator and verifier, showing that a better generator always improves performance, and a sufficiently good verifier (close to an oracle) can achieve high breakdown points even with a non-degenerate generator.
[02] Simulations on Synthesized Data
1. What are the key lessons learned from the simulations on synthesized data? The key lessons learned from the simulations on synthesized data include:
- Oracle supervision (i.e., using an oracle verifier) matches the performance of training with oracle labels, as predicted by the theory.
- Weak supervision can lead to a sweet spot in performance, but further increasing the verifier's accuracy may degrade performance if the verifier's correlation with the generator changes.
- The simulations illustrate the theoretical insights, showing that improving the generator and verifier generally enhances performance, but the effectiveness of the verifier depends on its correlation with the generator.
[03] Experiments
1. What are the key findings from the experiments on the arithmetic task using transformers? The key findings from the experiments on the arithmetic task using transformers include:
- Model collapse is observed when training solely on the generator's synthesized data, even with increased data volume.
- Using reinforcement (ranking the generated solutions and selecting the best) can lead to significant performance improvements.
- Simply selecting the best solution based on perplexity does not enhance performance, indicating that the model lacks the inherent capability to identify the superior predictions.
- With oracle supervision (using a verifier to select the data), the curated dataset can surpass the performance of the original dataset, and performance improves as more data is added.
2. What are the key findings from the experiments on news summarization using LLaMA-2? The key findings from the experiments on news summarization using LLaMA-2 include:
- Model collapse is observed when training solely on the generator's synthesized data, even with increased data volume.
- Using an oracle verifier to select the synthesized data can lead to performance that surpasses the model trained on the original dataset, with only a fraction of the data and training compute.
- Using a verifier model with higher performance than the generator does not always lead to better performance, as the effectiveness of the verifier depends on its correlation with the generator.
- Self-selection by the generator can sometimes outperform the verifier-based selection, suggesting the importance of carefully choosing the supervision model when an oracle is not available.
</output_format>