Summarize by Aili

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

🌈 Abstract

The paper investigates the role of synthetic data in improving the math reasoning capabilities of large language models (LLMs). The key findings are:

Training on positive synthetic data from capable models like GPT-4 and Gemini 1.5 Pro results in slower performance scaling compared to standard empirical risk minimization.
Training on model-generated positive synthetic data can improve sample efficiency by 2x, but also amplifies spurious correlations.
Appropriately constructing learner-specific negative data with emphasis on critical steps results in a performance boost equivalent to scaling up positive data by 2x.
Training with negative data provides a mechanism to unlearn spurious correlations.
The authors present a conceptual model inspired from reinforcement learning (RL) to explain the benefits of using negative data, showing its equivalence to advantage-weighted RL.

🙋 Q&A

[01] Positive Synthetic Data

1. What are the key findings regarding training on positive synthetic data?

Training on positive synthetic data from capable models like GPT-4 and Gemini 1.5 Pro results in slower performance scaling compared to standard empirical risk minimization.
Training on model-generated positive synthetic data can improve sample efficiency by 2x, but also amplifies spurious correlations.

2. Why is self-generated positive data more sample-efficient than data from more capable models? Self-generated positive data is more sample-efficient because responses from a similar model are "easier-to-fit" than those from a more capable model, resulting in reduced memorization during finetuning.

3. How can training on positive data alone amplify spurious correlations? If the positive response contains incorrect/irrelevant intermediate steps, training on such data often incentivizes the model to overfit on these spurious correlations, leading to a flat or even inverse scaling with more data.

[02] Negative Synthetic Data

1. How can negative data help address the failure modes of training on positive data alone? The paper shows that training on negative (incorrect) data generated by the model can address the issues with positive data, as long as the negative data is carefully constructed to enable per-step credit assignment.

2. What is the key insight behind the construction of negative data? The key insight is to contrast positive and negative responses that depict good and bad choices for the more "critical" intermediate steps - steps that the model must carefully produce to succeed at the problem.

3. How is training on this negative data equivalent to advantage-weighted reinforcement learning (RL)? The paper shows that training on this negative data is equivalent to performing advantage-weighted RL, where the advantages are computed under an optimal value function induced by sampling multiple responses under the SFT policy obtained by training on only the positive data.

4. How does advantage-weighted RL improve generalization compared to training on positive data alone? Advantage-weighted RL can improve generalization by de-emphasizing spurious steps and emphasizing critical steps. This is equivalent to a distributionally robust optimization objective, which ensures low loss for both majority and minority subpopulations in the training data.

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store