Weak-to-Strong Reasoning
๐ Abstract
The paper introduces a weak-to-strong learning framework that enables a strong language model to autonomously refine its training data and enhance its reasoning capabilities, without requiring input from a more advanced model or human-annotated data. The framework consists of two stages: (1) supervised fine-tuning on a selective small but high-quality dataset, and (2) preference optimization on contrastive samples identified by the strong model itself. Experiments on the GSM8K and MATH datasets demonstrate significant improvements in the reasoning capabilities of Llama2-70b using three separate weak models. The method is further validated on the challenging OlympicArena dataset, where Llama3-8b-instruct effectively supervises Llama3-70b.
๐ Q&A
[01] Weak-to-Strong Learning
1. What is the key challenge addressed by the weak-to-strong learning framework? The key challenge addressed is that as large language models (LLMs) exceed human-level capabilities, it becomes increasingly difficult to provide full-scale and accurate supervisions for these models. The weak-to-strong learning framework aims to leverage a less capable model to unlock the latent abilities of a stronger model.
2. How does the proposed method differ from previous approaches to weak-to-strong learning? Previous studies have shown that naively fine-tuning strong models on the full set of noisy data produced by weak models (full weak fine-tuning) can consistently improve their performance over the weaker counterparts. However, this approach is still far from recovering the full capabilities of strong models, especially for complex reasoning tasks. The proposed method introduces a progressive refinement learning framework that enables the strong model to autonomously refine its training data, without requiring input from either a more advanced model or human-annotated data.
3. What are the key stages of the proposed weak-to-strong learning framework? The framework consists of two main stages:
- Supervised fine-tuning on a selective small but high-quality dataset, combining weak data generated by the less capable model and data self-generated by the more advanced model through in-context learning.
- Preference optimization on contrastive samples identified by the strong model itself, enabling the model to learn effectively from the errors of the weaker model.
[02] Experimental Results
1. How does the proposed method perform compared to full weak fine-tuning on complex reasoning tasks? The experiments show that full weak fine-tuning, while effective in classification tasks, falls short for complex reasoning tasks like those in the GSM8K and MATH datasets. In contrast, the proposed method significantly outperforms full weak fine-tuning, achieving a 26.99-point improvement on GSM8K when supervised solely by the weak model (Gemma-2b) after the first stage of training, and further enhancing performance by an additional 8.49 points through preference optimization.
2. How does the proposed method perform compared to the strong model fine-tuned on gold-standard solutions? The experiments demonstrate that the proposed preference optimization phase enables the strong model to learn from errors made by the weak supervisor, ultimately surpassing the strong model fine-tuned on gold-standard solutions (the strong ceiling) in challenging scenarios, such as level 4-5 MATH problems.
3. How does the proposed method perform in a forward-looking experimental setup using the OlympicArena dataset? In the experiments on the OlympicArena dataset, which is designed to simulate more realistic future scenarios, the proposed two-stage training approach outperforms full weak fine-tuning by 3.19 points. This validates the robustness and generalizability of the method in settings closer to future conditions, where Llama3-8b-instruct effectively supervises the larger Llama3-70b model.