# Iterative Reasoning Preference Optimization

## ๐ Abstract

The article discusses an iterative preference optimization approach for improving the reasoning capabilities of large language models (LLMs) on tasks that require chain-of-thought (CoT) reasoning. The key points are:

- Iterative preference optimization methods have been shown to perform well on general instruction tuning tasks, but typically make little improvement on reasoning tasks.
- The proposed "Iterative Reasoning Preference Optimization" (Iterative RPO) approach optimizes the preference between competing generated CoT candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer.
- Iterative RPO uses a modified DPO loss with an additional negative log-likelihood (NLL) term, which is found to be crucial for performance.
- Iterative RPO results in significant accuracy improvements on reasoning tasks like GSM8K, ARC-Challenge, and MATH, outperforming baselines like supervised fine-tuning and standard DPO.

## ๐ Q&A

### [01] Iterative Reasoning Preference Optimization

**1. What is the key idea behind the Iterative Reasoning Preference Optimization (Iterative RPO) approach?**
The key idea is to optimize the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. This is done through an iterative process of generating CoT candidates, constructing preference pairs based on the correctness of the final answers, and then training the model using a modified DPO loss with an additional NLL term.

**2. How does Iterative RPO differ from other iterative alignment methods?**
Iterative RPO differs from other iterative methods like Iterative DPO and Self-Rewarding LLMs in a few ways:

- It only requires the final answer labels, not the full CoT solutions, to construct the preference pairs.
- It uses the model itself to generate both winning and losing CoT candidates, rather than relying on human-provided or external reward model-generated data.
- It includes an additional NLL loss term in the training objective, which is found to be crucial for performance.

**3. What are the key components of the Iterative RPO approach?**
The key components are:

- Chain-of-Thought & Answer Generation: Using the current model, generate multiple responses for each input, where each response consists of a CoT reasoning followed by a final answer.
- Preference Optimization: Construct a dataset of response pairs such that the chosen (winning) responses have higher rewards (i.e., correct final answers) than the rejected (losing) responses. Train the next model using a modified DPO loss that includes an additional NLL term.
- Iterative Training: Train a series of models where each successive model uses preference data created by the previous model.

### [02] Experimental Results

**1. How does Iterative RPO perform compared to other baselines on the GSM8K dataset?**
On the GSM8K dataset, Iterative RPO outperforms baselines like zero-shot CoT, supervised fine-tuning (SFT) on the gold CoT solutions, and standard DPO. Iterative RPO improves the base model accuracy from 55.6% to 81.6% (or 88.7% with majority voting over 32 samples), while the SFT-only approach only improves to 63.5%.

**2. What is the importance of the NLL loss term in the Iterative RPO training?**
The NLL loss term in the Iterative RPO training objective is found to be crucial for performance. Compared to standard DPO training, which uses the same preference data but without the NLL term, Iterative RPO with the NLL term shows a large performance boost (73.1% vs. 61.8% on GSM8K).

**3. How does Iterative RPO perform on the ARC-Challenge and MATH tasks?**
On the ARC-Challenge task, Iterative RPO improves the accuracy from 77.8% (zero-shot CoT) to 86.7% (after 3 iterations). On the MATH task, it improves the accuracy from 12.5% (zero-shot CoT) to 20.8% (after 3 iterations). These results outperform baselines like SFT on chosen sequences and standard DPO.

</output_format>