# DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

## ๐ Abstract

We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. The model is pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, and then undergoes supervised fine-tuning using an enhanced formal theorem proving dataset. Further refinement is achieved through reinforcement learning from proof assistant feedback. Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the miniF2F and ProofNet benchmarks.

## ๐ Q&A

### [01] Pre-Training

**1. What was the goal of the pre-training process for DeepSeek-Prover-V1.5?**
The goal of the pre-training process was to enhance the language model's capabilities in formal theorem proving and mathematical reasoning. This was achieved by further pre-training the base model on high-quality mathematics and code data, with a focus on formal languages such as Lean, Isabelle, and Metamath.

**2. How did the pre-training help improve the model's performance?**
The pre-training on formal mathematical languages and content helped the model better grasp the syntax and semantics of formal systems like the Lean theorem prover, and align abstract mathematical reasoning with precise formal representation.

### [02] Supervised Fine-Tuning

**1. What were the key components of the supervised fine-tuning process?**
The supervised fine-tuning process involved two main data augmentation techniques:

- Annotating natural language chain-of-thought comments alongside Lean 4 code, to better align formal theorem proving with natural language reasoning.
- Incorporating intermediate tactic state information within the Lean 4 proof code, to enable the model to leverage compiler feedback more effectively.

**2. How did the supervised fine-tuning improve the model's performance?**
The supervised fine-tuning, which used the enhanced dataset, helped the model develop new behaviors that employ more delicate mathematical thinking to guide the generation of tactics. This led to significant improvements in the model's theorem proving capabilities compared to the base model.

### [03] Reinforcement Learning

**1. What was the goal of the reinforcement learning stage?**
The goal of the reinforcement learning stage was to further enhance the model's performance based on verification feedback from the Lean 4 prover. The RL process leveraged the rigorous verification provided by the proof assistant to offer accurate reward signals for the model's training.

**2. How did the reinforcement learning improve the model's performance?**
The reinforcement learning, using the GRPO algorithm, helped the model better align its proof generation with the formal specifications of the Lean verification system. This resulted in genuine enhancements to the model's fundamental capabilities in formal theorem proving, as evidenced by the improvements across different sample budgets.

### [04] Monte-Carlo Tree Search

**1. What was the motivation behind the development of the RMaxTS algorithm?**
The RMaxTS algorithm was developed to address the exploration challenges in the sparse-reward proof search problem. By assigning intrinsic rewards, the algorithm encourages the prover agent to generate diverse planning paths, fostering extensive exploration of the proof space.

**2. How does the RMaxTS algorithm work, and how does it differ from standard MCTS?**
The RMaxTS algorithm integrates a truncate-and-resume mechanism to seamlessly combine the strengths of whole-proof generation and proof-step generation. It also employs a novel reward-free exploration strategy based on the RMax principle, which prioritizes the expansion of nodes that lead to diverse tactic states, in contrast to the standard UCB-based exploration in MCTS.

### [05] Evaluation and Results

**1. What were the key benchmarks used to evaluate DeepSeek-Prover-V1.5?**
The model was evaluated on the miniF2F benchmark, which focuses on high-school level formal problem-solving, and the ProofNet benchmark, which evaluates formal theorem-proving capabilities at the undergraduate level.

**2. How did DeepSeek-Prover-V1.5 perform compared to previous state-of-the-art models?**
DeepSeek-Prover-V1.5 achieved new state-of-the-art results on both benchmarks, outperforming previous specialized models for formal theorem proving as well as general-purpose language models like GPT-3.5 and GPT-4. The integration of the RMaxTS algorithm further boosted the model's performance, demonstrating the effectiveness of the proposed exploration-oriented tree search approach.