# AlphaMath Almost Zero: process Supervision without process

## ๐ Abstract

Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still struggle with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors. This study introduces an innovative approach that eliminates the need for manual annotation by leveraging the Monte Carlo Tree Search (MCTS) framework to generate both the process supervision and evaluation signals automatically.

## ๐ Q&A

### [01] Introduction

**1. What are the limitations of large language models (LLMs) in mathematical reasoning tasks?**

- LLMs often face significant limitations in mathematical reasoning due to the "hallucination" issue in numerical calculations, impeding their full potential.

**2. What approaches have been developed to address the challenges associated with LLMs' intrinsic calculation errors?**

- The Chain-of-Thought (CoT) approach and the Program-of-Thought (PoT) framework have been developed to enhance the reasoning capabilities of LLMs in complex tasks.
- The CoT approach capitalizes on the in-context learning proficiency of LLMs, while the PoT framework and Program-Aided Language (PAL) models incorporate an external code interpreter to handle precise numerical and symbolic computations.

**3. How do these approaches differ from the natural process of mathematical problem-solving as human beings?**

- The CoT and PoT frameworks pursue a solution to its final answer regardless of the accuracy of intermediate steps, unlike the dynamic method of tackling problems examined in the context of the Tree of Thoughts (ToT) framework.

**4. How does the proposed approach extend the research line of Tree of Thoughts?**

- The proposed approach utilizes the LLMs integrated with the Monte Carlo Tree Search (MCTS) framework to strike a more effective balance between exploration and exploitation, enabling the generation of high-quality training data without professional human annotations.

### [02] Our Method

**1. What is the primary goal of the proposed approach?**

- The primary goal is to develop a step-level value model that is capable of assessing the confidence in the correctness of partial solutions and guide the LLM in generating subsequent reasoning steps.

**2. How does the proposed approach leverage the Monte Carlo Tree Search (MCTS) algorithm?**

- The MCTS algorithm is employed to reuse simulations and update the estimated values in a principled manner, addressing the practical inefficiency of the common Monte Carlo (MC) evaluation approach.

**3. What are the four key operations of the MCTS algorithm within the context of mathematical problem-solving?**

- Selection: The algorithm explores the tree by selecting actions according to the Upper Confidence bounds applied to Trees (UCT) principle.
- Expansion: The probability of the next reasoning step (action) is calculated by employing random sampling with higher temperature.
- Evaluation: The leaf node or partial solution is evaluated using a weighted sum of the value network's estimation and the empirical reward obtained during the rollout.
- Backup: The nodes along the path from the root to the leaf node undergo a backward pass update to their action-state values and visitation counts.

**4. How is the final tree value approximated using the MCTS algorithm?**

- The final tree value is approximated using the values stored within the tree, assuming that for non-terminal nodes, the label is straightforwardly determined by the correctness of the final answer.

**5. What is the iterative training process of the proposed approach?**

- The approach begins with a pre-trained LLM as the policy model and extends it by adding an auxiliary linear layer with a tanh activation function as the value model.
- The policy and value models are then trained using a multi-task loss function, which includes the negative log-likelihood loss for next-token prediction in correct solutions and the loss in value prediction for both correct and incorrect solutions.
- The updated policy and value models are then used to advance to the next round of MCTS, iterating this training process to enhance the models further.

### [03] Inference

**1. What are the two inference strategies proposed in the paper?**

- MCTS: Constructs a single tree with multiple simulations to estimate a robust policy distribution.
- Step-level Beam Search: A more computationally efficient version that eliminates the need for a backup, generating a sequential streaming output of each step.

**2. How does the Step-level Beam Search work?**

- Initially, the LLM generates actions for the first step by sampling decoding.
- These generated actions are then evaluated using the step-level value LLM, from which the top-k ones are selected.
- Subsequently, for each of these chosen actions, the LLM generates subsequent actions, and the value-based LLM reranks them, with the best actions being picked.
- This reranking and selection procedure is repeated iteratively.

### [04] Experiments

**1. What are the key findings from the experimental results?**

- The proposed approach, without relying on high-quality solutions annotated by humans or GPT-4, is competitive with or surpasses the performance of the state-of-the-art (SOTA) on 7B LLMs.
- The integration of the value model and the MCTS framework can progressively generate high-quality math reasoning data autonomously.
- The value model is instrumental in aiding the policy model to navigate more effective solution paths.

**2. How does the performance of the proposed approach compare to other baselines?**

- The proposed approach outperforms proprietary and open-source models, as well as supervised fine-tuning models that utilize external tools like a Python code interpreter.
- The step-level beam search with a beam size of 1 offers a computationally efficient inference strategy that achieves similar accuracy to more complex approaches like MCTS.

**3. What is the role of the value model in the proposed approach?**

- The value model is crucial in guiding the policy model to generate more effective solution paths, as evidenced by the significant performance improvements observed when incorporating the value model into the inference process.