Prover-verifier games improve legibility of llm outputs
๐ Abstract
The paper studies how to make the outputs of large language models (LLMs) more legible to humans. It proposes a training algorithm inspired by the Prover-Verifier Game, where a "helpful" prover is trained to produce correct and convincing solutions, while a "sneaky" prover is trained to produce incorrect but convincing solutions. The goal is to train a robust verifier that can distinguish between the helpful and sneaky provers' outputs. The paper finds that this training approach can improve the legibility of the helpful prover's solutions to both the verifier and to human evaluators, while also making the verifier more robust to adversarial attacks.
๐ Q&A
[01] Legibility and Checkability
1. What is the key idea behind making LLM outputs more legible to humans? The key idea is to optimize the LLM "prover" to produce solutions that are checkable by a smaller "verifier" model. The verifier is trained to distinguish between correct and incorrect solutions, and the prover is trained to produce solutions that are convincing to the verifier.
2. How is checkability defined in the paper? Checkability is defined in terms of two properties:
- Completeness: The helpful prover generates correct solutions that the verifier accepts.
- Soundness: The verifier rejects incorrect solutions produced by an adversarial "sneaky" prover.
3. Why do the authors use smaller verifier models as a proxy for human legibility? The authors give three reasons for using smaller verifier models:
- LLMs are already used to aid in judging LLM outputs, and this is more scalable than relying on humans alone.
- Varying the capability gap between prover and verifier lets them study legibility as a function of this gap.
- When human feedback is limited, making outputs verifiable by less capable verifiers may incrementally aid human legibility.
[02] Checkability Training
1. What is the key idea behind the checkability training algorithm? The key idea is to iteratively train a verifier model to predict the correctness of solutions, and then train a "helpful" prover to produce correct and convincing solutions, and a "sneaky" prover to produce incorrect but convincing solutions. This is inspired by the Prover-Verifier Game.
2. How does the training algorithm work in practice? In each round of training:
- The verifier is trained to predict the correctness of solutions sampled from a mixture of previous round provers.
- The prover is trained using reinforcement learning to maximize a reward that combines solution correctness and convincingness to the current verifier.
- The prover is conditioned on a randomly sampled "helpful" or "sneaky" role.
3. What are the key findings from the checkability training experiments?
- Optimizing only for correctness can lead to unintelligible solutions that are hard for humans to evaluate.
- Checkability training achieves a better performance-legibility tradeoff.
- The verifier becomes more robust to adversarial attacks over training rounds.
- The sneaky prover generates increasingly subtle flaws that are harder for humans to detect.
- The helpful prover's solutions become more legible to both the verifier and human evaluators over training.
[03] Limitations and Future Work
1. What are some key limitations of the current work?
- The setup relies on having ground truth labels for the training data, which may not be available in all domains.
- The work focuses on the math domain, and it's unclear how well the approach would generalize to other domains.
- The training dynamics and hyperparameter tuning may be sensitive, as evidenced by the differences between the SRC and CGC reward formulations.
2. What are some promising future research directions?
- Extending the approach to semi-supervised or unsupervised settings without ground truth labels.
- Studying the generalization of legibility from domains with ground truth to those without.
- Exploring ways to separate the model's chain-of-thought from the legibility of the final solution.
- Investigating multi-round interactive setups between the prover and verifier.
- More systematic studies of human legibility beyond just correctness judgments.