LLM Critics Help Catch LLM Bugs
๐ Abstract
The article discusses the use of "critic" models, which are large language models (LLMs) trained using reinforcement learning from human feedback (RLHF), to help humans more accurately evaluate the output of other LLMs. The key points are:
๐ Q&A
[01] LLM Critics Help Catch LLM Bugs
1. What is the core idea of the approach used in this work? The core idea is to train an autoregressive policy that takes as input a (question, answer) pair and outputs a text critique which points out errors in the answer. This is done using RLHF on challenging real-world data, resulting in a GPT-4-based critic model called CriticGPT.
2. How do the LLM critics perform compared to human contractors? The LLM critics, both CriticGPT and prompted ChatGPT, catch substantially more inserted bugs than qualified human contractors paid for code review. Additionally, model-written critiques are preferred over human critiques in more than 80% of cases.
3. What is the benefit of human-machine teams of critics and contractors? Human-machine teams of contractors assisted by critic models write more comprehensive critiques than contractors alone while reducing the hallucination rate compared to models alone.
4. What is the Force Sampling Beam Search (FSBS) technique and how does it help? FSBS is an inference-time sampling and scoring strategy that balances the tradeoff between the number of real and spurious issues included in LLM critiques. It allows navigating the precision-recall tradeoff for the critic models.
[02] Methods
1. How are the LLM critics evaluated? The critics are evaluated based on attributes like comprehensiveness, whether they catch a particular bug, whether they include hallucinated bugs or nitpicks, and an overall helpfulness rating. Contractors rate these attributes on a 1-7 scale.
2. How are the critic models trained using RLHF? The training pipeline involves: 1) Sampling critiques for each (question, answer) pair, 2) Contractors rating the attributes of the sampled critiques, 3) Training a reward model to predict the human quality rankings, 4) Optimizing a policy against the reward model using PPO, and 5) Applying the FSBS inference-time strategy.
3. What is the "tampering" step and why is it used? In the optional first step, contractors introduce subtle bugs into model-written answers by editing them. This provides a source of high-quality, difficult reference bugs to ground the ranking task and ensure the majority of the training data is on buggy code similar to the LLM distribution.
[03] Results
1. What are the key high-level results summarized in Figures 1 and 5? Figure 1 shows that LLM critics catch substantially more inserted bugs than human contractors, and their critiques are preferred over human critiques. Figure 5 shows that CriticGPT (RL only) outperforms prompted ChatGPT across model scales on catching inserted bugs.
2. How do human-machine teams perform compared to models or humans alone? Human-machine teams write more comprehensive critiques than humans alone, while reducing the hallucination rate compared to models alone (Figures 6 and 7).
3. What is the purpose and key finding of the FSBS technique? FSBS allows navigating the tradeoff between the comprehensiveness and precision of the critiques. It shows that human-machine teams can move beyond the model-only Pareto frontier in this tradeoff (Figure 8).
[04] Discussion and Limitations
1. What are some key limitations of the approach discussed in the paper? Limitations include: the code snippets used are typically short, the absolute rate of nitpicks and hallucinated bugs is still quite high, and real-world complex bugs may not be simple to localize or explain.
2. How does the work relate to the broader goal of "scalable oversight"? The work demonstrates a simple scalable oversight method that helps humans more comprehensively spot problems in real-world RLHF data, which is an important step towards the goal of having trustworthy and aligned models.