Elevate LLM Performance by 20% Instantly with Min-P
๐ Abstract
The article discusses a new sampling method called "Min-p sampling" that has been recently released for large language models (LLMs). It explains how this simple change can significantly improve the accuracy of LLMs by 10-20% in error-prone tasks like math or fact-based question answering, without any apparent downsides compared to the status quo. The article delves into how LLMs actually work, modeling probability distributions over their entire vocabulary rather than just predicting the most likely next word. It then explains how Min-p sampling works by dynamically truncating the distribution based on the highest probability value, which helps prevent hallucinations and improves the model's performance.
๐ Q&A
[01] How LLMs Work
1. What is the main task that LLMs are trained to optimize? LLMs are trained to optimize the task of next-word prediction - they receive a sequence of input words and predict the most likely next token.
2. How do LLMs model the probability distribution over their entire vocabulary? LLMs don't just predict the single most likely next word, but instead assign probabilities to their entire vocabulary of possible next tokens. This allows the model to generate alternative sequences that are semantically similar to the original, rather than just imitating the training data.
3. How do LLMs choose which token to output when generating text? The most popular method is top-p sampling, where the model chooses a token at random from the set of most likely tokens that, combined, reach a certain probability threshold. However, this can lead to issues with hallucinations when the probability distribution is not highly skewed.
[02] Min-p Sampling
1. How does Min-p sampling work? Min-p sampling truncates the probability distribution by setting a dynamic threshold dependent on the highest probability value. This helps reject low-probability tokens when the distribution is highly skewed, while still allowing creativity when the distribution is flatter.
2. What are the benefits of Min-p sampling compared to top-p sampling? Min-p sampling is more versatile and better at preventing hallucinations. It incentivizes the output of highly probable tokens when the result is obvious, without impacting the model's creativity in more open-ended tasks.
3. How has Min-p sampling performed in benchmarks? When tested on popular benchmarks like GSM8k (math) and GPQA (multiple-choice questions), models using Min-p sampling achieved 10-20% higher results compared to top-p sampling.
4. What is the significance of Min-p sampling? The article suggests that Min-p sampling could become as standard as AdamW and MLPs in the field of neural networks, as it is a simple yet highly effective improvement that can be widely adopted by LLMs.