Large Language Monkeys: Is The Best Model Always the Best Option? No.
๐ Abstract
The article discusses research findings from Google Deepmind, Stanford, and Oxford teams that challenge the common assumption that the "most intelligent" large language models (LLMs) are always the best choice. It presents evidence that smaller LLMs with the ability to sample multiple solutions can outperform larger, more advanced models in certain tasks. The article also explores the concept of "long-inference" models, where increasing the inference computation by allowing the model more time to think can lead to significant performance improvements, independent of the model size. Overall, the article suggests that enterprises should reconsider their approach to deploying generative AI models and explore strategies that leverage the power of multiple sampling and longer inference times.
๐ Q&A
[01] Large Language Monkeys
1. What are the two ways that LLMs can improve their intelligence?
- Compression: LLMs can improve their intelligence by finding patterns in data, allowing them to perform reasoning instinctively, quickly, and without second thoughts.
- Search: LLMs can improve their intelligence by exploring possible solutions at runtime until they find the best one, which is a slow, deliberate, and "conscious" way of solving a problem.
2. How do these two intelligence types relate to the "thinking modes" theory by Daniel Kahneman? The two intelligence types are similar to the "System 1" (fast and automatic) and "System 2" (slow, deliberate, and conscious) thinking modes described in Kahneman's theory.
3. What is the key insight behind the "power of search" approach for LLMs? The key insight is that by allowing LLMs to generate multiple possible solutions to a problem and then selecting the best one, the likelihood of getting the correct answer increases, similar to how humans explore different approaches to solve complex problems.
[02] Seeing LLMs in a New Light
1. What is the surprising discovery about the performance of smaller LLMs compared to larger, more advanced models? Researchers found that a smaller LLM (DeepSeek-V2-coder) that samples multiple solutions to a problem can outperform state-of-the-art models like GPT-4o or Claude 3.5 Sonnet, even when the compute budget is fixed.
2. How much did the smaller, multiple-sampling model outperform the larger models in the SWE-Bench Lite benchmark? The smaller model with multiple sampling achieved a new state-of-the-art 56% in the SWE-Bench Lite benchmark, while the larger models (GPT-4o and Claude 3.5 Sonnet) combined only achieved 43%.
3. What is the key insight about the relationship between model size and performance when using multiple sampling? The research shows that by allowing smaller models to sample multiple solutions, they can outperform larger, more advanced models, suggesting that enterprises should reconsider their approach to deploying generative AI models.
[03] The Forgotten Unlocker
1. What is the key finding about the relationship between inference compute and model performance? Researchers found that increasing the inference compute, i.e., allowing the model more time to think and generate multiple solutions, can lead to significant performance improvements, independent of the model size.
2. What is the implication of this finding for enterprises deploying generative AI models? The research suggests that enterprises should prioritize implementing multiple-sample approaches over simply using the "best" model available, as increasing the inference compute can yield better results.
3. What is the limitation identified in the current LLM-based verifiers used to evaluate the generated solutions? The research points out that non-automatic verifiers (like LLMs) seem to plateau, meaning that while inference scaling laws might be true, we have yet to create truly robust LLM verifiers that can confidently scale.