Grokking, a New Form of Reasoning
๐ Abstract
The article discusses the limitations of current large language models (LLMs) and introduces a technique called "grokking" that can potentially make small LLMs much smarter than the current frontier AI models. It explores how grokking allows models to go beyond just memorizing data and develop a deeper, more holistic understanding.
๐ Q&A
[01] The Limitations of Current LLMs
1. What are the four ways an AI can reason, according to the article?
- Implicit reasoning
- Prompted "thought" generation
- Few-shot reasoning
- Active search reasoning
2. Why are the second and third options (prompted "thought" generation and few-shot reasoning) limited and mediocre in their results? The article states that in both cases, the results are "pretty mediocre and, as we will see in a minute, outright inferior to the method we are seeing today."
3. Why is the fourth option (active search reasoning) too complex or expensive? The article mentions that "the sheer scale of compute and memory requirements that this paradigm requires makes it today a wet dream far from reality."
4. What are some examples provided to show that current LLMs cannot reason effectively over their knowledge?
- Failing to infer that Barack Obama's wife was born in 1964 when given the fact that Barack Obama was born in 1961.
- Failing to infer that if Alice has 10 brothers and 10 sisters, any of the brothers must also have 10 sisters.
[02] Grokking: A Potential Solution
1. What is the key idea behind the technique of "grokking"? Grokking involves training models well beyond the point of overfitting, which is traditionally seen as something to avoid. By continuing to train the model on the same data, it can develop a more holistic and generalized understanding, rather than just memorizing the training data.
2. How does grokking allow models to go beyond just memorizing data and develop a deeper understanding? The article explains that by repeatedly exposing the model to the same data, it starts to find simpler, more generalizable solutions, rather than just memorizing specific patterns. This allows the model to develop a more fundamental understanding of the underlying concepts.
3. How does grokking enable small models to outperform much larger frontier AI models in reasoning tasks? The article provides an example where a grokked GPT-2 level model (1.5 billion parameters) was able to outperform much larger models like GPT-4 Turbo and Gemini 1.5 Pro in complex reasoning tasks. This suggests that grokking can be a powerful technique for training more effective reasoners with smaller model sizes.
4. What are the potential implications of grokked models for the AI industry and investment landscape? The article suggests that grokked models could be a "compression factor," creating powerful reasoners that are much smaller and cheaper to run than frontier models. This could challenge the current industry focus on building ever-larger models and the associated hardware and infrastructure investments.