Why LLMs Can’t Plan And Unlikely To Reach AGI?
🌈 Abstract
The article provides a critical analysis of the capabilities and limitations of Large Language Models (LLMs), such as ChatGPT, and discusses the hype surrounding their intelligence. It covers topics like the essential characteristics of human intelligence, the Transformer architecture, the concept of "Grokking", the issues with evaluating LLMs, and the problems with neural scaling laws and emergent behavior claims. The article also presents several examples and research papers that highlight the shortcomings of LLMs in tasks like reasoning, planning, and common sense understanding.
🙋 Q&A
[01] The Messy World of LLMs
1. What are the four essential characteristics of human intelligence that current AI systems cannot do?
- Reasoning
- Planning
- Persistent memory
- Understanding the physical world
2. How are LLMs described as an "idea-generation machine"? LLMs can generate or provide approximate answers to any textual query, even if they have not seen similar content in the past, due to their massive scale of training data.
3. What is the concept of "Emergent Behavior" in the context of LLMs? Emergent Behavior refers to the idea that as the size of LLMs increases, they might develop completely new capabilities and capacities that were not explicitly programmed.
4. What is the author's view on the claim that scaling LLMs will lead to human-level or superintelligent AI? The author is skeptical of this claim, arguing that scaling alone will not necessarily lead to LLMs becoming truly intelligent or overcoming their current limitations, such as the inability to do basic tasks like multiplication.
[02] Evaluating and Understanding LLMs
1. What is the "LLM reversal curse" and how does it challenge the perception of LLM intelligence? The LLM reversal curse refers to the finding that LLMs can appear to get everything correct and show generalization capabilities, but fail completely when asked questions from different perspectives or viewpoints.
2. What are the issues with the Neural Scaling Laws and the claim of "Emergent" capabilities in LLMs? The author argues that the Neural Scaling Laws and the idea of Emergent capabilities have flaws. The author suggests that LLMs' behavior is more aligned with memorization rather than true generalization, and their world models may not be as sophisticated or correct as claimed.
3. What are the problems with the current benchmarking and evaluation of LLMs? The author highlights issues like benchmark data leakage and the difficulty in distinguishing between memorization and generalization in LLM performance, making it challenging to accurately evaluate their capabilities.
[03] Limitations and Failures of LLMs
1. What are some examples of simple tasks that LLMs fail at, and how does this contradict the claims of their intelligence? The article presents examples of LLMs failing at tasks like finding the largest 5-digit number containing the digit 3, understanding the physics of the world in text-to-video generation, and solving the classic "wolf, goat, and cabbage" problem.
2. Why do the author's argue that LLMs cannot do well on planning tasks, even the latest models like GPT-4? The author cites research showing that LLMs perform poorly on planning benchmarks, as their plans cannot be reliably verified and executed to reach the desired goal.
3. What are the issues with the "steerability" and controllability of LLMs, and how does this relate to their lack of understanding of right and wrong? The author argues that LLMs are difficult to steer and control, as they are primarily unaware of what is right or wrong, and cannot predict the consequences of their outputs 10 words ahead.
[04] The Bigger Picture and the Way Forward
1. What is the author's view on the hype and marketing around LLMs by big tech companies and social media creators? The author is critical of the excessive hype and marketing around LLMs, which they believe is often misleading and not supported by the actual capabilities of the models.
2. What are the author's concerns about the proliferation of new LLM models and the "shittification" of the language and internet? The author argues that the constant creation of new LLM models, often with minimal changes, is a waste of resources, and that the proliferation of AI-generated content on platforms like LinkedIn is leading to a degradation of the quality of information online.
3. What is the author's recommendation for the use of LLMs and the future of AI research? The author suggests that LLMs should be used as a last resort, not the first option, and that more focus should be placed on building products and solutions from first principles, rather than relying solely on the capabilities of LLMs. They also call for more rigorous research and testing before releasing new models or making claims about their capabilities.