Generative AI is a hammer and no one knows what is and isn’t a nail
🌈 Abstract
The article discusses the capabilities and limitations of large language models (LLMs) like ChatGPT, and argues that they are not universal problem-solvers despite the hype around them. It uses an analogy of hammers and artificial labor to illustrate the idea that LLMs, like hammers, are good at certain tasks but not others. The article suggests that there are fundamental limitations to the probabilistic guessing approach used by LLMs that make them unsuitable for tasks requiring high specificity, such as playing the sum-to-22 game or generating the decimal digits of pi. It also discusses the challenges of evaluating the capabilities of these models and the incentives for companies to promote them as universal solutions. Overall, the article cautions against the assumption that LLMs can solve any problem and argues for a more nuanced understanding of their strengths and weaknesses.
🙋 Q&A
[01] Limitations of Large Language Models
1. What are some examples of tasks that ChatGPT seems to be bad at, according to the article? The article provides several examples of tasks that ChatGPT seems to be bad at, including:
- Playing the sum-to-22 game optimally
- Generating the decimal digits of pi
- Generating an image of exactly 7 elephants
- Generating a video of a grandmother blowing out birthday candles that matches the specific details in the prompt
2. What is the author's theory for why ChatGPT struggles with these types of tasks? The author's theory is that these tasks require a high degree of specificity that is incompatible with the probabilistic guessing approach used by LLMs like ChatGPT. For these tasks, there is only a small set of "correct" outputs, and the probability of ChatGPT randomly guessing the right output is extremely low.
3. How does the author contrast the capabilities of ChatGPT with traditional computer programs? The author points out that traditional computer programs can easily solve tasks like playing the sum-to-22 game optimally, as demonstrated by the simple Python script provided in the article. The issue is not that these tasks are inherently difficult for computers, but rather that they are not well-suited to the generative approach used by ChatGPT.
[02] Evaluating the Capabilities of LLMs
1. What challenges does the article identify in evaluating the capabilities of LLMs like ChatGPT? The article highlights a few key challenges:
- It is time-consuming and expensive to thoroughly evaluate an LLM's performance on a wide range of tasks.
- LLMs can often generate text that appears competent, even if they are not actually solving the task correctly.
- There is a lack of a clear, general theory about what types of tasks LLMs are well-suited for.
2. How does the article characterize the incentives for companies promoting LLMs as universal problem-solvers? The article suggests that companies have a strong incentive to promote the idea that LLMs like ChatGPT are universal problem-solvers, as this would justify massive investments and valuations. However, the author argues that this "universal hammer" theory is not supported by evidence and that the capabilities of these models are likely more limited.
[03] Potential Use Cases for LLMs
1. What are some examples of tasks the author believes LLMs like ChatGPT can be useful for? The author suggests that LLMs can be useful for tasks like:
- Documenting code
- Generating code refactorings or unit tests (with careful review)
- Debugging code
- Providing interactive thesaurus-like functionality
- Generating inoffensive marketing copy
- Providing placeholder content for video essays
However, the author cautions that these are relatively narrow use cases and do not justify the hype and investment around LLMs as universal problem-solvers.
2. What are the author's concerns about using LLMs for customer service chatbots? The author is skeptical about the effectiveness of using LLMs for customer service chatbots, arguing that this task requires a high degree of specificity and script-following that may not be well-suited to the probabilistic guessing approach of LLMs. The author cites examples of a customer service chatbot making incorrect statements about inventory and offering unauthorized discounts, suggesting that these types of errors are likely to occur.