Open LLMs don’t need to beat OpenAI
🌈 Abstract
The article discusses the current state of the language model landscape, with a focus on the performance and cost-effectiveness of OpenAI's models compared to open-source alternatives like Llama 3 and Mixtral 8x22B. It also explores the potential for fine-tuning open-source models and the challenges of catching up to OpenAI's scale advantage.
🙋 Q&A
[01] Overview of the Language Model Landscape
1. What are the key observations the author has made about the language model landscape over the past 6 months?
- OpenAI still sells the best top-end models, with Claude 3 Opus being the only model that has achieved comparable Elo to GPT-4, but it is about 3x as expensive.
- The gap between the top tier and second tier of models has shrunk dramatically, with open-source models like Llama 3 now firmly in the second tier.
- The author previously argued that open-source LLMs should focus on efficiency rather than trying to compete with GPT-4 in size and capability, and this prediction has largely been borne out.
2. How has the quality and efficiency of open-source LLMs like Llama 3 and Mixtral 8x22B improved in the past 3 months?
- The quality of these models, given their parameter size, is now quite impressive, and the author has been increasing experimentation with them.
- The inference process for these models has also become dramatically more efficient, with Llama 3 able to achieve at least 10 tokens/s on a MacBook and production-level inference on cloud deployments.
3. What is the author's view on the future of fine-tuning open-source LLMs compared to proprietary models like GPT-3.5?
- The author believes that fine-tuning open-source LLMs is now incredibly attractive, due to the increasing quality of the models and the plummeting costs of inference.
- The author is considering replacing parts of their production inference stack with Llama 3 and is excited to experiment with fine-tuning Llama 3 instead of GPT-3.5, as they expect it will be higher quality, faster, and cheaper.
[02] Comparison of Open-Source and Proprietary Models
1. Why does the author believe that open-source LLMs won't catch up to OpenAI's proprietary models?
- The scale advantage and positive consumer feedback loop that proprietary model builders like OpenAI have is daunting, and even Meta has been unable to get close with all its resources.
- However, the author believes that open-source LLMs don't need to be the best models around to survive.
2. What is the author's view on the future of RAG (Retrieval-Augmented Generation) versus fine-tuning?
- The author has gone back-and-forth on this issue, but now believes that "both" approaches will be important, with the path forward for efficient, scalable fine-tuning becoming clearer in the last month.
- The author expects to see a proliferation of narrow, expert LLMs as the details of fine-tuning Llama 3 are worked out.
3. What is the author's perspective on the difficulty of improving Elo scores for the top language models?
- The author notes that Elo scores seem to exhibit asymptotic behavior, with GPT-4 finding that asymptote first.
- The author suggests that going from 1000 to 1100 Elo may be much easier than going from 1150 to 1200 Elo, implying that further improvements in the top models will become increasingly difficult.