TLDR: 1 year of building with LLMs – D-Squared
🌈 Abstract
The article discusses the key lessons learned by six practitioners who have been building with large language models (LLMs) for a year. It covers tactical, operational, and strategic lessons, with a focus on the tactical aspects.
🙋 Q&A
[01] Tactical Lessons
1. What are the key takeaways for most use cases?
- For most use cases, prompt engineering, retrieval-augmented generation (RAG), and proper evaluations will get you 80-90% of the way to your goal. Often, fine-tuning is completely unnecessary.
2. What are some questions to ask before embarking on fine-tuning?
- Questions to ask include whether synthetic data and open-source data can be used to bootstrap, and whether the expected benefits justify the time and money investment.
3. What are some practical tips for prompt engineering?
- Use few-shot learning with examples of ideal outputs to improve the LLM's performance with minimal effort.
- Ask the model to "think a little harder" to improve its reasoning capabilities.
- Use delimiters (e.g., XML, JSON, Markdown) to set boundaries for different parts of the input and output data.
- Use "prompt chaining" to break down a larger task into smaller, focused prompts.
4. What are the benefits and considerations for using RAG setups?
- RAG setups provide the benefit of "grounding" the model to use only the provided data, which can lower the number of hallucinations.
- Both vector databases (for semantic search) and traditional databases (for exact text matches) have their place in a RAG architecture.
5. What are some considerations for setting up LLM evaluations?
- Evaluations can be used to quickly improve the outputs of an LLM, and can be done manually with humans or using another LLM as the judge.
- When using an LLM as the judge, considerations include ensuring the judge model is appropriate for the task, using a consistent set of prompts, and monitoring for drift over time.
[02] Operational Lessons
1. What are the challenges when moving a prompt from one model to another?
- The performance of a prompt can change dramatically when moving it to a different model, even within the same provider's model versions.
- Techniques to mitigate this include prompt tuning, model fine-tuning, and using a human in the loop to provide feedback.
2. What are the considerations for the ideal GenAI user experience?
- The ideal GenAI user experience often involves placing a human in the loop to provide feedback, either explicitly or implicitly.
- Implicit feedback (e.g., actions taken by the user) is preferred over explicit feedback due to user reluctance to provide feedback consistently.
3. Why is it important to downsize models after proving the task is possible?
- Downsizing models from large, easy-to-set-up APIs to smaller, self-hosted models can save money and improve latency, once product-market fit has been established.
- This allows for increased profit margins by reducing costs while maintaining the core functionality.
[03] Strategic Lessons
1. Why is the ability to quickly swap out models not a durable moat for a company?
- The rapid pace of model innovation and the ease of migrating from one state-of-the-art model to the next is an exciting gift, but it does not create a durable moat for a company.
- Instead, the focus should be on creating quality processes and infrastructure, such as prompt engineering, RAG setups, and robust evaluation systems, as these are the components that create durable products and companies.