What We Learned from a Year of Building with LLMs (Part I)
๐ Abstract
The article discusses best practices and lessons learned for building successful products using large language models (LLMs). It covers tactical, operational, and strategic aspects of working with LLMs, including prompting techniques, retrieval-augmented generation, workflow design, evaluation, and monitoring.
๐ Q&A
[01] Tactical Nuts and Bolts of Working with LLMs
1. What are some effective prompting techniques discussed in the article?
- N-shot prompts + in-context learning: Providing the LLM with a few examples that demonstrate the task and align outputs to expectations.
- Chain-of-thought (CoT) prompting: Encouraging the LLM to explain its thought process before returning the final answer.
- Providing relevant resources: Using retrieval-augmented generation (RAG) to expand the model's knowledge base and reduce hallucinations.
- Structured input and output: Formatting inputs and outputs to help the model better understand the task and integrate with downstream systems.
2. How can we avoid the "God Object" anti-pattern in prompts?
- Break down complex prompts into simpler, focused prompts that can be iterated on and evaluated individually.
- Rethink the amount of context being provided to the model and extract only what is necessary.
3. What are some key considerations for effective retrieval-augmented generation (RAG)?
- Relevance: Measure how well the retrieval system ranks relevant documents higher than irrelevant ones.
- Information density: Prefer more concise and information-dense documents over those with extraneous details.
- Level of detail: Provide additional context, such as column descriptions and sample values, to help the LLM better understand the semantics of the data.
- Hybrid approaches: Combine keyword-based and embedding-based retrieval for better performance.
4. When should we consider fine-tuning versus using RAG?
- Recent research suggests RAG may outperform fine-tuning in many cases, and it is often easier and cheaper to maintain.
- However, fine-tuning can be effective for tasks where prompting falls short, and the higher upfront cost may be worth it.
[02] Operational Strategies for Reliable LLM Workflows
1. How can we improve the reliability of LLM-based workflows?
- Decompose complex tasks into simpler, well-defined steps with clear objectives.
- Use a deterministic planning-and-execution approach, where the agent first generates a plan and then executes it in a structured way.
2. What techniques can we use to increase output diversity?
- Shuffle the order of input items in the prompt.
- Keep track of recent outputs and avoid suggesting similar items.
- Vary the phrasing used in the prompts to shift the focus.
3. How can caching help improve performance and reduce costs?
- Use unique IDs to cache responses for repeatable inputs.
- Leverage techniques from search, such as autocomplete and spelling correction, to normalize user input and increase cache hit rates.
4. When should we consider fine-tuning a model for a specific task?
- Fine-tuning can be effective when prompting alone falls short of the desired performance.
- However, the higher upfront cost of fine-tuning, including data annotation and model training, should be weighed against the benefits.
[03] Evaluating and Monitoring LLM-based Applications
1. What are some best practices for evaluating LLM-based applications?
- Create unit tests with assertions based on multiple criteria, such as including or excluding specific phrases, and checking word/sentence counts.
- Use execution-evaluation for code generation tasks, where you run the generated code and check the runtime state.
- Leverage "dogfooding" by using the product as intended by customers to identify real-world failure modes.
2. How can we effectively use LLM-as-Judge for evaluation?
- Perform pairwise comparisons between control and treatment outputs to assess the direction of improvement, even if the magnitude is noisy.
- Iterate on the LLM-as-Judge approach by logging its responses, critiques, and final outcomes, and reviewing them with stakeholders to identify areas for improvement.
3. What are some pitfalls to avoid when evaluating LLM-based applications?
- Beware of overemphasizing synthetic "needle-in-a-haystack" (NIAH) evals, as they may not reflect the reasoning and recall needed in real-world applications.
- Avoid cognitive overload on human raters by using binary classifications and pairwise comparisons instead of open-ended Likert scale feedback.
4. How can we use evals as guardrails to catch inappropriate or harmful content?
- Leverage reference-free evals, such as summarization or translation quality assessments, to filter out low-quality outputs before displaying them to users.
- Complement prompt engineering with robust guardrails that detect and filter/regenerate undesired outputs, such as those containing hate speech, PII, or factual inconsistencies.