Making my local LLM voice assistant faster and more scalable with RAG
๐ Abstract
The article discusses the author's experience in building a smart home voice assistant using open-source and local components. It focuses on the challenges faced in terms of performance and latency, and how the author addressed these issues using techniques like Retrieval Augmented Generation (RAG) to optimize the language model prompt.
๐ Q&A
[01] Challenges with the initial setup
1. What were the initial challenges the author faced with their smart home voice assistant?
- The author found the voice assistant to be slow, even with prefix caching, as the language model's performance was impacted when used for other tasks.
- The author tried using dual RTX 3090 GPUs to offload the Whisper and Llama 3 70B AWQ models, but it was still not fast enough.
2. What was the main issue the author identified with the language model's performance?
- The author found that the prefill phase of the language model, which happens before the first token is output, was taking the majority of the inference time. This was due to the large context size being passed to the language model, which caused the prefill latency to increase quadratically.
[02] Optimizing the language model prompt
1. What technique did the author use to optimize the language model prompt?
- The author utilized Retrieval Augmented Generation (RAG), a method that augments language model prompts with relevant external sources.
- The key part of RAG is the use of embeddings, which project text inputs onto a high-dimensional space, allowing for efficient retrieval of the most relevant information.
2. How did the author implement the RAG-based optimization?
- The author built a RAG API that splits the large prompt into smaller, more relevant sections, and caches the embeddings for these sections in RAM.
- When a user prompt comes in, the API computes the embeddings for the prompt and finds the top 3 most relevant "documents" (sections of the prompt), which are then used to augment the language model prompt.
- The author also dynamically generates examples for in-context learning to help the language model better understand the smart home context.
3. What were the different categories of information the author included in the optimized prompt?
- Calendar events for the next week
- Weather forecast for the next week
- One category per area defined in Home Assistant, with a list of all entities in that area
- Shopping list
- Whether anyone else is home
- Media players and what they are playing
- Laundry and color loop (custom to the author's setup)
4. How did the optimized prompt compare to the initial prompt in terms of performance?
- The author provided before and after comparisons, showing that the optimized prompt with the RAG-based approach was significantly faster and more responsive.
[03] Considerations for long prompts and responses
1. What did the author acknowledge about the performance of the system for very long prompts and responses?
- The author acknowledged that even with the optimization, dealing with very long prompts and responses can still result in a slow-feeling system.
- However, the author believes that the improved performance is sometimes worth the trade-off, as the system can now provide more comprehensive and relevant information to the user.