Triumph Over Data Obstacles In RAG: 8 Expert Tips - Vectorize
๐ Abstract
The article discusses the common data-related challenges and strategies for building Retrieval-Augmented Generation (RAG) applications using large language models (LLMs). It covers 8 key areas:
- Data Extraction
- Handling Structured Data
- Choosing the Right Chunk Size and Chunking Strategy
- Creating a Robust and Scalable Pipeline
- Retrieved Data Not in Context
- Task-Based Retrieval
- Data Freshness
- Data Security
๐ Q&A
[01] Data Extraction
1. What are the challenges in parsing complex data structures like PDFs with embedded tables or images?
- Parsing complex data structures like PDFs with embedded tables or images can be difficult and require specialized techniques to accurately extract relevant information. OCR (Optical Character Recognition) is still largely an open problem, especially when dealing with scanned complex documents like invoices.
2. What strategies are recommended to address the data extraction challenges?
- Start with a text-only data pipeline whenever possible. If the RAG system cannot perform well on text data, it is unlikely to perform well on images and audio.
- For PDF files, invest in a good parsing tool. Segment the problem into subproblems of lower complexity, such as using OCR for specific key fields.
- Consider using pre-built document recognition/understanding tools like Microsoft Azure's Document Intelligence.
[02] Handling Structured Data
1. What are the challenges in using LLMs for handling structured data like tabular data?
- LLMs are great at dealing with unstructured data, such as free-flowing text, but not as good at handling structured data like tabular data. Issues include a high rate of hallucination when using LLMs on tabular data.
2. What strategies are recommended to address the challenges with structured data?
- Transform tabular data to unstructured text, being careful with numerical and categorical representation to avoid issues like exceeding the context limit of LLMs.
- Employ techniques like the chain-of-the-table approach, which combines table analysis with step-by-step information extraction strategies, to enhance tabular question-answering capabilities in RAG systems.
[03] Choosing the Right Chunk Size and Chunking Strategy
1. What are the challenges in determining the optimal chunk size and chunking strategy?
- Determining the optimal chunk size for dividing documents into semantically distinct parts while balancing the need for comprehensive context and fast retrieval. Longer contexts lead to longer inference time, and smaller context chunks might lead to incomplete answers.
- The challenge around the chunking strategy, such as whether to chunk based on sentences, paragraphs, or word count.
2. What strategies are recommended for choosing the chunk size and chunking strategy?
- Start with a simple chunking strategy, run evaluations, and measure the performance. Determine the chunking strategy first, then choose the chunk size considering both the strategy and the LLM context size.
- For example, if the data is a collection of news articles, paragraph-level chunking would likely work, and the chunk size can be the size of a paragraph.
[04] Creating a Robust and Scalable Pipeline
1. What are the challenges in building a robust and scalable RAG pipeline?
- Building a robust and scalable RAG pipeline to handle a large volume of data and continuously index and store it in a vector database. This is often an afterthought, but it's crucial if the app will eventually have to handle TBs of data per hour.
2. What strategies are recommended for creating a robust and scalable pipeline?
- Start simple and estimate the scale of the app early on, even if you don't need to act on it immediately.
- Adopt a modular and distributed system approach, separating the pipeline into scalable units and employing distributed processing for parallel operation efficiency.
- Use battle-tested tools like Kubernetes to deploy the app, which allows for easy scaling up and down as needed.
[05] Retrieved Data Not in Context
1. What are the challenges related to the retrieved data not being in the right context?
- The RAG system may retrieve data that is not relevant or does not provide the necessary context for accurate response generation, often due to issues like bad embedding, bad user query, or context truncation.
2. What strategies are recommended to address the issue of retrieved data not being in the right context?
- Use query augmentation/rewriting to enhance user queries with additional context or modifications, improving the relevancy of retrieved information.
- Explore different retrieval strategies, such as small-to-big sentence window retrieval and semantic similarity scoring, to incorporate relevant information into the context.
- Use generative UIs to clarify user intent and ensure the retrieval engine retrieves the complete context to answer the query.
- Invest in a monitoring solution to visualize prompts, chunks, and responses.
[06] Task-Based Retrieval
1. What is the challenge related to task-based retrieval in RAG applications?
- RAG applications should be able to handle a wide range of user queries, including those seeking summaries, comparisons, or specific facts. Using one retrieval prompt might not be sufficient to handle all these different use cases.
2. What strategy is recommended to address the task-based retrieval challenge?
- Implement query routing to identify the appropriate subset of tools or sources based on the initial user query, ensuring adapted retrieval for different use cases. This usually involves creating multiple indexes and a classifier to route the query to the corresponding index.
[07] Data Freshness
1. What is the challenge related to ensuring data freshness in RAG applications?
- Ensuring that the RAG app always uses the latest and most accurate information, especially when documents are updated.
2. What strategies are recommended to address the data freshness challenge?
- Implement metadata filtering, which acts as a label to indicate if a document is new or changed, ensuring the app always uses the most recent information.
- Use generative UIs to clarify user intent around recency, such as asking the user if they are interested in recent events or general information.
- Be explicit in the prompt about which timestamps to use, as LLMs can sometimes confuse their internal learned information about dates and external data timestamps.
[08] Data Security
1. What are the data security challenges in RAG applications?
- Ensuring the security and integrity of LLMs used in RAG applications, preventing sensitive information disclosure, and addressing ethical and privacy considerations.
2. What strategies are recommended to address the data security challenges?
- Implement multi-tenancy to keep user data private and secure.
- Properly handle the user prompt to ensure it does not abuse the LLM.
- Analyze the RAG data for any signs of data poisoning.