When RAG runs out of steam, use schema extraction and analytics with Sycamore
๐ Abstract
The article discusses how Sycamore, an LLM-powered search and analytics platform for unstructured data, can be used to extract structured metadata from unstructured documents and leverage it for advanced analytics and search queries. It highlights the limitations of traditional Retrieval Augmented Generation (RAG) approaches and demonstrates how Sycamore's data preparation capabilities and analytics features can address these limitations.
๐ Q&A
[01] When RAG is not enough
1. What are the limitations of a traditional RAG approach when dealing with unstructured data like the NTSB aviation incident reports?
- RAG assumes the answer is contained within the top K results, which may not always be the case, especially when dealing with large datasets with similar-looking information.
- RAG places the burden on keywords and vector embeddings to return the right passages, which can break down and lead to incomplete or incorrect answers.
- RAG is not accurate with aggregation-style questions, as LLMs struggle to accurately aggregate information across multiple documents.
2. How can Sycamore address these limitations?
- Sycamore allows you to extract structured metadata from unstructured documents using LLMs and then use this metadata for filtering, aggregation, and other analytics operations in your search queries.
- This enables you to retrieve the right information for a query, which then provides better input into a RAG pipeline or other downstream operations.
[02] Defining and extracting structured metadata
1. How does Sycamore Data Prep use LLMs to extract structured metadata from unstructured documents?
- Sycamore Data Prep uses the
extract_batch_schema
andextract_properties
transforms to:- Infer a schema for a given document class (e.g., "FlightAccidentReport") by analyzing the document content.
- Extract the values for the identified schema properties for each document.
2. What are the key steps in the data prep job described in the article?
- Ingest and partition the documents into a Sycamore DocSet.
- Use the
extract_batch_schema
transform to infer a schema for the "FlightAccidentReport" class, extracting the 7 most important attributes. - Use the
extract_properties
transform to extract the values for the identified schema properties for each document. - Format the extracted metadata, such as converting the "dateAndTime" field to a Python date object.
[03] Adding analytics to search queries
1. How can Sycamore's analytics capabilities be used to answer questions that traditional RAG approaches struggle with?
- For questions that require aggregation, such as "What types of planes were involved in incidents in California?", Sycamore can use the structured metadata to perform aggregations and return the unique aircraft types.
- For questions that require filtering on specific attributes, such as "Were there any incidents in the last three days of January 2023 in Washington?", Sycamore can use filters on the structured metadata to ensure only the relevant documents are retrieved and used in the RAG pipeline.
2. What are the key steps in the Sycamore search query examples provided in the article?
- For the aggregation query:
- Use a filter to select incidents in California based on the "location" metadata field.
- Use an aggregation on the "aircraftType" metadata field to get the unique aircraft types.
- For the filtering query:
- Use filters on the "location" and "dateAndTime" metadata fields to retrieve only the relevant documents for the specified location and date range.
- Pass the filtered documents to the Sycamore RAG pipeline to generate the final answer.