Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations
๐ Abstract
The paper discusses efficient inverted indexes for approximate retrieval over learned sparse representations. Learned sparse representations are an attractive class of contextual embeddings for text retrieval, as they are effective models of relevance and are interpretable. However, retrieval over sparse embeddings remains challenging due to the distributional differences between learned embeddings and term frequency-based lexical models. The paper proposes a novel organization of the inverted index, called Seismic, that enables fast yet effective approximate retrieval over learned sparse embeddings. Seismic organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector, allowing it to quickly determine if a block must be evaluated during query processing. The authors show experimentally that Seismic reaches sub-millisecond per-query latency on various sparse embeddings of the Ms Marco dataset while maintaining high recall, outperforming state-of-the-art inverted index-based solutions and the winning submissions to the BigANN Challenge.
๐ Q&A
[01] Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations
1. What are the key challenges in retrieval over learned sparse representations?
- The distributional differences between learned embeddings and term frequency-based lexical models, such as BM25, make retrieval over sparse embeddings challenging.
- Learned sparse representations do not conform to the assumptions under which popular inverted index-based retrieval algorithms, such as WAND and MaxScore, operate effectively.
2. What is the "concentration of importance" property observed in learned sparse representations?
- Learned sparse representation techniques, such as Splade and Efficient Splade, tend to place a disproportionate amount of the total mass of a vector on just a small subset of the coordinates.
- This means that a partial inner product between the largest entries (by absolute value) can approximate the full inner product with arbitrary accuracy.
3. How does Seismic leverage the "concentration of importance" property?
- Seismic uses static pruning to keep only the top entries in each inverted list, taking advantage of the fact that a small subset of coordinates can effectively approximate the full inner product.
- Seismic also partitions inverted lists into geometrically-cohesive blocks, with each block having a summary vector that approximates the inner product of documents within the block.
4. What are the key components of Seismic's design?
- Seismic uses a novel organization of the inverted index, with inverted lists partitioned into geometrically-cohesive blocks, each equipped with a summary vector.
- The summary vectors allow Seismic to quickly determine which blocks must be evaluated during query processing, enabling efficient approximate retrieval.
- Seismic also uses a forward index to compute the exact inner products for documents in selected blocks.
5. How does Seismic's performance compare to other baselines?
- Seismic consistently outperforms state-of-the-art inverted index-based solutions, as well as the winning submissions to the BigANN Challenge, in terms of query latency while maintaining high recall.
- Seismic is one to two orders of magnitude faster than other baselines, reaching sub-millisecond per-query latency on various sparse embeddings of the Ms Marco dataset.
[02] Related Work
1. What are the key developments in learned sparse representations?
- Early work includes DeepCT and HDCT, which used BERT to extract contextual features and summarize them as term weights.
- Subsequent work, such as Splade and Efficient Splade, introduced sparsity-inducing regularization and other techniques to produce sparser and more effective learned sparse representations.
2. What are the main approaches to retrieval algorithms for learned sparse representations?
- Traditional inverted index-based algorithms, such as WAND and MaxScore, are designed for term frequency-based lexical models and do not perform well on the distributional properties of learned sparse vectors.
- Researchers have explored two main directions: 1) forcing LSR models to produce embeddings more friendly to dynamic pruning algorithms, and 2) designing new retrieval algorithms that have fewer restrictive assumptions, such as approximate nearest neighbor (ANN) methods.
3. How does Seismic relate to other ANN algorithms for sparse vectors?
- Seismic can be viewed as an extension of the inverted file (IVF) approach to ANN, where vectors are partitioned into clusters during indexing and only a fraction of clusters are scanned during retrieval.
- Seismic's novel index organization, with geometrically-cohesive blocks and summary vectors, allows it to outperform other ANN baselines, including the winning submissions to the BigANN Challenge.