More Room for Language: Investigating the Effect of Retrieval on Language Models
๐ Abstract
The paper investigates the effect of retrieval augmentation on the underlying language model, separate from the overall retrieval pipeline. It introduces an 'ideal retrieval' methodology to study these models in a fully controllable setting. The key findings are:
- Retrieval augmentation separates linguistic knowledge from world knowledge to some extent - the language model alone improves syntactic understanding while delegating world knowledge to the retrieval module. This separation becomes larger with scale.
- Retrieval augmentation negatively impacts NLU performance - the standalone language model performs worse in multi-sentence language understanding tasks.
- Poor retrieval quality does not negatively impact pretraining - the model behavior gets closer to the baseline no-retrieval performance, without overall quality degradation.
๐ Q&A
[01] Introduction
1. What is the main goal of the paper? The main goal of the paper is to shed more light on the expected qualities of the language model when separated from the database retrieval, in a fully controllable setting.
2. What are the key findings of the paper? The key findings are:
- Retrieval augmentation separates linguistic knowledge from world knowledge to some extent
- Retrieval augmentation negatively impacts NLU performance
- Poor retrieval quality does not negatively impact pretraining
3. How does the paper aim to study the effect of retrieval augmentation? The paper introduces an 'ideal retrieval' methodology, using paraphrased inputs, to study the effect of retrieval augmentation in a fully controllable setting, separate from the overall retrieval pipeline.
[02] Controlled retrieval augmentation
1. What is the simplified retrieval-augmented LM setup used in the paper? The model is an encoder-decoder transformer, where the encoder embeds the retrieved context and the decoder is a language model. The retrieval augmentation is simplified using paraphrase-based pretraining.
2. How are the paraphrases generated and what are their quality characteristics? The paraphrases are generated using an instruction-tuned Mistral 7B language model. The paraphrases have high semantic similarity (average cosine similarity of 0.88) but low lexical and syntactic similarity (average BLEU score of 13% and 7% after removing named entities and digits).
3. How does the paper handle the separation of the language model from the retrieval augmentation? The paper uses a linear patching approach, where a simple linear layer is added between the self-attention and feed-forward network of each layer of the encoder as a proxy to the missing cross-attention.
[03] Evaluation
1. What are the main categories of tasks used to evaluate the language models? The models are evaluated on 3 main categories: world knowledge (LAMA probing), syntactic knowledge (linear probing, attention probing, BLiMP, MSGS), and language understanding (LAMBADA, GLUE, SQuAD).
2. How do the retrieval-augmented models perform compared to the baseline models on the different task categories?
- World knowledge: Retrieval-augmented models perform worse as they save less world knowledge in their weights.
- Syntactic knowledge: Retrieval-augmented models perform better, with the advantage growing with model size.
- Language understanding: Retrieval-augmented models perform worse, especially on tasks requiring global context understanding.
3. What is the effect of retrieval noise on the model performance? Noisy retrieval pretraining does not lead to an overall drop in performance, but rather interpolates the behavior between standard pretraining and pretraining with perfect retrieval.
</output_format>