BERTs are Generative In-Context Learners
๐ Abstract
The paper explores the in-context learning capabilities of masked language models, challenging the common view that this ability does not 'emerge' in them. It presents a simple inference technique that enables DeBERTa to operate as a generative model without any additional training. The findings demonstrate that DeBERTa can match and even surpass GPT-3 in in-context learning performance. The comparative analysis reveals that masked and causal language models behave very differently, with each outperforming the other on different categories of tasks. This suggests great potential for a hybrid training approach that combines the strengths of both objectives.
๐ Q&A
[01] Introduction
1. What is the key point made about the introduction of GPT-3? The introduction of GPT-3 marked a paradigm shift by demonstrating in-context learning, where a model can infer a task from context without any finetuning. This is particularly attractive for practical applications as it avoids the need for extensive hand-annotated datasets and deep-learning expertise.
2. What is the prevailing assumption about the inability of masked language models to perform in-context learning? There is a prevailing assumption that masked language models, such as BERT, are "very restricted in their generative capabilities" and are considered "somewhat deprecated" compared to causal language models like GPT-3.
3. What is the key claim made in this paper? This paper challenges the prevailing assumptions and presents empirical evidence showing that the masked language model DeBERTa is equally adept at in-context learning compared to GPT-3.
[02] Method: text generation and ranking with masked language models
1. How does the paper propose to use a masked language model for text generation? The paper introduces a simple inference technique that modifies the sequence of input tokens to enable a masked language model like DeBERTa to perform autoregressive text generation, without any additional training.
2. How does the paper propose to use a masked language model for text ranking? The paper introduces a method to compute a pseudo-log-likelihood score that can be used to rank text sequences by their likelihood, by masking additional tokens in the right context to reduce the effect of local dependencies.
3. What is the key limitation of the proposed text generation method? The key limitation is that it is slower in practice compared to causal language models, because the hidden representations of the whole sequence have to be recomputed in every step due to the bidirectional nature of the model.
[03] DeBERTa family of language models
1. What are the key differences between the training data used for DeBERTa and GPT-3? DeBERTa was pretrained on a relatively small and clean text corpus of 78GB, while GPT-3 was trained on a much larger corpus of 570GB.
2. How does the total training compute used for DeBERTa compare to GPT-3? Even though DeBERTa uses a smaller training corpus, it is trained on more than three times more tokens than GPT-3 (1 trillion compared to 300 billion). However, the loss is computed only on 15% of tokens for DeBERTa.
[04] Evaluation
1. What are the four groups of tasks used to evaluate and compare DeBERTa and GPT-3? The four groups are: language understanding (SuperGLUE), language modeling (text completion and Winograd-like tasks), machine translation, and question answering (closed-book question answering and commonsense reasoning).
2. What is the key finding about the relative performance of DeBERTa and GPT-3 on the language understanding tasks? DeBERTa clearly outperforms GPT-3 on the language understanding tasks and scales much more favorably with model size.
3. What is the key finding about the relative performance of DeBERTa and GPT-3 on the machine translation tasks? Contrary to the language understanding tasks, the causal language model GPT-3 clearly outperforms the masked language model DeBERTa on the machine translation tasks.
4. What is the key finding about the relative performance of DeBERTa and GPT-3 on the closed-book question answering tasks? DeBERTa performs substantially worse than GPT-3 on the closed-book question answering tasks, which the paper suggests may be due to the disadvantage of the MLM training objective in directly storing world knowledge in the model weights.
</output_format>