What happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives — Yi Tay
🌈 Abstract
The article discusses the evolution of language model architectures in the era of large language models (LLMs), covering the three main paradigms: encoder-only models (e.g., BERT), encoder-decoder models (e.g., T5), and decoder-only models (e.g., GPT series). It explores the relationships between these architectures, the role of denoising objectives, and the gradual phasing out of BERT-like models in favor of more flexible and unified autoregressive models.
🙋 Q&A
[01] Encoder-only, Encoder-Decoder, and Decoder-only Models
1. What are the key differences between encoder-only, encoder-decoder, and decoder-only models?
- Encoder-only models like BERT use a denoising objective where the model learns to predict masked tokens in-place.
- Encoder-decoder models like T5 are also autoregressive, with the decoder being a causal decoder that can leverage the encoder's representations via cross-attention.
- Decoder-only models like the GPT series are autoregressive language models that predict the next token based on the previous ones.
2. How are encoder-decoder and decoder-only models related?
- Encoder-decoder models and decoder-only models are not fundamentally different - they are both autoregressive models, with the key difference being the presence of an encoder in encoder-decoder models.
- A variant of encoder-decoder models is the Prefix Language Model (PrefixLM) architecture, which is a decoder-only model that can leverage bidirectional attention without the need for a separate encoder.
3. Why did BERT-like encoder-only models become less prominent compared to encoder-decoder and decoder-only models?
- The shift towards more unified, multi-task models made BERT-like models less desirable, as encoder-decoder and decoder-only models could more easily express multiple tasks without the need for task-specific classification heads.
- Encoder-decoder and decoder-only models also retained the bidirectional attention benefits of BERT while offering more flexibility and efficiency.
[02] Denoising Objectives
1. What is the denoising objective, and how does it differ from regular language modeling?
- The denoising objective, also known as "span corruption" or "infilling", involves masking and predicting a subset of tokens in the input sequence, rather than predicting the next token in a causal language modeling objective.
- In denoising objectives, the masked tokens are "moved to the back" of the sequence for the model to predict, whereas in regular language modeling, the model predicts the next token in a left-to-right fashion.
2. How effective are denoising objectives compared to regular language modeling?
- Denoising objectives are seen as a complementary objective to regular language modeling, as they can learn useful representations in a more sample-efficient way for certain tasks.
- However, denoising objectives have the drawback of "less loss exposure" since only a small portion of tokens are masked and predicted, compared to the full sequence in regular language modeling.
- The author suggests that denoising objectives should be used in conjunction with regular language modeling, rather than as a standalone pretraining objective.
[03] Bidirectional Attention and Model Scaling
1. What is the role of bidirectional attention in language models?
- Bidirectional attention, as seen in BERT-style models, is an inductive bias that can be beneficial at smaller scales, as it allows the model to leverage both past and future context.
- However, the author suggests that the importance of bidirectional attention may diminish at larger scales, and that its impact can vary depending on the task or modality.
2. How do encoder-decoder architectures compare to decoder-only models in terms of advantages and drawbacks?
- Encoder-decoder models can be more flexible in the encoder side, as they are not restricted by the causal mask like decoder-only models. This allows for more aggressive pooling or linear attention in the encoder.
- However, encoder-decoder models have the drawback of fixed input and output budgets, whereas decoder-only models like PrefixLM can more efficiently concatenate input and target sequences.
[04] Recap and Conclusions
1. What are the key takeaways from the article?
- Encoder-decoder and decoder-only models are both autoregressive models with subtle differences in inductive biases and implementation-level details.
- Denoising objectives are mostly used as complementary objectives to regular causal language modeling, rather than as standalone pretraining objectives.
- Bidirectional attention can be beneficial at smaller scales, but its importance may diminish at larger scales, depending on the task and application.
- BERT-like encoder-only models have become less prominent, as more flexible and unified autoregressive models like T5 and PrefixLM have emerged, allowing for better multi-task capabilities.