magic starSummarize by Aili

What happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives — Yi Tay

🌈 Abstract

The article discusses the evolution of language model architectures in the era of large language models (LLMs), covering the three main paradigms: encoder-only models (e.g., BERT), encoder-decoder models (e.g., T5), and decoder-only models (e.g., GPT series). It explores the relationships between these architectures, the role of denoising objectives, and the gradual phasing out of BERT-like models in favor of more flexible and unified autoregressive models.

🙋 Q&A

[01] Encoder-only, Encoder-Decoder, and Decoder-only Models

1. What are the key differences between encoder-only, encoder-decoder, and decoder-only models?

  • Encoder-only models like BERT use a denoising objective where the model learns to predict masked tokens in-place.
  • Encoder-decoder models like T5 are also autoregressive, with the decoder being a causal decoder that can leverage the encoder's representations via cross-attention.
  • Decoder-only models like the GPT series are autoregressive language models that predict the next token based on the previous ones.

2. How are encoder-decoder and decoder-only models related?

  • Encoder-decoder models and decoder-only models are not fundamentally different - they are both autoregressive models, with the key difference being the presence of an encoder in encoder-decoder models.
  • A variant of encoder-decoder models is the Prefix Language Model (PrefixLM) architecture, which is a decoder-only model that can leverage bidirectional attention without the need for a separate encoder.

3. Why did BERT-like encoder-only models become less prominent compared to encoder-decoder and decoder-only models?

  • The shift towards more unified, multi-task models made BERT-like models less desirable, as encoder-decoder and decoder-only models could more easily express multiple tasks without the need for task-specific classification heads.
  • Encoder-decoder and decoder-only models also retained the bidirectional attention benefits of BERT while offering more flexibility and efficiency.

[02] Denoising Objectives

1. What is the denoising objective, and how does it differ from regular language modeling?

  • The denoising objective, also known as "span corruption" or "infilling", involves masking and predicting a subset of tokens in the input sequence, rather than predicting the next token in a causal language modeling objective.
  • In denoising objectives, the masked tokens are "moved to the back" of the sequence for the model to predict, whereas in regular language modeling, the model predicts the next token in a left-to-right fashion.

2. How effective are denoising objectives compared to regular language modeling?

  • Denoising objectives are seen as a complementary objective to regular language modeling, as they can learn useful representations in a more sample-efficient way for certain tasks.
  • However, denoising objectives have the drawback of "less loss exposure" since only a small portion of tokens are masked and predicted, compared to the full sequence in regular language modeling.
  • The author suggests that denoising objectives should be used in conjunction with regular language modeling, rather than as a standalone pretraining objective.

[03] Bidirectional Attention and Model Scaling

1. What is the role of bidirectional attention in language models?

  • Bidirectional attention, as seen in BERT-style models, is an inductive bias that can be beneficial at smaller scales, as it allows the model to leverage both past and future context.
  • However, the author suggests that the importance of bidirectional attention may diminish at larger scales, and that its impact can vary depending on the task or modality.

2. How do encoder-decoder architectures compare to decoder-only models in terms of advantages and drawbacks?

  • Encoder-decoder models can be more flexible in the encoder side, as they are not restricted by the causal mask like decoder-only models. This allows for more aggressive pooling or linear attention in the encoder.
  • However, encoder-decoder models have the drawback of fixed input and output budgets, whereas decoder-only models like PrefixLM can more efficiently concatenate input and target sequences.

[04] Recap and Conclusions

1. What are the key takeaways from the article?

  • Encoder-decoder and decoder-only models are both autoregressive models with subtle differences in inductive biases and implementation-level details.
  • Denoising objectives are mostly used as complementary objectives to regular causal language modeling, rather than as standalone pretraining objectives.
  • Bidirectional attention can be beneficial at smaller scales, but its importance may diminish at larger scales, depending on the task and application.
  • BERT-like encoder-only models have become less prominent, as more flexible and unified autoregressive models like T5 and PrefixLM have emerged, allowing for better multi-task capabilities.
Shared by Daniel Chen ·
© 2024 NewMotor Inc.