Summarize by Aili

What happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives — Yi Tay

https://www.yitay.net/blog/model-architecture-blogpost-encoders-prefixlm-denoising?utm_source=tldrai

🌈 Abstract

The article discusses the evolution of language model architectures in the era of large language models (LLMs), covering the three main paradigms: encoder-only models (e.g., BERT), encoder-decoder models (e.g., T5), and decoder-only models (e.g., GPT series). It explores the relationships between these architectures, the role of denoising objectives, and the gradual phasing out of BERT-like models in favor of more flexible and unified autoregressive models.

🙋 Q&A

[01] Encoder-only, Encoder-Decoder, and Decoder-only Models

1. What are the key differences between encoder-only, encoder-decoder, and decoder-only models?

Encoder-only models like BERT use a denoising objective where the model learns to predict masked tokens in-place.
Encoder-decoder models like T5 are also autoregressive, with the decoder being a causal decoder that can leverage the encoder's representations via cross-attention.
Decoder-only models like the GPT series are autoregressive language models that predict the next token based on the previous ones.

2. How are encoder-decoder and decoder-only models related?

Encoder-decoder models and decoder-only models are not fundamentally different - they are both autoregressive models, with the key difference being the presence of an encoder in encoder-decoder models.
A variant of encoder-decoder models is the Prefix Language Model (PrefixLM) architecture, which is a decoder-only model that can leverage bidirectional attention without the need for a separate encoder.

3. Why did BERT-like encoder-only models become less prominent compared to encoder-decoder and decoder-only models?

The shift towards more unified, multi-task models made BERT-like models less desirable, as encoder-decoder and decoder-only models could more easily express multiple tasks without the need for task-specific classification heads.
Encoder-decoder and decoder-only models also retained the bidirectional attention benefits of BERT while offering more flexibility and efficiency.

[02] Denoising Objectives

1. What is the denoising objective, and how does it differ from regular language modeling?

The denoising objective, also known as "span corruption" or "infilling", involves masking and predicting a subset of tokens in the input sequence, rather than predicting the next token in a causal language modeling objective.
In denoising objectives, the masked tokens are "moved to the back" of the sequence for the model to predict, whereas in regular language modeling, the model predicts the next token in a left-to-right fashion.

2. How effective are denoising objectives compared to regular language modeling?

Denoising objectives are seen as a complementary objective to regular language modeling, as they can learn useful representations in a more sample-efficient way for certain tasks.
However, denoising objectives have the drawback of "less loss exposure" since only a small portion of tokens are masked and predicted, compared to the full sequence in regular language modeling.
The author suggests that denoising objectives should be used in conjunction with regular language modeling, rather than as a standalone pretraining objective.

[03] Bidirectional Attention and Model Scaling

1. What is the role of bidirectional attention in language models?

Bidirectional attention, as seen in BERT-style models, is an inductive bias that can be beneficial at smaller scales, as it allows the model to leverage both past and future context.
However, the author suggests that the importance of bidirectional attention may diminish at larger scales, and that its impact can vary depending on the task or modality.

2. How do encoder-decoder architectures compare to decoder-only models in terms of advantages and drawbacks?

Encoder-decoder models can be more flexible in the encoder side, as they are not restricted by the causal mask like decoder-only models. This allows for more aggressive pooling or linear attention in the encoder.
However, encoder-decoder models have the drawback of fixed input and output budgets, whereas decoder-only models like PrefixLM can more efficiently concatenate input and target sequences.

[04] Recap and Conclusions

1. What are the key takeaways from the article?

Encoder-decoder and decoder-only models are both autoregressive models with subtle differences in inductive biases and implementation-level details.
Denoising objectives are mostly used as complementary objectives to regular causal language modeling, rather than as standalone pretraining objectives.
Bidirectional attention can be beneficial at smaller scales, but its importance may diminish at larger scales, depending on the task and application.
BERT-like encoder-only models have become less prominent, as more flexible and unified autoregressive models like T5 and PrefixLM have emerged, allowing for better multi-task capabilities.

Shared by Daniel Chen ·

Install fromChrome Web Store