Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
๐ Abstract
The article introduces Transfusion, a method for training a single unified model to understand and generate both discrete (text) and continuous (image) modalities. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a transformer over mixed-modal sequences. The authors pretrain Transfusion models up to 7B parameters and demonstrate that it scales significantly better than quantizing images and training a language model over discrete image tokens. They further show that scaling Transfusion to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on par with similar scale diffusion models and language models.
๐ Q&A
[01] Introduction
1. What is the key innovation of Transfusion? The key innovation of Transfusion is that it trains a single unified model to understand and generate both discrete (text) and continuous (image) modalities, using separate loss functions for each modality (language modeling for text, diffusion for images) over shared data and parameters.
2. How does Transfusion compare to previous approaches for combining discrete and continuous modalities? Previous approaches either attached modality-specific architectures together or quantized continuous modalities into discrete tokens and trained a language model over the combined sequence. In contrast, Transfusion fully integrates both modalities with no information loss by training a single model on both the language modeling and diffusion objectives.
3. What are the main advantages of Transfusion over the quantization-based approach? Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. Experiments show that Transfusion achieves better performance than the quantization-based approach (Chameleon) using much less compute.
[02] Background
1. What is the language modeling objective? The language modeling objective is to predict the probability of a sequence of discrete tokens from a closed vocabulary, by decomposing it into a product of conditional probabilities of each token given the prefix.
2. What is the diffusion objective? Diffusion models operate on continuous data like images. The diffusion objective trains the model to reverse a gradual noise-addition process, learning to denoise the data step-by-step.
3. What is the role of latent image representations in diffusion models? Early diffusion models worked directly in pixel space, which was computationally expensive. Variational autoencoders (VAEs) can encode images into a lower-dimensional latent space, allowing diffusion models to operate efficiently on compact image patch embeddings.
[03] Transfusion
1. How does Transfusion represent the data? Text is represented as a sequence of discrete tokens from a fixed vocabulary. Images are encoded as sequences of continuous latent patch vectors using a VAE.
2. What is the Transfusion model architecture? Transfusion uses a single transformer model to process the mixed-modal sequences, with modality-specific components (embedding/patchification layers) to convert the data into the transformer's input/output space.
3. How does Transfusion handle attention for text and images? Transfusion applies causal attention to the entire sequence, and bidirectional attention within the elements of each individual image. This allows every image patch to attend to every other patch within the same image, but only attend to text or patches of other images that appeared previously.
4. What are the training and inference procedures for Transfusion? During training, Transfusion applies the language modeling objective to text tokens and the diffusion objective to image patches, combining the two losses. At inference, it switches between language modeling and diffusion modes to generate text and images respectively.
[04] Experiments
1. How does Transfusion compare to the Chameleon baseline in terms of scaling and efficiency? In controlled experiments, Transfusion consistently exhibits better scaling laws and is significantly more compute-efficient than Chameleon, especially in image generation tasks. Transfusion achieves parity with Chameleon using around 1/3 the compute.
2. What are the key architectural components that contribute to Transfusion's performance? Enabling intra-image bidirectional attention is important for Transfusion's performance. Adding U-Net encoding/decoding layers also improves performance, especially on image tasks, by allowing Transfusion to compress larger image patches with relatively small loss.
3. How does Transfusion's large-scale 7B model perform compared to other state-of-the-art models? The 7B Transfusion model can generate images on par with popular diffusion models like DALL-E 2 and SDXL, while also matching the text generation performance of Llama 1. This demonstrates Transfusion's ability to achieve high-quality generation in both modalities.
[05] Related Work
1. How does Transfusion differ from previous multi-modal models? Most existing multi-modal models are built by attaching modality-specific architectures together, or by quantizing continuous modalities into discrete tokens and training a language model. In contrast, Transfusion is a single unified model trained end-to-end on both discrete and continuous data.
2. What are some recent developments in applying diffusion models to text generation? There has been active research on applying diffusion models and their generalizations to discrete text generation, though these approaches have yet to match the performance and scale of standard autoregressive language models.