# DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

## ๐ Abstract

Diffusion models (DMs) have revolutionized generative learning, but encoding complex, multimodal data distributions into a single continuous Gaussian distribution is challenging. The paper proposes Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. DisCo-Diff augments DMs with learnable discrete latents, inferred with an encoder, and trains the DM and encoder end-to-end. The discrete latents significantly simplify learning the DM's complex noise-to-data mapping by reducing the curvature of the DM's generative ODE. An additional autoregressive transformer models the distribution of the discrete latents. DisCo-Diff consistently improves model performance across toy data, image synthesis tasks, and molecular docking.

## ๐ Q&A

### [01] Introduction

**1. What is the key motivation behind DisCo-Diff?**
The key motivation behind DisCo-Diff is that encoding complex, potentially multimodal data distributions into a single continuous Gaussian distribution represents an unnecessarily challenging learning problem for diffusion models (DMs). DisCo-Diff aims to simplify this task by introducing complementary discrete latent variables.

**2. How does DisCo-Diff address the challenges of DMs?**
DisCo-Diff augments DMs with learnable discrete latents, inferred with an encoder, and trains the DM and encoder end-to-end. The discrete latents significantly simplify learning the DM's complex noise-to-data mapping by reducing the curvature of the DM's generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, which is a simple step since DisCo-Diff requires only a few discrete variables with small codebooks.

**3. What are the key contributions of the paper?**
The key contributions are:

- Proposing the DisCo-Diff framework, a novel way of combining discrete and continuous latent variables in DMs in a universal manner.
- Extensively validating DisCo-Diff, showing that it significantly boosts model quality in all experiments and achieves state-of-the-art performance on several image synthesis tasks.
- Providing detailed analyses, ablation studies, and architecture design studies that demonstrate the unique benefits of discrete latent variables and how they can be effectively incorporated into the main denoiser network.

### [02] Background

**1. What is the key idea behind diffusion models (DMs)?**
DMs leverage a forward diffusion process that effectively encodes the training data in a simple, unimodal Gaussian prior distribution. Generation can be formulated as a deterministic process that takes random noise from the Gaussian prior and transforms it into data through a generative ordinary differential equation (ODE).

**2. What are the challenges of directly encoding complex, multimodal data distributions into a single Gaussian distribution?**
Directly encoding complex, multimodal data distributions into a single unimodal Gaussian distribution and learning the corresponding reverse noise-to-data mapping is challenging. The mapping, or generative ODE, needs to be highly complex with strong curvature, and it may be considered unnatural to map an entire data distribution to a single Gaussian distribution.

**3. How do conditioning information and discrete latent variables address these challenges?**
Conditioning information, such as class labels or text prompts, can help simplify the complex mapping by offering the DM's denoiser additional cues for more accurate denoising. However, even with conditioning, the mapping remains highly complex. DisCo-Diff proposes to address this by augmenting DMs with additional discrete latent variables that can encode high-level information about the data and simplify the DM's denoising task.

### [03] DisCo-Diff

**1. How does the DisCo-Diff framework work?**
DisCo-Diff augments a DM with learnable discrete latent variables. It has three main components: a denoiser neural network (the DM), an encoder to infer discrete latents from clean input data, and an autoregressive model to capture the distribution of the learned discrete latents. The denoiser and encoder are trained end-to-end, while the autoregressive model is trained in a second stage.

**2. Why are discrete latents preferred over continuous latents in DisCo-Diff?**
Discrete latents are preferred over continuous latents because modeling the distribution of continuous latents that capture multimodal structure in the data would require a highly non-linear, difficult-to-learn mapping from Gaussian noise. In contrast, the distribution of the discrete latents in DisCo-Diff is simple to model with a straightforward autoregressive model.

**3. What are the key architectural considerations in DisCo-Diff?**
Key architectural considerations include:

- Using a ViT-based encoder to naturally allow each discrete latent to capture global characteristics of the data
- Incorporating cross-attention layers to enable the discrete latents to globally influence the denoiser's output
- Exploring a group hierarchical design to encourage different discrete latents to encode different image characteristics
- Carefully choosing the number of latents and codebook size to balance performance improvement and modeling complexity

### [04] Experiments

**1. What are the key findings from the image synthesis experiments?**

- DisCo-Diff achieves new state-of-the-art FID scores on class-conditioned ImageNet-64 and ImageNet-128 datasets when using ODE-based sampling.
- DisCo-Diff outperforms all baselines in the unconditional setting and when using stochastic samplers.
- The discrete latents capture variations complementary to class semantics, simplifying the diffusion process.
- Ablation studies show the benefits of discrete latents over continuous latents and the importance of end-to-end training.

**2. How does DisCo-Diff perform on the molecular docking task?**

- When applied to the molecular docking task, DisCo-Diff (DisCo-DiffDock-S) improves the success rate from 32.9% to 35.4% on the full test set, and from 13.9% to 18.5% on the harder subset with unseen proteins.
- The discrete latents allow the model to better decompose the multimodal uncertainty in the pose distribution from the continuous variability of each pose.

**3. What are the key insights from the architecture design and analysis?**

- The group hierarchical DisCo-Diff model shows that latents for lower resolutions capture overall shape and layout, while latents for higher resolutions control color and texture.
- Classifier-free guidance with respect to the discrete latents further improves performance, suggesting the latents learn modes similar to class labels.
- The autoregressive model for the discrete latents is computationally efficient, requiring only 0.44 seconds to generate 32 images on ImageNet-128.