Diffusion is spectral autoregression
๐ Abstract
The article explores the connection between diffusion models and autoregressive models, particularly in the context of image generation. It argues that diffusion models perform approximate autoregression in the frequency domain, and provides a detailed analysis to support this claim.
๐ Q&A
[01] Autoregression and Diffusion as Generative Modeling Paradigms
1. What are the key similarities between autoregression and diffusion as generative modeling approaches?
- Both autoregression and diffusion split up the difficult task of generating data from complex distributions into smaller, easier-to-learn subtasks.
- Autoregression does this by casting the data into a sequence and recursively predicting one element at a time.
- Diffusion works by defining a corruption process that gradually destroys the structure in the data, and training a model to learn to invert this process step-by-step.
- This iterative refinement approach is common to both paradigms, allowing for the construction of deep computational graphs for generation without having to backpropagate through them during training.
2. How does the author frame the dichotomy between autoregression for language and diffusion for other domains?
- The author notes that currently, most language models are autoregressive, while most models of images and video are diffusion-based.
- The author finds this dichotomy, which can be summarized as "autoregression for language, and diffusion for everything else", quite interesting and has written about it before.
- The author suggests that this dichotomy may not persist in the long run, as the future is likely to be multimodal, requiring models that natively understand language, images, sound, and other modalities together.
[02] Diffusion as Approximate Autoregression in Frequency Space
1. How does the author use signal processing to analyze the connection between diffusion and autoregression?
- The author uses the 2D Fourier transform to obtain a frequency representation of images, which allows them to tease apart the coarse and fine-grained structure of the images.
- The author observes that the power spectrum of natural images typically follows a power law, and that adding Gaussian noise to the images results in a "hinge-shaped" spectrum.
- The author argues that this hinge-shaped spectrum corresponds to an approximate version of autoregression in the frequency domain, where the diffusion process gradually filters out high-frequency information.
2. What are the limitations of the author's interpretation of diffusion as approximate spectral autoregression?
- The author acknowledges that the relationship between noise levels and spatial frequencies is only valid in expectation, averaged across many images, and not necessarily for individual images.
- The "elbow" of the hinge-shaped spectrum is not very sharp, so there is a large transition zone where it is difficult to unequivocally say that a particular frequency is dominated by either signal or noise.
- The author notes that this is a very smooth approximation to the "hard" autoregression used in language models.
[03] Diffusion and Autoregression in the Audio Domain
1. How does the author's analysis of audio spectra differ from the analysis of image spectra?
- The author finds that the spectra of audio recordings, such as speech and music, do not exhibit the same power law behavior as the spectra of natural images.
- The audio spectra are noisier and do not monotonically decay with increasing frequency, which means that the "diffusion is just spectral autoregression" interpretation does not apply as directly to the audio domain.
2. What implications does this have for diffusion models of audio?
- The author notes that many diffusion models of audio do not operate directly in the waveform domain, but instead use spectrograms, which exhibit power-law spectra similar to images.
- The author suggests that this may be one reason why treating spectrograms as images works well in practice for audio diffusion models, as the power-law spectrum allows for a similar frequency decomposition interpretation.
[04] Implications for Multimodal Modeling
1. How does the author's analysis of diffusion and autoregression relate to the future of multimodal modeling?
- The author suggests that the current dichotomy of using autoregression for language and diffusion for other modalities is an unstable equilibrium, as the future is likely to be multimodal.
- The author speculates that in the longer term, we may either go back to using autoregression across all modalities, or figure out how to build multimodal diffusion models for all modalities, including language.
2. Why does the author argue against simply doing exact autoregression in the frequency domain instead of diffusion?
- The author acknowledges that this would resolve the "instability" of the current situation, but argues that the diffusion sampling procedure is exceptionally flexible in ways that autoregressive sampling is not, such as the ability to choose the number of sampling steps at test time.
- The author suggests that before abandoning diffusion altogether, we will want to figure out a way to avoid giving up some of these benefits.