Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task
๐ Abstract
The paper proposes the Proxy Token Diffusion Transformer (PT-DiT) to address the redundancy and computational complexity issues in existing diffusion transformer models for image and video generation tasks. The key ideas are:
- Employing sparse representative "proxy tokens" to model global visual information efficiently, instead of using full token self-attention.
- Introducing a Global Information Interaction Module (GIIM) to capture global semantics through proxy token self-attention, and then injecting this information into all latent tokens via cross-attention.
- Incorporating window attention and shift-window attention in the Texture Complement Module (TCM) to enhance the model's ability to capture detailed textures.
The proposed PT-DiT architecture can be applied to both image and video generation tasks without structural changes. Experiments show that PT-DiT achieves competitive performance while significantly reducing computational complexity compared to existing diffusion transformer models.
๐ Q&A
[01] Redundancy Analysis
1. What observations were made about the attention maps in existing diffusion transformer models?
- The attention maps of tokens within the same spatial window were highly similar and redundant.
- Tokens paid varying attention to spatially neighboring tokens but almost uniform attention to spatially distant tokens, suggesting redundant computation.
2. How does the proposed method address this redundancy? The paper proposes a sparse attention strategy that samples a limited number of "proxy tokens" from each spatial-temporal window to perform self-attention, reducing redundant computation and decreasing computational complexity.
[02] Architecture of PT-DiT
1. What are the key components of the PT-DiT architecture? The PT-DiT architecture consists of two main components:
- Global Information Interaction Module (GIIM): Efficiently models global visual associations using the proxy token mechanism.
- Texture Complement Module (TCM): Enhances the model's ability to capture detailed textures using window attention and shift-window attention.
2. How does the proxy token mechanism work in the GIIM?
- The input latent code sequence is reshaped to recover spatial and temporal relationships.
- A set of proxy tokens is randomly sampled from each spatial-temporal window.
- The proxy tokens interact with each other through self-attention to capture global information.
- The global information is then propagated to all latent tokens through cross-attention between the proxy tokens and all latent tokens.
3. What is the role of the TCM? The TCM is introduced to complement the sparse proxy token interactions and enhance the model's ability to capture detailed textures. It employs window attention and shift-window attention to introduce visual priors and aid in the construction of texture details.
[03] Complexity Analysis
1. How does PT-DiT reduce the computational complexity compared to the original diffusion transformer?
- The computational complexity of self-attention in PT-DiT is reduced from O(n^2) to O(n*m), where n is the total number of tokens and m is the number of proxy tokens.
- With larger compression ratios (e.g., 1, 16, 16 for 2048x2048 resolution), PT-DiT accounts for only 2.3% of the total self-attention computation compared to the original diffusion transformer.
2. How does the computational advantage of PT-DiT translate to performance in image and video generation tasks?
- For image generation, PT-DiT achieves a 48% reduction in GFLOPs compared to the original diffusion transformer and a 35% reduction compared to Pixart- for the same parameter size.
- For video generation, PT-DiT's computational complexity is 77.2% of CogVideoX and 85% of EasyAnimate (with 3 million more parameters) for the same parameter size.