Summarize by Aili

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

🌈 Abstract

The paper introduces Tora, the first trajectory-oriented Diffusion Transformer (DiT) framework for video generation that integrates text, image, and trajectory conditions. Tora encodes arbitrary trajectories into spacetime motion patches and injects them into the DiT architecture to enable precise motion control and realistic simulation of physical world movements. The key contributions include:

Tora, the first trajectory-oriented DiT for video generation that seamlessly integrates visual and trajectory instructions.
Novel Trajectory Extractor and Motion-guidance Fuser modules to align trajectory encoding with the scalability of DiT.
Experiments demonstrating Tora's ability to generate high-quality 720p videos up to 204 frames, with precise trajectory control across diverse durations, aspect ratios, and resolutions.

🙋 Q&A

[01] Introduction

1. What are the key limitations of previous video diffusion models?

Previous video diffusion models predominantly used U-Net architectures, which focused on synthesizing videos of limited duration (around 2 seconds) and were constrained to fixed resolutions and aspect ratios.
These models struggled to generate long videos with consistent motion, often leading to issues like motion blur, appearance distortions, and unnatural movements.

2. How does Tora address these limitations?

Tora adopts the Diffusion Transformer (DiT) architecture, which has superior scaling properties compared to U-Net based models.
Tora introduces two novel modules - the Trajectory Extractor (TE) and the Motion-guidance Fuser (MGF) - to integrate arbitrary trajectories into the DiT framework, enabling precise motion control and realistic simulation of physical world movements.

3. What are the key contributions of this work?

Tora is the first trajectory-oriented DiT for video generation that seamlessly integrates text, image, and trajectory conditions.
The TE and MGF modules are designed to align trajectory encoding with the scalability of DiT.
Experiments demonstrate Tora's ability to generate high-quality 720p videos up to 204 frames, with precise trajectory control across diverse durations, aspect ratios, and resolutions.

[02] Methodology

1. What is the overall architecture of Tora? Tora's architecture consists of:

Spatial-Temporal DiT (ST-DiT) for video generation
Trajectory Extractor (TE) to encode arbitrary trajectories into spacetime motion patches
Motion-guidance Fuser (MGF) to integrate the motion patches into the DiT blocks

2. How does the Trajectory Extractor (TE) work?

The TE converts the input trajectory into a trajectory map, which is then transformed into the RGB color space using flow visualization techniques.
A 3D VAE is used to compress the trajectory maps into compact motion latent representations, which are then encoded into spacetime motion patches.

3. How does the Motion-guidance Fuser (MGF) integrate the motion patches into the DiT blocks? The MGF employs an adaptive normalization layer to infuse the multi-level motion patches into the corresponding DiT blocks. This adaptive normalization approach demonstrates the best generation performance and compute efficiency compared to other fusion methods like extra channel connections and cross-attention.

4. What training strategies are used for Tora?

A two-stage training approach is used for trajectory learning:
1. First stage: Use dense optical flow as the trajectory to accelerate motion learning.
2. Second stage: Fine-tune the model using sparse trajectories selected from the optical flow.
For image conditioning, the mask strategy from OpenSora is used to support visual conditioning.

[03] Experiments

1. How does Tora perform compared to other motion-guided video generation approaches?

Tora demonstrates superior trajectory control, especially as the number of frames increases. While U-Net based methods like MotionCtrl and DragNUWA show good trajectory alignment for 16 frames, their performance degrades significantly for longer durations.
Tora's trajectory accuracy surpasses other methods by a factor of 3 to 5 when evaluated on the 128-frame test setting, showcasing its exceptional motion control capabilities.

2. What are the key findings from the ablation studies?

The 3D VAE-based trajectory compression in the TE module outperforms simpler methods like keyframe sampling and average pooling.
The adaptive normalization layer in the MGF module achieves the best performance in terms of video quality and trajectory control, while also being the most computationally efficient.
Integrating the MGF within the Temporal DiT block enhances the module's ability to interact with temporal dynamics, leading to improved motion synthesis fidelity.
The two-stage training approach, starting with dense optical flow and then fine-tuning with sparse trajectories, is more effective than training with either type of trajectory alone.

3. What are the key strengths of Tora demonstrated in the experiments?

Tora can generate high-quality 720p videos with up to 204 frames, while maintaining precise control over the motion trajectories.
Tora's motion control capabilities are robust to variations in video duration, aspect ratio, and resolution, outperforming U-Net based methods.
Tora's generated videos exhibit realistic simulations of physical world movements, with smooth and natural-looking motions.

Shared by Daniel Chen ·

Install fromChrome Web Store