
Video-Infinity: Distributed Long Video Generation
๐ Abstract
The paper introduces Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. The key contributions are:
- Addressing the challenges of long video generation using distributed parallel computation to enhance scalability and reduce generation times.
- Introducing two interconnected mechanisms:
- Clip parallelism, which optimizes the sharing of context information across GPUs.
- Dual-scope attention, which adjusts temporal self-attention to ensure video coherence across devices.
- Experiments show the proposed approach can generate videos up to 2,300 frames in just 5 minutes, which is up to 100 times faster than existing methods.
๐ Q&A
[01] Distributed Long Video Generation
1. What are the key challenges addressed by Video-Infinity in enabling distributed long video generation? Video-Infinity addresses two main challenges:
- Ensuring effective communication among GPUs to share timing and context information
- Adapting existing video diffusion models, typically trained on short sequences, to generate longer videos without additional training
2. How does Video-Infinity overcome these challenges? Video-Infinity introduces two key mechanisms:
- Clip parallelism:
- Optimizes the gathering and sharing of context information across GPUs to minimize communication overhead
- Uses an interleaved communication strategy to complete the sharing in three steps
- Dual-scope attention:
- Modulates the temporal self-attention to balance local and global contexts efficiently across the devices
- Allows a model trained on short clips to be extended to long video generation with overall coherence
3. How does Video-Infinity reduce the memory overhead compared to previous approaches? By leveraging both Clip parallelism and Dual-scope attention, Video-Infinity reduces the memory overhead from a quadratic to a linear scale. This allows the system to generate videos of any, potentially even infinite length, by utilizing the power of multiple device parallelism and sufficient VRAM.
[02] Experiments
1. What are the key findings from the experimental evaluation of Video-Infinity?
- On an 8 Nvidia 6000 Ada (48G) setup, Video-Infinity can generate videos up to 2,300 frames in just 5 minutes.
- Compared to the existing ultra-long text-to-video method Streaming T2V, Video-Infinity is up to 100 times faster in generating long videos.
- Video-Infinity outperforms other baselines, such as FreeNoise and OpenSora V1.1, in terms of both video length and quality, as evaluated by the VBench metrics.
2. How does Video-Infinity's performance compare to the baseline methods in generating long videos?
- Video-Infinity can generate the longest videos, up to 2,300 frames, which is 8.2 times more than the next best method, OpenSora V1.1.
- In generating 1,024-frame videos, Video-Infinity is over 100 times faster than Streaming T2V, the only baseline capable of producing videos of this length.
- Even when compared to Streaming T2V's generation of smaller, lower-resolution preview videos, Video-Infinity is 16 times faster.
3. How does the video quality of Video-Infinity compare to the baselines?
- Video-Infinity maintains better consistency and more significant motion in the generated videos compared to the baselines.
- In the generation of 64-frame videos, Video-Infinity's average metric scores are higher than FreeNoise and OpenSora V1.1.
- In the generation of 192-frame videos, Video-Infinity outperforms Streaming T2V, the only other method capable of producing videos of this length, across the majority of evaluated metrics.