magic starSummarize by Aili

Video-Infinity: Distributed Long Video Generation

๐ŸŒˆ Abstract

The paper introduces Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. The key contributions are:

  • Addressing the challenges of long video generation using distributed parallel computation to enhance scalability and reduce generation times.
  • Introducing two interconnected mechanisms:
    1. Clip parallelism, which optimizes the sharing of context information across GPUs.
    2. Dual-scope attention, which adjusts temporal self-attention to ensure video coherence across devices.
  • Experiments show the proposed approach can generate videos up to 2,300 frames in just 5 minutes, which is up to 100 times faster than existing methods.

๐Ÿ™‹ Q&A

[01] Distributed Long Video Generation

1. What are the key challenges addressed by Video-Infinity in enabling distributed long video generation? Video-Infinity addresses two main challenges:

  • Ensuring effective communication among GPUs to share timing and context information
  • Adapting existing video diffusion models, typically trained on short sequences, to generate longer videos without additional training

2. How does Video-Infinity overcome these challenges? Video-Infinity introduces two key mechanisms:

  1. Clip parallelism:
    • Optimizes the gathering and sharing of context information across GPUs to minimize communication overhead
    • Uses an interleaved communication strategy to complete the sharing in three steps
  2. Dual-scope attention:
    • Modulates the temporal self-attention to balance local and global contexts efficiently across the devices
    • Allows a model trained on short clips to be extended to long video generation with overall coherence

3. How does Video-Infinity reduce the memory overhead compared to previous approaches? By leveraging both Clip parallelism and Dual-scope attention, Video-Infinity reduces the memory overhead from a quadratic to a linear scale. This allows the system to generate videos of any, potentially even infinite length, by utilizing the power of multiple device parallelism and sufficient VRAM.

[02] Experiments

1. What are the key findings from the experimental evaluation of Video-Infinity?

  • On an 8 Nvidia 6000 Ada (48G) setup, Video-Infinity can generate videos up to 2,300 frames in just 5 minutes.
  • Compared to the existing ultra-long text-to-video method Streaming T2V, Video-Infinity is up to 100 times faster in generating long videos.
  • Video-Infinity outperforms other baselines, such as FreeNoise and OpenSora V1.1, in terms of both video length and quality, as evaluated by the VBench metrics.

2. How does Video-Infinity's performance compare to the baseline methods in generating long videos?

  • Video-Infinity can generate the longest videos, up to 2,300 frames, which is 8.2 times more than the next best method, OpenSora V1.1.
  • In generating 1,024-frame videos, Video-Infinity is over 100 times faster than Streaming T2V, the only baseline capable of producing videos of this length.
  • Even when compared to Streaming T2V's generation of smaller, lower-resolution preview videos, Video-Infinity is 16 times faster.

3. How does the video quality of Video-Infinity compare to the baselines?

  • Video-Infinity maintains better consistency and more significant motion in the generated videos compared to the baselines.
  • In the generation of 64-frame videos, Video-Infinity's average metric scores are higher than FreeNoise and OpenSora V1.1.
  • In the generation of 192-frame videos, Video-Infinity outperforms Streaming T2V, the only other method capable of producing videos of this length, across the majority of evaluated metrics.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.