Summarize by Aili

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

🌈 Abstract

The paper proposes Loopy, an end-to-end audio-conditioned diffusion model for portrait video generation that does not require spatial conditions. The key innovations are:

Inter- and intra-clip temporal modules that leverage long-term motion dependency to learn natural motion patterns.
An audio-to-latents module that enhances the correlation between audio and portrait motion by using strongly correlated conditions during training.

The extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.

🙋 Q&A

[01] Proposed Method

1. What are the key components of the Loopy framework? The Loopy framework consists of the following key components:

Inter- and intra-clip temporal modules to capture long-term motion dependency
Audio-to-latents module to enhance the correlation between audio and portrait motion
Dual U-Net architecture with a reference net module to effectively incorporate reference image features

2. How do the inter- and intra-clip temporal modules work?

The inter-clip temporal layer models the cross-clip temporal relationships between motion frame latents and noisy latents.
The intra-clip temporal layer focuses on the temporal relationships within the noisy latents of the current clip.
The temporal segment module extends the temporal range covered by the inter-clip temporal layer and accounts for variations in information due to different distances from the current clip.

3. How does the audio-to-latents module work?

The audio-to-latents module maps audio and facial motion-related features (landmarks, head motion variance, expression motion variance) into motion latents based on a shared feature space.
These motion latents are then used as conditions in the denoising network, allowing the model to leverage strongly correlated motion conditions to enhance the modeling of the relationship between audio and portrait motion.

[02] Experiments and Results

1. How did Loopy perform on the CelebV-HQ and RAVDESS datasets?

On the CelebV-HQ dataset, which simulates real-world usage conditions, Loopy significantly outperformed the compared methods in most metrics, including video synthesis quality and lip-sync accuracy.
On the RAVDESS dataset, which evaluates emotional expression, Loopy outperformed the compared methods in the E-FID metric and motion dynamics metrics (Glo and Exp).

2. How did Loopy perform on the openset test scenarios?

Loopy consistently outperformed the compared methods across diverse input styles (real people, anime, humanoid crafts, side face) and audio types (speech, singing, rap, emotional audio).
Subjective evaluations by experienced users showed Loopy's advantages in identity consistency, video synthesis quality, audio-emotion matching, motion diversity, motion naturalness, and lip-sync accuracy.

3. What were the key findings from the ablation studies?

The dual temporal layer design and the temporal segment module were crucial for improving temporal stability and motion quality.
The audio-to-latents module enhanced the modeling of the relationship between audio and portrait motion, leading to better overall performance.
Increasing the length of motion frames alone without the temporal modules resulted in degraded image quality, highlighting the importance of the proposed temporal modeling approach.

Shared by Daniel Chen ·

Install fromChrome Web Store