BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
๐ Abstract
The paper proposes BetterDepth, a plug-and-play diffusion refiner for zero-shot monocular depth estimation (MDE). BetterDepth efficiently combines the strengths of zero-shot and diffusion-based MDE methods, achieving robust affine-invariant MDE performance with fine-grained details.
๐ Q&A
[01] Introduction
1. What are the key challenges faced by existing MDE methods?
- Zero-shot MDE methods trained on large-scale datasets often suffer from over-smoothness of details due to the low-quality labels in real-world datasets.
- Diffusion-based MDE methods can produce impressive depth maps with fine granularity, but struggle to outperform feed-forward MDE models due to the difficulty of learning diverse geometric priors from real-world datasets with sparse depth labels.
2. What is the goal of this work? The work aims to achieve robust affine-invariant MDE performance while capturing fine-grained details, by efficiently leveraging the strengths of both zero-shot and diffusion-based MDE methods.
3. What are the key contributions of this work?
- Propose the BetterDepth framework to boost zero-shot MDE methods with plug-and-play diffusion refiners.
- Design global pre-alignment and local patch masking strategies to enable learning detail refinement from small-scale synthetic datasets while preserving rich prior knowledge from pre-trained MDE models.
- Achieve state-of-the-art zero-shot MDE performance with fine-grained details through efficient training and inference.
[02] Method
1. How does BetterDepth combine the strengths of zero-shot and diffusion-based MDE methods? BetterDepth is composed of a pre-trained feed-forward MDE model (MFFD) and a conditional latent diffusion model (MDM). It utilizes the rich geometric prior from MFFD to ensure accurate estimation of global depth context, and further employs MDM to improve local estimation results via iterative refinement.
2. What are the key training strategies proposed for BetterDepth?
- Global pre-alignment: Aligns the depth conditioning from MFFD to the ground truth depth labels, enhancing the faithfulness of BetterDepth to depth conditioning at the global scale.
- Local patch masking: Filters out significantly dissimilar local patches between depth conditioning and ground truth, enabling learning of detail refinement while preserving the prior knowledge from MFFD.
3. How does BetterDepth achieve plug-and-play capability? Since BetterDepth treats the pre-trained MFFD as a knowledge reservoir for zero-shot generalization and only needs to train the MDM for detail refinement, the trained MDM can be directly employed to improve other MFFD models without re-training.
[03] Experiments and Analysis
1. What are the key findings from the quantitative comparisons?
- BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets, outperforming both feed-forward and diffusion-based MDE methods.
- BetterDepth can efficiently achieve robust MDE performance with fine-grained details by learning on small-scale synthetic datasets (e.g., 400 data pairs).
2. How does BetterDepth compare to previous methods in terms of training and inference efficiency?
- BetterDepth shows significantly faster convergence speed and better performance stability compared to the previous diffusion-based method Marigold.
- BetterDepth can produce comparable or even better results than 50-step Marigold with only 2-step inference, demonstrating superior inference efficiency.
3. What are the limitations and future work of BetterDepth?
- Limitations: Model size and inference speed are still bounded by the chosen architecture of the pre-trained MDE model and the diffusion refiner.
- Future work: Explore more lightweight components (e.g., efficient UNet) and techniques like latent consistency models to further improve the efficiency of BetterDepth.