Summarize by Aili

Temporally Consistent Stereo Matching

https://arxiv.org/html/2407.11950v1

🌈 Abstract

The paper proposes a temporally consistent stereo matching method that leverages temporal information to improve the temporal consistency, accuracy, and efficiency of stereo matching. The key components are:

Temporal disparity completion: Provides a well-initialized disparity map by leveraging the result of the previous frame.
Temporal state fusion: Fuses the current state features from the completion module with the past hidden states to provide a temporally coherent initial hidden state for further refinement.
Iterative dual-space refinement: Refines the results in both disparity and disparity gradient spaces, improving estimations in ill-posed regions.

Extensive experiments demonstrate that the proposed method effectively alleviates temporal inconsistency while enhancing both accuracy and efficiency, achieving state-of-the-art results on synthetic and real-world datasets.

🙋 Q&A

[01] Temporal Consistency

1. What are the two types of metrics used to evaluate the temporal consistency of the proposed method? The paper designs two types of metrics to evaluate temporal consistency:

The first metric calculates the absolute difference between the predicted disparity map at the current timestamp converted to the image coordinate at the previous timestamp and the actual disparity map at the previous timestamp.
The second metric calculates the change in error between the current and previous timestamps, allowing for temporary inconsistencies as the model corrects its predictions.

2. How do the proposed method's temporal consistency results compare to other methods? The paper's final models, (G) and (H), achieve the best temporal consistency and convergence compared to the other settings. The results demonstrate the effectiveness of the state fusion, temporal disparity completion module, and dual-space refinement module in improving temporal consistency, especially in occluded areas.

3. How does the temporal consistency of the single-frame mode (I) compare to the multi-frame mode (D)? While the single-frame mode (I) surpasses the multi-frame mode (D) in accuracy metrics, it is inferior to (D) in temporal metrics, further revealing the crucial role of temporal information in improving temporal consistency.

[02] Local Searching Range and Fast Convergence

1. How does the proposed method's local disparity searching range compare to RAFT-Stereo? The paper's method iterates based on a well-initialized disparity, providing a local disparity searching range. This leads to smaller update step sizes and faster convergence to the ground truth, compared to RAFT-Stereo which regresses the disparity from scratch in a global disparity range.

[03] Improvement on Ill-posed Areas

1. How does the proposed method's dual-space refinement module improve results in ill-posed regions? The dual-space refinement module iteratively refines the results in both the disparity space and the disparity gradient space. This allows the local smoothness constraints in the disparity gradient space to be progressively extended to more global areas, thereby improving the smoothness of ill-posed regions like occlusions or reflective areas and resulting in more stable outputs.

[04] Robustness Analysis

1. How does the proposed method handle dynamic scenes and incorrect poses? The paper notes that the method remains robust in ordinary dynamic scenes, as the temporal information is only used for initialization, and the disparities in dynamic regions will be corrected by the iterative refinements. For incorrect poses and large motions, the method degrades to searching for the ground truth in a global disparity range, similar to RAFT-Stereo, as the temporal information becomes less reliable in these challenging cases.

[05] Benchmark Results

1. How does the proposed method perform on the KITTI 2015 benchmark compared to other state-of-the-art methods? On the KITTI 2015 benchmark, the proposed TC-Stereo method outperforms other methods in most key metrics, achieving the lowest error rates in both non-occluded and all regions. It also showcases exceptional efficiency, with a processing time of only 0.09 seconds per frame, significantly faster than other high-performance methods.

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store