
LiDAR-Event Stereo Fusion with Hallucinations
๐ Abstract
Event stereo matching is an emerging technique to estimate depth from neuromorphic cameras. However, events are unlikely to trigger in the absence of motion or the presence of large, untextured regions, making the correspondence problem extremely challenging. The paper proposes integrating a stereo event camera with a fixed-frequency active sensor, such as a LiDAR, to collect sparse depth measurements and overcome these limitations. The depth hints are used to hallucinate fictitious events, compensating for the lack of information in the absence of brightness changes. The proposed techniques, Virtual Stack Hallucination (VSH) and Back-in-Time Hallucination (BTH), are general and can be adapted to any structured representation to stack events, outperforming state-of-the-art fusion methods applied to event-based stereo.
๐ Q&A
[01] Introduction
1. What are the key challenges in depth estimation using traditional approaches?
- Accurate, prompt, and high-resolution depth information is crucial for many applications, but obtaining it remains an open challenge.
- Depth-from-stereo is one of the longest-standing approaches, but it has limitations, such as low dynamic range and motion blur.
2. How do event cameras differ from traditional imaging devices?
- Event cameras do not capture frames at synchronous intervals, but instead report pixel intensity changes as soon as they happen, with positive or negative polarities.
- This endows them with unparalleled features like microsecond temporal resolution and high dynamic range, making them suitable for applications with fast motion and challenging light conditions.
3. What are the limitations of event-based stereo matching?
- Events are triggered only with brightness changes, resulting in semi-dense and uninformative data, especially in the absence of motion or in large untextured regions.
- This makes the downstream stereo network struggle to match events across left and right cameras.
4. How can fusing event-based stereo with sparse depth measurements from active sensors help overcome these limitations?
- Fusing color information with sparse depth measurements from an active sensor, such as a LiDAR, can soften the weaknesses of passive depth sensing, despite the much lower resolution of the depth points.
- However, the fixed rate of depth sensors is in contrast with the asynchronous acquisition rate of event cameras, causing issues with either using depth points only when available or limiting processing to the LiDAR pace.
[02] Proposed Method
1. What are the two strategies proposed in the paper for fusing event-based stereo with sparse depth measurements?
- Virtual Stack Hallucination (VSH): Augmenting each channel of the stacked event representation with virtual patterns consistent with the depth measurements.
- Back-in-Time Hallucination (BTH): Hallucinating fictitious events directly in the continuous event streams, based on the depth measurements.
2. How do VSH and BTH work, and what are the key differences between them?
- VSH requires explicit access to the stacked event representation, while BTH does not.
- VSH injects virtual patterns into the stacked representation, while BTH hallucinates fictitious events in the continuous event streams.
- Both strategies aim to increase the distinctiveness of the event data to ease the matching process for the downstream stereo model.
3. How do VSH and BTH handle the mismatch between the fixed rate of depth sensors and the asynchronous acquisition of event cameras?
- VSH and BTH can leverage depth measurements that are not synchronized with the event data, with only marginal drops in accuracy compared to the case of perfectly synchronized sensors.
- This allows preserving the microsecond resolution of event cameras while exploiting the depth information.
[03] Experiments
1. What datasets were used for the experiments, and what are their key characteristics?
- DSEC: An outdoor event stereo dataset with ground-truth disparity obtained by accumulating 16-line LiDAR scans.
- M3ED: An indoor/outdoor dataset with a 64-line LiDAR providing semi-dense ground-truth depth and a shorter-baseline event stereo camera.
2. How did the proposed VSH and BTH strategies perform compared to existing fusion methods?
- VSH and BTH consistently outperformed existing fusion methods, such as Guided Stereo Matching, Concat, and Guided+Concat, on both the DSEC and M3ED datasets.
- The improvements were particularly significant when training the stereo backbones from scratch to exploit the LiDAR data.
3. How did VSH and BTH perform when dealing with time-misaligned LiDAR data on the M3ED dataset?
- Both VSH and BTH showed robust performance, retaining a significant gain over the baseline event-only stereo even when the LiDAR data was up to 100ms misaligned.
- BTH with repeated event injections was the most robust solution, outperforming VSH when dealing with time-misaligned LiDAR data.