Summarize by Aili

StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

🌈 Abstract

The article presents a novel streaming network called StreamMOS for moving object segmentation (MOS) based on LiDAR data. The key contributions are:

A streaming framework that exploits short-term and long-term memory to build associations among multiple inferences, improving the integrity and continuity of predictions.
A multi-view encoder that captures object motion and appearance from different perspectives (BEV and range view) using cascade projection and asymmetric convolution.
A voting mechanism that refines segmentation results at both voxel and instance levels using historical predictions stored in long-term memory.

1. What are the key components of the multi-projection feature encoder?

The multi-projection feature encoder consists of:
- Point-wise encoder using PointNet to process point clouds
- Projection modules (P2B, P2R, B2P, R2P) to map point features to and from BEV and range view
- Multi-view encoder (MVE) with cascade structure and asymmetric convolution to extract motion features from BEV and range view

2. How does the asymmetric convolution block (ACB) in the MVE help capture object motion?

The ACB uses convolution kernels with different sizes in the horizontal and vertical directions, allowing it to better perceive the movement of objects which often have obvious motion in a certain direction.

3. What are the benefits of using a multi-view encoding approach compared to single-view?

Mapping points to both BEV and range view simultaneously can capture more complete appearance and obvious motion cues of dynamic objects, providing a more holistic observation.

1. What is the purpose of the short-term temporal fusion module?

The short-term temporal fusion module aims to transfer the historical feature Ht-1 from the previous inference to the current one, so that the spatial states of objects can be reused to guide the network in deducing object motion.

2. How does the short-term temporal fusion work?

It uses an attention mechanism with learnable offsets to adaptively find the relationship between the current motion feature Ft and the historical feature Ht-1, and then combines them to generate an updated feature Ht.
The updated Ht is then used in the segmentation decoder.

3. What are the benefits of the short-term temporal fusion?

It allows the network to leverage historical spatial information as a strong prior to enhance the current inference, improving the consistency of segmentation results.

1. What are the two components of the long-term voting mechanism?

Voxel-based voting (VBV): Analyzes the historical predictions stored in the long-term memory bank to determine the most frequent motion state for each voxel.
Instance-based voting (IBV): Leverages the instance-level information from the movable object predictions to further refine the motion states at the instance level.

2. How does the long-term voting mechanism improve the segmentation results?

The voting mechanism can explicitly suppress incorrect predictions and improve the temporal consistency of segmentation by analyzing long-term prediction patterns at both the voxel and instance levels.
It helps address the inconsistency issues that can arise from the inexplicability and data dependency of neural networks.

3. What is the role of the long-term memory bank in the voting mechanism?

The long-term memory bank stores the historical segmentation results, which are then used by the voting mechanism to refine the current predictions and ensure temporal continuity.

1. How does StreamMOS perform compared to previous methods on the SemanticKITTI and Sipailou-Campus datasets?

On the SemanticKITTI validation set, StreamMOS-VI* achieves an IoU of 81.6%, outperforming previous state-of-the-art methods like 4DMOS and InsMOS*.
On the Sipailou-Campus dataset, StreamMOS-V achieves an IoU of 92.5%, surpassing other competing approaches.

2. What is the inference speed of StreamMOS compared to other methods?

Despite the additional complexity of the temporal fusion and voting mechanism, StreamMOS maintains competitive inference speed, thanks to the projection-based backbone, lightweight deformable attention, and parameter-free upsampling in the decoder.

3. How do the ablation studies demonstrate the effectiveness of the key modules in StreamMOS?

The ablation studies show that the temporal fusion, multi-view encoder, and voting mechanism (both VBV and IBV) all contribute significantly to the overall performance improvement of StreamMOS.
The studies also provide insights into the optimal design choices, such as the effectiveness of the asymmetric convolution and the appropriate time window length for the voting mechanism.

Shared by Daniel Chen ·