Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization
๐ Abstract
The paper introduces a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. The approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization.
๐ Q&A
[01] Introduction
1. What is the task of Simultaneous Localization and Mapping (SLAM)? SLAM is the task of estimating camera motion and a 3D map from video. The standard setup assumes a single continuous video.
2. What is the task of Multi-Session SLAM? Multi-Session SLAM is the task of estimating camera poses for all video frames under a single global reference, when the input consists of multiple disjoint video sequences.
3. What are the challenges in handling disjoint videos in Multi-Session SLAM? Video data in the wild often consists of multiple disjoint sessions, either deliberately or inadvertently due to visual discontinuities in the video stream. Handling such disjoint videos is important for many applications in AR and robotics, and gives rise to the task of Multi-Session SLAM.
4. What are the limitations of existing approaches to Multi-Session SLAM? Existing solutions typically require additional sensor data to remove gauge freedoms and make tracking easier. Only a small number of methods, like CCM-SLAM and ORB-SLAM3, support Multi-Session SLAM from monocular video alone, but they are based on classical feature descriptors, making them less accurate compared to recent designs based on deep networks.
[02] Approach
1. What are the key components of the proposed backbone architecture? The backbone architecture includes:
- Feature extraction and feature pyramid generation
- Anchor-point selection and correlation feature computation
- An update operator that predicts iterative updates to optical flow and camera pose using a differentiable solver layer
2. How does the two-view solver work? The two-view solver aims to align the relative pose between two frames to the predicted matches by minimizing the symmetric epipolar distance (SED). It uses a pre-conditioning stage with a weighted 8-point algorithm to initialize the pose, and then refines it using the SED solver layer.
3. How is the backbone adapted for visual odometry? For visual odometry, the backbone uses a multi-view solver that minimizes reprojection error and treats depth as a separate variable, similar to bundle adjustment. It also includes mechanisms to share latent features between updates from the same anchor point.
4. How does the system perform trajectory alignment for multi-session SLAM? To align disjoint trajectories, the system first estimates the relative rotation and translation direction using the two-view solver. It then computes the translation magnitude and relative scaling by comparing the depth from the two-view matches to the depth from the visual odometry system.
[03] Experiments
1. How does the two-view pose estimation method perform compared to existing approaches? On the Scannet and MegaDepth datasets, the two-view pose estimation method outperforms existing matching-based approaches, especially on more challenging scenes with fewer salient keypoints.
2. How does the multi-session SLAM system perform compared to other methods? On the EuRoC-MAV and ETH3D datasets, the proposed multi-session SLAM system achieves significantly lower error than ORB-SLAM3, outperforming it on 4 out of 5 trajectory groups in the ETH3D dataset.
3. What are the key architectural components that contribute to the performance of the two-view pose estimation method? Ablation studies show that the pre-conditioning stage, the SED solver, and the clamping of matches to the epipolar lines are all important components that contribute to the improved performance of the two-view pose estimation method.
</output_format>