SAM 2: Segment Anything in Images and Videos
๐ Abstract
The article presents Segment Anything Model 2 (SAM 2), a foundation model for promptable visual segmentation in images and videos. The key aspects include:
- Extending the promptable segmentation task from images to videos, allowing users to provide prompts on any frame to segment and track objects across the video.
- Developing a streaming memory-based model architecture (SAM 2) that can effectively segment and track objects in videos, while also performing well on image segmentation tasks.
- Introducing a large and diverse video segmentation dataset (SA-V) collected using an interactive data engine that employs SAM 2 to assist human annotators.
- Extensive experiments showing that SAM 2 outperforms prior work on video segmentation, using 3x fewer interactions, and also delivers better performance on image segmentation benchmarks while being 6x faster.
๐ Q&A
[01] Promptable Visual Segmentation (PVS) Task
1. What is the PVS task and how does it relate to other segmentation tasks? The PVS task extends the Segment Anything (SA) task from static images to videos. In PVS, the model can be interactively prompted with clicks, bounding boxes or masks on any frame of a video, with the goal of segmenting and tracking the target object throughout the video. PVS can be seen as a generalization of the semi-supervised video object segmentation (VOS) task, where prompts are limited to the first frame, and the interactive VOS task, where scribbles are provided on multiple frames.
2. What are the key differences between PVS and prior video segmentation tasks? The key differences are:
- Prompts can be provided on any frame, not just the first frame
- The focus is on enhancing the interactive experience, allowing easy refinement of the segmentation with minimal interaction
- The annotation is not restricted to specific object classes, but can be on "any" object with a valid boundary, including parts and subparts
[02] SAM 2 Model
1. How does the SAM 2 model architecture differ from the original SAM? The main differences are:
- SAM 2 has a streaming memory module that stores information about the object and previous interactions, allowing it to generate masklet predictions throughout the video and refine them based on the stored memory context.
- SAM 2 has an additional head that predicts whether the object of interest is present on the current frame, to account for cases where no valid object exists (e.g. due to occlusion).
- SAM 2 uses skip connections from the hierarchical image encoder to incorporate high-resolution information for mask decoding, unlike the original SAM.
2. How does SAM 2 handle ambiguous prompts that could refer to multiple compatible masks? For ambiguous prompts, SAM 2 predicts multiple masks on each frame. If the ambiguity is not resolved by follow-up prompts, the model selects the mask with the highest predicted IoU for the current frame.
3. What are the key training strategies used for SAM 2? SAM 2 is trained jointly on image and video data using an alternating training strategy. It is first pre-trained on the SA-1B dataset, and then further trained on a mix of SA-V, Internal, and open-source video datasets. During training, the model simulates an interactive setting by randomly selecting frames to provide prompts.
[03] SA-V Dataset
1. What are the key characteristics of the SA-V dataset?
- SA-V contains 50.9K videos with 642.6K annotated masklets, making it the largest video segmentation dataset to date.
- The dataset covers diverse indoor and outdoor scenes, with videos captured by a geographically diverse set of participants.
- In addition to manual annotations, the dataset also includes automatically generated masklets to enhance the coverage of annotations.
- The dataset has a challenging set of objects, including small, occluded, and reappearing objects, as well as object parts.
2. How was the SA-V dataset collected using the data engine? The data engine went through three phases:
- Using SAM for per-frame annotation
- Using SAM + SAM 2 Mask for temporal propagation of masks
- Using the full SAM 2 model in the loop with annotators
The final phase with SAM 2 in the loop was 8.4x faster than the initial per-frame annotation, while maintaining comparable quality.
[04] Experiments
1. How does SAM 2 perform on zero-shot video segmentation tasks? SAM 2 outperforms two strong baselines (SAM+XMem++ and SAM+Cutie) on 9 zero-shot video segmentation datasets, both in an offline interactive setting and an online interactive setting. SAM 2 can generate better segmentation accuracy while using 3 fewer interactions than the baselines.
2. How does SAM 2 perform on the semi-supervised video object segmentation (VOS) task? SAM 2 also outperforms the baselines on the semi-supervised VOS task, where prompts are limited to the first frame. SAM 2 achieves significantly higher accuracy than the baselines across a wide range of datasets, including both standard VOS benchmarks and the more challenging SA-V dataset.
3. How does SAM 2 perform on zero-shot image segmentation tasks compared to prior work? When evaluated on 37 zero-shot image segmentation datasets, including the original 23 datasets used to evaluate SAM, SAM 2 outperforms both SAM and HQ-SAM. SAM 2 achieves higher accuracy while being 6x faster than the original SAM model.