Segment Anything without Supervision
๐ Abstract
The article presents UnSAM, an innovative unsupervised learning method for image segmentation that can perform both interactive and whole-image segmentation without the need for human annotations. The key highlights are:
๐ Q&A
[01] Unsupervised Pseudo-Mask Generation
1. How does UnSAM generate high-quality pseudo masks without supervision?
- UnSAM introduces a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes:
- The divide stage leverages a top-down clustering method (CutLER) to extract initial semantic and instance-level masks.
- The conquer stage then refines these masks using a bottom-up clustering method, iteratively merging semantically similar pixels into larger segments.
- This divide-and-conquer pipeline generates a vast range of granularities with minimal extra cost, capturing more subtle details that are often missed by human annotators.
2. How does UnSAM's pseudo-mask generation compare to prior unsupervised methods?
- Prior works rely solely on top-down clustering, missing the hierarchical structure present in complex images.
- While some prior methods incorporate bottom-up clustering, they are limited in their ability to detect a full range of entity sizes.
- UnSAM's divide-and-conquer strategy demonstrates qualitatively and quantitatively superior performance, producing high-quality, detailed pseudo-masks that better capture the hierarchical complexity of visual scenes.
[02] Model Learning and Self-Training
1. How does UnSAM leverage the generated pseudo-masks for model training?
- UnSAM learns an image segmentation model (Mask2Former) using the pseudo-masks discovered by the divide-and-conquer strategy.
- UnSAM then performs self-training, where high-confidence mask predictions are merged as new 'ground-truth' annotations to further enhance the model's performance.
2. How does UnSAM enable promptable image segmentation?
- UnSAM utilizes Semantic-SAM as the base model for predicting multiple granularity levels of masks from a single click prompt.
- During training, UnSAM randomly samples points within the mask to simulate user clicks.
[03] Improving Supervised SAM with Unsupervised Segmentation
1. How does UnSAM+ leverage both supervised and unsupervised annotations?
- UnSAM+ merges SA-1B's ground-truth masks with UnSAM's unsupervised segmentation masks based on IoU.
- This fusion approach leverages the strengths of both supervised and unsupervised annotations, addressing the limitations of human-annotated datasets while enriching the diversity and comprehensiveness of the training data.
2. How does UnSAM+ outperform the supervised SAM model?
- UnSAM+ surpasses SAM's Average Recall by over 6.7% and Average Precision by 3.9% on the SA-1B dataset.
- UnSAM+ can often discover entities missed by SAM, particularly small entities and fine-grained details that human annotators tend to overlook.
[04] Experimental Results
1. How does UnSAM perform compared to the state-of-the-art unsupervised methods?
- UnSAM outperforms the previous state-of-the-art unsupervised methods by a significant margin of 11% in terms of Average Recall across various evaluation datasets.
- On datasets focused on part-level segmentation, such as PartImageNet and PACO, UnSAM exceeds the state-of-the-art by 16.6% and 12.6%, respectively.
2. How does UnSAM's performance compare to the supervised SAM model?
- Even when trained with only 1% of the SA-1B dataset and a smaller backbone, UnSAM achieves performance very close to the fully-supervised SAM model.
- On certain datasets, UnSAM surpasses SAM's performance, demonstrating its ability to capture details that are often missed by human annotators.