ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
๐ Abstract
The article discusses the problem of unsupervised semantic segmentation, where only images without annotations are available. It observes that when adopting CLIP for this pixel-level understanding task, unexpected biases (including class-preference bias and space-preference bias) occur, which largely constrain the segmentation performance. The paper proposes to explicitly model and rectify these biases to facilitate the unsupervised semantic segmentation task.
๐ Q&A
[01] Introduction
1. What are the key observations made about CLIP when applied to unsupervised semantic segmentation?
- The authors observe that when adopting CLIP to perform unsupervised semantic segmentation, unexpected biases occur, including:
- Class-preference bias: CLIP tends to incorrectly classify certain objects as other related classes
- Space-preference bias: CLIP performs better for segmenting central objects than objects near the image boundary
2. How do previous works address the bias issue in CLIP-based unsupervised semantic segmentation?
- Previous works do not explicitly model the bias in CLIP, which largely constrains the segmentation performance.
3. What is the key contribution of this paper?
- The paper proposes to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task.
[02] Method
1. How does the proposed method model the class-preference bias and space-preference bias in CLIP?
- The method designs a learnable "Reference" prompt to encode the class-preference bias, and projects the positional embedding in the vision transformer to encode the space-preference bias.
- The class-preference bias and space-preference bias are encoded into different features (Reference feature and positional feature), and then combined via matrix multiplication to generate a bias logit map.
2. How does the method rectify the bias in CLIP predictions?
- The method rectifies the logits of CLIP by a simple element-wise subtraction between the original CLIP logits and the bias logit map.
- To make the rectified results smoother and more contextual, a mask decoder is designed to take the rectified logit map and the CLIP visual feature as input, and output the final rectified segmentation mask.
3. How is the bias modeling and rectification process supervised?
- A contrastive loss based on masked visual features and text features of different classes is imposed to make the bias modeling and rectification process meaningful and effective.
4. How does the method further improve the segmentation performance?
- The knowledge from the rectified CLIP is distilled to an advanced segmentation architecture (DeepLab V2) via mask-guided, feature-guided and text-guided loss terms.
[03] Experiments
1. What are the key findings from the experimental results?
- The proposed ReCLIP++ method performs favorably against previous state-of-the-art unsupervised semantic segmentation methods on various benchmarks.
- Compared to the conference version ReCLIP, ReCLIP++ exhibits stronger bias rectification capability and achieves better segmentation performance.
- Ablation studies verify the effectiveness of each technical component in the proposed framework.
</output_format>