Odd-One-Out: Anomaly Detection by Comparing with Neighbors
๐ Abstract
This paper introduces a novel anomaly detection (AD) problem that focuses on identifying 'odd-looking' objects relative to the other instances within a scene. The proposed setting involves multi-object multi-view scenes, where anomalies are defined by the regular instances that make up the majority. To provide a testbed for future research, the authors introduce two benchmarks, ToysAD-8K and PartsAD-15K. They propose a novel method that generates 3D object-centric representations for each instance and detects the anomalous ones through a cross-examination between the instances.
๐ Q&A
[01] Introduction
1. What is the key difference between the proposed AD problem and traditional AD benchmarks? The key difference is that in the proposed setting, anomalies are scene-specific, defined by the regular instances that make up the majority, rather than being based on high-level variations or low-level shape/texture variations as in traditional AD benchmarks.
2. What are the challenges that the proposed problem presents? The proposed problem presents several challenges, including:
- 3D understanding of the scene and registering the views from multiple camera viewpoints without groundtruth 3D knowledge while identifying potential occlusions
- Aligning and comparing object instances with each other without their relative pose information in both training and evaluation
- Learning representations that generalize to unseen object instances during testing
3. How does the proposed method address these challenges? The proposed method addresses these challenges by:
- Taking multiple views of the same scene as input, projecting them into a 3D voxel grid, and producing a 3D object-centric representation for each instance
- Predicting the labels of instances by cross-correlating them through an efficient attention mechanism
- Leveraging recent advances in differentiable rendering and self-supervised learning to supervise the 3D representation learning
[02] Method
1. What are the main components of the proposed architecture? The proposed architecture has three main components:
- 3D feature fusion module: Encodes each view image and projects it to 3D, forming a fused 3D feature volume.
- Feature distillation block: Enhances the 3D feature volume through differentiable rendering and distillation of features from a 2D self-supervised model (DINOv2).
- Cross-instance matching module: Leverages the established correspondences to compare all similar object regions in the scene using a sparse voxel attention mechanism.
2. How does the feature distillation block improve the 3D representation? The feature distillation block improves the 3D representation in two ways:
- It incorporates open-world knowledge from the pre-trained DINOv2 model, enabling the model to perform better on unseen object instances or novel categories.
- It enforces consistent 3D scene representation, leading to identical features for the same object geometries and enabling the model to infer robust local correspondences.
3. How does the cross-instance matching module work? The cross-instance matching module computes sparse voxel attention only among geometrically corresponding voxel locations, unlike the vanilla self-attention that uses all tokens. This helps eliminate noisy interactions with irrelevant features and improves the performance of fine-grained object matching.
[03] Experiments
1. What are the key characteristics of the proposed ToysAD-8K and PartsAD-15K datasets? ToysAD-8K includes real-world objects from multiple categories, allowing the evaluation of the model's ability to generalize to unseen object categories. PartsAD-15K comprises a diverse collection of mechanical object parts with arbitrary shapes, being free from any class-level inductive biases.
2. How does the proposed method perform compared to the baselines? The proposed method significantly outperforms the baselines, including a reconstruction-based approach and two multi-view 3D object detection methods (ImVoxelNet and DETR3D). This highlights the effectiveness of the dedicated architecture for matching corresponding regions across objects.
3. What are the key findings from the ablation and robustness studies? The ablation studies show the importance of the DINOv2 feature distillation and the sparse voxel attention mechanism for the performance of the proposed method. The robustness studies demonstrate that the model can perform reasonably well with as few as 3-5 input views and is adaptable to varying object counts in the scene.