OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion
๐ Abstract
The paper presents a novel unified open-vocabulary detection method called OV-DINO, which aims to improve the performance of open-vocabulary detection tasks. The key contributions are:
- Proposing a Unified Data Integration (UniDI) pipeline to integrate diverse data sources for end-to-end pre-training, eliminating the need for pseudo-label generation.
- Introducing a Language-Aware Selective Fusion (LASF) module to effectively fuse and align the language-aware context with the region-level visual representations.
- Achieving state-of-the-art results on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings.
๐ Q&A
[01] Unified Data Integration (UniDI) and Language-Aware Selective Fusion (LASF)
1. What is the purpose of the Unified Data Integration (UniDI) pipeline proposed in the paper? The UniDI pipeline aims to integrate diverse data sources, including detection, grounding, and image-text data, into a unified detection-centric data format. This allows for end-to-end training and eliminates the need for pseudo-label generation on image-text data.
2. How does the Language-Aware Selective Fusion (LASF) module improve the performance of the open-vocabulary detection model? The LASF module selectively and dynamically fuses the language-aware context with the region-level visual representations. This helps the model to better align the text input with the relevant image regions, improving the overall performance of open-vocabulary detection.
3. What are the key advantages of the UniDI and LASF components compared to previous methods? The UniDI pipeline avoids the complexities of handling diverse data formats and the noise introduced by pseudo-label generation. The LASF module effectively balances the fusion and alignment of modality information, which is crucial for fine-grained vision-language understanding.
[02] Experimental Results
1. How does the proposed OV-DINO model perform on the COCO and LVIS benchmarks compared to previous state-of-the-art methods? OV-DINO achieves state-of-the-art results on both the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings. Specifically, OV-DINO outperforms previous methods by a significant margin, with a 2.5% AP improvement on COCO and a 13.6% AP improvement on LVIS in the zero-shot evaluation.
2. What are the key factors that contribute to the superior performance of OV-DINO? The unified data integration pipeline and the language-aware selective fusion module are the key factors that contribute to the strong performance of OV-DINO. The UniDI pipeline enables effective integration of diverse data sources, while the LASF module enhances the model's ability to capture precise image details guided by language input.
3. How does OV-DINO compare to other recent open-vocabulary detection methods in terms of model architecture and training approach? OV-DINO differs from previous methods like GLIP, GLIPv2, and G-DINO, which treat object detection as a grounding task and generate pseudo-labels on image-text data. OV-DINO, on the other hand, adopts a unified detection-centric framework that integrates various data sources without the need for pseudo-labeling, resulting in more accurate supervision and improved performance.