magic starSummarize by Aili

Grounding DINO 1.5: Advance the “Edge” of Open-Set Object Detection

🌈 Abstract

This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment.

🙋 Q&A

[01] Model Architecture

1. What are the key differences between the Grounding DINO 1.5 Pro and Grounding DINO 1.5 Edge models?

  • Grounding DINO 1.5 Pro significantly expands the model capacity and dataset size to create a more potent and versatile open-set object detection model. It incorporates the pre-trained ViT-L architecture and a 20 million image dataset to enhance the model's semantic comprehension.
  • Grounding DINO 1.5 Edge is tailored for edge devices, focusing on computational efficiency without compromising detection quality. It uses an efficient feature enhancer that leverages only high-level image features, removing the need for multi-scale features.

2. How does the early fusion and late fusion design strategies differ in the Grounding DINO 1.5 models?

  • Early fusion models, like Grounding DINO 1.5 Pro, tend to yield higher detection recall and better bounding box precision, but can also lead to increased model hallucinations.
  • Late fusion models, which integrate language and image modalities only in the loss calculation phase, generally demonstrate more robustness against hallucinations but may lead to lower detection recall.
  • To balance the advantages and drawbacks, Grounding DINO 1.5 Pro retains the early fusion design while introducing a more comprehensive training sampling strategy to increase the proportion of negative samples.

[02] Training Dataset

1. What are the key characteristics of the Grounding-20M dataset used to train the Grounding DINO 1.5 models?

  • Grounding-20M is a high-quality grounding dataset collected from publicly available sources, containing over 20 million images with grounding annotations.
  • The dataset is designed to be rich in categories and encompass a wide range of detection scenarios, in order to train a robust open-set detector.
  • The authors developed a series of annotation pipelines and post-processing rules to ensure the high quality of the grounding annotations.

[03] Model Evaluation

1. What are the key performance improvements of the Grounding DINO 1.5 Pro model compared to previous models?

  • On the COCO zero-shot transfer benchmark, Grounding DINO 1.5 Pro achieves a 54.3 AP, improving upon Grounding DINO Swin-L by 1.8 AP.
  • On the LVIS-minival and LVIS-val zero-shot transfer benchmarks, Grounding DINO 1.5 Pro achieves a 55.7 AP and a 47.6 AP, outperforming the previous best model, DetCLIPv3, by 6.9 AP and 6.2 AP respectively.
  • Compared to the Grounding DINO Swin-T model, Grounding DINO 1.5 Pro demonstrates a remarkable improvement of 28.3 AP (55.7 AP vs. 27.4 AP) on the LVIS-minival zero-shot transfer benchmark.

2. How does the Grounding DINO 1.5 Edge model perform in terms of speed and accuracy?

  • When optimized with TensorRT, the Grounding DINO 1.5 Edge model reaches a speed of 75.2 FPS.
  • The Grounding DINO 1.5 Edge model achieves a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios.

[04] Case Analysis and Qualitative Visualization

1. What are the key capabilities demonstrated by the Grounding DINO 1.5 Pro model in the case analysis?

  • The model exhibits robust performance in detecting objects in challenging scenarios, such as monochromatic images, blurry objects, and small/partially occluded objects.
  • The model demonstrates versatility in handling objects of varying sizes and shapes, from petite to sprawling, accurately localizing and identifying each one.
  • The model showcases exceptional capability in detecting long-tailed objects, which are less frequently encountered categories that pose unique challenges.
  • The model exhibits strong performance in short and long caption grounding, accurately mapping textual descriptions to corresponding visual elements.
  • The model demonstrates the ability to detect objects in dense scenes, where multiple objects are closely positioned or overlapping.

2. How does the Grounding DINO 1.5 Edge model perform in real-world edge computing environments?

  • The Grounding DINO 1.5 Edge model, when deployed on the NVIDIA Orin NX platform, achieves an inference speed of over 10 FPS at an input size of 640 × 640, demonstrating its practical utility for real-time object detection in edge computing scenarios.
Shared by Daniel Chen ·
© 2024 NewMotor Inc.