计算机科学
人工智能
对象(语法)
目标检测
集合(抽象数据类型)
探测器
水准点(测量)
开放集
自然语言处理
一般化
程序设计语言
模式识别(心理学)
数学
大地测量学
电信
离散数学
数学分析
地理
作者
Shilong Liu,Zhaoyang Zeng,Tianhe Ren,Feng Li,Hao Zhang,Jie Yang,Yang Yang,Jianwei Yang,Hang Su,Jun Zhu,Lei Zhang
出处
期刊:Cornell University - arXiv
日期:2023-01-01
被引量:220
标识
DOI:10.48550/arxiv.2303.05499
摘要
In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.
科研通智能强力驱动
Strongly Powered by AbleSci AI