计算机科学
计算机视觉
融合
人工智能
变压器
单发
弹丸
目标检测
对象(语法)
遥感
模式识别(心理学)
电气工程
地质学
物理
工程类
哲学
语言学
化学
光学
有机化学
电压
作者
Abdullah Azeem,Zhengzhou Li,Abubakar Siddique,Yuting Zhang,Shangbo Zhou
标识
DOI:10.1016/j.inffus.2024.102508
摘要
Object detection is a fundamental computer vision task with wide applications in remote sensing, but traditional methods strongly rely on large annotated datasets which are difficult to obtain, especially for novel object classes. Few-shot object detection (FSOD) aims to address this by using detectors to learn from very limited labeled data. Recent work fuse multi-modalities like image–text pairs to tackle data scarcity but require external region proposal network (RPN) to align cross-modal pairs which leads to a bias towards base classes and insufficient cross-modal contextual learning. To address these problems, we propose a unified multi-modal fusion transformer (UMFT), which extracts visual features from ViT and textual encodings from BERT to align multi-modal representations in an end-to-end manner. Specifically, affinity-guided fusion (AFM) captures semantically related image–text pairs by modeling their affinity relationships to selectively combine most informative pairs. In addition, cross-modal correlation module (CCM) captures discriminative inter-modal patterns between image and text representations and filters out unrelated features to enhance cross-modal alignment. By leveraging AFM to focus on semantic relationships and CCM to refine inter-modal features, the model better aligns multimodal data without RPN. These representations are passed to detection decoder that predicts bounding boxes, probabilities of class and cross-modal attributes. Evaluation of UMFT on benchmark datasets NWPU VHR-10 and DIOR demonstrated its ability to leverage limited image–text training data via dynamic fusion, achieving new state-of-the-art mean average precision (mAP) for few-shot object detection. Our code will be publicly available at https://github.com/abdullah-azeem/umft.
科研通智能强力驱动
Strongly Powered by AbleSci AI