Unified multimodal fusion transformer for few shot object detection for remote sensing images

计算机科学计算机视觉融合人工智能变压器单发弹丸目标检测对象（语法）遥感模式识别（心理学）电气工程地质学物理工程类光学哲学电压有机化学化学语言学

作者

Abdullah Azeem,Zhengzhou Li,Abubakar Siddique,Yuting Zhang,Shangbo Zhou

出处

期刊：Information Fusion [Elsevier BV]
日期：2024-06-07 卷期号：111: 102508-102508 被引量：1

标识

DOI：10.1016/j.inffus.2024.102508

摘要

Object detection is a fundamental computer vision task with wide applications in remote sensing, but traditional methods strongly rely on large annotated datasets which are difficult to obtain, especially for novel object classes. Few-shot object detection (FSOD) aims to address this by using detectors to learn from very limited labeled data. Recent work fuse multi-modalities like image–text pairs to tackle data scarcity but require external region proposal network (RPN) to align cross-modal pairs which leads to a bias towards base classes and insufficient cross-modal contextual learning. To address these problems, we propose a unified multi-modal fusion transformer (UMFT), which extracts visual features from ViT and textual encodings from BERT to align multi-modal representations in an end-to-end manner. Specifically, affinity-guided fusion (AFM) captures semantically related image–text pairs by modeling their affinity relationships to selectively combine most informative pairs. In addition, cross-modal correlation module (CCM) captures discriminative inter-modal patterns between image and text representations and filters out unrelated features to enhance cross-modal alignment. By leveraging AFM to focus on semantic relationships and CCM to refine inter-modal features, the model better aligns multimodal data without RPN. These representations are passed to detection decoder that predicts bounding boxes, probabilities of class and cross-modal attributes. Evaluation of UMFT on benchmark datasets NWPU VHR-10 and DIOR demonstrated its ability to leverage limited image–text training data via dynamic fusion, achieving new state-of-the-art mean average precision (mAP) for few-shot object detection. Our code will be publicly available at https://github.com/abdullah-azeem/umft.

求助该文献

最长约 10秒，即可获得该文献文件

Unified multimodal fusion transformer for few shot object detection for remote sensing images

今日热心研友