计算机科学
人工智能
模态(人机交互)
特征(语言学)
对象(语法)
目标检测
图像(数学)
判决
过程(计算)
模式识别(心理学)
图像融合
计算机视觉
机器学习
自然语言处理
语言学
哲学
操作系统
作者
Shuyu Miao,Hexiang Zheng,Lin Zheng,Jin Hong
标识
DOI:10.1007/978-3-031-44195-0_11
摘要
A baby can successfully learn to identify objects in an image with the corresponding text description provided by their parents. This learning process leverages the multimodal information of both the image and text. However, classical object detection approaches only utilize the image modality to distinguish objects, neglecting the text modality. While many Vision-Language Models have been explored in the object detection task, they often require large amounts of pre-training data and can only specify a particular model structure. In this paper, we propose a lightweight and generic Local and Global Feature Fusion (LGF $$^2$$ ) framework for text-guided object detection to enhance the performance of image modality detection models. Our adaptive text-image fusion module is designed to learn optimal fusion rules between image and text features. Additionally, we introduce a word-level contrastive loss to guide the local-focused fusion of multimodal features and a sentence-level alignment loss to drive the global consistent fusion of multimodal features. Our paradigm is highly adaptable and can be easily embedded into existing image-based object detection models without any extra modification. We conduct extensive experiments on two multimodal detection datasets, and the results demonstrate that our LGF $$^2$$ significantly improves their performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI