答疑
计算机科学
人工智能
推论
语义学(计算机科学)
自然语言处理
视觉推理
情报检索
关系(数据库)
程序设计语言
数据挖掘
作者
Siyu Zhang,Yeming Chen,Yaoru Sun,Fang Wang,Haibo Shi,Haoran Wang
标识
DOI:10.1109/tmm.2023.3347093
摘要
Visual question answering (VQA) has been intensively studied as a multimodal task, requiring efforts to bridge vision and language for correct answer inference. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual semantic comprehension. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to comprehend and correctly infer the causal nexus of contextual object semantics in images. To this end, we propose a finer model framework without bounding boxes in this work, termed Looking Out of Instance Semantics (LOIS) to address this crucial issue. LOIS can achieve more fine-grained feature descriptions to generate visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of relation attention modules: 1) intra-modality and 2) inter-modality, are devised to infer the correct answers from different visual features. Specifically, we implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. In addition, our proposed attention model can further analyze salient image regions by focusing on important word-related questions. Experimental results on four benchmark VQA datasets prove that our proposed method has favorable performance in improving visual reasoning capability.
科研通智能强力驱动
Strongly Powered by AbleSci AI