计算机科学
模态(人机交互)
特征(语言学)
参照物
人工智能
表达式(计算机科学)
模式
自然语言处理
自然语言
代表(政治)
相似性(几何)
图像(数学)
语言学
政治
哲学
社会学
政治学
程序设计语言
法学
社会科学
作者
Qian‐Zhong Li,Yujia Zhang,Shiying Sun,Jinting Wu,Xiaoguang Zhao,Min Tan
标识
DOI:10.1016/j.neucom.2021.09.066
摘要
Referring expression comprehension and segmentation aim to locate and segment a referred instance in an image according to a natural language expression. However, existing methods tend to ignore the interaction between visual and language modalities for visual feature learning, and establishing a synergy between the visual and language modalities remains a considerable challenge. To tackle the above problems, we propose a novel end-to-end framework, Cross-Modality Synergy Network (CMS-Net), to address the two tasks jointly. In this work, we propose an attention-aware representation learning module to learn modal representations for both images and expressions. A language self-attention submodule is proposed in this module to learn expression representations by leveraging the intra-modality relations, and a language-guided channel-spatial attention submodule is introduced to obtain the language-aware visual representations under language guidance, which helps the model pay more attention to the referent-relevant regions in the images and relieve background interference. Then, we design a cross-modality synergy module to establish the inter-modality relations for modality fusion. Specifically, a language-visual similarity is obtained at each position of the visual feature map, and the synergy is achieved between the two modalities in both semantic and spatial dimensions. Furthermore, we propose a multi-scale feature fusion module with a selective strategy to aggregate the important information from multi-scale features, yielding target results. We conduct extensive experiments on four challenging benchmarks, and our framework achieves significant performance gains over state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI