计算机科学
光学(聚焦)
语义学(计算机科学)
注意力网络
图像(数学)
关系(数据库)
模态(人机交互)
人工智能
对比度(视觉)
钥匙(锁)
偏爱
自然语言处理
数据挖掘
数学
物理
光学
程序设计语言
统计
计算机安全
作者
Chunxiao Liu,Zhendong Mao,An-An Liu,Tianzhu Zhang,Bin Wang,Yongdong Zhang
标识
DOI:10.1145/3343031.3350869
摘要
Learning semantic correspondence between image and text is significant as it bridges the semantic gap between vision and language. The key challenge is to accurately find and correlate shared semantics in image and text. Most existing methods achieve this goal by representing the shared semantic as a weighted combination of all the fragments (image regions or text words), where fragments relevant to the shared semantic obtain more attention, otherwise less. However, despite relevant ones contribute more to the shared semantic, irrelevant ones will more or less disturb it, and thus will lead to semantic misalignment in the correlation phase. To address this issue, we present a novel Bidirectional Focal Attention Network (BFAN), which not only allows to attend to relevant fragments but also diverts all the attention into these relevant fragments to concentrate on them. The main difference with existing works is they mostly focus on learning attention weight while our BFAN focus on eliminating irrelevant fragments from the shared semantic. The focal attention is achieved by preassigning attention based on inter-modality relation, identifying relevant fragments based on intra-modality relation and reassigning attention. Furthermore, the focal attention is jointly applied in both image-to-text and text-to-image directions, which enables to avoid preference to long text or complex image. Experiments show our simple but effective framework significantly outperforms state-of-the-art, with relative [email protected] gains of 2.2% on both Flicr30K and MSCOCO benchmarks.
科研通智能强力驱动
Strongly Powered by AbleSci AI