计算机科学
光学(聚焦)
模态(人机交互)
焦点
人工智能
模式
匹配(统计)
自然语言处理
基点
数学
社会科学
统计
光学
物理
社会学
作者
Chunxiao Liu,Zhendong Mao,Tianzhu Zhang,An-An Liu,Bin Wang,Yongdong Zhang
标识
DOI:10.1109/tmm.2020.3046855
摘要
The key point in multimodal learning is to learn semantic alignment that finds the correspondence between sub-elements of instances from different modality data. Attention mechanism has shown its power in semantic alignment learning as it enables to densely associate sub-elements across different modalities. However, for each sub-element, existing attention aligns it with all the sub-elements from another modality, while most of them have no correspondence with it, i.e. irrelevant sub-elements. The irrelevant sub-elements will distract the semantic alignment if they are also attended. In this paper, we propose a novel focal attention mechanism to learn more accurate semantic alignment. The focal attention sparsely attends to a subset of sub-elements, which are identified as relevant ones according to their posterior probabilities given each sub-element from another modality. Based on the observation that relevant sub-elements mostly describe the same semantic, the posterior probability can precisely distinguish relevant and irrelevant ones by taking interactions within the same modality into consideration, such that relevant sub-elements get higher and closer posterior probabilities, while irrelevant ones get lower probabilities. Such a design learns better semantic alignment by preventing the interference of irrelevant sub-elements, and it facilitates subsequent multimodal tasks that demand semantic alignment. To validate the effectiveness of the focal attention, we conduct extensive experiments on image-text matching and text-to-image generation, and we propose a bidirectional and stacked version of focal attention for them, respectively. Experimental results on benchmarks show that the focal attention can significantly and consistently outperform state-of-the-arts.
科研通智能强力驱动
Strongly Powered by AbleSci AI