计算机科学
情态动词
图像(数学)
干扰(通信)
人工智能
匹配(统计)
自然语言处理
语音识别
计算机视觉
电信
统计
频道(广播)
化学
数学
高分子化学
作者
Tao Yao,Shouyong Peng,Yujuan Sun,Guorui Sheng,Haiyan Fu,Xiangwei Kong
标识
DOI:10.1016/j.engappai.2024.108005
摘要
Image-text matching, which aims at precisely measuring the visual-semantic similarities between images and texts, is a fundamental research topic in multimedia analysis domain. Current methods have obtained an impressive performance by taking advantage of Transformer architecture. However, most of them only consider inter-modal relationships to mine the image-text semantic correspondences, which makes them hard to accurately measure the similarity when facing similar images and text due to the cross-modal semantic interferences. In this work, to tackle the issue mentioned above, we propose a Cross-Modal Semantic Interference Suppression (CMSIS) method, which incorporates intra-modal fine-grained semantics and unmatched segments to suppress the semantic influences caused by similar heterogeneous data points. The intra-modal fine-grained semantics are utilized to push similar images or text away in the learned latent embedding space for better matching results. To further suppress the cross-modal semantic interferences among similar data points, the unmatched segments that can provide explicit clues to distinguish similar images or text, is also adopted. Experimental results on two popular multimodal datasets have demonstrated that the proposed CMSIS outperforms a range of baselines.
科研通智能强力驱动
Strongly Powered by AbleSci AI