计算机科学
情态动词
嵌入
相似性(几何)
语义相似性
语义学(计算机科学)
语义空间
变压器
图像(数学)
人工智能
匹配(统计)
自然语言处理
情报检索
模式识别(心理学)
统计
程序设计语言
电压
量子力学
物理
高分子化学
数学
化学
作者
Tao Yao,Shouyong Peng,Yujuan Sun,Guorui Sheng,Haiyan Fu,Xiangwei Kong
标识
DOI:10.1016/j.engappai.2024.108005
摘要
Image-text matching, which aims at precisely measuring the visual-semantic similarities between images and texts, is a fundamental research topic in multimedia analysis domain. Current methods have obtained an impressive performance by taking advantage of Transformer architecture. However, most of them only consider inter-modal relationships to mine the image-text semantic correspondences, which makes them hard to accurately measure the similarity when facing similar images and text due to the cross-modal semantic interferences. In this work, to tackle the issue mentioned above, we propose a Cross-Modal Semantic Interference Suppression (CMSIS) method, which incorporates intra-modal fine-grained semantics and unmatched segments to suppress the semantic influences caused by similar heterogeneous data points. The intra-modal fine-grained semantics are utilized to push similar images or text away in the learned latent embedding space for better matching results. To further suppress the cross-modal semantic interferences among similar data points, the unmatched segments that can provide explicit clues to distinguish similar images or text, is also adopted. Experimental results on two popular multimodal datasets have demonstrated that the proposed CMSIS outperforms a range of baselines.
科研通智能强力驱动
Strongly Powered by AbleSci AI