计算机科学
情态动词
语义学(计算机科学)
一致性(知识库)
人工智能
嵌入
自然语言处理
光学(聚焦)
代表(政治)
语义鸿沟
情报检索
模式识别(心理学)
图像(数学)
图像检索
物理
化学
高分子化学
光学
程序设计语言
法学
政治
政治学
作者
Zejun Liu,Fanglin Chen,Jun Xu,Wenjie Pei,Guangming Lu
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology
[Institute of Electrical and Electronics Engineers]
日期:2022-11-07
卷期号:33 (5): 2465-2476
被引量:11
标识
DOI:10.1109/tcsvt.2022.3220297
摘要
Cross-modal image-text retrieval is an important area of Vision-and-Language task that models the similarity of image-text pairs by embedding features into a shared space for alignment. To bridge the heterogeneous gap between the two modalities, current approaches achieve inter-modal alignment and intra-modal semantic relationship modeling through complex weighted combinations between items. In the intra-modal association and inter-modal interaction processes, the higher-weight items have a higher contribution to the global semantics. However, the same item always produces different contributions in the two processes, since most traditional approaches only focus on the alignment. This usually results in semantic changes and misalignment. To address this issue, this paper proposes Cross-modal Semantic Importance Consistency (CSIC) which achieves invariance in the semantic of items during aligning. The proposed technique measures the semantic importance of items obtained from intra-modal and inter-modal self-attention and learns a more reasonable representation vector by inter-calibrating the importance distribution to improve performance. We conducted extensive experiments on the Flickr30K and MS COCO datasets. The results show that our approach can significantly improve retrieval performance, proving the proposed approach’s superiority and rationality.
科研通智能强力驱动
Strongly Powered by AbleSci AI