计算机科学
嵌入
人工智能
情态动词
一致性(知识库)
推论
匹配(统计)
模式识别(心理学)
语义学(计算机科学)
语义匹配
自然语言处理
数学
统计
化学
高分子化学
程序设计语言
作者
Yu Liu,Yanming Guo,Li Liu,Erwin M. Bakker,Michael S. Lew
标识
DOI:10.1016/j.patcog.2019.05.008
摘要
In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot recognition to multimedia question and answering, bridging image and text representations plays an important and in some cases an indispensable role. To narrow the modality gap between vision and language, prior approaches attempt to discover their correlated semantics in a common feature space. However, these approaches omit the intra-modal semantic consistency when learning the inter-modal correlations. To address this problem, we propose cycle-consistent embeddings in a deep neural network for matching visual and textual representations. Our approach named as CycleMatch can maintain both inter-modal correlations and intra-modal consistency by cascading dual mappings and reconstructed mappings in a cyclic fashion. Moreover, in order to achieve a robust inference, we propose to employ two late-fusion approaches: average fusion and adaptive fusion. Both of them can effectively integrate the matching scores of different embedding features, without increasing the network complexity and training time. In the experiments on cross-modal retrieval, we demonstrate comprehensive results to verify the effectiveness of the proposed approach. Our approach achieves state-of-the-art performance on two well-known multi-modal datasets, Flickr30K and MSCOCO.
科研通智能强力驱动
Strongly Powered by AbleSci AI