计算机科学
水准点(测量)
图像(数学)
情态动词
人工智能
集合(抽象数据类型)
图像检索
模式
情报检索
模式识别(心理学)
社会科学
化学
大地测量学
社会学
高分子化学
程序设计语言
地理
作者
Junhao Liu,Min Yang,Chengming Li,Ruifeng Xu
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology
[Institute of Electrical and Electronics Engineers]
日期:2021-08-01
卷期号:31 (8): 3242-3253
被引量:26
标识
DOI:10.1109/tcsvt.2020.3037661
摘要
Cross-modal image-text retrieval has emerged as a challenging task that requires the multimedia system to bridge the heterogeneity gap between different modalities. In this paper, we take full advantage of image-to-text and text-to-image generation models to improve the performance of the cross-modal image-text retrieval model by incorporating the text-grounded and image-grounded generative features into the cross-modal common space with a “Two-Teacher One-Student” learning framework. In addition, a dual regularizer network is designed to distinguish the mismatched image-text pairs from the matched ones. In this way, we can capture the fine-grained correspondence between modalities and distinguish the best-retrieved result from a candidate set. Extensive experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, and MS COCO) show that our model can achieve state-of-the-art cross-modal retrieval results. In particular, our model improves the image-to-text and text-to-image retrieval accuracy by more than 22% over the best competitors on the MS COCO dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI