Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

计算机科学人工智能判别式模态（人机交互）情态动词特征（语言学）语义学（计算机科学）特征学习代表（政治）模式识别（心理学）相似性（几何）自然语言处理语义相似性特征提取情报检索机器学习图像（数学）语言学化学哲学政治政治学高分子化学法学程序设计语言

作者

Lei Liao,Meng Yang,Bob Zhang

出处

期刊：IEEE Transactions on Circuits and Systems for Video Technology [Institute of Electrical and Electronics Engineers]
日期：2022-09-07 卷期号：33 (2): 920-934 被引量：9

标识

DOI：10.1109/tcsvt.2022.3203247

摘要

Cross-modal retrieval tasks, which are more natural and challenging than traditional retrieval tasks, have attracted increasing interest from researchers in recent years. Although different modalities with the same semantics have some potential relevance, the feature space heterogeneity still seriously weakens the performance of cross-modal retrieval models. To solve this problem, common space-based methods in which multimodal data is projected into a learned common space for similarity measurement have become the mainstream approach for cross-modal retrieval tasks. However, current methods entangle the modality style and semantic content in the common space and neglect to fully explore the semantic and discriminative representation/reconstruction of the semantic content. This often results in an unsatisfactory retrieval performance. To solve these issues, this paper proposes a new Deep Supervised Dual Cycle Adversarial Network (DSDCAN) model based on common space learning. It is composed of two cross-modal cycle GANs, one for the image and one for the text. The proposed cycle GAN model disentangles the semantic content and modality style features by making the data of one modality well reconstructed from the extracted modal style feature and the content feature of the other modality. Then, a discriminative semantic and label loss is proposed by fully considering the category, sample contrast, and label supervision to enhance the semantic discrimination of the common space representation. Besides this, to make the data distribution between two modalities similar, a second-order similarity is presented as a distance measurement of the cross-modal representation in the common space. Extensive experiments have been conducted on the Wikipedia, Pascal Sentence, NUS-WIDE-10k, PKU XMedia, MSCOCO, NUS-WIDE, Flickr30k and MIRFlickr datasets. The results demonstrate that the proposed method can achieve a higher performance than the state-of-the-art methods.

求助该文献

最长约 10秒，即可获得该文献文件

Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

今日热心研友