计算机科学
嵌入
人工智能
模态(人机交互)
不变(物理)
生成语法
模式识别(心理学)
机器学习
数学
数学物理
作者
Jialin Tian,Kai Wang,Xing Xu,Zhiwei Cao,Fumin Shen,Heng Tao Shen
标识
DOI:10.1145/3477495.3532028
摘要
Zero-Shot Cross-Modal Retrieval (ZS-CMR) has recently drawn increasing attention as it focuses on a practical retrieval scenario, i.e., the multimodal test set consists of unseen classes that are disjoint with seen classes in the training set. The recently proposed methods typically adopt the generative model as the main framework to learn a joint latent embedding space to alleviate the modality gap. Generally, these methods largely rely on auxiliary semantic embeddings for knowledge transfer across classes and unconsciously neglect the effect of the data reconstruction manner in the adopted generative model. To address this issue, we propose a novel ZS-CMR model termed Multimodal Disentanglement Variational AutoEncoders (MDVAE), which consists of two coupled disentanglement variational autoencoders (DVAEs) and a fusion-exchange VAE (FVAE). Specifically, DVAE is developed to disentangle the original representations of each modality into modality-invariant and modality-specific features. FVAE is designed to fuse and exchange information of multimodal data by the reconstruction and alignment process without pre-extracted semantic embeddings. Moreover, an advanced counter-intuitive cross-reconstruction scheme is further proposed to enhance the informativeness and generalizability of the modality-invariant features for more effective knowledge transfer. The comprehensive experiments on four image-text retrieval and two image-sketch retrieval datasets consistently demonstrate that our method establishes the new state-of-the-art performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI