模式
计算机科学
杠杆(统计)
人工智能
模态(人机交互)
情态动词
对偶(语法数字)
机器学习
生成模型
自然语言处理
生成语法
模式识别(心理学)
社会科学
化学
社会学
艺术
文学类
高分子化学
作者
Mengmeng Jing,Jingjing Li,Lei Zhu,Ke Lü,Yang Yang,Zi Huang
标识
DOI:10.1145/3394171.3413676
摘要
Learning the relationship between the multi-modal data, e.g., texts, images and videos, is a classic task in the multimedia community. Cross-modal retrieval (CMR) is a typical example where the query and the corresponding results are in different modalities. Yet, a majority of existing works investigate CMR with an ideal assumption that the training samples in every modality are sufficient and complete. In real-world applications, however, this assumption does not always hold. Mismatch is common in multi-modal datasets. There is a high chance that samples in some modalities are either missing or corrupted. As a result, incomplete CMR has become a challenging issue. In this paper, we propose a Dual-Aligned Variational Autoencoders (DAVAE) to address the incomplete CMR problem. Specifically, we propose to learn modality-invariant representations for different modalities and use the learned representations for retrieval. We train multiple autoencoders, one for each modality, to learn the latent factors among different modalities. These latent representations are further dual-aligned at the distribution level and the semantic level to alleviate the modality gaps and enhance the discriminability of representations. For missing instances, we leverage generative models to synthesize latent representations for them. Notably, we test our method with different ratios of random incompleteness.Extensive experiments on three datasets verify that our method can consistently outperform the state-of-the-arts.
科研通智能强力驱动
Strongly Powered by AbleSci AI