计算机科学
人工智能
模态(人机交互)
情态动词
深度学习
语音识别
特征学习
模式识别(心理学)
视听
卷积神经网络
图像检索
判别式
特征(语言学)
机器学习
学习迁移
作者
Cong Jin,Tian Zhang,Shouxun Liu,Yun Tie,Xin Lv,Jianguang Li,Yan Wencai,Ming Yan,Qian Xu,Yicong Guan,Zhenggougou Yang
标识
DOI:10.1007/978-3-030-68780-9_26
摘要
Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an important research direction: cross-modal learning. In this paper, we introduce a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between the two modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: i) Using feature selection model for choosing top-k audio and visual feature representation; ii) A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used; iii) Due to the lack of video-music paired dataset, we construct dataset of video-music pairs from YouTube 8M and MER31K datasets. The experiments have proved that our proposed model has a better performance compared with other methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI