计算机科学
判别式
匹配(统计)
人工智能
嵌入
组分(热力学)
模式识别(心理学)
机器学习
上下文图像分类
光学(聚焦)
图像(数学)
数学
热力学
统计
光学
物理
作者
Yu Liu,Li Liu,Yanming Guo,Michael S. Lew
标识
DOI:10.1016/j.patcog.2018.07.001
摘要
Multimodal learning has been an important and challenging problem for decades, which aims to bridge the modality gap between heterogeneous representations, such as vision and language. Unlike many current approaches which only focus on either multimodal matching or classification, we propose a unified network to jointly learn multimodal matching and classification (MMC-Net) between images and texts. The proposed MMC-Net model can seamlessly integrate the matching and classification components. It first learns visual and textual embedding features in the matching component, and then generates discriminative multimodal representations in the classification component. Combining the two components in a unified model can help in improving their performance. Moreover, we present a multi-stage training algorithm by minimizing both of the matching and classification loss functions. Experimental results on four well-known multimodal benchmarks demonstrate the effectiveness and efficiency of the proposed approach, which achieves competitive performance for multimodal matching and classification compared to state-of-the-art approaches.
科研通智能强力驱动
Strongly Powered by AbleSci AI