模式
计算机科学
情态动词
人工智能
机器学习
模态逻辑
特征学习
透视图(图形)
代表(政治)
深度学习
社会学
政治
化学
高分子化学
法学
社会科学
政治学
作者
Yu Huang,Chenzhuang Du,Zihui Xue,Xuanyao Chen,Hang Zhao,Longbo Huang
出处
期刊:Cornell University - arXiv
日期:2021-01-01
被引量:85
标识
DOI:10.48550/arxiv.2106.04538
摘要
The world provides us with data of multiple modalities. Intuitively, models fusing data from different modalities outperform their uni-modal counterparts, since more information is aggregated. Recently, joining the success of deep learning, there is an influential line of work on deep multi-modal learning, which has remarkable empirical results on various applications. However, theoretical justifications in this field are notably lacking. Can multi-modal learning provably perform better than uni-modal? In this paper, we answer this question under a most popular multi-modal fusion framework, which firstly encodes features from different modalities into a common latent space and seamlessly maps the latent representations into the task space. We prove that learning with multiple modalities achieves a smaller population risk than only using its subset of modalities. The main intuition is that the former has a more accurate estimate of the latent space representation. To the best of our knowledge, this is the first theoretical treatment to capture important qualitative phenomena observed in real multi-modal applications from the generalization perspective. Combining with experiment results, we show that multi-modal learning does possess an appealing formal guarantee.
科研通智能强力驱动
Strongly Powered by AbleSci AI