计算机科学
变压器
嵌入
人工智能
机器学习
工程类
电压
电气工程
作者
Xue Han,Yitong Wang,Jun-Lan Feng,Chao Deng,Zhan‐Heng Chen,Yu‐An Huang,Hui Su,Lun Hu,Pengwei Hu
标识
DOI:10.1016/j.neucom.2022.09.136
摘要
With the broad industrialization of Artificial Intelligence(AI), we observe a large fraction of real-world AI applications are multimodal in nature in terms of relevant data and ways of interaction. Pre-trained big models have been proven as the most effective framework for joint modeling of multi-modality data. This paper provides a thorough account of the opportunities and challenges of Transformer-based multimodal pre-trained model (PTM) in various domains. We begin by reviewing the representative tasks of multimodal AI applications, ranging from vision-text and audio-text fusion to more complex tasks such as document layout understanding. We particularly address the new multi-modal research domain of document layout understanding. We further analyze and compare the state-of-the-art Transformer-based multimodal PTMs from multiple aspects, including downstream applications, datasets, input feature embedding, and model architectures. In conclusion, we summarize the key challenges of this field and suggest several future research directions.
科研通智能强力驱动
Strongly Powered by AbleSci AI