模式
情绪分析
变压器
计算机科学
成对比较
水准点(测量)
人工智能
自然语言处理
一致性(知识库)
代表(政治)
机器学习
模式识别(心理学)
大地测量学
电压
社会学
物理
法学
政治
地理
量子力学
社会科学
政治学
作者
Di Wang,Xutong Guo,Yumin Tian,Jinhui Liu,LiHuo He,Xuemei Luo
标识
DOI:10.1016/j.patcog.2022.109259
摘要
Multimodal sentiment analysis (MSA), which aims to recognize sentiment expressed by speakers in videos utilizing textual, visual and acoustic cues, has attracted extensive research attention in recent years. However, textual, visual and acoustic modalities often contribute differently to sentiment analysis. In general, text contains more intuitive sentiment-related information and outperforms nonlinguistic modalities in MSA. Seeking a strategy to take advantage of this property to obtain a fusion representation containing more sentiment-related information and simultaneously preserving inter- and intra-modality relationships becomes a significant challenge. To this end, we propose a novel method named Text Enhanced Transformer Fusion Network (TETFN), which learns text-oriented pairwise cross-modal mappings for obtaining effective unified multimodal representations. In particular, it incorporates textual information in learning sentiment-related nonlinguistic representations through text-based multi-head attention. In addition to preserving consistency information by cross-modal mappings, it also retains the differentiated information among modalities through unimodal label prediction. Furthermore, the vision pre-trained model Vision-Transformer is utilized to extract visual features from the original videos to preserve both global and local information of a human face. Extensive experiments on benchmark datasets CMU-MOSI and CMU-MOSEI demonstrate the superior performance of the proposed TETFN over state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI