计算机科学
模式
手势
非语言交际
话语
自然语言处理
人工智能
判决
情绪分析
面部表情
模态(人机交互)
语音识别
沟通
心理学
社会科学
社会学
作者
Di Wang,Shuai Liu,Quan Wang,Yumin Tian,Lihuo He,Xinbo Gao
标识
DOI:10.1109/tmm.2022.3183830
摘要
Multimodal sentiment analysis (MSA) plays an important role in many applications, such as intelligent question-answering, computer-assisted psychotherapy and video understanding, and has attracted considerable attention in recent years. It leverages multimodal signals including verbal language, facial gestures, and acoustic behaviors to identify sentiments in videos. Language modality typically outperforms nonverbal modalities in MSA. Therefore, strengthening the significance of language in MSA will be a vital way to promote recognition accuracy. Considering that the meaning of a sentence often varies in different nonverbal contexts, combining nonverbal information with text representations is conducive to understanding the exact emotion conveyed by an utterance. In this paper, we propose a Cross-modal Enhancement Network (CENet) model to enhance text representations by integrating visual and acoustic information into a language model. Specifically, it embeds a Cross-modal Enhancement (CE) module, which enhances each word representation according to long-range emotional cues implied in unaligned nonverbal data, into a transformer-based pre-trained language model. Moreover, a feature transformation strategy is introduced for acoustic and visual modalities to reduce the distribution differences between the initial representations of verbal and nonverbal modalities, thereby facilitating the fusion of distinct modalities. Extensive experiments on benchmark datasets demonstrate the significant gains of CENet over state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI