副语言
计算机科学
模式
语音识别
判决
变压器
自然语言处理
人工智能
模态(人机交互)
语言学
社会科学
哲学
物理
量子力学
电压
社会学
作者
Lili Guo,Longbiao Wang,Jianwu Dang,Yahui Fu,Jiaxing Liu,Shifei Ding
出处
期刊:IEEE MultiMedia
[Institute of Electrical and Electronics Engineers]
日期:2022-04-01
卷期号:29 (2): 94-103
被引量:3
标识
DOI:10.1109/mmul.2022.3161411
摘要
People usually express emotions through paralinguistic and linguistic information in speech. How to effectively integrate linguistic and paralinguistic information for emotion recognition is a challenge. Previous studies have adopted the bidirectional long short-term memory (BLSTM) network to extract acoustic and lexical representations followed by a concatenate layer, and this has become a common method. However, the interaction and influence between different modalities are difficult to promote using simple feature fusion for each sentence. In this article, we propose an implicitly aligned multimodal transformer fusion (IA-MMTF) framework based on acoustic features and text information. This model enables the two modalities to guide and complement each other when learning emotional representations. Thereafter, the weighed fusion is used to control the contributions of different modalities. Thus, we can obtain more complementary emotional representations. Experiments on the interactive emotional dyadic motion capture (IEMOCAP) database and multimodal emotionlines dataset (MELD) show that the proposed method outperforms the baseline BLSTM-based method.
科研通智能强力驱动
Strongly Powered by AbleSci AI