计算机科学
话语
人工智能
背景(考古学)
语音识别
光学(聚焦)
水准点(测量)
特征(语言学)
传感器融合
生物
光学
物理
哲学
古生物学
语言学
地理
大地测量学
作者
Yang Liu,Haoqin Sun,Wenbo Guan,Yuqi Xia,Zhao Zhen
标识
DOI:10.1016/j.specom.2022.02.006
摘要
Accurately recognizing emotion from speech is a necessary yet challenging task due to the variability in speech and emotion. In this paper, a novel method combined self-attention mechanism and multi-scale fusion framework is proposed for multi-modal SER by using speech and text information. A self-attentional bidirectional contextual LSTM (bc-LSTM) is proposed to learn the context-sensitive dependences from speech. Specifically, the BLSTM layer is applied to learn long-term dependencies and utterance-level contextual information and the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A self-attentional multi-channel CNN (MCNN), which takes advantage of static and dynamic channels, is applied for learning general and thematic features from text. Finally, a multi-scale fusion strategy, including feature-level fusion and decision-level fusion, is applied to improve the overall performance. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 1.48% and 3.00% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI