计算机科学
稳健性(进化)
语音识别
判别式
人工智能
编码器
特征(语言学)
融合机制
模式识别(心理学)
融合
基因
操作系统
脂质双层融合
哲学
生物化学
化学
语言学
作者
Mustaqeem Khan,Wail Gueaieb,Abdulmotaleb El Saddik,Soonil Kwon
标识
DOI:10.1016/j.eswa.2023.122946
摘要
In human-computer interaction (HCI) and especially speech signal processing, emotion recognition is one of the most important and challenging tasks due to multi-modality and limited data availability. Nowadays, an intelligent system is required for real-world applications to efficiently process and understand the speaker's emotional state and to enhance the analytical abilities to assist communication by a human-machine interface (HMI). Designing a reliable and robust Multimodal Speech Emotion Recognition (MSER) to efficiently recognize emotions through multi-modality such as speech and text is necessary. This paper proposes a novel MSER model with a deep feature fusion technique using a multi-headed cross-attention mechanism. The proposed model utilizes audio and text cues to predict the emotion label accordingly. Our proposed model processes the raw speech signal and text by CNN and feeds to corresponding encoders for discriminative and semantic feature extractions. The cross-attention mechanism is applied to both features to enhance the interaction between text and audio cues by crossway to extract the most relevant information for emotion recognition. Finally, combining the region-wise weights from both encoders enables interaction among different layers and paths by the proposed deep feature fusion scheme. The authors evaluate the proposed system using the IEMOCAP and MELD datasets and conduct extensive experiments that obtain state-of-the-art (SOTA) results and show a 4.5% improved recognition rate, respectively. Our model secured a significant improvement over SOTA methods, which shows the robustness and effectiveness of the proposed MSER model.
科研通智能强力驱动
Strongly Powered by AbleSci AI