Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms

计算机科学人工智能杠杆（统计）特征学习深度学习话语特征（语言学）语音识别卷积神经网络帧（网络）电信哲学语言学

作者

Zengzhao Chen,Mengting Lin,Zhifeng Wang,Qiuyu Zheng,Chuan Liu

出处

期刊：Knowledge Based Systems [Elsevier]
日期：2023-10-21 卷期号：281: 111077-111077 被引量：5

标识

DOI：10.1016/j.knosys.2023.111077

摘要

Speech emotion recognition (SER) systems have become essential in various fields, including intelligent healthcare, customer service, call centers, automatic translation systems, and human–computer interaction. However, current approaches predominantly rely on single frame-level or utterance-level features, offering only shallow or deep characterization, and fail to fully exploit the diverse types, levels, and scales of emotion features. The limited ability of single features to capture speech emotion information, along with the ineffective combination of different features' complementary advantages through simple fusion, pose significant challenges. To address these issues, this paper presents a novel spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms(STRL-SER). The proposed technique integrates fine-grained frame-level features and coarse-grained utterance-level emotion features, while employing separate modules to extract deep representations at different levels. In the frame-level module, we introduce parallel networks and utilize a bidirectional long short-term memory network (BiLSTM) and an attention-based multi-scale convolutional neural network (CNN) to capture the spatio-temporal representation details of diverse frame-level signals. Consequently, we extract deep representations of utterance-level features to effectively learn global speech emotion features. To leverage the advantages of different feature types, we introduce a multi-head attention mechanism that fuses the deep representations from various levels. This fusion approach retains the distinctive qualities of each feature type. Finally, we employ segment-level multiplexed decision making to generate the ultimate classification results. We evaluate the effectiveness of our proposed method on two widely recognized benchmark datasets: IEMOCAP and RAVDESS. The results demonstrate that our method achieves notable performance improvements compared to previous studies. On the IEMOCAP dataset, our method achieves a weighted accuracy (WA) of 81.60% and an unweighted accuracy (UA) of 79.32%. Similarly, on the RAVDESS dataset, we achieve a WA of 88.88% and a UA of 87.85%. These outcomes confirm the substantial advancements realized by our proposed method.

求助该文献

Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms

今日热心研友