An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

计算机科学语音识别光谱图人工智能特征（语言学）卷积神经网络水准点（测量）特征提取光学（聚焦）模式识别（心理学）哲学语言学物理大地测量学光学地理

作者

Md. Rayhan Ahmed,Salekul Islam,A. K. M. Muzahidul Islam,Swakkhar Shatabda

出处

期刊：Expert Systems With Applications [Elsevier]
日期：2023-05-01 卷期号：218: 119633-119633 被引量：51

链接

arxiv.org arxiv.orgdoi.org

标识

DOI：10.1016/j.eswa.2023.119633

摘要

Precise recognition of emotion from speech signals aids in enhancing human–computer interaction (HCI). The performance of a speech emotion recognition (SER) system depends on the derived features from speech signals. However, selecting the optimal set of feature representations remains the most challenging task in SER because the effectiveness of features varies with emotions. Most studies extract hidden local speech features ignoring the global long-term contextual representations of speech signals. The existing SER system suffers from low recognition performance mainly due to the scarcity of available data and sub-optimal feature representations. Motivated by the efficient feature extraction of convolutional neural network (CNN), long short-term memory (LSTM), and gated recurrent unit (GRU), this article proposes an ensemble utilizing the combined predictive performance of three different architectures. The first architecture uses 1D CNN followed by Fully Connected Networks (FCN). In the other two architectures, LSTM-FCN and GRU-FCN layers follow the CNN layer respectively. All three individual models focus on extracting both local and long-term global contextual representations of speech signals. The ensemble uses a weighted average of the individual models. We evaluated the model's performance on five benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D. We have augmented the data by injecting additive white gaussian noise, pitch shifting, and stretching the signal level to obtain better model generalization. Five categories of features were extracted from the speech samples: mel-frequency cepstral coefficients, log mel-scaled spectrogram, zero-crossing rate, chromagram, and root mean square value from each audio file in those datasets. All four models perform exceptionally well in the SER task; notably, the ensemble model accomplishes the state-of-the-art (SOTA) weighted average accuracy of 99.46% for TESS, 95.42% for EMO-DB, 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for CREMA-D datasets and thus significantly outperformed the SOTA models using the same datasets.

求助该文献

最长约 10秒，即可获得该文献文件

An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition

今日热心研友