计算机科学
过度拟合
堆积
语音识别
人工智能
特征(语言学)
卷积神经网络
模式识别(心理学)
隐马尔可夫模型
计算智能
概率逻辑
人工神经网络
语言学
核磁共振
物理
哲学
作者
Dongdong Li,Linyu Sun,Xinlei Xu,Zhe Wang,Jing Zhang,Wenli Du
标识
DOI:10.1007/s11063-021-10581-z
摘要
Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving time series acoustic features and Convolutional Neural Network (CNN) can discover the local structure among different features. This paper proposed the BLSTM and CNN Stacking Architecture (BCSA) to enhance the ability to recognition emotions. In order to match the input formats of BLSTM and CNN, slicing feature matrices is necessary. For utilizing the different roles of the BLSTM and CNN, the Stacking is employed to integrate the BLSTM and CNN. In detail, taking into account overfitting problem, the estimates of probabilistic quantities from BLSTM and CNN are combined as new data using K-fold cross validation. Finally, based on the Stacking models, the logistic regression is used to recognize emotions effectively by fitting the new data. The experiment results demonstrate that the performance of proposed architecture is better than that of single model. Furthermore, compared with the state-of-the-art model on SER in our knowledge, the proposed method BCSA may be more suitable for SER by integrating time series acoustic features and the local structure among different features.
科研通智能强力驱动
Strongly Powered by AbleSci AI