期刊:IEEE Transactions on Audio, Speech, and Language Processing [Institute of Electrical and Electronics Engineers] 日期:2021-10-15卷期号:29: 3280-3291
标识
DOI:10.1109/taslp.2021.3120586
摘要
Speech emotion recognition (SER) is an indispensable part of fluid human-machine interaction and attracts lots of research attentions. Recent work on SER has successfully applied convolutional neural networks (CNNs) to learn feature representations from speech spectrograms. However, the fundamental problem of CNNs is that the spatial information in spectrograms is lost, which includes positional and relationship information of low-level features, such as pitch and formant frequencies. We propose a novel architecture of sequential capsule networks (CapNets) by leveraging the advantange of CapNets that spatial information can be preserved in capsules and passed to upper capsule layers via dynamic routing. Also, the dynamic routing algorithm provides an effective alternative to pooling or storing recurrent hidden states for obtaining utterance-level features from the sequential capsule outputs. To further improve the model's ability to capture contextual information, we introduce a recurrent connection to the sequential structure. The experimental comparison of the proposed systems and previously published systems using CNNs and recurrent neural networks (RNNs) based on the IEMOCAP corpus demonstrates the effectiveness of the proposed sequential CapNets.