计算机科学
情绪识别
水准点(测量)
特征(语言学)
光学(聚焦)
语音识别
领域(数学)
频域
领域(数学分析)
人工智能
特征提取
模式识别(心理学)
计算机视觉
数学
数学分析
地理
纯数学
哲学
物理
光学
语言学
大地测量学
作者
Ke Liu,Chen Wang,Jiayue Chen,Jun Feng
标识
DOI:10.1007/978-3-030-98358-1_42
摘要
In the field of Human-Computer Interaction (HCI), Speech Emotion Recognition (SER) is not only a fundamental step towards intelligent interaction but also plays an important role in smart environments e.g., elderly home monitoring. Most deep learning based SER systems invariably focus on handling high-level emotion-relevant features, which means the low-level feature difference between time and frequency dimensions is rarely analyzed. And it leads to an unsatisfactory accuracy in speech emotion recognition. In this paper, we propose the Time-Frequency Attention (TFA) to mine the significant low-level emotion feature from the time domain and the frequency domain. To make full use of the global information after feature fusion conducted by the TFA, we utilize Squeeze-and-Excitation (SE) blocks to compare emotion features from different channels. Experiments are conducted on a benchmark database - Interactive Emotional Dyadic Motion Capture (IEMOCAP). The results indicate that proposed model outperforms the sate-of-the-art methods with the absolute increase of 1.7% and 3.2% on average class accuracy among four emotion classes and weighted accuracy respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI