计算机科学
瓶颈
特征(语言学)
对偶(语法数字)
领域(数学分析)
帧(网络)
信息瓶颈法
人工智能
频域
噪音(视频)
语音识别
模式识别(心理学)
机器学习
图像(数学)
计算机视觉
相互信息
数学
艺术
数学分析
电信
哲学
语言学
文学类
嵌入式系统
作者
Guoyan Li,Junjie Hou,Yi Liu,Jianguo Wei
标识
DOI:10.1016/j.eswa.2023.123110
摘要
Speech emotion recognition (SER) is a crucial topic in human–computer interaction. However, there are still many challenges to extracting emotional embeddings. Emotional embeddings extracted by network models often contain noise and incomplete emotional information. To meet these challenges, this study developed an innovative model (MVIB-DVA) composed of a multi-feature variational information bottleneck (MVIB) based on the information bottleneck (IB) principle and a dual-view aware module (DVAM) with an attention mechanism. MVIB employs the IB principle as the driving model and utilizes learned minimal sufficient single-feature emotional embeddings as auxiliary information. The aims are to capture unique emotional information in individual features and complementary information between different types of features as well as reduce noise and represent rich emotional information with fewer parameters. DVAM proposes (1) a frequency-domain statistical aware module (FDSAM) in the frequency domain that emphasizes the frequency that best reflects emotional information and (2) a frame aware module (FAM) in the time domain that focuses on the frames that contribute to the most to the final emotion recognition results. A separate channel extracts details ignored in the frequency and time domain views, extracting more comprehensive emotional information. The experimental results show that our method performs excellently in recognizing speech emotions. MVIB-DVA achieved weighted accuracy (WA) of 74.05% and unweighted accuracy (UA) of 75.67% on the IEMOCAP dataset. Similarly, on the RAVDESS dataset, MVIB-DVA attained WA of 86.66% and UA of 86.51%.
科研通智能强力驱动
Strongly Powered by AbleSci AI