透视图(图形)
计算机科学
语音识别
深度学习
情绪识别
自然语言处理
人工智能
人机交互
标识
DOI:10.1145/3681716.3681725
摘要
Speech Emotion Recognition (SER) aims to identify and interpret emotional states conveyed through speech signals. Originally a branch of Affective Computing research, SER has burgeoned into a commercially viable field, seamlessly integrating into various facets of daily life. Traditional SER systems are limited by manual feature extraction, noise, and voice quality interference. Deep learning has transformed this domain by enabling the development of neural network architectures that learn complex representations directly from raw speech data while simultaneously mitigating noise. Nonetheless, advancements in deep learning introduce their own set of challenges, comprising two main categories: 1) challenges stemming from the limitation of model architectures and complexities in using multiple modeling techniques; and 2) challenges related to trustworthiness, including scarcity of training data, privacy concerns, security issues, and considerations of AI fairness. This paper surveys the challenges in the current practices from a deep learning perspective. We argue that despite the diverse nature of these challenges, prevalent solutions predominantly reside within the realm of deep learning methodology. As deep learning's extensive utility in SER now spans various domains—encompassing data augmentation, overfitting mitigation, privacy preservation, debiasing, and the promotion of fairness in AI systems—and continues to evolve, navigating the challenges and embracing the advancements in deep learning remains pivotal in its ongoing integration and enhancement across diverse domains of application.
科研通智能强力驱动
Strongly Powered by AbleSci AI