计算机科学
人工智能
相互信息
话语
变压器
语音识别
模式识别(心理学)
无监督学习
编码器
域适应
自然语言处理
电压
物理
量子力学
操作系统
分类器(UML)
作者
Shiqing Zhang,Ruixin Liu,Yijiao Yang,Xiaoming Zhao,Jun Yu
标识
DOI:10.1145/3503161.3548328
摘要
This paper focuses on an interesting task, i.e., unsupervised cross-corpus Speech Emotion Recognition (SER), in which the labelled training (source) corpus and the unlabelled testing (target) corpus have different feature distributions, resulting in the discrepancy between the source and target domains. To address this issue, this paper proposes an unsupervised domain adaptation method integrating Transformers and Mutual Information (MI) for cross-corpus SER. Initially, our method employs encoder layers of Transformers to capture long-term temporal dynamics in an utterance from the extracted segment-level log-Mel spectrogram features, thereby producing the corresponding utterance-level features for each utterance in two domains. Then, we propose an unsupervised feature decomposition method with a hybrid Max-Min MI strategy to separately learn domain-invariant features and domain-specific features from the extracted mixed utterance-level features, in which the discrepancy between two domains is eliminated as much as possible and meanwhile their individual characteristic is preserved. Finally, an interactive Multi-Head attention fusion strategy is designed to learn the complementarity between domain-invariant features and domain-specific features so that they can be interactively fused for SER. Extensive experiments on the IEMOCAP and MSP-Improv datasets demonstrate the effectiveness of our proposed method on unsupervised cross-corpus SER tasks, outperforming state-of-the-art unsupervised cross-corpus SER methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI