稳健性(进化)
计算机科学
蒸馏
麦克风阵列
话筒
混乱
人工智能
语音识别
特征提取
模式识别(心理学)
说话人识别
化学
心理学
电信
生物化学
有机化学
声压
精神分析
基因
作者
Yichi Wang,Jie Zhang,Shihao Chen,Weitai Zhang,Zhongyi Ye,Xinyuan Zhou,Li-Rong Dai
标识
DOI:10.1109/icassp48485.2024.10446870
摘要
Target speaker extraction (TSE) based on direction of arrival (DOA) has a wide range of applications in e.g., remote conferencing, hearing aids, in-car speech interaction. Due to the inherent phase uncertainty, existing TSE methods usually suffer from speaker confusion within specific frequency bands. Imprecise DOA measurements caused by e.g., the calibration of the microphone array and ambient noises, can also deteriorate the TSE performance. In order to improve the robustness of TSE, in this work we propose several new multichannel spatiotemporal features to represent the discriminability of the target speaker. The narrow-band Conformer model is applied in combination with the proposed features to facilitate the extraction of the target speaker. In addition, we consider knowledge distillation for improving the model robustness, particularly in the presence of DOA mis-match. Experimental results on a public dataset verify the efficacy of the proposed method.
科研通智能强力驱动
Strongly Powered by AbleSci AI