计算机科学
稳健性(进化)
判别式
人工智能
回归
特征学习
语音识别
模式识别(心理学)
回归分析
机器学习
噪音(视频)
图像(数学)
数学
统计
生物化学
化学
基因
作者
Qiushi Zhu,Long Zhou,Jie Zhang,Shujie Liu,Yu‐Chen Hu,Li-Rong Dai
标识
DOI:10.1109/icassp49357.2023.10095373
摘要
Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the robustness impact of combining the two pre-training tasks and constructing different negative samples for contrastive learning still remains unclear. In this paper, we propose a noise-robust data2vec for self-supervised speech representation learning by jointly optimizing the contrastive learning and regression tasks in the pre-training stage. Furthermore, we present two improved methods to facilitate contrastive learning. More specifically, we first propose to construct patch-based non-semantic negative samples to boost the noise robustness of the pre-training model, which is achieved by dividing the features into patches at different sizes (i.e., so-called negative samples). Second, by analyzing the distribution of positive and negative samples, we propose to remove the easily distinguishable negative samples to improve the discriminative capacity for pre-training models. Experimental results on the CHiME-4 dataset show that our method is able to improve the performance of the pre-trained model in noisy scenarios. We find that joint training of the contrastive learning and regression tasks can avoid the model collapse to some extent compared to only training the regression task.
科研通智能强力驱动
Strongly Powered by AbleSci AI