期刊:IEEE/ACM transactions on audio, speech, and language processing [Institute of Electrical and Electronics Engineers] 日期:2023-01-01卷期号:31: 1825-1838
标识
DOI:10.1109/taslp.2023.3273417
摘要
Deep speaker embedding learning based on neural networks has become the predominant approach in speaker verification (SV) currently. In prior studies, researchers have investigated various network architectures. However, rare works pay attention to the question of how to design and scale up networks in a principled way to achieve a better trade-off on model performance and computational complexity. In this paper, we focus on efficient architecture design for speaker verification. Firstly, we systematically study the effect of the network depth and width on performance and empirically discover that depth is more important than the width of networks for speaker verification task . Based on this observation, we propose a novel depth-first (DF) architecture design rule. By applying it to ResNet and ECAPA-TDNN, two new families of much deeper models, namely DF-ResNets and DF-ECAPAs, are constructed. In addition, to further boost the performance of small models in the low computation regime, a novel attentive feature fusion (AFF) scheme is proposed to replace the conventional feature fusion methods. Specifically, we design two different fusion strategies, including sequential AFF (S-AFF) and parallel AFF (P-AFF), which can dynamically fuse features in a learnable way. Experimental results on the VoxCeleb dataset show that the newly proposed DF-ResNets and DF-ECAPAs can achieve a much better trade-off on performance and complexity than the original ResNet and ECAPA-TDNN. Moreover, small models can further obtain up to 40% relative improvement in EER by adopting AFF scheme with negligible computational cost. Finally, a comprehensive comparison with various other published SV systems illustrates that our proposed models achieve the best trade-off on performance and complexity in both low and high computation scenarios.