Depth-First Neural Architecture With Attentive Feature Fusion for Efficient Speaker Verification

计算机科学特征（语言学）建筑人工神经网络保险丝（电气）人工智能嵌入网络体系结构方案（数学）计算深度学习模式识别（心理学）语音识别算法计算机网络工程类数学分析哲学电气工程艺术语言学视觉艺术数学

作者

Bei Liu,Zhengyang Chen,Qian Ye

出处

期刊：IEEE/ACM transactions on audio, speech, and language processing [Institute of Electrical and Electronics Engineers]
日期：2023-01-01 卷期号：31: 1825-1838

标识

DOI：10.1109/taslp.2023.3273417

摘要

Deep speaker embedding learning based on neural networks has become the predominant approach in speaker verification (SV) currently. In prior studies, researchers have investigated various network architectures. However, rare works pay attention to the question of how to design and scale up networks in a principled way to achieve a better trade-off on model performance and computational complexity. In this paper, we focus on efficient architecture design for speaker verification. Firstly, we systematically study the effect of the network depth and width on performance and empirically discover that depth is more important than the width of networks for speaker verification task . Based on this observation, we propose a novel depth-first (DF) architecture design rule. By applying it to ResNet and ECAPA-TDNN, two new families of much deeper models, namely DF-ResNets and DF-ECAPAs, are constructed. In addition, to further boost the performance of small models in the low computation regime, a novel attentive feature fusion (AFF) scheme is proposed to replace the conventional feature fusion methods. Specifically, we design two different fusion strategies, including sequential AFF (S-AFF) and parallel AFF (P-AFF), which can dynamically fuse features in a learnable way. Experimental results on the VoxCeleb dataset show that the newly proposed DF-ResNets and DF-ECAPAs can achieve a much better trade-off on performance and complexity than the original ResNet and ECAPA-TDNN. Moreover, small models can further obtain up to 40% relative improvement in EER by adopting AFF scheme with negligible computational cost. Finally, a comprehensive comparison with various other published SV systems illustrates that our proposed models achieve the best trade-off on performance and complexity in both low and high computation scenarios.

求助该文献

最长约 10秒，即可获得该文献文件

Depth-First Neural Architecture With Attentive Feature Fusion for Efficient Speaker Verification

今日热心研友