过度拟合
光谱图
计算机科学
卷积神经网络
语音识别
鉴定(生物学)
频道(广播)
人工智能
任务(项目管理)
说话人识别
说话人日记
深度学习
说话人识别
模式识别(心理学)
人工神经网络
工程类
植物
生物
计算机网络
系统工程
作者
Banala Saritha,Mohammad Azharuddin Laskar,Anish Monsley Kirupakaran,Rabul Hussain Laskar,Madhuchhanda Choudhury
标识
DOI:10.1016/j.compeleceng.2024.109100
摘要
Advancements in deep learning for speaker identification are constrained by the limited availability of data, especially in law enforcement applications. This has led to the emergence of few-shot speaker identification, a technique that classifies unseen test samples with the help of a few support samples. Despite several attempts to advance few-shot speaker identification, significant challenges persist, including the extraction of robust speaker embeddings, the problem of overfitting, and the issue of prototype shift error. This paper proposes a few-shot speaker identification system employing a novel architecture called the Channel Attention-based Convolutional Recurrent Neural Network (CACRN-Net) with three-dimensional (3D) log Mel spectrogram inputs to mitigate overfitting and enhance the accuracy of speaker embeddings. Furthermore, a self-attention mechanism alleviates prototype shift errors caused by noisy data. The proposed framework is compared to existing methods using VCTK and Voxceleb1 speech corpora through 5-way, 5-shot learning experiments. To assess the performance of the framework in speech variability conditions, we utilized the IIT Guwahati (IITG) multi-variability (MV) speech database. The proposed approach outperforms state-of-the-art techniques, achieving a substantial enhancement in speaker identification with a 2.73 % accuracy improvement on the VCTK database and a 2.3 % improvement on Voxceleb1.
科研通智能强力驱动
Strongly Powered by AbleSci AI