光谱图
语音识别
计算机科学
特征(语言学)
任务(项目管理)
积极倾听
代表(政治)
人工智能
模式识别(心理学)
语音处理
相似性(几何)
说话人识别
沟通
心理学
哲学
语言学
管理
政治
政治学
法学
经济
图像(数学)
作者
Jichen Yang,Yi Zhou,Hao Huang
标识
DOI:10.1016/j.specom.2023.05.004
摘要
The self-supervised speech representation (S3R) has succeeded in many downstream tasks, such as speaker recognition and voice conversion thanks to its high-level information. Voice conversion (VC) is a task to convert the source speech into a target speaker’s voice. Though S3R features effectively encode content and speaker information, spectral features contain low-level acoustic information that is complementary to the S3R. As a result, solely relying on the S3R features for VC may not be optimal. In order to seek speech representation carrying both high-level learned information and low-level spectral details for VC, we proposed a three-level attention to combine Mel-spectrogram (Mel) and S3R, denoted as Mel-S3R. In particular, S3R features are high-level learned representations extracted by a pre-trained network with self-supervised learning. Whereas Mel is the spectral feature representing the acoustic information. Then the proposed Mel-S3R is used as the input of any-to-any VQ-VAE-based VC and the experiments are performed as a downstream task. Objective metrics and subjective listening tests have demonstrated that the proposed Mel-S3R speech representation facilitates the VC framework to achieve robust performance in terms of both speech quality and speaker similarity.
科研通智能强力驱动
Strongly Powered by AbleSci AI