面子(社会学概念)
计算机科学
面部识别系统
语音识别
面对面
人工智能
面部表情
作者
Hsiao-Han Lu,Shao-En Weng,Ya-Fan Yen,Hong-Han Shuai,Wen-Huang Cheng
出处
期刊:ACM Multimedia
日期:2021-10-17
卷期号:: 496-505
标识
DOI:10.1145/3474085.3475198
摘要
Zero-shot voice conversion (VC) trained by non-parallel data has gained a lot of attention in recent years. Previous methods usually extract speaker embeddings from audios and use them for converting the voices into different voice styles. Since there is a strong relationship between human faces and voices, a promising approach would be to synthesize various voice characteristics from face representation. Therefore, we introduce a novel idea of generating different voice styles from different human face photos, which can facilitate new applications, e.g., personalized voice assistants. However, the audio-visual relationship is implicit. Moreover, the existing VCs are trained on laboratory-collected datasets without speaker photos, while the datasets with both photos and audios are in-the-wild datasets. Directly replacing the target audio with the target photo and training on the in-the-wild dataset leads to noisy results. To address these issues, we propose a novel many-to-many voice conversion network, namely Face-based Voice Conversion (FaceVC), with a 3-stage training strategy. Quantitative and qualitative experiments on the LRS3-Ted dataset show that the proposed FaceVC successfully performs voice conversion according to the target face photos. Audio samples can be found on the demo website at https://facevc.github.io/.
科研通智能强力驱动
Strongly Powered by AbleSci AI