自然性
黏着语
计算机科学
人工智能
自然语言处理
语音合成
自编码
语言模型
深度学习
语音识别
语言学
解析
量子力学
物理
哲学
作者
Rui Liu,Yifan Hu,Haolin Zuo,Zhaojie Luo,Longbiao Wang,Guanglai Gao
出处
期刊:IEEE/ACM transactions on audio, speech, and language processing
[Institute of Electrical and Electronics Engineers]
日期:2024-01-01
卷期号:32: 1075-1087
被引量:4
标识
DOI:10.1109/taslp.2023.3348762
摘要
Text-to-Speech (TTS) aims to convert the input text to a human-like voice. With the development of deep learning, encoder-decoder based TTS models perform superior performance, in terms of naturalness, in mainstream languages such as Chinese, English, etc. Note that the linguistic information learning capability of the text encoder is the key. However, for TTS of low-resource agglutinative languages, the scale of the $< $ text, speech $>$ paired data is limited. Therefore, how to extract rich linguistic information from small-scale text data to enhance the naturalness of the synthesized speech, is an urgent issue that needs to be addressed. In this paper, we first collect a large unsupervised text data for BERT-like language model pre-training, and then adopt the trained language model to extract deep linguistic information for the input text of the TTS model to improve the naturalness of the final synthesized speech. It should be emphasized that in order to fully exploit the prosody-related linguistic information in agglutinative languages, we incorporated morphological information into the language model training and constructed a morphology-aware masking based BERT model (MAM-BERT). Experimental results based on various advanced TTS models validate the effectiveness of our approach. Further comparison of the various data scales also validates the effectiveness of our approach in low-resource scenarios.
科研通智能强力驱动
Strongly Powered by AbleSci AI