计算机科学
语音识别
端到端原则
比例(比率)
声学模型
人工智能
语音处理
地理
地图学
作者
Li Wang,Dingguo Gao,Quzhen Suolang
标识
DOI:10.1109/prml59573.2023.10348322
摘要
Tibetan is one of the important languages of China's ethnic minorities, with rich cultural and historical value. However, Tibetan speech recognition is a challenging task due to the complexity of its phonetic features and the scarcity of data. Although some research results have been achieved, there is still a large room for improvement. In this paper, we propose an end-to-end Tibetan speech recognition acoustic model based on multiscale features, aiming at the problem that the non-encoder-decoder model widely used in the acoustic model of Tibetan speech recognition experiment leads to poor recognition effect of speech recognition task with prediction sequence information. We compare the baseline model based on the attention-based encoder-decoder speech recognition framework with four Tibetan speech recognition acoustic models, and then we improve the baseline model by using a hybrid loss function and multi-scale features for feature extraction. The experimental results show the feasibility of attention-based encoder-decoder model for Tibetan speech recognition, and that using hybrid loss function and multiscale features can improve the recognition performance of the model. The model proposed in this paper has the best effect in the recognition of Tibetan Lhasa dialect at present, and the word error rate of test set is only 15.04%.
科研通智能强力驱动
Strongly Powered by AbleSci AI