期刊:IEEE/ACM transactions on audio, speech, and language processing [Institute of Electrical and Electronics Engineers] 日期:2023-11-23卷期号:32: 554-563被引量:2
标识
DOI:10.1109/taslp.2023.3335807
摘要
Automatic pronunciation assessment is an indispensable technology in computer-assisted pronunciation training systems. To further evaluate the quality of pronunciation, multi-task learning with simultaneous output of multi-granularity and multi-aspect has become a mainstream solution. Existing methods either predict scores at all granularity levels simultaneously through a parallel structure, or predict individual granularity scores layer by layer through a hierarchical structure. However, these methods do not fully understand and take advantage of the correlation between the three granularity levels of phoneme, word, and utterance. To address this issue, we propose a novel method, Granularity-decoupled Transformer (Gradformer), which is able to model the relationships between multiple granularity levels. Specifically, we first use a convolution-augmented transformer encoder to encode acoustic features, where the convolution module helps the model better capture local information. The model outputs both phoneme- and word-level granularity scores with high correlation by the encoder. Then, we use utterance queries to interact with the output of the encoder through the transformer decoder, ultimately obtaining the utterance scores. Through unique encoder and decoder architecture, we achieve decoupling at three granularity levels, and handling the relationship between each granularity. Experiments on the speachocean762 dataset show that our model has advantages over state-of-the-art methods in various metrics, especially in key metrics such as phoneme accuracy, word accuracy, and total score.