计算机科学
自然语言处理
人工智能
任务(项目管理)
拼写
安全性令牌
编辑距离
语音识别
语言学
哲学
计算机安全
管理
经济
作者
Idhibhat Pankam,Peerat Limkonchotiwat,Ekapol Chuangsuwanich
标识
DOI:10.1109/jcsse58229.2023.10202006
摘要
Thai spelling detection and correction is a task to detect and correct mistakes in natural language. The implementation of an edit distance and dictionary mapping to identify and correct misspelled words is a widely used method for addressing this challenge in the Thai language. However, this technique excludes contextualization from the input and can yield incorrect answers. In this paper, we propose a two-stage framework that leverages pre-trained language models, such as WangchanBERTa, for correcting Thai misspellings. Our proposed method consists of a model for misspelling detection based on the token prediction task and a correction model based on masked prediction. Additionally, character edit distance (CED) can be used as a post-processing step to help improve the correction results. Our experiments were conducted on two standard datasets for Thai misspellings, namely UGWC and VISTEC-TPTH-2021. On the UGWC data, our model can help correct 3,867 out of the 6,399 misspelled words (60.43%), which is higher than the baseline's rate of 41.75%. 1 1 https://github.com/bookpanda/Two-stage-Thai-Misspelling-Correction-Based-on-Pre-trained-Language-Models
科研通智能强力驱动
Strongly Powered by AbleSci AI