计算机科学
建筑
自然语言处理
词(群论)
领域(数学)
人工智能
文本分割
分割
语音识别
语言学
历史
数学
哲学
纯数学
考古
作者
Peng Li,Honggang Fan,Junyan Cao,Zhangyu Guan
摘要
At present, one of the problems of Chinese word segmentation is the low efficiency of Out-Of-Vocabulary (OOV) detection in the field of expertise. Due to restrictions on the characteristics of the words of the profession itself, the word segmentation of architectural texts is not very effective in identifying OOV. This paper proposes a new method to recognize OOV, which is an unsupervised method based on improved algorithm and entropy. This paper uses algorithms to identify strings with relatively large interdependencies between texts, filters through the stop-words vocabulary and corpus to obtain candidate dictionaries, calculates the entropy between candidate dictionaries, and determine the final OOV by setting an accurate threshold, Add the recognized OOV as a professional dictionary for word segmentation. Experiments show that by using the algorithm proposed in this paper, the recognition effect of OOV in architectural text has been significantly improved. Compared with the algorithm, P ( precision ) increased by 15.92 %, R ( recall ) increased by 7.61 %. Therefore, the final word segmentation precision can reach 82.15% and recall can reach 80.45%.
科研通智能强力驱动
Strongly Powered by AbleSci AI