计算机科学
外来词
鉴定(生物学)
自然语言处理
人工智能
文字嵌入
相似性(几何)
词(群论)
嵌入
情报检索
语言学
哲学
植物
图像(数学)
生物
标识
DOI:10.1016/j.csl.2023.101517
摘要
To alleviate the resource scarcity and improve the robustness in loanword identification, the current study proposes a novel loanword identification method based on Wikipedia. In this paper, we first present how to obtain loanword candidate datasets and comparable corpora from Wikipedia. On the basis of these corpora, we develop a pseudo-data generation model for loanword identification tasks. And then we put forward a loanword identification model, i.e. the PK-SM-Bi-LSTM-CRF framework, which is based on a bidirectional LSTM-CRF framework and further enhanced by prior knowledge and self-matching attention. The advantages of the proposed method mainly lie in two aspects. For one thing, besides the commonly used word embedding and character embedding features, several other features, including subword embedding, lexical similarity, word alignment and semantic similarity, are incorporated into our method. For another, geographic distance is set as a primary principle in the selection of the best matched donor word from several candidates. Moreover, in order to evaluate the effectiveness of the proposed method, we have conducted a series of experiments in different languages. Experimental results show that the proposed method achieves the best performance among all baseline systems.
科研通智能强力驱动
Strongly Powered by AbleSci AI