斯拉夫语
类型学
计算机科学
词汇多样性
自然语言处理
人工智能
语言类型学
透视图(图形)
多样性(政治)
语言学
聚类分析
地理
社会学
词汇
人类学
哲学
考古
作者
Chenliang Zhou,Haitao Liu
摘要
Abstract This study proposes a linguistic classification method based on quantitative typology, which leverages a large-scale multilingual parallel corpus to obtain valid language classification result by excluding the influence of covariates such as text genre and semantic content in cross-language comparison. To achieve this, we model the type–token relationships of each Slavic parallel text and calculate the lexical diversity to approximate the morphological complexity of the language. We perform automatic clustering of languages based on these lexical diversity metrics. Our findings show that (1) the lexical diversity metrics can well reflect that the language is located somewhere on the continuum of ‘analytism-synthetism’; (2) the automatic clustering based on these metrics effectively reflects the genealogical classification of Slavic languages; and (3) the geographical distribution of lexical diversity in the region where Slavic languages are spoken shows a monotonic increasing trend from southwest to northeast, which is consistent with the pattern found by previous authors on a global scale. The methodological approach taken in this study is data-driven, with the benefit of being independent of theoretical assumptions and easy for computer processing. This approach can offer a better insight into corpus-based typology and may shed light on the understanding of language as a human-driven complex adaptive system.
科研通智能强力驱动
Strongly Powered by AbleSci AI