普通话
多样性(政治)
计算机科学
词汇多样性
语言学
人工智能
自然语言处理
安全性令牌
逻辑回归
机器学习
社会学
哲学
计算机安全
词汇
人类学
作者
Linlin Sun,David Correia Saavedra
出处
期刊:Corpora
[Edinburgh University Press]
日期:2020-11-01
卷期号:15 (3): 317-342
被引量:3
标识
DOI:10.3366/cor.2020.0202
摘要
This paper applies a quantitative model developed for measuring grammatical status, using data from the Lancaster Corpus of Mandarin Chinese (lcmc). The model takes into account four quantitative factors (token frequency, collocate diversity, colligate diversity and deviation of proportions) and uses them as predictors in a binary logistic regression in order to compute a score of grammatical status between ‘0’ (lexical/non-grammatical) and ‘1’ (highly grammatical) for each given element. The results of the lcmc model are then compared to those of a similar study of the British National Corpus (bnc). The comparison suggests that token frequency emerges as one of the most relevant parameters for quantifying degrees of grammatical status in both language models, together with the collocate diversity measure when using a broad window span. On the other hand, the colligational measures (left- or right-based) and the other collocate diversity measures using small spans (left- or right-based) contribute very differently to the two languages due to their typologically distinctive structures.
科研通智能强力驱动
Strongly Powered by AbleSci AI