概化理论
基本事实
可靠性(半导体)
面试
背景(考古学)
人工智能
计算机科学
样品(材料)
心理学
自然语言处理
机器学习
发展心理学
功率(物理)
化学
物理
色谱法
量子力学
古生物学
政治学
法学
生物
作者
Louis Hickman,Josh Liff,Caleb Rottman,Charles Calderwood
标识
DOI:10.1177/10944281241264027
摘要
While machine learning (ML) can validly score psychological constructs from behavior, several conditions often change across studies, making it difficult to understand why the psychometric properties of ML models differ across studies. We address this gap in the context of automatically scored interviews. Across multiple datasets, for interview- or question-level scoring of self-reported, tested, and interviewer-rated constructs, we manipulate the training sample size and natural language processing (NLP) method while observing differences in ground truth reliability. We examine how these factors influence the ML model scores’ test–retest reliability and convergence, and we develop multilevel models for estimating the convergent-related validity of ML model scores in similar interviews. When the ground truth is interviewer ratings, hundreds of observations are adequate for research purposes, while larger samples are recommended for practitioners to support generalizability across populations and time. However, self-reports and tested constructs require larger training samples. Particularly when the ground truth is interviewer ratings, NLP embedding methods improve upon count-based methods. Given mixed findings regarding ground truth reliability, we discuss future research possibilities on factors that affect supervised ML models’ psychometric properties.
科研通智能强力驱动
Strongly Powered by AbleSci AI