可靠性(半导体)
公制(单位)
组内相关
人的可靠性
计算机科学
度量(数据仓库)
金标准(测试)
心理学
人工智能
数据科学
应用心理学
心理测量学
可靠性工程
数据挖掘
工程类
临床心理学
统计
人为错误
数学
运营管理
功率(物理)
物理
量子力学
出处
期刊:Cornell University - arXiv
日期:2023-01-01
被引量:3
标识
DOI:10.48550/arxiv.2304.05372
摘要
ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that are slated to promise different applications in diverse areas. In education, these AI technologies have been tested for applications in assessment and teaching. In assessment, AI has long been used in automated essay scoring and automated item generation. One psychometric property that these tools must have to assist or replace humans in assessment is high reliability in terms of agreement between AI scores and human raters. In this paper, we measure the reliability of OpenAI ChatGP and Google Bard LLMs tools against experienced and trained humans in perceiving and rating the complexity of writing prompts. Intraclass correlation (ICC) as a performance metric showed that the inter-reliability of both the OpenAI ChatGPT and the Google Bard were low against the gold standard of human ratings.
科研通智能强力驱动
Strongly Powered by AbleSci AI