Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board–style Examination

医学 重复性 置信区间 可靠性(半导体) 稳健性(进化) 医学物理学 放射科 统计 内科学 功率(物理) 物理 生物化学 数学 化学 量子力学 基因
作者
Satheesh Krishna,Nishaant Bhambra,Robert R. Bleakney,Rajesh Bhayana,Sarah Atzen
出处
期刊:Radiology [Radiological Society of North America]
卷期号:311 (2) 被引量:20
标识
DOI:10.1148/radiol.232715
摘要

Background ChatGPT (OpenAI) can pass a text-based radiology board–style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board–style examination. Materials and Methods In this exploratory prospective study, 150 radiology board–style multiple-choice text-based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by ≥1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1–10 (with 10 being the highest level of confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively (P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively (P = .42). Though both GPT-4 and GPT-3.5 had only moderate intrarater agreement (κ = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (≥8 on the 1–10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Ballard in this issue.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI

祝大家在新的一年里科研腾飞
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
专一的小海豚完成签到,获得积分10
1秒前
大模型应助会笑的猪猪猫采纳,获得10
1秒前
1秒前
wxx完成签到,获得积分20
2秒前
动人的亦旋完成签到,获得积分10
2秒前
2秒前
Orange应助sssssnape采纳,获得10
2秒前
英姑应助攀攀采纳,获得10
3秒前
蔡蔡蔡发布了新的文献求助10
4秒前
5秒前
5秒前
5秒前
enli发布了新的文献求助20
5秒前
5秒前
我是老大应助科研通管家采纳,获得10
5秒前
无花果应助科研通管家采纳,获得10
5秒前
Singularity应助科研通管家采纳,获得10
6秒前
6秒前
领导范儿应助科研通管家采纳,获得10
6秒前
Owen应助科研通管家采纳,获得10
6秒前
搜集达人应助科研通管家采纳,获得10
6秒前
桐桐应助科研通管家采纳,获得10
6秒前
ding应助科研通管家采纳,获得10
6秒前
6秒前
布公发布了新的文献求助10
9秒前
9秒前
9秒前
刘二狗发布了新的文献求助10
10秒前
10秒前
cmd发布了新的文献求助10
11秒前
yiyiyiyi发布了新的文献求助10
11秒前
调研昵称发布了新的文献求助10
12秒前
12秒前
小二郎应助Vancy采纳,获得10
13秒前
Xxxxzzz发布了新的文献求助10
13秒前
踏雪无痕6509完成签到,获得积分10
13秒前
sdgasdca发布了新的文献求助10
13秒前
14秒前
花园里的蒜完成签到 ,获得积分0
14秒前
14秒前
高分求助中
Востребованный временем 2500
The Three Stars Each: The Astrolabes and Related Texts 1500
Classics in Total Synthesis IV: New Targets, Strategies, Methods 1000
Les Mantodea de Guyane 800
Mantids of the euro-mediterranean area 700
The Oxford Handbook of Educational Psychology 600
有EBL数据库的大佬进 Matrix Mathematics 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 内科学 纳米技术 物理 计算机科学 化学工程 基因 复合材料 遗传学 物理化学 免疫学 细胞生物学 催化作用 病理
热门帖子
关注 科研通微信公众号,转发送积分 3416546
求助须知:如何正确求助?哪些是违规求助? 3018380
关于积分的说明 8884060
捐赠科研通 2705746
什么是DOI,文献DOI怎么找? 1483862
科研通“疑难数据库(出版商)”最低求助积分说明 685830
邀请新用户注册赠送积分活动 680985