Performance of Generative Large Language Models on Ophthalmology Board–Style Questions

生成语法 口译(哲学) 优势比 可能性 风格(视觉艺术) 人工智能 心理学 计算机科学 医学 机器学习 内科学 历史 考古 程序设计语言 逻辑回归
作者
Louis Cai,Abdulla Shaheen,Andrew Jin,Riya Fukui,Jonathan Yi,Nicolas A. Yannuzzi,Chrisfouad R. Alabiad
出处
期刊:American Journal of Ophthalmology [Elsevier]
卷期号:254: 141-149 被引量:76
标识
DOI:10.1016/j.ajo.2023.05.024
摘要

Purpose To investigate the ability of generative artificial intelligence models to answer ophthalmology board style questions Design Experimental study. Methods This study evaluated three large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science (BCSC) Self-Assessment Program (SAP). While ChatGPT is trained on information last updated in 2021, Bing Chat incorporates more recently indexed internet search to generate its answers. Performance was compared to human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or non-logical reasoning were documented. Main outcome measures Primary outcome: response accuracy. Secondary outcomes: performance in question subcategories and hallucination frequency. Results Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), while ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (OR = 3.89, 95% CI 1.19-14.73, p = 0.03) compared with diagnostic questions, but struggled with image interpretation (OR = 0.14, 95% CI 0.05-0.33, p < 0.01) when compared with single step reasoning questions. Against single step questions, Bing Chat also faced difficulties with image interpretation (OR = 0.18, 95% CI 0.08-0.44, p < 0.01) and multi-step reasoning (OR = 0.30, 95% CI 0.11-0.84, p = 0.02). ChatGPT-3.5 had the highest rate of hallucinations or non-logical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%). Conclusions LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the BCSC SAP. The frequency of hallucinations and non-logical reasoning suggest room for improvement in the performance of conversational agents in the medical domain.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
啦啦啦发布了新的文献求助30
2秒前
i好运完成签到,获得积分20
4秒前
5秒前
eason应助sunaijia采纳,获得10
6秒前
LXx完成签到 ,获得积分10
6秒前
逆流的鱼完成签到 ,获得积分10
7秒前
7秒前
漠mo完成签到 ,获得积分10
9秒前
加油少年完成签到,获得积分10
10秒前
wjw发布了新的文献求助10
12秒前
南瓜pcn完成签到,获得积分10
12秒前
12秒前
小夭发布了新的文献求助10
13秒前
pe完成签到,获得积分10
16秒前
NI伦Ge发布了新的文献求助10
17秒前
咔什么嚓完成签到,获得积分10
22秒前
寂寞的梦安完成签到 ,获得积分10
23秒前
畅快的谷秋完成签到 ,获得积分10
23秒前
liushuang完成签到 ,获得积分10
25秒前
科目三应助可靠的无血采纳,获得10
29秒前
坚强的芙完成签到 ,获得积分10
32秒前
PPP完成签到,获得积分10
32秒前
33秒前
田所浩二完成签到 ,获得积分10
33秒前
pluto应助NI伦Ge采纳,获得10
33秒前
38秒前
小夭完成签到,获得积分10
39秒前
llly完成签到,获得积分10
41秒前
kkuula发布了新的文献求助10
42秒前
wanci应助医路潜行采纳,获得10
42秒前
韩寒完成签到 ,获得积分10
43秒前
echo完成签到 ,获得积分10
43秒前
夏沫完成签到,获得积分10
44秒前
勤奋雨完成签到,获得积分10
44秒前
彩色大碗完成签到,获得积分10
50秒前
庄怀逸完成签到 ,获得积分10
52秒前
scarlett完成签到,获得积分10
52秒前
亦玉完成签到,获得积分10
52秒前
东东呀完成签到,获得积分10
53秒前
kkuula完成签到,获得积分10
54秒前
高分求助中
Continuum Thermodynamics and Material Modelling 3000
Production Logging: Theoretical and Interpretive Elements 2700
Les Mantodea de Guyane Insecta, Polyneoptera 1000
Structural Load Modelling and Combination for Performance and Safety Evaluation 1000
Conference Record, IAS Annual Meeting 1977 820
電気学会論文誌D(産業応用部門誌), 141 巻, 11 号 510
Typology of Conditional Constructions 500
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3571428
求助须知:如何正确求助?哪些是违规求助? 3141983
关于积分的说明 9445184
捐赠科研通 2843436
什么是DOI,文献DOI怎么找? 1562857
邀请新用户注册赠送积分活动 731366
科研通“疑难数据库(出版商)”最低求助积分说明 718524