亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

Benchmarking Open-Source Large Language Models, GPT-4 and Claude 2 on Multiple-Choice Questions in Nephrology

子专业 肾病科 个性化医疗 背景(考古学) 标杆管理 内科学 医学教育 心理学 医学 家庭医学 生物信息学 业务 地理 营销 生物 考古
作者
Sean M. Wu,Michael Koo,Lesley Blum,Andy Black,Liyo Kao,Zhe Fei,Fabien Scalzo,Ira Kurtz
标识
DOI:10.1056/aidbp2300092
摘要

BackgroundIn recent years, significant breakthroughs have been made in the field of natural language processing, particularly with the development of large language models (LLMs). LLMs have demonstrated remarkable capabilities on benchmarks related to general medical question answering, but there are fewer data about their performance in subspecialty fields and fewer studies still comparing the many available LLMs. These models have the potential to be used as a part of adaptive physician training, medical copilot applications, and digital patient interaction scenarios. The ability of LLMs to participate in medical training and patient care depends in part on their mastery of the knowledge content of specific medical fields. MethodsThis study investigated the medical knowledge capability of multiple LLMs in the context of their internal medicine subspecialty multiple-choice test-taking ability. We compared the performance of several open-source LLMs (Llama2-70B, Koala 7B, Falcon 7B, Stable-Vicuna 13B, and Orca-Mini 13B) with the proprietary models GPT-4 and Claude 2 on multiple-choice questions in the field of nephrology. Nephrology was chosen as an example of a conceptually complex subspecialty field in internal medicine. This study was conducted to evaluate the ability of LLMs to provide correct answers to Nephrology Self-Assessment Program (nephSAP) multiple-choice questions. These questions administered by the American Society of Nephrology help clinicians assess their knowledge in various topics in nephrology. ResultsThe overall success of open-source LLMs in answering the 858 nephSAP multiple-choice questions correctly was 17.1 to 30.6%. In contrast, Claude 2 answered 54.4% of the questions correctly, whereas GPT-4 achieved a score of 73.3%. A dataset containing questions and ground truth labels used to assess the LLMs has been made available. ConclusionsWe show that the current widely used open-source LLMs have poor zero-shot reasoning ability in nephrology compared with GPT-4 and Claude 2, illustrating knowledge gaps across LLMs relevant to future subspecialty medical training and patient care. (Funded by the Factor Family Foundation and others.)

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
yyyjx完成签到,获得积分10
5秒前
青竹完成签到,获得积分10
6秒前
科研通AI5应助青竹采纳,获得10
9秒前
干净傲霜完成签到 ,获得积分10
28秒前
慕青应助什么奶酪橘汁采纳,获得10
29秒前
小马甲应助草上飞李四采纳,获得20
30秒前
40秒前
44秒前
william8688发布了新的文献求助10
44秒前
功夫小猫完成签到,获得积分10
48秒前
Archers完成签到 ,获得积分10
49秒前
浅浅发布了新的文献求助10
57秒前
Akim应助还没想好采纳,获得10
1分钟前
fareless完成签到 ,获得积分10
1分钟前
朱朱子完成签到 ,获得积分10
1分钟前
1分钟前
还没想好发布了新的文献求助10
1分钟前
1分钟前
科研通AI5应助科研通管家采纳,获得10
1分钟前
VDC应助科研通管家采纳,获得30
1分钟前
1分钟前
还没想好完成签到,获得积分10
1分钟前
1分钟前
乐乐应助dahai采纳,获得10
1分钟前
滴滴完成签到 ,获得积分10
1分钟前
简让完成签到 ,获得积分10
1分钟前
dyy完成签到 ,获得积分10
1分钟前
1分钟前
2分钟前
CipherSage应助要减肥的乌龟采纳,获得10
2分钟前
这个手刹不太灵完成签到 ,获得积分10
2分钟前
燕鹏发布了新的文献求助10
2分钟前
打打应助昏睡的早晨采纳,获得10
2分钟前
李小猫完成签到,获得积分10
2分钟前
yiyixt完成签到 ,获得积分10
2分钟前
清脆安南发布了新的文献求助10
2分钟前
华仔应助李小猫采纳,获得10
2分钟前
2分钟前
李小猫发布了新的文献求助10
2分钟前
淡漠完成签到 ,获得积分10
2分钟前
高分求助中
Continuum Thermodynamics and Material Modelling 3000
Production Logging: Theoretical and Interpretive Elements 2700
Mechanistic Modeling of Gas-Liquid Two-Phase Flow in Pipes 2500
Structural Load Modelling and Combination for Performance and Safety Evaluation 1000
Conference Record, IAS Annual Meeting 1977 610
電気学会論文誌D(産業応用部門誌), 141 巻, 11 号 510
Virulence Mechanisms of Plant-Pathogenic Bacteria 500
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3561907
求助须知:如何正确求助?哪些是违规求助? 3135509
关于积分的说明 9412421
捐赠科研通 2835888
什么是DOI,文献DOI怎么找? 1558793
邀请新用户注册赠送积分活动 728452
科研通“疑难数据库(出版商)”最低求助积分说明 716865