Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews

医学 系统回顾 模板 重症监护医学 梅德林 医学物理学 程序设计语言 计算机科学 政治学 法学
作者
Christian Cao,Jason Sang,Rohit Arora,David Chen,Robert Kloosterman,Milena Cecere,Jaswanth Gorla,Richard Saleh,Ian R. Drennan,Bijan Teja,Michael G. Fehlings,Paul E. Ronksley,Alexander A. C. Leung,Dany E. Weisz,Harriet Ware,Mairead Whelan,D. B. Emerson,Rahul K. Arora,Niklas Bobrovitz
出处
期刊:Annals of Internal Medicine [American College of Physicians]
被引量:1
标识
DOI:10.7326/annals-24-02189
摘要

Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews. Diagnostic test accuracy. 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI). None. Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity). Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD. Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles. A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences. None.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
星辰大海应助In采纳,获得10
2秒前
IDA发布了新的文献求助10
2秒前
111完成签到,获得积分10
3秒前
空白发布了新的文献求助10
4秒前
5秒前
6秒前
bkagyin应助NXK采纳,获得10
7秒前
kkk发布了新的文献求助10
8秒前
科研通AI5应助靓丽的飞槐采纳,获得10
8秒前
柚子蟹应助sun采纳,获得30
10秒前
IDA完成签到,获得积分20
10秒前
10秒前
11秒前
汉堡包应助LL采纳,获得10
11秒前
In完成签到,获得积分10
12秒前
NexusExplorer应助发文章鸭采纳,获得10
12秒前
14秒前
yun发布了新的文献求助30
15秒前
kkk完成签到,获得积分20
17秒前
12发布了新的文献求助10
17秒前
LL发布了新的文献求助10
18秒前
18秒前
Huang完成签到 ,获得积分0
19秒前
直率手机完成签到,获得积分10
19秒前
19秒前
四夕完成签到 ,获得积分10
23秒前
那个笨笨完成签到,获得积分10
23秒前
lilac发布了新的文献求助50
23秒前
研友_VZG7GZ应助AiQi采纳,获得10
24秒前
clyhg完成签到,获得积分10
24秒前
25秒前
David完成签到 ,获得积分10
25秒前
qw1完成签到,获得积分20
28秒前
ding应助无无采纳,获得10
30秒前
Hayat发布了新的文献求助10
31秒前
yun完成签到,获得积分10
32秒前
35秒前
37秒前
wanci应助放放采纳,获得10
38秒前
NXK发布了新的文献求助10
38秒前
高分求助中
All the Birds of the World 4000
Production Logging: Theoretical and Interpretive Elements 3000
Les Mantodea de Guyane Insecta, Polyneoptera 2000
Am Rande der Geschichte : mein Leben in China / Ruth Weiss 1500
CENTRAL BOOKS: A BRIEF HISTORY 1939 TO 1999 by Dave Cope 1000
Machine Learning Methods in Geoscience 1000
Resilience of a Nation: A History of the Military in Rwanda 888
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3738291
求助须知:如何正确求助?哪些是违规求助? 3281789
关于积分的说明 10026606
捐赠科研通 2998667
什么是DOI,文献DOI怎么找? 1645317
邀请新用户注册赠送积分活动 782748
科研通“疑难数据库(出版商)”最低求助积分说明 749901