A large language model–based generative natural language processing framework fine‐tuned on clinical notes accurately extracts headache frequency from electronic health records

人工智能 变压器 医学 计算机科学 自然语言处理 介绍 语言模型 偏头痛 生成模型 背景(考古学) 置信区间 机器学习 生成语法 家庭医学 内科学 古生物学 物理 量子力学 电压 生物
作者
Chia‐Chun Chiang,Man Luo,Gina Dumkrieger,Shubham Trivedi,Yi‐Chieh Chen,Chieh‐Ju Chao,Todd J. Schwedt,Abeed Sarker,Imon Banerjee
出处
期刊:Headache [Wiley]
卷期号:64 (4): 400-409 被引量:12
标识
DOI:10.1111/head.14702
摘要

Abstract Objective To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free‐text clinical notes. Background Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. Methods This was a retrospective cross‐sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine‐tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre‐Trained Transformer‐2 (GPT‐2) Question Answering (QA) model zero‐shot, (3) GPT‐2 QA model few‐shot training fine‐tuned on clinical notes, and (4) GPT‐2 generative model few‐shot training fine‐tuned on clinical notes to generate the answer by considering the context of included text. Results The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT‐2 generative model was the best‐performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R 2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT‐2–based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R 2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R 2 score was higher than the GPT‐2 QA zero‐shot model or GPT‐2 QA model few‐shot training fine‐tuned model. Conclusion We developed a robust information extraction model based on a state‐of‐the‐art large language model, a GPT‐2 generative model that can extract headache frequency from EHR free‐text clinical notes with high accuracy and R 2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT‐2–based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT‐2 generative model and inference code with open‐source license of community use in GitHub. Additional fine‐tuning of the algorithm might be required when applied to different health‐care systems for various clinical use cases.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
芸芸发布了新的文献求助10
2秒前
我是老大应助djbj2022采纳,获得20
3秒前
Bellamie发布了新的文献求助30
3秒前
科目三应助TATA采纳,获得10
4秒前
4秒前
慕青应助康康采纳,获得10
4秒前
小马甲应助无情白羊采纳,获得10
5秒前
5秒前
123完成签到,获得积分10
5秒前
6秒前
7秒前
xiaxia发布了新的文献求助10
7秒前
杨华启应助慕沐采纳,获得10
8秒前
感性的梦露完成签到,获得积分10
8秒前
狮子沟核聚变骡子完成签到 ,获得积分10
10秒前
10秒前
解寄灵发布了新的文献求助10
10秒前
深情安青应助ZZY采纳,获得10
10秒前
ccnss完成签到,获得积分10
13秒前
快快快快快快快快快完成签到 ,获得积分10
14秒前
yazhong发布了新的文献求助10
14秒前
核桃发布了新的文献求助10
15秒前
李健的小迷弟应助sci大户采纳,获得10
17秒前
17秒前
17秒前
mt完成签到 ,获得积分10
17秒前
18秒前
爱学习的慕完成签到,获得积分10
19秒前
量子星尘发布了新的文献求助10
20秒前
YuguangWu完成签到 ,获得积分10
20秒前
十块小子发布了新的文献求助10
21秒前
bibibi完成签到,获得积分20
21秒前
充电宝应助安详砖家采纳,获得10
21秒前
iUshio完成签到,获得积分10
22秒前
芸芸完成签到,获得积分10
23秒前
24秒前
康康发布了新的文献求助10
24秒前
XIAO发布了新的文献求助10
24秒前
ccnss发布了新的文献求助10
25秒前
尹春阳完成签到,获得积分10
25秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Handbook of pharmaceutical excipients, Ninth edition 5000
Aerospace Standards Index - 2026 ASIN2026 3000
Terrorism and Power in Russia: The Empire of (In)security and the Remaking of Politics 1000
Polymorphism and polytypism in crystals 1000
Signals, Systems, and Signal Processing 610
Discrete-Time Signals and Systems 610
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 纳米技术 有机化学 物理 生物化学 化学工程 计算机科学 复合材料 内科学 催化作用 光电子学 物理化学 电极 冶金 遗传学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 6044918
求助须知:如何正确求助?哪些是违规求助? 7814182
关于积分的说明 16246605
捐赠科研通 5190603
什么是DOI,文献DOI怎么找? 2777460
邀请新用户注册赠送积分活动 1760669
关于科研通互助平台的介绍 1643815