A large language model–based generative natural language processing framework fine‐tuned on clinical notes accurately extracts headache frequency from electronic health records

人工智能 变压器 医学 计算机科学 自然语言处理 介绍 语言模型 偏头痛 生成模型 背景(考古学) 置信区间 机器学习 生成语法 家庭医学 内科学 古生物学 物理 量子力学 电压 生物
作者
Chia‐Chun Chiang,Man Luo,Gina Dumkrieger,Shubham Trivedi,Yi‐Chieh Chen,Chieh‐Ju Chao,Todd J. Schwedt,Abeed Sarker,Imon Banerjee
出处
期刊:Headache [Wiley]
卷期号:64 (4): 400-409 被引量:6
标识
DOI:10.1111/head.14702
摘要

Abstract Objective To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free‐text clinical notes. Background Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. Methods This was a retrospective cross‐sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine‐tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre‐Trained Transformer‐2 (GPT‐2) Question Answering (QA) model zero‐shot, (3) GPT‐2 QA model few‐shot training fine‐tuned on clinical notes, and (4) GPT‐2 generative model few‐shot training fine‐tuned on clinical notes to generate the answer by considering the context of included text. Results The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT‐2 generative model was the best‐performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R 2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT‐2–based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R 2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R 2 score was higher than the GPT‐2 QA zero‐shot model or GPT‐2 QA model few‐shot training fine‐tuned model. Conclusion We developed a robust information extraction model based on a state‐of‐the‐art large language model, a GPT‐2 generative model that can extract headache frequency from EHR free‐text clinical notes with high accuracy and R 2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT‐2–based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT‐2 generative model and inference code with open‐source license of community use in GitHub. Additional fine‐tuning of the algorithm might be required when applied to different health‐care systems for various clinical use cases.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
怕黑砖头完成签到,获得积分10
1秒前
2秒前
2秒前
花玥鹿完成签到,获得积分10
2秒前
cybbbbbb完成签到,获得积分10
2秒前
咳咳完成签到,获得积分10
2秒前
3秒前
SciGPT应助眼睛大的鑫磊采纳,获得10
3秒前
3秒前
Fareth完成签到,获得积分10
3秒前
领导范儿应助故意的绿竹采纳,获得10
3秒前
3秒前
复杂谷蓝完成签到 ,获得积分10
3秒前
4秒前
迟大猫应助于某人采纳,获得10
4秒前
qingkong发布了新的文献求助10
5秒前
5秒前
5秒前
细腻白柏完成签到,获得积分10
5秒前
5秒前
麦满分完成签到,获得积分10
6秒前
长度2到发布了新的文献求助10
6秒前
Alicia完成签到,获得积分10
7秒前
西瓜大虫完成签到,获得积分10
7秒前
害羞聋五发布了新的文献求助10
8秒前
prosperp完成签到,获得积分0
8秒前
Hongsong完成签到,获得积分20
8秒前
prosperp应助背侧丘脑采纳,获得10
9秒前
好好发布了新的文献求助10
9秒前
gaos发布了新的文献求助10
9秒前
einuo发布了新的文献求助10
10秒前
001完成签到,获得积分20
10秒前
李健应助阔达萧采纳,获得10
10秒前
陆离发布了新的文献求助10
10秒前
11秒前
66应助雪白红紫采纳,获得10
11秒前
英俊的铭应助东郭南松采纳,获得10
11秒前
YANG完成签到 ,获得积分10
12秒前
冷酷哈密瓜完成签到,获得积分10
13秒前
岁月流年完成签到,获得积分10
13秒前
高分求助中
Continuum Thermodynamics and Material Modelling 3000
Production Logging: Theoretical and Interpretive Elements 2700
Social media impact on athlete mental health: #RealityCheck 1020
Ensartinib (Ensacove) for Non-Small Cell Lung Cancer 1000
Unseen Mendieta: The Unpublished Works of Ana Mendieta 1000
Bacterial collagenases and their clinical applications 800
El viaje de una vida: Memorias de María Lecea 800
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3527469
求助须知:如何正确求助?哪些是违规求助? 3107497
关于积分的说明 9285892
捐赠科研通 2805298
什么是DOI,文献DOI怎么找? 1539865
邀请新用户注册赠送积分活动 716714
科研通“疑难数据库(出版商)”最低求助积分说明 709678