Supervised machine learning compared to large language models for identifying functional seizures from medical records

置信区间接收机工作特性逻辑回归脑电图卡帕神经影像学癫痫医学惊厥机器学习心理学内科学听力学人工智能精神科计算机科学数学几何学

作者

Wesley T. Kerr,Katherine N. McFarlane,Gabriela Figueiredo Pucci,Danielle R. Carns,Alex Israel,Lianne Vighetti,Page B. Pennell,John M. Stern,Zongqi Xia,Yanshan Wang

出处

期刊：Epilepsia [Wiley]
日期：2025-02-17 卷期号：66 (4): 1155-1164

链接

wiley.com nih.gov nih.govdoi.org

标识

DOI：10.1111/epi.18272

摘要

Abstract Objective The Functional Seizures Likelihood Score (FSLS) is a supervised machine learning–based diagnostic score that was developed to differentiate functional seizures (FS) from epileptic seizures (ES). In contrast to this targeted approach, large language models (LLMs) can identify patterns in data for which they were not specifically trained. To evaluate the relative benefits of each approach, we compared the diagnostic performance of the FSLS to two LLMs: ChatGPT and GPT‐4. Methods In total, 114 anonymized cases were constructed based on patients with documented FS, ES, mixed ES and FS, or physiologic seizure‐like events (PSLEs). Text‐based data were presented in three sequential prompts to the LLMs, showing the history of present illness (HPI), electroencephalography (EEG) results, and neuroimaging results. We compared the accuracy (number of correct predictions/number of cases) and area under the receiver‐operating characteristic (ROC) curves (AUCs) of the LLMs to the FSLS using mixed‐effects logistic regression. Results The accuracy of FSLS was 74% (95% confidence interval [CI] 65%–82%) and the AUC was 85% (95% CI 77%–92%). GPT‐4 was superior to both the FSLS and ChatGPT ( p <.001), with an accuracy of 85% (95% CI 77%–91%) and AUC of 87% (95% CI 79%–95%). Cohen's kappa between the FSLS and GPT‐4 was 40% (fair). The LLMs provided different predictions on different days when the same note was provided for 33% of patients, and the LLM's self‐rated certainty was moderately correlated with this observed variability (Spearman's rho 2 : 30% [fair, ChatGPT] and 63% [substantial, GPT‐4]). Significance Both GPT‐4 and the FSLS identified a substantial subset of patients with FS based on clinical history. The fair agreement in predictions highlights that the LLMs identified patients differently from the structured score. The inconsistency of the LLMs' predictions across days and incomplete insight into their own consistency was concerning. This comparison highlights both benefits and cautions about how machine learning and artificial intelligence could identify patients with FS in clinical practice.

求助该文献

最长约 10秒，即可获得该文献文件

Supervised machine learning compared to large language models for identifying functional seizures from medical records

今日热心研友