Wesley T. Kerr,Katherine N. McFarlane,Gabriela Figueiredo Pucci,Danielle R. Carns,Alex Israel,Lianne Vighetti,Page B. Pennell,John M. Stern,Zongqi Xia,Yanshan Wang
Abstract Objective The Functional Seizures Likelihood Score (FSLS) is a supervised machine learning–based diagnostic score that was developed to differentiate functional seizures (FS) from epileptic seizures (ES). In contrast to this targeted approach, large language models (LLMs) can identify patterns in data for which they were not specifically trained. To evaluate the relative benefits of each approach, we compared the diagnostic performance of the FSLS to two LLMs: ChatGPT and GPT‐4. Methods In total, 114 anonymized cases were constructed based on patients with documented FS, ES, mixed ES and FS, or physiologic seizure‐like events (PSLEs). Text‐based data were presented in three sequential prompts to the LLMs, showing the history of present illness (HPI), electroencephalography (EEG) results, and neuroimaging results. We compared the accuracy (number of correct predictions/number of cases) and area under the receiver‐operating characteristic (ROC) curves (AUCs) of the LLMs to the FSLS using mixed‐effects logistic regression. Results The accuracy of FSLS was 74% (95% confidence interval [CI] 65%–82%) and the AUC was 85% (95% CI 77%–92%). GPT‐4 was superior to both the FSLS and ChatGPT ( p <.001), with an accuracy of 85% (95% CI 77%–91%) and AUC of 87% (95% CI 79%–95%). Cohen's kappa between the FSLS and GPT‐4 was 40% (fair). The LLMs provided different predictions on different days when the same note was provided for 33% of patients, and the LLM's self‐rated certainty was moderately correlated with this observed variability (Spearman's rho 2 : 30% [fair, ChatGPT] and 63% [substantial, GPT‐4]). Significance Both GPT‐4 and the FSLS identified a substantial subset of patients with FS based on clinical history. The fair agreement in predictions highlights that the LLMs identified patients differently from the structured score. The inconsistency of the LLMs' predictions across days and incomplete insight into their own consistency was concerning. This comparison highlights both benefits and cautions about how machine learning and artificial intelligence could identify patients with FS in clinical practice.