Comparative Evaluation of Advanced AI Reasoning Models in Pediatric Clinical Decision Support: ChatGPT O1 vs. DeepSeek-R1

计算机科学决策支持系统临床决策支持系统人工智能

作者

Gianluca Mondillo,Simone Colosimo,Alessandra Perrotta,Vittoria Frattolillo,Mariapia Masino

出处

期刊：Cold Spring Harbor Laboratory - medRxiv 日期：2025-01-28

标识

DOI：10.1101/2025.01.27.25321169

摘要

Introduction: The adoption of advanced reasoning models, such as ChatGPT O1 and DeepSeek-R1, represents a pivotal step forward in clinical decision support, particularly in pediatrics. ChatGPT O1 employs "chain-of-thought reasoning" (CoT) to enhance structured problem-solving, while DeepSeek-R1 introduces self-reflection capabilities through reinforcement learning. This study aimed to evaluate the diagnostic accuracy and clinical utility of these models in pediatric scenarios using the MedQA dataset. Materials and Methods: A total of 500 multiple-choice pediatric questions from the MedQA dataset were presented to ChatGPT O1 and DeepSeek-R1. Each question included four or more options, with one correct answer. The models were evaluated under uniform conditions, with performance metrics including accuracy, Cohen's Kappa, and chi-square tests applied to assess agreement and statistical significance. Responses were analyzed to determine the models effectiveness in addressing clinical questions. Results: ChatGPT O1 achieved a diagnostic accuracy of 92.8%, significantly outperforming DeepSeek-R1, which scored 87.0% (p < 0.00001). The CoT reasoning technique used by ChatGPT O1 allowed for more structured and reliable responses, reducing the risk of errors. Conversely, DeepSeek-R1, while slightly less accurate, demonstrated superior accessibility and adaptability due to its open-source nature and emerging self-reflection capabilities. Cohen's Kappa (K=0.20) indicated low agreement between the models, reflecting their distinct reasoning strategies. Conclusions: This study highlights the strengths of ChatGPT O1 in providing accurate and coherent clinical reasoning, making it highly suitable for critical pediatric scenarios. DeepSeek-R1, with its flexibility and accessibility, remains a valuable tool in resource-limited settings. Combining these models in an ensemble system could leverage their complementary strengths, optimizing decision support in diverse clinical contexts. Further research is warranted to explore their integration into multidisciplinary care teams and their application in real-world clinical settings.

求助该文献

Comparative Evaluation of Advanced AI Reasoning Models in Pediatric Clinical Decision Support: ChatGPT O1 vs. DeepSeek-R1

今日热心研友