耳鼻咽喉科
头颈外科
领域(数学)
基线(sea)
医学
计算机科学
头颈部
医学物理学
自然语言处理
人工智能
外科
数学
海洋学
纯数学
地质学
作者
Dante J. Merlino,Santiago Romero‐Brufau,George Saieed,Kathryn M. Van Abel,Daniel L. Price,David Archibald,Gregory A. Ator,Matthew L. Carlson
摘要
Objective The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT‐3.5 and GPT‐4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology—head and neck surgery. Methods A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers. Results GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty‐nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively. Conclusion Large language models vary in their understanding of otolaryngology‐specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well‐suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood. Level of Evidence N/A Laryngoscope , 2024
科研通智能强力驱动
Strongly Powered by AbleSci AI