Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation

医学诊断医学背景（考古学）神经组阅片室放射科医学物理学医学影像学射线照相术诊断准确性模式神经学社会科学生物精神科社会学古生物学

作者

Marc Huppertz,Robert Siepmann,David Topp,Omid Nikoubashman,Can Yüksel,Christiane Kühl,Daniel Truhn,Sven Nebelung

出处

期刊：European Radiology [Springer Science+Business Media]
日期：2024-10-18

链接

nih.govdoi.org

标识

DOI：10.1007/s00330-024-11115-6

摘要

Abstract Objectives ChatGPT-4 Vision (GPT-4V) is a state-of-the-art multimodal large language model (LLM) that may be queried using images. We aimed to evaluate the tool’s diagnostic performance when autonomously assessing clinical imaging studies. Materials and methods A total of 206 imaging studies (i.e., radiography ( n = 60), CT ( n = 60), MRI ( n = 60), and angiography ( n = 26)) with unequivocal findings and established reference diagnoses from the radiologic practice of a large university hospital were accessed. Readings were performed uncontextualized, with only the image provided, and contextualized, with additional clinical and demographic information. Responses were assessed along multiple diagnostic dimensions and analyzed using appropriate statistical tests. Results With its pronounced propensity to favor context over image information, the tool’s diagnostic accuracy improved from 8.3% (uncontextualized) to 29.1% (contextualized, first diagnosis correct) and 63.6% (contextualized, correct diagnosis among differential diagnoses) ( p ≤ 0.001, Cochran’s Q test). Diagnostic accuracy declined by up to 30% when 20 images were re-read after 30 and 90 days and seemed unrelated to the tool’s self-reported confidence (Spearman’s ρ = 0.117 ( p = 0.776)). While the described imaging findings matched the suggested diagnoses in 92.7%, indicating valid diagnostic reasoning, the tool fabricated 258 imaging findings in 412 responses and misidentified imaging modalities or anatomic regions in 65 images. Conclusion GPT-4V, in its current form, cannot reliably interpret radiologic images. Its tendency to disregard the image, fabricate findings, and misidentify details, especially without clinical context, may misguide healthcare providers and put patients at risk. Key Points Question Can Generative Pre-trained Transformer 4 Vision (GPT-4V) interpret radiologic images—with and without clinical context? Findings GPT-4V performed poorly, demonstrating diagnostic accuracy rates of 8% (uncontextualized), 29% (contextualized, most likely diagnosis correct), and 64% (contextualized, correct diagnosis among differential diagnoses). Clinical relevance The utility of commercial multimodal large language models, such as GPT-4V, in radiologic practice is limited. Without clinical context, diagnostic errors and fabricated findings may compromise patient safety and misguide clinical decision-making. These models must be further refined to be beneficial.

求助该文献

最长约 10秒，即可获得该文献文件

Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation

今日热心研友