Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review

生成语法变压器医学计算机科学人工智能工程类电气工程电压

作者

Shaoting Luo,Federico Canavese,Alaric Aroojis,Antonio Andreacchio,Darko Antičević,Maryse Bouchard,Pablo Castañeda,Vincenzo De Rosa,Michel Armand Fiogbe,Steven L. Frick,James Hoi Po Hui,Ashok Johari,Antonio Loro,Xuemin Lyu,Masaki Matsushita,Hakan Ömeroğlu,David P. Roye,Maulin Shah,Bicheng Yong,Lianyong Li

出处

期刊：Journal of Pediatric Orthopaedics [Ovid Technologies (Wolters Kluwer)]
日期：2024-04-09 卷期号：44 (6): e504-e511 被引量：1

链接

nih.govdoi.org

标识

DOI：10.1097/bpo.0000000000002682

摘要

Objective: There is increasing interest in applying artificial intelligence chatbots like generative pretrained transformer 4 (GPT-4) in the medical field. This study aimed to explore the universality of GPT-4 responses to simulated clinical scenarios of developmental dysplasia of the hip (DDH) across diverse global settings. Methods: Seventeen international experts with more than 15 years of experience in pediatric orthopaedics were selected for the evaluation panel. Eight simulated DDH clinical scenarios were created, covering 4 key areas: (1) initial evaluation and diagnosis, (2) initial examination and treatment, (3) nursing care and follow-up, and (4) prognosis and rehabilitation planning. Each scenario was completed independently in a new GPT-4 session. Interrater reliability was assessed using Fleiss kappa, and the quality, relevance, and applicability of GPT-4 responses were analyzed using median scores and interquartile ranges. Following scoring, experts met in ZOOM sessions to generate Regional Consensus Assessment Scores, which were intended to represent a consistent regional assessment of the use of the GPT-4 in pediatric orthopaedic care. Results: GPT-4’s responses to the 8 clinical DDH scenarios received performance scores ranging from 44.3% to 98.9% of the 88-point maximum. The Fleiss kappa statistic of 0.113 ( P = 0.001) indicated low agreement among experts in their ratings. When assessing the responses’ quality, relevance, and applicability, the median scores were 3, with interquartile ranges of 3 to 4, 3 to 4, and 2 to 3, respectively. Significant differences were noted in the prognosis and rehabilitation domain scores ( P < 0.05 for all). Regional consensus scores were 75 for Africa, 74 for Asia, 73 for India, 80 for Europe, and 65 for North America, with the Kruskal-Wallis test highlighting significant disparities between these regions ( P = 0.034). Conclusions: This study demonstrates the promise of GPT-4 in pediatric orthopaedic care, particularly in supporting preliminary DDH assessments and guiding treatment strategies for specialist care. However, effective integration of GPT-4 into clinical practice will require adaptation to specific regional health care contexts, highlighting the importance of a nuanced approach to health technology adaptation. Level of Evidence: Level IV.

求助该文献

Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review

今日热心研友