Appropriateness of Answers to Common Preanesthesia Patient Questions Composed by the Large Language Model GPT-4 Compared to Human Authors

医学重症监护医学医疗急救

作者

Scott Segal,Amit K. Saha,Ashish K. Khanna

出处

期刊：Anesthesiology [Ovid Technologies (Wolters Kluwer)]
日期：2024-01-05 卷期号：140 (2): 333-335

链接

nih.govdoi.org

标识

DOI：10.1097/aln.0000000000004824

摘要

Many surgical patients will not interact with anesthesiologists until minutes before surgery, and the internet has become a common source of medical information. The use of large language models such as GPT-4, which are "generative artificial intelligence" tools capable of creating natural, human-sounding prose in response to a plain-language query, and their incorporation into search engines, promise to make it easier for patients to directly ask questions related to preanesthetic preparation. The accuracy of large language models in answering medical questions has generally been impressive1–3 but has not been evaluated for preanesthetic queries. We evaluated the ability of the widely accessible model GPT-4 to provide reasonable responses to common preanesthetic patient questions, compared to online published resources. Our hypothesis was that GPT-4 was at least as reasonable as published resources.The study was approved by the Wake Forest University School of Medicine institutional review board and completion of the survey was deemed to indicate informed consent. Sixteen common preanesthetic questions were drawn from websites of academic anesthesiology departments (table 1; Supplemental Content 1, https://links.lww.com/ALN/D360). The online answers and the answers to the same questions provided by GPT-4 via the ChatGPT Pro interface (chat.openai.com), queried on two dates in independent sessions in April 2023, were obtained. Two sessions with ChatGPT were used because the software regenerates new responses when reprompted, and the responses may differ in quality.1 Survey participants were preoperative anesthesia experts known to the investigators, and other similar experts suggested by this cohort, nearly all of whom were academicians involved with preoperative assessment (total solicited N = 210). The survey instructions asked raters to "evaluate answers to questions about anesthesia care that patients may ask. Your task is simply to evaluate each statement as 'reasonable' or 'unreasonable.' Please select 'reasonable' unless you detect a significant error or major omission." For each question, the survey recipient was randomly presented with a single answer without knowledge of its authorship in an approximately 2:1 ratio of GPT-4–generated responses to website content. Participants then rated each answer as "reasonable" or "unreasonable."3 Respondents were anonymous unless they chose to give their name for acknowledgment (Supplemental Content 2, https://links.lww.com/ALN/D361). Enrollment was closed when less than 1 response per day was observed. The total percent rated reasonable were compared for GPT-4 and website content overall and for each question with Pearson chi-square or Fisher exact test. We estimated 240 responses per group (i.e., GPT-4 or human authored) would be needed to detect a 10% difference in ratings, assuming 90% "reasonable" in the published statements, with 80% power and α = 0.05.Seventy-four of 210 (35%) invited participants responded during the 10-day survey period. The combined results and those for each question are shown in table 2. Overall, GPT-4 answers were more frequently rated reasonable compared to published websites: 536/644 (83.2%) versus 328/435 (75.4%), P = 0.002. GPT-4's responses to four individual questions ("Why can't I eat before surgery," "What can I drink before surgery," "Why does the anesthesiologist want to know about my teeth," and "Can I have surgery without opioids?") were significantly more highly rated compared to human-authored responses (table 2). No human-authored answers were rated more reasonable than the corresponding GPT-4 answer.The findings of this study suggest that experts in preoperative anesthesia care rate responses to common preoperative questions similarly when generated by GPT-4 compared to those provided on academic websites. Our results are similar to other reports of GPT-4 answers to potential patient questions in preventive cardiology,3 general medicine,4 and diabetes,5 although some poor-quality answers have been observed in other fields.6–8 Given the rapid growth in public access to generative artificial intelligence platforms, our finding of comparable human- and machine-authored answer quality is reassuring.A strength of our study compared to previous reports is that we used a large number of human raters, blinded to the provenance of any given comment, rather than a small panel of experts, to evaluate the generative artificial intelligence responses. We believe this adds some generalizability to our findings. Conversely, our design has some limitations. We used a relatively coarse rating scale for the answers (reasonable or unreasonable) rather than Likert scales, although such an approach has been used by other investigators.3 A further refinement could be asking reviewers to evaluate the statements in specific domains, including overall accuracy, the presence of fabrications, and concordance, or internal consistency. The human-authored texts were all present on patient-facing websites from expectedly reputable academic institutions but may not be representative of the overall quality of such sources (and poorer performing online answers represent an opportunity for improvement). The anonymous nature of the survey also makes it impossible to assess nonresponder characteristics. Models such as GPT-4 are trained on millions of documents, which might include preoperative websites, and do not provide determinative outputs, and generate text using an iterative process of choosing the next word in a sentence. While generating fluent, human-like prose, large language models are well-known to also occasionally generate inaccurate results, known as hallucinations.9 Although we did not observe any such statements in this investigation, this remains a risk if widely used by preoperative patients. Conversely, models such as GPT-4 and its successors could be used, in partnership with human experts, to curate the responses, to generate patient-facing text on a variety of topics, including those posed and evaluated by patients themselves.Overall, the findings of this study suggest that generative artificial intelligence used in large language models such as GPT-4 may be an effective source of medical information for patients about preparing for anesthesia. Although caution is still in order until accuracy can be assured, models such as GPT-4 should be further studied for potential involvement in patient-facing activities in anesthesia care.Support was provided solely from institutional and/or departmental sources.The authors declare no competing interests.1. Questions and answers evaluated in the study, https://links.lww.com/ALN/D3602. Respondents wishing to be acknowledged, https://links.lww.com/ALN/D361

求助该文献

最长约 10秒，即可获得该文献文件

Appropriateness of Answers to Common Preanesthesia Patient Questions Composed by the Large Language Model GPT-4 Compared to Human Authors

今日热心研友