作者
Ryutaro Tanno,David G. T. Barrett,Andrew Sellergren,Sumedh Ghaisas,Sumanth Dathathri,Abigail See,Johannes Welbl,Charles T. Lau,Tao Tu,Shekoofeh Azizi,K. K. Singhal,Mike Schaekermann,R. May,Roy Lee,SiWai Man,S. Sara Mahdavi,Zahra Ahmed,Yossi Matias,Joëlle Barral,S. M. Ali Eslami,Danielle Belgrave,Yun Liu,Sreenivasa Raju Kalidindi,Shravya Shetty,Vivek Natarajan,Pushmeet Kohli,Po-Sen Huang,Alan Karthikesalingam,Sofia Ira Ktena
摘要
Automated radiology report generation has the potential to improve patient care and reduce the workload of radiologists. However, the path toward real-world adoption has been stymied by the challenge of evaluating the clinical quality of artificial intelligence (AI)-generated reports. We build a state-of-the-art report generation system for chest radiographs, called Flamingo-CXR, and perform an expert evaluation of AI-generated reports by engaging a panel of board-certified radiologists. We observe a wide distribution of preferences across the panel and across clinical settings, with 56.1% of Flamingo-CXR intensive care reports evaluated to be preferable or equivalent to clinician reports, by half or more of the panel, rising to 77.7% for in/outpatient X-rays overall and to 94% for the subset of cases with no pertinent abnormal findings. Errors were observed in human-written reports and Flamingo-CXR reports, with 24.8% of in/outpatient cases containing clinically significant errors in both report types, 22.8% in Flamingo-CXR reports only and 14.0% in human reports only. For reports that contain errors we develop an assistive setting, a demonstration of clinician–AI collaboration for radiology report composition, indicating new possibilities for potential clinical utility. By engaging a group of 27 certified radiologists in the United States and India, the accuracy and quality of radiology reports generated by a vision–language model is evaluated, exploring synergistic combinations with clinicians.