作者
Lu Zhang,Mingqian Liu,Lingyun Wang,Y Zhang,Xiangjun Xu,Zhijun Pan,Yan Feng,Jue Zhao,Lin Zhang,Gehong Yao,Xu Chen,Xueqian Xie
摘要
Background The specialization and complexity of radiology makes the automatic generation of radiologic impressions (ie, a diagnosis with differential diagnosis and management recommendations) challenging. Purpose To develop a large language model (LLM) that generates impressions based on imaging findings and to evaluate its performance in professional and linguistic dimensions. Materials and Methods Six radiologists recorded imaging examination findings from August 2 to 31, 2023, at Shanghai General Hospital and used the developed LLM before routinely writing report impressions for multiple radiologic modalities (CT, MRI, radiography, mammography) and anatomic sites (cranium and face, neck, chest, upper abdomen, lower abdomen, vessels, bone and joint, spine, breast), making necessary corrections and completing the radiologic impression. A subset was defined to investigate cases where the LLM-generated impressions differed from the final radiologist impressions by excluding identical and highly similar cases. An expert panel scored the LLM-generated impressions on a five-point Likert scale (5 = strongly agree) based on scientific terminology, coherence, specific diagnosis, differential diagnosis, management recommendations, correctness, comprehensiveness, harmlessness, and lack of bias. Results In this retrospective study, an LLM was pretrained using 20 GB of medical and general-purpose text data. The fine-tuning data set comprised 1.5 GB of data, including 800 radiology reports with paired instructions (describing the output task in natural language) and outputs. Test set 2 included data from 3988 patients (median age, 56 years [IQR, 40-68 years]; 2159 male). The median recall, precision, and F1 score of LLM-generated impressions were 0.775 (IQR, 0.56-1), 0.84 (IQR, 0.611-1), and 0.772 (IQR, 0.578-0.957), respectively, using the final impressions as the reference standard. In a subset of 1014 patients (median age, 57 years [IQR, 42-69 years]; 528 male), the overall median expert panel score for LLM-generated impressions was 5 (IQR, 5-5), ranging from 4 (IQR, 3-5) to 5 (IQR, 5-5). Conclusion The developed LLM generated radiologic impressions that were professionally and linguistically appropriate for a full spectrum of radiology examinations. © RSNA, 2024