Constructing a Large Language Model to Generate Impressions from Findings in Radiology Reports

医学放射科语言学哲学

作者

Lu Zhang,Mingqian Liu,Lingyun Wang,Y Zhang,Xiangjun Xu,Zhijun Pan,Yan Feng,Jue Zhao,Lin Zhang,Gehong Yao,Xu Chen,Xueqian Xie

出处

期刊：Radiology [Radiological Society of North America]
日期：2024-09-01 卷期号：312 (3): e240885-e240885 被引量：53

链接

nih.govdoi.org

标识

DOI：10.1148/radiol.240885

摘要

Background The specialization and complexity of radiology makes the automatic generation of radiologic impressions (ie, a diagnosis with differential diagnosis and management recommendations) challenging. Purpose To develop a large language model (LLM) that generates impressions based on imaging findings and to evaluate its performance in professional and linguistic dimensions. Materials and Methods Six radiologists recorded imaging examination findings from August 2 to 31, 2023, at Shanghai General Hospital and used the developed LLM before routinely writing report impressions for multiple radiologic modalities (CT, MRI, radiography, mammography) and anatomic sites (cranium and face, neck, chest, upper abdomen, lower abdomen, vessels, bone and joint, spine, breast), making necessary corrections and completing the radiologic impression. A subset was defined to investigate cases where the LLM-generated impressions differed from the final radiologist impressions by excluding identical and highly similar cases. An expert panel scored the LLM-generated impressions on a five-point Likert scale (5 = strongly agree) based on scientific terminology, coherence, specific diagnosis, differential diagnosis, management recommendations, correctness, comprehensiveness, harmlessness, and lack of bias. Results In this retrospective study, an LLM was pretrained using 20 GB of medical and general-purpose text data. The fine-tuning data set comprised 1.5 GB of data, including 800 radiology reports with paired instructions (describing the output task in natural language) and outputs. Test set 2 included data from 3988 patients (median age, 56 years [IQR, 40-68 years]; 2159 male). The median recall, precision, and F1 score of LLM-generated impressions were 0.775 (IQR, 0.56-1), 0.84 (IQR, 0.611-1), and 0.772 (IQR, 0.578-0.957), respectively, using the final impressions as the reference standard. In a subset of 1014 patients (median age, 57 years [IQR, 42-69 years]; 528 male), the overall median expert panel score for LLM-generated impressions was 5 (IQR, 5-5), ranging from 4 (IQR, 3-5) to 5 (IQR, 5-5). Conclusion The developed LLM generated radiologic impressions that were professionally and linguistically appropriate for a full spectrum of radiology examinations. © RSNA, 2024 Supplemental material is available for this article.

求助该文献

最长约 10秒，即可获得该文献文件

Constructing a Large Language Model to Generate Impressions from Findings in Radiology Reports

今日热心研友