稳健性(进化)
医学
可信赖性
人工智能
接收机工作特性
机器学习
射线照相术
放射科
计算机科学
计算机安全
生物化学
基因
化学
作者
Jiajin Zhang,Hanqing Chao,Giridhar Dasegowda,Ge Wang,Mannudeep K. Kalra,Pingkun Yan
出处
期刊:Radiology
[Radiological Society of North America]
日期:2024-01-01
卷期号:6 (1)
被引量:1
摘要
“Just Accepted” papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To determine if saliency maps in radiology artificial intelligence (AI) are vulnerable to subtle perturbations of the input, which could potentially lead to misleading interpretations, using Prediction-Saliency Correlation (PSC) for evaluating the sensitivity and robustness of saliency methods. Materials and Methods In this retrospective study, locally trained deep learning models and a research prototype provided by a commercial vender were systematically evaluated on 191,229 chest radiographs from the CheXpert dataset (1,2) and 7,022 MRI images from a human brain tumor classification dataset (3). Two radiologists performed a reader study on 270 chest radiographs pairs. A model-agnostic approach for computing the PSC coefficient was used to evaluate the sensitivity and robustness of seven commonly used saliency methods. Results The saliency methods had low sensitivity (maximum PSC = 0.25, 95% CI: 0.12, 0.38) and weak robustness (maximum PSC = 0.12, 95% CI: 0.0, 0.25) on the CheXpert dataset, as demonstrated by leveraging locally trained model parameters. Further evaluation showed that the saliency maps generated from commercial prototype may be irrelevant to the model output without knowing model specifics (area under the receiver operating characteristic curve dropped by 8.6% without affecting the saliency map). The human observer studies confirmed that is difficult for experts to identify the perturbed images, who had less than 44.8% correctness. Conclusion Popular saliency methods scored low PSC values on the two datasets of perturbed chest radiographs, indicating weak sensitivity and robustness. The proposed PSC metric provides a valuable quantification tool for validating the trustworthiness of medical AI explainability. ©RSNA, 2023
科研通智能强力驱动
Strongly Powered by AbleSci AI