Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data

缺少数据插补（统计学）随机森林过度拟合计算机科学统计人工智能健康档案数据挖掘机器学习数学人工神经网络医疗保健经济经济增长

作者

Kylie Getz,Rebecca A. Hubbard,Kristin A. Linn

出处

期刊：Epidemiology [Lippincott Williams & Wilkins]
日期：2022-12-09 卷期号：34 (2): 206-215 被引量：23

链接

nih.govdoi.org

标识

DOI：10.1097/ede.0000000000001578

摘要

Background: Missing data are common in studies using electronic health records (EHRs)-derived data. Missingness in EHR data is related to healthcare utilization patterns, resulting in complex and potentially missing not at random missingness mechanisms. Prior research has suggested that machine learning-based multiple imputation methods may outperform traditional methods and may perform well even in settings of missing not at random missingness. Methods: We used plasmode simulations based on a nationwide EHR-derived de-identified database for patients with metastatic urothelial carcinoma to compare the performance of multiple imputation using chained equations, random forests, and denoising autoencoders in terms of bias and precision of hazard ratio estimates under varying proportions of observations with missing values and missingness mechanisms (missing completely at random, missing at random, and missing not at random). Results: Multiple imputation by chained equations and random forest methods had low bias and similar standard errors for parameter estimates under missingness completely at random. Under missingness at random, denoising autoencoders had higher bias than multiple imputation by chained equations and random forests. Contrary to results of prior studies of denoising autoencoders, all methods exhibited substantial bias under missingness not at random, with bias increasing in direct proportion to the amount of missing data. Conclusions: We found no advantage of denoising autoencoders for multiple imputation in the setting of an epidemiologic study conducted using EHR data. Results suggested that denoising autoencoders may overfit the data leading to poor confounder control. Use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.

求助该文献

最长约 10秒，即可获得该文献文件

Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data

今日热心研友