计算机科学
插补(统计学)
缺少数据
稳健性(进化)
深度学习
人工智能
嵌入
样本量测定
推论
机器学习
数据挖掘
统计
数学
生物化学
基因
化学
作者
Yige Sun,Jing Li,Yifan Xu,Tingting Zhang,Xiaofeng Wang
标识
DOI:10.1016/j.eswa.2023.120201
摘要
Deep learning models have been recently proposed in the applications of missing data imputation. In this paper, we review the popular statistical, machine learning, and deep learning approaches, and discuss the advantages and disadvantages of these methods. We conduct a comprehensive numerical study to compare the performance of several widely-used imputation methods for incomplete tabular (structured) data. Specifically, we compare the deep learning methods: generative adversarial imputation networks (GAIN) with onehot encoding, GAIN with embedding, variational auto-encoder (VAE) with onehot encoding, and VAE with embedding versus two conventional methods: multiple imputation by chained equations (MICE) and missForest. Seven real benchmark datasets and three simulated datasets are considered, including various scenarios with different feature types under different levels of sample sizes. The missing data are generated based on different missing ratios and three kinds of missing mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Our experiments show that, for small or moderate sample sizes, the conventional methods establish better robustness and imputation performance than the deep learning methods. GAINs only perform well in the case of MCAR and often fail in the cases of MAR and MNAR. VAEs are easy to fall into mode collapse in all missing mechanisms. We conclude that the conventional methods, MICE and missForest, are preferable for practitioners to deal with missing data imputation for tabular data with a limited sample size (i.e., n<30,000) in real case analyses.
科研通智能强力驱动
Strongly Powered by AbleSci AI