插补(统计学)
缺少数据
计算机科学
数据挖掘
回归
贝叶斯概率
均方误差
线性回归
统计
人工智能
机器学习
数学
作者
Anil Jadhav,Dhanya Pramod,Krishnan Ramanathan
标识
DOI:10.1080/08839514.2019.1637138
摘要
Missing data is common problem faced by researchers and data scientists. Therefore, it is required to handle them appropriately in order to get better and accurate results of data analysis. Objective of this research paper is to provide better understanding of data missingness mechanism, data imputation methods, and to assess performance of the widely used data imputation methods for numeric dataset. It will help practitioners and data scientists to select appropriate method of data imputation for numeric dataset while performing data mining task. In this paper, we comprehensively compare seven data imputation methods namely mean imputation, median imputation, kNN imputation, predictive mean matching, Bayesian Linear Regression (norm), Linear Regression, non-Bayesian (norm.nob), and random sample. We have used five different numeric datasets obtained from UCI machine learning repository for analyzing and comparing performance of the data imputation methods. Performance of the data imputation methods is assessed using Normalized Root Mean Square Error (RMSE) method. The results of analysis show that kNN imputation method outperforms the other methods. It has also been found that performance of the data imputation method is independent of the dataset and percentage of missing values in the dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI