插补(统计学)
缺少数据
计算机科学
数据挖掘
知识抽取
数据质量
k-最近邻算法
数据集
人工智能
最近的邻居
机器学习
工程类
公制(单位)
运营管理
作者
Gustavo E. A. P. A. Batista,Maria Carolina Monard
摘要
Data quality is a major concern in Machine Learning and other correlated areas such as Knowledge Discovery from Databases (KDD). As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. One relevant problem in data quality is the presence of missing data. Despite the frequent occurrence of missing data, many Machine Learning algorithms handle missing data in a rather naive way. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we analyse the use of the k-nearest neighbour as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data.
科研通智能强力驱动
Strongly Powered by AbleSci AI