计算机科学
数据挖掘
数据集
数据清理
数据质量
集合(抽象数据类型)
大数据
数据建模
质量(理念)
人工智能
数据库
工程类
认识论
哲学
公制(单位)
程序设计语言
运营管理
作者
Cheng Zhang,Song-Bo Lin,Zhang Dang
标识
DOI:10.1145/3561801.3561814
摘要
With the development of the era of big data, the quality of data has become a growing concern of people. Improving data quality has become a very hot topic at present. In this paper, we propose a data cleaning method for industrial data flow based on multistage combinational optimization of rule set. According to the characteristics of the data, excellent cleaning algorithm is selected for the data. The data is evaluated and the cleaning rules are updated. In the first step, feature detection is carried out on the data, and high-quality data is selected as training samples to match the optimal data cleaning algorithm for them. In the second step, the model uses a random forest algorithm to learn the relationship between data features and data cleansing algorithms, and constructs multi-level filtering rules. In the third step, the data is cleansed and iterated to ensure that the rules are updated automatically. Finally, the model can automatically clean the data with a good cleaning effect. The results show that the method presented in this paper can achieve automatic cleaning effect on real industrial data sets, and the cleaning effect can reach 99% accuracy. This method effectively solves the problem of automatic data cleaning and can be used in the actual industrial data system.
科研通智能强力驱动
Strongly Powered by AbleSci AI