大数据
计算机科学
数据质量
数据挖掘
质量(理念)
采样(信号处理)
背景(考古学)
一致性(知识库)
数据完整性
数据科学
数据库
人工智能
工程类
哲学
公制(单位)
古生物学
运营管理
认识论
滤波器(信号处理)
生物
计算机视觉
作者
Ikbal Taleb,Hadeel T. El Kassabi,Mohamed Adel Serhani,Rachida Dssouli,Chafik Bouhaddioui
标识
DOI:10.1109/uic-atc-scalcom-cbdcom-iop-smartworld.2016.0122
摘要
Data is the most valuable asset companies are proud of. When its quality degrades, the consequences are unpredictable, can lead to complete wrong insights. In Big Data context, evaluating the data quality is challenging, must be done prior to any Big data analytics by providing some data quality confidence. Given the huge data size, its fast generation, it requires mechanisms, strategies to evaluate, assess data quality in a fast, efficient way. However, checking the Quality of Big Data is a very costly process if it is applied on the entire data. In this paper, we propose an efficient data quality evaluation scheme by applying sampling strategies on Big data sets. The Sampling will reduce the data size to a representative population samples for fast quality evaluation. The evaluation targeted some data quality dimensions like completeness, consistency. The experimentations have been conducted on Sleep disorder's data set by applying Big data bootstrap sampling techniques. The results showed that the mean quality score of samples is representative for the original data, illustrate the importance of sampling to reduce computing costs when Big data quality evaluation is concerned. We applied the Quality results generated as quality proposals on the original data to increase its quality.
科研通智能强力驱动
Strongly Powered by AbleSci AI