The Detection Algorithms for Similar Duplicate Data
计算机科学
算法
作者
Jinyu Song,Quan Yu,Ruoyu Bao
出处
期刊:International Conference on Systems日期:2019-11-01被引量:1
标识
DOI:10.1109/icsai48974.2019.9010154
摘要
This paper studied and analyzed three algorithms which can be used to detect similar duplicate data records. Among them, the two commonly used duplicate data detection algorithms are basic sorted-neighborhood method (SNM) and priority queue algorithm. Both of them are based on sorting-merger thought. The third one is Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The paper also discussed some key problems in the application and implementation of algorithms, including how to choose the attributes of data records and calculate the similarity between these attributes of two records, the choice of the sort algorithm, the setting of sliding window size and queue length. The paper realized three algorithms by programming based on Matlab platform, as well as some technologies related to algorithms. The paper verified related conclusions and evaluated three algorithms by testing a lot of loaded data sets. The paper finally provided the instruction to apply these algorithms.