计算机科学
连接(拓扑)
数据挖掘
图形
集合(抽象数据类型)
相似性(几何)
猛增
理论计算机科学
分布式数据库
人工智能
分布式计算
数学
组合数学
图像(数学)
程序设计语言
标识
DOI:10.1109/smc-iot62253.2023.00020
摘要
Set Similarity Join (SSJ) plays a crucial role in a wide array of tasks, including plagiarism detection, data cleaning and near-duplicate detection in IOT information, as it effectively identifies similar pairs within two collections of sets. As the volume of data continues to soar, the necessity for distributed SSJ becomes apparent in order to manage large-scale datasets efficiently. Nonetheless, the extensively studied distributed SSJ solutions predominantly rely on a prefix-based framework. They often suffer from negative issues such as (1) duplicate verification (2) load imbalance. To address such issues, we propose a novel Graph-partitioning-based distributed set similarity Join (GrassJoin). Empirical evaluations conducted on four datasets substantiate the efficacy of our approach and demonstrate its substantial advantages over state-of-the-art solutions.
科研通智能强力驱动
Strongly Powered by AbleSci AI