雅卡索引
相似性(几何)
计算机科学
素描
集合(抽象数据类型)
最近邻搜索
匹配(统计)
遏制(计算机编程)
交叉口(航空)
数据挖掘
情报检索
人工智能
算法
模式识别(心理学)
数学
统计
图像(数学)
航空航天工程
工程类
程序设计语言
作者
Yang Yang,Ying Zhang,Wenjie Zhang,Zengfeng Huang
标识
DOI:10.1109/icde.2019.00048
摘要
In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q are not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a good trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-of-the-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on some real-life dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI