期刊:IEEE Transactions on Knowledge and Data Engineering [Institute of Electrical and Electronics Engineers] 日期:2024-02-28卷期号:36 (8): 3712-3724
标识
DOI:10.1109/tkde.2024.3371228
摘要
Streaming triangle counting is a critical issue in graph stream mining, with applications in dense subgraph discovery, web mining, anomaly detection, and more. Recent efforts have focused on estimating triangle counts in graph streams, primarily through sampling methods. However, because of limited memory resources for handling high speed streams, traditional sampling methods suffer from reduced sampling rate and thereby performance loss. In this paper, we propose a new compact data structure called uHLL to process edge streams by considering the tradeoff between estimation accuracy and memory efficiency. Furthermore, different from conventional triangle counting algorithms, we solve the estimation of union set cardinality for edge-local triangle count under both centralized and distributed framework, so as to efficiently estimate the global triangle count by a one-pass streaming algorithm. To the best of our knowledge, this is the first implementation of a distributed framework using a compact data structure for streaming triangle counting. We provide theoretical proof of unbiasedness and derive the variance of the union set and global triangle count. We compare our scheme with 11 algorithms, showing that under the same experimental setting, uHLL and distributed uHLL are at least 2.3 and 1.7 times more accurate than the state-of-the-art, respectively.