聚类分析
计算机科学
可扩展性
光谱聚类
集合(抽象数据类型)
计算
分布式存储器
相似性(几何)
数据集
相关聚类
基质(化学分析)
算法
模式识别(心理学)
数据挖掘
人工智能
理论计算机科学
并行计算
共享内存
材料科学
复合材料
数据库
图像(数学)
程序设计语言
作者
Wen-Yen Chen,Yangqiu Song,Hongjie Bai,Chih‐Jen Lin,Edward Yi Chang
标识
DOI:10.1109/tpami.2010.88
摘要
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.
科研通智能强力驱动
Strongly Powered by AbleSci AI