聚类分析
计算机科学
范畴变量
数据挖掘
兰德指数
熵(时间箭头)
数据点
模式识别(心理学)
k-中位数聚类
单连锁聚类
相似性度量
公制(单位)
高维数据聚类
人工智能
数据集
水准点(测量)
相关聚类
CURE数据聚类算法
机器学习
量子力学
物理
运营管理
经济
地理
大地测量学
作者
Amit Kumar Kar,Amaresh Chandra Mishra,Sraban Kumar Mohanty
标识
DOI:10.1016/j.engappai.2022.105795
摘要
Clustering is an unsupervised learning technique that discovers intrinsic groups based on proximity between data points. Therefore, the performance of clustering techniques mainly relies on the proximity measures used to compute the (dis)similarity between the data objects. In general, it is relatively easier to compute the distance between numerical data points as numerical operations can directly be applied to values along features. However, for categorical datasets, computing the (dis)similarity between the data objects becomes a non-trivial problem. Therefore, in this paper, we propose a new distance metric based on the information theoretic approach to compute the dissimilarity between categorical data points. We compute entropy along each feature to capture the intra-attribute statistical information, based on which significance of attributes are decided during clustering. The proposed measure is free from any domain-dependent parameters and also does not rely on the distribution of data points. Experiment is conducted over diversified benchmark data sets, considering six competing proximity measures with three popular clustering algorithms and the clustering results are compared in terms of RI (Rand Index), ARI (Adjusted Rand Index), CA (Clustering Accuracy) and Cluster Discrimination Matrix (CDM). Over 85 percent of the data sets, the clustering accuracy of the proposed metric embedded with K-Mode and Weighted K-Mode outperforms its counterparts. Approximately, 0.2951 s is needed by the proposed metric to cluster a data set having 10,000 data points with 8 attributes and 2 clusters on a standard desktop machine. Overall, experimental results demonstrate the efficacy of the proposed metric to handle complex real datasets of different characteristics.
科研通智能强力驱动
Strongly Powered by AbleSci AI