An efficient entropy based dissimilarity measure to cluster categorical data

聚类分析计算机科学范畴变量数据挖掘兰德指数熵（时间箭头）数据点模式识别（心理学） k-中位数聚类单连锁聚类相似性度量公制（单位）高维数据聚类人工智能数据集水准点（测量）相关聚类 CURE数据聚类算法机器学习量子力学物理运营管理经济地理大地测量学

作者

Amit Kumar Kar,Amaresh Chandra Mishra,Sraban Kumar Mohanty

出处

期刊：Engineering Applications of Artificial Intelligence [Elsevier BV]
日期：2023-01-12 卷期号：119: 105795-105795 被引量：13

标识

DOI：10.1016/j.engappai.2022.105795

摘要

Clustering is an unsupervised learning technique that discovers intrinsic groups based on proximity between data points. Therefore, the performance of clustering techniques mainly relies on the proximity measures used to compute the (dis)similarity between the data objects. In general, it is relatively easier to compute the distance between numerical data points as numerical operations can directly be applied to values along features. However, for categorical datasets, computing the (dis)similarity between the data objects becomes a non-trivial problem. Therefore, in this paper, we propose a new distance metric based on the information theoretic approach to compute the dissimilarity between categorical data points. We compute entropy along each feature to capture the intra-attribute statistical information, based on which significance of attributes are decided during clustering. The proposed measure is free from any domain-dependent parameters and also does not rely on the distribution of data points. Experiment is conducted over diversified benchmark data sets, considering six competing proximity measures with three popular clustering algorithms and the clustering results are compared in terms of RI (Rand Index), ARI (Adjusted Rand Index), CA (Clustering Accuracy) and Cluster Discrimination Matrix (CDM). Over 85 percent of the data sets, the clustering accuracy of the proposed metric embedded with K-Mode and Weighted K-Mode outperforms its counterparts. Approximately, 0.2951 s is needed by the proposed metric to cluster a data set having 10,000 data points with 8 attributes and 2 clusters on a standard desktop machine. Overall, experimental results demonstrate the efficacy of the proposed metric to handle complex real datasets of different characteristics.

求助该文献

最长约 10秒，即可获得该文献文件

An efficient entropy based dissimilarity measure to cluster categorical data

今日热心研友