聚类分析
计算机科学
模式识别(心理学)
加权
熵(时间箭头)
数据挖掘
特征(语言学)
人工智能
Kullback-Leibler散度
数学
算法
统计
物理
语言学
哲学
量子力学
声学
作者
Abdul Atif Khan,Amaresh Chandra Mishra,Sraban Kumar Mohanty
标识
DOI:10.1016/j.knosys.2023.110967
摘要
Suitable selection of a proximity measure is one of the fundamental requirements of clustering. With conventional (dis)similarity measures, many clustering algorithms do not yield satisfactory results on complex high-dimensional datasets. The problem lies with widely varying distributions along each feature which existing (dis)similarity measures fail to capture. In this work, we study the distribution of all-pair absolute distances in some standard real datasets over the feature space and observe that most of them have values near to zero. The frequency of data pairs decreases with the increasing value of distance which suggests an exponential distribution of values along features. The exponential decay rate constant, termed as characteristic length indicates the inhomogeneity in a feature and therefore, we use it as a weighting factor across attributes. The dissimilarity for a pair of data points is computed by considering the weights of each attribute along with a continuum adaptation of Boltzmann’s notion of entropy which uses feature-wise absolute differences as input. We prove that the proposed measure is a metric. For experimental analysis, we compare different proximity measures with the proposed, in terms of clustering results. The combination of feature wise characteristic length and the continuous version of Boltzmann’s entropy proves its excellence in terms of clustering results on diversified synthetic, real, and gene expression datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI