聚类分析
计算机科学
数据挖掘
水准点(测量)
序列(生物学)
CURE数据聚类算法
相关聚类
人工智能
生物
地理
大地测量学
遗传学
作者
Ze-Gang Wei,Xu Chen,Xiao-Dan Zhang,Hao Zhang,Xing-Guo Fan,Hongyan Gao,Fei Liu,Yu Qian
标识
DOI:10.1109/tcbb.2023.3253138
摘要
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
科研通智能强力驱动
Strongly Powered by AbleSci AI