计算机科学
多序列比对
序列比对
后缀数组
源代码
软件
编码(集合论)
序列(生物学)
并行计算
数据结构
程序设计语言
集合(抽象数据类型)
生物
生物化学
化学
遗传学
肽序列
基因
作者
Tong Zhou,Pinglu Zhang,Quan Zou,Han Wu
标识
DOI:10.1093/bioinformatics/btae718
摘要
Abstract Motivation HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences. Results To address this issue, we have implemented HAlign4 in C ++. In this version, we replaced the original suffix tree with Burrows–Wheeler Transform (BWT) and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency. Experiments show that HAlign4 significantly outperforms HAlign3 in runtime and memory usage in both single-threaded and multi-threaded configurations, while maintains high alignment accuracy comparable to MAFFT. HAlign4 can complete the alignment of 10 million COVID-19 sequences in about 12 minutes and 300GB of memory using 96 threads, demonstrating its efficiency and practicality for large-scale alignment on standard workstations. Availability Source code is available at https://github.com/malabz/HAlign-4, dataset is available at https://zenodo.org/records/13934503. Supplementary information Supplementary data are available at Bioinformatics online.
科研通智能强力驱动
Strongly Powered by AbleSci AI