德布鲁因图
德布鲁恩序列
计算机科学
史密斯-沃特曼算法
图形
并行计算
理论计算机科学
序列比对
数学
生物化学
基因
离散数学
肽序列
化学
作者
Yao Li,Cheng Zhong,Danyang Chen,Jinxiong Zhang,Mengxiao Yin
标识
DOI:10.1109/paap54281.2021.9720451
摘要
A large number of reads generated by the next generation sequencing platform will contain many repetitive subsequences. Effective localizing and identifying genomic regions containing repetitive subsequences will contribute to the subsequent genomic data analysis. To accelerate the alignment between large-scale short reads and reference genome with many repetitive subsequences, this paper develops a compact de Bruijn graph based short-read alignment algorithm on distributed parallel computing platform. The algorithm uses resilient distributed data sets (RDDS) to perform calculations in memory, and executes the broadcast method to distribute short reads and reference genome to the computing nodes to reduce the data communication time on the cluster system, and the number of RDD partitions is set to optimize the performance of parallel aligning algorithm. Experimental results on real datasets show that compared with the compact de Bruijn graph based sequential short-read alignment algorithm, our implemented distributed parallel alignment algorithm achieves good acceleration on the premise of obtaining the same correct alignment percentage as a whole, and compared with existing distributed parallel alignment algorithms, the implemented parallel algorithm can more quickly complete the alignment between large-scale short reads and reference genome with highly repetitive subsequences.
科研通智能强力驱动
Strongly Powered by AbleSci AI