计算机科学
参考文献
可扩展性
纳米孔测序
概率逻辑
算法
内存占用
数据挖掘
参考基因组
精确性和召回率
数据库
人工智能
基因组
操作系统
基因
化学
生物化学
作者
Chirag Jain,Alexander Dilthey,Sergey Koren,Srinivas Aluru,Adam M. Phillippy
标识
DOI:10.1007/978-3-319-56970-3_5
摘要
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each $$\ge 5$$ kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and $$> 60,000$$ genomes.
科研通智能强力驱动
Strongly Powered by AbleSci AI