计算机科学
散列函数
后缀数组
SIMD公司
哈希表
GenBank公司
并行计算
数据结构
程序设计语言
生物
遗传学
基因
作者
Thomas D. Wu,Jens Reeder,Michael Lawrence,Gabe Becker,Matthew J. Brauer
出处
期刊:Methods in molecular biology
日期:2016-01-01
卷期号:: 283-334
被引量:334
标识
DOI:10.1007/978-1-4939-3578-9_15
摘要
The programs GMAP and GSNAP, for aligning RNA-Seq and DNA-Seq datasets to genomes, have evolved along with advances in biological methodology to handle longer reads, larger volumes of data, and new types of biological assays. The genomic representation has been improved to include linear genomes that can compare sequences using single-instruction multiple-data (SIMD) instructions, compressed genomic hash tables with fast access using SIMD instructions, handling of large genomes with more than four billion bp, and enhanced suffix arrays (ESAs) with novel data structures for fast access. Improvements to the algorithms have included a greedy match-and-extend algorithm using suffix arrays, segment chaining using genomic hash tables, diagonalization using segmental hash tables, and nucleotide-level dynamic programming procedures that use SIMD instructions and eliminate the need for F-loop calculations. Enhancements to the functionality of the programs include standardization of indel positions, handling of ambiguous splicing, clipping and merging of overlapping paired-end reads, and alignments to circular chromosomes and alternate scaffolds. The programs have been adapted for use in pipelines by integrating their usage into R/Bioconductor packages such as gmapR and HTSeqGenie, and these pipelines have facilitated the discovery of numerous biological phenomena.
科研通智能强力驱动
Strongly Powered by AbleSci AI