索引
基因分型
单核苷酸多态性
计算生物学
可扩展性
计算机科学
SNP基因分型
遗传学
生物
基因型
基因
数据库
作者
Lorenzo Di Rocco,Umberto Ferraro Petrillo
标识
DOI:10.1109/tcbbio.2025.3525547
摘要
The growing volume of sequencing data and the ever-larger size of variants databases challenge genotyping procedures to handle massive genomics datasets efficiently. Recent alignment-free solutions leverage exclusively on the k-mers counts to speed up the analysis, but have to trade off the time gain against the memory requirements, to make the elaborations possible on a single workstation. In this paper, we present SparkGeno+, a novel alignment-free (AF) distributed pipeline for the fast and accurate genotyping of Single Nucleotide Polymorphisms (SNPs) and indels on a large scale. Starting from a previous pipeline, we identified and evaluated the performance bottlenecks that arise when performing genotyping using a standard AF approach, to develop and implement several innovations to better exploit the resources of a distributed system. The effectiveness of our proposal has been validated through an experimental analysis on widely studied datasets. The results show that the accuracy of SparkGeno+ matches the one of state-of-the-art alignment-free tools like Vargeno and MALVA. Moreover, the time performance of SparkGeno+ scales well with the number of computing units, thus allowing execution times that are in order of growth smaller than those of classical genotyping tools. This indicates SparkGeno+ to be a promising solution for large-scale genotyping applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI