可扩展性
SPARK(编程语言)
管道(软件)
计算机科学
管道运输
数据处理
瓶颈
大数据
条形码
数据挖掘
数据库
操作系统
嵌入式系统
工程类
程序设计语言
环境工程
作者
Yu Liu,Mingxuan Gao,Lixuan Tan,Hongjin Liu,Yating Lin,Wenxian Yang,Rongshan Yu
标识
DOI:10.1109/bibm52615.2021.9669512
摘要
High-throughput single-cell RNA sequencing (scRNA-seq) data processing pipelines integrate multiple modules to transform raw scRNA-seq data to gene expression matrices, including barcode processing, sequence quality control, genome alignment and transcript quantification. With the rapid growth in data volume, the speed of scRNA-seq data processing pipeline has become a major bottleneck to large-scale scRNA-seq studies. We present scSpark XMBD 1 (denoted as scSpark), a cloud computing based scRNA-seq data processing pipeline. By leveraging the in-memory computing capability of Apache Spark, scSpark significantly improves the processing speed of scRNA-seq data, and achieves around 5-20 times faster than the state-of-the-art processing pipelines under the same CPU core consumption. In addition, thanks to the inherent scalability of Spark in a cloud computing environment, scSpark can further reduce the processing time for a typical scRNA-seq dataset (e.g., 640 million reads) from hours to minutes when multiple computer nodes (e.g., 16) are used. Biological evaluation also confirmed that the results generated by scSpark are highly consistent with existing scRNA-seq data processing pipelines. 1 XMBD refers to Xiamen Big Data, which is a biomedical open software initiative in the National Institute for Data Science in Health and Medicine, Xiamen University, China
科研通智能强力驱动
Strongly Powered by AbleSci AI