Penguin: A tool for predicting pseudouridine sites in direct RNA nanopore sequencing data

假尿苷核糖核酸计算生物学纳米孔测序纳米孔生物遗传学纳米技术基因转移RNA DNA测序材料科学

作者

Doaa Hassan,Daniel Acevedo,Swapna Vidhur Daulatabad,Quoseena Mir,Sarath Chandra Janga

出处

期刊：Methods [Elsevier BV]
日期：2022-02-16 卷期号：203: 478-487 被引量：40

链接

sciencedirect.com nih.gov biorxiv.org nih.govdoi.org

标识

DOI：10.1016/j.ymeth.2022.02.005

摘要

Pseudouridine is one of the most abundant RNA modifications, occurring when uridines are catalyzed by Pseudouridine synthase proteins. It plays an important role in many biological processes and has been reported to have application in drug development. Recently, the single-molecule sequencing techniques such as the direct RNA sequencing platform offered by Oxford Nanopore technologies have enabled direct detection of RNA modifications on the molecule being sequenced. In this study, we introduce a tool called Penguin that integrates several machine learning (ML) models to identify RNA Pseudouridine sites on Nanopore direct RNA sequencing reads. Pseudouridine sites were identified on single molecule sequencing data collected from direct RNA sequencing resulting in 723 K reads in Hek293 and 500 K reads in Hela cell lines. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer. Those features are used to train the predictors included in Penguin, which in turn, can predict whether the signal is modified by the presence of Pseudouridine sites in the testing phase. We have included various predictors in Penguin, including Support vector machines (SVM), Random Forest (RF), and Neural network (NN). The results on the two benchmark data sets for Hek293 and Hela cell lines show outstanding performance of Penguin either in random split testing or in independent validation testing. In random split testing, Penguin has been able to identify Pseudouridine sites with a high accuracy of 93.38% by applying SVM to Hek293 benchmark dataset. In independent validation testing, Penguin achieves an accuracy of 92.61% by training SVM with Hek293 benchmark dataset and testing it for identifying Pseudouridine sites on Hela benchmark dataset. Thus, Penguin outperforms the existing Pseudouridine predictors in the literature by 16 % higher accuracy than those predictors using independent validation testing. Employing penguin to predict Pseudouridine sites revealed a significant enrichment of “regulation of mRNA 3'-end processing” in Hek293 cell line and 'positive regulation of transcription from RNA polymerase II promoter involved in cellular response to chemical stimulus' in Hela cell line. Penguin software and models are available on GitHub at https://github.com/Janga-Lab/Penguin and can be readily employed for predicting Ψ sites from Nanopore direct RNA-sequencing datasets.

求助该文献

最长约 10秒，即可获得该文献文件

Penguin: A tool for predicting pseudouridine sites in direct RNA nanopore sequencing data

今日热心研友