作者
Xin Wang,Yang Yang,Jian Liu,Guohua Wang
摘要
With the development of next-generation sequencing technology, a large number of transcripts need to be analyzed, and it has been a challenge to distinguish non-coding ribonucleic acid (RNAs) (ncRNAs) from coding RNAs. And for non-model organisms, due to the lack of transcriptional data, many existing methods cannot identify them. Therefore, in addition to using deoxyribonucleic acid-based and RNA-based features, we also proposed a hybrid framework based on the stacking strategy to identify ncRNAs, and we innovatively added eight features based on predicted peptides. The proposed framework was based on stacking two-layer classifier which combined random forest (RF), LightGBM, XGBoost and logistic regression (LR) models. We used this framework to build two types of models. For cross-species ncRNAs identification model, we tested it on six different species: human, mouse, zebrafish, fruit fly, worm and Arabidopsis. Compared with other tools, our model was the best in datasets of Arabidopsis, worm and zebrafish with the accuracy of 98.36%, 99.65% and 94.12%. For performance metrics analysis, the datasets of the six species were considered as a whole set, and the sensitivity, accuracy, precision and F1 values of our model were the best. For the plant-specific ncRNAs identification model, the average values of the six metrics of the two experiments were all greater than 95%, which demonstrated it can be used to identify ncRNAs in plants. The above indicates that the hybrid framework we designed is universal between animals and plants and has significant advantages in the identification of cross-species ncRNAs.