随机森林
计算机科学
人工智能
机器学习
排名(信息检索)
降维
集成学习
特征(语言学)
情态动词
模式识别(心理学)
特征选择
语言学
哲学
化学
高分子化学
作者
Siyuan Zhao,Jun Meng,Qiang Kang,Yushi Luan
标识
DOI:10.1109/tcbb.2021.3104288
摘要
Long non-coding RNA (lncRNA) contains short open reading frames (sORFs), and sORFs-encoded short peptides (SEPs) have become the focus of scientific studies due to their crucial role in life activities. The identification of SEPs is vital to further understanding their regulatory function. Bioinformatics methods can quickly identify SEPs to provide credible candidate sequences for verifying SEPs by biological experimenrts. However, there is a lack of methods for identifying SEPs directly. In this study, a machine learning method to identify SEPs of plant lncRNA (ISPL) is proposed. Hybrid features including sequence features and physicochemical features are extracted manually or adaptively to construct different modal features. In order to keep the stability of feature selection, the non-linear correction applied in Max-Relevance-Max-Distance (nocRD) feature selection method is proposed, which integrates multiple feature ranking results and uses the iterative random forest for different modal features dimensionality reduction. Classification models with different modal features are constructed, and their outputs are combined for ensemble classification. The experimental results show that the accuracy of ISPL is 89.86% percent on the independent test set, which will have important implications for further studies of functional genomic.
科研通智能强力驱动
Strongly Powered by AbleSci AI