计算机科学
人工智能
嵌入
字错误率
特征工程
稳健性(进化)
深度学习
模式识别(心理学)
机器学习
语音识别
基因
生物
生物化学
作者
Gancheng Zhu,Yongchang Fan,Fĕi Li,Annebella Tsz Ho Choi,Zhenyu Tan,Yiruo Cheng,Kewei Li,Siyang Wang,Changfan Luo,Hongmei Li,Gongyou Zhang,Zhaomin Yao,Yaqi Zhang,L. Q. Huang,Fengfeng Zhou
标识
DOI:10.1016/j.eswa.2023.120439
摘要
A genome carries many functional genomic signals and regions (GSRs), which play a vital role in orchestrating the complex biological processes in eukaryotic organisms. Precise recognition of the GSRs within a genomic sequence is the first step to an understanding of genomic organization and gene regulation. Previous studies have used machine learning or deep learning algorithms to identify GSRs based on hand-crafted features, that frequently fail to capture complex patterns within the GSRs. The one-hot encoding or word2vec embedding algorithms used in several deep learning-based studies have the potential to overcome the weakness of the human-designed features, but they may fail to capture contextual and positional information. The present study proposes a general-purpose end-to-end framework for GSR prediction (GSRNet), that integrates DNABERT embedding, adversarial training, BiGRU, and multi-scale CNN to eliminate human involvement in feature engineering. The GSRNet is evaluated with polyadenylation signals (PAS) and translation initiation sites (TIS) prediction tasks. The comparative experiments show that the proposed GSRNet outperforms the state-of-the-art methods reported in previous studies, with a drop in the error rate by 1.08% and 1.50% for human PAS and TIS GSR, respectively. Our model reduces the relative error rate up to 8.73% and 32.97%, respectively. The improved detections of the two types of GSRs (PAS and TIS) across four organisms confirmed the effectiveness and robustness of the proposed GSRNet. The source code and the data are freely available at http://www.healthinformaticslab.org/supp/resources.php.
科研通智能强力驱动
Strongly Powered by AbleSci AI