Deep convolutional neural networks for predicting leukemia-related transcription factor binding sites from DNA sequence data

卷积神经网络 DNA结合位点支持向量机人工智能计算机科学转录因子深度学习人工神经网络机器学习随机森林模式识别（心理学）计算生物学基因生物发起人遗传学基因表达

作者

Jian He,Xuemei Pu,Menglong Li,Chuan Li,Yanzhi Guo

出处

期刊：Chemometrics and Intelligent Laboratory Systems [Elsevier]
日期：2020-04-01 卷期号：199: 103976-103976 被引量：7

标识

DOI：10.1016/j.chemolab.2020.103976

摘要

Transcription factors are proteins that could bind to specific DNA sequences so as to regulate gene expressions. Currently, identification of transcription factor binding sites locating in DNA sequences is very important for building regulatory model in biological systems and identifying pathogenic variations. Traditional machine-learning methods have been successfully used for biological prediction problems based on DNA or protein sequences, but they all need to manually extract numerical features, which is not only tedious, but also would ignore effective information of first-order sequences. In this paper, based on the principle of deep learning (DL), we constructed prediction model for transcription factor binding sites only from DNA original base sequences. Here, a DL method based on convolutional neural network (CNN) and long short-term memory (LSTM) were proposed to investigate four leukemia categories from the perspective of transcription factor binding sites using four large non-redundant datasets for acute, chronic, myeloid and lymphatic leukemia, respectively. Compared with three widely used machine-learning methods of artificial neural network (ANN), support vector machine (SVM) and random forest (RF), our DL method exhibits significant superiority in terms of prediction performance, since the prediction accuracy of three machine-learning models either based on sequence feature or k-mer feature extraction are all lower than that of DL model. The available DL models for four leukemia categories gives an average prediction accuracy of 75% based only on sequence segments with 101 bases, which indicates that the DL based method is promising with unique advantages over the traditional machine learning methods. But focusing on leukemia-related transcription factor binding site prediction, further improvements would be implemented such as optimizing base segment length and CNN architecture, in order to improve the current prediction accuracy.

求助该文献

最长约 10秒，即可获得该文献文件

Deep convolutional neural networks for predicting leukemia-related transcription factor binding sites from DNA sequence data

今日热心研友