计算机科学
DNA结合位点
嵌入
卷积神经网络
人工智能
深度学习
DNA微阵列
转录因子
编码(内存)
文字嵌入
序列(生物学)
k-mer公司
数据挖掘
DNA测序
模式识别(心理学)
DNA
基因
发起人
生物
遗传学
基因表达
作者
Jujuan Zhuang,Xinru Huang,Shuhan Liu,Wanquan Gao,Rui Su,Kexin Feng
标识
DOI:10.1021/acs.jcim.3c02088
摘要
Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R2 of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI