计算机科学
试验装置
人工智能
编码器
卷积神经网络
集合(抽象数据类型)
机器学习
代表(政治)
模式识别(心理学)
鉴定(生物学)
分割
生物
政治
操作系统
政治学
植物
程序设计语言
法学
作者
Yue Ma,Yongzhen Pei,Changguo Li
标识
DOI:10.1142/s0219720023500282
摘要
Identifying proteins is crucial for disease diagnosis and treatment. With the increase of known proteins, large-scale batch predictions are essential. However, traditional biological experiments being time-consuming and expensive are difficult to accomplish this task efficiently. Nevertheless, deep learning algorithms based on big data analysis have manifested potential in this aspect. In recent years, language representation models, especially BERT, have made significant advancements in natural language processing. In this paper, using three protein segmentation methods and three encoder numbers, nine BERT models with different sizes are constructed to predict whether known proteins are DNA-binding proteins or not. Furthermore, based on the concept of protein motifs, multi-scale convolutional networks are fused into the models to extract the local features of DNA-binding proteins. Finally, we find that the larger the number of encoders, the better the model predictions under the condition of considering each amino acid in the protein as a word. Our proposed algorithm achieves 81.88% sensitivity and 0.39 MCC value on the test set. Furthermore, it achieves 62.41% accuracy on the independent test set PDB2272. It is evident that our proposed method can be a tool to assist in the identification of DNA-binding proteins.
科研通智能强力驱动
Strongly Powered by AbleSci AI