水准点(测量)
蛋白质结构预测
计算机科学
集合(抽象数据类型)
训练集
功能(生物学)
蛋白质功能预测
性能预测
数据挖掘
机器学习
人工智能
蛋白质结构
蛋白质功能
模拟
生物
基因
大地测量学
物理
进化生物学
生物化学
化学
程序设计语言
地理
核磁共振
作者
Wenjian Ma,Shugang Zhang,Zhen Li,Mingjian Jiang,Shuang Wang,Weigang Lu,Xiangpeng Bi,Huasen Jiang,Henggui Zhang,Zhiqiang Wei
标识
DOI:10.1021/acs.jcim.2c00885
摘要
The structure of a protein is of great importance in determining its functionality, and this characteristic can be leveraged to train data-driven prediction models. However, the limited number of available protein structures severely limits the performance of these models. AlphaFold2 and its open-source data set of predicted protein structures have provided a promising solution to this problem, and these predicted structures are expected to benefit the model performance by increasing the number of training samples. In this work, we constructed a new data set that acted as a benchmark and implemented a state-of-the-art structure-based approach for determining whether the performance of the function prediction model can be improved by putting additional AlphaFold-predicted structures into the training set and further compared the performance differences between two models separately trained with real structures only and AlphaFold-predicted structures only. Experimental results indicated that structure-based protein function prediction models could benefit from virtual training data consisting of AlphaFold-predicted structures. First, model performances were improved in all three categories of Gene Ontology terms (GO terms) after adding predicted structures as training samples. Second, the model trained only on AlphaFold-predicted virtual samples achieved comparable performances to the model based on experimentally solved real structures, suggesting that predicted structures were almost equally effective in predicting protein functionality.
科研通智能强力驱动
Strongly Powered by AbleSci AI