计算机科学
剧目
一套
蛋白质测序
序列(生物学)
语言模型
过程(计算)
人工智能
数据科学
机器学习
生物
程序设计语言
遗传学
肽序列
物理
考古
基因
声学
历史
作者
Erik Nijkamp,Jeffrey A. Ruffolo,Eli N. Weinstein,Nikhil Naik,Ali Madani
出处
期刊:Cell systems
[Elsevier]
日期:2023-10-30
卷期号:14 (11): 968-978.e3
被引量:110
标识
DOI:10.1016/j.cels.2023.10.002
摘要
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper’s Transparent Peer Review process is included in the supplemental information.
科研通智能强力驱动
Strongly Powered by AbleSci AI