蛋白质超家族
计算生物学
蛋白质工程
蛋白质家族
变位酶
功能(生物学)
蛋白质测序
生物
计算机科学
自然语言
序列(生物学)
肽序列
自然语言处理
遗传学
生物化学
基因
酶
作者
Ali Madani,Ben Krause,Eric R. Greene,Subu Subramanian,Benjamin P. Mohr,James M. Holton,J.L. Olmos,Caiming Xiong,Zachary Z. Sun,Richard Socher,James S. Fraser,Nikhil Naik
标识
DOI:10.1038/s41587-022-01618-2
摘要
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
科研通智能强力驱动
Strongly Powered by AbleSci AI