人工智能
代表(政治)
计算机科学
无监督学习
机器学习
生成模型
蛋白质三级结构
蛋白质结构预测
序列空间
序列(生物学)
生成语法
蛋白质结构
生物
数学
生物化学
政治
遗传学
巴拿赫空间
法学
纯数学
政治学
作者
Alexander Rives,Joshua Meier,Tom Sercu,Siddharth Goyal,Zeming Lin,Jason Liu,Demi Guo,Myle Ott,C. Lawrence Zitnick,Jerry Ma,Rob Fergus
摘要
Abstract In the field of artificial intelligence, a combination of scale in data and model capacity enabled by un-supervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction.
科研通智能强力驱动
Strongly Powered by AbleSci AI