计算机科学
推论
语言模型
变压器
边距(机器学习)
人工智能
集合(抽象数据类型)
蛋白质家族
序列(生物学)
机器学习
自然语言处理
生物
程序设计语言
物理
基因
电压
量子力学
生物化学
遗传学
作者
Roshan Rao,Jason Liu,Robert Verkuil,Joshua Meier,John Canny,Pieter Abbeel,Tom Sercu,Alexander Rives
标识
DOI:10.1101/2021.02.12.430858
摘要
Abstract Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evo lutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.
科研通智能强力驱动
Strongly Powered by AbleSci AI