xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

计算机科学 变压器 人工智能 机器学习 语言模型 自然语言理解 自然语言处理 自然语言 工程类 电压 电气工程
作者
Bo Chen,Xingyi Cheng,Li Pan,Yangli‐ao Geng,Jing Gong,Shen Li,Zhilei Bei,Xu Tan,Boyan Wang,Xin Zeng,Chi-Ming Liu,Aohan Zeng,Yuxiao Dong,Jie Tang,Le Song
标识
DOI:10.1101/2023.07.05.547496
摘要

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
年轻冥茗完成签到,获得积分10
刚刚
聂落雁完成签到,获得积分10
刚刚
奋进的熊发布了新的文献求助10
1秒前
无奈晓瑶完成签到,获得积分10
2秒前
3秒前
3秒前
趣乐多完成签到,获得积分10
5秒前
5秒前
lg完成签到,获得积分10
6秒前
赘婿应助Vic采纳,获得10
6秒前
pluto应助谦让的沛芹采纳,获得10
6秒前
一一发布了新的文献求助10
6秒前
ying完成签到 ,获得积分10
6秒前
6秒前
科研通AI5应助L~采纳,获得10
6秒前
小马甲应助无所谓采纳,获得10
6秒前
6秒前
涨秋池应助爆螺钉采纳,获得10
7秒前
7秒前
likes发布了新的文献求助10
8秒前
李爱国应助勤劳的逍遥采纳,获得10
8秒前
8秒前
9秒前
大力的大白菜真实的钥匙完成签到,获得积分10
9秒前
机灵抽屉完成签到,获得积分20
9秒前
10秒前
dove_min070809完成签到 ,获得积分10
10秒前
10秒前
10秒前
10秒前
所所应助向阳采纳,获得10
10秒前
11秒前
lh完成签到,获得积分10
12秒前
机灵抽屉发布了新的文献求助30
12秒前
12秒前
13秒前
ding应助Tyh0315采纳,获得10
13秒前
笨笨伟泽完成签到,获得积分10
14秒前
shiyu发布了新的文献求助10
14秒前
玩命的小虾米完成签到 ,获得积分10
14秒前
高分求助中
こんなに痛いのにどうして「なんでもない」と医者にいわれてしまうのでしょうか 510
Seven new species of the Palaearctic Lauxaniidae and Asteiidae (Diptera) 400
Where and how to use plate heat exchangers 300
Fundamentals of Medical Device Regulations, Fifth Edition(e-book) 300
A method for calculating the flow in a centrifugal impeller when entropy gradients are present 240
The Enzymes,Tyrosinase Volume 56 200
Cardiac arrhythmia classification of imbalanced data using convolutional autoencoder and LSTM techniques 200
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3702622
求助须知:如何正确求助?哪些是违规求助? 3252430
关于积分的说明 9879649
捐赠科研通 2964498
什么是DOI,文献DOI怎么找? 1625719
邀请新用户注册赠送积分活动 770222
科研通“疑难数据库(出版商)”最低求助积分说明 742888