计算机科学
化学空间
人工智能
语言模型
机器学习
变压器
分子图
特征学习
编码器
自然语言处理
自然语言
图形
药物发现
理论计算机科学
化学
操作系统
物理
量子力学
电压
生物化学
作者
Jerret Ross,Brian Belgodere,Vijil Chenthamarakshan,Inkit Padhi,Youssef Mroueh,Payel Das
标识
DOI:10.1038/s42256-022-00580-7
摘要
Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI