分子图
计算机科学
人工神经网络
人工智能
化学信息学
图形
化学空间
机器学习
可微函数
理论计算机科学
药物发现
数学
化学
生物化学
数学分析
计算化学
作者
Yuyang Wang,Jianren Wang,Zhonglin Cao,Amir Barati Farimani
标识
DOI:10.1038/s42256-022-00447-x
摘要
Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to generalize to the giant chemical space. Here we present MolCLR (Molecular Contrastive Learning of Representations via Graph Neural Networks), a self-supervised learning framework that leverages large unlabelled data (~10 million unique molecules). In MolCLR pre-training, we build molecule graphs and develop graph-neural-network encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of graph-neural-network encoders on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabelled database, MolCLR even achieves state of the art on several challenging benchmarks after fine-tuning. In addition, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities. Molecular representations are hard to design due to the large size of the chemical space, the amount of potentially important information in a molecular structure and the relatively low number of annotated molecules. Still, the quality of these representations is vital for computational models trying to predict molecular properties. Wang et al. present a contrastive learning approach to provide differentiable representations from unlabelled data.
科研通智能强力驱动
Strongly Powered by AbleSci AI