虚拟筛选
样品(材料)
计算机科学
贝叶斯概率
贝叶斯优化
基线(sea)
主动学习(机器学习)
化学空间
人工智能
图形
贝叶斯网络
机器学习
变压器
药物发现
色谱法
化学
生物
生物信息学
理论计算机科学
海洋学
物理
地质学
电压
量子力学
作者
Zhonglin Cao,Simone Sciabola,Ye Wang
标识
DOI:10.1021/acs.jcim.3c01938
摘要
Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, active learning and Bayesian optimization have recently been proven as effective methods of narrowing down the search space. An essential component of those methods is a surrogate machine learning model that predicts the desired properties of compounds. An accurate model can achieve high sample efficiency by finding hits with only a fraction of the entire library being virtually screened. In this study, we examined the performance of a pretrained transformer-based language model and graph neural network in a Bayesian optimization active learning framework. The best pretrained model identifies 58.97% of the top-50,000 compounds after screening only 0.6% of an ultralarge library containing 99.5 million compounds, improving 8% over the previous state-of-the-art baseline. Through extensive benchmarks, we show that the superior performance of pretrained models persists in both structure-based and ligand-based drug discovery. Pretrained models can serve as a boost to the accuracy and sample efficiency of active learning-based virtual screening.
科研通智能强力驱动
Strongly Powered by AbleSci AI