Chemical language models enable navigation in sparsely populated chemical space

生成模型化学空间人工智能生成语法计算机科学水准点（测量）机器学习领域（数学）质量（理念）深度学习人工神经网络空格（标点符号）药物发现生物信息学生物数学地理认识论纯数学哲学操作系统大地测量学

作者

Michael A. Skinnider,R. Greg Stacey,David S. Wishart,Leonard J. Foster

出处

期刊：Nature Machine Intelligence [Nature Portfolio]
日期：2021-07-19 卷期号：3 (9): 759-770 被引量：113

标识

摘要

Deep generative models are powerful tools for the exploration of chemical space, enabling the on-demand generation of molecules with desired physical, chemical or biological properties. However, these models are typically thought to require training datasets comprising hundreds of thousands, or even millions, of molecules. This perception limits the application of deep generative models in regions of chemical space populated by a relatively small number of examples. Here, we systematically evaluate and optimize generative models of molecules based on recurrent neural networks in low-data settings. We find that robust models can be learned from far fewer examples than has been widely assumed. We identify strategies that further reduce the number of molecules required to learn a model of equivalent quality, notably including data augmentation by non-canonical SMILES enumeration, and demonstrate the application of these principles by learning models of bacterial, plant and fungal metabolomes. The structure of our experiments also allows us to benchmark the metrics used to evaluate generative models themselves. We find that many of the most widely used metrics in the field fail to capture model quality, but we identify a subset of well-behaved metrics that provide a sound basis for model development. Collectively, our work provides a foundation for directly learning generative models in sparsely populated regions of chemical space. Deep learning-based methods to generate new molecules can require huge amounts of data to train. Skinnider et al. show that models developed for natural language processing work well for generating molecules from small amounts of training data, and identify robust metrics to evaluate the quality of generated molecules.

求助该文献

最长约 10秒，即可获得该文献文件

Chemical language models enable navigation in sparsely populated chemical space

今日热心研友