摘要
Virtual libraries used in molecular discovery are often too large to exhaustively evaluate, warranting the use of algorithms to help with exploration.Algorithmic approaches like Bayesian optimization can help to efficiently navigate predefined chemical spaces in combination with surrogate models.On-the-fly molecular generation during exploration enables even larger chemical spaces to be searched, including deep-learning-based models, although their chemical spaces are defined only implicitly.Emerging approaches to incorporate reactions into machine-learning-based generation can ensure that molecules are able to be synthesized, similar to previously developed algorithms for reaction-based de novo design. Designing functional molecules with desirable properties is often a challenging, multi-objective optimization. For decades, there have been computational approaches to facilitate this process through the simulation of physical processes, the prediction of molecular properties using structure–property relationships, and the selection or generation of molecular structures. This review provides an overview of some algorithmic approaches to defining and exploring chemical spaces that have the potential to operationalize the process of molecular discovery. We emphasize the potential roles of machine learning and the consideration of synthetic feasibility, which is a prerequisite to 'closing the loop'. We conclude by summarizing important directions for the future development and evaluation of these methods. Designing functional molecules with desirable properties is often a challenging, multi-objective optimization. For decades, there have been computational approaches to facilitate this process through the simulation of physical processes, the prediction of molecular properties using structure–property relationships, and the selection or generation of molecular structures. This review provides an overview of some algorithmic approaches to defining and exploring chemical spaces that have the potential to operationalize the process of molecular discovery. We emphasize the potential roles of machine learning and the consideration of synthetic feasibility, which is a prerequisite to 'closing the loop'. We conclude by summarizing important directions for the future development and evaluation of these methods. Chemical space can be thought of as the set of all possible molecules or materials. We generally consider more narrowly defined chemical spaces that are defined or constrained by the structures or functions of the molecules they contain. For example, 'drug-like chemical space' is used in the context of drug discovery to reflect the vast number of molecules that have physical properties similar to those of existing small-molecule therapeutics. While quantifying the size of a chemical is rarely useful, it should be noted that there are far more organic molecules thought to be stable than atoms in the solar system, which is unsurprising given the combinatorics of designing molecular graphs. Here, we focus our discussion on small molecules rather than periodic materials, biomolecules, and polymers, all of which correspond to distinct 'chemical spaces'. Many studies have estimated the size of different chemical spaces [1.Bohacek R.S. et al.The art and practice of structure-based drug design: a molecular modeling perspective.Med. Res. Rev. 1996; 16: 3-50Crossref PubMed Scopus (774) Google Scholar, 2.Drew K.L.M. et al.Size estimation of chemical space: how big is it?.J. Pharm. Pharmacol. 2012; 64: 490-495Crossref PubMed Scopus (31) Google Scholar, 3.Polishchuk P.G. et al.Estimation of the size of drug-like chemical space based on GDB-17 data.J. Comput. Aided Mol. Des. 2013; 27: 675-679Crossref PubMed Scopus (201) Google Scholar] and suggested rules to organize these spaces along functional axes to improve their visualization and navigability [4.Oprea T.I. Gottfries J. Chemography: the art of navigating in chemical space.J. Comb. Chem. 2001; 3: 157-166Crossref PubMed Scopus (285) Google Scholar, 5.Reymond J.-L. Awale M. Exploring chemical space for drug discovery using the Chemical Universe database.ACS Chem. Neurosci. 2012; 3: 649-657Crossref PubMed Scopus (173) Google Scholar, 6.Awale M. Reymond J.-L. Web-based 3D-visualization of the DrugBank chemical space.J. Cheminform. 2016; 8: 25Crossref PubMed Scopus (10) Google Scholar, 7.Probst D. Reymond J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees.J. Cheminform. 2020; 12: 12Crossref PubMed Scopus (65) Google Scholar]. As we have described previously, the discovery of novel molecules can be framed as a search within chemical space [8.Coley C.W. et al.Autonomous discovery in the chemical sciences part I: progress.Angew. Chem. Int. Ed. 2019; (Published online September 25, 2019. https://doi.org/10.1002/anie.201909987)Google Scholar,9.Coley C.W. et al.Autonomous discovery in the chemical sciences part II: outlook.Angew. Chem. Int. Ed. 2019; (Published online September 25, 2019. https://doi.org/10.1002/anie.201909989)Google Scholar]. The goal is to identify one or more molecules that exhibit a set of desirable properties. Besides defining these properties and a strategy to evaluate candidate molecules, the two primary considerations one must make are: (i) how to define the space; and (ii) how to explore the space. Both contribute to the search efficiency and likelihood of finding a good candidate. These two aspects are not independent: if you are repurposing FDA-approved drugs, your chemical space is narrow enough that an exhaustive screen may be feasible, but if you have no such restriction you must employ some strategy to select which molecules to test. These strategies are typically iterative optimization routines (driven by human intuition or driven by quantitative experimental design) with varying degrees of sophistication, as discussed later. Navigating chemical space has been extensively written about in the context of (non-algorithmic) drug design [10.Dobson C.M. Chemical space and biology.Nature. 2004; 432: 824-828Crossref PubMed Scopus (717) Google Scholar,11.Lipinski C. Hopkins A. Navigating chemical space for biology and medicine.Nature. 2004; 432: 855-861Crossref PubMed Scopus (769) Google Scholar]. The number of candidate molecules is too large to explore exhaustively, so one often imposes constraints on chemical space depending on the search strategy, the application, and the practical limitations of cost and time. These constraints look quite different when candidates are evaluated by physical rather than computational experiments. In the former case, acquiring new information about the performance of a molecule requires its physical synthesis, purification, and characterization; considerations of synthesis cost and material availability are paramount. In the latter case, one may postpone these practical considerations until after computational evaluations have identified a putative 'optimal' molecule. To bound the computational cost, the search space is still restricted using human expertise or some 'prior' on what would make a viable candidate. This review examines strategies to define and explore chemical spaces with an emphasis on the role of machine learning and synthesizability constraints (Table 1, Key Table). While this can be performed by subject-matter experts (e.g., medicinal chemists) in the absence of computer assistance, formalizing these concepts may eventually enable autonomous workflows to produce novel, useful outcomes with reduced reliance on human intuition and subjectivity. Elements of the concepts we cover can be found in previous articles, including a recent overview by Lemonick [12.Lemonick S. Exploring chemical space: can AI take us where no human has gone before?.Chem. Eng. News. 2020; 98: 30Google Scholar]. We do not address visualization and instead refer readers to the work of Reymond and coworkers [5.Reymond J.-L. Awale M. Exploring chemical space for drug discovery using the Chemical Universe database.ACS Chem. Neurosci. 2012; 3: 649-657Crossref PubMed Scopus (173) Google Scholar,7.Probst D. Reymond J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees.J. Cheminform. 2020; 12: 12Crossref PubMed Scopus (65) Google Scholar].Table 1Key Table. Categorization of Approaches to Define Chemical Spaces for Molecular Discovery and an Incomplete Set of Examples for EachaSpaces can be defined prior to exploration or defined on the fly by evolutionary and/or machine learning-based methods. They can be relatively unconstrained (i.e., only in terms of validity) or constrained by availability (i.e., in terms of purchasability or synthesizability).UnconstrainedConstrainedPredefinedZINC [13.Irwin J.J. et al.ZINC: a free tool to discover chemistry for biology.J. Chem. Inf. Model. 2012; 52: 1757-1768Crossref PubMed Scopus (1646) Google Scholar], ChEMBL [15.Gaulton A. et al.ChEMBL: a large-scale bioactivity database for drug discovery.Nucleic Acids Res. 2012; 40: D1100-D1107Crossref PubMed Scopus (2302) Google Scholar], PubChem [14.Kim S. et al.PubChem 2019 update: improved access to chemical data.Nucleic Acids Res. 2019; 47: D1102-D1109Crossref PubMed Scopus (1440) Google Scholar], GDB [24.Reymond J.-L. The Chemical Space Project.Acc. Chem. Res. 2015; 48: 722-730Crossref PubMed Scopus (266) Google Scholar]DrugBank [16.Wishart D.S. et al.DrugBank: a comprehensive resource for in silico drug discovery and exploration.Nucleic Acids Res. 2006; 34: D668-D672Crossref PubMed Scopus (2338) Google Scholar], Enamine REAL (https://enamine.net/library-synthesis/real-compounds), WuXi Virtual Library (https://www.labnetwork.com/frontend-app/p/%5C#!/library/virtual), SAVI [32.Patel H. et al.Synthetically Accessible Virtual Inventory (SAVI).ChemRxiv. 2020; (Published online April 27, 2020. https://doi.org/10.26434/chemrxiv.12185559)Google Scholar], PGVL [33.Hu Q. et al.LEAP into the Pfizer Global Virtual Library (PGVL) space: creation of readily synthesizable design ideas automatically.Methods Mol. Biol. 2011; 685: 253-276Crossref PubMed Scopus (29) Google Scholar], PLC [34.Nicolaou C.A. et al.The Proximal Lilly Collection: mapping, exploring and exploiting feasible chemical space.J. Chem. Inf. Model. 2016; 56: 1253-1266Crossref PubMed Scopus (48) Google Scholar]On the fly via heuristic methodsFragment-based GAs [57.Venkatasubramanian V. et al.Computer-aided molecular design using genetic algorithms.Comput. Chem. Eng. 1994; 18: 833-844Crossref Scopus (192) Google Scholar], GroupBuild [66.Rotstein S.H. Murcko M.A. GroupBuild: a fragment-based method for de novo drug design.J. Med. Chem. 1993; 36: 1700-1710Crossref PubMed Scopus (170) Google Scholar], BREED [58.Pierce A.C. et al.BREED: generating novel inhibitors through hybridization of known ligands. Application to CDK2, P38, and HIV protease.J. Med. Chem. 2004; 47: 2768-2775Crossref PubMed Scopus (149) Google Scholar], GraphGA [62.Jensen J.H. A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space.Chem. Sci. 2019; 10: 3567-3572Crossref PubMed Google Scholar], GEGL [63.Ahn S. et al.Guiding deep molecular optimization with genetic exploration.arXiv. 2020; (Published online July 4, 2020. http://arxiv.org/abs/2007.04897)Google Scholar]SYNOPSIS [91.Vinkers H.M. et al.SYNOPSIS: SYNthesize and OPtimize System in Silico.J. Med. Chem. 2003; 46: 2765-2773Crossref PubMed Scopus (163) Google Scholar], Flux [88.Fechner U. Schneider G. Flux (1): a virtual synthesis scheme for fragment-based de novo design.J. Chem. Inf. Model. 2006; 46: 699-707Crossref PubMed Scopus (83) Google Scholar], MOARF [89.Firth N.C. et al.MOARF, an integrated workflow for multi-objective optimization: implementation, synthesis, and biological evaluation.J. Chem. Inf. Model. 2015; 55: 1169-1180Crossref PubMed Scopus (24) Google Scholar], DOGS [92.Hartenfeller M. et al.DOGS: reaction-driven de novo design of bioactive compounds.PLoS Comput. Biol. 2012; 8e1002380Crossref PubMed Scopus (155) Google Scholar]On the fly via machine learningSMILES VAE [118.Gomez-Bombarelli R. et al.Automatic chemical design using a data-driven continuous representation of molecules.ACS Cent. Sci. 2018; 4: 268-276Crossref PubMed Scopus (1022) Google Scholar], JT-VAE [75.Jin W. et al.Junction tree variational autoencoder for molecular graph generation.arXiv. 2018; (Published online February 12, 2018. https://arxiv.org/abs/1802.04364)Google Scholar], SMILES RNN [72.Segler M.H.S. et al.Generating focused molecule libraries for drug discovery with recurrent neural networks.ACS Cent. Sci. 2018; 4: 120-131Crossref PubMed Scopus (514) Google Scholar,73.Olivecrona M. et al.Molecular de-novo design through deep reinforcement learning.J. Cheminform. 2017; 9: 48Crossref PubMed Scopus (381) Google Scholar], MolDQN [77.Zhou Z. et al.Optimization of molecules via deep reinforcement learning.arXiv. 2018; (Published online October 19, 2018. http://arxiv.org/abs/1810.08678)Google Scholar]MoleculeChef [96.Bradshaw J. et al.A model to search for synthesizable molecules.arXiv. 2019; (Published online June 12, 2019. http://arxiv.org/abs/1906.05221)Google Scholar], ChemBO [97.Korovina K. ChemBO: Bayesian optimization of small organic molecules with synthesizable recommendations.arXiv. 2019; (Published online August 5, 2019. http://arxiv.org/abs/1908.01425)Google Scholar], PGFS [98.Gottipati S.K. et al.Learning to navigate the synthetically accessible chemical space using reinforcement learning.arXiv. 2020; (Published online April 26, 2020. https://arxiv.org/abs/2004.12485v1)Google Scholar], REACTOR [99.Horwood J. Noutahi E. Molecular design in synthetically accessible chemical space via deep reinforcement learning.arXiv. 2020; (Published online April 29, 2020. https://arxiv.org/abs/2004.14308v1)Google Scholar]a Spaces can be defined prior to exploration or defined on the fly by evolutionary and/or machine learning-based methods. They can be relatively unconstrained (i.e., only in terms of validity) or constrained by availability (i.e., in terms of purchasability or synthesizability). Open table in a new tab One approach to molecular discovery is to explore a predefined chemical space: an enumerated list of candidate molecules. In this setting, the two stages of (i) defining the space and (ii) exploring the space are entirely decoupled. Formally, we might think about this problem as an optimization of an objective function f(x), where x is a molecule belonging to a discrete set X. Defining or selecting a finite chemical space often relies on domain expertise. Careful selection of X can increase the likelihood that it contains a high-performing molecule while minimizing the number of low-performing compounds. Common databases of molecules for computational screening are: ZINC [13.Irwin J.J. et al.ZINC: a free tool to discover chemistry for biology.J. Chem. Inf. Model. 2012; 52: 1757-1768Crossref PubMed Scopus (1646) Google Scholar], a library of commercially available compounds; PubChem [14.Kim S. et al.PubChem 2019 update: improved access to chemical data.Nucleic Acids Res. 2019; 47: D1102-D1109Crossref PubMed Scopus (1440) Google Scholar], molecules with biological relevance; ChEMBL [15.Gaulton A. et al.ChEMBL: a large-scale bioactivity database for drug discovery.Nucleic Acids Res. 2012; 40: D1100-D1107Crossref PubMed Scopus (2302) Google Scholar], molecules with bioactivity data; and DrugBank [16.Wishart D.S. et al.DrugBank: a comprehensive resource for in silico drug discovery and exploration.Nucleic Acids Res. 2006; 34: D668-D672Crossref PubMed Scopus (2338) Google Scholar], approved or experimental therapeutic molecules. These virtual libraries (see Glossary) all represent 'general-purpose' chemical spaces with broad biological relevance and are therefore applied to many problems related to drug discovery [17.Walters W.P. Virtual chemical libraries.J. Med. Chem. 2019; 62: 1116-1124Crossref PubMed Scopus (83) Google Scholar]. More focused chemical spaces can be created through a domain-informed enumeration of compounds relevant to a specific application; for example, 1.6 million donor-bridge-acceptor trimers for organic electronics [18.Gomez-Bombarelli R. et al.Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach.Nat. Mater. 2016; 15: 1120-1127Crossref PubMed Scopus (509) Google Scholar] or 2.8 million transition-metal complexes for redox flow batteries [19.Janet J.P. et al.Accurate multiobjective design in a space of millions of transition metal complexes with neural-network-driven efficient global optimization.ACS Cent. Sci. 2020; 6: 513-524Crossref PubMed Scopus (60) Google Scholar]. These are exhaustively enumerated chemical spaces with strict constraints on which fragments are included and how they are attached, similar to R-group enumeration methods. Privileged fragments for drug-like molecules have been identified through retrosynthetic analysis and automatic fragmentation [20.Lewell X.Q. et al.RECAP – retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry.J. Chem. Inform. Comput. Sci. 1998; 38: 511-522Crossref PubMed Scopus (534) Google Scholar,21.Ertl P. Cheminformatics analysis of organic substituents: identification of the most common substituents, calculation of substituent properties, and automatic identification of drug-like bioisosteric groups.J. Chem. Inform. Comput. Sci. 2003; 43: 374-380Crossref PubMed Scopus (219) Google Scholar]; the molecules produced by recombining these fragments are intended to look more promising than an enumeration based on graph structure alone. Graph-theoretical enumeration of molecular structures has been studied for over a century, starting with simple spaces like that of acyclic alkanes [22.Cayley E. Ueber die analytischen Figuren, welche in der Mathematik Bäume genannt werden und ihre Anwendung auf die Theorie chemischer Verbindungen.Ber. Dtsch. Chem. Ges. 1875; 8 (in German): 1056-1059Crossref Scopus (58) Google Scholar,23.Henze H.R. Blair C.M. The number of isomeric hydrocarbons of the methane series.J. Am. Chem. Soc. 1931; 53: 3077-3085Crossref Scopus (77) Google Scholar]. However, it is only recently that these structures have been recorded, evaluated, and used for discovery. The Chemical Space Project exemplifies modern exhaustive enumeration of all stable organic molecules containing common atom types up to a certain size [24.Reymond J.-L. The Chemical Space Project.Acc. Chem. Res. 2015; 48: 722-730Crossref PubMed Scopus (266) Google Scholar]. Since the original Generated DataBase (GDB) of up to seven heavy atoms [25.Fink T. Reymond J.-L. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery.J. Chem. Inf. Model. 2007; 47: 342-353PubMed Google Scholar], Reymond and coworkers have enumerated, analyzed, and released the 166.4 billion structures of up to 17 heavy atoms [26.Ruddigkeit L. et al.Enumeration of 166 billion organic small molecules in the Chemical Universe database GDB-17.J. Chem. Inf. Model. 2012; 52: 2864-2875Crossref PubMed Scopus (569) Google Scholar] and published numerous visualizations and analyses thereof. In addition to the benefits of ensuring that X is relevant to the design objective, the predefinition of chemical spaces lets us impose arbitrary constraints on their contents. A practical constraint is the ease of experimental validation: that any candidate can be physically acquired for experimental testing. In the simplest case, a chemical space could be defined as the set of molecules in a company's chemical inventory or vendor catalog. Any compound from this list can be acquired rapidly for experimental evaluation. Accessibility is the primary motivation for make-on-demand libraries, which are chemical spaces defined as the molecules that are in stock or available and all molecules that can be produced from those structures through straightforward synthetic protocols. Libraries are often enumerated by applying a small number (<100) of reaction templates defining common single-step transformations to all possible combinations of starting materials [27.Cramer R.D. et al.Virtual compound libraries: a new approach to decision making in molecular discovery research.J. Chem. Inform. Comput. Sci. 1998; 38: 1010-1023Crossref Scopus (80) Google Scholar, 28.Nikitin S. et al.A very large diversity space of synthetically accessible compounds for use with drug design programs.J. Comput. Aided Mol. Des. 2005; 19: 47-63Crossref PubMed Scopus (31) Google Scholar, 29.Cramer R.D. et al.AllChem: generating and searching 1020 synthetically accessible structures.J. Comput. Aided Mol. Des. 2007; 21: 341-350Crossref PubMed Scopus (44) Google Scholar, 30.Patel H. et al.Knowledge-based approach to de novo design using reaction vectors.J. Chem. Inf. Model. 2009; 49: 1163-1184Crossref PubMed Scopus (61) Google Scholar] (Figure 1); recursive enumeration generates molecules accessible through multiple synthetic steps. There are numerous implementations of this approach [31.Hoffmann T. Gastreich M. The next level in chemical space navigation: going far beyond enumerable compound libraries.Drug Discov. Today. 2019; 24: 1148-1156Crossref PubMed Scopus (82) Google Scholar], including SAVI [32.Patel H. et al.Synthetically Accessible Virtual Inventory (SAVI).ChemRxiv. 2020; (Published online April 27, 2020. https://doi.org/10.26434/chemrxiv.12185559)Google Scholar], efforts within pharmaceutical companies [33.Hu Q. et al.LEAP into the Pfizer Global Virtual Library (PGVL) space: creation of readily synthesizable design ideas automatically.Methods Mol. Biol. 2011; 685: 253-276Crossref PubMed Scopus (29) Google Scholar,34.Nicolaou C.A. et al.The Proximal Lilly Collection: mapping, exploring and exploiting feasible chemical space.J. Chem. Inf. Model. 2016; 56: 1253-1266Crossref PubMed Scopus (48) Google Scholar], and efforts from commercial vendors (https://enamine.net/library-synthesis/real-compounds; https://www.labnetwork.com/frontend-app/p/%5C#!/library/virtual). As it becomes impractical to store such large numbers of compounds due to the combinatorial explosion of reaction products, these spaces may be defined implicitly. Whether molecules in these spaces are easy to synthesize depends on the robustness of rules used for enumeration. Lyu and colleagues cite an 86% synthesis success rate for 51 compounds selected from 170 million in the Enamine REAL library enumerated from 130 reaction types; WuXi estimates a 60–80% success rate for their 1.7-billion-member collection generated by 30 reaction types (https://www.labnetwork.com/frontend-app/p/%5C#!/library/virtual). This success rate might be improved through the use of machine-learning models for reaction outcome prediction [35.Coley C.W. et al.A graph-convolutional neural network model for the prediction of chemical reactivity.Chem. Sci. 2019; 10: 370-377Crossref PubMed Google Scholar,36.Schwaller P. et al.Molecular Transformer: a model for uncertainty-calibrated chemical reaction prediction.ACS Cent. Sci. 2019; 5: 1572-1583Crossref PubMed Scopus (190) Google Scholar], which for common reaction types exhibit accuracies above 90% on benchmark datasets. These neural models can be directly used to enumerate possible products or used to predict regio/stereoselectivity patterns [37.Tomberg A. et al.A predictive tool for electrophilic aromatic substitutions using machine learning.J. Org. Chem. 2019; 84: 4695-4703Crossref PubMed Scopus (38) Google Scholar, 38.Beker W. et al.Prediction of major regio-, site-, and diastereoisomers in Diels–Alder reactions by using machine-learning: the importance of physically meaningful descriptors.Angew. Chem. Int. Ed. 2019; 58: 4515-4519Crossref PubMed Scopus (63) Google Scholar, 39.Struble T.J. et al.Multitask prediction of site selectivity in aromatic C–H functionalization reactions.React. Chem. Eng. 2020; 5: 896-902Crossref Google Scholar]. Once these spaces are defined, there are several approaches to identify the top-performing molecules within them. The simplest strategy is, of course, to evaluate every candidate molecule. The feasibility of this approach depends on the nature of the evaluation and time/cost constraints. It would not be practical to physically test every compound in the ZINC database, but it could be for smaller collections like the Drug Repurposing Hub [40.Corsello S.M. et al.The Drug Repurposing Hub: a next-generation drug library and information resource.Nat. Med. 2017; 23: 405-408Crossref PubMed Scopus (352) Google Scholar] or the NCATS Pharmaceutical Collection [41.Huang R. et al.The NCATS Pharmaceutical Collection: a 10-year update.Drug Discov. Today. 2019; 24: 2341-2349Crossref PubMed Scopus (25) Google Scholar]. It is worth noting that technologies like DNA-encoded libraries [42.Clark M.A. et al.Design, synthesis and selection of DNA-encoded small-molecule libraries.Nat. Chem. Biol. 2009; 5: 647-654Crossref PubMed Scopus (416) Google Scholar] and phage display [43.Smith G.P. Petrenko V.A. Phage display.Chem. Rev. 1997; 97: 391-410Crossref PubMed Scopus (1352) Google Scholar] can be used to physically screen chemical spaces of trillions of molecules, albeit with a sparse and stochastic readout. If evaluation is computational, practicality is simply a question of computational budget. In one of the largest docking studies reported to date, 138 million and 99 million compounds from the Enamine REAL library were docked against the D4 receptor and AmpC, respectively [44.Lyu J. et al.Ultra large library docking for discovering new chemotypes.Nature. 2019; 566: 224-229Crossref PubMed Scopus (297) Google Scholar]. More recent studies have since screened over 1 billion enumerated molecules from the same database [45.Gorgulla C. et al.An open-source drug discovery platform enables ultra-large virtual screens.Nature. 2020; 580: 663-668Crossref PubMed Scopus (149) Google Scholar,46.Acharya A. et al.Supercomputer-based ensemble docking drug discovery pipeline with application to Covid-19.ChemRxiv. 2020; (Published online July 29, 2020. https://doi.org/10.26434/chemrxiv.12725465.v1)PubMed Google Scholar]. As make-on-demand libraries can exceed this scale by multiple orders of magnitude, we argue that such exhaustive screening techniques are not a viable long-term approach even for inexpensive evaluations like docking. A popular framework to reduce overall cost is active learning through iterative, model-guided optimization [47.Settles B. Active learning.Synth. Lect. Artif. Intell. Mach. Learn. 2012; 6: 1-114Crossref Scopus (625) Google Scholar]. This involves selecting subsets of experiments to perform based on predictions from a quantitative structure–property relationship (QSPR) model: a surrogate model f^(x) that codifies an approximation to f(x). In Bayesian optimization, predictions of performance and model uncertainty are both considered to balance the exploration of uncertain candidates and the exploitation of candidates likely to be high performing [48.Frazier P.I. A tutorial on Bayesian optimization.arXiv. 2018; (Published online July 8, 2018. https://arxiv.org/abs/1807.02811v1)Google Scholar]; simpler optimization schemes may simply perform a greedy search. Examples of this paradigm include the platform Eve for the identification of bioactive molecules [49.Williams K. et al.Cheaper faster drug development validated by the repositioning of drugs against neglected tropical diseases.J. R. Soc. Interface. 2015; 12: 20141289Crossref PubMed Scopus (59) Google Scholar], retrospective identification of bioactive compounds using PubChem data [50.Kangas J.D. et al.Efficient discovery of responses of proteins to compounds using active learning.BMC Bioinformatics. 2014; 15: 143Crossref PubMed Scopus (23) Google Scholar], computational screening of OLED-relevant molecules [18.Gomez-Bombarelli R. et al.Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach.Nat. Mater. 2016; 15: 1120-1127Crossref PubMed Scopus (509) Google Scholar], and the selection of compounds for docking [51.Gentile F. et al.Deep Docking: a deep learning platform for augmentation of structure based drug discovery.ACS Cent. Sci. 2020; 6: 939-949Crossref PubMed Scopus (80) Google Scholar]. There are still many limitations to be addressed related to the surrogate model, f^, in terms of its low-data performance, generalization power, and ability to quantify uncertainty [52.Muratov E.N. et al.QSAR without borders.Chem. Soc. Rev. 2020; 49: 3525-3564Crossref PubMed Google Scholar], although methods for learning from graph-structured molecules are promising [53.Wu Z. et al.A comprehensive survey on graph neural networks.IEEE Trans. Neural Netw. Learn. Syst. 2020; (Published online March 24, 2020. https://doi.org/10.1109/TNNLS.2020.2978386)Crossref Scopus (951) Google Scholar]. Algorithmic improvements to better handle variable evaluation costs (e.g., the cost of purchasing a compound) and batched optimization (e.g., parallelized in well plates or over multiple CPUs) would be beneficial. While multiple iterations lead to improved surrogate models, a one-iteration approach can still be very effective. A novel antibiotic was recently identified from a drug repurposing collection with fewer experiments than an exhaustive screen this way [54.Stokes J.M. et al.A deep learning approach to antibiotic discovery.Cell. 2020; 180: 688-702.e13Abstract Full Text Full Text PDF