Machine-Learning-Guided Library Design Cycle for Directed Evolution of Enzymes: The Effects of Training Data Composition on Sequence Space Exploration

定向进化序列空间序列（生物学）定向分子进化蛋白质工程作文（语言）系列（地层学）蛋白质测序化学空间功能（生物学）计算机科学计算生物学酶生物人工智能生物信息学遗传学肽序列生物化学数学基因药物发现语言学突变体古生物学哲学巴拿赫空间纯数学

作者

Yutaka Saitô,Misaki Oikawa,T. Sato,Hikaru Nakazawa,Tsuyoshi Ito,Tomoshi Kameda,Koji Tsuda,Mitsuo Umetsu

出处

期刊：ACS Catalysis [American Chemical Society]
日期：2021-11-19 卷期号：11 (23): 14615-14624 被引量：17

标识

摘要

Machine learning (ML) is becoming an attractive tool in mutagenesis-based protein engineering because of its ability to design a variant library containing proteins with a desired function. However, it remains unclear how ML guides directed evolution in sequence space depending on the composition of training data. Here, we present a ML-guided directed evolution study of an enzyme to investigate the effects of a known “highly positive” variant (i.e., variant known to have high enzyme activity) in training data. We performed two separate series of ML-guided directed evolution of Sortase A with and without a known highly positive variant called 5M in training data. In each series, two rounds of ML were conducted: variants predicted by the initial round were experimentally evaluated and used as additional training data for the second-round of prediction. The improvements in enzyme activity were comparable between the two series, both achieving enzyme activity 2.2–2.5 times higher than 5M. Intriguingly, the sequences of the improved variants were largely different between the two series, indicating that ML guided the directed evolution to the distinct regions of sequence space depending on the presence/absence of the highly positive variant in the training data. This suggests that the sequence diversity of improved variants can be expanded not only by conventional ML using the whole training data but also by ML using a subset of the training data even when it lacks highly positive variants. In summary, this study demonstrates the importance of regulating the composition of training data in ML-guided directed evolution.

求助该文献

Machine-Learning-Guided Library Design Cycle for Directed Evolution of Enzymes: The Effects of Training Data Composition on Sequence Space Exploration

今日热心研友