计算生物学
基因组
调节顺序
基因组学
编码(集合论)
生物
基因
序列(生物学)
DNA测序
计算机科学
遗传学
基因表达调控
程序设计语言
集合(抽象数据类型)
作者
Carl G. de Boer,Jussi Taipale
标识
DOI:10.1101/2023.04.20.537701
摘要
Abstract Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. The “cis-regulatory code” - the rules that cells use to determine when, where, and how much genes should be expressed - has proven to be exceedingly complex, but recent advances in the scale and resolution of functional genomics assays and Machine Learning have enabled significant progress towards deciphering this code. However, we will likely never solve the cis-regulatory code if we restrict ourselves to models trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and there is insufficient sequence diversity in our genomes to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable us to test a far larger sequence space than exists in our genomes in each experiment, and designed DNA sequences enable a targeted query of the sequence space to maximally improve the models. Since cells use the same biochemical principles to interpret DNA regardless of its source, models that are trained on these synthetic data can predict genomic activity, often better than genome-trained models. Here, we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by training models exclusively on non-genomic DNA sequences, and using genomic sequences solely for evaluating the resulting models.
科研通智能强力驱动
Strongly Powered by AbleSci AI