摘要
In 2021, a year before ChatGPT took the world by storm amid the excitement about generative artificial intelligence (AI), AlphaFold 2 cracked the 50-year-old protein-folding problem, predicting three-dimensional (3D) structures for more than 200 million proteins from their amino acid sequences. This accomplishment was a precursor to an unprecedented burgeoning of large language models (LLMs) in the life sciences. That was just the beginning. In recent months, we have moved into a hyperaccelerated phase of new foundation models, pretrained on massive datasets, with the ability to perform a wide range of tasks that are helping us understand the structure, biology, evolution, and design of proteins, RNA, DNA, and ligands, as well as their biomolecular interactions. Unlike multimodal LLMs such as GPT-4, Gemini, and Claude, which process text, audio, and images, these large language of life models (LLLMs) are multiomic. That is to say, they are not only multimodal but pertain to different layers of molecular biology. For example, Evo, a foundation model trained on 2.7 million diverse phage and prokaryotic genomes (equivalent to about 300 billion DNA nucleotides), predicts the impact of variants in DNA, RNA, or proteins on structure and function, as well as how essential genes are to cell function, and can generate new DNA sequences.