人病毒体
注释
计算生物学
基因组
基因组
生物
基因
遗传学
作者
Zachary Flamholz,Steven J. Biller,Libusha Kelly
出处
期刊:Nature microbiology
日期:2024-01-29
卷期号:9 (2): 537-549
被引量:16
标识
DOI:10.1038/s41564-023-01584-8
摘要
Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches. Ocean viral proteome annotations are expanded by a machine learning approach that is not reliant on sequence homology and can annotate sequences not homologous to those seen in training.
科研通智能强力驱动
Strongly Powered by AbleSci AI