同源染色体
计算生物学
计算机科学
生物
遗传学
基因
作者
Liang Hong,Zhigang Hu,Siqi Sun,Xiangru Tang,Jiuming Wang,Qingxiong Tan,Liangzhen Zheng,Sheng Wang,Sheng Xu,Irwin King,Mark Gerstein,Yu Li
标识
DOI:10.1038/s41587-024-02353-6
摘要
The identification of protein homologs in large databases using conventional methods, such as protein sequence comparison, often misses remote homologs. Here, we offer an ultrafast, highly sensitive method, dense homolog retriever (DHR), for detecting homologs on the basis of a protein language model and dense retrieval techniques. Its dual-encoder architecture generates different embeddings for the same protein sequence and easily locates homologs by comparing these representations. Its alignment-free nature improves speed and the protein language model incorporates rich evolutionary and structural information within DHR embeddings. DHR achieves a >10% increase in sensitivity compared to previous methods and a >56% increase in sensitivity at the superfamily level for samples that are challenging to identify using alignment-based approaches. It is up to 22 times faster than traditional methods such as PSI-BLAST and DIAMOND and up to 28,700 times faster than HMMER. The new remote homologs exclusively found by DHR are useful for revealing connections between well-characterized proteins and improving our knowledge of protein evolution, structure and function.
科研通智能强力驱动
Strongly Powered by AbleSci AI