质心
嵌入
深度学习
计算机科学
最近邻搜索
人工智能
特征(语言学)
蛋白质结构预测
相似性(几何)
模式识别(心理学)
公制(单位)
特征向量
k-最近邻算法
机器学习
数据挖掘
算法
蛋白质结构
工程类
哲学
语言学
物理
运营管理
核磁共振
图像(数学)
作者
Wei Yang,Yang Liu,Chunjing Xiao
标识
DOI:10.1016/j.knosys.2022.108356
摘要
Predicting the secondary structure of a protein from its amino acid sequence alone is a challenging prediction task for each residue in bioinformatics. Recent work has mainly used deep models based on the profile feature derived from multiple sequence alignments to make predictions. However, the existing state-of-the-art predictors usually have higher computational costs due to their large model sizes and complex network architectures. Here, we propose a simple yet effective deep centroid model for sequence-to-sequence secondary structure prediction based on deep metric learning. The proposed model adopts a lightweight embedding network with multibranch topology to map each residue in a protein chain into an embedding space. The goal of embedding learning is to maximize the similarity of each residue to its target centroid while minimizing its similarity to nontarget centroids. By assigning secondary structure types based on the learned centroids, we bypass the need for a time-consuming k-nearest neighbor search. Experimental results on six test sets demonstrate that our method achieves state-of-the-art performance with a simple architecture and smaller model size than existing models. Moreover, we also experimentally show that the embedding feature from the pretrained protein language model ProtT5-XL-U50 is superior to the profile feature in terms of prediction accuracy and feature generation speed. Code and datasets are available at https://github.com/fengtuan/DML_SS.
科研通智能强力驱动
Strongly Powered by AbleSci AI