亚细胞定位
人工智能
UniProt公司
计算机科学
蛋白质亚细胞定位预测
一般化
机器学习
蛋白质测序
集合(抽象数据类型)
模式识别(心理学)
生物
肽序列
数学
生物化学
数学分析
基因
细胞质
程序设计语言
作者
Sam Giannakoulias,John J. Ferrie,Andrew Apicello,Carter A. Mitchell
标识
DOI:10.1101/2023.09.01.555932
摘要
ABSTRACT Protein subcellular localization is a critically important parameter to consider when designing expression constructs and production strategies for industry scale protein production. In this study, we present Prot-SCL an innovative self-supervised machine learning approach to predict protein subcellular localization exclusively from primary sequence. The models herein were learned from a dataset of subcellular localizations derived by exhaustively analyzing the Uniprot database. The set of localization data was rigorously curated for machine learning by employing group sampling following clustering of the protein sequences. The novel component of this approach lies in the development of a triplet neural network architecture capable of generating meaningful embeddings for classification of protein subcellular localization. We observed a robust predictive power for our classical gradient boosted machine learning models trained on these triplet embeddings in both cross validation and in generalization to the testing set. Importantly, we have made this extensive dataset of protein subcellular localizations publicly accessible, facilitating future, need-based, localization studies. Finally, we provide the relevant codebase to encourage a wider adoption and expansion of this methodology. GRAPHICAL ABSTRACT
科研通智能强力驱动
Strongly Powered by AbleSci AI