计算机科学
启发式
序列(生物学)
管道(软件)
集合(抽象数据类型)
序列数据库
软件
多序列比对
功能(生物学)
蛋白质功能预测
过程(计算)
算法
数据挖掘
序列比对
蛋白质功能
肽序列
程序设计语言
生物
基因
操作系统
进化生物学
生物化学
遗传学
标识
DOI:10.1101/2022.03.23.485035
摘要
Abstract Many of the machine learning (ML) models used in the field of bioinformatics and computational biology to predict either function or structure of proteins rely on the evolutionary information as summarized in multiple-sequence alignments (MSAs) or the resulting position-specific scoring matrices (PSSMs), as generated by PSI-BLAST. The current procedure used in protein structure and function prediction is computationally exhaustive and time-consuming. The main issue relies on the PSI-BLAST software being forced to load the current database of sequences (about 220 GB) in batches and search for similar sequence alignments to a query sequence. This leads to an average runtime of about 40-60 min for a medium-sized (450 Amino Acids) query protein. This average runtime is strictly dependent on the hardware used to run the software. The issue is becoming more problematic since the bio-sequence data pools are increasing in size exponentially over time, hence raising PSI-BLAST runtime as well. A prominent solution claims to speed up the current process by 100 folds. The MMseqs2 method, given enough memory, will load the whole database in memory and apply certain heuristics to retrieve the relevant set of aligned sequences. However, this solution cannot be used directly to generate the final output in the desired PSI-BLAST alignment and PSSM profile data format. In this research project, we analyzed the runtime performance of each tool separately. Furthermore, we built a pipeline that combines both MMseqs2 and PSI-BLAST to obtain a robust, optimized and very fast hybrid alignment tool, faster than PSI-BLAST by two orders of magnitude. It is implemented in C++ and is freely available under the MIT license at https://github.com/issararab/IsarPipeline . The output of our pipeline was evaluated on two previously built predictive models.
科研通智能强力驱动
Strongly Powered by AbleSci AI