IsarPipeline: Combining MMseqs2 and PSI-BLAST to Quickly Generate Extensive Protein Sequence Alignment Profiles

计算机科学 启发式 序列(生物学) 管道(软件) 集合(抽象数据类型) 序列数据库 软件 多序列比对 功能(生物学) 蛋白质功能预测 过程(计算) 算法 数据挖掘 序列比对 蛋白质功能 肽序列 程序设计语言 生物 基因 操作系统 进化生物学 生物化学 遗传学
作者
Issar Arab
标识
DOI:10.1101/2022.03.23.485035
摘要

Abstract Many of the machine learning (ML) models used in the field of bioinformatics and computational biology to predict either function or structure of proteins rely on the evolutionary information as summarized in multiple-sequence alignments (MSAs) or the resulting position-specific scoring matrices (PSSMs), as generated by PSI-BLAST. The current procedure used in protein structure and function prediction is computationally exhaustive and time-consuming. The main issue relies on the PSI-BLAST software being forced to load the current database of sequences (about 220 GB) in batches and search for similar sequence alignments to a query sequence. This leads to an average runtime of about 40-60 min for a medium-sized (450 Amino Acids) query protein. This average runtime is strictly dependent on the hardware used to run the software. The issue is becoming more problematic since the bio-sequence data pools are increasing in size exponentially over time, hence raising PSI-BLAST runtime as well. A prominent solution claims to speed up the current process by 100 folds. The MMseqs2 method, given enough memory, will load the whole database in memory and apply certain heuristics to retrieve the relevant set of aligned sequences. However, this solution cannot be used directly to generate the final output in the desired PSI-BLAST alignment and PSSM profile data format. In this research project, we analyzed the runtime performance of each tool separately. Furthermore, we built a pipeline that combines both MMseqs2 and PSI-BLAST to obtain a robust, optimized and very fast hybrid alignment tool, faster than PSI-BLAST by two orders of magnitude. It is implemented in C++ and is freely available under the MIT license at https://github.com/issararab/IsarPipeline . The output of our pipeline was evaluated on two previously built predictive models.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
Flora发布了新的文献求助30
1秒前
mia完成签到,获得积分10
2秒前
廷泽完成签到 ,获得积分10
2秒前
2秒前
畅快觅柔发布了新的文献求助10
4秒前
FFF发布了新的文献求助10
4秒前
4秒前
7秒前
偏偏意气用事完成签到 ,获得积分10
10秒前
CX发布了新的文献求助10
10秒前
phy完成签到,获得积分10
13秒前
彭于晏应助BWZ采纳,获得10
13秒前
14秒前
张可发布了新的文献求助10
15秒前
wzx发布了新的文献求助30
15秒前
白启完成签到,获得积分10
17秒前
17秒前
科研通AI5应助展会恩采纳,获得10
18秒前
FFF完成签到,获得积分10
18秒前
19秒前
ysl关注了科研通微信公众号
19秒前
be发布了新的文献求助10
20秒前
22秒前
黄黄黄发布了新的文献求助10
22秒前
海滨之鹅发布了新的文献求助10
22秒前
23秒前
Erin完成签到,获得积分10
24秒前
24秒前
sff完成签到,获得积分10
26秒前
小花生zz完成签到,获得积分10
27秒前
劲秉应助lovein采纳,获得10
28秒前
BWZ发布了新的文献求助10
28秒前
asdasd发布了新的文献求助10
29秒前
迷路的蛋挞应助易安采纳,获得10
29秒前
展会恩发布了新的文献求助10
30秒前
爱规划的小宇完成签到,获得积分10
31秒前
醉酒戏红尘完成签到,获得积分10
33秒前
眼圆广志完成签到,获得积分10
34秒前
如意翡翠发布了新的文献求助10
34秒前
BWZ完成签到,获得积分10
35秒前
高分求助中
All the Birds of the World 4000
Production Logging: Theoretical and Interpretive Elements 3000
Machine Learning Methods in Geoscience 1000
Weirder than Sci-fi: Speculative Practice in Art and Finance 960
Resilience of a Nation: A History of the Military in Rwanda 888
Massenspiele, Massenbewegungen. NS-Thingspiel, Arbeiterweibespiel und olympisches Zeremoniell 500
Essentials of Performance Analysis in Sport 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3727963
求助须知:如何正确求助?哪些是违规求助? 3273011
关于积分的说明 9979560
捐赠科研通 2988384
什么是DOI,文献DOI怎么找? 1639597
邀请新用户注册赠送积分活动 778819
科研通“疑难数据库(出版商)”最低求助积分说明 747817