随机森林
接口(物质)
计算机科学
特征(语言学)
数据挖掘
二进制数
蛋白质结构预测
模式识别(心理学)
采样(信号处理)
算法
人工智能
蛋白质结构
数学
化学
最大气泡压力法
并行计算
气泡
哲学
计算机视觉
滤波器(信号处理)
算术
生物化学
语言学
作者
Minjie Li,Ziheng Wu,Wenyan Wang,Kun Lü,Jun Zhang,Yuming Zhou,Zhaoquan Chen,Dan Li,Shicheng Zheng,Peng Chen,Bing Wang
标识
DOI:10.1109/tcbb.2021.3123269
摘要
The computational methods of protein-protein interaction sites prediction can effectively avoid the shortcomings of high cost and time in traditional experimental approaches. However, the serious class imbalance between interface and non-interface residues on the protein sequences limits the prediction performance of these methods. This work therefore proposed a new strategy, NearMiss-based under-sampling for unbalancing datasets and Random Forest classification (NM-RF), to predict protein interaction sites. Herein, the residues on protein sequences were represented by the PSSM-derived features, hydropathy index (HI) and relative solvent accessibility (RSA). In order to resolve the class imbalance problem, an under-sampling method based on NearMiss algorithm is adopted to remove some non-interface residues, and then the random forest algorithm is used to perform binary classification on the balanced feature datasets. Experiments show that the accuracy of NM-RF model reaches 87.6% and 84.3% on Dtestset72 and PDBtestset164 respectively, which demonstrate the effectiveness of the proposed NM-RF method in differentiating the interface or non-interface residues.
科研通智能强力驱动
Strongly Powered by AbleSci AI