萨萨
一般化
随机森林
相似性(几何)
计算生物学
聚类分析
人工智能
计算机科学
机器学习
虚拟筛选
数据挖掘
生物信息学
数学
生物
药物发现
古生物学
数学分析
图像(数学)
作者
Hui Zhu,Jincai Yang,Niu Huang
标识
DOI:10.1021/acs.jcim.2c01149
摘要
In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein–ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 12 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with greater predicted binding affinities on complexes owning larger protein–ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, the random forest (RF)-Score attained a good performance in the Random-CV test. Based on these findings, we strongly advise assessing the generalization ability of MLSFs with the Pfam-cluster approach and being cautious with the features learned by MLSFs.
科研通智能强力驱动
Strongly Powered by AbleSci AI