Jiaqi He,Yangyang Li,Yuan Gao,Xu Zhou,Zhuoya Wang,Hong Guang,Weidan Wang,Yujie Dong,Hao Liu
标识
DOI:10.1109/bibm58861.2023.10385880
摘要
With growing data volume for large-scale virtual screening, the associated data processing and management meet challenges. We have developed UCAPF, A unified platform for large-scale virtual screening. The platform provides a parallel processing framework for large-scale virtual screening data. It also enables scheduling of heterogeneous parallel architectures and hierarchical storage of massive data. The processing framework improves data quality. On the CASF-2016 dataset, the standardized molecules processed by UCAPF showed a 9.5% to 14.62% improvement in scoring performance and a 7.9% to 34.6% improvement in ranking performance compared to the raw molecules. For massive data processing, the framework provides parallel efficiency of 81.20% for molecule standardized processing and 79.51% for docking result processing on a 72-unit Hadoop cluster. In addition, the distributed database for data management improves the ability to retrieve 10,094 molecules from seventy million docking result data by a factor of 2.40 compared to the single-node storage model. Finally, we analyze the variation of input/output (I/O) over time for different phases of virtual screening to reflect the effectiveness of the scheduling strategy and tiered storage for the heterogeneous parallel architecture.