化学信息学
Python(编程语言)
可扩展性
计算机科学
SPARK(编程语言)
工作站
大数据
脚本语言
计算机集群
分析
数据挖掘
计算科学
并行计算
操作系统
分布式计算
数据库
程序设计语言
生物信息学
生物
作者
Mario Lovrić,José Manuel Molero,Roman Kern
标识
DOI:10.1002/minf.201800082
摘要
Abstract The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations.
科研通智能强力驱动
Strongly Powered by AbleSci AI