工作流程
数据整理
数量结构-活动关系
计算机科学
标识符
数据挖掘
过程(计算)
质量(理念)
机器学习
情报检索
数据库
认识论
操作系统
哲学
程序设计语言
作者
Kamel Mansouri,Chris Grulke,Ann M. Richard,Richard Judson,Antony Williams
标识
DOI:10.1080/1062936x.2016.1253611
摘要
The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
科研通智能强力驱动
Strongly Powered by AbleSci AI