班级(哲学)
人工智能
机器学习
稀缺
二进制数
二进制数据
二元分类
计算机科学
支持向量机
数学
经济
算术
微观经济学
作者
Chang‐Hun Kim,Jaeseong Jeong,Jinhee Choi
标识
DOI:10.1021/acs.chemrestox.2c00189
摘要
The development of toxicity classification models using the ToxCast database has been extensively studied. Machine learning approaches are effective in identifying the bioactivity of untested chemicals. However, ToxCast assays differ in the amount of data and degree of class imbalance (CI). Therefore, the resampling algorithm employed should vary depending on the data distribution to achieve optimal classification performance. In this study, the effects of CI and data scarcity (DS) on the performance of binary classification models were investigated using ToxCast bioassay data. An assay matrix based on CI and DS was prepared for 335 assays with biologically intended target information, and 28 CI assays and 3 DS assays were selected. Thirty models established by combining five molecular fingerprints (i.e., Morgan, MACCS, RDKit, Pattern, and Layered) and six algorithms [i.e., gradient boosting tree, random forest (RF), multi-layered perceptron, k-nearest neighbor, logistic regression, and naive Bayes] were trained using the selected assay data set. Of the 30 trained models, MACCS-RF showed the best performance and thus was selected for analyses of the effects of CI and DS. Results showed that recall and F1 were significantly lower when training with the CI assays than with the DS assays. In addition, hyperparameter tuning of the RF algorithm significantly improved F1 on CI assays. This study provided a basis for developing a toxicity classification model with improved performance by evaluating the effects of data set characteristics. This study also emphasized the importance of using appropriate evaluation metrics and tuning hyperparameters in model development.
科研通智能强力驱动
Strongly Powered by AbleSci AI