计算机科学
虚假关系
特征选择
机器学习
数据预处理
预处理器
数据科学
随机性
破译
过程(计算)
数据挖掘
人工智能
环境数据
管理科学
工程类
统计
数学
操作系统
生物
遗传学
法学
政治学
作者
Jun‐Jie Zhu,Meiqi Yang,Zhiyong Jason Ren
标识
DOI:10.1021/acs.est.3c00026
摘要
Machine learning (ML) is increasingly used in environmental research to process large data sets and decipher complex relationships between system variables. However, due to the lack of familiarity and methodological rigor, inadequate ML studies may lead to spurious conclusions. In this study, we synthesized literature analysis with our own experience and provided a tutorial-like compilation of common pitfalls along with best practice guidelines for environmental ML research. We identified more than 30 key items and provided evidence-based data analysis based on 148 highly cited research articles to exhibit the misconceptions of terminologies, proper sample size and feature size, data enrichment and feature selection, randomness assessment, data leakage management, data splitting, method selection and comparison, model optimization and evaluation, and model explainability and causality. By analyzing good examples on supervised learning and reference modeling paradigms, we hope to help researchers adopt more rigorous data preprocessing and model development standards for more accurate, robust, and practicable model uses in environmental research and applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI