人工智能
Boosting(机器学习)
特征选择
随机森林
离群值
集成学习
计算机科学
梯度升压
预处理器
机器学习
相关性
模式识别(心理学)
统计分类
特征(语言学)
数据预处理
数学
哲学
语言学
几何学
作者
Mert Demirarslan,Aslı Süner
标识
DOI:10.1080/03610918.2022.2046087
摘要
A wide range of issues including missing values, class noise, class imbalance, outliers, correlation and irrelevant variables have the potential to negatively affect the overall performance of disease diagnosis classification algorithms. This study proposes a new technique, alternative to the t-Score method, to increase the performance of ensemble learning classification algorithms by removing irrelevant variables. Therefore, three publicly available datasets from medical domain varying in their sample sizes, number of variables, and data preprocessing problems were selected and processed with our newly proposed feature selection method called Outliers and Correlation t-Score (OCtS). Afterwards, six widely used ensemble learning algorithms including Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting Machine, Light Gradient Boosting Machine, CatBoost, and Bagging were employed for disease diagnosis classification, and performance metrics were measured. Our results indicate that the classification performance of six ensemble learning algorithms significantly increased when the OCtS method was employed, and our feature selection method, OCtS, exhibited higher performance compared to the standard t-score method across all datasets (p = 0.0001). We conclude that, using data preprocessing methods with OCtS offers better algorithm performance when employing ensemble learning algorithms in disease diagnosis classification.
科研通智能强力驱动
Strongly Powered by AbleSci AI