随机森林
子空间拓扑
特征选择
计算机科学
选择(遗传算法)
模式识别(心理学)
人工智能
特征(语言学)
随机子空间法
数据挖掘
语言学
哲学
作者
Yisen Wang,Shu‐Tao Xia
标识
DOI:10.1109/ijcnn.2016.7727772
摘要
Random forests are a class of ensemble methods for classification and regression with randomizing mechanism in bagging instances and selecting feature subspace. For high dimensional data, the performance of random forests degenerates because of the random sampling feature subspace for each node in the construction of decision trees. To address the issue, in this paper, we propose a new Principal Component Analysis and Stratified Sampling based method, called PCA-SS, for feature subspace selection in random forests with high dimensional data. For each decision tree in the forests, we firstly create the training data by bagging instances and partition the feature set into several feature subsets. Principal Component Analysis (PCA) is applied on each feature subset to obtain transformed features. All the principal components are retained in order to preserve the variability information of the data. Secondly, depending on a certain principal components principle, the transformed features are partitioned into informative and less informative parts. When constructing each node of decision trees, a feature subspace is selected by stratified sampling method from the two parts. The PCA-SS based Random Forests algorithm, named PSRF, ensures enough informative features for each tree node, and it also increases the diversity between the trees to a certain extent. Experimental results demonstrate that the proposed PSRF significantly improves the performance of random forests when dealing with high dimensional data, compared with the state-of-the-art random forests algorithms.
科研通智能强力驱动
Strongly Powered by AbleSci AI