期刊:IEEE transactions on artificial intelligence [Institute of Electrical and Electronics Engineers] 日期:2024-01-01卷期号:: 1-13
标识
DOI:10.1109/tai.2024.3407034
摘要
In tabular data, certain challenges can negatively affect the quality of machine learning models, such as high dimensionality, noisy, irrelevant, or repetitive features, interactions between features and the fact that instances often come from different sources or distributions. Feature selection, instance selection and clustering algorithms address some of these challenges. Here, we propose a new holistic framework that assists in clarifying the structure of tabular datasets and enables the production of higher-quality machine learning models. The framework, based on intrinsic-reward deep reinforcement learning loops, uses curious feature selection as the basis for clustering data instances, effectively creating blocks within the tabular data with the most relevant features for each cluster. The framework results in a clustering algorithm, wherein the instances are clustered based on their predicted optimal informative features. We show that this framework makes it possible to improve the accuracy of learning models on artificial and real datasets and to provide important insights into the data themselves.