过度拟合
计算机科学
特征选择
人工智能
水准点(测量)
机器学习
下游(制造业)
选择(遗传算法)
特征(语言学)
人工神经网络
深度学习
Lasso(编程语言)
数据挖掘
模式识别(心理学)
工程类
语言学
运营管理
哲学
大地测量学
万维网
地理
作者
Valeriia Cherepanova,Roy Levin,Gowthami Somepalli,Jonas Geiping,C. Bayan Bruss,Andrew Gordon Wilson,Tom Goldstein,Micah Goldblum
出处
期刊:Cornell University - arXiv
日期:2023-11-10
标识
DOI:10.48550/arxiv.2311.05877
摘要
Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. Motivated by the increasing popularity of tabular deep learning, we construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.
科研通智能强力驱动
Strongly Powered by AbleSci AI