范畴变量
计算机科学
数据挖掘
预处理器
杠杆(统计)
朴素贝叶斯分类器
人工智能
基数(数据建模)
机器学习
数据预处理
人工神经网络
方案(数学)
模式识别(心理学)
支持向量机
数学
数学分析
出处
期刊:SIGKDD explorations
[Association for Computing Machinery]
日期:2001-07-01
卷期号:3 (1): 27-32
被引量:207
标识
DOI:10.1145/507533.507538
摘要
Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.
科研通智能强力驱动
Strongly Powered by AbleSci AI