范畴变量
计算机科学
人工智能
编码(内存)
文字嵌入
词(群论)
词汇分析
深度学习
特征向量
嵌入
特征(语言学)
模式识别(心理学)
机器学习
自然语言处理
数学
语言学
哲学
几何学
作者
Mwamba Kasongo Dahouda,Inwhee Joe
出处
期刊:IEEE Access
[Institute of Electrical and Electronics Engineers]
日期:2021-01-01
卷期号:9: 114381-114391
被引量:57
标识
DOI:10.1109/access.2021.3104357
摘要
Many machine learning algorithms and almost all deep learning architectures are incapable of processing plain texts in their raw form.This means that their input to the algorithms must be numerical in order to solve classification or regression problems.Hence, it is necessary to encode these categorical variables into numerical values using encoding techniques.Categorical features are common and often of high cardinality.One-hot encoding in such circumstances leads to very high dimensional vector representations, raising memory and computability concerns for machine learning models.This paper proposes a deep-learned embedding technique for categorical features encoding on categorical datasets.Our technique is a distributed representation for categorical features where each category is mapped to a distinct vector, and the properties of the vector are learned while training a neural network.First, we create a data vocabulary that includes only categorical data, and then we use word tokenization to make each categorical data a single word.After that, feature learning is introduced to map all of the categorical data from the vocabulary to word vectors.Three different datasets provided by the University of California Irvine (UCI) are used for training.The experimental results show that the proposed deeplearned embedding technique for categorical data provides a higher F1 score of 89% than 71% of onehot encoding, in the case of the Long short-term memory (LSTM) model.Moreover, the deep-learned embedding technique uses less memory and generates fewer features than one-hot encoding.
科研通智能强力驱动
Strongly Powered by AbleSci AI