计算机科学
范畴变量
加权
代表(政治)
基数(数据建模)
人工智能
编码(内存)
一般化
机器学习
外部数据表示
空格(标点符号)
模式识别(心理学)
数据挖掘
数学
医学
放射科
数学分析
政治
政治学
法学
操作系统
作者
Morteza Mohammady Gharasuie,Fengjiao Wang,Omar Sharif,Ravi Mukkamala
标识
DOI:10.1007/978-981-97-2259-4_24
摘要
Self-supervised and Semi-supervised learning (SSL) on tabular data is an understudied topic. Despite some attempts, there are two major challenges: 1. Imbalanced nature in the tabular dataset; 2. The one-hot encoding used in these methods becomes less efficient for high-cardinality categorical features. To cope with the challenges, we propose SAWTab which uses a target encoding method, Conditional Probability Representation (CPR), for efficient representation in the input space of categorical features. We improve this representation by incorporating the unlabeled samples through pseudo-labels. Furthermore, we propose a Smooth Adaptive Weighting mechanism in the target encoding to mitigate the issue of noisy and biased pseudo-labels. Experimental results on various datasets and comparisons with existing frameworks show that SAWTab yields best test accuracy on all datasets. We find that pseudo-labels can help improve the input space representation in the SSL setting, which enhances the generalization of the learning algorithm.
科研通智能强力驱动
Strongly Powered by AbleSci AI