自编码
计算机科学
解码方法
编码(内存)
高斯分布
数据挖掘
特征(语言学)
混合模型
条件概率分布
人工智能
模式识别(心理学)
机器学习
算法
人工神经网络
数学
统计
语言学
哲学
物理
量子力学
作者
Yandan Tan,Hongbin Zhu,Jie Wu,Hongfeng Chai
标识
DOI:10.1016/j.eswa.2023.122071
摘要
Data synthesizing is of great significance for the privacy protection of real credit data. Credit data synthesis poses unique challenges, involving discrete and continuous features, lack of prior information, high feature complexity, and imbalance. To address the challenge, we propose a data-driven prior-based tabular variational autoencoder (DPTVAE) to end-to-end synthesize credit data, without any expert experience. It mainly contains three main innovations: 1) Binning Gaussian probability density (BGPD)-based feature type classification. Previous work relies on expert-experience classification, which is limited and possibly missing. We innovatively propose BGPD-based class values importance calculation to automatically classify discrete continuous columns, so as to effectively facilitate the rational synthesis requirement of values or distributions. 2) Encoding based on BGPD-Variational Gaussian Mixture (BGPD-VGM): Continuous columns of financial data usually involve skewed, multi-peaks, or mixture distributions. To adapt to the complexity of the distribution, we propose BGPD-VGM to encode data-driven prior. 3) Conditional decoding: We also designed a conditional decoding strategy for DPTVAE to synthesize imbalanced discrete columns. Compared to seven existing advanced models, DPTVAE demonstrates exceptional synthesis performance on two datasets with a 33-fold difference in data size, particularly in identifying real default users based on synthetic data. This achievement is significant for data applications based on privacy protection.
科研通智能强力驱动
Strongly Powered by AbleSci AI