计算机科学
水准点(测量)
数据挖掘
成对比较
合成数据
深度学习
k-最近邻算法
忠诚
小数据
人工智能
生成模型
机器学习
生成语法
电信
大地测量学
地理
作者
Jayanth Sivakumar,R. Karthik,Menaka Radhakrishnan,Daehan Won
标识
DOI:10.1016/j.knosys.2023.110956
摘要
Synthetic data generation for tabular data unlike computer vision, is an emerging challenge. When tabular data needs to be synthesized, it either faces a small dataset problem or violates privacy if the data contains sensitive information. When the data is small, any data-driven modeling leads to biased decision making. On the other hand, deep learning models that use small dataset for training are limited. Tabular data also faces a myriad of challenges, such as mixed data types, fidelity, mode collapse, etc. To eradicate small dataset problems and increase the deep learning capabilities on small data, a new generative method, GenerativeMTD, is proposed in this research. The method generates fake data by using pseudo-real data as input during the training. Pseudo-real data serves the purpose of training the deep learning model with large samples when the real dataset size is small. The pseudo-real data is generated from the real data through k-nearest neighbor mega-trend diffusion. This pseudo-real data is then translated into synthetic data that is similar and realistic to the real data. The method outperforms some of the state-of-the-art methodologies that exist in tabular data generation. The proposed method also generates quality synthetic data for the benchmark datasets in terms of pairwise correlation differences. In addition, the method surpasses the benchmark models in terms of the distance-based privacy metrics: distance to the closest record and nearest neighbor distance ratio.
科研通智能强力驱动
Strongly Powered by AbleSci AI