Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer

合成数据 数据集 公制(单位) 统计 计算机科学 人口 人工智能 随机森林 数学 医学 运营管理 环境卫生 经济
作者
Hyunwook Kim,Won Seok Jang,Woo Seob Sim,Han Sang Kim,Jeong Eun Choi,Eun Sil Baek,Yu Rang Park,Sang Joon Shin
出处
期刊:JCO clinical cancer informatics [American Society of Clinical Oncology]
卷期号: (8)
标识
DOI:10.1200/cci.23.00201
摘要

PURPOSE In artificial intelligence–based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models. MATERIALS AND METHODS A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network–based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method. RESULTS A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state. CONCLUSION The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
刚刚
wyc完成签到,获得积分20
2秒前
3秒前
keyantang完成签到,获得积分10
4秒前
桃桃子发布了新的文献求助10
4秒前
5秒前
张子扬发布了新的文献求助10
8秒前
啦啦啦完成签到,获得积分10
9秒前
桃桃子完成签到,获得积分10
10秒前
上官若男应助momo采纳,获得10
12秒前
13秒前
13秒前
橘朵方差完成签到,获得积分10
14秒前
CipherSage应助lala采纳,获得10
14秒前
raolixiang完成签到,获得积分10
16秒前
17秒前
HEIKU应助医路通行采纳,获得10
17秒前
小蘑菇应助Y123采纳,获得10
17秒前
我是老大应助李昕123采纳,获得10
19秒前
20秒前
积极彩虹完成签到,获得积分10
20秒前
清脆的乌冬面完成签到,获得积分10
22秒前
冻冻也完成签到,获得积分10
23秒前
23秒前
任性柜子完成签到 ,获得积分10
24秒前
24秒前
万能图书馆应助艾米尼采纳,获得20
24秒前
鹿三德完成签到,获得积分20
25秒前
lala完成签到,获得积分20
26秒前
26秒前
27秒前
sdhgd发布了新的文献求助100
27秒前
钱俊发布了新的文献求助10
29秒前
30秒前
Y123发布了新的文献求助10
32秒前
33秒前
34秒前
王大壮完成签到,获得积分20
34秒前
34秒前
34秒前
高分求助中
Evolution 10000
Sustainability in Tides Chemistry 2800
юрские динозавры восточного забайкалья 800
Diagnostic immunohistochemistry : theranostic and genomic applications 6th Edition 500
Chen Hansheng: China’s Last Romantic Revolutionary 500
China's Relations With Japan 1945-83: The Role of Liao Chengzhi 400
Classics in Total Synthesis IV 400
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3149519
求助须知:如何正确求助?哪些是违规求助? 2800571
关于积分的说明 7840676
捐赠科研通 2458112
什么是DOI,文献DOI怎么找? 1308279
科研通“疑难数据库(出版商)”最低求助积分说明 628471
版权声明 601706