推论
插补(统计学)
计算机科学
生命银行
机器学习
缺少数据
人工智能
结果(博弈论)
数据挖掘
生物信息学
数学
生物
数理经济学
作者
Jue Hou,Zijian Guo,Tianxi Cai
出处
期刊:Cornell University - arXiv
日期:2021-01-01
被引量:1
标识
DOI:10.48550/arxiv.2105.01264
摘要
Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against mis-specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is mis-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort.
科研通智能强力驱动
Strongly Powered by AbleSci AI