Fast and scalable ensemble learning method for versatile polygenic risk prediction

集成学习计算机科学生命银行全基因组关联研究连锁不平衡可扩展性机器学习人工智能稳健性（进化）数据挖掘遗传关联汇总统计回归统计生物信息学生物数学遗传学基因基因型单核苷酸多态性数据库单倍型

作者

Tony Chen,Haoyu Zhang,Rahul Mazumder,Xihong Lin

出处

期刊：Proceedings of the National Academy of Sciences of the United States of America [Proceedings of the National Academy of Sciences]
日期：2024-08-07 卷期号：121 (33) 被引量：3

链接

nih.govdoi.org

标识

DOI：10.1073/pnas.2403210121

摘要

Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.

求助该文献

Fast and scalable ensemble learning method for versatile polygenic risk prediction

今日热心研友