缺少数据
插补(统计学)
计算机科学
离群值
支持向量机
数据质量
特征工程
特征(语言学)
机器学习
数据挖掘
数据科学
人工智能
工程类
深度学习
语言学
运营管理
哲学
公制(单位)
作者
Fan Zhang,Melissa Petersen,Leigh Johnson,James Hall,Sid E. O’Bryant
摘要
Abstract Background The Health and Aging Brain Study: Health Disparities (HABS‐HD) seeks to understand the biological, social and environmental factors that impact brain aging among diverse communities. HABS‐HD, like many other NIH funded data‐sharing projects, has important data assets for various uses, including social, environmental and behavioral data, and multiple data flow pathways. Machine learning (ML) develops algorithms and models to continuously improve itself over time, but the determination of data quality and its readiness are needed for these models to operate efficiently. Therefore, developing a data readiness reporting methodology has become a very urgent task for HABS‐HD. Method In this study, we developed a conceptual framework of data readiness. First, we analyzed the missing data percentage and used ML‐Based Multiple Imputation (MLMI) for missing data imputation. Then, we performed SVM based on Recursive Feature Elimination and Cross Validation (SVM‐RFE‐CV) for feature elimination and outlier removal. Lastly, we rated the data readiness based on the three metrics: missing data percentage, performance before feature engineering, and performance after feature engineering to rate data readiness. All the three scores were averaged to rate the overall readiness of data. Result A framework for calculating overall average score for readiness of data was presented (1 stands for completely accessible, 0 for not accessible at all, and 0.5 for neutral). Our results show that the framework of data readiness was straightforward and useful in assessing how ready the HABS‐HD data is for ML. Conclusion The systematic analysis of readiness of data before building ML models is of utmost importance. And it has a significant impact on biomarker discovery and disease prediction application for Alzheimer’s disease. The conceptual framework of data readiness works well for our Alzheimer’s disease models in HABS‐HD. It can also be applied to other disease data readiness reporting.
科研通智能强力驱动
Strongly Powered by AbleSci AI