计算机科学
均方预测误差
领域(数学)
服务器
预测建模
机器学习
过程(计算)
性能预测
人工智能
数据挖掘
模拟
操作系统
数学
纯数学
标识
DOI:10.1145/3488423.3519316
摘要
Being the major causes of hardware failures in datacenters, uncorrectable memory errors result in server crashes. In this paper, we address the problem of predicting uncorrectable errors (UEs) using the historical correctable error (CE) information. We first establish a new UE prediction framework of inferring latent memory faulty status from CE observations and correlating the inferred faulty status with the UE occurrences for prediction. We then design several predictors based on different memory fault modes and examine their performance on 4 datasets of memory errors from contemporary servers in datacenters of 3 top-tier technology companies. While in existing work, UE prediction is studied in a single environment only, this is the first comparative study on the prediction performance across datasets from different environments. Through the cross-dataset study, we demonstrate that predictors performing relatively well in some environments do not perform well in some other environments. The prediction performance are highly impacted by different characteristics in different environments and no free predictors are universally applicable. Finally, in order to capture the characteristics specific to each environment in UE prediction, we propose a properly designed learning process to induce a weighted ensemble of the predictors from the data and show that the ensemble predictor learned consistently outperforms the individual predictors within each dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI