可靠性(半导体)
计算机科学
电迁移
资源配置
功率消耗
可靠性工程
时间范围
资源(消歧)
编配
功率(物理)
资源管理(计算)
实时计算
分布式计算
材料科学
数学优化
计算机网络
数学
工程类
复合材料
视觉艺术
艺术
物理
音乐剧
量子力学
作者
Mohammad-Hashem Haghbayan,Antonio Miele,Onur Mutlu,Juha Plosila
标识
DOI:10.1109/tc.2023.3272800
摘要
Run-time resource management is fundamental for efficient execution of workloads on Chip Multiprocessors. Application- and system-level requirements (e.g., on performance versus power versus lifetime reliability) are generally conflicting each other, and any decision on resource assignment, such as core allocation or frequency tuning, may positively affect some of them while penalizing some others. Resource assignment decisions can be perceived in few instants of time on performance and power consumption, but not on lifetime reliability. In fact, this latter changes very slowly based on the accumulation of effects of various decisions over a long time horizon. Moreover, aging mechanisms are various and have different causes; most of them, such as Electromigration (EM), are subject to temperature levels, while Thermal Cycling (TC) is caused mainly by temperature variations (both amplitude and frequency). Mitigating only EM may negatively affect TC and vice versa. We propose a resource orchestration strategy to balance the performance and power consumption constraints in the short-term and EM and TC aging in the long-term. Experimental results show that the proposed approach improves the average Mean Time To Failure at least by 17% and 20% w.r.t. EM and TC, respectively, while providing same performance level of the nominal counterpart and guaranteeing the power budget.
科研通智能强力驱动
Strongly Powered by AbleSci AI