计算机科学
云计算
供应
分布式计算
服务质量
可扩展性
云朵
资源配置
试验台
计算机网络
数据库
操作系统
作者
Shreshth Tuli,Giuliano Casale,Nicholas R. Jennings
标识
DOI:10.1109/tc.2023.3310678
摘要
The rise of distributed cloud computing technologies has been pivotal for the large-scale adoption of Artificial Intelligence (AI) based applications for high fidelity and scalable service delivery. Systematic resource management is central in maintaining optimal Quality of Service (QoS) in cloud platforms and is divided into three fundamental types: resource provisioning, AI model deployment and workload placement. To exploit the synergy among these decision types, it becomes imperative to concurrently design (co-design) the provisioning, deployment and placement decisions for optimal QoS. As users and cloud service providers shift to non-stationary AI-based workloads, frequent decision making imposes severe time constraints on the resource management models. Existing AI-based solutions often optimize decision types independently and tend to ignore the dependencies across various system performance aspects such as energy consumption and CPU utilization, making them perform poorly in large-scale cloud systems. To address this, we propose a novel method, called SciNet, that leverages a co-simulated digital-twin of the infrastructure to capture inter-metric dependencies and accurately estimate QoS scores. To avoid expensive simulation overheads at test time, SciNet trains a neural network based imitation learner that aims to mimic an oracle, which takes optimal decisions based on co-simulated QoS estimates. Offline model training and online decision making based on the imitation learner, enables SciNet to take optimal decisions while being time-efficient. Experiments with real-life AI-based benchmark applications on a public cloud testbed show that SciNet gives up to 48% lower execution cost, 79% higher inference accuracy, 71% lower energy consumption and 56% lower response times compared to the current state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI