计算机科学
调度(生产过程)
图形处理单元
GPU群集
对称多处理机系统
图形
图形处理单元的通用计算
分布式计算
库达
算法
并行计算
绘图
理论计算机科学
数学优化
操作系统
数学
作者
Sheng Wang,Shiping Chen,Yumei Shi
标识
DOI:10.1016/j.future.2023.10.022
摘要
Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource utilization. However, prior research in resource scheduling algorithms typically employed machine learning (ML) algorithms to estimate job durations or GPU utilization in the cluster based on training progress and task speed. Regrettably, these studies often overlooked the performance variations among different GPU types within these clusters, as well as the presence of spatiotemporal correlations among jobs. To address these limitations, this paper introduces the graph predictive algorithm for efficient resource scheduling (GPARS) designed specifically for heterogeneous clusters. GPARS leverages spatiotemporal correlations among jobs and utilizes graph attention networks (GANs) for precise job duration prediction. Building upon the prediction results, we develop a dynamic objective function to allocate suitable GPU types for newly submitted jobs. By conducting a comprehensive analysis of Alibaba's heterogeneous GPU cluster, we delve into the impact of GPU capacity and type on job completion time (JCT) and resource utilization. Our evaluation, using real traces from Alibaba and Philly, substantiates the effectiveness of GPARS. It achieves a remarkable 10.29% reduction in waiting time and an average improvement of 7.47% in resource utilization compared to the original scheduling method. These findings underscore GPARS's superior performance in enhancing resource scheduling within heterogeneous GPU clusters.
科研通智能强力驱动
Strongly Powered by AbleSci AI