计算机科学
强化学习
云计算
工作量
大数据
调度(生产过程)
软件部署
分布式计算
作业车间调度
SPARK(编程语言)
计算机集群
分析
人工智能
操作系统
数据科学
地铁列车时刻表
数学优化
程序设计语言
数学
作者
Muhammed Tawfiqul Islam,Shanika Karunasekera,Rajkumar Buyya
出处
期刊:IEEE Transactions on Parallel and Distributed Systems
[Institute of Electrical and Electronics Engineers]
日期:2022-07-01
卷期号:33 (7): 1695-1710
被引量:55
标识
DOI:10.1109/tpds.2021.3124670
摘要
Big data frameworks such as Spark and Hadoop are widely adopted to run analytics jobs in both research and industry. Cloud offers affordable compute resources which are easier to manage. Hence, many organizations are shifting towards a cloud deployment of their big data computing clusters. However, job scheduling is a complex problem in the presence of various Service Level Agreement (SLA) objectives such as monetary cost reduction, and job performance improvement. Most of the existing research does not address multiple objectives together and fail to capture the inherent cluster and workload characteristics. In this article, we formulate the job scheduling problem of a cloud-deployed Spark cluster and propose a novel Reinforcement Learning (RL) model to accommodate the SLA objectives. We develop the RL cluster environment and implement two Deep Reinforce Learning (DRL) based schedulers in TF-Agents framework. The proposed DRL-based scheduling agents work at a fine-grained level to place the executors of jobs while leveraging the pricing model of cloud VM instances. In addition, the DRL-based agents can also learn the inherent characteristics of different types of jobs to find a proper placement to reduce both the total cluster VM usage cost and the average job duration. The results show that the proposed DRL-based algorithms can reduce the VM usage cost up to 30%.
科研通智能强力驱动
Strongly Powered by AbleSci AI