期刊:IEEE Transactions on Big Data [Institute of Electrical and Electronics Engineers] 日期:2023-08-01卷期号:9 (4): 1086-1101被引量:1
标识
DOI:10.1109/tbdata.2022.3233031
摘要
Stream processing has been gaining extensive attention in the past few years. Apache Flink is a new generation of distributed stream processing engines that can process a great deal of data in real-time with low latency. But the default scheduler of Flink adopts a random task scheduling strategy, which does not consider the cost and load balancing in the cloud environment. In this article, a cost-efficient task scheduling algorithm (CETSA) and a cost-efficient load balancing algorithm (LBA-CE) for Flink are proposed to reduce the job execution cost while optimizing load balancing. First, a cost-efficient model and a load balancing model based on Flink are constructed. Then, the core mechanism of Flink task scheduling is improved based on the cost-efficient model and the improved task scheduler is implemented. In addition, the concept of node adaptation is introduced into cost-efficient scheduling according to the load balancing model, ensuring that the cluster load is balanced as much as possible while reducing the cost in a heterogeneous cluster. Extensive experiments have been performed with Hibench's Wordcount and Fixwindow workloads in the cloud environment. The experimental results indicate that compared to the baseline scheduling algorithm, the proposed algorithms reduce the cost by about 37.9% and 20.2% on average, and the load deviation of the cluster is reduced by about 23.1% and 24.6% on average, respectively. In summary, the proposed algorithms in this paper can significantly reduce the cost of executing jobs and optimize the load balancing of the cluster in Flink.