SPARK(编程语言)
计算机科学
歪斜
分拆(数论)
大数据
并行计算
调度(生产过程)
分布式计算
数据挖掘
数学优化
电信
数学
组合数学
程序设计语言
作者
Aibo Song,Bowen Peng,Jingyi Qiu,Yingying Xue,Mingyang Du
标识
DOI:10.1109/icpads53394.2021.00075
摘要
As a memory-based distributed big data computing framework, Spark has been widely used in big data processing systems. However, during the execution of Spark, due to the imbalance of input data distribution and the shortage of existing data partitioners in Spark, it is easy to cause partition skew problem and reduce the execution efficiency of Spark. Aiming at this problem, this paper proposes a balanced Spark data partitioner called BSDP (Balanced Spark Data Partitioner). By deeply analyzing the partitioning characteristics of Shuffle intermediate data, the Spark Shuffle intermediate data equalization partitioning model is established. The model aims to minimize the partition skew and find a Shuffle intermediate data equalization partitioning strategy. Based on the model, this paper designs and implements a data equalization partitioning algorithm of BSDP. This algorithm transforms the Shuffle intermediate data equalization partitioning problem into a classic List-Scheduling task scheduling problem, effectively realizes the balanced partitioning of Shuffle intermediate data. The experiment verifies that the BSDP can effectively realize the balanced partitioning of the Shuffle intermediate data and improve the execution efficiency of Spark.
科研通智能强力驱动
Strongly Powered by AbleSci AI