ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments

计算机科学异步通信分布式计算培训（气象学）过程（计算）方案（数学）计算加速人工智能机器学习计算机网络并行计算算法数学分析物理数学气象学操作系统

作者

Zhaoyan Shen,Qingxiang Tang,Tianren Zhou,Yuhao Zhang,Zhiping Jia,Dongxiao Yu,Zhiyong Zhang,Bingzhe Li

出处

期刊：IEEE Transactions on Computers [Institute of Electrical and Electronics Engineers]
日期：2024-01-01 卷期号：73 (1): 30-43 被引量：1

标识

DOI：10.1109/tc.2023.3315847

摘要

With the increment of data sets and models sizes, distributed deep learning has been proposed to accelerate training and improve the accuracy of DNN models. The parameter server framework is a popular collaborative architecture for data-parallel training, which works well for homogeneous environments by properly aggregating the computation/communication capabilities of different workers. However, in heterogeneous environments, the resources of different workers vary a lot. Some stragglers may seriously limit the whole speed, which impacts the overall training process. In this paper, we propose an adaptive multi-stage distributed deep learning training framework, named ASHL, for heterogeneous environments. First, a profiling scheme is proposed to capture the capabilities of each worker to reasonably plan the training and communication tasks on each worker, and lay the foundation for the formal training. Second, a hybrid-mode training scheme (i.e., coarse-grained and fined-grained training) is proposed to balance the model accuracy and training speed. The coarse-grained training scheme (named AHL) adopts an asynchronous communication strategy, which involves less frequent communications. Its main goal is to make the model quickly converge to a certain level. The fine-grained training stage (named SHL) uses a semi-asynchronous communication strategy and adopts a high communication frequency. Its main goal is to improve the model convergence effect. Finally, a compression-based communication scheme is proposed to further increase the communication efficiency of the training process. Our experimental results show that ASHL reduces the overall training time by more than 35% to converge to the same degree and has better generalization ability compared with state-of-the-art schemes like ADSP.

求助该文献

最长约 10秒，即可获得该文献文件

ASHL: An Adaptive Multi-Stage Distributed Deep Learning Training Scheme for Heterogeneous Environments

今日热心研友