Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

计算机科学异步通信分布式计算智能交通系统云计算边缘计算同步（交流）计算机网络 GSM演进的增强数据速率人工智能频道（广播）工程类土木工程操作系统

作者

Jiawei Geng,Jing Cao,Haipeng Jia,Zongwei Zhu,Hai Fang,Chengxi Gao,Cheng Ji,Gangyong Jia,Guangjie Han,Xuehai Zhou

出处

期刊：IEEE Transactions on Intelligent Transportation Systems [Institute of Electrical and Electronics Engineers]
日期：2023-06-24 卷期号：25 (1): 959-972 被引量：2

标识

DOI：10.1109/tits.2023.3286400

摘要

Transportation big data (TBD) are increasingly combined with artificial intelligence to mine novel patterns and information due to the powerful representational capabilities of deep neural networks (DNNs), especially for anti-COVID19 applications. The distributed cloud-edge-vehicle training architecture has been applied to accelerate DNNs training while ensuring low latency and high privacy for TBD processing. However, multiple intelligent devices (e.g., intelligent vehicles, edge computing chips at base stations) and different networks in intelligent transportation systems lead to computing power and communication heterogeneity among distributed nodes. Existing parallel training mechanisms perform poorly on heterogeneous cloud-edge-vehicle clusters. The synchronous parallel mechanism may force fast workers to wait for the slowest worker for synchronization, thus wasting their computing power. The asynchronous mechanism has communication bottlenecks and can exacerbate the straggler problem, causing increased training iterations and even incorrect convergence. In this paper, we introduce a distributed training framework, Heter-Train. First, a communication-efficient semi-asynchronous parallel mechanism (SAP-SGD) is proposed, which can take full advantage of acceleration effect of asynchronous strategy on heterogeneous training and constrain the straggler problem by using global interval synchronization. Second, Considering the difference in node bandwidth, we design a solution for heterogeneous communication. Moreover, a novel weighted aggregation strategy is proposed to aggregate the model parameters with different versions. Finally, experimental results show that our proposed strategy can achieve up to

$6.74 \times$

speedups on training time, with almost no accuracy decrease.

求助该文献

最长约 10秒，即可获得该文献文件

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

今日热心研友