计算机科学
GSM演进的增强数据速率
云计算
边缘计算
边缘设备
节点(物理)
启发式
仿真
分布式计算
加入
特征(语言学)
深度学习
并行计算
人工智能
语言学
哲学
结构工程
工程类
经济
程序设计语言
经济增长
操作系统
作者
Tanmoy Sen,Haiying Shen
标识
DOI:10.1109/icccn58024.2023.10230190
摘要
With the emergence of edge computing along with its local computation advantage over the cloud, methods for distributed deep learning (DL) training on edge nodes have been proposed. The increasing scale of DL models and large training dataset poses a challenge to run such jobs in one edge node due to resource constraints. However, the proposed methods either run the entire model in one edge node, collect all training data into one edge node, or still involve the remote cloud. To handle the challenge, we propose a fully distributed training system that realizes both Data and Model Parallelism over a network of edge devices (called DMP). It clusters the edge nodes to build a training structure by taking advantage of the feature that distributed edge nodes sense data for training. For each cluster, we propose a heuristic and a Reinforcement Learning (RL) based algorithm to handle the problem of how to partition a DL model and assign the partitions to edge nodes for model parallelism to minimize the overall training time. Taking advantage of the feature that geographically close edge nodes sense similar data, we further propose two schemes to avoid transferring duplicated data to the first-layer edge node as training data without compromising accuracy. Our container-based emulation and real edge node experiments show that our systems reduce up to 44% training time while maintaining the accuracy comparing with the state-of-the-art approaches. We also open sourced our source code.
科研通智能强力驱动
Strongly Powered by AbleSci AI