Network communication optimization of RCCL communication library in Multi-NIC systems

计算机科学 初始化 分布式计算 超级计算机 通信系统 节点(物理) 带宽(计算) 计算机体系结构 并行计算 计算机网络 结构工程 工程类 程序设计语言
作者
shuaiming he,Wei Wan,Junhong Li
标识
DOI:10.1117/12.3031956
摘要

With the widespread application of deep learning frameworks, large-scale computing and GPU programming are receiving increased attention. For upper-layer applications that utilize GPUs for computational communication, such as TensorFlow and PyTorch, improving the communication efficiency of the underlying communication library is of paramount importance to enhance the overall performance of the frameworks. Among them, the RCCL (Rocm Collective Communication Library) GPU communication library, provided by the Rocm (Radeon Open Compute platform) computing platform, supports various collective communication operations and point-to-point operations. Through analysis, we have identified a problem in the initialization and usage of the ring channel network in the RCCL library, specifically in multi-network card systems. This issue results in certain network cards being unable to communicate, leading to wasted system resources. To address this problem, optimizations can be made at the code level by introducing data structures and algorithms to control the invocation of network cards. The goal is to adjust the usage strategy of multiple network cards in the ring channel network without modifying the original design concept of RCCL. After optimization, extensive evaluations were conducted using a large-scale GPU cluster. The optimized RCCL library achieved significant improvements in communication performance. Under a communication scale of 16 compute nodes and 64 GPUs, the peak bandwidth increased from 5.28GB/s to 7.78GB/s. In inter-node collective communication tests, the performance improvement reached up to 60%. The improved RCCL library provides better low-level communication performance for upper-layer applications on the Rocm computing platform, offering enhanced communication support.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
霸气馒头完成签到 ,获得积分20
刚刚
卡布叻完成签到 ,获得积分10
1秒前
luna发布了新的文献求助10
1秒前
1秒前
1秒前
3秒前
4秒前
JamesPei应助体贴擎采纳,获得10
6秒前
6秒前
时尚老九发布了新的文献求助10
6秒前
Liuyan发布了新的文献求助10
6秒前
7秒前
浦肯野应助鱼小鱼采纳,获得60
9秒前
9秒前
善学以致用应助luna采纳,获得10
9秒前
11秒前
zxj发布了新的文献求助10
11秒前
11秒前
小二郎应助Lgh采纳,获得10
11秒前
EED发布了新的文献求助10
12秒前
12秒前
三侠完成签到,获得积分10
15秒前
StayGolDay完成签到,获得积分10
15秒前
超超发布了新的文献求助10
16秒前
16秒前
17秒前
Ava应助12345采纳,获得10
17秒前
nanan完成签到,获得积分10
18秒前
在望应助白华苍松采纳,获得10
18秒前
liu完成签到,获得积分10
20秒前
20秒前
小巧亦竹完成签到,获得积分10
20秒前
xx发布了新的文献求助10
22秒前
muzixin完成签到,获得积分10
22秒前
23秒前
23秒前
garcia99应助科研通管家采纳,获得10
23秒前
aldehyde应助科研通管家采纳,获得10
23秒前
科研通AI5应助科研通管家采纳,获得10
23秒前
乐乐应助科研通管家采纳,获得10
23秒前
高分求助中
Continuum thermodynamics and material modelling 3000
Production Logging: Theoretical and Interpretive Elements 2700
Healthcare Finance: Modern Financial Analysis for Accelerating Biomedical Innovation 2000
Applications of Emerging Nanomaterials and Nanotechnology 1111
Unseen Mendieta: The Unpublished Works of Ana Mendieta 1000
Les Mantodea de Guyane Insecta, Polyneoptera 1000
Theory of Block Polymer Self-Assembly 750
热门求助领域 (近24小时)
化学 医学 材料科学 生物 工程类 有机化学 生物化学 纳米技术 内科学 物理 化学工程 计算机科学 复合材料 基因 遗传学 物理化学 催化作用 细胞生物学 免疫学 电极
热门帖子
关注 科研通微信公众号,转发送积分 3488940
求助须知:如何正确求助?哪些是违规求助? 3076437
关于积分的说明 9145315
捐赠科研通 2768689
什么是DOI,文献DOI怎么找? 1519340
邀请新用户注册赠送积分活动 703765
科研通“疑难数据库(出版商)”最低求助积分说明 702009