远程直接内存访问
计算机科学
网络拥塞
接头(建筑物)
并行计算
计算机网络
工程类
建筑工程
网络数据包
作者
Zirui Wan,Jiao Zhang,Haoran Wei,Zhuo Jiang,Xiaolong Zhong,Wenfei Wu,Huaping Zhou,Tian Pan,Tao Huang
摘要
Together with the construction of RDMA networks for data center applications, the RDMA-coupled DCQCN dominates the RDMA Congestion Control (CC). However, DCQCN suffers severe performance problems in high-speed RDMA networks with modern high-performance distributed applications such as machine learning training. This paper presents RECC, inspired by both the latest emerging programmability of RDMA NICs (RNICs) and limitations in existing RDMA congestion control mechanisms. RECC comprehensively leverages RTT and ECN events from RNICs to handle congestion timely and precisely, along with a History-aware Burst Smooth mechanism to avoid wrong rate decisions under various traffic patterns. We implement RECC completely based on commercial RNICs without any modifications to switches, RDMA protocol stack, and applications. The results of microbenchmark testbed experiments and real Machine Learning (ML) workload experiments with hundreds of 200G RNICs show that RECC can significantly reduce network tail latency and pause duration by up to 64.4% and 95%, respectively, compared with DCQCN. In addition, large-scale simulations with realistic workloads demonstrate that RECC achieves comparable performance with HPCC.
科研通智能强力驱动
Strongly Powered by AbleSci AI