计算机科学
带宽(计算)
光子学
延迟(音频)
互连
控制重构
计算机体系结构
低延迟(资本市场)
钥匙(锁)
任务(项目管理)
缩放比例
并行计算
分布式计算
嵌入式系统
计算机网络
操作系统
光电子学
电信
物理
数学
经济
管理
几何学
作者
Mehrdad Khani,Manya Ghobadi,Mohammad Alizadeh,Ziyi Zhu,Madeleine Glick,Keren Bergman,Amin Vahdat,Benjamin Klenk,Eiman Ebrahimi
标识
DOI:10.1145/3452296.3472900
摘要
This paper proposes optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML, accelerates the training time of popular DNN models using silicon photonics links capable of providing multiple terabits-per-second of bandwidth per GPU. SiP-ML partitions the training job across GPUs with hybrid data and model parallelism while ensuring the communication pattern can be supported efficiently on the network interconnect. We develop task partitioning and device placement methods that take the degree and reconfiguration latency of optical interconnects into account. Simulations using real DNN models show that, compared to the state-of-the-art electrical networks, our approach improves training time by 1.3--9.1x.
科研通智能强力驱动
Strongly Powered by AbleSci AI