计算机科学
光网络
收发机
波分复用
多路复用
多波长光网络
软件部署
光传送网
交通疏导
无源光网络
光开关
带宽(计算)
光交叉连接
光纤
计算机网络
电子工程
无线
电信
工程类
光纤分路器
波长
材料科学
光纤传感器
光电子学
操作系统
作者
Liu Hong,Ryohei Urata,Kevin Yasumura,Xiang Zhou,Roy Bannon,Jill Berger,P.Z. Dashti,Norman P. Jouppi,Cedric F. Lam,Sheng Li,Erji Mao,Daniel Nelson,George C. Papen,Mukarram Tariq,Amin Vahdat
标识
DOI:10.1145/3603269.3604836
摘要
We describe our experience developing what we believe to be the world's first large-scale production deployments of lightwave fabrics used for both datacenter networking and machine-learning (ML) applications. Using optical circuit switches (OCSes) and optical transceivers developed in-house, we employ hardware and software codesign to integrate the fabrics into our network and computing infrastructure. Key to our design is a high degree of multiplexing enabled by new kinds of wavelength-division-multiplexing (WDM) and optical circulators that support high-bandwidth bidirectional traffic on a single strand of optical fiber. The development of the requisite OCS and optical transceiver technologies leads to a synchronous lightwave fabric that is reconfigurable, low latency, rate agnostic, and highly available. These fabrics have provided substantial benefits for long-lived traffic patterns in our datacenter networks and predictable traffic patterns in tightly-coupled machine learning clusters. We report results for a large-scale ML superpod with 4096 tensor processing unit (TPU) V4 chips that has more than one ExaFLOP of computing power. For this use case, the deployment of a lightwave fabric provides up to 3× better system availability and model-dependent performance improvements of up to 3.3× compared to a static fabric, despite constituting less than 6% of the total system cost.
科研通智能强力驱动
Strongly Powered by AbleSci AI