计算机科学
GPU群集
星团(航天器)
分布式计算
调度(生产过程)
并行计算
计算机网络
库达
运营管理
经济
作者
Shubham Chaudhary,Ramachandran Ramjee,Muthian Sivathanu,Nipun Kwatra,Srinidhi Viswanatha
标识
DOI:10.1145/3342195.3387555
摘要
We present Gandivafair, a distributed, fair share scheduler that balances conflicting goals of efficiency and fairness in GPU clusters for deep learning training (DLT). Gandivafair provides performance isolation between users, enabling multiple users to share a single cluster, thus, maximizing cluster efficiency. Gandivafair is the first scheduler that allocates cluster-wide GPU time fairly among active users.
科研通智能强力驱动
Strongly Powered by AbleSci AI