计算机科学
云计算
隐藏物
弹性(物理)
调度(生产过程)
加速
杠杆(统计)
分布式计算
作业调度程序
并行计算
人工智能
操作系统
运营管理
材料科学
经济
复合材料
作者
Rong Gu,Kai Zhang,Zhihao Xu,Yang Che,Bin Fan,Haojun Hou,Haipeng Dai,Li Yi,Yu Ding,Guihai Chen,Yihua Huang
标识
DOI:10.1109/icde53745.2022.00209
摘要
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access with complicated performance tuning, lack of cache capacity with specialized hardware to match its high and dynamic I/O requirement, and inefficient I/O resource sharing across different training jobs. We propose Fluid, a cloud-native platform that provides DL training jobs with a data abstraction called Fluid Dataset to access training data from heterogeneous sources in a unified manner with transparent and elastic data acceleration powered by auto-tuned cache runtimes. In addition, it comes with an on-the-fly cache system autoscaler that can intelligently scale up and down the cache capacity to match the online training speed of each individual DL job. To improve the overall performance of multiple DL jobs, Fluid can co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order. Our experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. In addition, for scheduling multiple DL jobs with same datasets, Fluid gives around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with adopters in production including Alibaba Cloud, Tencent Cloud, Weibo.com, China Telecom, etc.
科研通智能强力驱动
Strongly Powered by AbleSci AI