亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs

计算机科学 云计算 隐藏物 弹性(物理) 调度(生产过程) 加速 杠杆(统计) 分布式计算 作业调度程序 并行计算 人工智能 操作系统 运营管理 复合材料 经济 材料科学
作者
Rong Gu,Kai Zhang,Zhihao Xu,Yang Che,Bin Fan,Haojun Hou,Haipeng Dai,Li Yi,Yu Ding,Guihai Chen,Yihua Huang
标识
DOI:10.1109/icde53745.2022.00209
摘要

Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access with complicated performance tuning, lack of cache capacity with specialized hardware to match its high and dynamic I/O requirement, and inefficient I/O resource sharing across different training jobs. We propose Fluid, a cloud-native platform that provides DL training jobs with a data abstraction called Fluid Dataset to access training data from heterogeneous sources in a unified manner with transparent and elastic data acceleration powered by auto-tuned cache runtimes. In addition, it comes with an on-the-fly cache system autoscaler that can intelligently scale up and down the cache capacity to match the online training speed of each individual DL job. To improve the overall performance of multiple DL jobs, Fluid can co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order. Our experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. In addition, for scheduling multiple DL jobs with same datasets, Fluid gives around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with adopters in production including Alibaba Cloud, Tencent Cloud, Weibo.com, China Telecom, etc.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
6秒前
22秒前
小curry发布了新的文献求助10
27秒前
41秒前
bkagyin应助LLL采纳,获得10
42秒前
51秒前
53秒前
LLL发布了新的文献求助10
57秒前
LLL完成签到,获得积分20
1分钟前
Ava应助小curry采纳,获得10
1分钟前
MchemG完成签到,获得积分0
1分钟前
1分钟前
1分钟前
Boro发布了新的文献求助10
1分钟前
2分钟前
细腻不二应助科研通管家采纳,获得10
2分钟前
celinewu完成签到,获得积分10
2分钟前
2分钟前
uikymh完成签到 ,获得积分0
2分钟前
武广敏发布了新的文献求助10
2分钟前
2分钟前
yyds发布了新的文献求助30
3分钟前
3分钟前
Jack祺完成签到 ,获得积分10
3分钟前
细腻不二应助科研通管家采纳,获得10
4分钟前
无花果应助科研通管家采纳,获得10
4分钟前
风趣雪一应助科研通管家采纳,获得10
4分钟前
4分钟前
黄玉发布了新的文献求助10
4分钟前
合适的如天完成签到,获得积分10
4分钟前
rl完成签到,获得积分10
4分钟前
田様应助南风采纳,获得10
4分钟前
4分钟前
9527完成签到,获得积分10
4分钟前
南风发布了新的文献求助10
4分钟前
AliEmbark发布了新的文献求助10
5分钟前
5分钟前
ljh024发布了新的文献求助10
5分钟前
5分钟前
尘鸢发布了新的文献求助10
5分钟前
高分求助中
Cronologia da história de Macau 1600
Treatment response-adapted risk index model for survival prediction and adjuvant chemotherapy selection in nonmetastatic nasopharyngeal carcinoma 1000
Lloyd's Register of Shipping's Approach to the Control of Incidents of Brittle Fracture in Ship Structures 1000
BRITTLE FRACTURE IN WELDED SHIPS 1000
Intentional optical interference with precision weapons (in Russian) Преднамеренные оптические помехи высокоточному оружию 1000
Atlas of Anatomy 5th original digital 2025的PDF高清电子版(非压缩版,大小约400-600兆,能更大就更好了) 1000
Toughness acceptance criteria for rack materials and weldments in jack-ups 800
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 纳米技术 计算机科学 化学工程 生物化学 物理 复合材料 内科学 催化作用 物理化学 光电子学 细胞生物学 基因 电极 遗传学
热门帖子
关注 科研通微信公众号,转发送积分 6195345
求助须知:如何正确求助?哪些是违规求助? 8022460
关于积分的说明 16696231
捐赠科研通 5290297
什么是DOI,文献DOI怎么找? 2819501
邀请新用户注册赠送积分活动 1799244
关于科研通互助平台的介绍 1662150