计算机科学
延迟(音频)
服务器
推论
服务质量
深度学习
云计算
排队
人工智能
并行计算
分布式计算
操作系统
计算机网络
电信
作者
Deyu Zhang,Yunzhen Luo,Y. Wang,Xiaoyan Kui,Ju Ren
出处
期刊:IEEE Transactions on Cloud Computing
[Institute of Electrical and Electronics Engineers]
日期:2024-01-01
卷期号:12 (1): 174-185
标识
DOI:10.1109/tcc.2024.3350561
摘要
Deep learning (DL) has been applied in billions of mobile devices due to its astonishing performance in image, text, and audio processing. However, limited by the computing capability of mobile devices, a large amount of DL inference tasks need to be offloaded to edge or cloud servers, which makes powerful GPU servers are struggling to ensure the quality of service(QoS). To better utilize the highly parallel computing architecture of GPU to improve the QoS, we propose BatOpt, a framework that uses dynamic batch processing to strike a good balance between service latency and GPU memory usage in DL inference services. Specifically, BatOpt innovatively models the DL inference service as a $M/G(a,b)/1/N$ queue, with the consideration of stochastic task arrivals, which enables it to predict the service latency accurately in different system states. Furthermore, we propose an optimization algorithm to trade off the service latency and GPU memory usage in different system states by analyzing the queueing model. We have implemented BatOpt on Pytorch and evaluated it on an RTX 2080 GPU using real DL models. BatOpt brings up to 31x and 4.3x times performance boost in terms of service latency, compared to single-input and fixed-batch-size strategies, respectively. And BatOpt's maximum GPU memory usage is only 0.3x that of greedy-dynamic-batch-size strategy on the premise of the same service latency.
科研通智能强力驱动
Strongly Powered by AbleSci AI