分拆(数论)
计算机科学
推论
量化(信号处理)
星团(航天器)
并行计算
分布式计算
人工智能
算法
计算机网络
数学
组合数学
作者
Juntao Zhao,Borui Wan,Chuan Wu,Yanghua Peng,Haibin Lin
标识
DOI:10.1145/3627535.3638480
摘要
The immense sizes of Large-scale language models (LLMs) have led to high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. Extensive experiments on production inference workloads demonstrate throughput improvement in inference, showing great advantages over state-of-the-art works.
科研通智能强力驱动
Strongly Powered by AbleSci AI