数据流
计算机科学
效率低下
推论
安全性令牌
解码方法
实现(概率)
程序设计语言
理论计算机科学
算法
人工智能
计算机安全
统计
数学
经济
微观经济学
作者
Cong Li,Zhe Zhou,Size Zheng,Jiaxi Zhang,Yun Liang,Guangyu Sun
标识
DOI:10.1145/3620666.3651352
摘要
Generative large language models' (LLMs) inference suffers from inefficiency because of the token dependency brought by autoregressive decoding. Recently, speculative inference has been proposed to alleviate this problem, which introduces small language models to generate draft tokens and adopts the original large language model to conduct verification. Although speculative inference can enhance the efficiency of the decoding procedure, we find that it presents variable resource demands due to the distinct computation patterns of the models used in speculative inference. This variability impedes the full realization of speculative inference's acceleration potential in current systems.
科研通智能强力驱动
Strongly Powered by AbleSci AI