数据流
计算机科学
数据流体系结构
建筑
推论
计算机体系结构
程序设计语言
并行计算
人工智能
艺术
视觉艺术
作者
Cong Li,Zhe Zhou,Size Zheng,Jiaxi Zhang,Yun Liang,Guangyu Sun
标识
DOI:10.1145/3620666.3651352
摘要
Generative large language models' (LLMs) inference suffers from inefficiency because of the token dependency brought by autoregressive decoding. Recently, speculative inference has been proposed to alleviate this problem, which introduces small language models to generate draft tokens and adopts the original large language model to conduct verification. Although speculative inference can enhance the efficiency of the decoding procedure, we find that it presents variable resource demands due to the distinct computation patterns of the models used in speculative inference. This variability impedes the full realization of speculative inference's acceleration potential in current systems.
科研通智能强力驱动
Strongly Powered by AbleSci AI