计算机科学
现场可编程门阵列
矩阵乘法
边缘设备
收缩阵列
变压器
边缘计算
并行计算
计算机硬件
计算机工程
嵌入式系统
工程类
物联网
超大规模集成
云计算
物理
量子力学
电压
电气工程
量子
操作系统
作者
Mingqiang Huang,J.P. Luo,Chenchen Ding,Zikun Wei,Sixiao Huang,Hao Yu
出处
期刊:IEEE Transactions on Circuits and Systems I-regular Papers
[Institute of Electrical and Electronics Engineers]
日期:2023-10-19
卷期号:70 (12): 5289-5301
被引量:9
标识
DOI:10.1109/tcsi.2023.3312775
摘要
Transformer-like network has shown remarkable high performance in both natural language processing and computer vision. However, the huge computational demands in non-linear floating-point arithmetic and the irregular memory access requirement in self-attention mechanism make it still a challenge to deploy Transformer on edge. To address the above issues, we propose integer-only quantization scheme for the simplification of non-linear operations (such as LayerNorm, Softmax and Gelu), meanwhile algorithm-hardware co-design strategy is applied to guarantee both the high accuracy and high efficiency. Besides, we construct general-purpose group vector systolic array to efficiently accelerate the matrix multiplication operations including both regular matrix-multiplication/convolution and the irregular multi-head self-attention mechanism. Unified data-package strategy and flexible on-/off-chip data storage management strategy are also proposed to further improve the performance. The design has been deployed on Xilinx ZCU102 FPGA platform, achieving an overall inference latency of 4.077ms and 11.15ms per image for ViT-tiny and ViT-s, respectively. The average throughput can reach as high as 762.7 GOPs, which shows significant improvement over the previous state-of-the-art FPGA Transformer accelerator.
科研通智能强力驱动
Strongly Powered by AbleSci AI