计算机科学
现场可编程门阵列
硬件加速
查阅表格
可重构性
计算机硬件
循环神经网络
嵌入式系统
人工智能
人工神经网络
电信
程序设计语言
作者
Shin-haeng Kang,Sukhan Lee,Byeongho Kim,Hweesoo Kim,Kyomin Sohn,Nam Sung Kim,Eojin Lee
标识
DOI:10.1145/3490422.3502355
摘要
In this paper, we implemented a world-first RNN-T inference accelerator using FPGA with PIM-HBM that can multiply the internal bandwidth of the memory. The accelerator offloads matrix-vector multiplication (GEMV) operations of LSTM layers in RNN-T into PIM-HBM, and PIM-HBM reduces the execution time of GEMV significantly by exploiting HBM internal bandwidth. To ensure that the memory commands are issued in a pre-defined order, which is one of the most important constraints in exploiting PIM-HBM, we implement a direct memory access (DMA) module and change configuration of the on-chip memory controller by utilizing the flexibility and reconfigurability of the FPGA. In addition, we design the other hardware modules for acceleration such as non-linear functions (i.e., sigmoid and hyperbolic tangent), element-wise operation, and ReLU module, to operate these compute-bound RNN-T operations on FPGA. For this, we prepare FP16 quantized weight and MLPerf input datasets, and modify the PCIe device driver and C++ based control codes. On our evaluation, our accelerator with PIM-HBM reduces the execution time of RNN-T by 2.5 × on average with 11.09% reduced LUT size and improves energy efficiency up to 2.6 × compared to the baseline.
科研通智能强力驱动
Strongly Powered by AbleSci AI