CIMFormer: A Systolic CIM-Array-Based Transformer Accelerator With Token-Pruning-Aware Attention Reformulating and Principal Possibility Gathering

安全性令牌变压器校长（计算机安全）计算机科学收缩阵列嵌入式系统工程类计算机安全电气工程电压超大规模集成

作者

Ruiqi Guo,X.L. Chen,Lei Wang,Yang Wang,Hao Sun,Jingchuan Wei,Huiming Han,Leibo Liu,Shaojun Wei,Yang Hu,Shouyi Yin

出处

期刊：IEEE Journal of Solid-state Circuits [Institute of Electrical and Electronics Engineers]
日期：2024-01-01 卷期号：: 1-13

标识

DOI：10.1109/jssc.2024.3402174

摘要

Transformer models have achieved impressive performance in various artificial intelligence (AI) applications. However, the high cost of computation and memory footprint make its inference inefficient. Although digital compute-in-memory (CIM) is a promising hardware architecture with high accuracy, Transformer's attention mechanism raises three challenges in the access and computation of CIM: 1) the attention computation involving Query and Key results in massive data movement and under-utilization in CIM macros; 2) the attention computation involving Possibility and Value exhibits plenty of dynamic bit-level sparsity, resulting in redundant bit-serial CIM operations; and 3) the restricted data reload bandwidth in CIM macros results in a significant decrease in performance for large Transformer models. To address these challenges, we design a CIM accelerator called CIM Transformer (CIMFormer) with three corresponding features. First, the token-pruning-aware attention reformulation (TPAR) is a technique that adjusts attention computations according to the token-pruning ratio. This reformulation reduces the real-time access to and under-utilization of CIM macros. Second, the principal possibility gather-scatter scheduler (PPGSS) gathers the possibilities with greater effective bit-width as concurrent inputs to CIM macros, enhancing the efficiency of bit-serial CIM operations. Third, the systolic X

$\mid$

W-CIM macro array efficiently handles the execution of large Transformer models that exceed the storage capacity of the on-chip CIM macros. Fabricated in a 28-nm technology, CIMFormer achieves a peak energy efficiency of 15.71 TOPS/W, with an over 1.46

$\times$

improvement compared with the state-of-the-art Transformer accelerator at an equivalent situation.

求助该文献

最长约 10秒，即可获得该文献文件

CIMFormer: A Systolic CIM-Array-Based Transformer Accelerator With Token-Pruning-Aware Attention Reformulating and Principal Possibility Gathering

今日热心研友