矩阵乘法
稀疏矩阵
基质(化学分析)
计算机科学
核(代数)
并行计算
架空(工程)
收缩阵列
稀疏数组
乘法(音乐)
算法
数学
材料科学
嵌入式系统
组合数学
物理
复合材料
高斯分布
操作系统
量子
量子力学
超大规模集成
作者
Xin He,Subhankar Pal,Aporva Amarnath,Siying Feng,Dong-Hyeon Park,Austin Rovinski,Haojie Ye,Yuhan Chen,Ronald Dreslinski,Trevor Mudge
标识
DOI:10.1145/3392717.3392751
摘要
While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.
科研通智能强力驱动
Strongly Powered by AbleSci AI