Mohammed Elbtity,Peyton Chandarana,Brendan Reidy,Jason K. Eshraghian,Ramtin Zand
出处
期刊:IEEE Transactions on Circuits and Systems I-regular Papers [Institute of Electrical and Electronics Engineers] 日期:2022-09-23卷期号:69 (12): 5135-5146被引量:9
标识
DOI:10.1109/tcsi.2022.3206262
摘要
We propose an approximate tensor processing unit (APTPU), which includes two main components: (1) approximate processing elements (APEs) consisting of a low-precision multiplier and an approximate adder, and (2) pre-approximate units (PAUs) which are shared among the APEs in the APTPU's systolic array, functioning as the steering logic to pre-process the operands and feed them to the APEs. We conduct extensive experiments to evaluate the performance of the APTPU across various configurations and various workloads. The results show that the APTPU's systolic array achieves up to $5.2\times \textit {TOPS}/mm^{2}$ and $4.4\times \textit {TOPS}/W$ improvements compared to that of a conventional systolic array design. The comparison between the proposed APTPU and in-house TPU designs shows that we can achieve approximately $2.5\times $ and $1.2\times $ area and power reduction, respectively, while realizing comparable accuracy. Finally, a comparison with the state-of-the-art approximate systolic arrays shows that the APTPU can realize up to $1.58\times $ , $2\times $ , and $1.78\times $ , reduction in delay, power, and area, respectively, while using similar design specifications and synthesis constraints.