现场可编程门阵列
计算机科学
失败
卷积神经网络
收缩阵列
杠杆(统计)
计算机体系结构
吞吐量
浮点型
加法器
计算机工程
并行计算
推论
深度学习
时钟频率
嵌入式系统
人工智能
算法
超大规模集成
炸薯条
延迟(音频)
电信
无线
作者
Xuechao Wei,Cody Hao Yu,Peng Zhang,Youxiang Chen,Yuxin Wang,Hu Han,Yun Liang,Jason Cong
标识
DOI:10.1145/3061639.3062207
摘要
Convolutional neural networks (CNNs) have been widely applied in many deep learning applications. In recent years, the FPGA implementation for CNNs has attracted much attention because of its high performance and energy efficiency. However, existing implementations have difficulty to fully leverage the computation power of the latest FPGAs. In this paper we implement CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization. We provide an analytical model for performance and resource utilization and develop an automatic design space exploration framework, as well as source-to-source code transformation from a C program to a CNN implementation using systolic array. The experimental results show that our framework is able to generate the accelerator for real-life CNN models, achieving up to 461 GFlops for floating point data type and 1.2 Tops for 8-16 bit fixed point.
科研通智能强力驱动
Strongly Powered by AbleSci AI