Stratix公司                        
                
                                
                        
                            数据流                        
                
                                
                        
                            计算机科学                        
                
                                
                        
                            卷积神经网络                        
                
                                
                        
                            循环展开                        
                
                                
                        
                            现场可编程门阵列                        
                
                                
                        
                            卷积(计算机科学)                        
                
                                
                        
                            并行计算                        
                
                                
                        
                            硬件加速                        
                
                                
                        
                            计算机硬件                        
                
                                
                        
                            设计空间探索                        
                
                                
                        
                            计算机工程                        
                
                                
                        
                            嵌入式系统                        
                
                                
                        
                            计算机体系结构                        
                
                                
                        
                            人工神经网络                        
                
                                
                        
                            人工智能                        
                
                                
                        
                            编译程序                        
                
                                
                        
                            程序设计语言                        
                
                        
                    
            作者
            
                Yufei Ma,Yu Cao,Sarma Vrudhula,Jae-sun Seo            
         
                    
        
    
            
            标识
            
                                    DOI:10.1109/tvlsi.2018.2815603
                                    
                                
                                 
         
        
                
            摘要
            
            As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.
         
            
 
                 
                
                    
                    科研通智能强力驱动
Strongly Powered by AbleSci AI