计算机科学
数据流
人工智能
编译程序
机器学习
软件
人工神经网络
瓶颈
深度学习
源代码
图形
计算机体系结构
程序设计语言
并行计算
理论计算机科学
嵌入式系统
作者
Ruohan Wu,Mingfan Li,Hanxi Li,Tianxiang Chen,Xinghui Tian,Xiaoxin Xu,Bin Zhou,Junshi Chen,Hong An
标识
DOI:10.1109/hpcc-dss-smartcity-dependsys57074.2022.00038
摘要
As innovations in deep learning systems and deep neural network (DNN) models continue to grow, accurate performance analysis acts as a promising tool for understanding and navigating the complex software-hardware interplay, especially for the today's heterogeneous AI architecture. However, the actual execution of DNNs on the dedicated accelerators involves chal-lenges from nontrivial dataflow graph analysis, tensor compiler optimizations, and operator performance prediction. In this work, we propose a two-stage performance model framework that combines graph-level analysis and operator-based hotspot modeling to bridge the gap between high-level application performance and its software-hardware systems. By the employ of machine learning (ML) solution, our performance model further captures the low-level hardware-dependent information, including operator fusion and data layout transformation. Our graph analysis for mainstream models from computer vision (CV), natural language processing (NLP) and recommendation domains selects total 26 kinds of operators and builds a dataset on the Huawei Ascend 910. With the well-trained model, our open source 1 1 Source code available at https://github.com/Huawei-Performance-Model/Ascend-910b performance model finally achieves 15.4 % average error for predicting the execution time of DNN models, and our modeling for memory access and performance bottleneck supports efficient running of DNN models for future systems.
科研通智能强力驱动
Strongly Powered by AbleSci AI