量化(信号处理)
推论
计算机科学
人工智能
机器学习
算法
标识
DOI:10.1109/tc.2024.3398503
摘要
The post-training quantization (PTQ) is a common technology to improve the efficiency of embedded neural network accelerators. Existing PTQ schemes for CNN activations usually rely on calibration dataset with good data representation to reduce quantization overflow in inference, which is not always effective due to large variation and uncertainty of the inference input data in practice. This paper proposes an adaptive PTQ method for activations (AQA), which monitors the quantization overflow of activations, adaptively updates the quantization parameters, and re-quantizes the activations on-the-fly when the overflow degree is over a threshold. The key challenges in implementing the AQA method are to reduce the associated side-effects in increasing computational complexity, processing time and hardware resource usage. We propose a series of design optimizations for quantization overflow monitor, quantization parameters update and re-quantization to successfully address the challenges. The proposed AQA method is implemented in a CNN accelerator and evaluated on VGG16, ResNet18 and MobileNetV2 on several datasets. Experiment results show that the adaptation method makes the models' inference accuracy stable over various quantization overflow degrees, while the static quantization method suffers from significant accuracy degradation. The costs introduced by the adaptation method include 5% power consumption increase and 4% throughput degradation.
科研通智能强力驱动
Strongly Powered by AbleSci AI