计算机科学
变压器
二进制数
计算
人工智能
水准点(测量)
目标检测
计算机工程
模式识别(心理学)
算法
工程类
大地测量学
电压
地理
电气工程
算术
数学
作者
Miaohui Wang,Zhuowei Xu,Bin Zheng,Wuyuan Xie
标识
DOI:10.1109/tii.2024.3396520
摘要
Vision Transformer (ViT) has recently demonstrated impressive nonlinear modeling capabilities and achieved state-of-the-art performance in various industrial applications, such as object recognition, anomaly detection, and robot control. However, their practical deployment can be hindered by high storage requirements and computational intensity. To alleviate these challenges, we propose a binary transformer called BinaryFormer, which quantizes the learned weights of the ViT module from 32-b precision to 1 b. Furthermore, we propose a hierarchical-adaptive architecture that replaces expensive matrix operations with more affordable addition and bit operations by switching between two attention modes. As a result, BinaryFormer is able to effectively compress the model size as well as reduce the computation cost of ViT. Experimental results on the ImageNet-1K benchmark datasets show that BinaryFormer reduces the size of a typical ViT model by an average of 27.7× and converts over 99% of multiplication operations into bit operations while maintaining reasonable accuracy.
科研通智能强力驱动
Strongly Powered by AbleSci AI