卷积神经网络
计算机科学
嵌入
模式识别(心理学)
人工智能
特征(语言学)
块(置换群论)
变压器
数学
几何学
语言学
量子力学
物理
哲学
电压
标识
DOI:10.1145/3582649.3582676
摘要
In view of the following problems of the Vision Transformer (ViT) model: a large number of parameters, the lack of global modeling ability and sensitivity to data enhancement. Inspired by MobileViT, based on Convolutional Neural Networks (CNN) and Vision Transformer (ViT), propose a light-weight classification model: Efficient-ViT. By introducing the following modules: Squeeze-and-Excitation Block (SE-Block), Overlapping Patch Embedding (OPE), Linear Spatial Reduction Attention (Linear SRA). While keeping the model compact, the local and global information of the input feature map can be effectively encoded and integrated. The local information of the feature map is processed by CNN, while the global information is processed by ViT, and then simply fused the captured information. The proposed model has both the inductive bias ability of the CNN model and the good properties of the global modeling of the ViT model, and can learn better feature representations. Classification experiments swere carried out on three datasets such as CIFAR10, CIFAR100, and Stanford Cars. The experimental results show that the proposed method achieves better results, and the accuracy of Top1 is improved by 3.32% (86.55% to 89.87%).
科研通智能强力驱动
Strongly Powered by AbleSci AI