计算机科学
失败
卷积神经网络
变压器
人工智能
卷积(计算机科学)
高内存
人工神经网络
深度学习
机器学习
模式识别(心理学)
并行计算
物理
量子力学
电压
标识
DOI:10.1109/icme55011.2023.00206
摘要
Deep Neural Networks (DNN) have achieved extraordinary success in many visual recognition tasks. Visual Transformer (ViT), which is derived from Natural Language Processing (NLP), has achieved state-of-the-art (SOTA) results on many tasks due to its capability of capturing long-range dependencies in visual data. However, Existing ViT models are challenging to deploy on devices due to their massive computational consumption, huge memory overhead, and reliance on large datasets. In this work, we address these issues by replacing some computationally expensive and memory-intensive modules in ViT with standard Convolutional Neural Network (CNN) modules. Firstly, we propose an efficient Self-Attention module called SDG-Attention (SDGA) with linear space and time complexity, and an economical FeedForward Network (FFN) composed of group convolution and shuffle channel (SFFN). Then, we develop a lightweight CNN model with SDGA and SFFN, SDGFormer, which embraces several priors of ViT and is LayerNorm-Free. We evaluate SDGFormer on ImageNet-1K and Mini-ImageNet, and the SDGFormer-S achieves a comparable top-1 accuracy of 77.6% on ImageNet-1K with 9.1M parameters and 1.6 GFlops regimes. Moreover, our SDGFormer-T achieves SOTA performance on Mini-ImageNet with 83.3% accuracy, demonstrating good generalization on small datasets without extra data.
科研通智能强力驱动
Strongly Powered by AbleSci AI