Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

计算机科学计算人工智能变压器失败安全性令牌卷积神经网络特征（语言学）模式识别（心理学）算法并行计算量子力学语言学物理哲学计算机安全电压

作者

Yongming Rao,Zuyan Liu,Wenliang Zhao,Jie Zhou,Jiwen Lu

出处

期刊：IEEE Transactions on Pattern Analysis and Machine Intelligence [IEEE Computer Society]
日期：2023-04-03 卷期号：45 (9): 10883-10897 被引量：24

链接

arxiv.org arxiv.org nih.govdoi.org

标识

DOI：10.1109/tpami.2023.3263826

摘要

In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative regions, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find that the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks. To handle structured feature maps, we formulate a generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and expressive slow paths to important locations, we can maintain the complete structure of feature maps while significantly reducing the overall computations. Extensive experiments on diverse modern architectures and different visual tasks demonstrate the effectiveness of our proposed framework. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31% ∼ 35% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision Transformers. By introducing asymmetric computation, a similar acceleration can be achieved on modern CNNs and Swin Transformers. Moreover, our method achieves promising results on more complex tasks including semantic segmentation and object detection. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT.

求助该文献

最长约 10秒，即可获得该文献文件

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

今日热心研友