计算机科学
变压器
卷积神经网络
人工智能
像素
图像分辨率
骨干网
计算
分割
帕斯卡(单位)
移植
模式识别(心理学)
计算机视觉
计算机工程
算法
电压
软件
物理
程序设计语言
量子力学
计算机网络
作者
Wenhai Wang,Enze Xie,Xiang Li,Deng-Ping Fan,Kaitao Song,Liang Ding,Tong Lü,Ping Luo,Ling Shao
标识
DOI:10.1109/iccv48922.2021.00061
摘要
Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a simpler, convolution-free backbone network use-fid for many dense prediction tasks. Unlike the recently-proposed Vision Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to current state of the arts. (1) Different from ViT that typically yields low-resolution outputs and incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps. (2) PVT inherits the advantages of both CNN and Transformer, making it a unified backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. (3) We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope that PVT could, serre as an alternative and useful backbone for pixel-level predictions and facilitate future research.
科研通智能强力驱动
Strongly Powered by AbleSci AI