失败
计算机科学
像素
安全性令牌
变压器
人工智能
计算机视觉
棱锥(几何)
模式识别(心理学)
数学
工程类
并行计算
几何学
计算机安全
电气工程
电压
作者
Bichen Wu,Chenfeng Xu,Xiaoliang Dai,Alvin Wan,Peizhao Zhang,Masayoshi Tomizuka,Kurt Keutzer,Péter Vajda
出处
期刊:Cornell University - arXiv
日期:2020-01-01
被引量:288
标识
DOI:10.48550/arxiv.2006.03677
摘要
Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to 7 points while using fewer FLOPs and parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
科研通智能强力驱动
Strongly Powered by AbleSci AI