计算机科学
可扩展性
编码器
人工智能
遮罩(插图)
代表(政治)
图像(数学)
缩放比例
像素
特征学习
模式识别(心理学)
计算机视觉
机器学习
数据库
政治
操作系统
几何学
艺术
视觉艺术
数学
法学
政治学
作者
Kaiming He,Xinlei Chen,Saining Xie,Yanghao Li,Piotr Dollár,Ross Girshick
出处
期刊:Cornell University - arXiv
日期:2021-01-01
被引量:59
标识
DOI:10.48550/arxiv.2111.06377
摘要
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
科研通智能强力驱动
Strongly Powered by AbleSci AI