人工智能
计算机科学
模式识别(心理学)
规范化(社会学)
直方图
特征(语言学)
特征提取
计算机视觉
语音识别
图像(数学)
人类学
语言学
哲学
社会学
作者
Wei Chen,Haoqi Fan,Sihong Xie,Chao-Yuan Wu,Alan Yuille,Christoph Feichtenhofer
标识
DOI:10.1109/cvpr52688.2022.01426
摘要
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer based models. Without using extra model weights or supervision, MaskFeat pretrained on unlabeled videos achieves unprecedented results of 86.7% with MViTv2-L on Kinetics-400, 88.3% on Kinetics 600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageN et.
科研通智能强力驱动
Strongly Powered by AbleSci AI