计算机科学
人工智能
多标签分类
卷积神经网络
模式识别(心理学)
推论
上下文图像分类
变压器
编码器
利用
机器学习
训练集
图像(数学)
量子力学
操作系统
物理
电压
计算机安全
作者
Jack Lanchantin,Tianlu Wang,Vicente Ordóñez,Yanjun Qi
标识
DOI:10.1109/cvpr46437.2021.01621
摘要
Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. In this work we propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Our approach consists of a Transformer encoder trained to predict a set of target labels given an input set of masked labels, and visual features from a convolutional neural network. A key ingredient of our method is a label mask training objective that uses a ternary encoding scheme to represent the state of the labels as positive, negative, or unknown during training. Our model shows state-of-the-art performance on challenging datasets such as COCO and Visual Genome. Moreover, because our model explicitly represents the label state during training, it is more general by allowing us to produce improved results for images with partial or extra label annotations during inference. We demonstrate this additional capability in the COCO, Visual Genome, News-500, and CUB image datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI