计算机科学
人工智能
判别式
特征(语言学)
水准点(测量)
特征学习
视觉对象识别的认知神经科学
编码器
任务(项目管理)
模式识别(心理学)
上下文图像分类
集合(抽象数据类型)
计算机视觉
特征提取
图像(数学)
机器学习
哲学
语言学
管理
大地测量学
经济
程序设计语言
地理
操作系统
作者
Hongbo Sun,Xiangteng He,Jiahuan Zhou,Yuxin Peng
标识
DOI:10.1145/3581783.3612403
摘要
Large-scale pre-trained vision-language (VL) models have shown powerful generic representation capabilities for adapting to downstream tasks with limited training data, which are data-efficient solutions to various applications such as image recognition. In order to enhance the adaption performance, most existing methods attempt to introduce learnable vectors into the text prompt to generate adaptive classification weights for the class in the downstream task. However, they generally focus on the text side while neglecting adaptive visual feature generation on the image side, which is insufficient to fit the downstream task data. In this paper, we propose fine-grained visual prompt learning (FG-VPL) of vision-language models for image recognition with few training samples, and the main contributions are: (1) Fine-grained visual prompt is introduced into the image encoder of the vision-language model for focusing on the target object and conducting information interaction within the object, which facilitates generating discriminative visual features for image recognition. (2) A two-pathway adaptive recognition module is proposed to narrow the domain gap and utilize both the cross-modal knowledge of the vision-language model and the visual information of the few-sample training set for classifying images with the help of feature adapters. We conduct extensive experiments on 11 image recognition benchmark datasets under the few training samples setting, which demonstrate that our proposed approach can achieve state-of-the-art performance. The code is available at https://github.com/PKU-ICST-MIPL/FG-VPL_ACMMM2023.
科研通智能强力驱动
Strongly Powered by AbleSci AI