计算机科学
变压器
一次性
初始化
人工智能
计算
机器学习
模式识别(心理学)
上下文图像分类
适配器(计算)
计算机视觉
图像(数学)
计算机硬件
电压
算法
工程类
机械工程
电气工程
程序设计语言
作者
Zhao Song,Ke Yang,Naiyang Guan,Jun Zhu,Qiao Peng,Qingyong Hu
标识
DOI:10.1109/icassp49357.2023.10095154
摘要
Large-scale pre-trained transformers have recently achieved remarkable success in several computer vision tasks. However, it remains highly challenging to fully fine-tune models for downstream tasks, due to the expensive computational and storage cost. Recently, Parameter-Efficient Tuning (PETuning) techniques, e.g., Visual Prompt Tuning (VPT), have significantly reduced the computation cost by inserting lightweight prompt modules including prompt tokens or adapter layers, into the pre-trained models and tuning these prompt modules with a small number of trainable parameters, while keeping the transformer backbone freeze. Although encouraging results were achieved, existing PETuning methods cannot perform well under the few-shot learning settings (i.e., extremely limited training data, with only 1 or 2 shots per class), due to the scarce supervision signal. To this end, we first empirically identify the poor performance is mainly due to the inappropriate way of initializing prompt modules, which has also been verified in the pre-trained language models. Next, we propose a Visual Pre-trained Prompt Tuning framework (VPPT), which pre-trains the prompt modules first and then leverages the pre-trained modules along with the pre-trained transformer backbone to perform prompt tuning on downstream tasks. Extensive experiments show that our VPPT framework achieves 16.08% average accuracy absolute improvement under 1 shot setting on five fine-grained visual classification datasets, compared with the previous PETuning techniques, e.g., VPT, in few-shot image classification.
科研通智能强力驱动
Strongly Powered by AbleSci AI