适配器(计算)
计算机科学
任务(项目管理)
弹丸
编码器
人工智能
计算机硬件
操作系统
工程类
化学
系统工程
有机化学
作者
Wen-bo ZHANG,Yifan Zhang,Yuyang Deng,Wenlong Zhang,Jianfeng Lin,Binqiang Huang,Jinlu Zhang,Wenhao Yu
标识
DOI:10.1016/j.patcog.2024.110559
摘要
Contrastive Language-Image Pre-training (CLIP) has shown impressive zero-shot transfer capabilities, but its potential for specific downstream tasks is not fully utilized. To further enhance CLIP's few-shot capability for specific datasets, some subsequent works have been proposed, such as methods based on lightweight adapters and prompt learning. However, since CLIP is pretrained on a diverse collection of image and text pairs sourced from the internet, it is difficult to sufficiently tune models to specific datasets using only lightweight adaptions. In this paper, we argue that largely modifying the internal representations within CLIP's encoders can yield better results on downstream datasets. In this work, we introduce Ta-Adapter, a method that equips both the visual and textual encoders of CLIP with task-specific prompts. These prompts are generated using a collaborative prompt learning approach, which allows the encoders to produce representations that are better aligned with specific downstream datasets. Then, we initialize an adapter module using the optimized features generated by the task-aware visual encoder for further feature alignment, and this module can also be further fine-tuned. Our extensive experiments on image classification datasets show that compared to the state-of-the-art few-shot methods Tip-Adapter-F and MaPLe, our model exhibits good performance and obtains an average absolute gain of 2.04% and 1.62% on 11 different image recognition datasets, respectively. In conclusion, this work presents a unique and effective approach to unlocking the full potential of CLIP's few-shot learning capabilities.
科研通智能强力驱动
Strongly Powered by AbleSci AI