情态动词
特征(语言学)
计算机科学
人工智能
控制理论(社会学)
材料科学
复合材料
哲学
语言学
控制(管理)
作者
Haonan Wang,Mingwen Shao,Xiaodong Tan,Lixu Zhang
标识
DOI:10.1016/j.compeleceng.2024.109270
摘要
Prompt learning has recently emerged as a promising method for fine-tuning vision-language models. By introducing prompts in the text encoder or image encoder, the pre-trained model can quickly adapt to downstream tasks without updating the pre-trained weights. However, prior multi-modal prompt tuning works do not consider the difference in feature distributions between text and images, and adopt the same prompts for both encoders, thus achieving sub-optimal performance in the downstream few-shot learning. In this paper, we propose Modal-Aware Prompt (MAP) to alleviate this issue. Specifically, considering the stability of text features, we design text-specific prompts, which can acquire text class-related information from a general template (i.e., "a photo of a ") by unidirectional attention-based interaction. Additionally, considering the diversity of image features, we design visual-specific prompts to acquire image class-related information and adjust the image features by bidirectional attention-based interaction. To learn hierarchical prompt representations and reinforce the prompt features, we further propose a Deep Adaptive Feature Enhancement (DAFE) module to adaptively utilize the prompt output of the former layer, which can combine instance-level and task-level information simultaneously. Combining the above two designs, our method MAP-DAFE obtains state-of-the-art results on 11 image recognition datasets and has the fastest convergence rate. This proves our MAP-DAFE is effective and efficient.
科研通智能强力驱动
Strongly Powered by AbleSci AI