计算机科学
编码器
嵌入
鉴定(生物学)
特征(语言学)
人工智能
图像(数学)
编码(集合论)
分割
利用
自编码
代表(政治)
钥匙(锁)
集合(抽象数据类型)
自然语言处理
计算机视觉
模式识别(心理学)
深度学习
程序设计语言
语言学
哲学
生物
计算机安全
植物
法学
操作系统
政治
政治学
作者
Siyuan Liu,Sun Li,Qingli Li
出处
期刊:Cornell University - arXiv
日期:2022-11-25
标识
DOI:10.48550/arxiv.2211.13977
摘要
Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID and give them to the text encoder to form ambiguous descriptions. In the first training stage, image and text encoders from CLIP keep fixed, and only the text tokens are optimized from scratch by the contrastive loss computed within a batch. In the second stage, the ID-specific text tokens and their encoder become static, providing constraints for fine-tuning the image encoder. With the help of the designed loss in the downstream task, the image encoder is able to represent data as vectors in the feature embedding accurately. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks. Code is available at https://github.com/Syliz517/CLIP-ReID.
科研通智能强力驱动
Strongly Powered by AbleSci AI