计算机科学
人工智能
自然语言处理
水准点(测量)
掉期(金融)
无监督学习
编码(集合论)
匹配(统计)
鉴定(生物学)
代表(政治)
情态动词
秩(图论)
机器学习
特征学习
模式识别(心理学)
地理
程序设计语言
生物
经济
法学
高分子化学
政治学
政治
组合数学
集合(抽象数据类型)
财务
大地测量学
数学
化学
植物
统计
作者
Zhong Chen,Zhizhong Zhang,Xin Tan,Yanyun Qu,Yuan Xie
标识
DOI:10.1145/3581783.3612050
摘要
Large-scale Vision-Language Pre-training (VLP) model, e.g., CLIP, has demonstrated its natural advantage in generating textual descriptions for images. These textual descriptions afford us greater semantic monitoring insights while not requiring any domain knowledge. In this paper, we propose a new prompt learning paradigm for unsupervised visible-infrared person re-identification (USL-VI-ReID) by taking full advantage of the visual-text representation ability from CLIP. In our framework, we establish a learnable cluster-aware prompt for person images and obtain textual descriptions allowing for subsequent unsupervised training. This description complements the rigid pseudo-labels and provides an important semantic supervised signal. On that basis, we propose a new memory-swapping contrastive learning, where we first find the correlated cross-modal prototypes by the Hungarian matching method and then swap the prototype pairs in the memory. Thus typical contrastive learning without any change could easily associate the cross-modal information. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our method. For example, on SYSU-MM01 we arrive at 54.0% in terms of Rank-1 accuracy, over 9% improvement against state-of-the-art approaches. Code is available at https://github.com/CzAngus/CCLNet.
科研通智能强力驱动
Strongly Powered by AbleSci AI