计算机科学
情态动词
人工智能
自然语言处理
情报检索
模式识别(心理学)
化学
高分子化学
作者
Zongyi Li,Jianbo Li,Yuxuan Shi,Hefei Ling,Jiazhong Chen,Runsheng Wang,Shijuan Huang
标识
DOI:10.24963/ijcai.2024/116
摘要
Text-based Person Search aims to retrieve a specified person using a given text query. Current methods predominantly rely on paired labeled image-text data to train the cross-modality retrieval model, necessitating laborious and time-consuming labeling. In response to this challenge, we present the Cross-modal Generation and Alignment via Attribute-guided Prompt framework (GAAP) for fully unsupervised text-based person search, utilizing only unlabeled images. Our proposed GAAP framework consists of two key parts: Attribute-guided Prompt Caption Generation and Attribute-guided Cross-modal Alignment module. The Attribute-guided Prompt Caption Generation module generates pseudo text labels by feeding the attribute prompts into a large-scale pre-trained vision-language model. These synthetic texts are then meticulously selected through a sample selection, ensuring the reliability for subsequent fine-tuning. The Attribute-guided Cross-modal Alignment module encompasses three sub-modules for feature alignment across modalities. Firstly, Cross-Modal Center Alignment (CMCA) aligns the samples with different modality centroids. Subsequently, to address ambiguity arising from local attribute similarities, an Attribute-guided Image-Text Contrastive Learning module (AITC) is proposed to facilitate the alignment of relationships among different pairs by considering local attribute similarities. Lastly, the Attribute-guided Image-Text Matching (AITM) module is introduced to mitigate noise in pseudo captions by using the image-attribute matching score to soften the hard matching labels. Empirical results showcase the effectiveness of our method across various text-based person search datasets under the fully unsupervised setting.
科研通智能强力驱动
Strongly Powered by AbleSci AI