粒度
计算机科学
模态(人机交互)
模式
统一
情态动词
人工智能
特征(语言学)
鉴定(生物学)
自然语言处理
机器学习
模式识别(心理学)
情报检索
程序设计语言
社会学
高分子化学
化学
哲学
生物
植物
语言学
社会科学
作者
Zhiyin Shao,Xinyu Zhang,Meng Fang,Zhifeng Lin,Jian Wang,Changxing Ding
标识
DOI:10.1145/3503161.3548028
摘要
Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions. It is challenging due to both rich intra-modal variations and significant inter-modal gaps. Existing works usually ignore the difference in feature granularity between the two modalities, i.e., the visual features are usually fine-grained while textual features are coarse, which is mainly responsible for the large inter-modal gaps. In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. LGUR framework contains two modules: a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. Besides, DGA has two important factors, i.e., the cross-modality guidance and the foreground-centric reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance. Comprehensive experiments show that our LGUR consistently outperforms state-of-the-arts by large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released at https://github.com/ZhiyinShao-H/LGUR.
科研通智能强力驱动
Strongly Powered by AbleSci AI