BDNet: A BERT-based dual-path network for text-to-image cross-modal person re-identification

计算机科学特征（语言学）特征学习人工智能公制（单位）鉴定（生物学）代表（政治）模式识别（心理学）联营特征提取路径（计算）机器学习哲学政治学政治经济生物程序设计语言法学植物语言学运营管理

作者

Qiang Liu,He Xiaohai,Qizhi Teng,Linbo Qing,Honggang Chen

出处

期刊：Pattern Recognition [Elsevier BV]
日期：2023-04-24 卷期号：141: 109636-109636 被引量：12

标识

DOI：10.1016/j.patcog.2023.109636

摘要

Text-to-image person re-identification (TI-ReID) aims to provide a descriptive sentence to find a specific person in the gallery. The task is very challenging due to the huge feature differences between both image and text descriptions. Currently, most approaches use the idea of combining global and local features to get more fine-grained features. However, these methods usually acquire local features with the help of human pose or segmentation models, which makes it difficult to use in realistic scenarios due to the introduction of additional models or complex training evaluation strategies. To facilitate practical applications, we propose a BERT-based framework for dual-path TI-ReID. Without the help of additional models, our approach directly employs visual attention in the global feature extraction network to allow the network to adaptively learn to focus on salient local features in image and text descriptions, which enhances the network’s attention to local information through a visual attention mechanism, thus strengthening the global feature representation and effectively improving the global feature representation. In addition, to learn text and image modality invariant feature representations, we propose a convolutional shared network (CSN) to learn image and text features together. To optimize cross-modal feature distances more effectively, we propose a global hybrid modal triplet global metric loss. In addition to combining local metric learning and global metric learning, we also introduce the CMPM loss and CMPC loss to jointly optimize the proposed model. Extensive experiments on the CUHK-PEDES dataset show that the proposed method performs significantly better than the current research results, achieving a Rank-1/mAP accuracy of 66.27%/ 57.04%.

求助该文献

最长约 10秒，即可获得该文献文件

BDNet: A BERT-based dual-path network for text-to-image cross-modal person re-identification

今日热心研友