计算机科学
特征(语言学)
特征学习
人工智能
公制(单位)
鉴定(生物学)
代表(政治)
模式识别(心理学)
联营
特征提取
路径(计算)
机器学习
哲学
政治学
政治
经济
生物
程序设计语言
法学
植物
语言学
运营管理
作者
Qiang Liu,X. T. He,Qizhi Teng,Linbo Qing,Honggang Chen
标识
DOI:10.1016/j.patcog.2023.109636
摘要
Text-to-image person re-identification (TI-ReID) aims to provide a descriptive sentence to find a specific person in the gallery. The task is very challenging due to the huge feature differences between both image and text descriptions. Currently, most approaches use the idea of combining global and local features to get more fine-grained features. However, these methods usually acquire local features with the help of human pose or segmentation models, which makes it difficult to use in realistic scenarios due to the introduction of additional models or complex training evaluation strategies. To facilitate practical applications, we propose a BERT-based framework for dual-path TI-ReID. Without the help of additional models, our approach directly employs visual attention in the global feature extraction network to allow the network to adaptively learn to focus on salient local features in image and text descriptions, which enhances the network’s attention to local information through a visual attention mechanism, thus strengthening the global feature representation and effectively improving the global feature representation. In addition, to learn text and image modality invariant feature representations, we propose a convolutional shared network (CSN) to learn image and text features together. To optimize cross-modal feature distances more effectively, we propose a global hybrid modal triplet global metric loss. In addition to combining local metric learning and global metric learning, we also introduce the CMPM loss and CMPC loss to jointly optimize the proposed model. Extensive experiments on the CUHK-PEDES dataset show that the proposed method performs significantly better than the current research results, achieving a Rank-1/mAP accuracy of 66.27%/ 57.04%.
科研通智能强力驱动
Strongly Powered by AbleSci AI