摘要
IET Image ProcessingVolume 14, Issue 17 p. 4614-4620 Research ArticleFree Access Multi-scale feature fusion network for person re-identification Yongjie Wang, Yongjie Wang School of Microelectronics, Tianjin University, Tianjin, 300072 People's Republic of China The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang, 050081 People's Republic of ChinaSearch for more papers by this authorWei Zhang, Corresponding Author Wei Zhang tjuzhangwei@tju.edu.cn School of Microelectronics, Tianjin University, Tianjin, 300072 People's Republic of ChinaSearch for more papers by this authorYanyan Liu, Yanyan Liu Key Laboratory of Photoelectronic Thin Film Devices and Technology of Tianjin, Nankai University, Tianjin, 300071 People's Republic of ChinaSearch for more papers by this author Yongjie Wang, Yongjie Wang School of Microelectronics, Tianjin University, Tianjin, 300072 People's Republic of China The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang, 050081 People's Republic of ChinaSearch for more papers by this authorWei Zhang, Corresponding Author Wei Zhang tjuzhangwei@tju.edu.cn School of Microelectronics, Tianjin University, Tianjin, 300072 People's Republic of ChinaSearch for more papers by this authorYanyan Liu, Yanyan Liu Key Laboratory of Photoelectronic Thin Film Devices and Technology of Tianjin, Nankai University, Tianjin, 300071 People's Republic of ChinaSearch for more papers by this author First published: 25 February 2021 https://doi.org/10.1049/iet-ipr.2020.0008AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract Recently, it is becoming a challenging work for person re-identification due to the problems of occlusion, blurring and posture. The key of effective person re-identification is to capture sufficient detailed features of a person's appearance in images. Different from previous methods, our method mainly focuses on fusing different visual clues only depending on the features of different levels and scales without additional assistance. The major contributions of our paper are the mixed pooling strategy with different kernels and the mixed loss function. Firstly, we adopt ResNet50 as our backbone. We have slightly modified the backbone, which does not use the down-sampling operation at the beginning of stage 4. Inspired by pyramid pooling structure, we pass the outputs of Res4 and Res5 through the average pooling layer and max pooling layer with different kernels and strides separately. Secondly, we combine the averaged triplet losses and the averaged softmax losses as the final loss of the whole network. Extensive experiments on three datasets (CUHK3, Market1501, DukeMTMC-reID) show that compared with many state-of-the-art methods in recent years, our model achieve higher accuracy. 1 Introduction In fact, person re-identification is a kind of computer vision technology. Person re-identification is mainly about detecting the exact person in different surveillance cameras. Therefore, person re-identification is a kind of problem belonging to image retrieval, which means if you have got an exact person image under a camera, you can find the exact person under another camera. In general, the technology of person re-identification and person detection are usually combined together in real-application scenario, which can make up for the shortage of vision range in fixed cameras. With the continuous development of convolutional neural network (CNN), the CNN-based method for person re-identification task is emerging one after another in recent years. Although these proposed methods have achieved good results in person re-identification, there are still many challenges. One of the challenges is to get sufficient discriminative features of the pedestrian. It is an important problem that getting sufficient discriminative feature representation in person re-identification. However, due to the limitations of the environment such as low image resolution, the calculation of fingerprint, iris, face, gait and other identifying features is difficult. Therefore, the visual appearance (such as shoes, bags, clothes and gait [1]) of a person plays an important role in the person re-identification task. In other words, during the person re-identification work, the visual appearance of a person will not change significantly. Fig. 1 shows some examples from three challenging data sets. Fig. 1Open in figure viewerPowerPoint Some paired images from the three challenging data sets including CUHK03, Market1501 and DukeMTMC-reID. The same person in different cameras shows great variation in terms of background and occlusion In order to find abundant discriminative features from the limited training data set, (1) some methods adopt specially designed regularisations or constraints and proposed various measurement learning losses other than classification loss, such as triplet loss [2], quadruplet loss [3] and group similarities learning [4]; (2) some methods focused on finding more fine-grained visual cues throughout the whole body. These features are obtained by rigid spatial divisions [5], latent parts localisation [6], pose estimation [7], human parsing [8] or attention map [9]; (3) some methods try to increase the change of training data by adding data, such as random clipping (mirror image) [10], synthesised samples [11] by generative adversarial network (GAN) [12] or adversarially occluded samples [13]. Our method belongs to the second class in the above categorisation which focuses on fusing more discriminative visual feature over the whole human body. However, previous work needs an extra step for body parts localisation with rigid spatial division, pose estimation or learning latent parts, which increases the complexities and uncertainties of algorithms. Our method does not need an extra step to deal with the image, we only use the single network to extract different scale feature of the body from different backbone layers. In this paper, we propose a new CNN-based method to fuse more discriminative visual feature. The major contributions of our method are mainly expressed in two aspects: using a mixed pool strategy to extract multi-scale features and the mixed loss function. We adopt ResNet50 as the backbone of our model. To make up for the insufficiency of high-level semantic feature, we design a multi-scale network to fuse more semantic features from the last two layers of our network, that is, the feature extraction layers of our network are located at the last two layers of the backbone. We set the Res5 down-sample factor as 1, so the outputs of Res4 and Res5 have the same size. Each output of Res4 and Res5 is combined with four different pooling layers to increase the diversity of the extracted features. We concatenate the outputs of Res4 and Res5 which passed through the same stride layer. As for the loss function, we adopt the mixed loss function which includes softmax loss and triplet loss as the final loss to train our network. First, softmax loss function ensures the representation ability of the features. Second, the representation ability of the features is improved by the triplet loss function which combined behind each pooling layer. Not only the distance between inter-class features can be enlarged by a triplet loss function, but also the distance between intra-class features are reduced by a triplet loss function, which produces more discriminative features. In addition, the projection direction of the features will not be changed by the triplet loss function. In the end, we combine the averaged triplet losses and the averaged softmax losses as the final loss to train the whole network. The results of experiments on the three challenging public data sets show that compared with many other methods, our method achievethe higher accuracy on these data sets. Moreover, our approach is applicable to many different types of backbones. Our contributions are summarised as follows: (1) Different from previous methods, our network uses a mixed pooling strategy only with kernel size from 1 × 1 to 2 × 2 to get sufficient features for person re-identification task. (2) We design a mixed loss function which contains the triplet loss part and the softmax loss part to calculate the final loss. (3) We apply our method to the three challenging data sets (CUHK3, Market1501 and DukeMTMC-reID), and find that it outperforms many other approaches. The rest of this article is organised as follows. The related work is briefly reviewed in Section 2. Then, Section 3 introduces the architecture of this method and the triplet loss function. Section 4 provides a performance comparison between our approach and the latest technologies. In this Section, we also demonstrate the effectiveness of multi-scale structure. Finally, we summarise this article in Section 5. 2 Related work Person re-identification is a valuable technique in security monitoring, which has attracted more and more attention in recent years. To deal with the person re-identification task, various methods have been proposed. Generally speaking, the methods of person re-identification task can be divided into five types. 2.1 Person re-identification based on representation learning With the development of deep learning, methods based on CNN can extract the representation feature from the original image data automatically. So some researchers regard the person re-identification work as the classification or verification problems: (1) the classification problems are referred to utilise the person ID or attributive character as the training label to train the model, (2) the verification problems means the model can decide whether the paired input images from different cameras belong to the same person. According to the characteristics of the problems, classification loss and verification loss are used by Geng et al. [14] to train the model. However, Lin et al. [15] believed that it was not enough to learn a model which had strong generalisation ability only with person ID information. In this work, they marked the attributes of person images, such as gender, hair, clothes and other person images. The model increases the generalisation ability greatly by adding the person attribute labels. Nonetheless, person re-identification based on representation learning method is easy to over-fit in the domain of the data sets. When the person ID labels increase to a certain extent, this kind of method will become weak. 2.2 Person re-identification based on local features Many methods have been proposed to extract local features, such as image segmentation, skeleton key point positioning and attitude correction. The most common method is image segmentation. However, this kind of method needs an extra step to align the person in the image. If the persons in the two images are not aligned up and down, different parts of the person's body likely to be compared, which will cause more and more errors in the model. Liang et al. [16] proposed a method based on key-point positioning. First, the key points of person are calculated by the model of attitude estimation, and then affine transformation is used to align the same key points of the image. Spindle Net [17] also extracted local features with a key point. Different from [16], Spindle Net does not generate affine transformation to align local image regions but generates regions of interest directly using these key points. Additional skeleton critical point or attitude estimation is required by these local feature alignment methods in the training process. However, training a practical model requires collecting enough training data based on this kind of method, which is very costly. 2.3 Person re-identification based on video sequence The major differences between methods based on video sequence with others are that except for the content information of images, these methods can get the motion information between frames. [18] employs the three-dimensional CNN model to extract the motion information from the video. The original image sequence and the extracted optical flow sequence are combined as the input of Accumulative Motion Context (AMOC) [19]. The core idea of AMOC is that the network should extract not only the features of sequence images but also the motion features of moving light flow. Song et al.[20] pointed out that when a single frame of image encounters occlusion, other information of multiple frames can be used to make up for it, and the network can be directly induced to judge the quality of the image and reduce the importance of poor quality frames. Riachy et al. [21] proposed a spatio-temporal descriptor, called GOG3D, for video-based method. Compared with other methods, the proposed descriptor had less complexity and can be easily fed into off-shelf learned distance metrics. Riachy et al. [22] also proposed a new method which regarded the person re-id problem as the classification with Naive Bayes nearest neighbour, and the Spearman rank correlation coefficient is regarded as the distance metrics, instead of the Euclidean distance. 2.4 Person re-identification based on metric learning Different from representation learning methods, the metric learning methods which are widely used in image retrieval aim to learn the similarity between two images from the model. In the matter of person re-identification, the same person had a larger similarity in different pictures than that of a different person in different pictures. Finally, the distance of the same person in different images is reduced, and the distance of different persons in different images is enlarged by the loss function of the model. Many methods, such as contrastive loss [23], triplet loss [2], quadruplet loss [3] and margin sample mining loss [24], have been proposed to measure the training loss. 2.5 Person re-identification based on data augmentation The big problem for person re-identification is that the amount of available data sets is very limited. Only a few thousand person ids and tens of thousands of images are contained in the largest Re-id data set. Compared with other data sets of a different problem, the person re-identification data set is small. Therefore, Zheng et al. [11] proposed that GAN can be used for re-identification work. Although the quality of the resulting images was not very high and the images were generated randomly, which means there are no labels to train. However, this paper is the first paper to utilise GAN to generate pedestrian images to make up for the insufficient number of data sets. After this paper, many subsequent works are proposed. Zhong et al. [25] proposed a method which is an improvement of the former. The proposed method made GAN graph becoming a controllable graph. Another problem with re-identification is that the same person in different cameras has a different appearance because of the different angles of cameras and occlusions. To solve this problem, GAN was used to transfer photos from one camera to another. In addition to the different appearance of the same person in different cameras, there are also data set bias problems in re-identification due to the background and occlusions. To deal with this bias, GAN was used to transfer pedestrians from one data set to another by Wei et al. [26]. Generally speaking, the mapping produced by GAN is designed to solve re-identification problems from a certain angle. If something was missing, GAN was used to make up for it, which is a method of data enhancement. 3 Proposed method As discussed above, the basic idea of the proposed method is to deploy a deeper CNNs to fuse features with different scales of the person body and concatenate the feature with the same stride pooling layer. In this section, we first introduce the architecture of our method, and then summarise the advantages of our method. 3.1 Architecture of the network Our network adopts ResNet50 as our backbone, and we modify the backbone slightly to meet our multi-scale requirements. Fig. 2 shows our network structure. The most obvious part is the multi-branch from the last two layers of backbone, which can generate discriminative features with multiple granularities. Fig. 2Open in figure viewerPowerPoint Architecture of our network. The network contains eight branches, four branches come from Res4 and the others come from Res5. Each branch is combined with a pooling layer, and the triplet loss is used to calculate loss Our network contains two noticeable parts the triplet loss part and the softmax loss part. The triplet loss part consists of eight branches, half of the branches are from Res4 and the others are from Res5. Each branch is combined with a pooling layer to extract different scale features. The softmax loss part consists of four branches, half of the branches are from Res4 and the others are from Res5. Each branch is a concatenated branch, all of the features are from the pooling layer which has the same kernel and stride. It is inspired by the fact that the features learned in the low level are mainly the colour, edge and other low-level features; the middle layer becomes more complex, learning texture features; the high-level features are complete and have the key distinguishing features. So the middle-level features (the output of Res4 in ResNet50) and the high-level features (the output of Res5 in ResNet50) are fused together. We modify the backbone ResNet50 slightly, in which the down-sampling operation at the beginning of stage4 is not employed. So the outputs of Res4 and Res5 have the same size. To get diverse and sufficient features, each output of the Res4 and Res5 is combined with a different style pooling layer. There are four different style pooling layers with different stride and kernel in all. The adaptive kernel size is from 1 × 1 to 2 × 2. So the outputs of the Res4 are 1024 × 1 ×1 and 1024 × 2 ×2, the outputs of the Res5 are 2048 × 1 ×1 and 2048 × 2 ×2. Different from other previous methods which need multiple horizontal strip features, our method only uses the 1 × 1 and 2 × 2 scale features to train and predicate the person ID. The main reasons are that with the scale increases, the model will pay more attention to the more detailed division of a given person. However, as the loss is a linear combination of all the local losses, it will weaken the global information of a person's body. On the other hand, if the scale is too small, it is difficult to learn discriminative local division. Our method adopts a mixed pooling strategy. The average pooling layer can get the average value of the extracted feature, and the max-pooling layer can get the max value of the same feature. The max-pooling layer can extract local discriminant information better than the average pooling layer. The average pooling layer uses the information of all positions to take the average, and the contribution of all positions is the same, which weakens the significant local discriminant information. Different scale features are fused, making it easier to get more discriminant features. 3.2 Loss function Triplet loss: There are two valuable characteristics for the using of triplet loss function after each pooling. The triplet loss can expand the distance in different categories, reduce the distance in the same categories. In addition, the original position of the feature in the feature space will not be changed by the triplet loss function, so the deeper network layers will not be affected negatively by the optimisation of this layer. Accordingly, the output of the pooling layer is optimised effectively by the triplet loss function. Final loss: We also adopt a mixed strategy to get the final loss. Each pooling layer uses triplet loss to calculate the loss. So there are eight triplet losses in all. Meanwhile inspired by the pyramid structure, the output of the pooling layer which has the same kernel size and stride is concatenated to calculate softmax loss. So there are four softmax losses in all. In the end, the final loss is obtained by using the averaged 12 loss values. The final loss function formula is defined as follows: L final = L T / 8 + L S / 4 where L final is final loss, L T is triplet loss and L S is softmax loss in the experiments. During the training process, the triplet loss and the softmax loss are used to train the model. During the testing process, only the softmax loss is used to predicate the person ID. 3.3 Summary of our method We propose a new method based on ResNet50 to capture sufficient discriminant features of a person's appearance in images for person re-identification task. The main contributions are mixed pooling strategy and the final loss function. The mixed pooling strategy only uses the 1 × 1 and 2 × 2 scale features to avoid the small scale to weaken the global information of a person's body. The final loss function fuses more correlation information to calculate losses. During the training process, the triplet loss function is combined with each pooling layer to enhance the feature representation. Meanwhile, the softmax loss function is used at the end of the model. The final loss includes four softmax losses and eight triplet losses. During the testing process, we test our model with two methods: one with the re-ranking algorithm, another without. 4 Experiments 4.1 Data sets In our experiments, we evaluate our method on three challenging public data sets, Market1501 [27], CUHK03 [28] and DukeMTMC-reID [11]. The details of the three challenging public data sets are shown in Table 1. In all three public data sets, the time interval between the same person in different cameras is no more than a day. Table 1. Details of experimental data sets Data sets Cameras Identitie Images Resolution CUHK03 6 1467 13,164 vary DukeMTMC-reID 8 1812 36,441 vary Market1501 6 1501 32,668 128 ×64 CUHK03 [28] which collected by Li et al. in 2014 consists of 14,096 images of 1467 identities. This data set is captured by six cameras and all of the images in this data set come from an area in the campus of the Chinese University of Hong Kong. This data set contains two parts: person images labelled by manual work and person images labelled by deformable part models (DPMs [29]), both of the two parts will be used by us. The average person has 9.6 pieces of training data. The DukeMTMC-reID data set [11] which part of the Duke data set contains 36,411 images from 1812 person identities collected from eight different high-resolution cameras. The training set of DukeMTMC-reID data set contains 16,522 training images which comprised of 702 identities, 2228 query images and the testing set of DukeMTMC-reID data set contains 17,661 gallery images of the other 702 identities. The average person has 23.5 pieces of training data. Market1501 data set [27] which captured from a supermarket in the campus of Tsinghua University contains 12,936 training images from 751 identities and 19,732 testing images from 750 identities. These images are captured by six cameras (including five high-resolution cameras and one low-resolution camera). The average person has 17.2 pieces of training data. All the person bounding boxes are generated by DPMs ([29]). 4.2 Evaluation metrics For performance evaluation, we use both cumulative matching characteristic (CMC) and mean average precision (mAP) to evaluate the proposed method. The Rank-k accuracy is used in the CMC. Rank- k accuracy means that according to the similarity score, the matched image is included in the top- k results. The R ( k ) is shown as follows: R ( k ) = 1 N ∑ i N P i k where N is the total number of query images. If the image belongs to the same person as the ith image in the returned top- k images, P i k = 1, otherwise P i k = 0. mAP is introduced by Zheng et al. [27]. For each query, we should calculate the area under the precision–recall curve, which is called average precision (AP). The mAP is shown below: mAP = 1 N ∑ i N A P i where N is the total number of query images. A P i is the AP of the ith image. Many parameters can be used as evaluation indicators, such as Ranks 1, 5, 10 and mAP. In this paper, Rank 1 and mAP are used to evaluate our method. 4.3 Training details The PyTorch is used to train and evaluate our model on the NVIDIA GTX TITAN-XP GPU. It supports neural networks dynamic construction. Through experiments, we found that using pre-trained weight initialisation can accelerate the convergence speed and improve the network performance. So instead of training our method from scratch, we fine-tuned the good training results for ResNet50. The training of the model is optimised by SGD optimiser, with the learning speed of 5 × 10 − 4 and momentum of 0.9. Actually, we adopt a method of dynamically adjusting the learning rate. At the beginning of training process, we set the learning rate as 3 × 10 − 2 , and when the number of iterations exceeds a certain value(130 epochs), the learning rate will change until the training process is completed. We train the network for 300 epochs. During our training process, all input pedestrian images are resised to 288 ×144, and randomly flipped horizontally is used to data augmentation for the data set. Based on the experimental result, we set the batch size to 32, the drop value to 0.2 and the margin parameter of triplet loss to 0.3. During our training process, we kept all the models and calculated the best model for the validation data. During our testing process, all the testing pedestrian images are resized to 256 ×128. 4.4 Experiments results Our method is evaluated against many other methods on these three challenging data sets. Table 2 shows all the results of compared methods. There are 19 methods used for comparison in all, including methods: IDE [30], PAN [31], SVDNet [32], DPFL [33], HA-CNN [6], SVDNet+Era [34], TriNet+Era [34], DaRe [35], GP-reid [36], PCB [5], PCB+RPP [5], HPM [37], BDB [38], CASN (PCB) [39], DSA [40], IANet [41], AANet [42], BAT-Net [43], S+IT+CA [44]. In particular, we want to compare our method with HPM. HPM adopts a similar idea to our method, which also employs mixed pooling strategy to extract multi-scale features. The accuracy rate is shown by two kinds of results: one with re-ranking [45] and another without re-ranking, and the best results are shown in bold. Table 2. Comparison with state-of-the-art methods on CUHK3, DukeMTMC-reIDand Market1501 Methods CUHK3-Labled CUHK3-Detect DukeMTMC-reID Market1501 Rank 1 mAP Rank 1 mAP Rank 1 mAP Rank 1 mAP IDE [45] 22.2 21.0 21.3 19.7 67.7 47.1 72.5 46.0 PAN [30] 36.9 35.0 36.3 34.0 71.6 51.5 82.8 63.4 SVDNet [31] - - - - - - 41.5 37.3 76.7 56.8 82.3 62.1 DPFL [32] 43.0 40.5 40.7 37.0 79.2 60.0 88.9 73.1 HA-CNN [6] 44.4 41.0 41.7 38.6 80.5 63.8 91.2 75.7 SVDNet+Era [33] 49.4 45.0 48.7 37.2 79.3 62.4 87.1 71.3 TriNet+Era [33] 58.1 53.8 55.5 50.7 73.0 56.6 83.9 68.7 DaRe [34] 66.1 61.6 63.3 59.0 80.2 64.5 89.0 76.0 GP-reid [35] - - - - - - - - - - - - 85.2 72.8 92.2 81.2 PCB [5] - - - - - - 61.3 54.2 81.9 65.3 92.4 77.3 PCB+RPP [5] - - - - - - 62.8 56.7 83.3 69.2 93.8 81.6 HPM [36] - - - - - - - - - - - - 86.6 74.3 94.2 82.7 BDB [37] 73.6 71.7 72.8 69.3 86.8 72.1 94.2 84.3 CASN (PCB) [38] 73.7 68.0 71.5 64.4 87.7 73.7 94.4 82.8 DSA [39] 78.9 75.2 78.2 73.1 86.2 74.3 95.7 87.6 IANet [41] - - - - - - - - - - - - 87.1 73.4 94.4 83.1 AANet [42] - - - - - - - - - - - - 87.65 74.29 93.93 83.41 BAT-Net [43] 78.6 76.1 76.2 73.2 87.7 77.3 95.1 87.4 S+IT+CA [44] - - - - - - - - - - - - 86.3 73.1 96.1 84.7 MSN [46] - - - - - - - - - - - - 82.2 62.9 91.1 75.9 SISN [47] - - - - - - - - - - - - 89.0 75.9 95.2 85.4 S+IT+CA [48] - - - - - - - - - - - - 83.34 68.30 91.95 79.08 Ours 80.3 75.5 76.4 71.2 88.5 77.4 95.3 87.6 Ours+RR 87.4 88.0 84.5 84.5 91.5 89.0 95.9 93.7 '- - -' means that the results are not given by authors in the corresponding paper. RR denotes Re-Ranking. Bold values indicate the best results are shown in bold. Performance on the Cuhk03 data set: Our method and other 19 compared methods are shown in Table 2. As discussed above, the CUHK03 data set can be divided into two parts: person images labelled by manual work and person images labelled by DPMs. We evaluated our method on both two parts of the data set. All the results of the two parts are shown in Table 2. As can be seen from the table, the result of our method on manual labelled data set is higher than that of other 19 methods, our method gets the best ranking in terms of Rank 1 and the second ranking in terms of mAP. Compared with the second-ranking method DSA, our method gets 1.4 higher in terms of Rank 1. On the detected data set, the result of our method gets the second ranking in terms of Rank 1 and mAP. Compared with the best ranking method DSA, our method gets 1.8 lower in terms of Rank 1 and 1.9 lower in terms of mAP. After re-ranking, the performance of our method has improved on both of two parts of the data set. Performance on the DukeMTMC-ReID data set: Compared with other two data sets, DukeMTMC-ReID data set is very challenging data set. Table 2 shows all the results of the experiment. As