1 Introduction

Person re-identification (re-ID), also called pedestrian retrieval, across different cameras is a very fundamental problem in the computer vision community. The main task of person re-ID is finding the target person in an image gallery by comparing the query image of this person with all other images in the same gallery, which has extensive applications in intelligent video surveillance, human-computer interaction, home robotics, especially in public safety. Despite its great progress in recent years, some realistic complex pedestrian retrieval environmental factors still make the re-ID task as a very challenging problem, such as background clutter, occlusion, deformation, and etc (Fig. 1).

Fig. 1
figure 1

Examples of person cross-view matched tasks from Market1501 [53]. Each bounding box shows the same person under different cameras

Initially, the research of person re-ID mainly focused on design hand-crafted features and learn similarity metrics [6, 14, 18, 23, 31]. In the similarity metric aspect, besides Euclidean distance, cosine distance and Mahalanobis distance, some improved methods based on nearest neighbor are also proposed to reduce intra-class distance and increase inter-class distance [25, 42]. In order to reduce the computational cost of a transformation matrix in Mahalanobis distance, the regularization constraints are relaxed in [10]. Although some improvements have been achieved based on the design of some suitable hand-crafted features and learn matched similarity metrics, the performance of pedestrian recognition is still at a low level. With the emergence of the large-scale dataset and the development of deep learning, deep learning-based methods have been proposed for the person re-ID task [37, 44]. Compared with the hand-crafted feature-based methods, the deep learning-based methods can achieve unprecedented re-identification performance only by using some simple deep features. However, only based on deep features without jointly learning similarity metric holistically, these methods still cannot meet the requirements of the actual environment. Existing deep learning-based person re-ID methods either assume the availability of well-aligned person bounding box images as model input [3, 30, 58] or rely on constrained segmentation mechanisms to calibrate misaligned images [5, 36, 43, 52]. They are not very suitable for re-ID matching in arbitrarily aligned person images potentially with large human pose variations and part occlusions. There are some attention-based methods that attempt to use attention maps to indicate the importance of different pedestrian locations solving the re-ID problem [13, 17, 39]. Additionally, a part-based convolutional network as a powerful baseline is also applied to person re-ID tasks [28]. Nevertheless, these deep learning-based methods just simply adopt the existing deep architectures which have high complexity in model design. In addition, the part based methods often give the same weight of each part whilst ignoring the integrity and importance of the information contained in each image part. Hence, these techniques are ineffective when the target person is not filled with the whole image or when the target person is seriously deformed or occluded.

In this paper, we consider the problem of jointly part weight selection and deep feature representation for optimizing person re-ID in an adaptive weight part-based convolutional network. When some image patches influenced by occlusion or deformation, the representation ability of these patches will be reduced and the identification ability of the patch-based re-ID model will be affected. Only by adaptively reducing the weight of these corresponding patches can improve the re-ID performance of the model. Compared with the same weight of part-based methods and the attention-based methods, the proposed method has an adaptive weight model that can adjust the corresponding weight of each part adaptively. The adaptive weight model based on the similarity of the same image part of the same person in different images and the location information of each part in the whole image, which has two advantages: 1) Adjusting the weight of each image part based on the location information of each part in the whole image can effectively alleviate the defect of whether pedestrians fill the whole image. 2) Adjusting the weight of each image part based on the similarity of the same image part of the same person in the different images can effectively solve the problem of whether pedestrians have occlusion, deformation or not. Combining with these two aspects, our AWPCN method has achieved state-of-the-art re-ID performance. The main contributions of this paper are as follows:

  • We formulate a novel method of jointly learning part weight selection and deep feature representation for optimizing person re-ID in the deep learning-based framework.

  • We propose an adaptive weight part-based convolutional network (AWPCN) which can simultaneously divide the image into several parts, extract deep features and learn the corresponding weight of each part.

  • Extensive comparative evaluations demonstrate the superiority of the proposed AWPCN model over a wide range of state-of-the-art re-ID models on three large benchmarks CUHK03 [37] Market-1501 [53], and DukeMTMC-ReID [56].

The rest of this paper is structured as follows. We first introduce some related works in Section 2. Next, we propose the adaptive weight part-based convolutional network for person re-ID, including the introduction of the baseline method and the adaptive weight model for the part-based convolutional network in Section 3. Subsequently, we introduce the implementation details and the evaluation criterion, evaluate and discuss our approach on some comprehensive benchmark datasets in Section 4. Finally, we briefly present the conclusion of this work in Section 5.

2 Related works

In this section, we introduce some re-ID methods closely related to our work in the proper context. A comprehensive review of re-ID methods is beyond the scope of this paper, and some survey papers can be found in [1, 15, 54].

2.1 Hand-crafted features-based re-ID methods

Previous to the popularity of deep learning technology, the earlier research on pedestrian retrieval mainly focused on how to design better hand-crafted features and how to learn better similarity measures. Different feature representations are suitable for different recognition scenarios [6, 14, 18, 23, 31, 47, 49, 50]. Common hand-crafted features was used for pedestrian image representation, mainly including color names [14], texture [23], scale invariant feature transform (SIFT) [18], histogram of oriented gradient (HOG) [6], etc. In [31], Tao et al. used the color histogram and texture features to characterize the image and proposed a regularly smoothed KISS metric to retrieve pedestrians in low-dimensional space through principal component analysis (PCA) dimensionality reduction. Pedagadi et al. [21] proposed a measurement method using PCA and local Fisher discriminant analysis in order to preserve the local neighborhood structure of the projected image. In [51], Zhang et al. proposed null Foley-Sammon transformation, which is a feature vector space with good learning and distinguishing ability. It satisfies zero intra-class divergence and positive inter-class divergence. In addition, there are many studies on pedestrian recognition from how to learn better similarity measure [7, 10, 12, 38, 42].

2.2 Deep learning-based re-ID methods

In recent years, with the development of deep learning methods [32,33,34, 46, 48], deep learning technology has been widely used in person re-ID tasks [4, 16, 20, 59]. Different from traditional re-ID methods, deep learning-based methods can automatically extract better pedestrian image features, while learning to obtain better similarity measures. When deep learning-based methods are adopted to person re-ID [37, 44], it is becoming more and more popular for this retrieval task. According to the different loss types, these deep learning-based re-ID methods can be divided into representation learning and metric learning. As we know the representation learning is a very common person re-ID method [20, 54,55,56, 59]. When the ultimate goal is to learn the similarity between two images, the representation learning-based method does not directly consider the similarity between images when training the network, but regards re-ID as a classification task [20]. The characteristic of this method is that the output of the last fully connected layer of the network is not the final image eigenvector, but a Softmax activation function to calculate the representation learning loss [55]. Using person ID or attributes as training labels to train the model, and the network can learn whether these two input images belong to the same pedestrian. Among the deep learning based person re-ID methods, the commonly used metric learning loss include contrastive loss [35, 36], triplet loss [9, 19] and quadruplet loss [3]. The comparative loss [35] is used to train the Siamese network. The input of the Siamese network is a pair of person images, which can be the same or different person. In [9], Hermans et al. verified that using a variant of the triplet loss to perform end-to-end deep metric learning can output some better re-ID performance. However, the triplet loss pays the main attention to obtaining correct orders on the training set. It still suffers from a weaker generalization capability from the training set to the testing set, thus resulting in inferior performance. In [3], a quadruplet loss has been proposed to lead the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss. In particular, a quadruplet deep network using a margin-based online hard negative mining is proposed based on the quadruplet loss for the person re-ID.

2.3 Patch-based re-ID methods

These methods described above is susceptible to noise interference and reduces its stability. In order to make the identification method more robust, several patch-based or attention-based methods have been proposed [5, 13, 17, 26, 28, 29, 36, 39, 41, 43, 52]. Figure 2 shows several partition strategies based models in person retrieval. Liu et al. [17] propose an attention-based deep neural network, which can explore the multi-scale selectiveness of attentive features to enrich the final feature representations for a pedestrian image.

Fig. 2
figure 2

Partition strategies of several deep part models in person re-ID. From top to bottom are Hydra-plus [17], PCB [28], PL-Net [43], PAR [52], and the proposed AWPCN, respectively

In [28], Sun et al. proposes a part-based convolutional baseline network and uses a refined part pooling method to re-assigns outliers to the closest parts, resulting in refined parts with enhanced within-part consistency, which achieved some good identification performance. The PL-Net [43] method can minimize both the empirical classification risk on training person images and the representation learning risk on unseen person images. The representation learning risk is evaluated by the proposed part loss, which automatically detects human body parts and computes the person classification loss on each part separately. In [52], Zhao et al. present a part-aligned representation (PAR) model for handling the body part misalignment problem. The PAR model decomposes the human body into parts which are discriminative for person matching. These part-based or attention-based algorithms all improve the robustness of the identification in varying degrees. Nevertheless, the integration strategy of these methods often uses the same weight of each patch or uses an attention strategy, which cannot show the advantage of the part-based method well. In this paper, we develop an adaptive weight model to mitigate this defect, which can adaptively apply the part similarity information and the part location information to determine the corresponding weight of each local part.

3 Adaptive weight part-based convolutional network

In this section, we introduce the adaptive weight part-based convolutional network for person re-ID. Firstly, we give a brief introduction to the part-based person re-ID methods. Then, we present the part-based convolutional network. Furthermore, we describe the adaptive weighting strategy to adjust the weight of different parts effectively. Finally, we show the re-ID pipeline of the proposed method, which is shown in Fig. 3.

Fig. 3
figure 3

Structure of the proposed AWPCN. First, the input image through the convolutional layers to extracted patch features. Then, the adaptive weight layer adjusts the importance of different patches in the identification task. Finally, each classifier is implemented with a fully-connected layer and a sequential Softmax layer to predict the identity (ID) of the input image

3.1 Part-based convolutional network

Although the PCB [28] method achieves good identification performance, its part-based model trained with the same weight of each part is less effective for capturing the discriminative local structures of the target person. Different parts play different roles in the target representation, especially in the realistic pedestrian retrieval environment (maybe exist deformation, occlusion, etc.). In order to enhance the accuracy of the part-based convolutional framework, we propose an adaptive weight part-based convolutional network for the person re-ID task.

Intuitively, the target can be divided into several local parts, and the distance (or similarity) of each part is calculated separately. If the occlusion or interference occurs in some local areas, we can still retrieval the same person accurately through other unobstructed or undisturbed parts. The PCB model reshapes the backbone network (ResNet50) with some modifications. Specifically, it retained the structure before the global average pooling layer and removed it and followed. When an image undergoes the PCB network, it becomes several column vectors in the same stripe into a single part-level column vector. Each vector can predict the identity (ID) of the input image through a fully-connected layer and a Softmax function.

$$ Softmax({{W}_{i}^{T}} f) = \frac{\exp({{W}_{i}^{T}} f)}{{\sum}_{j=1}^{p}\exp({{W}_{i}^{T}} f)}, $$
(1)

where p denotes the number of pre-defined parts, f denotes a single part-level column vector, and W denotes the trainable weight matrix of the part classifier.

The PCB is optimized by minimizing the sum of Cross-Entropy losses over p pieces of ID predictions in the training stage.

3.2 Adaptive weight strategy

In the training process, there will be inconsonant changes in each part of the target, such as occlusion, deformation, etc. If the model uses the same weight to the p part into the integrated output directly, the reliability of these parts may be inconsistent with the same weight, which can reduce the identification performance. In fact, the corresponding weight should be suppressed if the image part is occluded, and vice versa. Therefore, we propose an adaptive weighting strategy to achieve adaptation.

For person re-ID task, the distance (or similarity) of two images can be used to quantify the reliability of each part and it can be calculated as:

$$ \begin{aligned} & dis^{E}_{I_{1},I_{2}}=\|f_{I_{1}}-f_{I_{2}}\|^{2}, \\ & dis^{C}_{I_{1},I_{2}}=1- \frac{f_{I_{1}}\cdot f_{I_{2}}}{\|f_{I_{1}}\|^{2} \|f_{I_{2}}\|^{2}}, \end{aligned} $$
(2)

where I1,I2 denotes two different images, \(dis^{E}_{I_{1},I_{2}}\) and \(dis^{C}_{I_{1},I_{2}}\) denotes the Euclidean distance and Cosine distance respectively.

Usually, the information on the center position of the image has the most credibility and discrimination. The closer to the edge, the lower the credibility of the information. The main reason of this phenomenon is that there are some problems in the way of image acquisition, such as pedestrians not filling up the whole image, pedestrians deformation, etc. According to this phenomenon, we suggest that the weight of the edge image part should be lower than that of the image part near the center of the image when the image is partitioned. So we decide the weight of the corresponding image is computed as:

$$ \begin{aligned} & w_{p} = \frac{1}{\|p_{c}-I_{c}\|^{2} + I_{dis}/p}, \\ & w^{\prime}_{p} = \frac{w_{p}}{\sum w_{p}}, \end{aligned} $$
(3)

where wp is the original weight of p-th image part, \(w^{\prime }_{p}\) is the normalized weight of p-th image part, pc denotes the central coordinates of p-th image part, Ic denotes the central coordinates of the whole image, Idis denotes the length of the image.

In fact, besides the problems mentioned above, pedestrian deformation or occlusion can also significantly affect the performance of retrieving. Therefore, in the training process, we consider the similarity between the corresponding image blocks. The high similarity indicates that the image part is more important in the pedestrian representation, and the corresponding weight should be greater when determining the final pedestrian identification. We consider the similarity from the distance between image parts:

$$ Similarity = \frac{1}{\|p_{I_{1}}-p_{I_{2}}\|^{2} + \varepsilon}, $$
(4)

where \(p_{I_{i}}\) denotes p-th image part in different image of the same pedestrian, ε is a regularization parameter to avoid denominator zero (we set ε = 0.5). Combining the similarity measure with the location information mentioned above, we can set the final weight of each image part as follows:

$$ fw_{p} = sw^{\prime}_{p} \cdot \frac{Similarity(p)}{\sum Similarity(p)}, $$
(5)

where fwp is the final weight of p-th image part which we used in the model.

As is shown in Fig. 3, we add the adaptive weight layer into the identification network after feature extraction layers directly, which can adaptively adjust the weight of each part. Each adaptive weight image feature feeds into the corresponding classifier which is implemented with a fully-connected layer and a sequential Softmax layer to predict the identity of the input pedestrian image.

4 Experiments

We evaluate the proposed AWPCN compare with other recently published state-of-the-art re-ID methods on three widely used benchmarks Market-1501, DukeMTMC-ReID and CUHK03 [37, 53, 56].

4.1 Datasets and evaluation metrics

There are three databases that are widely used for person re-ID, CUHK03 [37], Market-1501 [53] and DukeMTMC-ReID [56]. Table 1 shows a visualized information about these datasets. And we give a brief introduction for these datasets as bellow:

Table 1 Person re-ID evaluation datasets

CUHK03

CUHK03 [37] is the first person re-identification dataset that is large enough for deep learning. It provides the bounding boxes detected from deformable part models and manually labeling, and we use the detected from in this paper. This re-ID dataset with 13,164 images collected from 10 cameras. There are 1,467 identities which are divided into two parts: 767 identifies are used for training and the remaining 700 identifies are used for testing.

Market-1501

This dataset [53] is one of the largest person re-identification benchmark datasets. It consists of 5 high-resolution (1280 × 1080) cameras and 1 low-resolution (720 × 576) camera with a total of 6 cameras. There are 32,668 bounding boxes of 1,501 identities: 751 identifies are used for training and the remaining 750 identifies are used for testing.

DukeMTMC-ReID

This dataset [56] is also one of the largest person re-identification benchmark datasets collected from campus. There are 8 cameras used to collect image data. It directly uses the manually labeled ground-truth for training and testing. There are 36,411 manually labeled bounding boxes of 1404 identities: half of these identifies are used for training and others are used for testing.

Evaluation metrics

The evaluation metrics we used in this paper are provided by Market-1501 [53] and DukeMTMC-ReID [56], respectively. All the experiments evaluate the single-query setting. We mainly use Rank-1 and mAP as evaluation metrics for comparative experiments. In order to make the performance comparison clearer, we did not implement the re-ranking [57] mechanism in our model.

4.2 Implementation details

We use ResNet-50 [8] as the backbone network, changing the output size of the classifier to the number of identities in the training set. Cosine distance is used for similarity metrics. These settings are the same as PCB in [28]. All the input person images are resized to 384 × 192. We empirically set the part number p = 9. The learning rate λ is set to 0.015 and multiplied by 0.1 after every 30 epoch. The network is trained in an end-to-end manner by Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 5e − 4. During training, we insert a dropout layer before the classifier to regularize the network and the dropout rates are set to 0.5 for all training sets. Figure 4 shows the training and verification losses on three datasets. From this figure we can see that the training and validation errors are almost stable after 30 rounds of training, so we terminated training after 60 epochs.

Fig. 4
figure 4

Training and validation errors on three datasets

4.3 Comparison to the state-of-the-art

In this section, in order to verify the effectiveness of our proposed AWPCN method, we compare it with some recently published state-of-the-art methods on Market1501, DukeMTMC-reID and CUHK03 datasets [37, 53, 56].

Evaluation on market-1501

Table 2 shows comparison results of our AWPCN with some state-of-the-art on Market-1501 [53] dataset. In the top section of the table, we compare some approaches without any attention mechanism or partitioning strategy to our adaptive weight part-based convolutional network. In the second section of the table, we compare some part-based or attention-based approaches with our approach. Compare with these state-of-the-art methods, we can see that our method achieves the best accuracy on Rank-1 and the second-best accuracy on mAP metrics. Compare with the baseline PCB [28], the proposed method adding an adaptive weight model on the base part-based convolutional network, which increases 1.7%/4.7% on Rank-1 and mAP metrics. It’s better than the refined part pooling (PCB+RPP) method. Compare with the patch-based PAR [52] method, the proposed AWPCN method achieved more than 10% improvement on both Rank-1 and mAP metrics. The above experimental comparisons verify the effectiveness of our adaptive weighting method for the patch-based re-ID model. Figure 5 shows some re-ID samples on the Market-1501 dataset. The images in the first column are the query images. The re-ID images are sorted according to the level of similarity from the left to the right. Most candidate images are correctly re-identified. Although the network re-identifies some incorrect candidates in some rows, we find that most of the candidate samples for incorrect re-identification are in the latter position, which also shows that our model has a perfect re-ID performance.

Table 2 Rank-1 and mAP results of the proposed AWPCN and some state-of-the-art methods on Market-1501 [53] dataset. The first and second best scores are highlighted in and colors, respectively
Fig. 5
figure 5

Person re-ID samples on the Market-1501 [53]. The first column are query images. The re-ID images are sorted according to the level of similarity from left to right. The correct and false matches are in the green bounding-box and the red bounding-box, respectively

Evaluation on DukeMTMC-ReID

We report some competitive results in Table 3 on DukeMTMC-ReID [56] dataset. Compared to recently proposed part-based methods, PCB (baseline) and PCB+RPP [28], our AWPCN achieves 4.0% and 2.4% rank-1 accuracy improvements and 8.0% and 4.9% mAP improvements respectively. Compared to recently proposed attention-based methods, HA-CNN [39] and DuATM [26], our AWPCN achieves 5.2% and 3.9% rank-1 accuracy improvements and 10.3% and 9.5% mAP improvements respectively. All these improvements are benefit from the adaptive weight part-based convolutional network.

Table 3 Rank-1 and mAP results of the proposed AWPCN and some state-of-the-art methods on DukeMTMC-ReID [56] dataset. The first and second best scores are highlighted in and colors, respectively

Evaluation on CUHK03

Table 4 shows comparison results of the proposed AWPCN and some state-of-the-art methods on CUHK03 [37] dataset. As the Table shows that the rank-1 and the mAP of AWPCN are 63.7%/62.8% with detected bounding boxes, which is the best or close to the best identification performance. The performance of our method is superior to that of the baseline [28], which benefits from the proposed adaptive weight model. Compare with the harmonious attention network-based HA-CNN [39] method, the proposed AWPCN method achieved more than 20% improvement on both Rank-1 and mAP metrics. These experimental comparisons verify the effectiveness of our adaptive weighting method in the patch-based model for the person re-ID task.

Table 4 Rank-1 and mAP results of the proposed AWPCN and some state-of-the-art methods on CUHK03 [37] dataset. The first and second best scores are highlighted in and colors, respectively

4.4 Discussion

Due to our AWPCN is an adaptive weight patch-based re-ID method, in this section, we mainly discuss the number of parts p which is essential to our re-ID performance. Table 5 shows the rank-1 and mAP results of the proposed method under different p value. In fact, the value of p directly determines the granularity of the part feature. p = 1 means the part-based convolutional network learn global feature. As p from 1 increases to 9, the identification accuracy gradually improves. However, the accuracy does not always increase with the value of p. As show in Table 5, when p increases from 9 to 15, the re-ID performance decreases significantly. As a comparison, we also report the performance of p = 20,25,30. From Table 5, we know that an over-increased value of p actually compromises the discriminative ability of the part features. This is mainly because too much segmentation of the image will reduce the representation ability of each image patch, thus reducing the re-ID ability of the model. Therefore, we suggest using p = 9 in practical applications.

Table 5 Rank-1 and mAP results of the proposed AWPCN method under different value of p on Market-1501 [53] and DukeMTMC-ReID [56] datasets

5 Conclusions

In this work, we show the advantage of the part-based convolutional network for feature representation and the necessity of adaptive weight for each part combination. Specifically, we formulate a novel adaptive weight model for joint training each part loss along with simultaneous optimization of its feature representations, dedicated to optimizing person re-id in misaligned/deformation images. Extensive comparative evaluations validate the superiority of this new adaptive weight part-based convolutional network for person re-ID over a wide variety of state-of-the-art methods on three large-scale benchmarks including CUHK03, Market-1501, and DukeMTMC-ReID.