1 Introduction

Person re-identification (re-id) is the task of finding the images of the same individual captured by multiple cameras distributed at different locations. Person re-id has received much attention because of its wide applications in surveillance networks and computer vision community. It is a challenging task due to the large appearance variations in viewpoints, illumination, human poses, and occlusion. Figure 1 displays some image samples from four public benchmark datasets. Previous mainstream methods tackle the re-id problem by first extracting hand-crafted features to represent person images and then learning distance metrics for similarity calculation of the extracted features. However, the performances of these methods are affected by the limited representation power of hand-crafted features, and the separate optimization of feature representations and distance metrics. Recently, deep learning models especially Convolutional Neural Networks (CNN) have obtained outstanding performance in various computer vision tasks [11, 46, 71], such as image recognition, object detection, and face recognition. Many researchers have explored the applications of CNN in person re-id task. Different from the hand-crafted features based systems, deep methods integrate feature extraction and metric learning into one unified framework and thus achieve large performance improvement.

Fig. 1
figure 1

Samples from four datasets including Market1501, CUHK03, DukeMTMC-reID, and CUHK01. The appearances of the same person captured by different camera views can be greatly changed with different lighting and pose, background and occlusion

Existing deep methods can be categorized into two groups, namely verification models and identification models. In the first group, verification models usually contain a Siamese CNN for feature extraction of image pairs and a distance layer outputting feature similarities. A lot of training loss functions including pairwise contrastive loss [45, 49, 61], triplet loss [29, 38, 73], quadruplet loss [3], and their variants [5] are utilized to learn a feature space, where intra-class distances are minimized and inter-class distances are maximized. These methods only utilize weak annotation information about whether two images depicting the same person and thus they do not take full advantage of the pedestrian identity labels. Moreover, verification models usually take pairwise or triplet image units as input, whose number will grow exponentially with the scale growth of dataset. This will cause convergence problem in training stage, thus easily leading to under-fitting on large-scale datasets. In the second group, the identification models that require no complex data sampling process has gained popularity among person re-id community. The identification models employ person identities as the supervisory signal and the softmax loss function is adopted for model training. But identification models need a rich amount of training samples per identity to combat over-fitting. Recently, several large-scale datasets [66, 70] are collected whilst each pedestrian has sufficient training samples. For example, there are on average 17.2 training images for each identity on Market1501 [66]. Some good training tricks [42, 43, 58, 70] are also proposed to build an effective identification model. Many studies [67, 69, 74] demonstrate that identification models yield outstanding performance on large-scale datasets requiring no special data sampling scheme. Thus, our method is built on identification model.

Person re-id is a fine-grained problem and some person identities can only be distinguished by subtle differences in body parts and small visual cues (e.g., backpacks and shoes). To this end, instead of only extracting global features, many methods exploit local information to enhance the representation ability of deep features. Typically, these methods first decompose person images into several patches, parts or stripes and then conduct part-level similarity matching for more robust feature representations. They implicitly assume the availability of well-aligned person bounding box images. However, as shown in Fig. 1, cross-view images usually undergo uncontrolled spatial misalignment due to severe human pose changes and inaccurate pedestrian detection, thus degrading the performance of part-based methods. Some other methods rely on external pose localization algorithms to remedy this problem. They directly adopt the pre-trained pose estimation models (e.g., Convolutional Pose Machines [51] and OpenPose toolkit [2]) as the part detector. But they essentially perform a two-stage scheme [62, 65], where the estimated human pose parts are first manually cropped and then they are passed through another network for part-level feature extraction. Some studies [41, 59] integrate the process of pose estimation and part feature learning into one end-to-end trainable framework. However, it remains challenging to obtain ideal semantic partition of person bodies because there are large domain bias between pose estimation datasets and person re-id datasets. An alternative solution is to fully utilize the attention mechanism that can focus on discriminative local regions. But existing attention based deep learning models for re-id usually design complicated attention algorithms (e.g., Harmonious Attention Network [26]) or depend on computationally expensive units (e.g., STN [16] in [21] and attention based LSTM [37] in [29]).

In this paper, we present a Part-based Attention Model (PAM) to alleviate misalignment problem. PAM contains a channel attention block and a spatial attention block. The two blocks aim to explore the informative features of the body-part feature maps along channel and spatial dimensions, respectively. Specifically, we uniformly slice the person body into several parts and each part is forward into PAM for channel and spatial feature refinement. PAM learns to find the informative features by assigning weights to different channels and different positions. Then the global full-body and local body-part of the refined feature maps are pooled into global and part-level feature representations, each of which is trained using identity classification loss. Our model is inspired by Squeeze-and-Excitation Network (SENet) [15] and Convolutional Block Attention Module (CBAM) [53]. SENet only considers channel interdependency while ignores spatial attention. Although both channel and spatial attention are incorporated in CBAM, the local cues beneficial for person re-id are not exploited. We thus combine part feature learning and attention mechanism for person re-id. The contributions of this paper can be summarized as follows:

  1. 1)

    We present a part-based attention model for person body-part feature refinement, and the visualisation results show it can somehow alleviate the spatial misalignment problem.

  2. 2)

    Two level of pedestrian descriptors are simultaneously learned to leverage the complementary advantages between global and local features.

  3. 3)

    Extensive experiments are conducted on several datasets to validate the effectiveness of the presented method.

The remainder of this paper are organized as follows: We review the related studies in Section 2. Our method is presented in Section 3 and the experimental results are shown in Section 4. Finally, the conclusion is drawn in Section 5.

2 Related work

Person re-id plays an important role in surveillance systems and thus has drawn increasing attention in recent years. Typical person re-id systems consist of two major components, namely extracting feature representations to describe the person appearance and learning distance metrics to measure the feature similarities. For feature representations, the commonly used features contain RGB, LAB, color names [60], local binary patterns (LBP) [17, 18, 22, 57], Gabor filter feature [22], color histogram and its variants [17, 18, 52] etc. For example, Li et al. [22] combined LBP, HSV color histogram, Gabor and HoG to represent person images. Liao et al. [27] constructed a feature descriptor by maximizing the horizontal occurrence of local features. Gray et al. [10] proposed to utilize AdaBoost algorithm to select the most discriminative features. Farenzena et al. [8] exploited the symmetry and asymmetry property of body structures to extract robust features for person re-id. Kviatkovsky et al. [20] utilized the color intra-distribution signatures and proposed an illumination-invariant color descriptor. For metric learning, many machine learning algorithms are utilized to learn a mapping function from the feature space to distance space, in which the intra-class distances are minimized while the inter-class distances are maximized. For instance, Zheng et al. [68] proposed a relative distance comparison (RDC) learning model from a probabilistic prospective. Davis et al. [6] proposed information-theoretic metric learning (ITML) method based on Mahalanobis distance. Liao et al. [27] proposed the Cross-view Quadratic Discriminant Analysis (XQDA), which simultaneously learned a discriminant low-dimensional subspace and a distance metric. Other representative metric learning methods contain Local Fisher Discriminant Analysis (LFDA) [32], large scale metric learning from equivalence constraint (KISSME) [18], Large Margin Nearest Neighbor (LMNN) [14] and etc. Further, Xiong et al. [57] extended many linear models (e.g., PCCA [30] and KISSME [18]) into their kernel versions. However, the performances of these methods are limited because of the separate optimization between feature extraction and metric learning.

Recently, the deep learning models that jointly learn feature representations and distance metrics have dominated the person re-id community. Numerous network architectures and training loss functions are proposed to learn more robust pedestrian descriptors. One important type of CNN model is the verification model that takes image pairs or triplets as input. For example, Li et al. [24] proposed a patch matching layer in a filter pairing neural network (FPNN) to learn the joint representations of paired images. Yi et al. [61] proposed a Siamese convolutional neural network followed by a cosine layer to calculate pairwise similarity. Ahmed et al. [1] proposed to compute the cross-input neighborhood differences in an improved Siamese CNN model. Later, Wu et al. [54] improved the work in [1] by increasing the depth of layers and using very small convolution filters. Ding et al. [7] utilized the triplet loss to learn a view-invariant feature space. Chen et al. [3] improved triplet model and designed a novel quadruplet deep network, which was trained using quadruplet loss. Hermans et al. [13] proposed a batch-hard triplet loss which selected the hardest positive and hardest negative pairs in a training batch to form the triplet unit. Cheng et al. [5] learned both global and local features in a triplet based Siamese CNN model. Varior et al. [45] proposed to model the spatial contextual information between different parts using a long short-term memory (LSTM) architecture. Meanwhile, Varior et al. [44] also proposed a gated CNN to capture effective subtle patterns. Wang et al. [49] simultaneously learned single-image and cross-image representations in a unified triplet and Siamese deep architecture. With the scale growth of person re-id datasets, another type of CNN model, namely identification model, has obtained outstanding performance. For instance, Xiao et al. [55] learned to predict the person identities from multiple datasets in a domain-guided deep network. Zheng et al. [70] proposed to enrich the diversity of training samples using a Generative Adversarial Network (GAN). Zhong et al. [77] proposed a novel camera style transfer model based on CycleGAN [78]. Lin et al. [28] constructed an attribute-person recognition (APR) network, which learned an identity embedding and simultaneously predicted the pedestrian attributes. Sun et al. [42] proposed to decorrelate the learned weight vectors of identification CNN using singular vector decomposition (SVD). Li et al. [25] proposed a multi-loss classification model to jointly learn global and local discriminative features. Many methods also construct a hybrid model that leverage the complementary advantages of two losses. For example, Wang et al. [47] trained a multi-task attentional network using both identification and verification supervisory signals. Zheng et al. [69] combined verification and identification models in a deep network. Our method is built based on identification/classification model.

One major challenge in person re-id task is the misalignment problem as shown in Fig. 1. Many methods solve this challenge by exploiting local visual similarities on predefined rigid body parts [1, 5, 24, 45, 54]. However, rigid partition of person images still can not fully capture the body structure because of pose variations and thus the performances of their methods are limited. Other methods employ external pose estimation models for accurate body part localization. For example, Zhao et al. [62] utilized the Convolutional Pose Machines (CPM) [51] to localize head-shoulder region, upper body region and lower body region. Su et al. [41] utilized Spatial Transformer Networks (STN) [16] to localize the body parts. Sarfraz et al. [35] included the confidence maps of human poses in the model training process and proposed a new unsupervised re-ranking framework. Xu et al. [59] integrated pose estimation into the feature learning stage. These methods either rely on pre-trained pose estimation models or need manual operation, which have disadvantage in time efficiency. Moreover, there exists domain bias between re-id and pose datasets, which may bring inaccurate pose localizations. Some other methods propose attention models to focus on the discriminative regions of person images. For instance, Liu et al. [29] combined attention mechanism with LSTM for informative parts localization. Li et al. [26] proposed to combine hard regional attention [16], residual attention model [48], and channel attention in an integrated deep framework. Zhao et al. [63] proposed a fully convolutional attention model to eliminate the misalignment problem. Our work departs from those above attention models in a simple channel-spatial attention model to refine body-part features and a multi-loss function to jointly learn global and local features. Our method is mostly similar to PCB model [43]. PCB model [43] first uniformly slices the feature maps and then uses an offline block termed Refined Part Pooling (RPP) [43] to deal with spatial misalignment problem. Compared to the work in [43], our model instead integrates an attention model for feature refinement.

3 Proposed model architecture

In this work, we design an end-to-end deep model which formulates person re-id as an identity classification problem. In this section, we will present our person re-id method. First, we will describe the deep neural network utilized in our method, then we will show the details of the part-based attention model for feature refinement. Finally, we will present the multi-loss function for global and local features learning. Figure 2 illustrates the whole framework of our method.

Fig. 2
figure 2

Illustration of the proposed person re-identification system. We first employ a deep network to extract the base feature maps for the input image. Then, the person body is vertically sliced into several parts and each of them is passed through the presented Part-based Attention Model (PAM) for feature refinement. Two level of feature representations including global features and local features are obtained from the refined feature maps to predict person identities. GAP and FC denote global averaging pooling and fully connected layer, respectively. Here, FC layer acts as classifier. Moreover, ID is the abbreviation of identity. Take ResNet50 as example, the network output size is 2048× 24 ×8, and thus the dimensions of each pooled feature are 2048. The number of predicted IDs K is identical to the training identities on different datasets (e.g., K= 750 on Market1501)

3.1 Convolutional representations

The deep learning models especially Convolutional Neural networks (CNN) have demonstrated significant performance improvement in a series of pattern recognition applications including person re-id task. Inspired by this, we employ CNN to extract the compact appearance features for person images. The backbone network utilized in our method is ResNet50 [12]. ResNet50 is constructed by five sequential downsampling blocks. The first block is one convolutional layer and the rest are four residual blocks encapsulated with several convolutional layers with batch normalization, ReLU, and optionally pooling operations. Each block downsamples the spatial size of feature maps into its half scale. The original design of ResNet50 is shown in Fig. 3a. Given an input image with size of 384 × 128, the final spatial size of ResNet50 is 12 × 4, which may be too small to retain the spatial regional information. We thus remove the downsampling layer in the last residual block to augment the resolution of feature maps. As illustrated in Fig. 3b, the final size of feature maps in our model is 24 × 8.

Fig. 3
figure 3

Architecture of the ResNet-50 model. We use the high-level feature maps from Res4 block as the base features

3.2 Part-based attention model

Person re-id system has benefited a lot from manipulation of spatial information. Specially, many methods learn local feature representations from predefined rigid body parts. Albeit simple, rigid partition of person images can roughly preserve the human body structure on vertical direction, that is, head is usually at the top part, torso and leg are at the middle and bottom part. However, the person images automatically detected by offline person detectors usually contain spatial misalignment and noisy occlusion. The features extracted from rigid body parts thus can not well describe person appearances.

In this paper, we present a part-based attention model to simultaneously take advantage of body distributions and overcome the weakness of rigid spatial decomposition. Our attention model is motivated by Squeeze-and-Excitation Network (SENet) [15] that models the interdependency between different convolutional channels, and CBAM [53] that recalibrates global feature responses. Different from those methods, our attention model aims to refine body-part feature maps for more robust pedestrian representations. As shown in Fig. 2, the output features from the deep network are denoted by XRC×H×W, where C, H, and W represent the channel, height, and width of feature maps. We first uniformly partition the convolutional output into L vertical parts. Then, each body-part feature cube is refined by PAM. As illustrated in Fig. 4, given the input part PRC×h×W where \(h=\frac {H}{L}\), PAM sequentially generates a 1D channel attention map McRC×1×1 and a 2D spatial attention map MsRh×W to weight the channels and positions, respectively. Afterwards, the weighted feature cube is summed with the input part using element-wise operation to obtain the final refined part. The overall attention process can be denoted as:

$$ \begin{array}{@{}rcl@{}} P_{c}&=&M_{c}\otimes P, \\ P_{s}&=&M_{s}\otimes P_{c}, \\ P^{\prime}&=&P\oplus P_{s}, \end{array} $$
(1)

where ⊗ and ⊕ represents element-wise multiplication and summation, respectively. Pc and Ps denote the part feature cubes weighted by channel attention map Mc and spatial attention map Ms, respectively. \(P^{\prime }\) is the final refined body-part feature. Below we will describe the computation process of each attention map.

Fig. 4
figure 4

Overview of the Part-based Attention Model (PAM). PAM contains a Channel Attention Block (CAB) and a Spatial Attention Block (SAB) to weight the input body-part feature cube along channel axis and spatial axis, respectively. The weighted feature cube is then element-wisely summed with the input to generate the final refined features

Channel attention block

The purpose of Channel Attention Block (CAB) is to explicitly model the interdependencies between the channels of convolutional features [15]. The structure of CAB is illustrated in Fig. 5a. CAB first uses a Global Average Pooling (GAP) operation to integrate the spatial information of feature maps into a feature vector. Then the vector is forward into a multi-layer perceptron (MLP) to generate the attention map McRC×1×1. Specifically, we construct MLP using two fully-connected (FC) layers, whose activation outputs are in size of RC/r×1×1 and RC×1×1, respectively. Here, r is the reduction factor for the purpose of parameter reduction. In short, the channel attention can be expressed as:

$$ M_{c} = \mathbf{W}_{\mathbf{1}}^{\sigma}(\mathbf{W}_{\mathbf{0}}^{\mathtt{ReLU}}({\mathbf{G}\mathbf{A}\mathbf{P}}^{\mathtt{s}}(P))), $$
(2)

where GAP means GAP operation along spatial dimension. \(\mathbf {W}_{\mathbf {0}}\in R^{\frac {C}{r}\times C}\) and \(\mathbf {W}_{\mathbf {1}}\in R^{C\times \frac {C}{r}}\) are two FC layers of MLP. Their corresponding activation functions are σ and functions, in which σ represents sigmoid function. It is worth noting that sigmoid function is used to assure that the values of attention map are in an interval of [0,1].

Fig. 5
figure 5

Structure of Channel Attention Block (CAB) and Spatial Attention Block (SAB). GAPs and GAPc mean GAP operation along spatial axis and channel axis, respectively. MLP represents multi-layer perceptron

Spatial attention block

We employ Spatial Attention Block (SAB) to automatically discover the salient regions of body-part, which is complementary to the channel attention. The structure of SAB is illustrated in Fig. 5b. SAB first aggregates the channel information into one feature map through applying GAP operation across channel axis. Then the aggregated feature map is sequentially passed through one convolutional layer to generate the spatial attention map MsRh×W. The filter size and stride of convolution are 1 × 1 and 1, respectively. In short, the spatial attention can be denoted as:

$$ M_{s} = {\mathbf{C}\mathbf{o}\mathbf{n}\mathbf{v}}^{^{\sigma}}({\mathbf{G}\mathbf{A}\mathbf{P}}^{\mathtt{c}}(P_{c})), $$
(3)

where GAP means GAP operation along channel dimension.

3.3 Multi-loss training

As shown in Fig. 2, given an input image, its corresponding output from backbone network is first decomposed into L parts and then each body-part is refined by PAM along the channel and spatial dimensions. After feature refinement, the global full-body is mapped into a set of refined parts \(\left \{P^{\prime }_{1},P^{\prime }_{2},\ldots ,P^{\prime }_{L}\right \}\). To better utilize the global-local complementary cues, we extract two level of pedestrian descriptors including global features and local features. Specifically, the global features are obtained by first concatenating all the parts along vertical axis and then passing the concatenated feature maps into GAP operation. The local features are generated by pooling each refined part. The global feature and each of the part-level features are then forward into FC layer to make identity predictions.

During training phase, softmax loss is utilized to minimize the identity classification errors. In a training batch, supposing that the number of images is N and each belongs to one of K identities, the softmax loss can be written as:

$$ \mathcal{L}=-\frac{1}{N}\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}y^{i}\log\hat{p}_{k}^{i}, $$
(4)

where \(\hat {p}_{k}^{i}\) is the probability of ith image belonging to kth identity and yi is the ground truth identity.

The total loss function to train global and local features thus can be expressed as:

$$ \mathcal{L}_{total}=\mathcal{L}^{g} + \sum\limits_{i=1}^{L}\mathcal{L}^{l}_{i} $$
(5)

where the superscripts g and l are the abbreviations of global features and local features, respectively.

4 Experiment results

4.1 Experimental settings

Datasets

The experiments are conducted on four challenging datasets, including Market1501 [66], CUHK03 [24], DukeMTMC-reID [70], and CUHK01 [23]. Market1501 contains 32,668 auto-detected bounding boxes of 1,501 identities. It is one of the largest re-id benchmarks in the existing literatures. Images of each identity are captured by at most six cameras in front of a campus supermarket with complex environment. The Deformable Part Model (DPM) [9] detector is employed as pedestrian detector and thus human parts are not well aligned in the bounding boxes. The CUKH03 dataset includes more than 13,000 images of 1360 identities collected in a university campus. Each identity is captured from two adjoint cameras and has 4.8 images on average for each view. Two versions are provided on this dataset, namely the manually labeled version and the automatically detected version by the DPM detector. We evaluate our model on the bounding boxes detected by DPM, which is closer to the realistic setting. DukeMTMC-reID dataset is a subset of the multi-target, multi-camera pedestrian tracking dataset [33]. We use the re-id version provided by [70], which contains 34,183 images of 1,401 person identities. The pedestrian bounding boxes are manually cropped. Each person is captured by at most eight different high-resolution cameras. CUHK01 dataset contains 971 persons, which are captured by two non-overlapping camera views. This dataset includes 3,884 images and each person identity has four images.

Evaluation protocol

We conduct the comparison experiments using single-query settings. Two widely used evaluation metrics are adopted for performance comparison, namely cumulative matching characteristic (CMC) [31] and mean average precision (mAP) [66]. Each dataset is split into two subsets, namely training and testing subset with non-overlapping person identities. For Market1501, we follow the standard training/testing protocol defined by [66], which uses fixed 750 identities as training subset and the rest fixed 751 identities as testing subset. For CUHK03, we adopt the new training/testing protocol proposed in [75], which fixes 767 person identities for training and the rest 700 identities for testing. For DukeMTMC-reID, following the evaluation protocol in [70], 1,401 identities are divided into a training subset with 702 identities and a testing subset with the rest 702 identities. For CUHK01, we randomly divide the 971 persons into the training subset with 485 persons and the testing subset with 486 persons.

Implementation details

We implement our method based on the open source PyTorchFootnote 1 library and we train our model on a compute note with Intel Xeon CPU (64G memory) and four Titan GPUs (48G memory in total). The backbone network is pre-trained on ImageNet [34]. In all experiments, the images are first re-scaled into 420 × 140 and then they are cropped into 384 × 128. Common data augmentation techniques are applied to the input images, including mirror flip, minor rotation, and random erasing [76]. Besides, the pixels of all images are normalized to [0,1], subtracted by mean pixel values of RGB channels and then divided by standard deviation of each channel. The GAP outputs are sequentially followed by batch normalization and dropout operation, which play a critical role in avoiding over-fitting. The dropout ratio is 50%. In the training process, each batch contains 16 person identities, and for each identity we randomly select 5 images, and thus the batch-size is 80. Adam optimizer is applied for model training and we set the initial learning rate as 3 × 10− 4. The learning rate decreases every 50 epoches by a factor of 0.1 and our model convergences stably after 180 epoches.

4.2 Comparison with state-of-the-art methods

In this section, we compare the performance of our method with recent state-of-the-art approaches, including both hand-crafted features based methods and deep learning based methods. The hand-crafted features based methods include SDALF [8], eSDC [64], BoW [66], KISSME [18] and LOMO [27]. The deep methods include PersonNet [54], End-to-end CAN [29], Siamese LSTM [45], ID-discriminative Embedding (IDE) [67, 75], Gated CNN [44], Spindle Network (SpindleNet) [62], GAN [70], Pose Invariant Embedding (PIE) [65], Deeply Learned Part-aligned Representation (DLPR) [63], CNN-Embedding [69], Pose-driven Deep Convolutional model (PDC) [41], TriNet [13], Joint Learning Multi-Loss (JLML) [25], Pose-Sensitive Embedding (PSE) [35], Cam-GAN [77], TGP [58], Harmonious Attention CNN (HA-CNN) [26], Dual Attention Matching network (DuATM) [39], Part-based Convolutional Baseline (PCB) [43], SVDNet [42], Online Instance Matching (OIM) [56], Attribute-Complementary Re-id Network (ACRN) [36], Attention-Aware Compositional Network (AACN) [59], Deep Anytime Re-ID (DaRe) [50], and Mancs [47]. Note that, some methods including DaRe [50], TriNet [13], and PSE [35] obtain better results using additional re-ranking technique [75]. For fair comparison, we only compare their results without re-ranking scheme as our method.

Performance on Market1501

The comparison results are shown in Table 1. It can be seen that deep methods, especially recent state-of-the-art models including PDC [41], JLML [25], DuATM [39], and PCB [43], perform significantly better the hand-crafted features based methods (e.g., BoW [66] and eSDC [64]), illustrating the powerful feature learning capability of deep networks. Our method achieves 93.6% Top1 accuracy and 81.7% mAP on this dataset. Our model obtains significantly better performance than PersonNet [54], Siamese LSTM [45], Gated CNN [44], and GAN [70]. The Top1 accuracy of our method is 16.7%, 14.9%, 9.2%, 7.7%, and 5.9% better than the pose-driven models SpindleNet [62], PIE [65], PDC [41], AACN [59], and PSE [35], respectively. Compared to previous attention based models End-to-end CAN [29], DLPR [63], HA-CNN [26], and DuATM [39], our method improves the Top1 accuracy by 45.4%, 12.6%, 2.4%, and 2.2%, respectively, and the mAP by 57.3%, 18.3%, 6.0%, and 5.1%, respectively. Our work is close to CNN-Embedding [69], JLML [25], Mancs [47], and PCB, which are trained using multiple losses. CNN-Embedding [69] and Mancs [47] both combine verification loss and identification loss, while our method only relies on softmax loss. JLML [25] and our method learn global-local features. Compared to JLML [25], our method needs no feature sparsity constraints. Our model shares similar structure with PCB [43], which is based on uniform body-part partition. Compared to the combination of PCB [43] and Random Erasing (RE) [76], our method obtains better re-id accuracies, with 0.8% Top1 accuracy improvement (ours 93.6% versus PCB 92.8%) and 3.0% mAP improvement (ours 81.7% versus PCB 78.7%). PCB [43] deals with the part inconsistency issue caused by rigid part partition using an offline RPP module, which is not efficient. In contrast, we use end-to-end trainable part attentions for feature refinement. Besides, the performances of our model are comparable to that of combining PCB, RE [76], and RPP.

Table 1 Performance comparison on Market1501 dataset. “*” denotes unpublished paper. “-” means no available reported results

Performance on CUHK03

We conduct experiments on the detected version of CUHK03 dataset, which is a more realistic setting considering spatial displacement, partial occlusion, and pose changes. The comparison results are shown in Table 2. It can be seen that our method consistently outperforms all the hand-crafted features based methods by a large margin, including BoW [66] and LOMO [27]. Our model obtains 64.1% Top1 accuracy and 60.8% mAP, which performs better than most of the compared deep models. Specifically, our method outperforms SVDNet [42], HA-CNN [26], TGP [58], and DaRe [50] by 22.6%, 22.4%, 7.6%, 2.5% respectively at Top1, and 23.5%, 22.2%, 8.6%, and 2.7% respectively in mAP. The part-based models including PCB [43] and our method achieve good performance on this dataset, illustrating the effectiveness of local cues in learning robust pedestrian descriptors. Compared to the combination of PCB [43] and RE [76], our method improves the Top1 accuracy and mAP by 2.3% and 4.4%, respectively. The Top1 accuracy and mAP of combining PCB, RE [76], and RPP [43] can be improved to 64.0% and 58.2%, respectively. But our model still performs slightly better. On this dataset, although Mancs [47] and AACN [59] obtain high Top1 accuracy and mAP, our method that depends on no multi-task learning scheme [47] or complicated pose estimation algorithm [59] has advantage in model complexity.

Table 2 Performance comparison on CUHK03 detected dataset

Performance on DukeMTMC-reID

The experimental results are shown in Table 3. It can be observed that our method performs better than most of the deep models. For instance, our method outperforms GAN [70], SVDNet [42], AACN [59], and DaRe [50] by 17.0%, 8.0%, 7.9%, and 5.6% respectively at Top1, and 22.3%, 12.6%, 10.1%, and 6.4% respectively in mAP. On this dataset, our method obtains 84.7% Top1 accuracy and 69.4% mAP, which are better than ACRN [36] that requires person attributes and Cam-GAN [77] that depends on camera information. Compared to PSE [35] that utilizes auxiliary human pose cues, our method improves the Top1 accuracy by 4.9%, and mAP by 7.4%. Our model performs better than PCB [43] and its combination with RE [76], as well as RPP. Besides, the performances of our method are comparable to the best-performing method Mancs [47] on this dataset, which depends on complex hard examples mining scheme.

Table 3 Performance comparison on DukeMTMC-reID dataset

Performance on CUHK01

The experimental results are displayed in Table 4. It can be seen that our method performs much better than the hand-crafted features based models, including KISSME [18], eSDC [64], and KLFDA [57]. On this dataset, our method obtains 86.4% Top1 accuracy and 85.3% mAP, which outperforms many deep learning models. For example, our method improves the Top1 accuracy by 19.8%, 23.8%, 9.7%, and 6.5%, over DGDNet [55], Quadruplet [3], JLML [25], and SpindleNet [62], respectively. Compared to PCB [43] and its combination with RE [76], our method improves the Top1 accuracy by 3.3% and 2.6%, respectively, and the mAP by 3.5% and 2.6%, respectively. Compared to the combination of PCB [43], RE [76], and RPP [43], our model achieves slightly better performances on this dataset. Besidse, it is worth noting that our method needs no offline operation, which exhibits high efficiency in training stage.

Table 4 Performance comparison on CUHK01 dataset

4.3 Analysis of proposed model

We further make a comprehensive performance analysis to evaluate the effectiveness of each component of our presented method.

Effectiveness of attention model

As shown in Fig. 3, the output spatial size of the last residual block is 24 × 8. We evaluate how the attention model contribute to the person re-id performance under different part partitions. We divide the feature map into 1, 2, 4, 6, 8, and 12 parts, whose spatial sizes are 24 × 8, 12 × 8, 6 × 8, 4 × 8, 3 × 8, and 2 × 8, respectively. Figure 6 displays the experimental results on Market1501. It can be observed that, under the same part partition, global or local features with PAM performs generally better than that without PAM, which validates the effectiveness of our attention model in boosting the re-id performance. Particularly, the performance of local features are largely improved with the increase of part number. When the part number L = 1, PAM degenerates to CBAM [53]. When the part number L = 6, the Top1 accuracy and mAP of local features with PAM reach their best, which are 93.2% and 80.4%, respectively. Therefore, we use this setting to conduct all the experiments. When the part number continually increases, the performance of local features with and without PAM drops with different rates. Specifically, the accuracy of local features with PAM slightly decreases while large accuracy drops can be observed when extracting local features without PAM. This implies that the performances of local features with PAM are not very sensitive to the part partitions. The local features with PAM can still obtain a relatively high Top1 accuracy even when the image is partitioned into two parts. Therefore, our method can better handle accuracy-efficiency trade-offs if applied to realistic re-id scenarios.

Fig. 6
figure 6

Performance of different features under different part partitions. “w” and “w/o” represent “with” and “without”, respectively

Performance of different attention methods

In Table 5, we explore how to effectively compute the spatial attention and arrange the order of two attention modules. Note that the channel attention model is same as the SENet [15] module. It can be seen that incorporating either channel attention or spatial attention can boost the re-id performance compared to the ResNet50 baseline. The accuracies are further improved after combining two attention modules. Besides, similar performances can be observed when using different convolution kernel sizes (5×5, 3×3 or 1×1) to compute the spatial attention module. We thereby use 1×1 convolution operation in spatial attention considering the computation cost. Regarding the order of two attention modules, we evaluate three arrangements, namely sequential channel-spatial, sequential spatial-channel, and parallel use of two modules. We can observe that two sequential orders outperform parallel arrangement, possibly because channel and spatial attention models generate two different semantic embedding spaces, and simply fusing them achieves less gains. Finally, the sequential channel-spatial design is chosen in our model for its sightly better performance than the sequential spatial-channel arrangement.

Table 5 Experimental results of different attention methods on Market1501 dataset

Effectiveness of multi-loss training

Our model is trained using multiple softmax loss. To reveal each of their ingredients contributing to the performance improvement, we report the results of baseline networks, different losses, and their combinations. The experimental results on four datasets are shown in Table 6, where the subscripts g and l represent global loss and local loss, respectively. Several important observations could be made from the results. 1) Performance improvement can be observed when augmenting the spatial size of feature maps, probably because more spatial information can be retained by using larger feature resolutions. On four datasets including Market1501, CUHK03, DukeMTMC-reID, and CUHK01, the respective accuracy improvements are 0.6%, 2.1%, 0.8%, and 1.7% at Top1, and the respective mAP improvements are 0.5%, 2.2%, 1.2%, and 0.9%. 2) It can be seen that the attention model consistently achieves better performance than baseline network. For instance, compared to ResNet50g whose spatial size is 24×8, PAMg improves the Top1 accuracy on four datasets by 0.6%, 2.4%, 0.9%, and 2.2%, respectively, and the mAP by 1.2%, 2.3%, 1.1%, and 2.4%, respectively. 3) In general, feature embedding with local loss outperforms global loss. On four datasets, the Top1 accuracy improvements are 2.4%, 10.3%, 1.4%, and 12.1%, respectively, and the mAP improvements are 5.9%, 8.8%, 4.7%, and 13.1%, respectively. This demonstrates the benefit of incorporating local cues. 4) Combining global information and local information improves the performance over using them individually. For instance, on Market1501, PAMg+l outperforms PAMg and PAMl by 2.8% and 0.4%, respectively at Top1, and 7.2% and 1.3%, respectively in mAP. This shows that global and local information are complementary in nature.

Table 6 Performance comparison of different losses on several datasets. The CMC Top1 accuracy (%) and mAP (%) are presented. 12×4 and 24×8 represent the spatial size of feature maps. The two subscripts g and l denote global loss and local loss, respectively

Experimental results on different CNN architectures

In Table 7, we conduct experiments on deep models with different parameters and layers, including two extra model architectures, namely AlexNet [19] and VGGNet [40]. Similar to ResNet50, all the FC layers and the last downsampling operation are removed. The sizes of input image are all same, which are 384 × 128. All the networks generate feature maps with the same resolutions, which are 24 × 8. The part numbers of other models are same to that of ResNet50, which is L = 6. Besides, the training settings of VGGNet are same to ResNet50, while for AlexNet a slightly higher learning rate is used, which is 10− 3. From Table 7, we can observe that incorporating local cues and combining different losses both significantly boost the re-id performances of all CNN models, which further validates the effectiveness of the presented method in learning more robust features. Additionally, with the increases of model parameters or layers, performance improvements can be observed (for example, AlexNet versus VGG11, or VGG11 versus VGG16). But the performance improvements are not linearly correlated to model parameters or layers. The better network design also contributes to the performance improvement. For instance, compared to VGG19, ResNet18 with half parameters and nearly same layers obtains better performances. Besides, for the same model architecture, if the model parameters further increase (for example, VGG16 versus VGG19, or ResNet50 versus ResNet101), the re-id accuracies remain nearly same, possibly because of the limited dataset scale.

Table 7 Top1 accuracy and mAP of different CNN architectures on Market1501 dataset (partition by “/”)

Qualitative results

In Fig. 7, we show some attention maps produced by PAM. We can see that the model learns to assign different weights to different regions. On the attention maps, it is apparent that the human body regions are more salient than background noises, such as cars and trees. The top and bottom regions are less salient than the middle body parts, probably because the faces with low resolutions and the small legs contain less discriminative information than body torsos, which include most of the clothing cues. It can be seen that the person appendixes, such as backpacks and luggage, are partially attended, which means they can assist matching persons. Our model can still focus on the body parts even under large view variations including spatial misalignment, occlusion, and human pose change. For instance, in the second row, although the target persons are occluded by other persons or objects such as bicycles, PAM pays more attention on the human body regions of target persons.

Fig. 7
figure 7

Visualization of attention maps. The first three rows respectively exhibit some examples with different view variations, namely spatial misalignment, occlusion, and human pose change. The fourth row shows some examples that are well aligned

In Fig. 8, we show top 10 retrieval results from Market1501. It can be observed that our model exhibits strong robustness to pose changes, scale variations, and spatial displacement. We can see that the false matchings are mainly caused by similar looking in visual appearances. These failure cases are also very challenging from human perspective, especially the person images with extremely similar clothes and human poses in the first two rows of Fig. 8b.

Fig. 8
figure 8

Samples of the retrieval results on Market1501. The images in the first column are the query images. The top 10 retrieved images are sorted according to the similarity scores from left to right. Red rectangles represent the correct matches

Running time analysis

In Table 8, we compare the average feature extraction time per image and the average retrieval time per image of our method with five other methods, including IDE(R) [75], CNN-Embedding [69], PCB [43], RCN [72], IDLA [1]. For fair comparison, we re-implement their feature extraction codes using PyTorch. The experiments are conducted on a machine with Intel Xeon CPU (64G memory) and four Titan GPUs (48G memory in total). During test phase, a batch is composed of 2,000 images. It can be seen that the identification models including IDE(R) [75], CNN-Embedding [69], PCB [43], and our method exhibit high computation efficiency compared to the verification models IDLA [1] and RCN [72]. The identification models use feature vectors that can be saved in buffer for distance calculation, and they only need to forward all the images once. But for verification models, they have to forward the same image for several times to obtain the joint feature of image pair, which is a time-consuming process.

Table 8 Comparison of average feature extraction time and average retrieval time on Market1501 (millisecond per image)

5 Conclusion

In this paper, we present a part-based attention network with multi-loss training for the task of person re-id. Specially, the part-based attention model contains a channel attention block and a spatial attention block to refine the feature maps of person body parts along channel and spatial dimensions. The attention model is capable of alleviating the spatial misalignment problem. Besides, to fully exploit the complementary benefits of global-local cues, two level of pedestrian descriptors including global full-body and local body-part features are extracted from the refined feature maps, each of which is trained using identification loss. We conduct extensive experiments on four public person re-id benchmarks including Market-1501, CUHK03, DukeMTMC-reID, and CUHK01. The experimental results demonstrate that our method yields higher re-id accuracy than most of state-of-the-art approaches.