A multi-branch attention and alignment network for person re-identification

Lyu, Chunyan; Ning, Wu; Wang, Chenhui; Wang, Kejun

doi:10.1007/s10489-021-02885-3

A multi-branch attention and alignment network for person re-identification

Published: 18 January 2022

Volume 52, pages 10845–10866, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

A multi-branch attention and alignment network for person re-identification

Download PDF

Chunyan Lyu¹,
Wu Ning¹,
Chenhui Wang² &
…
Kejun Wang ORCID: orcid.org/0000-0002-2912-8994¹

844 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Person re-identification plays a critical role in video surveillance and has a variety of applications. However, the body misalignment caused by detectors or pose changes sometimes makes it challenging to match features extracted from different images. To address the issues above, we propose a multi-branch attention and alignment network (MAAN). This approach is based on a deep network with three main branches. One branch is used for global feature representations. Another branch implements a multi-attention process based on keypoints, filters the practical information in the image, and then horizontally partitions the image to extract local features. For the last branch, we create a method based on part feature alignment. We obtain 17 keypoints from a pretrained pose estimation model, and nine local regions from the corresponding feature map are extracted for alignment. Experimental results on various popular datasets demonstrate that our method can produce competitive results under posture changes and body misalignment.

A part-based attention network for person re-identification

Article 25 May 2020

Person Re-identification Using Multi-branch Cooperative Network

SliceNet: Mask Guided Efficient Feature Augmentation for Attention-Aware Person Re-Identification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Person re-identification (Re-ID) aims to detect and return images containing the same person from an image gallery. Re-ID is critical in intelligent surveillance systems and has essential research influence and practical significance because it is important in the field of public safety due to the increasing number of surveillance cameras. Since the scene complexity of images from surveillance videos is high, Re-ID’s primary challenge comes from considerable changes in the image’s subject, such as posture, viewpoint, occlusion, clothing, background clutter, detection/tracking errors, and illumination changes, among others. These factors contribute to the difficulty of identifying unique individuals from among extensive galleries.

Among these, the crucial factor that affects the Re-ID accuracy is misalignment. First, pedestrians will naturally adopt a variety of postures. Changes in posture mean that the body part’s location inside the bounding box is uncertain. Detection errors may also cause misalignment. Pedestrians can appear in various locations throughout the image at varying scales. In addition, different camera viewpoints may cause misalignment due to factors such as different clothing characteristics. As shown in Fig. 1, we present some misaligned images that demonstrate the above factors from three popular Re-ID datasets.

In general, previous Re-ID methods extract features from the entire image and use them for retrieval. These methods either directly use global character features [1, 2] or combine local features from various parts [3,4,5,6]. However, when pedestrians are not appropriately aligned, the Re-ID accuracy may be significantly reduced. A typical Re-ID practice, for example, is to divide the bounding box into horizontal stripes [3, 7,8,9,10]. This approach assumes minor vertical misalignment. However, when vertical misalignment occurs, a person’s head may match the background of another picture. As a result, when extreme misalignment occurs, the horizontal partition’s advantages may be diminished. In another case, the feature extractor may incorrectly weigh the background under various pedestrian poses, impairing subsequent matching accuracy.

To our knowledge, several previous works [11,12,13,14,15,16,17,18,19] explicitly consider the misalignment problem. [11, 12] utilize the image structure (pictorial structure), and the extracted features are affected by noise and produce specific errors. [14, 17, 19] segment the human body structure into blocks using a more precise pose estimation network and employ a particular procedure for reorganization or feature fusion. However, its network structure is relatively simple, and the fusion process results in some information loss. In [87], although the PAP module is used for part alignment, its performance primarily comes from the image’s segmentation constraints in the target domain, which requires impractical and complex pseudo label generation. [13] introduced a four-stream network that acquires global and part features based on the head, upper body, and lower body simultaneously. It then combines multiple features to produce a GLAD descriptor. It lacks, however, strong robustness to changes in posture and viewpoint. In summary, [15, 16, 18, 19] add attention factors, [16, 18] are based on posture keypoints, and [15, 19] are derived from similarity calculations. These factors guide the model’s attention to the critical portion of the input image that contributes to alignment. Although relying solely on attention for alignment is not ideal, the robustness is insufficient when attention modules are not used. Unlike the above methods, the network we built incorporates multiple branches based on attention and alignment. Diverse attention mechanisms and feature partitioning are used in various branches, with global feature representations for complementation, contributing to better feature alignment and improving the model’s performance.

Considering the problems mentioned above and the limitations of existing methods, we propose a multi-branch feature fusion strategy based on pose-guided multi-attention and feature alignment. The attention mechanism in deep learning causes the network to pay more attention to feature extraction of areas practical for the Re-ID task, and the keypoints have important guiding significance for alignment in high-level semantics. We propose a concise but efficient multi-branch attention and alignment network that combines the ideas mentioned above. It is divided into three parts. Before processing the input images, they are routed via a backbone network (ResNet-50). Following this process, the global features are obtained and sent to three distinct components. Part 1 is used for global feature representations. In Part 2, we introduce a multiple attention mechanism including spatial attention, channel attention, and keypoint attention; then, we set a horizontal feature partition for local feature mining. In Part 3, we introduce an alignment method based on part features. We use a pretrained pose estimation model to obtain 17 keypoints; relying on these keypoints, we divide the input feature map into nine parts to achieve alignment. In addition, and considering the errors of attention blocks and pose estimation, both Part 2 and Part 3 introduce a sub-branch for feature information complementation. Notably, our feature extraction is based on the backbone, and only extra attention and region divisions are used in subsequent branches to further complete feature extraction. This design minimizes the number of network parameters and makes it easy to train. Simultaneously, we calculate the loss function for each minimum branch’s features. Finally, these features are concatenated to create the final representation of the input images; this process assists in developing the extracted features’ discrimination and robustness, thus improving the Re-ID accuracy.

This paper’s main contributions are:

We propose a simple yet effective Re-ID pipeline called the multi-branch attention and alignment network (MAAN). This method can simultaneously learn local features using the attention mechanism while finding part features partitioned according to keypoints. In addition, to construct a global feature representation branch, we create a separate sub-branch to maintain the global features as supplementary before each main branch’s feature partitioning process. We achieve feature alignment and increase the network’s robustness by combining features from multiple levels and different critical locations.
We use keypoints to partition the global features into nine specialized parts and connect a classification loss function to each part feature in the part alignment branch. This process enables finer-grained mining of part features, mitigates the effects of real factors such as sample noise, and leads to improved alignment.
Using MAAN, we report competitive Re-ID accuracy on the Market-1501 [8], CUHK03 [7], and DukeMTMC-reID [20, 21] datasets.

2 Related work

2.1 Attention mechanism

In the last few years, the attention mechanism has been widely used in computer vision as a method that may enhance convolutional neural networks (CNNs). Its primary objective is to choose the most critical pieces of information from a large amount of data. SENet [22] proposed a squeeze and exception network based on the relationship between feature channels. The interdependence between feature channels is modelled explicitly, and the weight of each feature channel is automatically obtained by learning. Then, according to the weight, the model enhances the valuable features and suppresses the useless features for the current task. SKNet [23] inspired cortical neurons to dynamically adjust their receptive fields according to different stimuli, used multiscale feature information to guide the distribution, and focused on which kernel representation to use. To emphasize the meaningful features in the two region and channel dimensions, CBAM [24] combines the two channel and spatial attention modules to achieve better feature representation.

2.2 Pose estimation

The research on pose estimation has diverged from classical methods [11, 12] to deep learning [25,26,27]. In general, this problem can be divided into four tasks: single-person skeleton estimation, multi-person pose estimation, video pose tracking, and 3D skeleton estimation. For single-person skeleton estimation, a picture of the cropped person is used as input, and then the keypoints needed in the body area, such as the head, left hand, and right knee, are predicted. Keypoints can be used to indicate the position of the human body and can assist with a variety of visual issues. In this paper, we use OpenPose [26] to produce keypoints, including the position of keypoints on the face, hand, and joints of the human body, since it is a multistage pose estimator with continuous pose prediction.

2.3 Person re-identification

Person re-identification addresses the problem of matching pedestrian images across disjoint cameras. The key challenge lies in the large intragroup and small intergroup differences caused by different views, illumination, occlusion, and poses. Existing techniques can be classified as follows: hand-crafted descriptors [10, 30], metric learning methods [32,33,34], and deep learning algorithms [2,3,4,5,6,7,8,9,10, 35]. The aforementioned Re-ID approaches are not robust to changes in human pose and camera viewpoint, restricting their applicability in real-world surveillance scenarios.

2.4 Part-based person re-identification

The part-based Re-ID methods use local descriptors from different regions to enhance the discrimination and robustness of the feature representation. Part-based deep feature extraction methods can be divided into two groups. The first group turns to the predicted keypoints, which require the help of pretrained pose estimators. [36] suggested a novel pose-based attention perception synthesis network. In addition, part of the visibility is also incorporated into the final feature representation. [37] suggested combining the person’s fine and coarse posture information to learn the discriminative embedding, directly splicing the confidence map of 14 keypoints, and the model automatically learns alignment. In [87], under the guidance of pose estimation and semantic segmentation, part aligned pooling and part segmentation constraints were proposed to improve the cross-domain Re-ID behaviour. The second group does not require keypoints or segmentation information . A simple method is to divide the person image or feature map into uniform partitions. [3] divided the feature map into p horizontal stripes and trained each embedding part independently using a non-shared classifier. Additionally, one can extract the local features using pose-driven RoI extraction [14], human parsing results [38], or learning attention regions based on appearance features [5, 6, 39]. For instance, [14] proposed using posture detection to generate a local area through a manually crafted cutting method and then gradually fusing part features. [38] extracted features of body parts from human semantic analysis results. In [5, 6, 39], they attempted to exploit local information using appearance-based attention maps.

Compared with the above related Re-ID methods, we creatively construct a more complete and robust fine-grained feature extraction and alignment framework based on multi-branch deep networks and multi-task learning. We simultaneously introduced a multi-attention mechanism and alignment method. In the former, we combine spatial attention, channel attention, and keypoint attention while carrying out multiple feature segmentation operations at the appropriate scale to strengthen the information mining of local features. In the alignment branch, we propose a new horizontal segmentation method with reused hierarchical information, which is different from previous work. For example, ’upper leg’ features will appear in the ’upper leg’,’ lower body’, and ’whole body’, and they are further divided into three sub-branches. This design is more reasonable because each branch can provide better gradient flow during training and alleviate the problem of the uneven gradient of each classification loss function. When calculating feature similarity, the feature robustness can also be improved, and better results can be achieved.

3 Method

This paper proposes a multi-branch Re-ID network consisting of a global feature extraction network; a multi-attention mechanism that incorporates channel, spatial, and keypoint attention; and a pose-guided part feature alignment network. We outline the proposed method’s overall framework in Section 3.1, and the design of the global feature extraction network is shown in Section 3.2. The pose-guided attention mechanism and the part alignment method are discussed separately in Sections 3.2 and 3.4. Section 3.5 summarizes the overall structure.

3.1 The overall framework

In a Re-ID system, an input pedestrian image’s global features can be used to achieve a reasonable Re-ID effect. However, by learning more refined local features, the effect is improved when compared to using only global features. Traditional methods usually use uniform partitioning and do not pay attention to the region around keypoints. As a result, the change in pedestrian pose and camera viewpoint can substantially influence the network’s performance. To solve this problem, this paper proposes a MAAN module. The overall structure is shown in Fig. 2.

3.2 Global feature extraction

As shown in Fig. 2, Part1 is a global feature extraction branch for input images. To increase the scale of extracted features in the network, the input image is resized to 384 × 128. We adopt ResNet-50 as the backbone to extract feature f¹ with a size of 2048 × 12 × 4. Part1 learns global information using global average pooling (GAP), 1 × 1 convolution(Conv1×1), BatchNorm(BN) and the ReLU activation function. We use ResNet-50 to extract features since it can converge quickly and reduce the number of parameters. In addition, it can make the model easier to train, which not only prevents the gradient from disappearing but also prevents the loss from diverging.

The global feature $f_{g}^{P1}$ is extracted from the input image using this branch, and the size of $f_{g}^{P1}$ is 256 × 1 × 1. Through dimension reduction, f¹ can be reduced to 256-dim from 2048-dim, which is more effective for feature calculation. The 256-dim feature $f_{g}^{P1}$ is used to simultaneously calculate the softmax loss $L_{softmax}^{P1}$ and hard triple loss $L_{triplet}^{P1}$. Both losses are summed for backpropagation. For feature f_i, the softmax loss is formulated as follows:

$$ L_{softmax}=-\frac{1}{N}\sum\limits_{i= 1}^{N}\log \frac{\exp \left (W_{yi}^{T} f_{i}\right )}{{\sum}_{k=1}^{C}\exp \left ({W_{k}^{T}} f_{i}\right )} $$

(1)

${W_{k}^{T}}$ is the weight vector for class k, where N denotes the number of mini-batches in the training period, C denotes the number of classes in the training dataset, and $W_{yi}^{T}$ corresponds to a weight vector when input sample i is predicted for the correct class yi. To improve the ranking performance, the global feature $f_{g}^{P1}$ is trained using a hard triplet loss that consists of an anchor sample, a positive sample, and a negative sample. The anchor and positive samples are the most dissimilar positive sample pairs, while the anchor and negative samples are the most similar negative sample pairs. The following formula is used to express the hard triplet loss function.

$$ L_{triplet}={\sum\limits_{i}^{N}}\left [ \max \left \| {f_{a}^{i}}-{f_{p}^{i}} \right \|_{2} -\min \left \| {f_{a}^{i}}-{f_{n}^{i}} \right \|_{2} +margin\right ]_{+} $$

(2)

${f_{a}^{i}},{f_{p}^{i}},{f_{n}^{i}}$ are features extracted from an anchor, a positive sample and a negative sample respectively, and the margin is used to control the inter-class distance. $\left \| {f_{a}^{i}}-{f_{p}^{i}} \right \|_{2}$ is the Euclidean distance between the anchor and positive sample, and $\left \| {f_{a}^{i}}-{f_{n}^{i}} \right \|_{2}$ is the Euclidean distance between the anchor and negative sample. The global feature has good performance on Re-ID tasks; however, it introduces some interference factors such as background noises, so we use it as a supplement to the overall feature and combine it with other branches.

3.3 Multi-attention mechanism

The attention mechanism is an important tool in computer vision tasks, and it causes the network to pay more attention to the effective part of the input image. Therefore, in the second part of the network (Part 2), we introduced a multi-attention mechanism combined with classic horizontal segmentation to complete local feature extraction. The principle of channel attention is to use different channels’ coefficient weights, and that of spatial attention is to use different regions’ coefficient weights. Neither of the two, however, recognize the significance of human pose variation. Thus, following the channel and spatial attention blocks, we introduce a keypoint attention block. This combination causes the network to focus on valuable parts of input images while diminishing the importance of insignificant or even harmful regions.

As shown in Part 2 of Fig. 2, the last layer of ResNet-50 cancels the downsampling operation that extracts feature f² with a size of 2048 × 24 × 8. The last stride is changed from 2 to 1, so the size of feature f² is twice as large as that of f¹. A larger feature size indicates that more information can be obtained, which is more helpful when learning details.

We detail the attention mechanism in Fig. 3. First, f² passes through a global max pooling and GAP block. Both features are then forwarded to another network, which conducts channel attention. This network is composed of two convolution layers and a ReLU activation function. After adding both output features, which are based on each element and activated by the sigmoid function, this subsequent network generates the final channel attention map. The channel attention map $f_{channel\_ map}$ is computed as:

$$ \begin{array}{@{}rcl@{}} f_{channel\_ map}&=&\sigma \left (W_{1} \left (ReLU\left (W_{0} avgpool\left (f^{2} \right )\right ) \right )\right.\\ &&\left. + W_{1} \left (ReLU\left (W_{0} maxpool\left (f^{2} \right )\right ) \right ) \right ) \end{array} $$

(3)

where σ denotes the sigmoid function, the convolution weights W₀ and W₁ are shared by both inputs, and the ReLU activation function is followed by W₀.

The channel attention map $f_{channel\_ map}$ and the feature f² are multiplied element-wise to generate feature $ f_{channel}^{2} $. Then, feature $ f_{channel}^{2} $ is used as the input for the spatial attention module. First, $ f_{channel}^{2} $ passes a global max pooling layer and a GAP layer separately, and then we concatenate these two features on the channel dimension. The dimension is reduced to one channel using another convolution layer. After that, the spatial attention map $f_{spatial\_ map}$ is generated using a sigmoid function, and $f_{spatial\_ map}$ is computed as:

$$ f_{spatial\_ map} = \sigma \left (W_{2} \left [ avgpool\left (f_{channel}^{2} \right ) ,maxpool\left (f_{channel}^{2} \right )\right ] \right ) $$

(4)

where σ denotes the sigmoid function, and W₂ is the convolution weight. Then, feature $ f_{channel}^{2} $ is multiplied by $f_{spatial\_ map}$ to obtain the input feature for the keypoint attention module.

The channel and spatial attention blocks produce the feature $ f_{channel+spatial}^{2} $, which is then used to process the keypoint attention. Specifically, we use OpenPose[26] pretrained on the MS COCO2017 dataset[41] to obtain the coordinates of 18 keypoints in the original image, including ’nose’, ’neck’, ’right shoulder’, ’right elbow’, ’right wrist’, ’left shoulder’, ’left elbow’, ’left wrist’, ’right hip’, ’right knee’, ’right ankle’, ’left hip’, ’left knee’, ’left ankle’, ’right eye’, ’left eye’, ’right ear’, and ’left ear’. According to our experience, when applying the pretrained model to Re-ID datasets, the estimation of ’neck’ has a larger deviation. Therefore, we removed ’neck’ and used the remaining 17 keypoints for the attention calculation in Part 2 and the partition of the global feature in Part 3. The keypoint coordinates are modified using Gauss’s transformation and binarization.

$$ \begin{aligned} &f_{keypoint\_ map}\left [ m\geq 0.8 \right ]=1\\ & f_{keypoint\_ map}\left [ m<0.8 \right ]=0 \end{aligned} $$

(5)

where m denotes the confidence values on the keypoint attention map $f_{keypoint\_ map}$. The elements on this attention map matrix that exceed 0.8 are set to 1, as we regard these elements as belonging to the region near the keypoints, and the remaining elements are set to 0. As shown in Fig. 4, bright areas represent the regions near the keypoints. The keypoint attention map focuses only on the areas surrounding 17 keypoints, omitting other areas. After multiplying by the corresponding features, it is possible to emphasize significant characteristics near the keypoints and extract more practical features for pedestrian identity representation.

Then, $ f_{channel+spatial}^{2} $ is multiplied by the keypoint attention map to obtain $ f_{keypoint}^{2} $, which has a size of 2048 × 24 × 8. Next, it is sent to two branches, one of which learns global information using GAP, a 1 × 1 convolution(Conv1×1), BatchNorm(BN) and the ReLU activation function. The global feature $ f_{g}^{P2} $ is extracted from the input image using this sub-branch, and the size of $ f_{g}^{P2} $ is 256 × 1 × 1. We retain this global feature following the upper attention module as a complement. In another sub-branch, feature $ f_{keypoint}^{2} $ passes through GAP and a 1 × 1 convolution layer, to obtain features with a size of 256 × 2 × 1. Using horizontal partitioning, the feature is divided into two uniform parts, $ f_{pi}^{P2}\mid _{i=0}^{1} $, which have the same size of 256 × 1 × 1. We can obtain better local features by mining after this division operation.

The softmax loss $ L_{softmax}^{P2},L_{softmax0}^{P2},L_{softmax1}^{P2} $ and the hard triplet loss $ L_{triplet}^{P2} $ are calculated after the calculation of the above branches. All the losses are added together for backpropagation.

In this section, only softmax loss is used for the two local features $ f_{pi}^{P2}\mid _{i=0}^{1}$, because if hard triplet loss is used for the local feature mining branches, the results will be less satisfactory. When an image is divided into two parts from top to bottom, it is not clear whether the upper part represents the upper body or the lower part represents the lower body. In reality, the upper part may include the background, while the lower part may represent the whole body. If we choose the hard triplet loss, the distance between the background and the upper body is meaningless, and the training data will cause the model to learn the wrong prediction.

3.4 Pose-guided part feature alignment

In this section, a part feature alignment network is proposed. When matching the local features from two pedestrian images, the identified accuracy based on the same body part is better. Therefore, according to human parsing, we could extract nine local regions using 17 estimated keypoints, as mentioned in Sec. 3.2. By aligning the corresponding part features, we can solve the offsets caused by misalignment, thus enhancing the discrimination of extracted features and the robustness of the entire network.

Part 3 of Fig. 2 includes one sub-branch for global information extraction and nine sub-branches for part information mining. The input image first passes through the same backbone used in Part 2, and the last layer of ResNet-50 cancels the downsampling operation to extract the feature f³, which has a size of 2048 × 24 × 8, the same as f².

After that, we distribute the network into two branches. One learns global information using GAP and a 1x1 convolution to obtain the global feature $f_{g}^{P3}$, and the other obtains the features $ z_{pi}^{P3}\mid _{i=0}^{8} $ using pose-guided partitioning, as shown in Fig. 5. As mentioned in Sec. 3.2, we use OpenPose[26], which is pretrained on the MS COCO2017 dataset [41], to obtain 17 keypoints from the total 18 keypoints for feature partitioning; we discard the neck keypoint. Specifically, we can obtain the coordinates of 17 keypoints in the original image and map to the feature map using the size ratio of the original image as well as the feature map. The vertical coordinates’ maximums for the left eye and right eye were selected as the vertical coordinate Y 1 for the eyes, and the vertical coordinates’ minimum for the left ankle and right ankle were selected as the vertical coordinate Y 2 for the ankles. According to the human body’s proportions, 120% of the difference M between the vertical eye and mouth coordinates is calculated as the distance from the eyes to the top of the head. The vertical coordinate Y 1 for the eyes plus this distance M is used as the vertical head coordinate Y 3, and the vertical coordinate Y 2 for the ankles minus this distance M is used as the vertical coordinate Y 4 for the feet. Therefore, a person’s height is the difference $\left |Y3-Y4\right |$ between the vertical coordinates for the head and the feet, and the person image is segmented into six stripes, as shown in Fig. 5. The average position of the symmetrical keypoints is used as the partition boundary. The average position of the left and right shoulders is used as the boundary between the head and upper torso, and the area between the shoulder and waist is divided equally into two parts: the upper torso and the lower torso. The average position of the left and right waist is used as the boundary between the upper body and the lower body. The average position of the left and right knees is used as the boundary between the upper leg and the lower leg. The average position of the left and right ankles is used as the boundary between the lower leg and feet. In this way, pedestrians can be divided into six parts. In addition to the above operations, and considering the general information, the image can be divided into three parts. The upper body is composed of the head, upper torso, and lower torso. The lower body is composed of the upper leg, lower torso, and feet, and the whole body is composed of the head to the feet. Therefore, the feature $ z_{pi}^{P3}\mid _{i=0}^{8} $ is finally divided into nine different parts.

This division method causes the network to pay attention to specific human body regions; thus, the extracted features are more specific, and the entire image is free from interference such as background noise. For example, the network can focus on details about the local information, such as the logo on the clothes.

The division method also modifies a uniform partition, such as PCB, to realize partial feature alignment. Additionally, 1 × 1 convolution(Conv1×1), BatchNorm(BN) and the ReLU activation function are used to obtain features $ f_{pi}^{P3}\mid _{i=0}^{8} $, and each has a size of 256 × 1 × 1.

The global feature $f_{g}^{P3}$ and the features $ z_{pi}^{P3}\mid _{i=0}^{8} $ from the other branch are used to learn each part feature of an input image, and these subnetworks share the same weights. Instead of training ten subnetworks separately, they are trained to share weights in the convolution layer to avoid overfitting. The softmax losses, $L_{softmax}^{P3},L_{softmax0}^{P3},L_{softmax1}^{P3}, L_{softmax2}^{P3},L_{softmax3}^{P3},$ $L_{softmax4}^{P3}, L_{softmax5}^{P3},L_{softmax6}^{P3},L_{softmax7}^{P3}, L_{softmax8}^{P3} $, and the hard triplet losses, $ L_{triplet}^{P3} \ and\ L_{triplet0}^{P3} $, are calculated. Eventually, all losses are added together for backpropagation. As $ f_{pi}^{P3}\mid _{i=0} $ means that all of the body features belong to the global feature, features $ \left \{ f_{g}^{P3},f_{pi}^{P3}\mid _{i=0} \right \} $ are trained using the hard triplet loss to improve the network’s performance.

In the training stage, the global feature $f_{g}^{P3}$ and nine part features $ f_{pi}^{P3}\mid _{i=0}^{8} $ are calculated separately and are then concatenated to form the entire identity representation during the test stage. Every branch in Part 3 shares parameters during training to enhance the model’s performance, as the concatenation of global features and each part feature with a total size of 10 × 256 is used in the test stage. The shared convolution kernels are forced to learn both the global and local features, and more samples are used during training to avoid overfitting.

This pose-guided feature partitioning method can effectively focus on the critical body parts and mine the corresponding information while suppressing the misalignment caused by background noise and detection errors. Compared with the existing partitioning methods [13, 14, 17, 19], the features we extracted consider the representation at different granularities; they are more informative and complete. We also prove that this method can achieve good results in the experiment section.

3.5 Training and inference

The entire network structure is composed of Part 1, Part 2, and Part 3. These branches include both cooperation and division of labour. The weights of the lower layers are shared, and those of the higher layers are independent. Global features are the overall common representation, and then the multi-attention mechanism and pose-guided feature alignment focus on local features at different levels. Combining the global and local features as the final identity representation could strengthen the network’s discrimination. Thus, in the training stage, the total loss function is formulated as follows:

$$ \begin{aligned} L=&L_{softmax}^{P1}+L_{softmax}^{P2}+L_{softmax}^{P3}+L_{triplet}^{P1}+L_{triplet}^{P2}+L_{triplet}^{P3}\\ +&L_{triplet0}^{P3}+\sum\limits_{i=0}^{1}L_{softmaxi}^{P2}+\sum\limits_{i=0}^{8}L_{softmaxi}^{P3} \end{aligned} $$

(6)

our method realizes end-to-end learning by integrating global information, multi-attention-based local features, and pose-guided part feature alignment. In retrieval, as shown in Fig. 2, there are 18 purple blocks of 256-dim vectors from top to bottom at the end of the network. We concatenate these to form an identity feature with the size of 18 × 256. The Euclidean distance was used for similarity calculation.

4 Experiments

The experiments are described in nine sections. Sections 4.1 and 4.2 introduce the datasets, evaluation protocols, experimental environments, and implementation details. In Section 4.3, we verify the effectiveness of the global feature extraction network. Section 4.4 demonstrates that the accuracy is further improved by the multi-attention mechanism. Section 4.5 proves the effectiveness of pose-guided part feature alignment. Section 4.6 shows the superiority of the overall network. In Section 4.7, we conduct an additional cross-domain experiment to demonstrate the generalization ability of this network. Section 4.8 provides a discussion of the time and space complexity. Finally, Section 4.9 conducts a synthesis comparison between our method and several state-of-the-art methods.

4.1 Dataset and evaluation protocol

We conduct experiments on three popular Re-ID datasets: Market-1501 [8], DukeMTMC-reID (Duke) [20, 21], and CUHK03 [7]. The Market-1501 dataset was collected on the Tsinghua University campus, and the images are from six non-overlapping cameras, including one with low resolution. The images are automatically detected and cut by the detector. This dataset includes pose changes, illumination variations, and occlusion, which are similar to an actual scene. The DukeMTMC-reID dataset is the largest person Re-ID dataset and was collected at Duke University. It also provides the annotation of pedestrian attributes such as gender and sleeve length, among others. The CUHK03 dataset was collected at the Chinese University of Hong Kong and contains some detection errors. The details of these datasets are described in Table 1.

Table 1 Descriptions of the Market-1501, DukeMTMC-reID, and CUHK03 (detected) datasets

A multi-branch attention and alignment network for person re-identification

Abstract

Similar content being viewed by others

A part-based attention network for person re-identification

Person Re-identification Using Multi-branch Cooperative Network

SliceNet: Mask Guided Efficient Feature Augmentation for Attention-Aware Person Re-Identification

Explore related subjects

1 Introduction

2 Related work

2.1 Attention mechanism

2.2 Pose estimation

2.3 Person re-identification

2.4 Part-based person re-identification

3 Method

3.1 The overall framework

3.2 Global feature extraction

3.3 Multi-attention mechanism

3.4 Pose-guided part feature alignment

3.5 Training and inference

4 Experiments

4.1 Dataset and evaluation protocol

4.2 Experimental environment and implementation details

4.3 Effectiveness of the global feature extraction branch

4.4 Effectiveness of multi-attention mechanism branch

4.5 Effectiveness of the pose-guided part feature alignment branch

4.6 Effectiveness of MAAN

4.7 Cross-Domain Re-ID

4.8 Complexity analysis

4.9 Comparison with state-of-the-art methods

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation