Keywords

1 Introduction

Person re-identification (Re-ID) identifying a target person in different scenarios, is an important subject in the field of computer vision [1]. It is widely used in video surveillance and autonomous driving. Early research efforts mainly focus on hand-crafted construction. With the vigorous development of deep learning, convolutional neural networks (CNNs) have become the predominant choices for Re-ID [2], achieving better recognition performances than traditional methods. However, due to various complicated factors, such as body poses and occlusions, learning robust and discriminative features is still a difficult and challenging task.

Re-ID based on CNNs can be summarized into 1) spliting images or feature maps into some horizontal grids. The PCB model [3] implicitly divides images into horizontal grids of multiple scales directly, ignoring the relations between body parts. 2) utilizing a pose estimator to extract a pose map. Wei [4] uses key points positioning to predict and estimate human body poses, which effectively solves the difficulty of person feature alignment, but requires a large amount of additionally labeled data for model training and prediction, and the retrieval accuracy is largely limited by the performance of the model. 3) leveraging generative adversarial networks (GANs) to generate more images. Zheng [5] uses GAN to generate more simulated data for data enhancement and improve the generalization ability of the model. However, it is easy to generate noisy samples, which significantly affects the accuracy and performance of the model. 4) computing attention maps to focus on a few key parts, but the extracted regions may not contain discriminative body parts, missing some important data.

To solve the above problems, in this paper, we propose a cooperative network based on multi-branch, which can effectively extract more discriminative features. In the shade module, the framework focuses on extracting features from low-response parts. In the stepped module, it focuses on reducing complex and noisy background cutter. Besides, combined with random erasing module, we use consistency activation penalty (CAP) function to ensure that the high activation regions of three networks do not overlap. In the multi-scale branch, We propose a branch to extract different levels of characteristics, which effectively preserves the integrity of pedestrian features. Finally, multiple branches are combined to form a complete multi-branch cooperative network, which can effectively deal with problems such as occlusions, poses changes.

In summary, our contributions can be summarized as follows:

  1. (1)

    We propose a branch that integrates multiple attention mechanisms. Through erasing high-response regions, we can generate more complex occlusion samples. We also use random erasing module to simulate low-quality samples in the real world. Besides, relations between different parts of pedestrians can be effectively extracted with the help of stepped module. We effectively combine the three modules through the consistent activation penalty function, so as to improve the model's feature extraction ability for samples with less information and solve the problem of low model accuracy.

  2. (2)

    We use a multi-scale branch to extract local features of persons, and combine them with global features to learn the relation between person parts and mine non-significant information. Consequently, the recognition ability and accuracy of the algorithm are improved significantly.

  3. (3)

    Experimental results on three large-scale person Re-ID datasets including Market-1501, DukeMTMC-reID, and CUHK03 prove that the proposed methods exceeds state-of-the-art methods.

2 Multi-branch Cooperative Network

We propose a person Re-ID method based on multiple attention mechanisms and multi-scale branches. The multiple attention mechanisms are used to improve the adaptability of the model to occlusions, pose changes, illumination, low resolution and other factors, while the multi-scale branch is used to improve the ability to fuse and extract global and local features. The complete multi-branch cooperative network structure is shown in Fig. 1.

Fig. 1.
figure 1

The proposed network structure

2.1 Shade Module

In the actual application of the algorithm, it is inevitable to encounter the phenomenon of occlusions in pedestrian pictures. Occlusions will cause partial loss of pedestrian features and affect the integrity of features. When distinguishing features are lost, the recognition performance of the model will decline significantly. In order to extract more discriminative pedestrian features, we block the areas with high spatial attention response and retain the feature maps with low response, so as to generate more difficult samples.

The spatial attention model used in this paper is calculated as follows:

$$ {\text{N}}_{{\text{s}}} \left( {\text{Y}} \right) = {\text{BN}}\left\{ {{\text{C}}_{2}^{2 \times 1} \left( {{\text{C}}_{1}^{3 \times 3} \left( {{\text{C}}_{0}^{1 \times 1} \left( {\text{Y}} \right)} \right)} \right)} \right\}. $$
(1)

where BN is the data normalization operation, C is the convolution operation, the upper right corner is the size of the convolution kernel, and Ns(Y) is the spatial attention map.

After the spatial attention response map is obtained, the high response region is set to 0 and the low response region remains unchanged, so as to obtain the mask \(\tilde{N}\left( Y \right)\), which is calculated as follows:

$$ {\tilde{\text{N}}}\left( {\text{Y}} \right) = \left\{ {\begin{array}{*{20}c} {\text{0, Ns(Y)i,j > S}} \\ {{\text{Ns}}\left( {\text{Y}} \right){\text{i, j,others}}} \\ \end{array} } \right.. $$
(2)

where Ns(Y)i, j represents the value of the spatial attention map at positions i and j, and S represents the threshold, that is, when the response value is greater than S, it is set to 0, otherwise it remains unchanged.

By setting the high response region to 0, the effect of forcing the model to learn distinctive features from the low response region is realized, and the recognition performance of the model is improved.

2.2 Random Erasing Module

We set the original image as M, the image size as W and H, the image area as U, the erasure probability P, the random initialization erasure area as Us, the value range of Us/U is set as (U1, U2), the erasure aspect ratio is Qs, the value range of Qs is set as (q1,1/q1), Ms is the random erasure rectangular box, and the random initialization P1(Xs,Ys) is the coordinate point randomly selected in the image M.

$$ {\text{O(x,y) = random(x, y)}}{.} $$
(3)

where random() is the random number generation function.

The random erasing algorithm can be described as follows: Input the pedestrian image, and the random initialization probability is \(P_{1}\), if \(P_{1}\) is greater than the erasure probability P, the original image is directly output. Otherwise, the erasing area and aspect ratio are randomly initialized according to the erasing area and image length and width range. The coordinate point P(Xs, Ys) is initialized randomly. When the erasing area is set to be smaller than the image size, the random erasing area is randomly selected (0,255) for assignment to achieve the effect of random erasing. The specific process is shown in the following pseudocode (Table 1):

Table 1. The process of random erasing algorithm

2.3 Stepped Module

The traditional feature segmentation method adopts horizontal segmentation, which pays more attention to different regions of pedestrians, but it is easy to ignore the local relations between pedestrians and the information that may exist at the edge of the block.

As shown in Fig. 2, the PCB algorithm divides each row into six horizontal slices, the handbag and umbrella of pedestrians are separated into different blocks between different blocks, which will cause the loss of important information such as edge information and local relation of the blocks, making it impossible to obtain the ideal effect when analyzing each block individually.

We use the method of dividing 8 slices in a stepped manner, as shown in Fig. 2, starting from the first block, every four blocks are taken as a relatively complete local area, which moves down continuously, and finally five block areas are obtained, as shown in the following figure a, b, c, d, e. We observed that the blocks d and e retain the complete information of handbag and umbrella. At the same time, because the cut is smaller than the original feature map, the noise and background cutter are reduced, so the recognition accuracy and performance are significantly improved.

Fig. 2.
figure 2

Ladder block method

In the previous methods based on feature space segmentation, each horizontal slice enjoys the same weight, and the details such as umbrella and handbag can not be highlighted effectively. In this paper, we assign relatively larger weights to the blocks containing more important information, so that the model can focus on the parts with strong resolution. For this branch, an image is input through conv5_x to get the feature map \(F \in R^{C \times H \times W}\), and then the feature map F is input into SBAM module at branch 2 and branch 3. In the SBAM module, firstly, step block is carried out, and F is divided into 8 horizontal parts. Every four parts are grouped to obtain a local region. The starting block of the local region moves downward in step 1 from the first block, and finally five local regions are extracted, of which each feature region is \(F_{i} \in R^{{C \times \left( \frac{H}{2} \right) \times W}} \left( {i = 1,2,3,4,5} \right)\), Then focusing on each \(F_{i}\), first it is compressed in the spatial dimension, and the calculation method is as follows:

$$ {\text{F}}_{{\text{i}}}^{^{\prime}} {\text{ = FC}}_{{2}} {\text{(FC}}_{{1}} \left( {{\text{avg}}_{{\text{s}}} \left( {{\text{F}}_{{\text{i}}} } \right)} \right){\text{ + FC}}_{{2}} {\text{(FC}}_{{1}} \left( {{\text{max}}_{{\text{s}}} \left( {{\text{F}}_{{\text{i}}} } \right)} \right){)}{\text{.}} $$
(4)

where \(avg_{s}\) and \(max_{s}\) are the average pooling and maximum pooling of the input data in the spatial dimension respectively, and two one-dimensional vectors are obtained after compression.

FC1 and FC2 are used as shared parts to compress and restore the two vectors on the channel. Finally, they are added and fused to get \(F_{i}^{^{\prime}} \in R^{C \times 1 \times 1}\). In order to give weight to each local region, \(F_{i}^{^{\prime}}\) is compressed in the channel dimension and expressed as

$$ {\text{s}}_{{\text{i }}} {\text{ = Sum}}_{{\text{c}}} {\text{(F}}_{{\text{i}}}^{^{\prime}} {)}{\text{.}} $$
(5)
$$ {\text{m}}_{{\text{i}}} {\text{ = Max}}_{{\text{c}}} {\text{(F}}_{{\text{i}}}^{^{\prime}} {)}{\text{.}} $$
(6)

where \(Sum_{c}\) and \(Max_{c}\) are respectively the sum and maximum value of the input data in the channel dimension. Eventually, we can get \(s_{i} \in R^{C \times 1 \times 1}\) and \(m_{i} \in R^{C \times 1 \times 1}\). According to the calculation of \(s_{i}\) and \(m_{i}\) of each local area, the proportion can be calculated. The calculation formula is:

$$ {\text{V}}_{{\text{i}}} { } =\uplambda {\text{(F}}_{{{\text{sum}}}} \left( {{\text{s}}_{{\text{i}}} } \right){\text{ + F}}_{{{\text{max}}}} \left( {{\text{m}}_{{\text{i}}} } \right){)}{\text{.}} $$
(7)

where λ is set to 6 according to the calculation, and finally the proportion value is adjusted between 0 and 1 through the sigmoid function. The original local area \(F_{i}\) is multiplied by the adjusted proportion value to obtain the update result of the local area \(S_{i} \in R^{{C \times \left( \frac{H}{2} \right) \times W}}\), namely:

$$ {\text{S}}_{{\text{i}}} {\text{ = F}}_{{\text{i}}} \times {\text{sigmoid(V}}_{{\text{i}}} {)}{\text{.}} $$
(8)

2.4 Multi-scale Branch

For fine-grained pedestrian feature extraction, the existing method can obtain fine pedestrian features by horizontally segmenting the features. However, due to the local misalignment and occlusion of pedestrian images, it is easy to produce wrong matching.

At the same time, due to the separate existence of each segment, complete pedestrian characteristics cannot be perceived. In contrast, in the multi-scale branch, fine-grained global module and local module are designed to refine the representation of pedestrian features, and achieve the effective feature extraction of “global + local” (Fig. 3).

Fig. 3.
figure 3

Multiscale local branch

For the fine-grained global module, the size of the feature map obtained in the convolutional neural networks (CNNs) is \(R^{C \times H \times W}\) (C represents the number of channels, H represents the height, and W represents the width). In the H dimension, we divide the feature map into N parts, and perform maximum pooling and average pooling operations on each part respectively to obtain feature vectors \(g_{maxi} \left( {i = 1,2,3 \cdots n} \right)\) and \(g_{avgi} \left( {i = 1,2,3 \cdots n} \right)\). Afterwards, the max-pooling and average-pooling results of different parts are concatenated respectively to obtain the description vectors \(G_{max}\) and \(G_{avg}\) of the fine-grained global branch. In this paper, triplet loss is used to train \(G_{max}\) and \(G_{avg}\), which can realize the integrity of information while considering local correlation, and achieve effective identification of similar parts of different persons.

For fine-grained local modules, different local blocks are considered separately. Global average pooling is easy to introduce background or local noise information, while global maximum pooling can overcome noise interference, but cannot consider all information in the same layer.

Therefore, it is possible to better extract the information between the same layers by performing the difference operation for the two. The specific calculation is as follows:

$$ {\text{g}}_{{{\text{cont}}}} {\text{ = g}}_{{{\text{aug}}}} - {\text{ g}}_{{{\text{max}}}} . $$
(9)

After that, we perform convolution dimension reduction for gmax and gcont respectively to obtain gmax and gcont, and then we perform cascade and convolution dimension reduction to obtain ginter. Finally, the global max pooling result gmax and gcont are recombined as the layer feature \({\tilde{\text{L}}}_{{1}}\) of the same layer, which is calculated as follows:

$$ \widetilde{{{\text{L}}_{{1}} }}{\text{ = g}}_{{{\text{max}}}}^{^{\prime}} {\text{ + g}}_{{{\text{inter}}}}^{^{\prime}} . $$
(10)

Finally, we connect through the full connection layer and conduct joint training through softmax and ID-Loss. By analyzing the correlation between non-adjacent parts, more significant potential information can be mined. In addition, through the analysis of discarded local information, the actual situation of local occlusion is effectively simulated, which enhances the robustness and discrimination of the model.

2.5 CAP Network

For the cascade of three branches of different attention mechanisms, in order to make different branches focus on different regions of the image and enhance the diversity and comprehensiveness of local feature extraction, CAP network is introduced in this paper to coordinate different attention branches and make each branch focus on different regions with different characteristics. Different weights are assigned to different branches through LAN, and Hellinger distance [7] is used to measure the consistency of output weights of different branches:

$$ {\text{H}}\left( {\upomega _{{\text{i}}} , \upomega _{{\text{j}}} } \right) = \frac{{1}}{{\surd {2}}}\left\| {\sqrt {{\upomega }_{{\text{i}}} } - { }\sqrt {{\upomega }_{{\text{j}}} } } \right\|_{{2}} . $$
(11)

where the sum of the elements of \(\omega_{i}\) and \(\omega_{j}\) is 1, then the square of the above formula can be obtained:

$$ {\text{H}}^{{2}} \left( {\upomega _{{\text{i}}} ,\upomega _{{\text{j}}} } \right){ = 1 } - { }\sum \sqrt {{\upomega }_{{\text{i}}} {\upomega }_{{\text{j}}} } . $$
(12)

In order to ensure that the high activation regions of different attention models do not overlap, it is necessary to maximize the distance between \(\omega_{i}\) and \(\omega_{j}\), that is, to minimize the value of \(\sum \sqrt {\omega_{i} \omega_{j} }\), then the CAP loss can be defined as follows:

$$ {\text{L}}_{{{\text{CAP}}}} { = }\sum \sqrt {{\upomega }_{{\text{i}}} {\upomega }_{{\text{j}}} } . $$
(13)

Through the above formula, it can be optimized to diversify the local feature extraction and enhance the representation ability of the model.

3 Experiments

3.1 Datasets

Experiments are conducted on three commonly-used large-scale person Re-ID datasets. The Market-1501 (Zheng et al., 2015) [8] dataset contains 32,668 images of 1,501 persons captured by 6 cameras. In the experiment, a total of 12,936 images of 751 persons are used as the training set, and a total of 19,732 images of 750 persons are used as the test set. The DukeMTMC-reID (Ristani et al., 2016) [9] dataset contains 36,411 images of 1,404 persons captured by 8 cameras. In the experiment, a total of 16,522 images of 702 persons are used as the training set, and a total of 17,66l images of 702 persons are used as the test set. The CUHK03 (Li et al., 2014) [10] dataset contains 14,097 images of 1,467 persons captured by 10 cameras. The experiment uses 767 persons samples as the training set and 700 persons samples as the test set.

3.2 Implementation Details

We use the Pytorch framework based on deep learning, and use the GPU RTX2080Ti server for training. During the training process, the input image size is adjusted to 384 * 128, and data enhancement methods such as random flipping and random cropping are used. In the experiment, the training batch is 32, where P is 8 and K is 4. ResNet50 pre-trained on ImageNet is used as the backbone network. In this paper, SGD is used as the optimizer, and the learning rate is set to 8e-4. The weight decay is 5e-4, the momentum is set to 0.9, and the number of training iterations (epoches) of the whole network is set to 300. This paper uses the cumulative matching characteristic curve (CMC) and the mean average precision (mAP) to analyze and evaluate the performance of the algorithm. Among them, Rank-1 represents the ratio of finding the person to be queried in the first search results. mAP is the average of the area under the accuracy-recall curve of all query samples, which reflects the overall performance of person Re-ID methods.

3.3 Experimental Results

In order to verify the effectiveness of the method proposed in this paper, the method proposed in this paper is compared with existing pedestrian re-identification methods, including PCB [3], MGN [11], PCB + RPP [3], Harmonized Attention Convolutional Neural Network (HA-CNN) [12], Second-Order Non-Local Attention (SONA) [13], AlignedReID [14], HONet [15], GCP [16], CDNet [17], PAT [18]. Table 2 shows the comparison of experimental results on the three datasets.

Table 2. Comparison of experimental results of different algorithms on data sets

As shown in Table 2, compared with the PCB + RPP method, our method has significant improvements on Market-1501, DukeMTMC-reID and CUHK03 datasets. The Rank-1 indicators increased by 3.8%, 8.3%, and 21.4% respectively. And the mAP indicators increased by 12.8%, 17.6%, and 27.3% respectively. The reason is that the multi-branch method considers the relationship between different parts of the human body, better characterizing person information through the joint collaboration of multiple branches, so the experimental results are significantly improved. Compared with the GCP method, we achieve significant improvements on all three datasets. The main reason is that although the GCP method adopts the relation analysis module, it does not pay enough attention to the global features. In this paper, the global and local training are combined on the multi-scale branch, and the extracted features are more discriminative. Thereby a better effect is achieved. Compared with the MGN method, the branch set in this paper is more reasonable. Compared with the current SONA with the highest accuracy, all indicators in this paper have been significantly improved on the three datasets. The model in this paper has achieved best results compared with the existing algorithms with better effects, which verifies the robustness of the model and effectively improves the accuracy of the pedestrian re-identification algorithm.

3.4 Ablation Experiments

In order to further verify the effectiveness of the multi-branch proposed in this paper, we analyze the model from both qualitative and quantitative aspects. First, we test the experimental effect of the baseline network of ResNet50 pre-trained on ImageNet. After that, the experiments of multiple branches and mutual combination methods are added respectively. The specific comparison results are shown in the following table.

Table 3. Ablation study on three datasets

The construction of the multi-branch cooperative network is to extract the features of pedestrians more effectively, improving the recognition performance of the model and achieving good recognition results in more complex environments and conditions. Table 3 shows the comparison between the baseline method and the model proposed in this paper. It can be seen from the table that the recognition accuracy of the baseline method is the lowest, and multiple branches are better than the baseline method whether used alone or in combination. For the attention branch, the introduction of the CAP network can significantly improve the results, which verifies the rationality of the network structure. At the same time, the results of the network including all branches are better than that of each branch working alone, and Rank-1 and mAP are significantly improved. It shows that there is a complementary and cooperative relationship between different branches, and person features with different levels of discrimination can be extracted respectively, which verifies the effectiveness of the model design in this paper.

3.5 Query Results Display

Fig. 4.
figure 4

Market1501 dataset recognition examples

As can be seen from the Fig. 4, the recognition accuracy of the proposed baseline method is not high, and the error rate in top10 recognition results is high. However, the features learned by the attention branch and the multi-scale branch can complement each other, which significantly improves the recognition accuracy.

4 Conclusions

Designing multi-branch networks to learn rich feature representation is one of the important directions in person re-identification (Re-ID). However, when extracting pedestrian features, the regions and non-significant regions where the model has significant recognition ability should be extracted as much as possible to enhance the robustness and recognition performance of the model. Therefore, this paper proposes a joint network based on multi-branch cooperation. Through the occlusion module, the random erasing module and the stepped module, strong person features are jointly extracted, and the CAP network is used to ensure the diversity of local feature extraction. It exploits the potential connections of non-adjacent parts through multi-scale branching, and combines global features to construct a high-precision person Re-ID network. The experimental results on three public person Re-ID datasets show that the multi-branch cooperative network proposed in this paper extract more discriminative and robust pedestrian features.