Keywords

1 Introduction

Person Re-Identification (Re-ID) aims to retrieve pedestrian targets with the same identity from multiple non-overlapping cameras, which has a high practical application value in society and industry. Compared with the conventional image-based person Re-ID, video-based person Re-ID can obtain more pedestrian information (e.g., action information and viewpoint information), thus alleviating the negative influence of the common occlusion situation in person Re-ID. Therefore, video-based person Re-ID has begun to receive academic attention and develop rapidly.

Currently, many person Re-ID methods [10, 13, 17, 25] take ResNet [3] as their backbone network to extract features, which can effectively avoid the gradient disappearance and explosion of deep neural networks. However, limited by the size of the receptive field and the pooling operation, some image information is inevitably lost during feature learning. Besides, ResNet focuses more on the local region in the image and lacks modeling the correlation between human parts. These weaknesses limit the person Re-ID ability. To alleviate this problem, the attention mechanism [11, 20] has begun to be widely used in video-based person Re-ID and improved the model performance, demonstrating its powerful representation ability by discovering salient regions in images.

The higher-level features of ResNet are abundant in semantics but lacking of details, while the lower-level features have more details but not enough semantics, so previous works [1, 16, 27] explore the effectiveness of hierarchical features in ResNet for video-based person Re-ID. In fact, features with different levels can complement each other through a specific aggregation. The deeper the layer is, the smaller the scale of its feature map is. Naturally, the feature maps of all layers are stacked together like a feature pyramid, which allows feature aggregation for video-based person Re-ID. However, effectively aggregating features of different layers through the pyramid structure is crucial for dealing with various complicated situations. PANet [14] proposed a bidirectional Feature Pyramid Network (FPN) consisting of a top-down as well as a bottom-up path to aggregate features at each layer of FPN. Similarly, Bi-FPN [9] proposed a nonlinear way to connect high-level and low-level. M2det [24] adopted a block of alternating joint U-shape module to fuse multilevel features. These methods have brought some improvements but require a large number of parameters and computations with complex structure.

In this paper, we innovatively embed the attention mechanism and Gated Recurrent Unit (GRU) into FPN to propose a learning method called Multi-scale Context Aggregation (MSCA), which can recurrently aggregate detail and semantic information of multi-scale feature maps from the backbone by Attention-aided FPN (AFPN), and enable the aggregation to focus on more salient regions in video. Different from previous methods that first extracted spatial features and later aggregated temporal features, our method can aggregate the spatio-temporal features simultaneously due to our proposed Temporal Enhancement Module (TEM). The TEM takes GRU as a primary component, it can be plugged into anywhere in a plug-and-play manner to learn complementary clues in temporal dimension and trained and converged easily due to its fewer parameters.

Overall, the main contributions of this paper are summarized as follows:

  • We propose an Attention-aided Feature Pyramid Network (AFPN) to recurrently aggregate the hierarchical features from each layer of the backbone in spatial and channel dimensions; thus, the network can exploit salient clues and avoid some wrong clues during aggregation.

  • We propose a Temporal Enhancement Module (TEM) that can be plugged into the backbone in a plug-and-play manner to merge the spatial and temporal information from pedestrian videos.

  • Based on AFPN and TEM, we propose a Multi-scale Context Aggregation (MSCA) method for video-based person Re-ID. We conduct extensive experiments on three widely used benchmarks (i.e., MARS, iLIDS-VID and PRID2011), the results demonstrate that our MSCA method is competitive with other state-of-the-art methods, and the ablation studies confirm the effectiveness of AFPN and TEM.

2 Method

2.1 Overview

The overall architecture of our proposed method MSCA is illustrated in Fig. 1. We first adopt ResNet-50 [3] as our backbone and use multi-scale features from Res2, Res3, Res4, and Res5. Then, we plug TEM into the above four stages in a plug-and-play manner for learning temporal information from video features, which makes the features output from each layer contain both spatial and temporal information. We propagate the hierarchical features recurrently in AFPN for context aggregation, combining high-level semantic information with low-level detail information, and the SCA module would be plugged after aggregation to focus on more salient regions for improving the model performance. Finally, we use cross entropy loss and triplet loss as the objective function to optimize the model in the training stage, and features from two aggregate directions are concatenated for testing.

Fig. 1.
figure 1

Illustration of our proposed Multi-scale Context Aggregation. The input frames first fed to the ResNet-50 backbone. For the multi-scale feature maps from each residual block, we propose the AFPN to recurrently aggregate the features and focus on more salient regions by the SCA module. TEM is our proposed temporal enhancement module, which can be plugged into anywhere of the model for learning spatio-temporal information simultaneously. Finally, multiple losses are used to supervise the model in the training stage.

2.2 Attention-Aided Feature Pyramid Network

In video-based person Re-ID, multi-scale information aggregation has been used as one of the main methods to improve performance. The hierarchical features are characterized by insufficient semantic information of low-level features and insufficient detail information of high-level features due to the fact that the ResNet backbone increases the feature dimensions and decreases the feature resolutions across contiguous layers, thus Lin et al. [12] use this intrinsic property to reverse the information aggregation to form FPN based on the backbone by top-down and lateral connection. High-level features with rich semantic information are up-sampled by nearest neighbor interpolation and aggregated with the next layer of features output by lateral connection through element-wise addition, where the lateral connection adopts \(1\times 1\) convolution layer for reducing channel dimensions, and then recurrently aggregates swallow features, which take on more detail information and less semantic information. A \(3\times 3\) convolution layer is utilized on each aggregated features to generate the final feature map to reduce the aliasing effect of the upsampling operation. Besides, in order to make the aggregated features more discriminative, we introduce a spatial-channel attention (SCA) module in FPN and call it attention-aided FPN (AFPN), the features of each layer after aggregation are fed to the SCA module, which consists of spatial and channel attention. As shown in Fig. 2, two kinds of attention cascade to calculate the spatial attention and channel attention [19], which can be described as follows:

$$\begin{aligned} F_{s}=A_{s}(F) \otimes F \end{aligned}$$
(1)
$$\begin{aligned} F_{sc}=A_{c}(F_{s}) \otimes F \end{aligned}$$
(2)

Where \(\otimes \) denotes the element-wise multiplication, given the input feature tensor \(F \in \mathbb {R} ^ {C\times H\times W}\), \(A_{s}(\cdot )\) and \(A_{c}(\cdot )\) denote the computation of the spatial and channel attention maps, \(F_{sc} \in \mathbb {R} ^ {C\times H\times W}\) is the final spatial-aided and channel-aided features.

Fig. 2.
figure 2

Diagram of the SCA module

To obtain spatial attention, we adopt two pooling operations GAP and GMP to generate two features: \(F_{avg}^{s} \in \mathbb {R} ^ {1\times H\times W}\) and \(F_{max}^{s} \in \mathbb {R} ^ {1\times H\times W}\), then the two are concatenated to obtain a two-layer feature descriptor. The feature map is processed using a \(7\times 7\) convolution layer and a sigmoid layer to generate a spatial attention map \(A_{s}(F) \in \mathbb {R} ^ {1\times H\times W}\), it is calculated as follows:

$$\begin{aligned} A_{s}(F)=\sigma (conv_{7\times 7}([F_{max}^{s},F_{avg}^{s}])) \end{aligned}$$
(3)

The channel attention helps the model to focus on more salient features by assigning greater weights to channels that show a higher response. We also first adopt GAP and GMP to generate two features: \(F_{avg}^{c} \in \mathbb {R} ^{C\times 1\times 1}\) and \(F_{max}^{c} \in \mathbb {R} ^{C\times 1\times 1}\), and the two are added up to fed into a convolution layer, which contains one ReLU activation layer and two \(3\times 3\) convolution layers. After that, the channel attention map \(A_{c}(F) \in \mathbb {R} ^{C\times 1\times 1}\) can be formulated as follows:

$$\begin{aligned} A_{c}(F)=\sigma (conv_{3\times 3}([F_{max}^{s}+F_{avg}^{s}])) \end{aligned}$$
(4)

To eliminate the superimposed effect of distracted information in weighted process, the final video feature is then obtained by the SCA module instead of primitive \(3\times 3\) convolution layer.

Fig. 3.
figure 3

Detailed structure of our TEM

2.3 Temporal Enhancement Module

Video sequences involve rich temporal information, we design a Temporal Enhancement Module (TEM) based on GRU, and it can be plugged anywhere in the model to capture the temporal clues in the video feature maps in a plug-and-play manner. As shown in Fig. 3, we use GAP to the input feature map \(F \in \mathbb {R} ^{T\times C\times H\times W}\) and squeeze all the spatial information to obtain temporal vector \(F' \in \mathbb {R} ^{T\times C}\). In the next, the temporal vector would be processed by GRU and recover the outputted feature map to the same size as the original input, then adopt a skip connection with input to obtain the final temporal enhanced feature \(F'' \in \mathbb {R} ^{T\times C\times H\times W}\), which incorporates the temporal information in the video and the video feature representation is more discriminative. The whole process of TEM is formulated as follows:

$$\begin{aligned} F'=Squeeze(GAP(F)) \end{aligned}$$
(5)
$$\begin{aligned} F''=Unsqueeze(GRU(F'))+F \end{aligned}$$
(6)

2.4 Loss Function

We employ two kinds of losses to jointly supervise the training the model: cross entropy loss and hard triplet loss [4]. We calculated two losses \(L_{xent}\) and \(L_{htri}\) as follows:

$$\begin{aligned} L_{xent}=\sum _{i=1}^{N}-q_{i}\log (p_{i}) \end{aligned}$$
(7)
$$\begin{aligned} L_{htri}=[d_{pos}-d_{neg}+m]_{+} \end{aligned}$$
(8)

Where \(p_{i}\) is the predicted logit of identity i and \(q_{i}\) is the ground-truth label in identification loss, \(d_{pos}\) and \(d_{neg}\) are respectively defined as the distance of positive sample pairs and negative sample pairs, \([d]_{+}=max(\cdot ,0)\) and m is the distance margin, which is set to 0.3 in the training procedure.

In this paper, in order to better supervise the training of model, we adopt output from Res5 in the backbone and output from AFPN in the two directions of feature aggregation. Therefore, the total loss is the summation of the four losses:

$$\begin{aligned} L_{total}=L_{xent}^{b2t}+L_{htri}^{b2t}+L_{xent}^{t2b}+L_{htri}^{t2b} \end{aligned}$$
(9)

3 Experiments

3.1 Datasets

MARS [26] dataset is the biggest video-based person re-identification benchmark with 1261 identities and around 20000 video sequences generated by DPM detector and GMMCP tracker. The dataset is captured by six cameras, each identity is captured by at least two cameras and has 13.2 sequences on average. There are 3248 distracter sequences in the dataset, it increases the difficulty of Re-ID.

iLIDS-VID [10] dataset is captured by two cameras in an airport hall. It contains 600 videos from 300 identities. This benchmark is very challenging due to pervasive background clutter, mutual occlusions, and lighting variations.

PRID2011 [5] dataset captures 385 identities by camera A and 749 identities by camera B, but only the first 200 people appear in both cameras.

Table 1. Comparison with State-of-the-Art methods On MARS, iLIDS-VID and PRID2011 datasets.

3.2 Evaluation Metrics

We employ the Cumulative Matching Characteristic curve (CMC) and the mean Average Precession (mAP) as evaluation critiria. CMC considers re-ID as a ranking problem and represents the accuracy of the person retrieval with each given query, mAP reflects the true ranking results while multiple ground-truth sequences exist. Conveniently, Rank-1, Rank-5 and Rank-20 are employed to represent the CMC curve.

3.3 Implementation Details

ResNet-50 pre-trained on ImageNet is employed as our backbone, and the input images are all resized to \(256\times 128\). We also utilize some commonly data augmentation strategies including random horizontal flipping, random erasing and random cropping. Specifically, in the training stage, we employ a restricted random sampling strategy to randomly sample \(T=8\) frames from each video as input. The ADAM optimizer with an initial learning rate of \(5\times 10^{-5}\) and a weight decay of \(5\times 10^{-4}\) for updating the parameters. We train the model for 200 epochs, and the learning rate is reduced by 0.1 per 50 epochs. All the experiments are conducted with Pytorch and a NVIDIA RTX 3090 GPU.

3.4 Comparison with State-of-the-Art Methods

In this section, we compare our proposed method with other state-of-the-art video-based person Re-ID methods on MARS, iLIDS-VID and PRID2011.

Results on MARS. From Table 1, compared with other state-of-the-art methods, the proposed MSCA achieves the best Rank-1 accuracy and competitive Rank-5 and Rank-20 accuracy. According to the results given in CTL baseline [13] and our baseline in Table 2, although both of them adopt ResNet-50 as the backbone, the mAP score of our baseline is 78.3% and that of CTL baseline is 82.7%, thus there is still a large margin in the final mAP result, even though our method performs well. The new and best works for video-based person Re-ID, CTL and PiT [22], the former adopts ResNet-50 as the backbone, but the latter is based on Transformer [2], all of which have utilized some complex modules such as key-points estimator, topology graph learning and hard-to-train transformer. In comparison, our method reaches a best Rank-1 accuracy with effective context aggregation. This demonstrates that our MSCA can aggregate more discriminative information, and the brief feature learning structure also has good generalization performance.

Results on iLIDS-VID and PRID2011. We also conduct several experiments on the two small datasets to demonstrate the advantages and possible flaws of our proposed method over the existing methods as shown in Table 1. We can observe that the result for iLIDS-VID is worse than other state-of-the-art methods, which causes this result is that the video sequences have a large variation of light and the serious occlusions on iLIDS-VID dataset. TEM only considers the temporal correlation but ignores the low-quality video sequences, which will introduce some additional irrelevant information, so as to decreases the model performance. For PRID2011 on the same scale as iLIDS-VID, our method achieves the best Rank-1 accuracy of 96.6%, and outperforms all previous approaches, confirming the superiority of our proposed method.

3.5 Ablation Study

To demonstrate the effectiveness of our proposed methods, we perform ablation studies on MARS dataset and use strong CNN backbone with ResNet-50 as our baseline. The experimental results are reported in Table 2 and Table 3.

Table 2. Ablation analysis of two components on MARS dataset

MSCA: As shown in Table 2, the baseline contains only the ResNet-50 backbone and is supervised by \(L_{xent}^{b2t}\) and \(L_{htri}^{b2t}\), the Rank-1 and mAP accuracy of the baseline are 88.2% and 78.3%, respectively. We find that using AFPN to aggregate multi-scale diverse features recurrently, the corresponding performance is 89.1% in Rank-1 and 79.4% in mAP, which is attributed to the complementary information of high-level semantic features and low-level detail features. Moreover, using TEM alone, we can achieve 90.7% in Rank-1 and 82.1% in mAP, the result shows the temporal enhancement module can complement the individual spatial features to learn more temporal information in videos. Eventually, by adding AFPN and TEM to the baseline, the model can further learn more discriminative features with the effective multi-scale context aggregation method.

AFPN: To explore the effectiveness of the SCA module and its position in the FPN, we conduct some experiments under the setting of utilizing the TEM. Table 3 shows the comparison of different plugging positions and the performance gap between our proposed AFPN and vanilla FPN, “w/o SCA” denotes vanilla FPN without SCA module, “\(1\times 1\)” denotes plugging the SCA module into the lateral connection and “Up-sample” denotes plugging the SCA module into the Up-sampling process. With the SCA module after propagation, we can find that the Rank-1 accuracy and mAP score are improved by 0.5% and 0.8%, respectively. Meanwhile, depending on the plugging position of the SCA module, the effects are also different. In Table 3, we can observe that plugging the SCA module into the lateral connection and into the downward propagation, they both have different performance degradation compared to plugging the SCA module after propagation.

Table 3. Ablation analysis of different plugging positions on MARS dataset

3.6 Visualization

As shown in Fig. 4, we report examples of different identities with Grad-CAM [18], which is commonly used in computer vision for a visual explanation. To verify the effectiveness of our proposed MSCA, we compare the Grad-CAM visualization of the baseline with our method, and the three example images are selected at intervals of at least 10 frames in three independent sequences of the MARS dataset. The frames contain various conditions such as motion and partial occlusions. As shown in Fig. 4(a), the features extracted from the ordinary frames by our proposed method can capture more information, including torso, legs and accessory that is discriminative to the target person. In Fig. 4(b), our method can focus on the area with more motion compared to the baseline if the subject has significant motion. As shown in Fig. 4(c), the features extracted by our proposed method can capture more body regions without additional bicycle information, which will enhance the representation of the target person. In general, our MSCA can effectively capture more spatio-temporal information and avoid some occlusions to improve the performance.

Fig. 4.
figure 4

Visualization of attention maps on different identities of the baseline and our proposed method. (a) Person takes up most of the image. (b) Person with significant motion. (c) Person is half occluded by bicycle

4 Conclusion

In this paper, we propose an innovative multi-scale context aggregation method for video-based person Re-ID. The proposed method can learn more video context information recurrently. AFPN can aggregate the semantic and detail information in multi-scale features, it integrates high-level semantic information into low-level detail information and uses the SCA module to aid the aggregated features to focus on salient regions. Furthermore, we propose a TEM to capture the temporal information among the video frames, and with its plug-and-play property, we can aggregate temporal features while extracting spatial features to enrich the final video feature representations, which is entirely different from previous works. The experimental results on three standard benchmarks demonstrate that our proposed method achieves competitive performance with most state-of-the-art methods.