Introduction

Re-ID is based on the whole-body information of pedestrians. It aims to use computer vision technology to deal with the matching problem of the same pedestrian at different time and different locations, and then compensate for the visual limitations of fixed cameras. As an important subtask in the field of image retrieval, Re-ID makes up for the defect that pedestrian identity is difficult to confirm due to the lack of face information in surveillance video, and has broad application prospects in community, supermarket, airport and other scenes.

The concept of “Person Re-identification” was first mentioned by Zajdel et al. [1], from the University of Amsterdam, in a 2005 article on multi-camera target tracking. Due to the complex extraction process and poor representation ability of manual feature [2, 3], the traditional Re-ID algorithms can only deal with the simple problems of pedestrian appearance and perspective difference under cross-camera. Until 2014, influenced by the strong learning ability of Convolutional Neural Network (CNN), deep learning in image classification began to be applied to the Re-ID field [4,5,6,7]. With the rapid development of computer vision technology, research on Re-ID based on deep learning has sprung up since 2016.

The most significant difficulty of Re-ID based on deep learning is that, in the cross-camera monitoring images, the whole-body image changes greatly in scale, posture and resolution, which makes the characteristics of the same pedestrian vary greatly. In addition, the inevitable occlusion environment leads to the lack of pedestrian saliency information, which further affects the accuracy of Re-ID. Therefore, the most serious problems of scale, blur, occlusion and other issues under cross-camera have always been the focus of many researchers.

In response to these problems, the current mainstream Re-ID algorithms are mainly divided into two categories: representation learning and metric learning. Representation learning mainly extracts or combines more robust features to strengthen the discrimination of pedestrian features [8]. The famous attention mechanism is an outstanding representative in this regard. Because of the good effect on the salient feature extraction of pedestrians, this mechanism has been widely used in the field of Re-ID in recent years [9,10,11]. Meanwhile, other scholars mainly focus on the metric learning, and are committed to designing better metric functions, so that the intra-class distance of pedestrian features is smaller and the inter-class distance is larger. Triplet loss function [12], as one of the most basic and attractive Distance Losses, has derived many valuable metric functions. Of course, with the advantages of joint training becoming increasingly prominent, more and more researchers choose to integrate the idea of representation learning and metric learning. In addition to representation learning and metric learning, there have been some other Re-ID methods based on local features in recent years. The main idea is to align and integrate pedestrian local features through image segmentation [13], skeleton key point positioning [14] and posture correction [15], and finally improve the discrimination of Re-ID tasks using local features as supplementary information.

Although the above typical methods can greatly improve the performance of Re-ID, they bring many new problems at the same time: attention mechanism can enhance pedestrian salient feature extraction, but it will introduce a large number of new parameters, which greatly increases the computational complexity of the model. The design of loss function in metric learning is complex and difficult to implement. Although the Re-ID methods based on local feature are simple, they bring a lot of redundant computation and affect the speed of Re-ID.

Fig. 1
figure 1

The structure comparison of FPN and our DPFM. a FPN, b DPFM

Different from the above classical algorithms, we propose a new framework DFPN based on dense feature pyramid from the direction of representation learning, which has high flexibility because it can be trained from scratch. In particular, our DFPN transfers the semantic information of the high-level feature map downward step by step, and makes it fully integrated with the details of the low-level feature map. Without introducing external feature enhancement modules, DFPN improves the richness, balance and integrity of pedestrian information in cross-camera scenes, and then improves the representation ability of pedestrian feature vectors. The specific innovations are as follows:

  1. 1.

    Feature extraction: we build backbone RVNet, which has a simple structure, fewer parameters and easy training.

  2. 2.

    Feature enhancement module I: we design Dense Pyramid Fusion Module (DPFM) module (as shown in Fig. 1b), which combines high-level semantic information and low-level detail information, and improves the scale invariance of feature map.

  3. 3.

    Feature enhancement module II: we introduce Improved RFB (IRFB) to solve the separation of context information of feature map and effectively deal with the pedestrian matching problem under blur and occlusion conditions.

The remainder about the paper is organized as follows: The next section introduces some related works. The subsequent section demonstrates the presented recommendation framework and modules. Then penultimate section shows the design of experiments and analysis about results. The final section illustrates the conclusion.

Related works

Feature extraction in Re-ID

The key step of computer vision tasks based on deep learning, such as object detection, Re-ID, is to select the appropriate backbone. The depth and structure design of backbone largely affect the final performance of the network. In 2012, AlexNet [16] won the championship in the ImageNet competition and became the first work of convolutional neural network to solve large-scale dataset classification tasks. VGGNet [17] is the earliest deep convolutional neural network after the rise of deep learning, which proves that increasing the depth of the network can significantly improve the performance of the network. Compared with AlexNet, VGGNet uses continuous 3*3 convolution layers instead of larger convolution layers in AlexNet (such as 11*11 and 7*7), which has a lower computational cost. At the same time, multiple nonlinear layers can increase the network depth to ensure that more complex patterns are learned.

Although model performance can be improved by deepening network structure, deeper networks often have more parameters. In addition, since the optimization of the network needs the back propagation algorithm, as the number of layers increases, the gradient of the network is prone to disappear or explode, which makes the model difficult to train and converge. Aiming at the problem of deep network degradation, He et al. proposed ResNet [18] in 2015, which shows a convenient way to build deeper neural networks with ResBlock. Later DenseNet [19] also pointed out that if the layer close to the input and the layer close to the output has a shorter connection, it is easy to construct and train a deeper network, and thus effectively improving the utilization and transmission efficiency of features. For example, Highway Networks [20] and Stochastic Depth [21] both provided a shorter connection between the front layer and the back layer.

Feature enhancement in Re-ID

In the field of Re-ID based on deep learning, the main ways of feature enhancement are through attention mechanism and feature fusion. Attention mechanism can focus on the region of interest to enhance salient features, so it is widely used in the field of Re-ID and has achieved good results [22,23,24]. However, occlusion will cause the lack of key information, which will greatly reduce the performance of these methods. As another means of feature enhancement, feature fusion can compensate the shortcoming of attention mechanism.

As a typical representative of feature fusion methods, Feature Pyramid Network (FPN) [25] solves the problem of variable object scale (as shown in Fig. 1a). Before this, most detectors based on deep learning only detect at the top feature map of the network. FPN uses a top-down architecture with horizontal connections to construct high-level semantics in feature maps at all levels. In addition, DenseNet adopts dense connection, which strengthens the reuse of features and makes the network easier to train. Although FPN can solve multi-scale problems well, it shows no effective improvement in dealing with blur and occlusion of object images. Increasing the receptive field is a key point to solve the problem of object occlusion. Liu [26] introduced RFB in SSD network to strengthen the feature extraction ability of the network by simulating the receptive field of human vision. RFB draws on the multi-branch fusion idea of Inception [27] in structure. The main difference is that dilated convolution is added in RFB, which effectively increases the receptive field. In the field of Re-ID, using abstract high-level features or single feature map cannot solve the problems of multi-scale and occlusion of pedestrian objects well, so it is an innovation to use appropriate feature fusion methods to combine deep and shallow features.

Proposed methods

This section consists of four subsections. First, we introduce the overall framework of the proposed DFPN in the first subsection; then, in the second to the fourth subsections, we successively elaborate the design scheme of several important modules including RVNet, DPFM and IRFB.

Introduction of the overall framework

Fig. 2
figure 2

Overall framework of our DFPN

As shown in Fig. 2, the DFPN proposed in this paper mainly includes four modules: feature extraction module RVNet, feature enhancement module I DPFM, feature enhancement module II IRFB and feature aggregation module Heads. The first three of them are our newly designed modules, and perform different functions in the whole architecture:

  1. 1.

    RVNet, as the backbone of our network, is one of the most important modules in DFPN. RVNet consists of five blocks from Block1 to Block5, and each block is composed of several designed RCBs. RCB can not only effectively control the number of channels in the feature map, but also effectively alleviate the network degradation problem by adopting the design idea of “Shortcut” [20].

  2. 2.

    To solve the multi-scale problem of the object, we put the extracted features into DPFM, whose essence is feature fusion. DPFM selects multiple high-level feature maps in RVNet, recursively transmits the rich semantic information in the high-level feature map to the bottom by dense connection, and fuses them in a specific way with the low-level feature map. The fused feature map takes into account global semantic information and local detail information, which can enhance the constraint and expression of pedestrian salient features.

  3. 3.

    In view of the occlusion and ambiguity of the object, we design the IRFB and insert it into the specific location of the network. IRFB is improved on the basis of RFB [26]. Unlike RFB, we equivalently replace the convolution layer of larger kernels in RFB, reducing parameters while adding more nonlinear mappings. The IRFB can assist the network to effectively extract the context feature information of the same pedestrian object and improve the discriminant ability of the model for occlusion objects.

  4. 4.

    The final feature aggregation module Heads is composed of Adaptive AvePooling layer, batch normalization (BN) layer and fully connected layer (i.e., Linear layer). The purpose is to integrate the highly abstract features after multiple convolutions, and provide interfaces for Distance Loss and Identification Loss. In particular, the pooling layer and the normalization layer can reduce the dimension of features and obtain stable data distribution respectively.

Specifically, combined with Fig. 2, the algorithm flow of our DFPN is as follows:

figure a

Design of RVNet

The improvement of CNNs has made great progress in recent years, but the main networks represented by VGGNet and ResNet still have the following defects: the former one is simple in structure, usually uses smaller size convolution layers, and has fewer parameters but is not easy to converge. The latter one uses the idea of “Shortcut” to solve the problem of network degradation during training, but its deep network structure has increased the amount of calculation. In addition, both VGGNet series and ResNet series need pretraining on ImageNet large dataset before migrating to specific tasks, which makes the models less flexible. To this end, we design a new backbone RVNet, which combines the simple structure of VGGNet with the advantages of “Shortcut”. It not only greatly reduces parameters, but also enables the network to train from scratch.

Figure 2 shows that the RVNet is composed of five blocks with RCB as the basic unit according to the simplified structure of VGGNet-19 (see Table 1 for network structure). The essence of RVNet is the designed RCB block. Figure 3b shows that RCB is composed of two composite convolutional layers (CCLs) in the form of “Conv-BN-LeakyReLU”. The difference between the two CCLs is that the first CCL uses a 1*1 convolution layer and the second uses a 3*3 convolution layer, but both have a stride of 1. First, the convolution layer of 1*1 can compress the number of channels of the feature map without changing the size of the feature map to reduce subsequent computation. Second, the convolution layer of 3*3 is used to restore the number of channels of the feature map and process the feature information. Since the convolution layer of 3*3 has fewer parameters, the calculation of the network is further reduced. In particular, through the operation of dimension reduction and dimension increase of the number of channels on the input feature map, the interaction of cross-channel information can be increased, and then a more representative feature vector can be generated.

At the same time, referring to the design concept of ResBlock, we add “Shortcut” branch in each RCB to supervise the shallow feature mapping and solve the problem that it is difficult to train due to the gradient divergence in the deep network. From Fig. 3, it should be noted that our RCB has obvious differences compared with ResBlock, which are listed as follows:

  1. 1.

    The activation function used in RCB is LeakyReLU, while the ReLU used by ResBlock. As a variant of ReLU, LeakyReLU also gives a small gradient to the negative value, which is convenient to retain the integrity and diversity of feature information.

  2. 2.

    RCB alternately uses the convolution layers of 1*1 and 3*3, reduces parameters and improves the nonlinearity of the network by reducing and increasing the dimension of the channel number of the feature map, while the two 3*3 convolution layers used by ResBlock keep the number of channels constant.

  3. 3.

    The feature fusion operation used by RCB is “Concat”, while ResBlock uses “Element-wise add” operation. The “Concat” operation can better aggregate multi-level features and will not cause confusion of feature information as “Element-wise add”.

Fig. 3
figure 3

The structure comparison of RCB and ResBlock. a ResBlock, b RCB

Table 1 The structure comparison of VGGNet-19 and RVNet

Design of DPFM

In the field of computer vision, FPN is an effective method to solve the multi-scale problem of the object. It constructs the feature pyramid on the multi-layer feature map in the form of top-down, and then uses the feature maps of different scales on the pyramid to predict. However, there are still many problems in the construction of FPN:

  1. 1.

    High-level feature map has strong discriminative semantic information, while FPN only uses adjacent upper-level feature map information when constructing, which leads to a low utilization of other high-level feature information.

  2. 2.

    The “Element-wise add” feature fusion method used by FPN destroys the hierarchical information of the feature map, resulting in feature confusion.

  3. 3.

    Multi-scale prediction can make the results more accurate, but it also increases the amount of calculation.

To solve these problems, we use the structural advantages of DenseNet to propose a Dense Pyramid Fusion Module (DPFM). Different from FPN, we apply dense connection to DPFM, recursively transmit the rich semantic information of the high-level feature map to the bottom, and adopt a specific way (the ratio of the number of channels of the current layer feature map to the sum of channels of all downwardly transmitted high-level feature maps = 1:1) to concatenate with the low-level feature map. Finally, only the end feature map is used for prediction, as shown in Fig. 4. Compared with FPN, DPFM has significant advantages: dense connection makes the loss function in the network provide supervision signal to not only the final output layer, but also the front non-output layer in the network (i.e., all high-level feature maps recursively transmitted), which is not only conducive to the deep mining of high-level semantic information, but also helpful in alleviating the degradation problem in network training. On the other hand, compared with the direct “Element-wise add”, the channel concatenation effectively retains the integrity of the hierarchical relationship of aggregated features, and takes into account global semantic information and local detail information, thus effectively responding to the appearance changes such as pedestrian scale under different devices.

The specific construction process of DPFM is as follows: firstly, the feature map C5 output by Block5 is processed by CCL to generate Feature5 (as shown in Eq. (1), and the size and number of channels of C5 remain unchanged during the processing; secondly, deep feature map Feature5, which contains rich semantic information, generates Feature4 (as shown in Eq. (2)) by manipulating Op1 (as shown in Fig. 5a) and concatenating it with Block4’s output feature map C4 in the form of channel ratio 1:1; then, Feature5 and Feature4 are concatenated with the feature map C3 output by Block3 through Op2 (as shown in Fig. 5b) and Op3 (as shown in Fig. 5c), respectively, to generate the feature map Feature3 (as shown in Eq. (3)); finally, the end feature map combined with shallow detail features and deep semantic features is output to discriminate the Re-ID task.

$$\begin{aligned}&\text {Feature}5= \text {CCL}(\text {C}5)\end{aligned}$$
(1)
$$\begin{aligned}&\text {Feature}4= [\text {C}4, \text {Op}1(\text {Feature5})]\end{aligned}$$
(2)
$$\begin{aligned}&\text {Feature}3= [\text {C}3, \text {Op}2(\text {Feature}4), \text {Op}3(\text {Feature}5)]. \end{aligned}$$
(3)
Fig. 4
figure 4

The structure of DPFM

Fig. 5
figure 5

The specific operations for Op1, Op2 and Op3. a Op1, b Op2, c Op3

Design of IRFB

When the object is occluded, the surrounding information can often be used to assist the discrimination. Inspired by this, RFB effectively increases the receptive field of the feature map by introducing dilated convolution, thus introducing more context-assisted information. At the same time, RFB further improves the scale invariance of the model using multi-branch structure, as shown in Fig. 6a. Based on RFB, we propose IRFB. The specific improvement of structure is as follows: as shown in Fig. 6b, the 5*5 convolution layer in RFB is replaced by two serial 3*3 convolution layers with “stride=1” and “padding=1”. In particular, when passing through the first 3*3 convolution layer, the number of channels in the feature map is compressed to half of the original, and then after passing through the second 3*3 convolution layer, the number of channels in the feature map is restored to the corresponding value. Simply put, we achieve the compression and recovery of the feature map’s channels by controlling the number of the two 3*3 convolutional layers and the number of their channels. In view of its function, we insert IRFB into the intersection of Block3 and Block4 in RVNet, and after the DPFM module, to enhance the sensitivity of IRFB module to the scale change of feature map. In addition, we also introduced the “Shortcut” technique in IRFB, which makes it more conducive to backpropagate the gradient during training.

Compared with RFB, IRFB has the following advantages:

  1. 1.

    Due to the mapping of two activation functions, IRFB has more nonlinearity.

  2. 2.

    IRFB has fewer parameters while keeping the size of the receptive field unchanged.

The calculation formula of receptive field R is

$$\begin{aligned} R_k=R_{k-1}+\left[ (f_k-1)\times \prod ^{k-1}_{i=1}s_i\right] . \end{aligned}$$
(4)

Among them, \(R_{k-1}\) is the receptive field size corresponding to the \((k-1)\)th layer, \(f_k\) is the convolution kernel size of the kth layer, and \(s_i\) is the stride of the ith layer. According to Eq. (4), the size of the receptive field before and after the improvement is calculated as follows, the intuitive schematic diagram is shown in Fig. 7:

$$\begin{aligned} R_{\text {RFB}}= & {} 5 \nonumber \\ R_{\text {IRFB}}= & {} 3+(3-1)\times 1=5. \end{aligned}$$
(5)

The number of parameters of the two modules is calculated as follows:

$$\begin{aligned} S_{\text {RFB}}= & {} 5\times 5\times c\times 2c=50c^2\nonumber \\ S_{\text {IRFB}}= & {} 3\times 3\times c\times \frac{1}{2}c+3\times 3\times \frac{1}{2}c\times 2c=\frac{27}{2}c^2. \end{aligned}$$
(6)

From the Eq. (6) it can be seen that the parameters of IRFB are reduced to about one-fourth of the RFB.

Fig. 6
figure 6

The structure diagram before and after improvement. a RFB, b IRFB

Fig. 7
figure 7

The Receptive field diagram before and after improvement. a RFB, b IRFB

Experiments and analysis

This section contains four subsections. The specific arrangement is as follows: first, we introduce the datasets and evaluation indicators used; second, we elaborate on some specific experimental details and parameter selection; then in the third subsection, we set up ablation experiments to verify the effectiveness of each design; finally, in the fourth and fifth subsections, we show the comparisons with classical and advanced methods respectively, highlighting the superiority of our overall model. It should be noted that all our experiments are trained and tested on a single dataset.

Datasets and evaluation indicators

Datasets

Market1501 [28]: Market1501 contains 1501 pedestrian IDs and 32,668 detection images captured by 6 cameras. Each pedestrian is captured by at least 2 cameras, and there may be multiple images under one camera. Among them, the training set has 751 pedestrian IDs and contains 12,936 images; the query set has 750 pedestrian IDs and contains 3368 images; the test set has 16,384 images with 750 pedestrian IDs.

DukeMTMC-reID [29]: DukeMTMC is a large-scale multi-target multi-camera pedestrian tracking dataset DukeMTMC-reID is a person re-identification subset of DukeMTMC, and provides a manually labeled bounding box. DukeMTMC-reID contains 16,522 training images of 702 pedestrian IDs, 2228 query images from another 702 pedestrian IDs, and a search gallery of 17,661 images containing 702 pedestrian IDs (the same as the pedestrian IDs of the query images).

Evaluation indicators

We adopt the cumulative matching characteristics (CMC) at Rank-1 and the mean average precision (mAP) as the evaluation indicators to test the performance of different Re-ID methods on these datasets. The mAP is the mean of average precision (AP) for each query image. Rank-1 is the probability that the top image in the search results is the object.

Implementation details and loss function

Implementation details

To ensure the consistency of the experimental results, the experimental process is carried out in the same software and hardware environment. We use the fast-reid framework [30] to build the model, the output feature dimension is set to 512 by default, and all designed models are not pretrained. The computing platform uses single GPU (memory 11 GB) of GTX 1080Ti, pytorch uses version 1.7.0, CUDA’s version is 11.0 and python’s version is 3.8.3. In the training phase, we scale all images to a size of 256*128 and perform random erasure operations with a probability of 0.5. In each batch, we randomly select 16 pedestrian IDs, and randomly sample 4 images for each ID. The Adam optimizer is used in the training process to update the model parameters. The initial learning rate is set to 1e−4, and the learning rate uses a multi-step strategy. When iterating to 120, 270, and 360 epochs, the learning rate is reduced to 0.1 times the original, and the training is stopped after a total of 390 epochs.

Loss function

The Triplet Loss [31] was selected for our Distance Loss, and its formula calculated as follows:

$$\begin{aligned} L_{\text {tri}}=[d_{\mathrm{p}}-d_{\mathrm{n}}+\alpha ]. \end{aligned}$$
(7)

Among them, \(d_{\mathrm{p}}\) is the Euclidean distance between the target and the positive samples, \(d_{\mathrm{n}}\) is the Euclidean distance between the target and the negative samples, and \(\alpha \) is the threshold. The goal of Triplet Loss is to make \(d_{\mathrm{p}}\) smaller and make \(d_{\mathrm{n}}\) larger through optimization.

For the Identification Loss, we chooses Softmax Loss which can map the output values of multiple neurons to (0, 1) interval. Suppose there are n samples in a batch, and the calculation formula of Softmax Loss is:

$$\begin{aligned} L_{\text {soft}}=-\frac{1}{n}\sum _{i=1}^{n}\sum _{j=1}^{m}p(x_{ij})\log (q(x_{ij})). \end{aligned}$$
(8)

Among them, m represents the number of classes, \(p(x_{ij})\) and \(q(x_{ij})\), respectively, represent the predicted probability and the true probability of the target sample i belonging to class j.

Ablation experiments

To ensure that there is only one independent variable for each experiment, it is important to note that, except for the last subsection, the dataset we use in other subsections is Market1501, and if no special instructions is given, all experiments exclude the pretraining of models.

Analyze the performance of RVNet

To fully prove the effectiveness of RVNet, we use progressive experimental methods to verify the designed CCL layer and RCB block.

Table 2 Performance comparison of different activation functions in CCL
  1. (1)

    The effectiveness of LeakyReLU in CCL.

    Based on the most commonly used ReLU activation function, we select the two most representative variants PReLU [32] and LeakyReLU [33] for comparative experiments. In the experiment, the order of each basic layer in CCL is “Conv-BN-Activ”.

    As shown in Table 2, ReLU can achieve 88.27% Rank-1 and 72.24% mAP on Market1501 dataset. After using the PReLU, compared with using the ReLU, Rank-1 decreases by 2.34% and mAP decreases by 3.76%. After using LeakyReLU, compared with using ReLU, Rank-1 increases to 88.66% (increases by 0.39%), mAP increases to 72.30% (increases by 0.06%).

    In summary, with the use of LeakyReLU in CCL, it gives the best result, indicating that it is not desirable for ReLU to set all negative weights to 0, because it destroys the integrity of feature information. Without special instructions, we will use LeakyReLU by default in subsequent experiments.

  2. (2)

    The effectiveness of different combination in CCL.

    In this experiment, we verify the best combination of “Convolution”, “Batch Normalization” and “LeakyReLU”. Among them, “CBL” indicates that the order of the common layers in the CCL is “Conv-BN-LeakyReLU”, other equivalents.

    It can be seen from Table 3 that the accuracy of other combinations on Market1501 decreases to varying degrees without using the “CBL”. Therefore, the combination of “CBL” is the best, which proves that the BN layer is generally used behind the convolution layer, before the nonlinear layer such as LeakyReLU: because the shape of the output distribution of the nonlinear layer changes during the training process, normalization cannot eliminate its variance offset. On the contrary, the output of convolution layer is generally a symmetric, non-sparse distribution (more similar to Gaussian distribution), which will produce a more stable distribution after normalization. If there is no special explanation, we will use “CBL” combination by default in the subsequent experiments.

    It can be seen from Table 3 that the accuracy of other combinations on Market1501 decreases to varying degrees without using the “CBL”. Therefore, the combination of “CBL” is the best one, which proves that the BN layer is generally used behind the convolution layer and before the nonlinear layer (such as LeakyReLU). We think the reason for above result is that the shape of the output distribution of the nonlinear layer changes during the training process, and normalization cannot eliminate its variance offset; on the contrary, the output of convolution layer is generally a symmetric, non-sparse distribution (more similar to Gaussian distribution), which will produce a more stable distribution after normalization. If there is no special explanation, we will use “CBL” combination by default in the subsequent experiments.

  3. (3)

    The effectiveness of RCB.

    It can be seen from Table 4 that the network using VGGNet-19 as backbone does not converge without pretraining. Using the pretrained model, VGGNet-19 can reach 66.63% Rank-1 and 40.27% mAP. However, compared with our RVNet (without pretraining), VGGNet-19 still has 22.03% Rank-1 gap and 32.03% mAP gap after pretraining. According to the loss curve of Fig. 8, our RVNet has faster convergence speed than VGGNet-19 (pretrained), and finally achieves a better result. This fully illustrates the design of RVNet can not only better feature extraction, but is also more favorable for gradient transfer in the training process.

Table 3 Performance comparison of different combination of CCL
Table 4 Performance comparison before and after using RCB
Fig. 8
figure 8

Loss curve using different backbones

Analyze the performance of DPFM and IRFB

Table 5 Performance comparison after using DPFM and IRFB

Based on RVNet, we further validate the effectiveness of the proposed DPFM and IRFB.

As shown in Table 5, the accuracy of the model is significantly improved by adding DPFM on the basis of RVNet, and Rank-1 and mAP increase by 2.2% and 4.48%, respectively, which proves that the fusion of high-level semantic information and low-level detail information is indeed conducive to enhancing the discrimination of Re-ID tasks. After the introduction of IRFB, the Rank-1 of the model increases from 90.86 to 92.93% (increased by 2.07%), and the mAP increases from 76.78 to 79.42% (increased by 2.64%), which fully shows that IRFB can effectively extract the context feature information of the same pedestrian target and improve the ability of identifying the occlusion targets.

To further highlight the superiority of our DPFM and IRFB, we compare the effects of DPFM and FPN, IRFB and RFB respectively. As shown in Table 6, from the first two rows of results, it can be seen that DPFM is 1.89% and 3.03% higher than FPN’s Rank-1 and mAP, respectively, which proves that our DPFM better retains the structure of hierarchical information. Similarly, comparing the experimental results in the third row and the fourth row, IRFB is also 1.04% and 2.12% higher than RFB on Rank-1 and mAP, respectively, indicating that information interaction between different channels is very important.

The experiments of the two groups have fully proved that our DPFM and IRFB are very effective, and also show that solving the multi-scale and occlusion problems of pedestrian targets is indeed helpful in the Re-ID task.

Table 6 Performance comparison of FPN with DPFM and RFB with IRFB

Analyze the performance of dimensions and datasets

Table 7 Performance comparison of different feature dimensions

It can be seen from Table 7 that the performance of the model comes best when the feature dimension is 512. Whether the feature dimension is larger or smaller, the performance of the model will decrease to varying degrees. To verify the generalization ability of our model, we test the model on another public dataset DukeMTMC-reID. From Table 8 below, our DFPN still has competitive performance in DukeMTMC-reID, Rank-1 and mAP can reach 83.89% and 69.58% respectively. In addition, DFPN shows a fast training and inference speed on both Market1501 and DukeMTMC-reID, especially reaching an inference speed of 887 FPS on Market1501.

Table 8 Performance of our model on different datasets
Table 9 Comparison with the classical algorithms on Market1501 and DukeMTMC-reID datasets

Comparative experiment of classical algorithms

To prove the superiority of our DFPN in the overall design. We have selected some of the most representative algorithms in the Re-ID field for comparison, such as PAN [34], SVDNet [35], APRNet [36], REDA [37], DPFL [38], DuATM [39], PCB [13] and AANet [40]. The selection principles are as follows:

  1. 1.

    Their structures are based on convolutional neural networks;

  2. 2.

    They are representative algorithms in different Re-ID genres;

  3. 3.

    They both carried out experiments on the Market1501 and DukeMTMC-reID datasets, and both used the evaluation indicators Rank-1 and mAP.

Table 10 Performance comparison of our DFPN before and after pretraining
Table 11 Comparison with the advanced algorithms on Market1501 and DukeMTMC-reID datasets

The comparison results are shown in Table 9 below. It can be seen that the Rank-1 and mAP of our DFPN on Market1501 and DukeMTMC-reID datasets exceed most of classical algorithms. Among them, compared with PAN, DFPN has increased by more than 10% in various indicators on different datasets. However, compared with AANet, DFPN still has a gap of 1% and 3.76% in Market1501 and DukeMTMC-reID, respectively, which should be related to the simplicity of our backbone, because the backbone of AANet has more than 152 layers, while our RVNet has only a dozen layers.

In addition, we need to emphasize in particular that our DFPN is not pretrained on ImageNet [16] or COCO [41] large datasets because of our limited computing resources, and it is well known that pretrained models generally have a significant improvement in accuracy. Therefore, in general, the performance of our DFPN algorithm has underlying potential in future application.

Comparative experiment of advanced algorithms

In the previous experiments, we use the controlled variable method to verify that our DFPN without pretraining surpass most of the classic algorithms in and before 2019. To make our method more convincing, we have completed the pretraining of our DFPN on COCO dataset and have selected several advanced algorithms proposed in the past 2 years to compare with our model.

Table 12 Comparison with the advanced algorithms on Occluded-Duke dataset
Fig. 9
figure 9

Some demonstrations of matching results

To fully illustrate that our DFPN still has a good performance under occlusion condition, we specially introduce a new dataset named Occluded-Duke [42], which is by far the largest dataset for the occlusion person Re-ID. Occluded-Duke is selected from DukeMTMC-reID by leaving occluded images and filter out some overlap images. Its training set contains 15,618 images covering 702 identities and testing set contains 1110 identities in total, including 17,661 gallery images and 2210 query images.

First of all, it can be seen from Table 10 that, the accuracy of our DFPN model on the three public datasets (namely Market1501, DukeMTMC-reID and Occluded-Duke) has been significantly improved after pretraining. Especially for the mAP indicator, its accuracy increase by 11.94%, 14.79% and 6.74% respectively. What’s more, on the Occluded-Duke dataset, the Rank-1 of pretrained DFPN can increase by 8.37% when compared to that without pretraining. This set of experiments comprehensively prove the importance of pretraining. In addition to the above experiments, we also tested the DFPN after deleting IRFB. As shown in last row of Table 10, the accuracy of our model has decreased significantly, which directly proves that the receptive field is one of the starting points to alleviate the occlusion problem.

Then, on the non-occlusion dataset of Market1501 and DukeMTMC-reID, we select five most advanced algorithms to compare with our DFPN (Pretrained). As shown in Table 11, our DFPN is very competitive in terms of the mAP indicator, surpassing CircleLoss, FastReID, CBN and the recently published DeepMiner algorithm (0.96% and 2.57% higher, respectively). For the Rank-1 indicator, the performance of DFPN is very close to that of DeepMiner on the Market1501 dataset, but is about 4% lower than that of DukeMTMC-reID. It is undeniable that when compared with the state-of-the-art FlipReID, our DFPN has a certain gap in terms of Rank-1 or mAP, because the network structure of FlipReID is much more complicated than ours, and its backbone has already exceeded 50 layers.

Finally, on the Occluded-Duke dataset, we also select five most advanced algorithms (top 5) to compare with our DFPN (Pretrained). It should be noted that there are few algorithms tested on Occluded-Duke because it is relatively new. As shown in Table 12, the Rank-1 and mAP of DFPN comprehensively surpass the best Ad-Occluded, SFR and DSR algorithms tested on Occluded-Duke in 2018, and even achieve 4.39% higher on mAP than PGFA which was the top 1 algorithm in 2019. However, for the state-of-the-art HOReID algorithm, which uses complex graph convolution, our DFPN still has 3.38% Rank-1 and 2.11% mAP gaps.

To show the performance of our algorithm more intuitively, we randomly selected 4 Query images, found the same pedestrian images under different cameras in their Gallery, and sorted them according to the similarity from high to low (select the top 10 results). It can be seen from Fig. 9 (highlights indicate the correct matches) that our algorithm can get high rankings of matching the correct pedestrians, which shows good robustness under conditions of blur, occlusion, and different scales.

Conclusion

In this paper, DFPN is proposed to solve the problems of scale, blur and occlusion of pedestrian images in Re-ID tasks. To train faster and more convenient, we use “Shortcut” and the simple design of VGGNet-19 to build backbone RVNet with fewer parameters and no pretraining. After feature extraction, we design a DPFM which combines high-level semantic information and low-level detail information to enhance the richness and representation ability of feature maps. In addition, we introduce the IRFB, which can effectively increase the receptive field of the feature map and reduce parameters. A series of experiments on public datasets have proved the effectiveness of our individual modules and model. In the following work, we will try to further improve our model and apply it in the person search work.