1 Introduction

Pedestrian re-identification aims to identify target pedestrians appearing in different cameras and is widely used for safety monitoring in public places including schools, shopping malls, supermarkets, and train stations [1]. In recent years, scholars have proposed a variety of methods to solve the problem of pedestrian re-identification [2,3,4,5,6]. Most of these methods utilize convolutional neural network to extract discriminative feature representations and have achieved good recognition results on open experimental dataset. For example, image patch-based pedestrian recognition methods utilize the attention mechanism in reference [7], the multi-branch structure in reference [8], and other strategies to introduce local features of pedestrians to handle pedestrian re-identification problems. Fine-grained information-based pedestrian re-identification methods utilize various strategies including pose estimation and key points to extract fine-grained features of pedestrians to improve the performance of pedestrian re-identification. Generative adversarial network-based pedestrian re-identification methods generate pedestrian images under different angles and different illumination, which helps enrich training samples and improve the robustness of the network. However, it is difficult for most pedestrian re-identification methods to achieve satisfactory recognition accuracy in the case of illumination change, background clutter, pedestrian occlusion, etc.

Due to the Gaussian distribution, the receptive field of convolutional neural network is limited to a small region [9,10,11,12]. Under pedestrian occlusion, background clutter or other noise, it is easy for the smaller receptive field to receive incorrect feature information; in addition, due to the fact that down-sampling reduces the resolution of feature representation, it is easy to reduce the accuracy of occluded pedestrian re-identification methods with small receptive field [13,14,15,16]. Therefore, even if the strategy of feature alignment or attention mechanism in [17] is introduced, it is difficult to completely solve the challenge of using convolution neural network to solve the problem of occluded pedestrian re-identification.

Literature [18] has proved that visual transformer can be utilized for image classification, and its effect is not inferior to that of traditional convolution neural network. The reason is that vision transformer takes the multi-head self-attention mechanism as its core, and abandons convolution and down-sampling [19]. Specifically, firstly, vision transformer divides the original image into a series of image patches; then, vision transformer performs classification coding and position coding on these image patches; finally, visual transformer performs self-attention on these image patches.

In recent years, with the ability to capture global feature information and better self-attention mechanism, visual transformer has achieved good results in the field of pedestrian recognition [20, 21]. However, when most of the pedestrian's body is occluded or the background information is similar to the pedestrian's features, visual transformer is prone to making misjudgments because it is not good at capturing local features of the pedestrian, resulting in poor robustness. This article aims to simultaneously take into account the advantages of convolution neural network and vision transformer, and construct a novel network that can not only depict the correlation of the pedestrian’s global characteristics but also the pedestrian’s local characteristics, so as to further improve the accuracy of occluded pedestrian re-identification.

In this article, a dual-branch hybrid structure based on residual network and visual transformer is proposed to deal with the issue of occluded pedestrian re-identification. Firstly, an original input image is augmented by using the proposed partial patch pre-convolution module; secondly, the augmented images are input into the vision transformer branch to establish the global feature relationship of the image sequence; thirdly, the features extracted from the original input image are input into the residual-visual transformer branch, where the proposed residual-visual transformer module is utilized to extract local features; fourthly, the feature information extracted from these two branches is fused to obtain the pedestrian’s discriminative features; finally, the loss is iteratively calculated to complete the network training. Some representative results of class activation maps between the vision transformer-based method and the proposed DB-ResHViT method are given in the following figure. It can be seen from Fig. 1 that our DB-ResHViT pays more attention on local features than original vision transformer and improves the accuracy of occluded pedestrian re-identification.

Fig. 1
figure 1

Class attention maps of different methods:(1) A givenimage, (2) Original vision transformer-based method, (3) Proposed DB-ResHViT based method

The main contributions can be summarized as follows:

  • We propose an effective occluded pedestrian re-identification network, which extracts local feature information and establishes global feature relationship through the proposed dual-branch hybrid structure of residual network and vision transformer.

  • We propose a novel data augmentation module, which inputs randomly selected image patches into the pre-convolution module to replace those original image patches, thereby obtaining discriminative local features.

  • We design a novel module integrating residual structure and visual transformer, which utilizes the translation invariance of convolution neural network to construct the relationship between global features, so as to reduce the computational cost and improve the occluded pedestrian re-identification accuracy.

  • Extensive experiments have been conducted on public occluded pedestrian re-identification datasets, and experimental results demonstrate that the proposed network significantly improves the performance of occluded pedestrian re-identification.

2 Related work

Most studies mainly utilize the entire body information of a pedestrian to handle the pedestrian re-identification issue, with less consideration given to pedestrian re-identification under occlusion conditions. However, it is difficult to obtain the entire body information of a pedestrian in real life, especially in crowded scenes. Therefore, the issue of occluded pedestrian re-identification cannot be ignored. Next, we will review the research on occluded pedestrian re-identification in recent years.

2.1 Occluded pedestrian re-identification

Existing deep learning methods for occluded pedestrian re-identification mainly utilize convolution neural network to extract pedestrians’ features. This category of method first aligns features or introduces higher-order semantic information (gesture information), then utilizes gesture model to estimate the position information of key points, and finally utilizes gesture information to complete pedestrian re-identification. Literature [22] proposes a robust soft matching feature alignment method, which utilizes hierarchical joint learning to obtain local features of a pedestrian’s gestures to predict the similarity of different pedestrians. Literature [23] presents a local matching method based on gesture information, which designs a visible local feature predictor and utilizes the attention mechanism to achieve the representation of pedestrians’ characteristics. Literature [24] utilizes a multi-granularity network based on gesture key points to extract multi-granularity features to eliminate the impact of occlusion. Although gesture information is beneficial for improving the performance of pedestrian re-identification, the key point estimation model makes the whole network slightly bloated, thereby reducing the network’s running speed. In recent years, some scholars have attempted to solve the occlusion issue through local feature matching. Literature [25] first utilizes an object detection network. To segment each image into a sequence of image patches, then extracts multi-scale features, and finally realizes feature matching based on spatial similarity. Literature [26] proposes a dual-branch network, which improves the network robustness by extracting fine-grained multi-scale features.

2.2 Visual transformer

Visual transformer is a common deep learning model in the field of natural language processing. The multi-head attention mechanism proposed in literature [27] completely abandons network structures such as recurrent neural network and convolutional neural network, which is utilized to deal with the machine translation tasks and achieves good results. Google introduces the transformer model into the field of image classification and proposes the famous visual transformer model, which input the segmented image patch sequence into the transformer encoder, maximizing the preservation of the original structure of the transformer and achieving satisfactory results. Similar to convolutional neural network, visual transformer requires a large number of datasets to complete parameter training. Therefore, literature [28] proposes an efficient image data transformation framework through attention mechanism and optimizes the pedestrian re-identification problem using the teacher–student strategy. Literature [29] designs a pedestrian re-identification model based on visual transformer, which utilizes a mosaic module to classify the features of the last layer and calculates the losses separately, further enhancing the robustness of the model. However, the pedestrian re-identification model based on visual transformer still focuses on characterizing global features, while some issues such as local occlusion features and short-distance correlation have not been well addressed.

Compared with TransREID in [29] which only uses ViT, our method combines ViT with convolution to design a novel model that reduces the number of parameters and establishes the connection between local features. In addition, we enhance the local feature extraction capability by introducing residual network branch and improve the identification accuracy by introducing parallel dual branches.

3 Proposed method

This article proposes a dual-branch hybrid structure occluded pedestrian re-identification network integrating residual network and visual transformer, as shown in Fig. 2. The proposed network includes the following two branches: the visual transformer branch based on the partial image patch pre-convolution module, and the residual-visual transformer branch based on the residual-visual transformer module. The vision transformer branch is utilized to establish global feature correlation, and the residual-vision transformer branch is. Utilized to extract translation invariant features. The fused feature contains both translation invariant local feature and spatially relevant global feature, improving the distinguishability of extracted features and enhancing their generalization ability. Next, we will introduce the method proposed in this article in detail.

Fig. 2
figure 2

Framework of the proposed dual-branch hybrid network for occluded pedestrian re-identification, where branch one is the vision transformer branch and branch two is the residual-visual transformer branch. The proposed partial image patch pre-convolution (PPPC) module and Convolution-Batch Normalization-Residual (CBNR) module are detailed in Sect. 3

3.1 Visual transformer branch

Vision transformer branch utilizes the partial image patches pre-convolution module to extract shallow local features, which helps improve the ability of vision transformer to characterize local features. Next, we will introduce the image augmentation module designed in this article, namely, the partial image patch pre-convolution module.

Figure 3 gives the framework of partial image patch pre-convolution module (PPPC) designed in this article. Assuming an input image x (H, W, and C), where H, W, and C represent the height, width, and channel size of the input image x, respectively. Due to low resolution, illumination change, mutual occlusion, background clutter, and inconsistent feature distribution, deep learning neural network failed to well complete the task of occluded pedestrian re-identification and the re-identification accuracy is not so satisfactory. Therefore, it is necessary to perform image augmentation before training.

Fig. 3
figure 3

Framework of the proposed partial image patch pre-convolutionmodule, where patch segmentation represents segmenting an image into multipleimage patches and random select represents selecting a portion of image patches for pre-convolution

Partial image patch pre-convolution module is utilized to implement image augmentation. Firstly, an input image is segmented into several image patches; secondly, some image patches are randomly selected and input into a convolutional neural network to extract shallow local features; thirdly, these extracted local features are input into the visual transformer branch. Therefore, the proposed PPPC module is utilized to introduce local features into global features extracted from the visual transformer branch.

During the process of randomly selecting image patches, all image patches are labeled with self-learning and a set of image patches is randomly selected to learn the discriminative feature distribution of occluded pedestrians for pre-convolution. Therefore, the robustness is guaranteed.

In this article, partial image patch pre-convolution module is added into the vision transformer branch, and the corresponding pseudocode of partial image patch pre-convolution module is shown in Table 1. Firstly, the input image is segmented into several image patches; secondly, a random percentage is set to select the corresponding proportion of segmented image patches for pre-convolution; then, a convolutional neural network is selected as the backbone network to complete the pre-convolution task, and ResNet50 is here selected as the backbone network through comparative experiments; finally, the convolutional features of randomly selected image patches are spliced with the features of original image patches in the extraction order, which helps ensure that the original feature distribution from these randomly selected image patches is not affected.

Table 1 Pseudocode of Partial Image Patch Pre-Convolution

The algorithm of the partial image patch pre-convolution module is described as follows:

Step 1: Obtain the length of initial image patch sequence.

Step 2: Utilize the Random library function to randomly select length*Percent image patches, where Select represents the index list of image patches.

Step 3: Utilize the obtained index list Select to obtain the image patches sequence SelectPatch for subsequent feature extraction.

Step 4: Subtract SelectPatch from original image patches sequence OriginalPatch to obtain the rest image patches sequence RestPatch.

Step 5: Input SelectPatch into the pre-convolution network of Resnet50 to obtain the pre-convolution image patch feature sequence PreConvPatch.

Step 6: Splice the pre-convolution image patch feature sequence PreConvPatch with the rest image patch sequence RestPatch to obtain the partial image patch pre-convolution feature sequence OutputPatch.

Step 7: End.

Due to the consistency between the selection order of partial image patches and the splice order of corresponding image patches, the feature distribution of the original input image will not change. In addition, due to the strong representation ability and low model parameters of the ResNet50 network, it can ensure that the convolutional features of randomly selected image blocks have strong discriminability. That is to say, thereby introducing local features into global features extracted by the visual transformer branch helps improve the robustness of the proposed model for occluded pedestrian re-identification.

3.2 Residual-visual transformer branch

Residual networks are often utilized to solve the issue of occluded pedestrian re-identification, where the reason is that residual networks can effectively deal with the issue of gradient disappearance. However, there exists the following two problems for residual networks: (1) With the increase in the number of network layers, there still exists many redundant parameters in deep residual network; (2) Compared with other convolutional neural networks, residual networks enlarge the receptive field to a certain extent; however, the enlarged receptive field is still limited to a certain region, which affects the performance of occluded pedestrian re-identification based on local receptive field.

Recent research has shown that vision transformer also has excellent performance in the field of computer vision. For small-scale and medium-scale datasets, the performance of visual transformer is inferior to that of residual network, because residual network has built-in biases and translation invariant. In other words, for large-scale datasets, the performance of visual transformer is much better than that of residual network, where the reason is that the advantages of visual transformer can be fully demonstrated through the training on large-scale datasets. Compared with residual network, visual transformer also has many disadvantages, such as multiple model parameters, slow computational speed, and weak local feature extraction ability. Therefore, this article attempts to combine the advantages of these two models to design a residual-vision transformer branch, as shown in Fig. 4. Such a branch design is not only conducive to extracting discriminative local features using residual network but also conducive to constructing the correlation between local features using visual transformer.

Fig. 4
figure 4

Framework of the proposed residual-vision transformer branch, which consists of convolution, bath normalization, nonlinearization, max pooling, convolution-batch normalization-residual (CBNR) andlinearization

The training process of the residual-visual transformer branch is described as follows: firstly, the feature of an occluded pedestrian is extracted through convolution, and the preliminary feature is extracted through batch normalization, ReLu activation function, and maximum pooling; secondly, the extracted preliminary feature is input into the convolution-batch normalization-residual module (CBNR) as shown in Fig. 5 to extract deep feature; thirdly, the extracted deep feature is input into the mobile visual transformer module as shown in Fig. 5 to extract discriminative feature; fourthly, the extracted discriminative feature is input into the linear layer to adjust the number of feature channels, and a short-cut structure is constructed with original input image feature; finally, the residual-visual transformer feature is the output result of ReLu activation function.

Fig. 5
figure 5

Framework of the proposed residual-vision transformer module, which consists of convolution, batch normalization, residual, attention, linearization, and nonlinearization

The residual-visual transformer branch leverages the advantages of residual network and visual transformer: (1) Visual transformer can not only represent global features but also reduce the computational cost of the network; (2) Residual network can not only effectively prevent over-fitting but also add the spatial accumulation bias missing in visual transformer, which optimizes the network structure and improves the network accuracy.

Next, we will introduce the residual-visual transformer module proposed in this article in detail.

3.3 Residual-vision transformer module

The design idea of the convolution-batch normalization-residual module is described as below. As shown in the dotted box on the left of Fig. 5, the convolution-batch normalization-residual module utilizes the short-cut structure in residual networks to extract shallow features. Such a CBNR module can deepen the number of feature channels and reduce the disappearance of feature gradients. Theoretically, such a CBNR module can be stacked countless times, and experimental results demonstrate that optimal performance can be achieved when the module is stacked three times.

Due to the large number of model parameters and high-complexity of floating-point calculations in the visualization transformer, a mobile visual transformer module (MViT) is here utilized to reduce network complexity and improve computational speed. The reasons for such a processing are described as follows: (1) The mobile vision transformer requires fewer parameters to model the local and global features of the input tensor. (2) The accumulated bias generated during the convolution process is introduced into vision transformer to improve the stability and robustness of the model.

As shown in the dotted box on the right of Fig. 5, the mobile visual transformer module first transforms the input feature tensor into a feature sequence through convolution and inputs the feature sequence into visual transformer, which corresponds to the structure from Conv to Dropout; then, the self-attention mechanism is added into the feature sequence, which corresponds to the structure from Linear to Dropout; finally, the feature sequence is transformed back to the original dimension through convolution, which corresponds to the structure from Conv to Conv. Theoretically, such a mobile visual transformer module can be stacked countless times, and experimental results demonstrate that optimal performance can be achieved when the module is stacked three times.

4 Experiments

A large number of experiments are conducted on the proposed dual-branch hybrid network combining residual network and visual transformer to verify the effectiveness of the proposed network in solving the problem of occluded pedestrian re-identification and to verify whether the proposed network fully leverages the advantages of these two models to achieve better performance than residual network.

4.1 Experimental dataset

The performance of the dual-branch hybrid network proposed in this article is evaluated on six public datasets, including the Occluded-REID dataset in literature [30], the Occluded-Duke dataset in literature [31], the Market-1501 dataset in literature [32], the DukeMTMC-REID dataset in literature [33], the Partial-REID dataset in literature [34], and the Partial-iLIDS dataset in literature [35].

Occluded-REID dataset: This dataset contains 2000 images from 200 pedestrians, including 5 full-body images and 5 occluded images for each pedestrian.

Occluded-Duke dataset: This is the largest occluded pedestrian re-identification dataset so far, which contains 35,489 images from 1110 pedestrians, including 15,618 training images, 17,661 validation images, and 2210 testing images.

Market-1501 dataset: This dataset contains 32,668 images from 1501 pedestrians, including 12,936 training images and 19,732 testing images.

DukeMTMC-REID dataset: This dataset contains 36,411 images from 1812 pedestrians, including 16,522 randomly selected training images, 2228 randomly selected validation images, and 17,661 randomly selected testing images.

Partial-REID dataset: This is the first pedestrian re-identification dataset, which contains 900 images from 60 pedestrians, including 5 full-body images, 5 local images, and 5 occluded images for each pedestrian.

Partial-iLIDS dataset: This dataset is a pedestrian re-identification dataset based on iLIDS, which contains 476 images from 119 pedestrians, including an average of 4 images for each pedestrian.

4.2 Experimental setting

Backbone Network: This article utilizes a self-designed backbone network with visual transformer branch and residual-visual transformer branch, where the visual transformer branch is composed of data enhancement module and visual transformer module, and the residual-visual transformer branch is composed of the residual module and residual-visual transformer module.

Training Detail: PyTorch is here utilized to build the network framework: firstly, the size of each input image is uniformly adjusted to 256 × 128, and each input image is augmented through flipping, filling, random horizontal, random cutting, and random erasing; secondly, the batch size is set to be 96, and the momentum of the Adam optimizer is set to be 0.9; thirdly, the weight attenuation parameter is set to be 1e−4, and the initial learning rate is set to be 0.01; finally, the first ranking accuracy (Rank-1), the top 5 ranking accuracy (Rank-5), the top 10 ranking accuracy (Rank-10), and the mean average accuracy (mAP) are utilized to evaluate the performance of the proposed network, and all experiments are performed under a single query condition.

4.3 Comparison method

Four categories of pedestrian re-identification methods are here selected for performance comparison, including the common pedestrian re-identification methods (Part Aligned [36]、PCB [37]、Adver occluded [38]), external information or semantic information based pedestrian re-identification methods on (PGFA [31]、PVPM [23]、HONet [24]、PAFM [25]、Part Bilinear [39]、FD-GAN [40]), local feature based pedestrian re-recognition methods (MFM + HFR [26]、FGMFN [27]、DSR [35]、SFR [41]、MoS [42]、FPR [43]、VPM [44]、MGCAM [45]), and occluded pedestrian re-recognition methods (TransReID [29]、MAT [46]、PAT [47]、DRL-Net [51]、PFD [53]、FRT [54]).

4.4 Experimental result

  • Performance Verification on Occluded-Duke Dataset

According to the experimental results on Occluded-Duke dataset as shown in Table 2, compared with the HONet method based on the gesture information, the proposed DB-ResHViT method has increased by 15.8 percent of Rank-1 and 18.3 percent of mAP; at the same time, compared with other occluded pedestrian re-recognition method, the retrieval accuracy of the proposed DB-ResHViT method is also greatly improved. The above experimental results demonstrate that the combination of vision transformer and residual network is more conducive to solving the problem of occluded pedestrian re-recognition.

Table 2 Experimental results of different methods on Occluded-Duke dataset

As the first method to introduce visual transformer into the field of pedestrian re-recognition, TransReID has refreshed the best performance of many pedestrian re-recognition, as well as in the field of occluded pedestrian re-recognition. According to the experimental results as shown in Table 2, compared with 66.4 percent of Rank-1 and 59.2 percent of mAP of the TransReID method, the Rank-1, and mAP of the proposed DB-ResHViT method increased by 4.5 percent and 2.9 percent, respectively. This proves that the performance of visual transformer has been further improved after incorporating the local feature receptive field of convolution.

  • Performance Verification on Occluded-REID and Partial-REID Datasets

Here, Market1501 dataset is selected as the pre-training dataset of the Partial-REID dataset, and the Occluded-Duke dataset is selected as the pre-training dataset of the Occluded-REID dataset. This is because that the Occluded-REID dataset can be considered as an occlusion dataset, so using occlusion dataset for pre-training is easier to achieve satisfactory results. Experimental results as shown in Table 3 verify such hypothesis.

Table 3 Experimental results of different methods on Occluded-REID dataset and Partial-REID dataset

It can be seen from the experimental results as shown in Table 3 that the DB-ResHViT method proposed in this article achieves the best Rank-1 and mAP on the Occluded-REID dataset and Partial-ReID dataset. Specifically, (1) compared with 80.3 percent of Rank-1 obtained by convolution neural network-based HONet method on the Occluded-REID dataset, the proposed DB-ResHViT method achieves 84.8 percent of Rank-1; (2) compared with 81.6 percent of Rank-1 and 72.1 percent of mAP obtained by vision transformer-based PAT method on the Occluded-REID dataset, the proposed DB-ResHViT method increases 3.2 percent of Rank-1 and 7.6 percent of mAP, respectively; (3) compared with 85.3 percent of Rank-1 obtained by convolution neural network-based HONet method on the Partial-ReID dataset, the proposed DB-ResHViT method achieves 88.5 percent of Rank-1; and (4) compared with the 88.0 percent of Rank-1 obtained by visual transformer-based PAT method on the Partial-ReID dataset, the proposed DB-ResHViT method improves 0.5 percent.

According to the experimental results as shown in Table 2 and Table 3, the DB-ResHViT method proposed in this article achieves good performance and high robustness on occluded pedestrian re-identification dataset. Next, the performance of the DB-ResHViT method proposed in this article will be verified on non-occlusion pedestrian re-identification dataset.

  • Performance Verification on Market-1501 and DukeMTMC Datasets

It can be seen from the experimental results as shown in Table 4, for the Market-1501 dataset, ISP method achieves the best performance among the convolution neural network-based methods, and PFD method achieves the best performance among the vision transformer-based methods. Specifically, (1) the Rank-1 and mAP of ISP method reach 95.3% and 88.6% respectively, and the Rank-1 and mAP of PFD method reach 95.5% and 89.7%, respectively, which demonstrates that the recognition performance of convolutional neural network and visual transformer is very close; (2) the Rank-1 and mAP of the proposed DB-ResHViT method reach 95.7% and 89.8% respectively, which shows that the performance of the proposed method is superior to that of ISP and PFD methods; (3) experimental results validate our original assumption that the proposed method can better play the respective advantages of convolutional neural network and visual transformer, namely, effective combination of convolution neural network and visual transformer can achieve performance beyond individual models.

Table 4 Experimental results of different methods on Market-1501 dataset and DukeMTMC dataset
  • Performance Verification on Partial-iLIDS Dataset

Because the Partial-iLIDS dataset contains few images, the Occluded-Duke dataset is selected as the training set, and the Partial-iLIDS dataset is selected as the testing set. According to the experimental results shown in Table 5, the Rank-1 of the DB-ResHViT method proposed in this article is 75.2%, which is 2.6% and 3.8% higher than the HOReID method and TransReID method, respectively.

Table 5 Experimental results of different methods on Partial-iLIDS dataset

Since the feature distribution of the Partial-ILIDS dataset is relatively inconsistent, the experimental results in Table 5 also proves that proposed DB-ResMViT can solve this kind of problem well through data enhancement and robust dual-branch network structure in the case of such micro-dataset and inconsistent feature distribution.

4.5 Ablations and analysis

This section studies the dual-branch structure of the DB-ResHViT method proposed in this article and the effectiveness of each proposed module. Based on the visual transformer branch, the ablation experiments of each module and the dual-branch are performed in turn. Table 6 shows ablation experiment results on the Occluded-Duke dataset, which verifies the effectiveness of each module and double-branch structure in the process of occluded pedestrian re-identification.

Table 6 Ablation experimental results on Occluded-Duke dataset, where ViT represents the branch network of visual transformer, PPPC represents the partial image patch pre-convolution module, ResDB represents the double-branch network containing the residual structure, RViTDB represents the double-branch network containing the residual-visual transformer network module

Effectiveness of partial image patches pre-convolution module: According to the experimental results of index 2 in Table 6, the Rank-1 of partial image patch pre-convolution module is improved by 2.3% compared with visual transformer. This verifies two purposes of the partial image patch pre-convolution module designed in this article: (1) This module really achieves the purpose of image augmentation and effectively improves the accuracy of occluded pedestrian re-identification. (2) This module is utilized to extract shallow local features, thus improving the ability of feature representation and effectively compensating for the shortcoming of visual transformer in extracting local features.

Effectiveness of the dual-branch structure of residual-vision transformer: According to the comparative experimental results of index 2 and index 3 in Table 6, after adding the dual-branch structure of residual-vision transformer, the Rank-1 and mAP of the network model are improved by 2.0% and 1.7% respectively, which shows that the performance is obviously improved. This verifies the two purposes of the design of the dual-branch structure of residual-vision transformer: (1) The dual-branch structure of residual-vision transformer can effectively establish the correlation of global feature sequences and can effectively extract shallow local features with high discrimination. (2) The dual-branch structure of residual-vision transformer can well maintain the consistency of feature distribution, thus making the training results more convergent and accurate.

Effectiveness of the residual-vision transformer module: According to the comparative experimental results of index 3 and index 4 in Table 6, compared with the simple dual-branch structure, after introducing the residual-vision transformer module in the dual-branch residual branch, the Rank-1 and mAP of the network model reached 69.1% and 60.5%, respectively, increasing by 1.2% and 2.0%. This verifies the two purposes of the design of the residual-vision transformer module in this article: (1) Such a module can adjust the convolution feature distribution to be as consistent as possible with the visual transformer feature distribution before the convolution feature is fused with the visual transformer feature. (2) Such a module is a lightweight module, which can approach the performance of the vision transformer branch by only stacking a few layers.

4.6 Computational complexity

Table 7 gives the complexity of the proposed network and compares it with existing works.

Table 7 Computational complexity of different methods

5 Conclusion

In this article, an occluded pedestrian re-identification method based on the dual-branch hybrid framework is proposed, including the designed dual-branch hybrid framework integrating residual network and visual transformer, the designed partial image patch pre-convolution augmentation method and the designed residual-vision transformer module. The dual-branch hybrid framework integrating residual network and visual transformer takes visual transformer and residual network as two parallel branches, which is conducive to extracting more robust features, integrating the correlation of local features and the distinguishability of global features. The partial image patch pre-convolution augmentation method introduces local feature information through the convolution of partial image patches, thus realizing image augmentation. The residual-vision transformer module integrates global features and local features, thus establishing the global correlation of local features. Experimental results demonstrate that the DB-ResHViT method proposed in this article has achieved good results in occluded pedestrian re-identification.