1 Introduction

Vision-based biometric technology has made significant advancements in the computer vision community. Popular techniques include fingerprint recognition [1], vein biometrics [2], face identification [3], iris biometrics [4], and gait recognition [5]. Among these, gait recognition is a relatively new method that aims to identify individuals from a distance without any physical contact. This contactless and long-distance recognition approach has many advantages, such as the lack of need for cooperation, difficulty in camouflage, and strong adaptability to different environments. As such, it has great potential for use in medical motion analysis [6], security monitoring, criminal investigations, and other monitoring systems in the future. However, there are still many challenges to be addressed before gait recognition can be fully integrated into real-world applications.

Current gait recognition methods primarily focus on extracting features from the gait silhouette sequence, which can lead to a lack of local information in pedestrian contour segmentation, such as missing legs or feet in certain frames of a video. Additionally, clothing and accessories worn by pedestrians, such as coats and backpacks, can also negatively impact recognition performance. These additional factors not only obscure the pedestrian's walking posture but also add irrelevant information, which can greatly hinder subsequent learning, particularly in cross-condition recognition [7].

Fig. 1
figure 1

Comparison of different gait representations. The first row is the original video frames, the second row is the silhouette images, and the third row is the pose sequence

In order to deal with the issues of occlusion, clothing, and accessories, some researchers have proposed using human pose estimation networks to generate skeleton sequences for extracting gait features. While methods based on human pose estimation can be robust, they often fail to capture important visual information such as the details of the human body, resulting in poor recognition performance.

To address the issues previously mentioned, we propose an end-to-end gait recognition method based on 3D human body reconstruction. Our method generates a new gait contour sequence using a 3D human body reconstruction method. Usually, 3D view gait descriptor-based techniques [8] require a complex and costly setup of multiple calibrated cameras, limiting their use to controlled environments. However, our proposed method overcomes this limitation by allowing for 3D reconstruction directly from original video frames, eliminating the need for costly camera setups, and expanding applicability to a wider range of environments. In comparison with silhouette sequences, the 3D reconstruction method does not include any redundant information other than the body, which means that previous problems such as clothing and accessories will not affect the analysis. Additionally, the 3D reconstruction allows for the extraction of more informative features that can effectively reflect the pedestrian's gait. A visual comparison of different gait representations is shown in Fig. 1, where it can be observed that the gait information extracted from 3D human body reconstruction is clearer and more complete than that from silhouette sequences. Furthermore, to fully utilize the global and local spatial information of pedestrians, we propose a multi-granular feature fusion module which models temporal–spatial dependencies at multiple levels to achieve better representation ability. Our contributions can be summarized as follows:

  • our proposed method leverages the power of 3D human body reconstruction to overcome the challenges posed by changes in pedestrian appearance and attire, such as coat wearing and bag carrying. Our approach generates a new gait contour sequence that contains information about the pedestrian's body, eliminating the need to consider irrelevant or redundant information. Unlike traditional methods that require a setup of multiple calibrated cameras or preprocessing of video streams, our model can be directly applied to the original video frames. This greatly simplifies the gait recognition process and enhances its robustness and efficiency.

  • To address the issue of underutilizing spatial features in gait recognition methods, we introduce a multi-granular feature fusion module that effectively captures the temporal–spatial information representation of pedestrians from both global and local perspectives. This allows for a more comprehensive understanding of the gait characteristics and helps in enhancing recognition performance.

2 Related works

2.1 Model-based approaches

2.1.1 Traditional gait recognition

The traditional gait recognition techniques mainly focus on utilizing information about the human body structure and the motion patterns of various body parts to identify gait characteristics. This information is then used to generate gait features for recognition purposes. For instance, Lee and Grimson [9] divided the pedestrian gait silhouette into 7 regions, each of which is fitted by an elliptic curve and then calculated the elliptic parameters as gait feature representation. Cunado et al. [10] considered that the leg motions follow the simple harmonic motion and then modeled this rule for gait recognition. In order to analyze the gait motion, Yoo et al. [11] utilized 2D stick shaped to represent the human body model and obtained the angle signals of various parts of the body through linear regression analysis. Yam et al. [12] used the pendulum model to guide the process of motion extraction. Urtasun et al. [13] extended the method of Cunado et al. [10] to 3D space and proposed a 3D human motion model based on principal component analysis (PCA) in order to overcome the influence of occlusion and motion direction changes. Dockstader et al. [14] proposed a hierarchical structure model which used a group of dots and lines to represent the human body and a periodic swing model to describe the gait pattern. Most of these traditional methods rely on specific environments and devices, such as fully controllable multi-camera collaborative environments, making such methods difficult to apply in practice. In contrast, our approach relies only on common cameras, greatly simplifying the constraint mention of recognition scenes.

2.1.2 Method based on RGB video frame

The methods for gait recognition based on RGB images can be separated into two categories: human pose estimation and 3D reconstruction. These techniques have garnered much attention in recent years and offer valuable insight into the field of gait recognition. By using human pose estimation instead of silhouette extraction, the gait recognition method based on human pose sequences represents a departure from traditional methods. Liao et al. [15] proposed a gait recognition method PTSN based on human pose sequences for the first time. It used the open-source pose estimation algorithm to extract human posture information from the original video sequence. After obtaining the standardized gait pose sequence, it used a pose-based temporal–spatial network to learn gait feature representation. Inspired by the success of GCNs in skeleton-based action recognition, Teepe et al. [16] combined skeleton poses with graph convolution network (GCN) [17] to obtain a modern model-based gait recognition method. The gait recognition methods based on pose estimation ignore the information of human body shape, which reduces the accuracy of gait recognition. To make up for the lack of body shape in human pose-based gait recognition methods, some researchers have started to try to replace human pose sequences with 3D human reconstruction. Li et al. [18] extracted pose and shape features by fitting the SMPL model and subsequently feed the pose and shape features to a recognition network. Several of the above methods do not take into account multiple perspectives, so Khan et al. [19] proposed a view-invariant gait representation for cross-view gait recognition using the temporal–spatial motion characteristics of walking conditions.

2.2 Appearance-based approaches

2.2.1 Gait recognition based on template

The process of constructing gait templates involves subtraction of the background and creation of a human contour through a weighted average of each frame. These templates come in various forms, including Gait Energy Image (GEI) [20], Gait Entropy Image (GEnI) [21], Gait Flow Image (GFI) [22], and Chrono-Gait Image (CGI) [23]. Currently, GEI is considered the simplest and most efficient among these gait template types. Gait recognition methods based on the template can fall into two categories. The first is to extract gait features for discrimination using traditional metric learning methods (e.g., linear discriminant analysis [20], tensor representation discriminant analysis [24], random subspace [25], combined intensity and spatial metric learning [26]), or deep neural network [27,28,29,30]. The second is to generate gait representations under different conditions into the same covariate conditions using subspace analysis methods [32,33,34,35,36] or generative adversarial networks (GANs) [38, 39]. The template-based gait recognition method takes a single image after weighted averaging as input and does not make full use of the temporal information of the video, while our method takes video frames as input and learns short-range temporal–spatial features through the motion capture module.

2.2.2 Method based on gait silhouette sequence

The methods based on gait silhouette sequence use the silhouette sequence as the input directly. It is divided into three categories based on the way of extracting temporal information: 3DCNN based [40, 41], LSTM based [42], and set based [43, 44]. The 3DCNN-based methods directly extract the temporal–spatial features of gait sequences through 3D convolution network, but these methods usually have more parameters and are difficult to train. Zhang et al. [42] proposed a new auto-encoder framework to extract gait-related features from the original RGB video and used three-layer LSTM to model the temporal changes of gait sequence. However, the LSTM-based method is considered to retain the unnecessary constraints of periodic gait. To avoid this problem, GaitSet [43] assumed that the appearance of the silhouette contains its position information and proposed to take the gait as a set to extract temporal–spatial features in the way of temporal pooling, which is simple and effective. Further, based on GaitSet, GaitPart [44] designed a temporal–spatial model for each part of the human body, making full use of the part-level features of pedestrians. The silhouette-based gait recognition method uses silhouette sequences as input. The silhouette sequences not only lose local body information in the process of generation but also contain redundant information such as coats and backpacks, which has a negative impact on gait recognition, while the 3D reconstructed sequences can effectively remove these redundant information.

3 Proposed method

3.1 Overall framework

In this study, we propose an approach for gait recognition where the original video frames of a pedestrian are taken as input and the length of the gait sequence is 30. New gait contour sequences are generated using the Human Mesh Recovery (HMR, 3D human body reconstruction) [45]. The frame-level part feature extractor (FPFE) [44] is then used to extract pedestrian gait features on the gait contour sequences. The multi-granular feature fusion (MGFF) module is employed to model the temporal–spatial representations of pedestrians from multiple granularities based on the generated spatial convolution features. Subsequently, the full connection layer (FC) is utilized to produce column vectors for identifying instances. Finally, the entire network is trained using the triplet loss function. The overall framework of our approach is depicted in Fig. 2.

Fig. 2
figure 2

The framework of our method. s, R, t, \(\beta\) and \(\theta\), respectively, represent the camera scaling, the rotation, translation parameters, shape parameters, and attribute parameters. SMPL is a parametric 3D model of human body. Block1, Block2, and Block3 are convolutional blocks of FPFE. HP indicates horizontal pooling. MCM is motion capture module. a, b, and c represent the weight of each granularity

3.2 3D human body reconstruction

Traditional gait recognition methods can be challenged by variations in the input video, such as changes in clothing or carrying objects, which are commonly encountered in real-world scenarios. To tackle this issue, we adopt a 3D human body reconstruction approach to generate a compact and discriminative gait representation, instead of relying on the silhouette feature that has been commonly used but may contain redundant information. 3D human body reconstruction is capable of generating 3D human mesh sequences that incorporate parametric pose and shape features. These sequences are advantageous for gait recognition in cross-state scenes since they do not include redundant information other than the human body, such as clothing and accessories, which is the case with silhouette gait sequences. Compared with simple human pose sequences, 3D human reconstruction produces more refined results that contain both body shape and pose information, resulting in better discrimination for gait recognition. Therefore, 3D human body reconstruction is an effective approach for gait recognition tasks. Our method uses the Human Mesh Recovery module to reconstruct the mesh of the human body from a single RGB image. The HMR is based on the principles of generative adversarial networks and consists of an encoder and a discriminator. The i-th image is fed through the encoder, whose backbone network is a ResNet-50, to extract image features. Then, a parametric regression (iterative 3D regression network) is performed on the features to learn an 85-dimensional vector \(\Theta _i=\{s,R,t,\beta ,\theta \}\) that includes the camera parameters, such as scaling, rotation, and translation, as well as the shape and attitude parameters of the individual. The shape parameter \(\beta\) describes the height, weight, and body proportions, while the attitude parameter \(\theta\) describes the joint locations. The learned parameters \(\hat{\theta _i}\) and \(\hat{\beta _i}\) are then input into the SMPL [46] model, which results in the 3-D joint coordinate of the model. The 3-D joint is then projected onto the image plane using the camera parameters to obtain a predicted 2D image. The SMPL model refers to the Skinned Multi-Person Linear Model, which is a parameterization of the human body.

With the help of the 3D human body reconstruction module (i.e., HMR), we have generated a new 5-dimensional gait representation vector with the size of \(N \times S \times C \times H \times W\), where N represents the batch size, S stands for the number of frames, C is the number of channels, and \(H \times W\) indicates the resolution of the generated gait feature maps.

3.3 Frame-level part feature extractor

With the aim of enhancing the learning of fine-grained features of frames, we employ the frame-level part feature extractor to extract the local spatial features of each frame. FPFE consists of three blocks, and each block is composed of two focal convolution layers (FConv) that divide the previous feature maps horizontally into n predefined parts, followed by regular convolution operations on each part. After three blocks, the output feature maps are concatenated. The detailed network structure is shown in Table 1.

Table 1 The structure of frame-level part feature extractor. In-C, Out-C, Kernel, Pad, and n are input channels, output channels, the size of kernels, padding, and the number of predefined blocks in FConv, respectively. MaxPool and stride represent the maximum pool operation and the distance of a kernel movement

3.4 Multi-granular feature fusion module

To make the most of the spatial features of pedestrians, we propose the multi-granular feature fusion module to model the multi-granularity features of pedestrians. The MGFF module consists of three branches, each of which (\(\text{MGFF}_{(i,\cdot )}\)) is responsible for modeling the short-range temporal–spatial representation of a specific granularity using the motion capture module (MCM). The first branch, \(\text{MGFF}_{(1,\cdot )}\), extracts global temporal–spatial features, while the second branch, \(\text{MGFF}_{(2,\cdot )}\), and third branch, \(\text{MGFF}_{(3,\cdot )}\), extract two-part and four-part features, respectively, to focus on finer-grained details. Unlike the most of existing gait recognition methods that only consider either global or local features, our approach models multiple levels of features for improved discriminative performance. Figure 3 shows the specific structure of \(\text{MGFF}_{(2,\cdot )}\) as an example.

Let \(p_{(i,j)}\) represent the j-th level of the i-th branch in the multi-granular feature fusion module. The part-level generative features are obtained by inputting the vector \(p_{(i,j)}\) into \(\text{MCM}_{(i,j)}\), as expressed by:

$$\begin{aligned} v_{(i,j)}=\,\text{MCM}_{(i,j)}(p_{(i,j)}). \end{aligned}$$
(1)
Fig. 3
figure 3

The structure of \(\text{MGFF}_{(2,\cdot )}\). ConvNet1d is a small network composed of two 1-D convolutional layers, Tempfunc is a template function composed of Avgpool1d and Maxpool1d functions, and s is a sigmoid function. TP is a temporary pooling

In Eq. 1, the motion capture module is designed to learn a more fine-grained gait representation. The MCM is composed of the Micro-motion Template Builder (MTB) module and the temporal pooling (TP) module. The MTB module maps the part-level feature vector \(p_{(i,j)}\) to \(q_{(i,j)}\), i.e., \(q_{(i,j)} = \text{MTB}(p_{(i,j)})\). The TP module then extracts the most discriminative motion feature vector \(v_{(i,j)}\), i.e., \(v_{(i,j)} = \text{TP}(q_{(i,j)})\).

3.4.1 MTB module

The MTB module includes two similar parts, each with a different convolution kernel size. The first part, ConvNet1d, is a small network composed of two 1-D convolution layers. As shown in Fig. 3, ConvNet1d is utilized to produce a temporary vector \(p1_{(i,j)}\), which is depicted as:

$$\begin{aligned} p1_{(i,j)}=\, \text{ConvNet1}d(p_{(i,j)}). \end{aligned}$$
(2)

The second part, Tempfunc, utilizes the concept of a Gait Energy Image to average multiple frames in the sequence. By applying two statistical functions, Tempfunc generates another temporary vector \(p2_{(i,j)}\). This can be expressed as:

$$\begin{aligned} p2_{(i,j)}= \, \text{Avgpool1}d(p_{(i,j)})+\text{Maxpool1}d(p_{(i,j)}). \end{aligned}$$
(3)

Further, to obtain a more discriminative micro-motion representation, the channel attention mechanism is introduced in the MTB module. This mechanism reweights the feature vector at each time, resulting in the final micro-motion representation \(q_{(i,j)}\). Mathematically, it can be formulated as:

$$\begin{aligned} q_{(i,j)}=p2_{(i,j)}{\cdot }\text{Sigmoid}(p1_{(i,j)}). \end{aligned}$$
(4)

3.4.2 TP module

After MTB, we get several gait motion representations, from which part-level features can be extracted by TP module. TP module uses \(\text{max}(\cdot )\) as the statistical function, i.e.,

$$\begin{aligned} \text{TP}(q^t_{(i,j)})=\, \text{max}(q^1_{(i,j)},q^2_{(i,j)},...,q^t_{(i,j)}), \end{aligned}$$
(5)

where t is the number of frames.

For obtaining the part-level feature vector \(v_{(i)}\), we sum the outputs \(v_{(i,\cdot )}\) of each branch \(\text{MGFF}_{(i,\cdot )}\) using the following equation:

$$\begin{aligned} v_{(i)}=\sum ^{j}_{\cdot =0}v_{(i,\cdot )}. \end{aligned}$$
(6)

Finally, by weighting the feature vectors of each branch, we can obtain the final feature vector v:

$$\begin{aligned} v=av_{(1)}+bv_{(2)}+cv_{(3)}, \end{aligned}$$
(7)

where a, b, and c are the weights of each branch.

3.5 Loss function

We use the separate Batch All (BA+) triplet loss function to optimize our model, which helps bring samples with the same ID closer in the feature space and separates samples with different IDs further apart. We also utilize the popular triplet loss for video detection tasks. The triplet loss calculates the Euclidean distance between an anchor sample, a positive sample, and a negative sample in the embedding space and aims to make the distance between the anchor and positive samples closer than the distance between the anchor and negative samples. Specifically, given a triplet of image sequences, i.e., anchor sample a, positive sample p, and negative sample n, the triplet loss function can be expressed as:

$$\begin{aligned} L=[D(f(x^i_a),f(x^i_p)) - D((f(x^i_a),f(x^i_n)) + \beta ]_+, \end{aligned}$$
(8)

where \(f(x^i_a)\), \(f(x^i_p)\) and \(f(x^i_n)\) are the features from anchor samples, positive samples, and negative samples, respectively. D(, ) denotes the Euclidean distance measure between features, and \(\beta\) is the margin.

4 Experiment

4.1 Datasets and metric

4.1.1 Outdoor-Gait

The Outdoor-Gait [47] dataset is a comprehensive Outdoor-Gait dataset, consisting of 138 individuals and three scenes for each person. Each scene is divided into 3 walking conditions, including 4 normal walking (NM) sequences, 4 walking sequences wearing coat and jacket (CL), and 4 walking sequences with bag (BG). Each walking sequence consists of a single view \((90^\circ )\) of the person walking, and there are \(3*(4+4+4)=36\) sequences for each person. During the training process, 69 individuals are used as the training set and the remaining 69 individuals are used as the test set. The dataset includes both original video frame sequences and gait silhouette sequences.

4.1.2 CASIA-B

The CASIA-B [48] dataset is a large-scale, multi-view gait dataset consisting of 124 individuals. Each individual has three walking conditions including 6 normal walking sequences (NM), 2 walking sequences wearing a coat and jacket (CL), and 2 walking sequences with a bag (BG). Each walking sequence is captured from 11 views \((0^\circ ,18^\circ ,36^\circ ,...,180^\circ )\), spanning from \(0^\circ\) to \(180^\circ\). In total, each individual has \((6+2+2)*11=110\) sequences. The first 74 individuals in the database are used for training, and the last 50 individuals are used for testing. The dataset includes both original video frame sequences and gait silhouette sequences.

4.1.3 Rank-1

In our experiments, the effectiveness of the proposed model was evaluated using the Rank-1 recognition accuracy, which measures the ability to correctly identify a sequence in the gallery that has the same ID as the sequence in the Probe. Specifically, the Rank-1 accuracy was calculated by comparing the probe sequence with all sequences in the gallery and determining whether the highest ranked match has the same ID as the probe.

4.2 Implementation details

In this section, we will provide a detailed explanation of the implementation and network structure of our experiments, including the FPFE and MTB modules.

Table 2 The structure of Micro-motion Template Builder. C and s represent the input channel and the squeeze ratio, respectively. ‘|’ is used to divide MTB1 and MTB2

In our experiments, we selected 30 frames for each sequence to be used for training, and the separate Batch All (BA+) triplet loss is used to train the network where the margin \(\beta\) in Eq. 8 was set to 0.2. The batch size for the Outdoor-Gait dataset was set to (4, 8), and the input frame resolution was cropped to \(128 \times 88\). For the CASIA-B dataset, the batch size was set to (8, 16), and the input frame resolution was cropped to \(64 \times 44\). We perform 160k iterations for both datasets. In addition, the Adam optimization algorithm was used with a learning rate of 1e-4 and momentum of 0.9. Prior to training, the 3D reconstruction network was pretrained on the MSCOCO-2017 object dataset [49].

The frame-level part feature extractor module is designed to extract meaningful features from gait sequences that represent the unique gait patterns of pedestrians. This module comprises multiple focal convolution network layers and MaxPooling layers, as shown in Table 1. The notations In-C, Out-C, Kernel, and Pad represent the number of input channels, the number of output channels, the size of the kernel, and padding, respectively. The Micro-motion Template Builder module is used to learn the micro-motion representations from the part-level gait features obtained from the FPFE. As seen in Table 2, the MTB module consists of convolution layers and pooling layers. The notations used for the MTB are the same as those used for the FPFE. In addition, we use the symbols C and s to denote the number of channels and the compression ratio between the input and output channels, separated by a ‘\(\mid\)’ symbol.

4.3 Main results

In this experiment, we validated our method on the Outdoor-Gait dataset. It is worth noting that previous gait recognition methods have mostly been based on GEI or silhouette sequence data, as shown in the middle row of Fig. 1. These binary images are generated from the original RGB video frames, meaning that previous works have rarely performed gait recognition directly on the original RGB video frames. Additionally, the reliance on silhouette sequence data as input introduces an extra step of image preprocessing into the gait recognition task, and the recognition accuracy is greatly impacted by the quality of silhouette sequence generation, leading to decreased robustness and increased noise.

Table 3 Experimental results on Outdoor-Gait dataset. NM, CL, and BG are, respectively, normal walking sequences, walking sequences wearing coat and jacket, and walking sequences with bag

The proposed method uses a novel approach based on 3D human body reconstruction and trains on original RGB video data directly. Unlike existing works, our model eliminates the need for GEI or silhouette sequence data, making it more practical and easier to implement in real-world scenarios. The results of cross-condition recognition experiments conducted on the Outdoor-Gait dataset are presented in Table 3. The table compares our method with other gait recognition methods based on GEI or silhouette sequence data and shows that the mean accuracy of our method outperforms these methods in recognizing the same pedestrian under different walking conditions. While our method may show a slight deficiency when the cross-conditions between gallery and probe are the same, i.e., Gallery-NM\(\rightarrow\)Probe-NM, Gallery-BG\(\rightarrow\)Probe-BG, Gallery-CL\(\rightarrow\)Probe-CL, it shows a huge gap over comparative methods in other scenarios. This result is twofold: first, our method avoids the negative effect of redundant information such as coat wearing or bag carrying that seriously impacts the performance of other methods when the conditions between the gallery and probe are different. Second, our method is more feasible for real-world applications as it utilizes original RGB video sequences rather than carefully labeled silhouette data, which may result in a loss of compact and discriminative representation when the conditions between the gallery and probe are the same. Despite this trade-off, our method directly uses original video data and has higher mean accuracy, making it a promising alternative for gait recognition.

Table 4 Experimental results on CASIA-B dataset
Table 5 Results on Outdoor-Gait dataset
Fig. 4
figure 4

The failure cases of 3D human body reconstruction

Our model was further validated through experiments on the large CASIA-B gait dataset, as seen in Table 4. The results demonstrate that our model produces relatively satisfied results even when raw video frames are used as input. In the GBG-based models, the accuracy of our model is better than others. However, it is important to note that the CASIA-B dataset, being published earlier, contains many poor quality data in its 3D human body reconstruction which negatively impacts recognition accuracy. In contrast, the Outdoor-Gait dataset features higher pixel quality and provides more effective 3D human body reconstruction for gait representation, leading to better results. It should be mentioned that the CASIA-B dataset has 11 different viewing angles for each walking condition, and the recognition accuracy is calculated as the average across these 11 angles.

Our experiments suggest that our method is effective and competitive in gait recognition tasks. As demonstrated in Tables 3 and 4, our proposed approach achieves the highest mean accuracy compared with other silhouette sequence-based models on the high-resolution dataset. Furthermore, when compared with the RGB-based models, our method exhibits higher recognition accuracy. During the process of 3D human body reconstruction, there may be some failed cases. In our experiments, the main reasons for failures are the presence of deviations in body tilt angles and slender limbs, resulting in inaccurate reconstruction of the true human contour, as shown in Fig. 4. There may be two reasons for these situations: firstly, some original RGB images have poor image quality, which affects the precision of human body reconstruction; secondly, the 3D human body reconstruction method used in our experiments has limitations in reconstructing fine-grained details of the human body.

Table 6 Experimental results of MGFF on CASIA-B dataset

4.4 Ablation study

To demonstrate the effectiveness of our proposed method, we conducted ablation experiments on both the Outdoor-Gait and CASIA-B datasets. Our method was compared against several state-of-the-art methods that rely on GEI and silhouette sequences as inputs, without the use of 3D human body reconstruction. As seen from Tables 5 and 6, the integration of the multi-granular feature fusion module has led to a improvement in the recognition accuracy on both datasets compared to current approaches.

To further demonstrate the efficacy of the multi-granular feature fusion module, we conducted additional experiments on the CASIA-B dataset. We performed ablation tests to evaluate each component of the multi-granular feature fusion module separately and compared them to our complete model. As seen in Table 7, the results reveal that the multi-granular feature fusion module consistently delivers improved performance across different walking conditions. To account for the 11 viewing angles in the CASIA-B dataset, the final results were obtained by taking the average recognition accuracy across all 11 angles.

Table 7 Ablation study on the CASIA-B dataset. In the first column, a, b, and c indicate the weight of different granularity

4.5 Visual analysis

In order to show a more intuitive performance of our model effectively, we present a visualization of our results in Fig. 5. The figure is comprised four parts, each representing the original video frames (a), silhouette image sequences (b), pose sequences (c), and image sequences after our 3D human body reconstruction (d). In each row, the first row represents the BG condition, the second row represents the CL condition, and the third row represents the NM condition.

Fig. 5
figure 5

Visualization results. a Original video frames, b silhouette image sequences, c pose sequences, and d image equences after 3D human body reconstruction. For each row, the first row is the BG condition, the second row is the CL condition, and the third row is the NM condition

As seen in Fig. 5, the 3D human body reconstruction effectively eliminates the negative impact of extraneous information, such as coats and backpacks, on the performance of gait recognition. Furthermore, it effectively compensates for the lack of local information in the original video frames. Consequently, our proposed method with 3D human body reconstruction performs much better than existing methods that only rely on silhouette image sequences. This leads to more robust and superior results, due to the compact and discriminative gait representation provided by our model.

5 Conclusion

We have designed a novel end-to-end gait recognition method that leverages 3D human body reconstruction to improve recognition performance. By using a HMR module to generate a compact and discriminative gait representation that eliminates the negative effects of redundant information, our method avoids the issues that plague existing methods when dealing with huge changes in video. To further enhance the recognition ability, we introduced a multi-granular feature fusion module that effectively leverages global and local features of pedestrians at multiple granularities. Our method was conducted on two popular gait recognition datasets, the Outdoor-Gait and CASIA-B, and it was shown to outperform similar state-of-the-art methods. Visualization results illustrate that our 3D reconstruction-based model can learn a more discriminative and nonredundant gait representation, greatly contributing to improved gait recognition performance.