Keywords

1 Introduction

As a kind of unique biometric features, gait is long-range, non-contact and difficult to disguise. Moreover, gait samples can be obtained without subjects’ cooperation. It has a very wide range of applications in security surveillance, criminal investigation surveillance and the public domain (Fig. 1).

Fig. 1.
figure 1

The gait silhouettes images are from individuals 33 and 53, taken every 4 frames. It can be observed that subtle changes need to be slowly detected through timing, as a single silhouette image alone cannot depict them. Therefore, the temporal features of gait are crucial, especially when individuals are wearing coats.

Existing global or local feature extraction methods [1, 10, 12, 17, 26] mainly focus on extracting spatial features, while temporal features are limited by some extent. Specifically, the max-pooling operation in the temporal dimension easily loses many gait temporal features. Moreover, existing methods mainly segment gait features into chunks and extract local features of different local regions separately, the interacting relationships between different human body parts have not been discovered enough, which thus limits gait recognition accuracies.

To address these issues, in this paper, we propose a gait recognition method that enhances the temporal gait features and interactions between local features. Specifically, we build a new feature enhancement module, called Global and Local Feature Extractor based on SeNet (GLFES), to enhance the interactions between local features by integrating attention mechanism. This method is realized by a squeeze and excitation Module (SEM) in the network, a module that is capable of enhancing inter-regional interactions. We can see the effect of SEM in Fig. 2. We show visualization maps under different views, including \(0^\circ \), \(54^\circ \), \(90^\circ \) and \(126^\circ \).

At the same time, we design a Temporal Global and Local Aggregator (TGLA) to extract temporal global and local features in a principled way. The global timing feature extractor focuses on the timing features of the entire gait sequence, while the local timing feature extractor splits the gait sequence in the time dimension, focusing on the gait details of adjacent frames. This then gives the model better recognition abilities by merging both global and local features.

Fig. 2.
figure 2

The convolutional feature map visualization after SEM. The visualization on the left is without SEM module, and the visualization on the right is with SEM module.

Finally, we design a novel Temporal Feature Aggregator (TFA). The gait features have two dimensions, height and width, and the features in the higher dimension are more discriminative than the wider dimension. Therefore, by pooling the wide dimension and reducing the number of parameters of the gait features as input to the TFA, the recognition accuracy of the model can be improved. The highlighted contributions of our method are listed as follows:

  1. (1)

    We propose a simple and lightweight application of convolution, called Temporal Global and Local Aggregator (TGLA), to facilitate refined learning of temporal features at the local level. The core idea of TGLA is to constraint the convolutional temporal perceptual wilderness and focus more on local timing information of gait-adjacent frames, then the local temporal feature are enhanced.

  2. (2)

    We propose a novel local feature enhancement module (SEM) that maximises the usage of local features by interacting with local features in different regions.

  3. (3)

    We propose a lightweight convolutional application, called Temporal Feature Aggregator (TFA), which is able to improve the comprehensive performance of the model.

  4. (4)

    We conduct extensive experiments on the public datasets CASIA-B and OUMVLP. Experimental estimates show that the proposed method can achieve the state-of-the-art performance.

2 Related Work

Current gait recognition studies prove the importance of spatial feature extraction and modeling from time series [2,3,4, 6, 7, 9, 14, 23, 24, 27, 31, 32]. In order to obtain more discriminative features from gait sequences, most of the existing models are based on CNNs, and they use 2D [2, 32] or 3D [14, 16, 22,23,24,25] convolution for feature extraction along the spatial dimension with good success. The importance of different human body parts in gait recognition is different, and performing the same scanning operation for all gait sequences often overlooks this characteristic. To obtain more detailed information about the different human body parts, GaitSet [2, 3], GaitPart [7], GLN [9], MT3D [15] tried to slice the output features into m blocks along the horizontal dimension, which can learn unique gait features of different body parts.

In addition, to better obtain discriminative gait features, many studies have integrated the entire gait sequence into one frame [14, 29]. At the same time, there are many studies that extracted frame-level features from gait sequences by CNNs and applied a max-pooling operation on the temporal dimension [2, 9], which easily limits the interrelationships and interactions between different gait frames.

In order to better obtain the relationships between consecutive gait frames, the original max-pooling operation is replaced by LSTM to integrate the gait features in the time series to generate the final gait features [5, 13, 18, 32], and the whole pipeline retains the non-essential order constraint in the gait sequences. These methods are good at extracting spatial and temporal features from gait sequences, ignore the spatio-temporal dependencies between non-local features.

Fig. 3.
figure 3

The overview of the whole GaitSE framework. ‘Conv3d’ and ‘LTConv3d’ denote three-dimensional convolution. ‘TGLA’ represents the Temporal Global and Local Aggregator. ‘GLFES’ represents the Global and Local Feature Extrator based on SeNet. ‘TFA’ represents Temporal Feature Aggregator. ‘GeMHPP’ represents the Generalized-Mean Horizontal Pyramid pooling. ‘FC’ represents Fully Connected layer. ‘Triplet Loss’ and ‘Cross Entropy Loss’ represent two kinds of loss functions

3 Method

In this section, the pipeline of GaitSE is first described, then the Temporal Global and Local Aggregator (TGLA) is described, followed by the Global and Local Feature Extractor based on SeNet(GLFESN), ending with the Temporal Feature Aggregator (TFA) and implementation details. The overall framework is presented in Fig. 3.

3.1 Pipeline

To obtain more holistic gait features, we first extract shallow features from the original gait sequences by 3D CNNs. Next, the Temporal Global and Local Aggregator (TGLA) was designed to extract a combination of global and local temporal information. Later, the Local Temporal 3D Convolution (LTConv3d) is designed to replace the original Max-pooling operation to retain more spatio-temporal information to ensure more comprehensive temporal information. After that, the Global and Local Feature Extractor based on SeNet (GLFES) is designed to enhance global and local information. Then, we propose the Temporal Feature Aggregator (TFA) to integrate the global temporal information. Finally, the triplet loss and cross entropy loss are used as our loss functions to train the model.

3.2 Temporal Global and Local Aggregator

We propose a Temporal Global and Local Aggregator (TGLA). The TGLA module consists of two 3D CNNS, one for global temporal information and the other for local temporal information. Since global and local temporal information are considered at the same time, the gait features extracted by TGLA are comprehensive, as shown in Fig. 4.

Fig. 4.
figure 4

Architectures of Temporal Global and Local Aggregator. ‘H’, ‘W’, ‘C’ and ‘T’ denote the height, width, number of channels and length of the gait sequence. ‘n’ denotes the number of cuts along the time dimension. ‘Temporal Partition’ denotes the segmentation of features along the time dimension and ‘N’ represents the number of segmented regions. ‘Conv3d’ is a 3D convolution operation, and ‘share’ denotes these 3D convolution shared parameters. ‘Concat’ indicates a concat operation in the temporal dimension

3.3 Global and Local Feature Extractor Based on SeNet

We propose a Global and Local Feature Extractor based on SeNet (GLFES). The first of this paragraph for the two forms of fusion methods, local features and global features, there are two classical methods, one is addition and the other is concat in this high dimension. The module based on additional fusion method we define as GLSEA, the module based on concat fusion method we define as GLSEC, the difference between the two lies in the final fusion method. The GLFES module consists of four layers ‘GLSEA1-MaxPool3d-GLSEA2-GLSEC’ as shown in Fig. 3.

The principle of TGLA has been shown above, and the difference between TGLA and GLSE lies in how the local feature map is partitioned and the Squeeze and Excitation Module (SEM). The segmentation of TGLA is along the time dimension, and GLSE is a horizontal segmentation. The SEM is shown in Fig. 5

Given \(X_{l} \in \mathbb {R}^{H \times W \times T \times C_{l}}\) as the final local feature map. Therefore, the squeeze and excitation module can be formulated as:

$$\begin{aligned} Y_{f} = F_{se}(Reshape(Max(X_l))) \end{aligned}$$
(1)
$$\begin{aligned} Y_{end} = F_{scale}(Reshape(Y_f), x_l) \end{aligned}$$
(2)

where Max(.) in maximizes the width of the feature, and Reshape(.) denotes merging the height and time dimensions of the feature. \(F_{se}(.)\) indicates that ‘FC-ReLU-FC-Sigmoid’ operations are performed, and and the output dimension of the first FC is \(\frac{C_l}{r}\), and the output dimension of the second FC is \(C_l\). \(Y_{f} \in \mathbb {R}^{H \times 1 \times T \times C_{l}}\) indicates the output of the Eq. 1. The Reshape in Eq. 2 means to separate the dimensions. \(F_{scale}(., .)\) denotes width-wise multiplication between the feature map. \(Y_{end} \in \mathbb {R}^{H \times W \times T \times C_{l}}\) denotes the output of the Eq. 2.

Fig. 5.
figure 5

The overview of SeM Attention Mechanism. The attention mechanism is widely used in Transformer.

3.4 Temporal Feature Aggregator

We propose an effective and low-consumption module, Temporal Feature Aggregator (TFA). The TFA module includes three layers, ‘Conv2d_down-Conv2d_inter-Conv2d_up’.

Suppose the input to the module is \(X_t \in \mathbb {R}^{C_{begin} \times T \times H \times W}\), where \(C_{begin}\) indicates the total number of channels of input, T represents the length of the input gait sequence and (HW) denotes the height and width of each silhouette image. Therefore, the Temporal Feature Aggregator can be formulated as:

$$\begin{aligned} Y_{beg} = Max(X_t, dim=W) \end{aligned}$$
(3)
$$\begin{aligned} Y_{mid} = F_r(F_b(F_u(F_{br}(F_i(F_{br}(F_d(Y_{beg}))))) + Y_{beg}) \end{aligned}$$
(4)
$$\begin{aligned} Y_{end} = GMP(Y_{mid}) \end{aligned}$$
(5)

where \(Max(., dim=W)\) represents the maximization of the input along the wide(W) dimension, and \(Y_{beg} \in \mathbb {R}^{C_{begin} \times T \times H \times 1}\) is the first output. \(F_d(.)\), \(F_i(.)\) and \(F_u(.)\) denote three different 2d convolutions with output channel \(C_{down}\), \(C_{inter}\) and \(C_{begin}\). Their convolution kernel sizes are (1, 1), \((T_i, 1)\) and (1, 1) respectively. \(F_b(.)\) and \(F_r(.)\) represent batch normalization and ReLU operations respectively, and \(F_{br}(.)\) denotes that the input performs the batch normalization operation first and then the ReLU operation. GMP(.) denotes global maximum pooling operation. \(Y_{mid} \in \mathbb {R}^{C_{begin} \times T \times H \times 1}\) and \(Y_{end} \in \mathbb {R}^{C_{begin} \times 1 \times 1 \times 1}\) denotes the output of the corresponding step.

3.5 Generalized-Mean Horizontal Pyramid Pooling

GeMHPP is an in-between method, determined by parameter learning. The GeMHPP module can be represented as:

$$\begin{aligned} Y_{GeMHPP} = (F_{Avg}(Y_{GeMin}.pow(p)))).pow(1/p) \end{aligned}$$
(6)

where \(Y_{GeMin}\) and \(Y_{GeMHPP}\) denotes the input and output of GeMHPP. \(F_{Avg}\) denotes the average pooling operation. pow(.) denotes the power operation. p is a parameter to be learned, and a suitable parameter is derived by multiple training.

In order to better train the proposed model, we use both triplet loss [8] and cross entropy loss.

4 Experiments

In this section, we first compare the experimental results on CASIA-B dataset with the state-of-the-art methods, then perform ablation study to compare the influence of different modules. Then we compare the experimental results on OUMVLP dataset.

4.1 Datasets

  1. (1)

    CASIA-B [28] is a multi-view large gait dataset. There are gait samples of 124 subjects in this dataset. The samples are collected from 11 views (\(0^\circ \), \(18^\circ \), \(\ldots \), \(162^\circ \), \(180^\circ \)).

  2. (2)

    OUMVLP [20] contains above 10,000 subjects. Each subject’s samples are collected from 14 views (\(0^\circ \), \(15^\circ \), \(30^\circ \), \(45^\circ \), \(60^\circ \), \(75^\circ \), \(90^\circ \), \(180^\circ \), \(195^\circ \), \(210^\circ \), \(215^\circ \), \(240^\circ \), \(255^\circ \), and \(270^\circ \)).

  3. (3)

    GREW [33] is currently recognized as the most extensive gait dataset in the wild according to available information. It consists of raw videos collected from 882 cameras positioned in a large public area, resulting in a substantial collection of nearly 3,500 h of video streams at a resolution of \(1,080 \times 1,920\).

Table 1. Rank-1 accuracy (%) on CASIA-B dataset under all view angles, different settings and conditions

4.2 Experiment Results on CASIA-B

To test the performance in cross-view scenarios, we compare GaitSE with the latest advanced methods. As shown in Table 1, GaitSE outperforms SOTA in most views. Specifically, GaitSE outperforms previous methods by at least 0.2%, 0.1% and 1.6% in three conditions (normal/bag/cloth). Most notably, in the most challenging CL condition, GaitSE achieves an accuracy of 85.2%, a 1.6% improvement compared to GaitGL [14], which validates the robustness of GaitSE in difficult scenarios.

From the results we see that the accuracy of our proposed network can be greatly improved under cl conditions, which proves that the potential of the model can be improved by SeM, thus improving the resistance of the network to interference. From the graphs, we can see that the accuracy improvement is much larger in the (\(0^\circ \), \(180^\circ \)) than in the other angles.

Table 2. The recognition accuracy (%) comparison on OUMVLP dataset under 14 probe views excluding identical views.

4.3 Experiment Results on OUMVLP

In this section, we evaluate the performance of our proposed model on a larger OUMVLP dataset. The experimental settings in this section follow the setup of GaitPart [7] and GaitGL [14]. In Table 2, we evaluate gait samples at 14 different views. During the test, we used Seq#00 and Seq#01 as the probe and gallery sequence, respectively. From the results, it seems that our proposed method improves more in the larger scale dataset than in the smaller scale dataset.

4.4 Experiment Results on GREW

We have analyzed the effectiveness of the proposed method by comparing its performance with various gait recognition methods using the GREW dataset. The evaluated methods, namely GaitGraph [21], GaitSet [2], Gaitpart [7], GaitGL [14], and CSTL [11], have been thoroughly assessed, and their respective experimental outcomes are presented in Table 3. Our findings from this comparison reveal an important trend. It appears that the gait recognition methods, which demonstrate satisfactory results in controlled laboratory settings, exhibit a notable decline in performance when confronted with real-world scenarios and datasets.

Table 3. The recognition accuracy (%) comparison on GREW dataset

4.5 Training Details

The alignment of the input silhouettes is referred to [20], and the resolution size of the final silhouette is \(64 \times 44\). Adam as our optimizer sets the learning rate and momentum to 1e−4 and 0.9, respectively. The margin m in the Eq. ?? about triplet loss is set to 0.2. The length of the gait sequences T is set to 30. Four NVIDIA 3080TI GPUs are used as our computational resources to train our model.

4.6 Ablation Study

We design various pertinent ablation experiments to analyze the importance of different modules.

Analysis of Global and Local Feature Extractor Based on SeNet. In GLSE, the role of the SEM module is mainly to reorganize features. In GaitGL [14], only local features are fused together to form a new gait feature map ignoring the connections between regions, the SEM module helps to establish the interactions between regions (Table 4).

Table 4. The recognition accuracy (%) of different max-pooling strategies in SE module on CASIA-B dataset.

Analysis of Temporal Feature Aggregator. We set the TFA in different places and set different temporal feature convolution kernel sizes, and drew experimental conclusions in Table 5. In order to fully verify the best hyperparameters, we set two positions, which are after GLSEA2 and GLSEC. Meanwhile, we also set three convolution kernel sizes, i.e., (3, 1), (4, 1) and (5, 1). From the table we can see that the hyperparameters mentioned above have quite a strong influence on the model, especially under the CL condition.

Table 5. The recognition accuracy (%) of placing TFA in different positions on CASIA-B dataset.

5 Conclusions

In this paper, we propose a novel gait recognition framework that is capable of enhancing temporal global and local gait information, which can better generate interactions among the local regions and thus improve the robustness of the gait recognition task. First, we propose to partition the features into multiple local regions along the temporal dimension to extract discriminative features separately, i.e., Temporal Global and Local Aggregator. Second, we propose to introduce SEM into the local features in order to better utilize the local features, which enhances the interaction between regions. Our experiments on public datasets including both CASIA-B and OUMVLP demonstrate the superiority of our proposed framework.