1 Introduction

Recognizing human’s gait pattern is a challenging issue in biometric systems and pattern recognition. Since the gait can be collected at a distance and in a hidden way or from low-resolution videos, it has many advantages in contrast to face and fingerprint biometrics [11, 21]. However, there are some challenges in the gait which affects the quality of a recognition system [6]. The most common variants are viewing angle, carriage condition, walking surface, type of shoes and aging [2]. To show the gait appearance in a video without a predefined model, two types of features are developed recently [6, 23]: “temporal” and “template”. In the mentioned methods, the period of walking with the silhouettes are computed from the input video initially. In the first approach, the extracted silhouettes from all frames of walking in a gait period are computed and then for recognizing the gait, the silhouettes from gallery/probe sets are compared accordingly [6]. For this reason, the complexity of the search space increases exponentially to achieve a better recognition rate [23]. But in the latter methods, the matching process are simpler as they transform all silhouettes in a video sequence into a single template. Despite the simplicity and stability to the image noises, the main weakness of the template-based approaches is its limitation in precise gait representation.

Sarkar et al. [23] develop baseline algorithm (i.e. a temporal method) in which the correlation among the silhouettes in a period is a metric for recognition. For this purpose, three levels of correlations have been defined for correct matching which they are; 1) frame-to-frame; 2) gait period-to-period (including several frames), and 3) gait sequence-to-sequence (including several periods). Despite of the accuracy of recognition, the computationally of this method is too high on an outdoor dataset because of searching all frames of walking [6]. To simplify the recognition process and reduce the complexities, the template features have been extensively developed recently [1, 2, 6, 8, 11, 30]. The basic idea of template-based approach is presented by Han et al. [11]. They have developed Gait Energy Image (GEI) that is derived by averaging the silhouettes in a period of walking. As an alternative template feature, a set of Gabor filters has been applied to GEI that is called General Tensor Discriminant Analysis (GTDA) [26]. In addition, GTDA has been combined with Enhanced Gabor (EG) features for multi-view recognition [10]. Recently, augmented Gabor filter based on patch distribution feature (Gabor-PDF) has been developed to improve the gait representation under challenging conditions [7]. Although the template features have almost simple structure which can alleviate the complexities, its performance under different gait conditions has limitations. This is mainly due to removal of gait temporal ordering in the final template. This issue has been addressed by C. Wang et al. [29]. They provide Chrono-Gait Image (CGI), a colorful template in which the rhythm of walking in time domain is represented in color spectrum. It has been proved that such colorful template can represent gait in different conditions more accurately [29]. Moreover, Space-Time Interest Points (STIP) as a spatio-temporal model of walking is defined to handle viewing issues [14]. Atta et al. [1] recently propose 5/3 Gait Image (5/3GI) by extracting lifting 5/3 wavelet features from the contour of silhouettes. It is proved that 5/3GI template can preserve temporal information of walking in final template properly. Meanwhile, to handle different gait conditions such as clothing and carriage conditions, Gheble et al. [8] propose an adaptive outlier detection method which discards clothing area of the silhouettes in a period. They also use a similarity measurement to compare the most similar parts of silhouettes in gallery and probe sets. Furthermore, Fendri et al. [5] dynamically select different parts of a silhouette that is more consistent to such covariates and apply a semantic classification to identify the relevant parts of human’s body.

The mentioned approaches can accurately recognize the gait in clothing and carriage gait conditions. But, they suffer from an accurate motion model limiting the performance on other gait conditions. In other words, the nature of a gait is a spatio-temporal process and hence, the motion descriptors are required. To utilize the gait motion model, the spatio-temporal filtering methods can be used efficiently [6, 7, 27]. For example, a set of LoG/Sinc impulse responses has been found as an efficient spatial/temporal filtering for the gait modeling [7]. In addition, the output template can be computed by averaging the responses of filtering over a period. Following the ideas in the gait filtering methods [6, 7], there are some salient regions in walking space which provides meaningful information for classification. More precisely, a simple walking pattern includes local patches (i.e. STIP) in 2D + t space that can be used to describe its styles more accurately [2, 27]. The mentioned idea has been proposed earlier in action recognition studies [24, 25]. For example, S. Rahman et al. [22] describe an action with the local features using statistical information (i.e. 2D + t histograms) of the patches. The corresponding patches are easily extracted from a bank of spatio-temporal filtering. Recently, Ben et al. [2] propose a spatio-temporal Coupled Patch Alignment (CPA) feature to handle different viewing conditions. They extract a set of local patch features in a sequence of walking and apply nearest neighbour classifier to match the patches of different gait viewing angles.

Besides the gait multi-viewing problem, Xu et al. [31] develop a novel gait recognition approach based on Capsule Neural Network (CapsNet). They also apply a linear projection of low-level local features from gallery and probe sets to improve the performance of recognition. In this paper, we develop a new patch-based template to improve the recognition rate under different covariate factors. To denote the gait with the local patch features, each gait template, e.g. GEI, is represented by set of patch distribution features (PDF) which is derived by adding X-Y patch coordinates to the pixels of template [17,18,19, 30]. It is proved (in [30]) that using a set of 40D Gabor features together with X-Y coordinates of patches provides better representation of gait than 2D Discrete Cosine Transform (2D-DCT) and Grey-level features (developed in [17, 18]). The augmented templates can also be characterized with Universal Background Model (UBM) [13, 19, 30] that will exponentially increase the overhead of gait representation with respect to the size of dataset. In addition, temporal information of the patches is discarded in computing the augmented template. These issues have been addressed in this paper by proposing a new gait template based on spatio-temporal distribution of the patch features. The flowchart of our approach has been shown in Fig. 1.

Fig. 1
figure 1

The flowchart of our proposed method. The set of 40D Gabor features is augmented with the weighted coordinates which are derived from distribution of the local patches in a walking process. Here, “wx, wy, wt” are the weights, “pi” is the pixel coordinates and S, D are Gabor scales and directions

As it is shown in Fig. 1, our method consists of three main steps: 1) Filtering the silhouettes and extracting the local patches, 2) Computing the patch probability distribution, and 3) Formatting the final template. In the first step, the local patches are extracted by looking for the extremums of local sliding window in 2D + t space. In the next step, three histograms of the patches coordinate in X-Y-T domain (or the gait signature) are computed and recorded for the weightings. In the final step, the weighted coordinates of template pixels along with the temporal ordering of patches are further concatenated with the template. We call our method as “Patch Gait Feature (PGF)”, in which the spatio-temporal local features of walking are embedded in an augmented template.

In comparisons with the recent gait filtering and patch-based features [6, 7, 19, 30], our major contributions are:

  • A new patch distribution feature is proposed based on coordinates of the local patches in 2D + t domain.

  • The simple walking pattern is characterized with proposed gait signature based on spatio-temporal coordinates of local patches over a sequence.

  • The weighted X-Y-T coordinates are embedded with Gabor features and a new augmented feature template is formed accordingly. The weights of coordinates are derived according to the histograms of the patches.

  • The complexity of the proposed method is comparable with other state-of-the-art methods.

To show how our method can describe the gait, we conduct comprehensive experiments on three well-known datasets (i.e. USF HumanID [23], CASIA-B [33] and OU-ISIR-B [20]). The structure of the paper is as follows. In the next section, local patch-based features are described. Then, the defined classifier is discussed in the third section that is used to recognize the proposed PGF. The experimental results are discussed in section 4 and the paper is concluded in the final section.

2 Gait feature extraction

In this section, the process of extracting patch features is described in detail. In the given approach, local patches along a sequence of walking are extracted and a new patch distribution feature are calculated. Finally, the proposed template, i.e. PGF, will be derived based on the probability distribution of patches.

2.1 Preprocessing and silhouette extraction

The local patches are extracted from the sequence of silhouettes in a period of walking. However, some primary processing steps are applied to the raw input video. The major steps are background subtraction, foreground alignment and gait period detection. For background subtraction, Gaussian Mixture Model (GMM) is applied to model the background of the input gait video [23]. Once the background of the video is being computed, the foreground images including the sequence of human’s walking are derived. These images are further aligned and resized according to the center of the foreground pixels and the normalized silhouettes with the same size and aspect ratio are generated [23].

Considering the sequence of aligned silhouettes, the period of a gait is easily derived by counting lower-half pixels in each frame [23]. It means that, the number of counted pixels have local extremums when two legs are near (or far) from each other. By computing the local extremums, the period is measured by median of distances between two consecutive minimums (or maximums) [6, 23]. Since each period includes of two analogous phases of walking, the starting phase for recording of the features is not important. But it is usual to start the phase of walking from the frames in which two legs are completely overlapped or near to each other [20, 23, 33]. Given a period of gait, the complete pattern of human’s walking is consisting of several periods. In our method, the local patches and the corresponding PGF are computed in each period independently. In the following subsection, we explain this procedure in details.

2.2 Local spatio-temporal patches

Assuming a sequence of silhouettes in a period, the desired patches are extracted from the responses of gait filtering. Recently, the Gabor filtering has been known as an effective tool in human’s motion modeling. Here, we apply it to highlight the salient regions during the walking. Let a set of Gabor filters be defined in a default pixel z = (x, y) as follows:

$$ {\phi}_{\tau, v}(z)=\frac{{\left\Vert {r}_{\tau, v}\right\Vert}^2}{\delta^2}{e}^{-\frac{{\left\Vert {r}_{\tau, v}\right\Vert}^2{\left\Vert z\right\Vert}^2}{2{\delta}^2}}\left[{e}^{ir_{\tau, v}.z}-{e}^{-\frac{\delta^2}{2}}\right], $$
(1)

where \( {r}_{\tau, v}={\theta}_{\tau }{e}^{i{\phi}_v} \) indicates the scales and directions of the Gabor function. Similar to conditions in [26, 30], we set σ = 2π, τ equals to {0, 1, 2, 3, 4} and v equals to {0, 1, 2, 3, 4, 5, 6, 7} as well. As a result, the set of 40D Gabor impulse responses are obtained in 5 scales and 8 different directions. Also,exp(−(δ2/2)) in eq. (1), is deducted to be the independent functions of DC value which are resistant to brightness variations.

In this paper, all of the gait frames (with the dimensions of N2 × N1) are convoluted with 40D Gabor filters to extract the energy of walking within a sequence. Afterwards, the magnitude sum of 40D Gabor responses over different scales and orientations is computed for each silhouettes:

$$ {\displaystyle \begin{array}{c}{R}_{SD}(z)=\left|\sum \limits_{\tau}\sum \limits_vI(z)\ast {\phi}_{\tau, v}(z)\right|\\ {}=\left|I(z)\ast \sum \limits_{\tau}\sum \limits_v{\phi}_{\tau, v}(z)\right|,\end{array}} $$
(2)

where I(z) is the input silhouette and “*” is convolutional operator. To reduce complexity issues, the dimensions of the RSD(z) are reduced to [N2/2] × [N1/2] (noted as RSD-dn), where [N2/2] and [N1/2] are the largest real numbers less than or equal to N2/2 and N1/2 respectively [1]. It has been proved that downsampling decreases the computational problems without reducing the accuracy of the identification [1].

In a conventional method such as [30], each GEI is represented as a set of Gabor features similar to the Eq. (2) and the existing augmented feature can be computed by adding spatial coordinates. The augmented Gabor features in spatial domain are shown as \( {p}_h={\left[{q}_h^T,{X}_h,{Y}_h\right]}^T\in {\Re}^{\rho } \) (ρ = 42) by adding the pixels data. Here, qh (h = 1…H) is the amount of Gabor responses and H = [N1/2] × [N2/2] are total values of the pixels. It is clear that the augmented feature, qh, has no data from the temporal information of gait. Moreover, all of the spatial coordinates have the same importance in qh and hence, it cannot present a proper motion model. To extract spatio-temporal information that is consistent with human’s visual system, we first define patch window and extract local extremums from such the local window.

Given a sequence of Gabor responses, RSD-dn, over a period, we define a sliding 2D + t window (or patch window) with dimensions of wx × wy × wt = w (e.g. 3 × 3 × 3) along x, y and t. By moving the given window in the spatial and temporal spaces, the local maximums from the responses are recorded in each step accordingly. The resulting patches zptch = (Xi, Yj, tk) are characterized with location of extremums in which 0 ≤ i < [N1/2], 0 ≤ j < [N2/2] and 0 ≤ k < T (T is the gait period). The gait signature is then extracted by computing distribution of the patches in 2D + t space which is discussed in following subsection.

2.3 Patch probability distribution

Given the recorded extremums in 2D + t walking space of an individual, we represent gait of individuals with the probability density function of patches over a sequence. It is mainly due to using statistical information of patches that can be applied to describe the gait. We formulate the probability distribution by set of m-bin histogram in spatial and temporal spaces. Suppose b :  → {1⋯m} as a function that assigns the input patches, zptch, to the bin b(zptch). The set of X-Y-T histograms, \( {q}_u,u=1\cdots m \), related to the locations of 2D + t patches are computed as follows:

$$ {\displaystyle \begin{array}{cc}{q}_{u,x}={C}_x{\sum}_{i=1}^{\left[{N}_1/2\right]}\delta \left[b\left({X}_i\right)-{u}_x\right]& {u}_x=1\cdots \left[{N}_1/2\right]\\ {}{q}_{u,y}={C}_y{\sum}_{j=1}^{\left[{N}_2/2\right]}\delta \left[b\left({Y}_j\right)-{u}_y\right],& {u}_y=1\cdots \left[{N}_2/2\right]\\ {}{q}_{u,t}={C}_t{\sum}_{k=1}^T\delta \left[b\left({t}_k\right)-{u}_t\right]& {u}_t=1\cdots T\end{array}} $$
(3)

where Cx, Cy and Ct are the normalization coefficients in which \( {\sum}_{u=1}^m{q}_u=1 \). By calculating above histograms, three unique distributions corresponding to the horizontal, vertical, and temporal directions are computed accordingly. Figure 1a shows the process of computing spatio-temporal histograms. Each moving window in 2D + t space has a local maximum where its coordinate is noted to a bin and stacked in the corresponding histograms. The location of extremums over the full 2D + t walking space represent totally the distribution of spatio-temporal patches for each people. Figure 2b, c and d show the set of X, Y and T histograms corresponding to three different people, each one with a unique color bar. Figure 3a shows locations of twenty most histogram values (in Fig. 2b, c and d) for visualizing how the local patches of different people are distributed in 2D + t space. The base template in Fig. 2 is mean of the 40D Gabor features (RSD(z) in eq. (2)) over a period and the circles denotes the patches in which the most values in histograms are shown with bigger radius. Meanwhile, the 2D projection of 2D + t patches in X-Y plane are shown in Fig. 3b.

Fig. 2
figure 2

(a) The process of computing X-Y-T histograms based on location of local extremums. (b), (c) and (d) The set of vertical, horizontal and temporal histograms for different individuals as shown in blue, green and red color bars

Fig. 3
figure 3

The 2D + t locations of twenty top values of X-Y-T histograms for three different people. a 3D viewing direction and b 2D projection of locations

The most bigger radius in Fig. 3 corresponds to density of local patches in a walking space. It is clear from Figs. 2 and 3 that a set of X-Y-T histograms as gait signature can be used to describe style of walking properly. It means that for each walking conditions, there are some STIP that its probability in 2D + t space can be used to accurately characterize its motion. It should also be noted that most of the patches are localized on legs, hands and shoulder regions and hence, it signifies the salient regions for recognition. In the next subsection, the computed histograms are used to form augmented Gabor template.

2.4 Formatting augmented template

Once the histograms of the patches are being generated, the final template can be computed. To do this, we perform two processing steps: 1) applying proper filtering to GEI templates, 2) adding statistical distribution of patches to the template.

In the first step, Gabor filters found as an efficient filtering to represent the gait templates [30]. By applying the 40D Gabor functions to the GEI, we have a set of 40D Gabor responses to describe the GEI templates [9, 26]. Here, the magnitude part of 40D Gabor responses is downsampled and utilized for the proposed descriptor:

$$ {R}_G(z)=\left| GEI(z)\ast {\phi}_{\tau, v}\right|\in {\Re}^{\left[{N}_1/2\right]\times \left[{N}_2/2\right]\times 40} $$
(4)

Then, the augmented Gabor filter can be obtained by adding the probability distribution of 2D + t patches to RG(z). For this purpose, the values of histograms are concatenated to the responses of Gabor filters, RG(z). However, to properly consider distributions of the patches in final feature vector, we use the histogram values to weight spatial and temporal coordinates.

Let \( {\left\{{z}_h^{\ast }=\left({x}_h^{\ast },{y}_h^{\ast}\right)\right\}}_{h=1\cdots H} \) be the normalized pixel locations, centered at center of silhouettes. Meanwhile, assume \( {\left\{{t}_k^{\ast}\right\}}_{k=1\cdots T} \) be the normalized time stamp centered at mean of a period (i.e. T/2). An isotropic kernel, k(‖z‖), is further defined to assign smaller weights to the pixels farther from the center. We also define the vectorized Gabor responses by \( {R}_{G-c}^T={\left\{{q}_h^T\right\}}_{h=1\cdots H}\in {\Re}^{40\times H} \) (H = [N1/2] × [N2/2]). With the above assumptions, the spatial part of the proposed augmented feature vector is computed as follows:

$$ {G}_{aug\_s}={\left[{R}_{G-c}^T,{q}_{u,x}k\left(\left\Vert {x}_h^{\ast}\right\Vert \right),{q}_{u,y}k\left(\left\Vert {y}_h^{\ast}\right\Vert \right)\right]}^T, $$
(5)

where Gaug _ s ∈ 42 × H is the feature vector that is derived by adding weighted spatial information, \( {q}_{u,x}k\left(\left\Vert {x}_h^{\ast}\right\Vert \right) \) and \( {q}_{u,y}k\left(\left\Vert {y}_h^{\ast}\right\Vert \right) \), to Gabor features. The proposed PGF is computed by concatenating the temporal information to Gaug _ s in eq. (5):

$$ {G}_{aug\_ st}={\left[{R}_{G-c}^T,{q}_{u,x}k\left(\left\Vert {x}_h^{\ast}\right\Vert \right),{q}_{u,y}k\left(\left\Vert {y}_h^{\ast}\right\Vert \right),{q}_{u,t}k\left(\left\Vert {t}_h^{\ast}\right\Vert \right)\right]}^T, $$
(6)

where \( {q}_{u,t}k\left(\left\Vert {t}_k^{\ast}\right\Vert \right) \) is the weighted (and interpolated) values of the pixels in temporal domain within a period.

In this paper, the uniform kernels have been applied (in eq. (6)). Also, other types of kernels for spatial and temporal weighting can be used insteadly. The proposed PGF (i.e. Gaug _ st ∈ 43 × H) contains spatio-temporal information of walking in a period of time. To compute PGF, the weighted spatio-temporal coordinates in 2D + t walking space are concatenated to the set of 40D Gabor responses. As such, the weights are obtained from probability distribution of the 2D + t patches in a period of walking. In the next section, it will be discussed how to identify this pattern through the given features.

3 Classification

The augmented Gabor features are calculated for all individuals in given dataset. The conventional method to recognize the gait is assigning the features from probe sets to its closest feature to the gallery set by direct matching process. As an example, the 1-Nearest Neighbor (1-NN) has been used as a simple classification tool to assign a label, which has the minimum Euclidian distance to its neighbor class. In this paper, we briefly discuss on common classification tools that are developed for gait recognition.

Despite the simplicity of direct matching process, it has some limitations such as [6]: (1) The gait samples are commonly taken in similar conditions, and hence the features may overfitted. (2) The number of patterns in the training space is small and limited therefore the gait features cannot describe the topology of the main walking space truly. (3) The dimensions of the feature space are much higher than the training samples, since each pixel of the input pattern is considered as a dimension.

The above restrictions are known as “Under Sample Problem” (USP) [6, 9, 29]. Several methods have been proposed to address the training limitation problem. A simple strategy is to increase the amount of training data by producing extra templates. For example, the number of features in GEI and CGI methods have been increased by producing synthetic templates [11, 29]. But an efficient and useful solution for removing USP is to use dimensional reduction techniques [6]. For example, a two-step technique can be used to optimize search space. Here, we briefly reviewed the common classification methods that has been applied in this paper for proposed gait classification. In this regards, the principal component analysis (PCA) followed by linear discriminant analysis (LDA), known as PCA + LDA, has been found as a useful and efficient approach in face and gait classification [10, 29]. In this approach, the PCA is firstly applied to project the given template into a low-dimensional feature space. In the second step, the LDA technique is further applied to eliminate redundant information in feature space. By computing PCA + LDA transformation matrix in the gallery feature space, the features in the probe set can be projected accordingly. In other words, there are feature vectors with the reduced dimensions in both of the gallery and probe sets that can be used for the classification purpose [6, 11, 29]. To classify the gait patterns, the projected feature vector of probe sets that has minimum Euclidean distances to its gallery set is labelled as final decision [6].

Although the PCA + LDA method can efficiently enhance dimension of feature space, it has some limitations in discriminating the covariate factors. This is mainly due to removing structure of gait templates within the projection process. To provide efficient tools that can discriminate the gait variants in feature space, the Random Subspace Method (RSM) has been developed recently [9]. It is proved that RSM provides better gait recognition rates in comparison with PCALDA under different gait covariates [7, 9]. More precisely, the RSM supersedes conventional methods due to reserving structure of the given template and generating weak classifiers through random sampling of the feature space [7]. The RSM has been consist of three main steps: creating random subspace from 2DPCA projection matrix, enhancing local dimensions and making hybrid-based decision. In first step, the 2DPCA from the input feature vector are computed and K (e.g. K = 1000 in this paper) random eigenvectors (projection matrix) are taken accordingly. Next, the dimensionality of randomly selected vectors is enhanced by applying another dimensionality reduction technique. In RSM, two different methods, i.e. 2DLDA and incremental dimension reduction algorithm via QR decomposition (IDR/QR) have been used alternatively. It should be note that in IDR/QR, the feature vectors are being vectorized within the process. As a result, two different classifiers are defined in RSM as 2DPCA + 2DLDA and 2DPCA + IDR/QR to label the features in K random subspaces. The hybrid decision-level fusion (HDF) within the K classifiers are derived by majority voting of the all classifiers in subspace for each individuals [9].

Assuming that there are nG individuals in the gallery set each one has gi(i = 1⋯nG) feature vectors. Also, suppose that Gaug,i is a feature vector of ith individual in the gallery set and Paug, j(j = 1⋯nP) is the feature vector of the individuals in the probe set. For each feature set, there are K random samples with given mean of samples, \( {m_G}_i^k\left(i=1\cdots c\right) \) in dataset. According to the Euclidean distance among the center of the feature vectors of gallery/probe sets in the low-dimensional subspace, the distance of the features between the input probe set (Paug,j) and the given gallery (Gaug,i) in each subspace is measured as:

$$ \mathrm{D}\left({P_{aug,}}_j^k,{G_{aug,}}_i^k\right)=\frac{1}{n_P}\sum \limits_{j=1}^{n_P}\left\Vert {P_{aug,}}_j^k-{m_G}_i^k\right\Vert, i=1\cdots {n}_G, $$
(7)

In RSM for kth subspace, the minimum distance of the probe set to individuals in the gallery set is considered as weak classifier:

$$ {l}^k\left({P}_{aug}^k\right)=\underset{i}{\arg \min}\mathrm{D}\left({P_{aug,}}_j^k,{G_{aug,}}_i^k\right),i=1\cdots {n}_G $$
(8)

where \( {l}^k\left({P}_{aug}^k\right) \) is the label of given individual in probe set in kth subspace. Now, to make the hybrid decision for the input template, the mode of labels among all subspaces is computed. In RSM, two types of decisions have been taken based on 2DLDA and IDR/QR techniques. Let lLDA and lQDR be the computed labels derived from different methods of RSM. A hybrid classifier (HC) is defined according to fusion of final decisions in each methods as follows:

$$ {l}_{HC}\left({P}_{aug}\right)=\left\{\begin{array}{l}{G}_{aug,i},\kern1.00em \mathrm{if}\begin{array}{c}\end{array}{l}_{LDA}\left({P}_{aug}\right)={G}_{aug,i},\\ {}{G}_{aug,i},\begin{array}{ccc}& & \end{array}\mathrm{if}\begin{array}{c}\end{array}{l}_{LDA}\left({P}_{aug}\right)={G}_{aug,i},\\ {}0,\begin{array}{cccc}& & & \end{array}\begin{array}{cc}& \end{array}\mathrm{Otherwise}\end{array}\right.{\displaystyle \begin{array}{cc}& \end{array}}i=1\cdots {n}_G $$
(9)

In other words, the final classifier is assigned the label correctly if one of the given classifiers making final decision correctly [7, 9]. In the following section, we bring the experimental results of the proposed gait recognition system which is known as PGF + PCALDA and PGF + RSM.

4 Results and discussion

In this section, PGF is evaluated by performing comprehensive experiments on three publically available datasets including USF [23], CASIA (Dataset B) [33] and OU-ISIR (Dataset B) [20]. To fully assess the proposed method, we compare the results with the recent approaches based on different classifier techniques. In first step, the nearest neighbor is performed on the augmented Gabor features (PGF + 1NN). Secondly, to make a fair comparison, the PGF recognition system is evaluated with two types of algorithms: evaluating PGF + PCALDA with the different template features that has applied simple classifier, comparing the results of PGF + RSM with the methods which utilize advanced classifiers. The selected methods for evaluations are baseline algorithm [23], 2DLDA [16], GTDA [26], Gabor+GTDA [26], DATER [32], MTP [3], GEI [11], CGI [29], PGM [12], GEI + Sparse bilinear discriminant analysis (SBDA) [15], GSI [6], 5/3GI [1], STIP [14], locality-constrained group sparse representation (LGSR) [30], Gabor+RSM [9], GEI + RSM [9], view-invariant multiscale gait recognition (VI-MGR) [4], local patch-based subspace ensemble learning algorithm (LPSELA) [19], gait spatio-temporal image (GSTI) [7] and Two-Stream Generative Adversarial Network (TS-GAN) [28]. We also consider “Rank1” and “Rank5” Correct Classification Rate (CCR) as two common benchmarks for performance evaluation. In each experiments, the most values of Rank1 and Rank5 has been bolded.

4.1 Gait Conditions

As discussed earlier, we evaluate the proposed method on three well-known datasets: USF HumanID dataset (dataset version 2.1) [23], CASIA Dataset (Dataset B) [33] and OU-ISIR (Dataset B) [20]. Most of the results in this section have been assessed on the USF dataset, due to more popularity, challenging gait conditions and noisier quality of silhouettes provided in the dataset.

The USF dataset consists of 122 individuals walking in elliptical paths in front of the camera. Gait conditions include walking surface (S), shoes type (H), viewing angle (V), carrying condition (C) and elapsed time (T). Considering these conditions, the set of “Grass, shoe type A, without briefcase, time t1 (in May)” was considered as a gallery set. Then 12 unique testing conditions have been developed for the probe sets.

The list of conditions for probe sets with the number of individuals per experiment are shown in Table 1. In USF dataset, the sequence of silhouettes with period of walking are provided. To extract silhouettes, the background subtraction method, has been applied to the raw input video and then foreground images are aligned to the center of images and cut to 128 × 88 pixels.

Table 1 The specification of probe sets in USF gait dataset

From Table 1, there are different number of subjects in each probe. We define the weighted average identification rate for the results (W–AvgI) as:

$$ \mathrm{W}\hbox{--} \mathrm{AvgI}=\frac{\sum_{i=1}^g{w}_i{R}_k}{\sum_{i=1}^g{w}_i}, $$
(10)

where wi denotes the number of people in the probe sets, g denotes the number of experiment, i.e. g = 12 in Table 1, and Rk are Rank1/Rank5 recognition rates.

Another test criterion is CASIA dataset (Dataset B) [33]. The dataset consists of 124 individuals and data are given in 10 different indoor conditions as follows: six sets with natural conditions (NM01-NM06), two sets of walking with carrying bag (BG01-BG02), and two sets of walking with clothing condition (CL01-CL02).

Each walking sequence is captured under 11 different viewing angles, from 0 to 180 degrees, at intervals of 18 degrees between the two directions. Since the proposed augmented Gabor filter has been designed for 90o viewing conditions, the results from corresponding degrees have been presented. Both of the USF and CASIA datasets provide lateral viewing of walking people, but only in USF datasets, the aligned silhouettes have been provided. Similarly, the silhouettes are aligned in the CASIA based on their horizontal center, and then cuts into images of 160 × 100 (similar to the CGI conditions [29]).

Our third testing condition is OU-ISIR-B dataset [33] that has been published recently to study the effect of clothing conditions. This dataset includes of 48 individuals walking on a treadmill with 32 types of different clothing. Table 2 presents gallery and probe sets corresponding to different clothing combinations provided in the database. The gallery set includes 48 individuals with standard clothes (i.e. type 9) while the probe set consists of 856 gait sequences with other 31 clothing conditions. As such, the silhouettes in the OU-ISIR-B dataset is aligned on their horizontal centroid and cut to 128 × 88 silhouette images (similar to the USF gait silhouette conditions).

Table 2 Clothing conditions in the OU-ISIR-B dataset

4.2 Testing conditions

To illustrate how the proposed patch-based features can identify the type of walking, we first evaluate the effect of different settings on the performance of gait recognition. In order to compute a PGF, the size of patch window is being defined as 3 × 3 × 3 (i.e. 3) pixels. However, other sizes can be used to describe the walking style in a sequence. Here, we repeat experiments with five different sizes of window as {3, 5, 7, 9, 11} and 1-NN classifier. The Rank1/Rank5 CCR on USF dataset is provided in Fig. 4. The performance of recognition shows that the results are almost invariant against size of window. However, the better performance is achieved by setting the size to 3. This is mainly due to preserving the local information that is provided with smaller window size.

Fig. 4
figure 4

Effect of patch size for the proposed PGF on performance of recognition

Assuming the PGF template with patch size of 3 pixels, another setting condition in computing the PGF is using the weighted coordinates (step 3 in Fig. 1). To study effect of the weighting the coordinates, we compute the PGF using one of the four different weighting conditions: 1) weighting the coordinates in 2D + t domain (our approach), 2) weighting in the spatial domain (and not weighted the T coordinates), 3) weighting in the temporal domain (and not weighted the X-Y coordinates), and 4) no weighting in spatio-temporal domain (not weighted X-Y-T coordinates). The Rank1/Rank5 results of different PGF settings with RSM classifier on USF dataset are shown in Fig. 5.

Fig. 5
figure 5

Evaluation the effect of weighting coordinates on performance in USF dataset

The results of the mentioned four experiments are shown Fig. 5 with blue, red, black and green colors, respectfully. Table 3 shows the averaging results from different testing conditions which are demonstrated in Fig. 5.

Table 3 The average of Rank1 and Rank5 for different testing conditions on USF dataset

From Fig. 5 and Table 3 we can find the following results: 1) In 5 out of 12 probes (A, B, H, I and J), the Rank1 and Rank5 of testing conditions are similar and hence the weighting of coordinates is not important in such probes. 2) In harder challenging conditions (C, D, E, F, G, K and L), the Rank1 and Rank5 of the proposed weighting are improved with the maximum of 17% (Rank 5 of probe F). In other words, our patch-based descriptor with proposed weighting has better performance under conditions of “walking surface” and “elapsed time” (in probes D, E, F, G, K and L). 3) The average of Rank1 and Rank5 for the proposed weighting are improved about 4%. The worse condition belongs to the second testing criteria when the pixels just weighted in spatial domain (and not in temporal domain). It means that the proposed gait recognition system is very sensitive to temporal information. However, weighting the pixel only in temporal domain (and not in spatial domain) will also decrease the performance. As a result, the quality of features is improved when weighted pixels both in spatial and temporal domains are added to the augmented template.

4.3 Gait templates

To study the effect of proposed template on the performance of gait recognition, we first apply a simple classifier, i.e. 1-NN classifier. Using a simple classifier without applying PCA + LDA or RSM provides useful tools to analysis the robustness of input feature vectors against noises and other covariates.

Here, the 1-NN classifier on USF and CASIA dataset is applied to evaluate quality of the algorithm. In addition, the results of our PGF are compared with recently published methods (including baseline algorithm, GEI, CGI, 5/3GI, STIP, VI-MGR, GSI and GSTI). The evaluation results on USF dataset are provided in Table 4.

Table 4 Performance evaluation of different gait features on USF dataset using 1-NN classifier

According to Rank1 CCR in Table 4, we have summarized the performance here. 1) The proposed template (PGF + 1NN) has the best weighted average of Rank1. 2) In comparison with the baseline algorithm and GEI, Rank1 accuracy has been improved in all conditions (except K and L). 3) The proposed patch-based feature improves the Rank1 CCR in 7 out of 12 experiments compared with other related methods. Although the suggested feature is vulnerable to “walking surfaces” and “elapsed time” conditions (in probes D, E, F, G, K, and L), it still has an acceptable performance. Likewise, the performance of other methods in Table 4 are weak in such conditions. 4) Compared to Rank5, the weighted average with 7 out of 12 conditions of our methods have the highest CCR. The performance of PGF + 1NN under “walking surface” and “elapsed time” conditions has been improved in Rank5 CCR and hence it is acceptable for recognition.

In addition, the quality of images in CASIA are greatly better than silhouettes in USF dataset. For each individual, there is one or two gait periods in CASIA which reduces the number of PGF samples in each sequence and hence, the number of computed templates for correct classification are limited. [6, 26]. However, the CASIA dataset provides a good benchmark to evaluate the quality of the generated template using 1-NN classifier. To make a fair comparison with other methods [1, 4, 7, 11, 14, 28, 29] on CASIA, we define each condition as training set (gallery) and make all nine others as testing sets (probes). The resulting conditions are 90 (10 × 9) different combinations of the gallery and probe sets for evaluation using 1-NN classifier. To visualize the 90 unique tests, it has been divided into three main groups (i.e. nm, bg, cl), in which the corresponding results are averaged in each group.

Table 5 shows Rank1 CCR for the proposed PGF + 1-NN under various conditions. The results in Table 5 are categorized into two different sections. 1) The results of the related template-based algorithms (including GEI [11], CGI [29], 5/3GI [1] and GSTI [7]) are provided in the first section. 2) The results of the recent multi-view methods (including STIP [14], VI-MGR [4] and TS-GAN [28]) are listed in the second section. Moreover, it has been presented only Rank1 results of CASIA conditions in Table 5.

Table 5 Evaluation Rank1 CCR of the proposed method on CASIA dataset using 1-NN classifier

In comparison with the methods in the first section of Table 5, the PGF method has a better average of Rank1. More precisely, in bg-bg, bg-cl and cl-cl conditions (columns 4, 5, and 6), the results show improvement in comparison with the template-based approaches. Furthermore, the results of our PGF are still comparable with respect to the methods in other parts of Table 5. The proposed method has better performance than STIP method in 4 out of 6 conditions. Meanwhile, the PGF results are competitive with the VI-MGR approach in bg-bg and cl-cl conditions (column 4 and 6). The VI-MGR provides higher accuracy due to the use of synthetic templates and a better classifier (i.e. subspace learning) [4, 9]. The total average of the Rank1 CCR in Table 5 confirms that the proposed feature has an acceptable performance in CASIA dataset (compared with TS-GAN [28]).

The performance of PGF on USF and CASIA datasets verifies effectiveness of our method in representing the gait under different conditions. Indeed, the patch-based features are able to improve the accuracy in most of the experiments.

4.4 Performance on clothing covariates

As discussed in subsection 4.1, the OU-ISIR-B dataset [33] is sampled from people walking on a treadmill and wearing different type of clothes. It should be noted that the most common covariate for rhythm of walking is use of different clothing. However, such variants are only presented in the probes K and L that have few number of people in each probe. As it is shown in Table 2, the OU-ISIR dataset consists of 32 combinations with clothing. To cover all possible testing conditions, it provides additional gallery set that can help for developing a clothing-invariant recognition system [8, 20]. Here, there is no need for pre-defined training set which may lead to unrealistic performance. To evaluate the performance of gait recognition system in OU-ISIR-B dataset, we compare the Rank1 results of 31 cloth types (i.e. the probe sets) of the proposed PGF with Clothing-Invariant Gait Recognition (CLIGR) [8], GSTI [7], VI-MGR [4] and Local Feature at the Bottom Layer Based on Capsule Network (LBC) [31]. The Rank1 results for different probes are shown in Fig. 6. To make fair comparisons, we apply RSM classifier to PGF and compare the results with other state-of-the-art methods.

Fig. 6
figure 6

Performance of the proposed gait recognition system for 31 probe types in OU-ISIR-B dataset

From Fig. 6, it is clear that the proposed method outperforms the Rank1 in comparison with other methods in most challenging conditions. More precisely, in 2 out of 31 conditions (i.e. probes H and M), the PGF has lower Rank1 with respect to the CLIGR [8] and LBC [31] methods. However, the performance of other methods is pretty low in such conditions. Actually in probes H and M, the people wear “Down Jacket (DJ)” in which the parts of body for extraction of the patches are corrupted. As a result, local patches in PGF are failed to accurately locate the most important regions. From the results in Fig. 6, we can find that the performance of PGF has been lost below 60% in probes H, M, R and V. This is mainly due to covering of body parts in such probes, because of wearing “DJ” or “Rain Coat (RC)”. Furthermore, in probes B and C, the people are wearing “Regular Pans (RP)” with “DJ” and PGF has still acceptable accuracy compared with other methods. Truly in probes B and C, the upper body parts covered with unusual clothing while the lower body parts are covered with standard clothing. In such conditions, the proposed method extracts enough information from the lower body parts which can help to improve the accuracy of recognition. The performance of PGF in Fig. 6 verifies that our method is more robust when upper body clothing is changing (in probes B and C) and is weak to unusual clothing of lower body (in probes H, M, R and V). But in comparisons with other methods, our PGF has still better accuracy in most conditions.

The weighted average of Rank1 for the methods of Fig. 6 is listed in Table 6. Again we can conclude that our method outperforms the conventional methods in OU-ISIR-B dataset. It is mainly due to utilizing local information during the walking process that is provided in PGF.

Table 6 Evaluation weighted average of Rank1 CCR (%) of the methods in OU-ISIR-B dataset

4.5 Recognition system

In this subsection, the proposed gait feature recognition system has been compared to the recent methods on the USF HumanID dataset. To make fair comparisons, we evaluate the results with two sets of gait recognition systems: 1) comparing the PGF + PCALDA with the templates that utilizing simple or similar classifier, and 2) evaluating the PGF + RSM with the related methods that applying advanced classifier.

The bold values indicate the values that outperform the previous works.

The Rank1 and Rank5 CCR with the weighted average results (Eq. (10)) of the methods in first category (i.e. using simple classifier) have been summarized in Tables 7 and 8. It is clear that the weighted average results (W–AvgI) of Rank1 and Rank5 have acceptable rate in comparison with other related methods. Additionally, the proposed patch-based method is appropriate to represent the gait in most conditions due to the improvement of Rank1/Rank5 recognition rates. Specifically, in 4 out of the 12 conditions (probes A, C, D and E), Rank1 CCR is highest while in other testing conditions, the accuracies of our system is near to the highest. Our method is only vulnerable against walking surface and timing variants (probes D, E, F, G, K and L) in which other methods have also weak performance in such conditions.

Table 7 Comparisons of Rank1 CCR (%) of different templates using a simple classifier on USF dataset
Table 8 Comparisons of Rank5 CCR (%) of different templates using a simple classifier on USF dataset

But in normal gait conditions (probes A, B, and C), Rank1 results provide better accuracies than the recent spatio-temporal templates (e.g. CGI, GSI and 5/3GI). The Rank5 performance of the proposed PGF in Table 8 is still competitive in similar testing conditions. By summarizing the Rank1/Rank5 accuracies, our method supersedes conventional template-based methods (such as CGI, GSI or 5/3GI) using spatio-temporal local patches in a given sequence. To evaluate the proposed PGF, a two-step classification method (i.e. PCA + LDA) has been utilized. However, the advanced classifiers such as ensemble methods [7, 9, 19] or sparse-based classifiers [15, 30] can be used to improve the performance. To verify effectiveness of our feature template, we compare the Rank1 and Rank5 results of PGF + RSM with other related methods in Tables 9 and 10. From Table 9, we can see that the proposed recognition system provides competitive result with respect to other methods. In other words, the average Rank1 of our method promoted other recognition rates except the Gabor+RSM [9]. More precisely, in 6 out of 12 conditions (in probes A, B, C, H, I and J), the Rank1 accuracy is the same or better than Gabor+RSM. In the remaining 6 probes (D, E, F, G, K and L), the proposed results are still competitive. The results of Rank1 in Table 9 indicate that our method is robust to find briefcase conditions but unable to locate the patches in surface and clothing of USF dataset. In such conditions, the main parts of body (including the legs) are covered while a person is walking on a “grass” or has a different clothing. The Rank5 recognition rates of our method listed in Table 7 also outperform the CCR in most of the conditions. Moreover, in 3 out of 12 conditions (in probes A, H and I), the PGF + RSM can completely recognize all of the given individuals in probe set. In other testing conditions, the Rank5 CCR of proposed system is near to the best value and can recognize the people accurately. Indeed, it is noted from Tables 9 and 10 that our patch-based feature can represent the gait more perfectly rather than conventional patch-based methods, i.e. LGSR [30] and LPSELA [19]. The experimental results in Tables 9 and 10 verifies that using advanced classifier can make a robust gait recognition system which improves the average of Rank1 by about 5% with respect to other methods (i.e. LGSR [30], GEI + RSM [9], VI-MGR [4] and LPSELA [19]). To calculate the PGF, a small and limited number of parameters (such as size of patch window) are defined, which the recognition rate is not quite dependent to such parameters. As a result, it provides a useful representation tools to describe the gait in more challenging and complex conditions.

Table 9 Comparisons of Rank1 CCR (%) of the approaches using advanced classifier on USF dataset
Table 10 Comparisons of Rank5 CCR (%) of the approaches using advanced classifier on USF dataset

4.6 Computational complexity

In this section, the computational complexity and required memory to calculate PGF are evaluated. The timing complexity is also compared with two recent approaches (i.e. CGI [29] and GSTI [7]).

From three main parts of the proposed approach (in Fig. 1), the first and third steps take the highest processing time. Because Gabor filters are calculated from the silhouettes. While in the second step, the local extremums and the corresponding probability distributions are computed. Since the filtering of the silhouettes needs high processing time for computing the extremums, the time/memory of filtering steps have dominant complexities in our method. In the first step, a set of 40D Gabor filters within the T (gait period) is calculated. If the time complexity of each filter is represented by O(IGfilt), then the time complexity of the first step will be equal to O(40TIGfilt).

However, to reduce complexities, the responses of filtering have been further downsampled to ¼ of original size. Therefore, the memory consumption will be in the order of O(10TIGfilt). Following such assumptions, the computational complexity of the third step is approximately equal to O(40IGfilt). As a result, timing complexity of the entire method is in the order of O(40IGfilt) + O(40TIGfilt) ≈ O(40(T + 1)IGfilt).

Considering the timing complexities of computing a PGF, the complexity of entire gallery and probe datasets is O(40(T + 1) (ngl + npr)IGfilt) where ngl and npr are the number of datasets in the gallery and probe sequences, respectively. If the size of input image is W × H and size of Gabor kernel is w × h, the computational overhead of a typical filtering is of order O(IGfilt) ≈ O(WHwh) [6]. Considering this assumption, the general complexity of all the PGF templates is in the order of O(40(T + 1) (ngl + npr)WHwh).

Assuming that, in the CGI method [29] with the k color channel (k = 3), the computational overhead would be in the order of O(k(ngl + npr)TWH) [6, 29]. Moreover, the GSTI computational overhead is also in the order of O(4(ngl + npr)TWHwh) [6]. As a result, the complexity of the proposed algorithm increases to O(\( \left(\raisebox{1ex}{$40$}\!\left/ \!\raisebox{-1ex}{$k$}\right.\right)\left(\raisebox{1ex}{$T+1$}\!\left/ \!\raisebox{-1ex}{$T$}\right.\right) \)wh) as compared to CGI. By assuming k = 3, the increase rate is approximately equal to O(13wh). Meanwhile, in comparison with the GSTI, the timing complexity increases approximately O(\( 40\left(\raisebox{1ex}{$T+1$}\!\left/ \!\raisebox{-1ex}{$T$}\right.\right) \)) ≈ O(40). This overhead is not too problematic since we can alleviate the complexity with fast filtering techniques. In other words, the time of computing a PGF for an individual is 820 ms using MATLAB 8.3.0 (2014a) running on an Intel (R) Core (TM) i7 processor with 8 GB RAM working at 2.39-GHz for USF dataset.

From the memory consumption, and following similar discussion mentioned above, the most memory usage of the proposed approach is related to the first and third steps. At the first and third steps, the required memory is about O(10TIGfilt) and O(10IGfilt) respectively. However, the memory usage for the convolution of each Gabor filter is in the order of O(IGfilt) ≈ O(WH + wh) [6]. With this assumption, the required memory of the proposed approach is in the order of O(10(T + 1)(WH + wh)). The memory usage of the approach for total silhouettes in the given dataset is O(10(T + 1)(WH + wh)(ngl + npr)). If we assume that the size of input silhouette in the USF dataset is 128 × 88, the size of the Gabor function is 39 × 39, the average of period in the dataset is T ≈ 30, and ngl + npr = 1080, the total amount of memory needed to calculate the proposed patch feature will be nearly 3.6 Gigabytes.

It should be noted that simple classifiers (e.g. 1-NN and PCA + LDA) have been proposed for evaluating the PGF templates. Obviously, using an advanced classifier such as ensemble method [7, 9] consumes extra computational cost and memory allocation for the recognition process.

5 Conclusion

In this paper, a new feature template has been proposed for human gait modeling that extracts spatio-temporal information of walking. The local patches are extracted in a walking sequence and the probability distribution of patches in spatial and temporal domains are considered as gait signature. The corresponding coordinates in 2D + t domain are then weighted according to probabilities of local patch. The final augmented template, which is called PGF, is derived by concatenating a set of Gabor-based templates together with the weighted coordinates in the spatial and temporal domains. The proposed method provides an efficient gait representation template because local patches are described with their probability distributions (i.e. histogram) in full 2D + t domain. To make a gait recognition system based on PGF, two well-known classifiers (i.e. PCA + LDA and RSM) has been applied to classify the patterns. Extensive simulations and evaluations on the popular datasets demonstrate that our proposed method provide an enhanced performance in comparison with other state-of-the-art approaches. Moreover, our method has an optimal computational overhead that makes it proper for different gait conditions. To calculate PGF, there should be enough space for generating and storing the responses of Gabor filters. The proposed PGF template has efficient patch-based features to handle more challenging conditions in real life scenarios.