1 Introduction

Biometrics has emerged as an important research area in the recent years due to its vast applications in various fields such as surveillance, authentication, etc. A number of human physiological characteristics e.g., face, iris, fingerprints, deoxyribonucleic acid (DNA) have been shown to be unique, and therefore, they can be used for individual identification [28, 42]. Lately, the gait biometric has received attention of research community for human identification. It refers to the way a person walks. Gait is different from other biometric modalities in which the person interaction is needed with the system for identification such as fingers must be placed on the scanners to get their print, face must be in a specific position in front of a camera to be captured, DNA requires sample of blood or skin to be used for analysis; the gait on the other hand does not require any interaction with the imaging device. It works in a hidden manner, and the individuals are generally unaware that they are being captured by a camera installed at some secret place. Moreover, the performance of gait-based human identifiers is less susceptible to noise as compared to other biometrics and that is why they perform pretty well even in low-quality and low-resolution videos. Gait may not be as precise and powerful biometric as some other modalities are, but the aforementioned characteristics make the gait an ideal choice for visual surveillance applications.

The existing gait recognition algorithms are generally categorized into two types: model-based algorithms and model-free approaches. In the former category, the gait recognition algorithms use the human motion, body structures, and joint positions to track and recognize the individuals [57]. The structural models are usually generated based on the prior knowledge of the shape of human body [7, 23]. The motion models [3, 10] exploit the motion parameters of human body parts like joint angle trajectories, rotation patterns of hip and thigh. The gait recognition algorithm proposed in [3] computes features from joints’ locations using elliptic Fourier descriptors which are used to recognize the gait. The authors in [10] modeled a gait feature by exploiting the angular motion of the hip and thigh using the Fourier series. Wang et al. [53] modeled human body using fourteen rigid parts connected to each other at joint locations and used these locations to obtain joint angle trajectories which are used as gait descriptor. The model-based approaches are somewhat robust to occlusion, but they suffer with number of limitations such as the performance of these methods depends upon the torso localization and also requires high-quality gait videos [3, 57]. Moreover, these methods are computationally expensive too.

The model-free gait recognition approaches usually extract human silhouettes from gait video and drive different information from these silhouettes images for gait identification. Numerous techniques construct a template image from the segmented human body silhouettes and use it for person identification e.g., [1, 8, 17, 24, 30, 37, 41, 51]. The technique proposed in [30] computes silhouette images and averages the results over a gait cycle to estimate the person identity, known as gait energy image (GEI). Several improvements over GEI have also been reported e.g., [1, 24, 41, 46, 51, 60]. The approaches [14, 36, 59] extract different gait-related parameters from the silhouette images and use them for person identification. Goffredo et al. [14] computed the height and width features from the normalized and scaled silhouette region over time and used them as gait dynamics. In [59], radial basis function network and deterministic learning is used to classify the height and width ratio and contour centroid over time to approximate the individual’s gait.

Shape analysis on the sequence of silhouettes and projection of silhouettes in different directions have also explored in gait identification [21, 44, 45, 49, 54]. The 2D silhouette are mapped into 1D normalized distance signal in the algorithm described in [55]. The changes in shape of 1D silhouette over time are used to approximate the gait patterns. The methods in [44, 45] used the normalized segmented silhouette region and projected them in different directions. They consider a vector of projective values in each projection as gait feature. Motion information has also been extensively used for gait recognition e.g., [6, 9, 20]. Castro et al. [6] computed the spatiotemporal cuboids of optical flow from the video sequences and fed to convolutional neural network to obtain a high-level gait representation. The authors in [9] exploit the spatiotemporal motion characteristics, and and statistical and physical parameters of silhouettes for recognition.

Fig. 1
figure 1

Schematic diagram of the proposed gait-based human recognition algorithm

The model-free gait recognition techniques have achieved better results compared to their counterpart model-based approaches, and they are computationally efficient too. However, the bottleneck in these techniques is the precise silhouette segmentation, as a poor segmentation can adversely affect the recognition performance [59]. This paper presents a novel spatiotemporal features-based gait representation using dense trajectories to model the human gait. The motivation in using the dense trajectories is because they contain the local motion patterns of walk and they can be easily computed from the gait video sequences using optical flow field. The extracted motion information is encoded using Fisher vector- and Gaussian mixture model-based codebook. These encoded features are classified using a simple linear support vector machine (SVM).

The proposed approach, unlike existing techniques, does not require gait cycle estimation or human body extraction, and therefore, it performs better in low-resolution videos and poor-quality videos. The experimental evaluation performed on CASIA-C gait database (Sect. 3.5) which consists of low-resolution video sequences reveals the efficacy of the proposed method in unfavorable and challenging conditions, as the video sequences are recorded at night using a thermal imaging camera. In particular, the proposed gait features comprise the persons static appearance and the relative motion information which are capable of capturing even a small variance in the gait. High recognition accuracy obtained with a simple linear SVM also confirm the discriminative nature of the proposed features for gait recognition. Preliminary results of this research were presented in [19]; however, in this paper a number of improvements are proposed which are summarized in the following.

  • In this study, we investigated the influence of dimensionality reduction on feature encoding. High-dimensional features inevitably increase the computational complexity and the storage requirements. It is inferred from a large experimentation that applying principal component analysis (PCA) after feature encoding not only improves the recognition rate, it also reduces the classification time and the storage requirements of the proposed algorithm.

  • It is also concluded that dimensionality reduction prior to feature encoding results in loss of few important cues for gait recognition, which may lower the recognition accuracy. Indeed, this information is important which significantly improves the efficiency of our method.

  • In this research, an extensive experimental evaluation is performed. The proposed technique is evaluated on five large benchmark gait databases and the recognition results are compared with state-of-the-art techniques.

2 Proposed gait recognition method

The proposed gait recognition technique consists of four steps. First, the dense trajectories are extracted for a set of sample points in a video sequence using optical flow field, and for each tracked point the appearance and motion information of a walker is computed within a spatiotemporal patch and saved in local descriptors. Second, we randomly selected one million local descriptors to build a codebook. The codebook is constructed using Gaussian mixture models (GMM), and the local descriptors are encoded using this codebook and Fisher vector encoding scheme in the third step. Finally, we used a linear SVM to classify the encoded features. The performance of the proposed algorithm is evaluated on five widely used benchmark gait databases, and the results are compared with the existing techniques. The recognition results confirm the superior performance of the proposed method. Figure 1 shows the block diagram of the proposed method. Each step of the proposed method is presented in the following sections.

2.1 Motion descriptor computation

Recently, dense trajectories have proven to be effective in action recognition [33, 52]. Since they encode the movement patterns of a person’s walk and can be easily computed directly from the gait video sequences, we used them to compute the gait features for person identification. In order to compute trajectories, we select a set of dense points from a frame and they are being tracked in the subsequent frames. The tracking is performed using displacement information in a dense optical flow field. In particular, each selected point \( P_{t} = (x_{t},y_{t}) \) in frame t is tracked in the subsequent frame \(t+1\) and so forth. The concatenation of these tracked points in the subsequent frames form a trajectory. Assuming that S represents a sequence of displacement vector and can be computed as,

$$\begin{aligned} S = ({\Delta }P_{t},{\Delta }P_{t+1},\ldots ,{\Delta }P_{t+L-1}), \end{aligned}$$
(1)

where L is the trajectory length and \({\Delta }P_{t} = (P_{t+1} - P_{t})\). The resultant vector S is normalized:

$$\begin{aligned} S' = \frac{{(\Delta }P_{t},\ldots ,{\Delta }P_{t+L-1})}{ \sum _{j=t}^{t+L-1} \left\| {\Delta }P_{j}\right\| } \end{aligned}$$
(2)

Equation 2 describes the shape of the trajectory. The authors in [52] proposed the computation of local descriptors around the tracked points to capture the appearance and motion information for action recognition. The local descriptors are computed within space-time volume, and their information is saved in two histograms, namely histogram of oriented gradient (HOG) and histogram of optical flow (HOF). Furthermore, two additional descriptors are computed by taking the spatial derivatives along horizontal (x-axis) and vertical (y-axis) components of the optical flow known as Motion Boundary Histogram and their information is saved in two histograms namely, MBH\(_{x}\) and MBH\(_{y}\), respectively. They describe the relative motion between the pixels. We quantized the orientation information of each descriptor into 8-bin histogram. The descriptors are normalized with \(L_{2}\)-norm. In order to assess the efficacy of these descriptors for gait recognition, we evaluated their several combinations in [19] and concluded that the combination of MBH and HOG outperforms the others feature combinations. The main reason for the superior performance of this combination is that HOG captures the person’s static appearance and MBH encodes the relative motion information between the pixels; therefore, when used collectively they have a greater impact in recognizing the individuals using their appearances and motion information. Moreover, they are capable of capturing small variances in the gait patterns.

2.2 Codebook generation

Usually, a set of local descriptors are used to construct a fixed length vector to represent an image or video. We encoded HOG and MBH descriptors using Fisher vector (FV) encoding and Gaussian mixture model (GMM)-based codebook. The FV is based on the principle of Fisher kernel [38] which incorporates the advantages of both discriminative and generic approaches using a kernel from generative model of the data. The GMM defines a collection of several Gaussian distributions over the feature space [33] and can be expressed as:

$$\begin{aligned} p(X \mid \theta ) = \sum \limits _{i=1}^{K}w_{i}\mathscr {N}( x \mid \mu _{i}, \textstyle \sum _{i}) \end{aligned}$$
(3)

where K represents the number of components of GMM, \(w_{i}\) is the weight of ith component with the constraint that \( \sum _{i=1}^{K}w_{i} = 1 \), \(\mu _{i}\) and \(\textstyle \sum _{i}\) are representing the mean vector and covariance matrix of the ith component, respectively. Additionally, \(\theta = \{ w_{i},\mu _{i},\textstyle \sum _{i}, i=1,2,\ldots ,K\}\) is the list of parameters for all K components of GMM and \(\mathscr {N}( x \mid \mu _{i}, \textstyle \sum _{i})\) represents the Gaussian distribution with D dimensions:

$$\begin{aligned} \mathscr {N}(x\mid \mu _{i},\textstyle \sum _{i}) = {\dfrac{1}{ \sqrt{(2\pi )^{D}{\mid }\sum _{i}{\mid }}} } e^{-\dfrac{1}{2}(x-\mu _{i})^{\top }\textstyle \sum _{i}^{-1}(x-\mu _{i})} \end{aligned}$$
(4)

For a given feature set \(X = \{x_{1}, \ldots , x_{t}\}\), we used maximum likelihood estimation algorithm [31] to estimate the optimal parameters of GMM. It performs soft assignment of a data point to each component. The assignment score which is also known as posterior probability demonstrates the association of data point to the respective component in GMM and can be defined as,

$$\begin{aligned} q_{t}(i) = \dfrac{w_{i}\mathscr {N}(x_{t}\mid \mu _{i}, \textstyle \sum _{i})}{\sum _{j=1}^{K} w_{j} \mathscr {N}(x_{t}\mid \mu _{j}, \textstyle \sum _{j})} \end{aligned}$$
(5)

We consider that each component of GMM describes a particular appearance and motion pattern contributed by the local descriptors. Since GMM uses an iterative expectation maximization (EM) algorithm [11, 25] which implements soft assignment of a descriptor to all components, the descriptor would be assigned to multiple components using the posterior estimates of the descriptor.

2.3 Feature encoding

Feature encoding is the process of transforming the local descriptors of an image or video into a fixed length vector. Lately, FV encoding became quite popular due to its superior performance in many image processing applications [34, 38]. We chose FV to encode our local descriptors. The encoding of FV starts by learning a GMM (Sect. 2.2). Let \(X = \{x_{t}| t=1, \ldots , T\}\) be a set of local descriptors which is transformed into a vector using (3).

$$\begin{aligned} F_{X} = \dfrac{1}{T}\nabla _{\theta }\log p(X|\theta ), \end{aligned}$$
(6)

where \(F_{X}\) represents the resultant Fisher vector and \(\nabla _{\theta }\) is the gradient of the log-likelihood with respect to the model parameters \(\theta \). Let \(x_{t}\) be a D-dimensional local descriptor, and its assignment to the ith Gaussian component (5) is represented as \(q_{t}(i)\). Assuming that the covariance matrices \(\textstyle \sum _{i}\) are diagonal and can be represented as \({\sigma _{i}}^2\), the gradient vector can be represented as in [35]:

$$\begin{aligned} u_{i} = \frac{1}{T\sqrt{w_{i}}} \sum _{t=1}^{T} q_{t}(i) \dfrac{x_{t}-\mu _{i}}{\sigma _{i}} \end{aligned}$$
(7)

where mean \(\mu _{i}\) represents the mean of the ith component.

$$\begin{aligned} v_{i} = \frac{1}{T\sqrt{2w_{i}}} \sum _{t=1}^{T} q_{t}(i) \begin{bmatrix} \dfrac{(x_{t}-\mu _{i})^{2}}{\sigma _{i}^{2}} - 1\end{bmatrix}, \end{aligned}$$
(8)

where \(u_{i}\) and \(v_{i}\) are D-dimensional gradient vectors with respect to the model parameters mean and variance, respectively. The Fisher encoding for the set of local descriptors X is obtained as a concatenation of u and v.

$$\begin{aligned} f = [u_{1}^{\top },v_{1}^{\top }, u_{2}^{\top },v_{2}^{\top }, \ldots , u_{K}^{\top },v_{K}^{\top }]^{\top } \end{aligned}$$
(9)

The final gradient vector f consists of \(u_{i}\) and \(v_{i}\) vectors for \(i=1,2,\ldots ,K\). The size of f is 2KD. The MBH\(_{x}\), MBH\(_{y}\) and HOG descriptors are encoded as described above, and they are fused using the representation-level fusion [33]. Each of the feature vectors representing a set of local descriptors are computed separately and a global representation is obtained by concatenating them.

Table 1 Summary of gait databases used in performance evaluation. Size represents the number of gait video sequences in the database used in experimental evaluation

2.4 Classification

The similarity between two samples X and Y can be measured using the Fisher kernel (FK) [38]. It is defined as a dot-product between the feature vectors of X and Y: \(FK(X,Y) = f_{X}' \cdot f_{Y}\). Here \(f_{X}\) and \(f_{Y}\) represent the Fisher vectors for samples X and Y, respectively. A nonlinear kernel machine using FK as a kernel is similar to a linear kernel machine using \(f_{X}\) as feature vector. The main advantage of using such an explicit vector formulation is that we can exploit any simple linear classifier which can learn very efficiently. We used a simple linear SVM to solve this problem. In the implementation of the proposed algorithm, LIBLINEAR SVM library [13] is used to classify the encoded gait features. Before the actual model is trained on full training database, a 10-fold cross-validation is performed to validate the model by selecting the optimal value of its parameters.

3 Experiments and results

The performance of the proposed algorithm is evaluated on five large gait databases: TUM GAID [17], CMU MoBo [15], NLPR [55], CASIA-B [58] and CASIA-C [46] databases. Each database contains gait sequences captured in different environment with several variations in walking style, clothing, walking speed, etc. For example, in TUM GIAD database, three different walking styles are captured: normal walk, walk with coating shoes, and walk with backpack. Moreover, the database is recorded in two different seasons; thus, the clothing and the environment are significantly different. A summary of the characteristics of gait databases used in the experimental evaluation is presented in Table 1. For each database, a codebook is separately computed using a training set, a subset of the database, and is used to encode the local descriptors of that particular database. In all experiments, one million randomly selected local descriptors from the training set are used to build a codebook with GMM. The number of components K in GMM are empirically computed and set to \(2^{8}\) in all experiments and the length of each local descriptor is 96. The same distribution of databases into training and testing sets are used in evaluating the performance of the proposed and the compared gait recognition methods.

We also assess the influence of dimensionality reduction on the classification accuracy of our computed features. It is observed that dimensionality reduction significantly improves the recognition accuracy of the proposed method and reduces the computational time and the storage requirements. This improvement is due to the impact of PCA on GMM, which decorrelates the different dimensions during the projection, and since we are using a diagonal covariance matrix in Gaussian distribution, it is considered more advantageous [43]. Let d be the dimension of our feature vector and n be the total number of instances in the training and the testing sets. Since \(d \gg n\); therefore, we have applied PCA on our feature vectors and reduced their dimensions to \(n-1\) to alleviate the curse of the dimensionality problem. It is empirically concluded that average recognition accuracy of the proposed algorithm on five gait databases is significantly increased (up to 27%), when the dimension of the final feature vector is reduced. Furthermore, using PCA the average percentage saving of memory on all the five databases is more than 50% and the average speedup achieved in classification time is more than 96%.

3.1 Evaluation on TUM GAID database

TUM GAID [17] is one of the biggest gait database, comprising 3370 gait sequences of 305 subjects. The gait sequences were recorded in two different seasons. Each subject in the database has ten walk sequences which include six sequences of normal walk (N), two sequences of walk with coating shoes (S) and two sequences of walk with backpack (B). A subset of 32 subjects in the database who took part in both recordings have ten more walk sequences (i.e., in total 20 sequences), and they are described as normal walk after time (TN), walk with coating shoes after time (TS) and walk with backpack after time (TB). The time variation, where the shoes, clothing, illumination and other recording properties are significantly changed, make this database extremely challenging. A similar division of gait sequences into the training (i.e., gallery) and the testing (i.e., probe) sets is used, defined in [17]. Specifically, the first four sequences of N for each person (i.e., \(N_{1} - N_{4}\)) are used in the training set and \(N_{5}-N_{6}\), \(S_{1}-S_{2}\) and \(B_{1}-B_{2}\) are used in the testing sets separately, with three different experiments, namely N, S and B. Moreover, three more experiments are conducted namely TN, TS and TB where the sequences of \(N_{7}-N_{8}\), \(S_{3}-S_{4}\) and \(B_{3}-B_{4}\) are used in the testing set separately, while the training set is similar as in the previous set of experiments.

Table 2 Performance comparison of proposed algorithm on TUM GAID database
Table 3 Performance comparison of proposed algorithm on CMU MoBo gait database. The composition of training and testing sequences in each experiment is denoted as TrainingTesting

Recognition rate measure is used to assess the performance of gait recognition algorithms. The recognition results achieved by our algorithm and their comparison with the state-of-the-art techniques [4,5,6, 17, 56] are outlined in Table 2. The statistics show that in the first set of experiments N, S and B, the proposed algorithm outperforms all competing methods. It achieved the recognition accuracy of 99.7%, 99.7% and 100%, respectively. In the next three experiments, TN, TS and TB, our algorithm performs the best on TB experiment achieving 71.9% recognition accuracy. However, on TN and TS experiments, respectively, PFM [5] and CNN-SVM [6] perform better than our algorithm. Overall, our algorithm achieves the best average recognition accuracy of 96.5%.

3.2 Evaluation on CMU MoBo database

The CMU MoBo database [15] contains the gait videos of 25 subjects while they are walking on a treadmill. The database comprises four variations of walk: slow walk (S), walk with ball in hands (B), fast walk (F) and walk at certain slope (I). The sequences are recorded from six different viewing angles. The gait sequences recorded in lateral view are used to assess the performance of the proposed algorithm to deal with the variations in walking surface (i.e., incline), speed and carrying condition. In order to increase the number of gait sequences, similar to [59] we split the sequences into three equal parts. Two type of experiments are conducted, (1) similar walking styles are used in the training and the testing sets, (2) different walking styles are used in the training and the testing sets. The recognition accuracies obtained by the proposed algorithm and their comparison with competing methods are listed in Table 3. The proposed algorithm achieved excellent results, outperformed all competing methods in sixteen experiments and achieved the highest average recognition accuracy of 98.5%.

Table 4 Performance comparison of proposed algorithm on NLPR gait database

3.3 Evaluation on NLPR gait database

The NLPR gait database [55] comprises the walk sequences of 20 subjects. Each subject has four walk sequences captured from three different viewing angles: \(0{^\circ }\), \(45{^\circ }\) and \(90{^\circ }\). Each subject in the database walks twice, from right-to-left and from left-to-right. The performance of the proposed algorithm is evaluated on the gait sequences recorded at \(0{^\circ }\) viewing angle. The first three recordings of each subject are used in the training set and the rest of the forth recording is used in the testing set. The recognition result obtained by the proposed algorithm on NLPR database and its comparison with other competing techniques is reported in Table 4. The proposed algorithm along with [14, 20, 21] achieves the 100% recognition rate on NLPR gait database.

Table 5 Performance comparison of proposed algorithm on CASIA-B gait database

3.4 Evaluation on CASIA-B gait database

CASIA-B [58] is another large gait database comprising the gait sequences of 124 subjects. The sequences are captured from 11 different viewing angles. Each subject in the database has ten sequences of gait with three variations: six sequences of normal walk (nm), two of walk while wearing a coat (cl) and two of walk with bag (bg). The sequences recorded in the lateral view are used in the evaluation, and three different experiments namely nm, cl and bg are conducted. In all experiments, the first four normal walk sequences from all 124 subjects are placed in the training set, and the rest of two nm, cl and bg sequences are used in the testing set, separately. The recognition accuracies obtained by the proposed algorithm and their comparison with competing methods are presented in Table 5. The results reveal that the proposed algorithm outperforms the existing methods in nm and bg, while SDL [59] performs slightly better than our method in cl. Our method achieves the highest average recognition accuracy of 96.6%.

3.5 Evaluation on CASIA-C database

The CASIA-C database [46] comprises the gait videos of 153 subjects with four walking scenarios: normal walk (fn), fast walk (fq), slow walk (fs) and walk with backpack (fb). The database contains low-resolution videos which are recorded at night using a thermal imaging camera. Each subject in the database has four recordings of fn and two recordings of each fq, fs and fb. This database aims at evaluating the performance of gait recognition methods under variations in the walking speed, carrying condition and illumination changes. Four different experiments namely fn, fq, fs and fb are conducted. In all experiments, the first three walk sequences of fn from all 153 subjects are placed in the training set. In the first experiment, the forth remaining sequence of fn is placed in the testing set while the rest of fq, fs and fb sequences are placed in the testing set, separately, for the next three experiments. The recognition results obtained by the proposed algorithm and the competing methods are reported in Table 6. The results demonstrate that our method outperform the existing methods in most experiments and obtain the highest average recognition accuracy of 99.8%.

One can conclude from the results presented in the previous sections that the proposed gait recognition algorithm performed consistently better than many state-of-the-art gait recognition methods and that it is strong in resisting the changes in walk due to variations in clothing, shoes, backpacks, bags, jackets, and illumination. The source code of the proposed gait recognition algorithm is publicly released to reproduce the results reported in this paper. The source code along with a sample database and computed features are made available online at.Footnote 1

Table 6 Performance comparison of proposed algorithm on CASIA-C gait database

4 Conclusions

A new gait representation based on spatiotemporal characteristics of human motion is presented in this paper. Unlike most existing methods which require the segmented silhouette of human body region to compute features for gait recognition, the proposed method does not require such information, and it directly operates on the video sequences. The proposed method tracks a set of points in successive frames to compute the dense trajectories which are used to specify a spatiotemporal patch to compute the local descriptors in order to capture the person’s static appearance and motion information. The local descriptors are encoded using Fisher vector encoding and classification is performed using linear SVM. A rigorous experimental evaluation on five large well-known benchmark gait databases showed that the proposed algorithm is highly accurate. In future, we plan to extend the proposed method for view-invariant gait recognition and also explore the probabilistic modeling techniques to be able to track the individual even in occlusion.