Keywords

1 Introduction

Over the last few years, commercial enterprises, governmental organizations, educational institutions, and the general public sectors have all shown attention to video surveillance. It does an excellent work of locating and identifying temporal and spatial abnormalities in videos [1, 2]. In comparison to just capturing data, the use of video analysis in security monitoring offers several advantages. It sees widespread use in the supervision of parking lots, the implementation of safety regulations, the protection of sensitive information, and the supervision of facility protection. However, to analyze surveillance footage, a variety of different approaches, such as automatic number plate identification, crowd detection, people tracking, image change/tamper detection, and people counting, are utilized. In addition, It supports the categorization of a wide variety of items into groups such as common and uncommon, as well as the recognition of faces, motion tracking, classification, tamper detection, and auto-tracking.

Systems for video surveillance may enhance situational awareness through the use of personalized real-time alerts when abnormal behavior is detected that may require a response. One issue with this method of video analysis is how to properly identify people’s faces. When it comes to targeted surveillance, face detection is one of the most investigated topics. It has been successfully used in many commonplace scenarios, including surveillance cameras, computer-human interaction, automated target recognition, safe driving, and medical diagnostics [3]. One of the most popular face recognition methods can identify facial characteristics in each frame, allowing for dynamic face tracking. The use of face detection has several advantages, including extreme security, confidentiality, and difficulty in hacking. Face identification is one of the most complicated natural structural targets, making face detection a challenging area of study. Individuals vary physiologically in a wide range of ways, including how their appearance, communication style, and complexion [4]. In Face detection, Low-resolution surveillance footage is quite challenging due to out-of-focus blur [5]. Therefore, Gait Analysis is the element of human detection that is most advantageous.

The use of biometric features for individual identification is a hotspot of study in the field of computer vision [6]. The term “Gait Recognition” is a way of identifying individuals according to measurements of the uniqueness, one-of-a-kind characteristics of their movement patterns (walking) and behavior [7]. This method is used to recognize persons. In open and public areas, the process of automatically capturing and extracting components of human motion and then using these characteristics to recognize the individual while they are moving can be extremely valuable. However, as compared to a more controlled situation with a fixed view angle, situations like these make Gait identification far more challenging. Because of this, a person’s Gait may be altered by a broad variety of circumstances, including the clothes they are wearing, the objects they are carrying, the speed at which they are moving, the shoes they are wearing, and the direction from which they are approaching the ground. One of the most important problems with Gait recognition is presented by shifts in view angle since these shifts can rapidly affect the distinct features that are available for matching. Particular benefits come into play for the Gait biometric, which evaluates behavior, in situations in which the camera is located at a considerable distance and the visuals are of poor quality. Therefore, Gait recognition has been utilized in surveillance to assist in investigations. The Gait biometric, which monitors behavior, is particularly useful in circumstances when the camera is located at a remote location and the visuals are poor. As a result, the use of Gait recognition in surveillance for investigations has been adopted.

The majority of studies on Gait Analysis being conducted today are focused on authenticating and identifying individuals [8]. The human Gait is a major biometric feature that is used in a variety of circumstances, including the estimation of age and the classification of gender, the classification of a person based on their age and gender are two types of general features that have various applications along a wide range of situations [9, 10]. These uses include access control, which restricts the entrance of people who are over a certain age or gender; a commercial application that makes use based on age also monitoring services that provide age restriction functionality or healthcare based on gender and age grouping; Traffic and driving safety improvements in ageing societies; advertising recommendation systems. In all of these cases, the information is presented in a manner that is appropriate to the audience.

Computer vision is becoming increasingly popular for its use in image processing, feature extraction, object recognition, and classification; Gait Analysis is one of the subfields that belongs within computer vision [1]. The human posture may be captured more accurately, which is useful for Gait Analysis. It is common knowledge that computer vision relies on machine learning and deep learning algorithms to achieve greater accuracy. This is especially true when it comes to the process of extracting image features with requirements. Numerous machine learning and deep learning algorithms and methods, like Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), Convolution Neural Network (CNN), and many more were utilized to classify the features gathered during Gait Analysis to estimate age and gender. The majority of research done in recent years has focused on wearable and visual sensor datasets in an attempt to achieve the highest level of accuracy possible when determining age and gender.

2 Gait Analysis and Feature Extraction

2.1 Gait and Gait Cycle

The term “Gait” describes the humans walking pattern [4]. And the time interval or sequence that occurs among one foot of a human hitting the floor and the next time that foot hits the floor is defined as the Gait Cycle [11]. When one foot touches the ground as a reference and then again when the same foot touches the ground, we say that a single sequence of function has been performed by that limb. Some of the most vital components of Gait are:

Stance phase: 60% of the limb is in touch with the floor during the stance phase, which consists of four phases: i) initial contact, ii) loading response, iii) mid stance, and iv) terminal stance.

Swing phase: Reference to swing at 40 % stages that are not in touch with the floor include following phase: i) the pre-swing, ii) the first swing, iii) the mid swing, and iv) the terminal swing.

Stride: The distance between two successive instances of heel contact made by the same foot is referred to as the stride [12].

2.2 Gait and Gait Cycle

Authentication of a person’s identity based on their physiological characteristics. It is a pattern recognition system that compares a person’s information to characteristics in order to determine their identity. For example, face detection, voice recognition, fingerprints, and handwritten signatures. The analysis of biometric Gait is known as the term “Biometric Gait Analysis” which refers to the process of determining a person’s walking style [6, 8]. The process of authenticating a person based on the way they walk is referred to as Gait recognition.

We need Gait recognition because it is effective even when performed remotely, does not require high-quality footage, and can be performed with minimally invasive equipment. When other identifying features, such as a person’s face or fingerprints, are hidden, Gait recognition can still, be effective, and also capturing a person’s walk from a considerable distance is possible.

2.3 Gait and Gait Cycle

Silhouette Images

In order to enable more accurate object recognition, the procedure of feature extraction was applied to the video and resulted in the visual sequence being displayed as silhouettes. In order to recover the body silhouette, a straightforward removal of the background and thresholding were utilized. This was then followed by the application of a 3 × 3 median filter operator to suppress any isolated pixels [13].

Gait Image Energy (GEI)

The GEI is a representation of the Gait pattern of a cycle that uses a weighted average. Walking patterns that coordinate Gait Cycles are called binary silhouettes [14, 15]. It is possible to calculate the Gait energy picture in the following manner when the Gait Cycle image sequence is Bt(x, y):

$$ G(x,y) = \frac{1}{N}\sum_{t = 1}^N {Bt(x,y)} $$
(1)

Bt(x, y) in the context of a series at time t. x and y describe the coordinates of each frame B or image B, and N is the total number of images taken in a Gait Cycle.

Gait Energy Image Projection Model (GPM)

The GPM, is a model that evaluates many characteristics of Gait, such as body size and the movement of the arms and legs. The General Procedures Manual is separated into two major groups of mathematical processes and procedures. Both the GEI Longitudinal Projection (GEL) and GEI Transverse Projection (GTP) are projected in GLP [16, 17]. With the help of the GTP, information about body shape and stride length may be delivered more efficiently. According to the GLP, the slouched posture and head pitch of each GEI image are mathematically connected to the following Gait Cycle. This was discovered by analyzing the Gait Cycles of people walking.

For GLP,

$$ GLPcycle = \frac{1}{K}\sum_{j = 1}^K {GLP_j } $$
(2)

For GTP,

$$ GTPcycle = \frac{1}{K}\sum_{j = 1}^K {GTP_j } $$
(3)

where K elucidates count of frames, GLPj and GTPj generate jtℎ frame-vector of GLP and GTP. GPM now incorporates GTP and GLP in addition to the concatenation approach.

For GPM,

$$ GPM = \left\{ {GLP_{cycle} \;U\,GTP_{cycle} } \right\} $$
(4)

Frame to Exemplar Distance (FED)

When figuring out a person’s FED, the whole Gait Cycle is taken into consideration. The silhouette is evaluated by the feature to exemplar distance or FED. This is done by placing 60 points on the silhouette’s contours and measuring the Euclidean distance between the centroid of the silhouette and the points. Every six degrees of rotation, up to a total of 360° in the opposite direction, contour points are determined.

$$ FED_{cycle} = \frac{1}{N}\sum_{i = 1}^N {(FED_{image} )} $$
(5)

where the N is Number of frames and it is iteration number.

GEINet

It is suggested that the GEI, be extracted from the ensuing Gait silhouette sequence and then transferred into GEINet. The ability to identify Gaits was achieved through the use of a CNN [10, 18, 19]. There are layers for convolution, pooling, and normalization in the two sequential triplets of the modified GEINet along with fully connected layers with normalization and activation function as softmax.

GaitSet

GaitSet can recognize sets of silhouette images, unlike standard Gait recognition networks that use GEI. Using a CNN, frame-level properties of each reconstructed Gait features are extracted [19]. The set pooling technique combines various frame-level features into a single set-level feature. The set-level feature is then merged with features from different spatial scales and locations using horizontal pyramid mapping to create a discriminative representation [10]. GaitSet used a fully connected layer and Softmax normalization to identify the characteristic.

2.4 Motivation and Application of GEI Motivation

The motivation for using gait analysis for person identification and recognition is given below:

  • Gait recognition has numerous advantages over other biometrics. This allows gait to be monitored from a great distance. For some biometric approaches, the user must contact a biometric collector.

  • Low-resolution gait analysis is possible. Face recognition may be less accurate with low-quality footage. For this purpose, gait recognition is the preferred method.

  • Simple instruments can recognize gait. Human stride data can be collected using a camera, smartphone accelerometer, and floor sensor.

  • Gait characteristics are challenging to reproduce. This is because gait recognition uses motions and silhouettes. This characteristic is vital for criminology.

  • Gait recognition is possible without participant engagement. In contrast, fingerprints require a person to touch the sensor.

Application of GEI

The following are the application of GEI based Gait recognition and identification:

  • Based on the GEI image, we are able to analyze the individual’s stance and swing frequency to provide an accurate prediction as to their best possible stride frequency. In addition, we are able to determine the order of the angles and forecast the angles that will arrive. As a consequence of this, it will be beneficial to assume a more accurate pattern of human walking style from the motions of their physical existence, such as the motions of their head, arms, and legs.

  • Additionally, it will provide age and gender prediction, which can be helpful in the process of designing applications based on age and gender queries. Some examples of such applications are college entrance purposes based on age, crime, hostels, and movies based on gender category.

3 Evolution Metric

Confusion Matrix: The confusion matrix is a common classifier. It works for binary and multiclass classification and provides True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) statistics based on classification models performance.

Precision: To calculate precision, divide the number of actual positive results by the total number of outcomes that were expected to be positive.

$$ {\text{Precision}} = \frac{TP}{{TP + FP}} $$
(6)

Recall: To get the recall, also known as the true positive rate, divide the number of actual positive results by the total number of results that were expected to be positive.

$$ {\text{Recall}} = \frac{TP}{{TP + FN}} $$
(7)

Correct Classification Rate (CCR): CCR is calculated by the total number of correct predictions that divided by the total number of data points in the dataset, the CCR is calculated. The highest possible accuracy rating is 1.0, while the lowest is 0.0.

$$ CRR = \frac{TP + TN}{{TP + TN + Fp + FN}} $$
(8)

The Mean Squared Error (MSE): MSE measures the average squared difference between the estimated and actual values. It is a risk function whose value should be expected squared error loss. It can’t take a negative value, thus closest to zero is ideal.

$$ MSE = \frac{1}{N}\sum_{i = 1}^N {(Y_i - \hat{Y}_i )} $$
(9)

where Ŷi is predicted values and Yi is observed values and N is the number of data points.

Root Mean Square Error (RMSE): It measures differences between expected and observed values.

$$ RMSD = \sqrt {\frac{{\sum_{i = 1}^N {(Y_i - \hat{Y}_i )^2 } }}{N}} $$
(10)

where it is variable, N denotes the number of data points that are not missing, Ŷi is the estimated series of time and Yi actual observations series of time.

Mean Absolute Error (MAE): It calculates the difference in error between two observations of the same occurrence. Y versus X includes predicted versus observed, subsequent versus starting time, and one measuring technique versus another.

$$ MAE = \frac{{\sum_{i = 1}^N {\left| {y_i - x_i } \right|} }}{n} $$
(11)

where the n is t total number of data points xi represents true value and yi shows the prediction.

Correlation Coefficient (CC): It is also often known as Pearson’s r, is a measurement for the linear correlation of two sets of data.

$$ r = \frac{{\sum {(x_i - \overline{x})(y_i - \overline{y})} }}{{\sqrt {\sum {(x_i - \overline{x})^2 } \sum {(y_i - \overline{y})} } }} $$
(12)

where r is the CC, yi shows the y variable’s value in a sample and y̅ is the mean of the y variable’s value and xi represents the x variable’s value in a sample, x̅ is the mean of the x variable’s value.

R-squared (R2) score: In a regression model, it indicates the proportion of a dependent variable’s variance that is described by independent variables. Correlation displays the strength of an independent-dependent relationship, whereas R2 indicates the proportion that one variable’s variation described another’s variance.

$$ R^2 = 1 - \frac{{\sum_{i = 1}^n {(y_i - f(x_i ))^2 } }}{{\sum_{i = 1}^n {(y_i - \overline{y})^2 } }} $$
(13)

4 Related Work

Hema M. and Pitta S. [16] used OU-ISIR dataset for the purpose of age classification, and the SVM is used as the classifier. For the purpose of age classification, a method known as the GPM is utilized. They compared GEI, SM, GPM, and FED (Silhouette Model), where GPM had 89.1% CCR which was 4.1% higher than GEI, 26.1% more than FED, and 14.63% higher than SM. Descriptor fusion for GPM, FED, and GEI gave a 91.8% CCR better outcomes than individual descriptors.

K. Khabir et al. [20] proposed a system for forecasting the age and gender from the Osaka University-ISIR Gait Database’s inertial sensor-based Gait dataset [21]. The database had taken into consideration data from an accelerometer, a gyroscope, and a smartphone worn around the subject’s waist. These components make up the three inertial measuring units. They used regression in order to estimate ages, and gender was used as the classification feature. They used KNN, SVM (RBF), and DT (Decision Tree) for age regression, where the DT had the maximum variance R2 score is 0.64 and the Mean square error is 0.36. For gender classification, SVM had the highest accuracy at 84.76% compared to KNN, and RF.

Sun Bei et al. [22] examined an innovative visual camera sensor-based approach to gender detection. Instead of employing a single GEI, SubGEI from the Gait Cycle was used to extract optical flow as temporal body movement data for complex Gait Analysis [15]. Compared to CNN-based models, two-stream CNN produced better results since it made use of both temporal and spatial data [1, 2]. The following view angles, including 18°, 36°, 54°, 72°, 90°, 108°, 126°, 144°, and 172°, are used to establish the results. Additionally, they created custom CNNs with 4 layers and also used Inception-V3 and VGG16. Additionally, there were three sets of the SubGEI, designated as Tl-1, TL-2, and TL-3, with respect to 4, 6, and 8 numbers of frames. When contrasting CNN, CNN+SVM, and two stream networks using the following parameters as a normal situation, caring bags, wearing clothes, and mixed situation. Where the TL-2 set had 6 frames of the SubGEI in inception-v3 they achieved close to 95% accuracy on 90° angle utilizing two-stream networks in normal conditions.

Q. Riaz et al. [23] examined the method that based on 50 handcrafted [6D acceleration/angular velocity ratio-temporal characteristics] non-visual characteristics, this method did not require a highly computational model to estimate age. RFR (Random Forest Regressor) is best for age estimation among SVR (Support Vector Regressor), and MLP (Multi-Layer Perceptron). On hybrid data (phone-embedded and wearable Inertial Measurement Unit (IMU), used a complete dataset where RFR’s 10-fold RMSE is 3.32 years and subject-wise is 8.22 years. Using only smartphone MPU-6500 data and the whole dataset, the RFR generated a 10-fold RMSE of 2.94 years and a subject-wise of 6.84 years giving good results as compared to hybrid data. RFR’s average RMSE error for 10-fold cross-validation is 5.42 years and 11.35 years for subject-wise cross-validation.

S. Gillani et al. [24] used various machine learning methods to estimate ages and classify people’s genders, and then they extracted features from those analyses. They used 3 IMUZ sensors where the inertial signals are captured by a triangle accelerometer and gyroscope. Three sensors were mounted to the waist: two on the sides and one in the middle. IMUZ can be replaced by a smartphone with accelerometer and gyroscope sensors. Signals were captured as subjects walked on a specified path when arriving (Sequence 0) and leaving (Sequence 1). Age estimation included CC, MAE, and RMSE. Where Liner Regression, MLP, SVM, and RF are used, SVM had good results such as 0.57 CC, 11.6 MAE, and 14.0 RMSE in sequence 0. And gender classification was based on True Positive rate, recall, and classification accuracy, and they used methods such as Logistic Regression, MLP, SVM, RF, and Naive Bayes (NB). The logistic regression method gave the best results with 72.2% male recall, 63.9% female recall, and 68.2% classification accuracy in sequence 0.

C. Xu et al. [19] focused on uncertainty age application that is age-dependent. An application that queries or groups by age. By adopting distribution of label, the learning framework can assist with uncertainty-based age estimation by adopting appearance-based Gait features and discrete label distribution instead of a single estimate age and reduced loss function. MAE was used to evaluate age estimation accuracy. By overlapping expected and ground truth statements, IOU is calculated. We received 95% IOU for predicting the age statistic, which is closer to the ground truth, and Gait Set’s MSE is 4.91 years, which is better than GEINet’s MSE of 5.41 years.

C. Xu at al. [10] used the Single Image, Instead of a sequence of Gait features. The Single Image of Gait set is utilized to estimate the probability distribution of integer age labels and gender recognition. That will improve in real-world applications in the future. The results were better at 75° to 90° degrees (side view). The results of age and gender on a single image were an MAE of 8.93 and a CS of 16.39 respectively. And the outcomes for gender classification results for a single image are 94.27% was determined using CCR.

B. Kwon and S. Lee [25] examined 3D gender classification using joint swing energy (JSE). JSE calculates the distance between model skeleton joints and anatomical planes when a person is striding. Anatomical planes are commonly extracted from fixed poses instend of the motion. They studied an innovative technique for gaining crosswise, median and frontal planes by utilizing sequence of 3D gait. They enable human-centered measurements to be used to represent the motion of each joint. They identified JSEs from 3D Gait sequences using the provided approaches. They examined 4 datasets and applied KNN, NB, DT, and SVM. They acquired the best accuracy (98.08%) using JSE-SVM in dataset B, which has 104 users (50 male, 54 female).

J. Upadhyay and T. Gonsalves [9] found out to make the system more lightweight and strong, and to avoid the perception of Pearson’s moment based on angles, the best possible results may be obtained when classifying people according to their gender. Utilizing the discrete cosine transform (DCT) that was implemented on GEI in order to extract DCT) vectors that were applied on XGBoost to perform gender classification. Based on 14 view angles of Gait data, we calculated the mean CCR for gender classification, which was 95.33 %.

Y. Chen et al. [26] recorded 960 steps from 24 younger and older subjects applying a sole pressure mat and estimating the Centre of pressure trajectory. SVM was applied to 30 features, including initial contact, forefoot contact, foot flat, and forefoot pushoff. SVM-kernel RBF compared linear, sigmoid, and polynomial where the RBF-SVM kernel COP (Center of Pressure) features with 99.65% accuracy. Moreover, they used 13 different stages that were required for each participant and achieved 95% accuracy in age recognition. And 5 steps for each participant and achieved 97% accuracy in gender detection.

5 Comparison and Summary of Related Research Work

Comparison and summary of related research work carried out by various researchers is presented in Table 1. It can be easily observed from the comparison table that huge amount of scope is available for applying deep learning and transfer learning approaches to improve the performance of Gait recognition.

Table 1. Comparison and Summary of related research work

6 Future Work

Huge amount of work is done by the researchers in this field of research but still many things can be added to enrich this field. Some of the pointers for future research work are listed below:

  1. 1)

    Gait Analysis can be used to predict the height and weight of a person.

  2. 2)

    Focus on overcoming the issue of Gait obstruction and creating a robust system that can compete with partially accessible Gait.

  3. 3)

    Additional data should be obtained from a wider variety of groups, particularly those that differ in terms of their ethnicity and body shape, in order to make the model better resilient.

  4. 4)

    One possible direction for research in the future is to find ways to smooth out the predicted age and gender distribution. Sometimes there is a drastic shift in the odds between two consecutive age labels. As a result, a dataset that is both symmetrical and comprehensive is required.

  5. 5)

    Develop an age and gender predictor that is sensitive to the impact of many parameters, including carrying status and also walking speed.

7 Limitations and Challenges

Huge amount of work is done by the researchers to solve many problems but still some problems are there. These are listed below:

  1. 1)

    The performance of the system is negatively impacted whenever there is a change in the person’s clothing condition or the camera view angle.

  2. 2)

    The pattern of gait is influenced by a variety of different elements, one of which is age, walking is dependent on the strength of one’s muscles, which naturally decreases as a person gets older. Variations in muscular strength of persons of the same age have an effect on walking style.

  3. 3)

    It is understood that there are alterations in the Gait pattern that occur during pregnancy.

  4. 4)

    Because of the increased sample rate required for faster person movement, accuracy decreases.

8 Conclusion

High demand exists for a gait-based dataset that may be utilized to extract useful information. Gait analysis isn’t new, but sensor-based gait datasets are. The majority of machine learning studies on the gait dataset rely on visual representations. Identifying a person’s gender and age from their gait is challenging. In this study, gait-based age and gender detection studies are compared. It also shows where researchers wish to go and what challenges they face.