Abstract
Human Activity Recognition is the process of identifying the activity of a person by analyzing continuous frames of a video. In many application areas, human activity identification is either a direct goal or it is a key segment of a bigger objective. Some of the examples are surveillance system, elder healthcare monitoring system, abnormal activity detection systems such as fight detection, theft detection etc. Robust and accurate activity recognition is a challenging task due to diverse reasons, such as changing ambient illumination, noise, background turbulence, camera placements etc. Existing literatures discuss some techniques for identifying human activity but these approaches are restricted to the case of videos recorded from static camera. The aim of the proposed approach is to fill this gap. In this proposed method, a new skeleton based feature for human activity recognition- “Orientation Invariant Skeleton Feature (OISF)”- is introduced and used to train Random Forest (RF) classifier for Human Activity Recognition. Efficiency of newly introduced feature OISF is analyzed for the videos recorded with multiple cameras positioned at two different slant angles. Experimental results reveal that the newly introduced feature OISF has minimal dependency on variations of camera orientation. Accuracy achieved is ≈ 99.30% with ViHASi dataset, ≈ 96.85% with KTH dataset and ≈ 98.34% with in-house dataset which is higher than those achieved by other researches with existing features. The improved result of human activity recognition in terms of accuracy proves the appropriateness of the proposed research in being used commercially.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Human Activity Recognition aims to recognize the physical activity of one or more persons from a sequence of observations using sensors or video logs. It is a difficult task for real time videos because of background clutter, variations in illumination, varying viewpoints, partial occlusions and camera movements. Recognition of human activity is thus an open and challenging research area in the field of computer vision, machine vision, image processing [1] etc. A robust human activity recognition system can greatly enhance efficacy of surveillance system [7], human-computer interaction [21], gesture interpretation [39], life-care system [14, 23], sports analysis [32], smart homes [15] and other application domains. Due to its wide applicability in real life, many researchers have been working in this area since long. Recognition accuracy of such system broadly depends on the quality of extracted features and learning capability of the classifier. Different features like Motion Stable Shape (MSS) [18], Motion Scale-Invariant Feature Transformation (MoSIFT) [6], Speeded Up Robust Features Motion History Image (SURF-MHI) [35], Spatio-temporal interest point (STIP) detector [8], skeleton based features etc. have been used for Human Activity Recognition by the research community in the past.
Human Activity Recognition system is applied on the data collected through the wearable sensor or the recorded video (recorder from CCTV camera, Surveillance camera, etc.). Wearable sensor [3, 26, 39] based approach first collects the data from the sensors attached to the different parts of human body. Thereafter, physical activity of human is recognized by analyzing these data. This type of approach can be used only when a person is always wearing sensors which is not always feasible in a real life scenario. To mitigate this limitation, computer vision based human activity recognition is the preferred method. In the vision based approach, features like MoSIFT, SURF-MHI, etc. are extracted from the video data which are further used to train the classifier. These features work well for the videos recorded with static cameras, but not for the videos recorded with moving cameras. Also, processing time is high for extracting these features because feature extraction requires processing of all pixels in the image. To minimize these limitations, skeleton based features are proposed by many researchers in literature for human activity recognition. Feature selection plays an important role in vision based human activity recognition [9, 40] system because multiple features may be required to train the system for improving its accuracy. One can select large number of features to increase the accuracy, but this results in increased time for feature extraction & learning and testing of the system. Hence, skeleton-based features prove to be a viable alternative.
Skeleton features [15, 32, 37] are extracted by analyzing the skeleton pose in a sequence of frames. The skeletal representation of the human body at any instant reflects original pose of that person. A human pose in a video frame contains lots of pixels, while its skeletal representation contains comparatively smaller number of pixels thus reducing processing time for feature extraction. Human Activity Recognition using skeleton involves following steps:
-
1.
Extraction of skeletons from video frames
-
2.
Extraction of skeleton features
-
3.
Construction of feature vector
-
4.
Training and testing of the classifier
The skeleton feature usually contains joint information of different body parts like end and mid points of hands, legs, and position of head. Feature vectors are then constructed by fusing the features extracted from the person being tracked, in few consecutive frames. Finally, a supervised machine learning based classifier is used to identify the physical activity of the tracked person based on the extracted features. Neural trees, SVM, LDA and other classifiers are used in the existing literature for the activity recognition task. Since, human activity recognition is a multi-class classification problem, tree based classifier is a preferable approach. This observation motivates us to explore the suitability of Random Forest classifier for human activity recognition. The aim of this work is to propose skeleton features for human activity recognition with the following goals:
-
To improve the recognition accuracy of the system
-
To reduce the dependency on camera positions and angles
-
To obtain similar recognition accuracy for the videos recorded by either static or moving cameras
In this paper, a new feature named “Orientation Invariant Skeleton Feature (OISF)” is introduced for human activity recognition. This newly introduced feature is examined through number of experiments on two publicly available datasets: KTH and ViHASi, and one in-house dataset which is available at https://github.com/neelamdw/Actiondataset/. The in-house dataset is developed in the lab which contains five different actions (boxing, hand clapping, hand waving, jogging and walking). Three subsets (SA1,SA2&SA3) are prepared by selecting similar and dissimilar activities from the ViHASi dataset (Table 1 presented in Section 4 may be reffered). Experiment #1, experiment #2 and experiment #3 are performed on SA1,SA2 and SA3. All the activities of KTH dataset are taken together to perform the experiment #4. Similarly, All the activities of in-house dataset are taken together to perform the experiment #5. Following are the major contributions of this paper:
-
A new skeleton based feature named as “Orientation Invariant Skeleton Feature (OISF)” for human activity recognition
-
Performance evaluation of proposed feature with Random Forest classifier for KTH dataset, in-house dataset and different subsets of ViHASi dataset
This paper is organized as follows: In Section 2, literature survey for human activity recognition is presented. In Section 3, methodology and architecture of the proposed approach are explained. Experimental results and their analysis are presented in Section 4. Conclusion is presented in Section 5.
2 Literature survey for human activity recognition
In this section, various human activity recognition approaches are briefly reviewed. Uddin et al. [33] utilized Independent Component Analysis (ICA) to extract the activity shape information from the body joints instead of using whole body. Further, they apply Hidden Markov Model (HMM) on extracted activity shape information to recognize the human activity. In this method, features are easy to extract from the silhouettes but it cannot recognize the difference between near and distant body parts. To overcome this limitation, Jalal et al. [13] present human activity recognition for smart homes based on R transformation applied on depth silhouettes which require more processing time. To minimize the processing time, a new model for action recognition based on the combination of mid-level representation (HoG and BoW poselets) and discriminative key frame selection is proposed by Raptis et al. [28]. These approaches require the complete sequence of frames to recognize the human activity correctly. To achieve the early detection of action, a hybrid technique is presented by Vats et al. [34] which combines the benefits of computer vision and fuzzy set theory. This approach may recognize the action, even if a partial action occurs. They use Fuzzy BK sub-product because of its flexibility and capability to imitate the natural human behavior. In the above-mentioned approaches, large number of pixel processing is required.
To reduce the pixel processing, skeleton are extracted from the images at first. Thereafter, features are extracted by processing the skeleton sequences that reduce the pixel processing time. Anjum et al. [2] present complex human activity recognition by tracking a subset of human skeleton joints instead of tracking the whole skeleton. Skeleton joints are selected either manually or automatically from the depth videos recorded by Kinect camera. They use Multiclass Support Vector Machines (MSVMs) to classify the human activity. Weng et al. [37] present human activity recognition using length-variable edge trajectory (LV-ET) and spatio-temporal motion skeleton descriptor. The LV-ET extracted by tracking edge points across video frames based on optical flow with the aim of better descriptor with the evolution of different type of actions. A novel encoding method for trajectory clustering is proposed to extract Spatio Temporal Motion Skeleton (STMS) (also called motion skeleton). Habli et al. [11] propose skeleton-based human activity recognition for elderly monitoring systems. For this task, they use spatial and temporal coordinates for the 3D skeleton and combine both to represent each frame of a human activity. Randomised tree algorithm is used to train and validate the method on the MSR-Action3D and DailyActivity3D datasets. Manzi et al. [22] have introduced an activity recognition system for two persons using skeleton data extracted from a depth camera. They used an unsupervised clustering approach to compute the activity using a set of few basic informative postures first. Thereafter, models are created using multiclass support vector machines on the training set. An optimal number of clusters for each sample are dynamically found by the X-means algorithm during classification phase. Li et al. [19] focus on multi-view skeletal interactions for human activity recognition. For this, a multi Active Joint Interaction Graph (AJIG) model is proposed to code the spatio-temporal patterns of two-person skeletal interactions. Then AJIG kernel is used to compute the similarity between two AJIGs. Further, a multiple kernel learning approach is applied to jointly learn the optimal combination of the numerous AJIG kernels. Ofli et al. [25] present a sequence of the Most Informative Joints (SMIJ) feature to recognize a human skeleton activity. At each time instant, few skeletal joints that are most related to the current action are selected. The selection of joints is based on highly interpretable measures such as mean or variance of joint angles, maximum angular velocity of joints etc. Zhu et al. [40] presents a deep LSTM network for skeleton activity recognition. Their model facilitates the learning of skeleton joint features with the help of co-occurrence exploration mechanism. This model dropout the complex structures among the important joints using exploration mechanism. They compare their method with other deep networks for skeletal activity recognition [9]. A methodology of our proposed approach is discussed in the next section.
3 Proposed approach for human activity recognition
In this section, a skeleton based feature to identify human activity is introduced. Figure 1 depicts the architectural design of the proposed approach for human activity recognition, which consists of three phases.
In the first phase, video is converted into consecutive frames followed by Skeletonization using Algorithm 1. Figure 2 depicts the skeletonization process of an input video frame. This process involves extraction of foreground object, conversion of foreground object into binary frame, enhancement of the frame, filling of small holes, removal of islands, and repeated thinning operation. After skeletonization, Region of Interest (RoI) is obtained and is marked by an elliptical boundary in the second phase. To draw the elliptical bounding box around skeleton: orientation, centroid, length of major axis, length of minor axis and eccentricity of the ellipse is obtained by applying a MATLAB function “regionprops()”. This elliptical bounding box is further divided into eight symmetric regions. All of these tasks are performed by applying Algorithm 2. In the third phase, feature FV1 and the newly introduced feature OISF are extracted using Algorithm 3 and Algorithm 4 respectively. Time complexity for extraction of each of the feature: FV 1 and OISF is 𝜃(m × n) where, m × n is the total number of pixels in a frame. Random Forest classifier is trained with each extracted feature separately for human activity recognition. The structure of random forest classifier proposed by LEO BREIMAN [4] is used in this paper. Each forest consist of 500 trees. Once the classifier is trained, it is used to recognize the activity from a new sequence of frames.
3.1 Phase 1 (pre-processing of video): skeletonization of input video frames
Depends on frame rate and duration of video, an input video (V ) is converted into N consecutive frames (F1,F2, ..., FN). For complex backgrounds of the image, foreground object is extracted by applying the background subtraction technique. Each foreground frame is then transformed into binary frame (image) in two steps:
-
Step 1: For each framve, two intensity thresholds (th1 and th2) of the frame are calculated as:
$$th1 = h_{j} - \text{c}1 $$$$th2 = h_{j} + \text{c}2 $$
Here, hj is the pixel intensity value with maximum frequency in a grayscale frame. Different values of c1 & c2 have been tried through experimentation (In the experiments valuses taken c1 = 35 and c2 = 40). These values (c1, c2) are initially adjusted and they remain constants for a dataset.
-
Step 2: All the intensity values lying between th1 and th2 in a frame are set to ‘1’. All other intensity values are set to ′0′. The resulting image is a binary image.
Each binary frame is further enhanced by applying median filter. Two morphological operations namely, dilation and erosion are used to obtain a well-defined shape of the person (silhouette) from the binary frame. To perform these morphological operations a linear structuring element of length c3 is used. Dilation and erosion are performed k1 and k2 times respectively, on the input binary frame. Various combinations of dilation and erosion operation are tried. Here, k1 = 2, k2 = 1 & c3 = 3 have shown promising result in terms of silhouette extraction. Skeleton is obtained by applying repeated thinning operation on the silhouette and the silhouette is obtained after performing morphological operations. Thinning operation removes pixels in such a way that an object shrinks to a minimally connected stroke. The pseudo code of skeletonization process is presented in Algorithm 1 (detail Algorithm is presented in ?? as Algorithm A.1).
Figure 3a, b and c show the ten original frames, and the corresponding binary and skeleton frames obtained through Algorithm 1. These frames are taken from the KTH dataset for ‘running’ activity. For the sake of visual clarity, all the figures in this paper include both the complements of actual silhouettes and the corresponding skeletons.
3.2 Phase 2: Region of Interest (RoI) selection
The skeleton extracted from the silhouette (obtained in Phase1) is our Region of Interest. The elliptical boundary around the skeleton separates it from the input frame. Centroid (xc,yc), length of major axis (2 × a), length of minor axis (2 × b), eccentricity (e) and orientation (𝜃) of the ellipse are calculated for further processing. As explained before, “regionprops()” function is used to obtain these values. This ellipse is then divided into eight symmetric regions for in order to extract hands, legs and head features as shown in Fig. 4. To divide the ellipse into eight symmetric regions following four lines are drawn that pass through the centroid of the ellipse:
-
1.
Major-axis
-
2.
Minor-axis
-
3.
Line passing through centroid and inclined at an angle of 450 in clockwise direction from the minor axis
-
4.
Line passing through centroid and inclined at an angle of 450 in anti-clockwise direction from the minor axis
Let us assume that, (xc,yc) is the co-ordinate of centroid of the given ellipse. Then, equation (1) represents parametric equation of the ellipse whose orientation is 𝜃:
where, x1 = x − xc and y1 = y − yc
To draw these lines, coordinates are obtained by applying eight cuts on the ellipse at the angles 0∘ (cut 1), 45∘ (cut2), 90∘ (cut3), 135∘ (cut4), 180∘ (cut5), 225∘ (cut6), 270∘ (cut7), 315∘ (cut8). Four lines are drawn between cut1 & cut5 (minor axis), cut2 & cut6 (dotted line), cut3 & cut7 (major axis), and cut4 & cut8 (dashed line) respectively.
Obtained ROI has eight symmetric regions and each region contains the information of skeleton. Figure 5a, b and c show the skeletal representation of the frames obtained by applying Algorithm 1, ROI with elliptical bounding box and its division into 8-regions using Algorithm 2 (detail Algorithm is presented in ?? as Algorithm A.2) respectively. Algorithms for extracting the features are explained in the Section 3.3.
3.3 Features used
Two features: FV 1 [17] and Orientation Invariant Skeleton Feature (OISF) have been used for the proposed approach. To extract these features, skeletons are obtained from an input video by applying Algorithm 1. Region of Interest (ROI) is selected and bounded by an elliptical bounding box that is further divided into eight symmetric regions (Fig. 3) by applying Algorithm 2. The information in each region of the skeleton is number of pixels, coordinates of hands, legs and head. By using this information, feature vector FV 1 and OISF are calculated. Total eight features for each skeleton are obtained both for FV 1 and OISF. Details of these features are discussed in Sections 3.3.1 and 3.3.2 respectively.
3.3.1 FV 1 feature extraction
Figure 6 shows the flow chart for extracting the FV 1 feature value of one frame. Skeletons sk1, sk2, ..., skN are obtained from the input video V by applying Algorithm 1. For each skeleton, an elliptical boundary is drawn and divided into eight symmetric regions by applying Algorithm 2. \(s{k_{i}^{j}}\) contains the total number of white pixels of ith skeleton in the jth (1 ≤ j ≤ 8) region of the ellipse. For each region, one feature value is extracted as follows:
-
Step 1: Compute the sum of pixels of jth region of ith skeleton ski:
$${P_{i}^{j}} = sum (s{k_{i}^{j}})$$ -
Step 2: Compute the sum of total pixels (pt) for the ith skeleton ski:
$$p^{t} =\sum\limits_{j=1}^{8} {P_{i}^{j}}$$
Thus, feature corresponding to jth region is given by the following expression:
Here, \(f_{i}^{[1...8]}\) = \([{f_{i}^{1}}, {f_{i}^{2}}..., {f_{i}^{8}}]\) is the feature corresponding to ith skeleton ski. Thus, a feature vector of size (N × 8) is obtained for the input video V. Final feature vector FV 1 is generated by fusing features of ten consecutive frames, since an activity is characterized by analyzing sequence of frames. This feature vector FV 1 is used to train and test the classifier for human activity recognition. Size of final feature matrix FV 1 for N frames is (\(\frac {N}{10} \times 80\)). Motivation for and the process of extraction of newly introduced feature OISF are discussed in the next section.
3.3.2 OISF feature extraction
In this section, algorithm and the motivation behind introducing a new feature, “Orientation Invariant Skeleton Feature (OISF)” is discussed. In the literature, it is observed that existing features for human activity recognition are dependent on the orientation and positioning of the cameras that reduces the recognition accuracy for the videos recorded by moving camera. Motivation of introducing OISF feature is to overcome this limitation. Application of OISF feature improves the recognition accuracy of human activity recognition system for the videos recorded with moving as well as static cameras. This is because, OISF characterizes human actions with respect to the relative movements of hands and legs along the x & y axes separately in each frame. The features corresponding to each skeleton ski is obtained as follows:
-
1.
x and y coordinates of first white pixel (having least value of x coordinate) in the first, second, seventh and eighth regions are determined.
-
2.
x and y coordinates of last white pixel (having maximum value of x coordinate) in the third, fourth, fifth and sixth regions are determined.
-
3.
Absolute differences of x and y coordinates of first & fourth, second & third, fifth & eighth, sixth & seventh regions are calculated and taken as eight features of skeleton ski.
Example 1
: For any skeleton ski, let (x1,y1), (x2,y2), (x7,y7) and (x8,y8) represent the coordinates of first white pixel of first, second, seventh and eighth regions respectively, while (x3,y3), (x4,y4), (x5,y5) and (x6,y6) represent the coordinates of last white pixel of third, fourth, fifth and sixth regions, respectively. First, second, third, fourth, fifth, sixth, seventh and eighth feature values of the skeleton ski are calculated as abs(x1 − x4), abs(x2 − x3), abs(x5 − x8), abs(x6 − x7), abs(y1 − y4), abs(y2 − y3), abs(y5 − y8) and abs(y6 − y7) respectively.
Normally, activities cannot be distinguished by analyzing a single frame. Therefore, features of ten consecutive frames are combined to generate the OISF feature vector for the classification of human activities. In the next section, experimental set-up and datasets used are discussed along with the analysis of the proposed method using various performance metrics.
4 Experiments and their analysis
To evaluate the performance of the proposed approach, five experiments have been conducted. All these experiments have been conducted using MATLAB R2017a in core i7 processor with 4GB RAM. For experiments #1, #2, & #3, frames are taken from the Virtual Human Action Silhouette (ViHASi) dataset [27]. For experiment #4, frames are taken from the KTH dataset [31]. For experiment #5, frames are taken from the in-house dataset. The ViHASi dataset contains synthetic videos of 20 action classes and are recorded by 9 actors. These videos used a maximum of 40 synchronized perspective camera views. This 40 synchronized perspective camera views are divided into two sets, each consists of 20 cameras views. The two sets of cameras are fixed at slant angles of 27∘ and 45∘ with the horizontal plane respectively. Angular difference between the cameras is 18∘ in both the sets. In the videos of KTH dataset, there are six actions (boxing, hand clapping, hand waving, jogging, running and walking). These actions are performed by 25 persons in two different scenarios that are indoor and outdoor scenarios with different scale variations. All the video sequences are taken over homogeneous background with static camera. As explained in Section 1, the videos of in-house dataset contains five different actions. These actions are performed by 2 actors in indoor scenario. All the videos are recorded by the static camera with low-resolution, complex background, and variations in illumination.
To create maximum possible combination of activities, twenty action classes of ViHASi dataset have been divided into three subsets for experiments #1, #2 & #3. These subsets are categorized on the basis of similarity of actions and named as SA1, SA2 and SA3. Table 1 presents actions of all sub activities. SA1 contains almost similar actions like running; walking etc. SA2 contains combination of similar (run pull object, run push object etc.) and dissimilar (knockout, punch etc.) actions. SA3 contains dissimilar actions like hang on bar, granade etc. Actions of SA1, SA2 and SA3 are used in experiments #1, #2 & #3, respectively. These three experiments are conducted in two separate scenarios:
-
1.
Videos recorded by the first set of cameras fixed at slant angle of 27∘
-
2.
Videos recorded by the second set of cameras fixed at slant angle of 45∘
In all the experiments, input videos are pre-processed using Algorithm 1 and, FV 1 and OISF features are extracted using Algorithm 3 (detail Algorithm is presented in ?? as Algorithm A.3) and Algorithm 4 (detail Algorithm is presented in ?? as Algorithm A.4) respectively. A separate human activity recognition model is created by training Random Forest classifier for each of the following cases:
-
Case 1: Training with FV 1 only
-
Case 2: Training with OISF only
By doing this, two separate Random Forest models are developed for each of the experiments. Random Forest classifier is a supervised machine learning based classifier and uses an “ensemble learning method” for the classification. The general method of Random Forest was first proposed by Ho in 1995 [12]. Each classification tree of this classifier uses two well-known methods, named boosting [30] and bagging [5]. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. A weighted vote is taken from all the decision trees to predict the class of the new frame. For example, if any classification problem has n1 classes, then samples of all the classes are selected randomly for the training. If each sample contains K variables, then k (k < K) randomly selected variables are specified at each node. Each decision tree grows up to its maximum extent without any pruning. By considering maximum vote from all the decision trees, a new data is classified. Here, features are randomly selected to split the nodes. The structure of random forest classifier proposed by LEO BREIMAN [4] is used in this paper. Total number of 500 trees are used to construct a forest. Figure 7 summarizes all the experiments. To measure the performance of the proposed approach of the system Confusion Matrix and five performance metrics: Precision, Recall, Specificity, F1 score and Accuracy are used and discussed in the next section.
4.1 Parameters used for performance measurement
Precision, Recall, F1 score, Accuracy and Confusion Matrix are most important parameters that are being used to evaluate the performance of the proposed approach. Assume a classifier classifies input video frames as shown in Table 2, where \({N_{i}^{j}}\) denotes the number of ith activity classified as jth activity by the classifier. Performance metrics for this classification results are calculated as follows:
4.2 Experiment #1
To perform this experiment, ten similar activities (C2: JumpGetOnBar, C3: Jump-OverObject, C4: JumpFromObject, C5: RunPullObject, C6: RunPushObject, C7: RunTurn90Left, C8: RunTurn90Right, C18: Walk, C19: WalkTurn180, and C20: Run) grouped in SA1 (Table 1) for the jump, walk and run categories are taken from the ViHASi dataset. Figure 8a, b and c show ten samples silhouettes of RunTurn90Left activity recorded from the second set of cameras, skeletons obtained using Algorithm 1 and 8-regions symmetrical division of elliptical bounding box obtained using Algorithm 2 respectively. To measure the effectiveness of the proposed approach, confusion matrix, precision, recall, specificity, F1 score, and accuracy are used as discussed in Section 4.1.
Tables 3 and 4 show the confusion matrices that are obtained in the first experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. The activities with similar body movements are misclassified in some of the instances for both of the cases. For example, most of the misclassified instances of Run (C20) activity are classified as RunTurn90Left (C8) activity and vice versa. It happens because both the actions have similar body movements.
Tables 5 and 6 list the Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the first experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles (27∘ and 45∘). By analysing the results presented in Tables 5 and 6, it is observed that average Precision is 96.67% & 96.83%, average Recall is 96.67% & 96.83%, average Specificity is 99.63% & 99.65%, average F1 score is 96.65% & 96.83% and average Accuracy is 99.33% & 99.37% with first set of cameras. On the other hand, with second set of cameras, average Precision is 92.30% & 95.25%, average Recall is 92.00% & 95.17%, average Specificity is 99.11% & 99.46% , average F1 score is 92.04% & 95.17% and average Accuracy is 98.40% & 99.03%. Following conclusions can be made from this experiment:
-
Average precision, recall, specificity, F1 score and accuracy are higher in Case 2 (when Random Forest classifier is trained with OISF feature) in comparison to Case 1 (when Random Forest classifier is trained with FV 1). It confirms the effectiveness of the proposed OISF feature.
-
If the difference calculated between two sets of camera angles (slant angle of cameras is 27∘ and 45∘) in Case 1 and Case 2, the minimum variation is observed in the accuracy of Case 2. This observation further confirms that OISF is least dependent on camera positioning.
-
High F1 score and high accuracy confirms the applicability of the proposed work for human activity recognition.
4.3 Experiment #2
To perform this experiment, combination of ten similar and dissimilar activities (C1: HangOnBar, C2: JumpGetOnBar, C3: JumpOverObject, C4: JumpFromObject, C5: RunPullObject, C6: RunPushObject, C11: KnockoutSpin, C12: Knockout, C16: Punch and C17: JumpKick) grouped in SA2 (Table 1) for jump, run and knockout categories are taken from the ViHASi dataset.
Tables 7 and 8 present the confusion matrices that are obtained in the second experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. Tables 9 and 10 show the Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the second experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. After evaluating the results of experiment #2, it is observed that average Precision is 98.22% & 99.02%, average Recall is 98.17% & 99.00%, average Specificity is 99.79% & 99.89%, average F1 score is 98.16% & 98.99%, and average Accuracy is 99.63% & 99.80% with first set of cameras, whereas average Precision is 94.60% & 96.92%, average Recall is 94.42% & 96.92%, average Specificity is 99.37% & 99.66%, averageF1 score is 94.41% & 96.90% and average Accuracy is 98.99% & 99.38% with second set of cameras. In Case 2, the average precision, recall, specificty, F1 score and accuracy are relatively high when compared to that of Case 1. This again proves the effectiveness of OISF feature.
4.4 Experiment #3
To perform this experiment, ten dissimilar activities of different categories (C1: HangOnBar, C9: HeroSmash, C10: HeroDoorSlam, C12: Knockout, C13: Granade, C14: Collapse, C15: StandLookAround, C16: Punch, C17: JumpKick and C18: Walk) grouped in SA3 (Table 1) are taken from the ViHASi dataset.
Tables 11 and 12 show the confusion matrices that are obtained in the third experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. By comparing all the confusion matrices obtained in experiment #1, experiment #2, and experiment #3, following conclusions can be drawn:
-
Probability of misclassification among similar activities is higher than the probability of misclassification among dissimilar activities.
-
Misclassification rate of the activities captured by the second set of cameras is higher than the activities captured by first set of cameras.
-
Average misclassification rate in Case 2 is less than the average misclassification rate in Case 1. It proves the appropriateness of OISF feature for human activity recognition.
Tables 13 and 14 list the values of Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the third experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. By analysing the results presented in Tables 13 and 14, it is observed that average Precision is 98.70% & 99.17%, average Recall is 98.67% & 99.17%, average Specificity is 99.85% & 99.91% , average F1 score is 98.66% & 99.16%, and average Accuracy is 99.73% & 99.83% with first set of cameras. It is also observed from this table that average Precision is 94.75% & 98.04%, average Recall is 94.67% & 98.00%, average Specificity is 99.40% & 99.78% , average F1 score is 94.67% & 98.00%, and average Accuracy is 98.93% & 99.60% with second set of cameras. From this experiment, it can be concluded that the average accuracy is more than 99% for all the dissimilar activities taken from ViHASi dataset. This shows that utilizing OISF feature for human activity recognition gives effective results in terms of activity classification both for similar and dissimilar activities.
4.5 Experiment #4
To perform this experiment, all the six activities (Boxing, Hand clapping, Hand waving, Jogging, Running and Walking) are taken from the KTH dataset. Figure 9a, b and c show the ten sample frames of silhouettes of Hand clapping activity, skeletons obtained using Algorithm 1 and 8-regions division of elliptical bounding box using Algorithm 2 respectively.
Table 15 shows the confusion matrix of the fourth experiment. Activities that have similar type of body movements such as Running, Jogging and Walking are misclassified in both cases. Table 16 lists the values of Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the fourth experiment for Case 1 and Case 2. By examining the results illustrated in Table 16, it is observed that average Precision is 88.79% & 90.81%, average Recall is 88.67% & 90.78%, average Specificity is 97.73% & 98.15% , average F1 score is 88.68% & 90.74%, and average Aaccuracy is 96.22% & 96.85% for Case 1 and Case 2 respectively. From this experiment, it is concluded that average precision, recall and accuracy increases in Case 2 of the experiment with respect to Case 1 of the experiment. It can be further concluded from these results that Random Forest classifier when trained with newly proposed feature OISF performs well for all types of activity (similar or dissimilar).
4.6 Experiment #5
To perform this experiment, all the five activities (Boxing, Hand clapping, Hand waving, Jogging, and Walking) are taken from the in-house dataset. Figure 10a, b, c and d show the ten input sample frames of Boxing activity, respective silhouettes and skeletons obtained by using Algorithm 1, and 8-regions division of elliptical bounding box obtained by using Algorithm 2.
Table 17 shows the confusion matrix of the fifth experiment. In the complex background also, our method results in high accuracy and minimum false classification for similar types of activities. Table 18 presents the values of Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the fifth experiment for Case 1 and Case 2. By examining the results illustrated in Table 18, it is observed that average Precision is 94.09% & 95.91%, average Recall is 94.00% & 95.86%, average Specificity is 98.50% & 98.97% , average F1 score is 94.01% & 95.88%, and average Accuracy is 97.60% & 98.35% for Case 1 and Case 2 respectively. From this experiment, it is concluded that average precision, average recall and average accuracy increases in Case 2 of the experiment with respect to Case 1 of the experiment. It can be further concluded from these results that Random Forest classifier when trained with newly proposed feature OISF performs well for all types of activity (similar or dissimilar) even in the complex background.
4.7 Effectiveness analysis of the proposed approach
The proposed approach has been tested on three datasets having different characteristics such as videos recorded with a low-resolution camera, complex background, variation in illumination, outdoor (with varying variations of scale) and indoor scenarios in day vision, and different view angles. Figure 11 depicts the average accuracy obtained in the experiments #1, #2 & #3 performed on ViHASi dataset recorded from the cameras of slant angle 27∘. This figure also shows the average accuracy obtained in experiment #4 on KTH dataset and experiment #5 on in-house dataset. Figure 12 shows the average accuracy obtained in the first three experiments with cameras slant angle 45∘.
The x-axis and y-axis of these graphs represent experiment numbers and average accuracy, respectively. It can be observed from the graphs shown in Figs. 11 and 12 that average accuracy greatly varies when FV 1 feature is used to train the Random Forest classifier whereas, it remains consistent with the use of OISF feature. Figure 13 depicts absolute difference between the two sets of camera angles (27∘ and 45∘).
The x-axis and y-axis of the graph in Fig. 13 represents experiment numbers and absolute difference in accuracies, respectively. Minimum variation in the accuracy may be observed in Case 2 for the experiments #1, #2 & #3 which shows that OISF feature is invariant towards the orientation of the camera. Apart from all these comparisons, when this feature is used to train Random Forest classifier for human activity recognition ≈ 97% of accuracy is achieved on both static and moving cameras.
Average accuracy for all the experiments varies from ≈ 97% to ≈ 99%. Through these results, we can say that the proposed method is capable to deal with scenarios like low resolution, complex background, etc. From experimental results, it is observed that variation in recognition accuracy is small (≈ 2%) for all of the experiments, which confirms the robustness of our proposed method.
FV1 and OISF feature extraction time are shown in Table 19 for all the five experiments. From Table 8, it is observed that the average feature extraction rate of FV1 and OISF features are 38 frames per second (fps) and 34 frames per second (fps) respectively. Training and Testing time of the random forest model are also recorded but it is found to be static and very less as compared to the feature extraction time. These experiments show that even if the size of a frame is in the range of 480 × 640, it can be used for the real-time activity recognition task.
5 Comparison of proposed approach with existing state-of-the-art approaches
In this section, average accuracy of the proposed approach is compared with state-of-the-art approaches performed on KTH dataset for human activity recognition.
Table 20 lists the average accuracy of different state-of-the-art approaches and proposed approach. Average accuracy of state-of-the-art approaches is about 94%. Accuracy achieved with FV 1 feature and OISF feature is 96.22% and 96.85% respectively. The average accuracy achieved by OISF feature is on an average 2.5% higher than the others which validate our proposed method. It can be concluded from this comparison that maximum accuracy can be achieved with OISF feature for human activity recognition.
6 Conclusion
An attempt has been made in this work to introduce a novel and efficient feature termed as OISF that is tested for Human Activity Recognition. To check the robustness of this feature for moving camera, silhouettes from ViHASi dataset that contain videos recorded by different cameras at different angles are taken. Average recognition accuracy of the proposed approach for experiments #1, #2 & #3 are 99.20%, 99.59% and 99.72% respectively. Small variations in recognition accuracy confirm the robustness of the newly proposed OISF feature towards the nature of activities (similar activities or combination of similar and dissimilar activities or dissimilar activities). The recognition accuracy of newly proposed feature OISF is superior to the existing approaches in case of videos for moving camera while its performance is at par with the existing feature in the case of static camera. Experimentally, it is found that overall recognition accuracy of the proposed approach with ViHASi dataset is ≈ 99.30%, for KTH dataset is ≈ 96.85% and for in-house dataset is ≈ 98.34%. In this proposed approach, skeletons are used to extract the features which reduces the processing time of feature extraction. Average feature extraction rate of FV1 and OISF features are 38 frames per second (fps) and 34 frames per second (fps) respectively. Higher accuracies obtained in both the cases prove that the proposed approach is applicable for real-life activities recognition such as patient monitoring, fight detection between persons, etc.
References
Agarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3): 1–43
Anjum ML, Rosa S, Bona B (2017) Tracking a subset of skeleton joints: an effective approach towards complex human activity recognition. Journal of Robotics
Bächlin M, Forster K, Troster G (2009) SwimMaster: a wearable assistant for swimmer. In: Proceedings of the 11th international conference on ubiquitous computing, pp 215–224
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Chen MY, Hauptmann A (2009) Mosift: recognizing human actions in surveillance videos. Citeseer
Singh DK, Kushwaha DS (2016) Tracking movements of humans in a real-time surveillance scene. In: Proceedings of fifth international conference on soft computing for problem solving, pp 491–500
Dawn DD, Shaikh SH (2016) A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis Comput Springer 32(3):289–306
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Gilbert A, Illingworth J, Bowden R (2009) Fast realistic multi-action recognition using mined dense spatio-temporal features. In: 2009 IEEE 12th international conference on computer vision, pp 925–931
Hbali Y, Hbali S, Ballihi L, Sadgal M (2017) Skeleton-based human activity recognition for elderly monitoring systems. IET Comput Vis 12(1):16–26
Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition, vol 1. IEEE, pp 278–282
Jalal A, Uddin MZ, Kim JT, Kim TS (2012) Recognition of human home activities via depth silhouettes and R transformation for smart homes. Indoor Built Environ 21(1):184–190
Jalal A, Kamal S, Kim D (2017) A depth video-based human detection and activity recognition using multi-features and embedded hidden Markov models for health care monitoring systems. Int J Interact Multimed Artif Intell 4:4
Jalaland A, Kamal S (2014) Real-time life logging via a depth silhouette-based human activity recognition system for smart home services. In: 2014 11th IEEE International conference on advanced video and signal based surveillance (AVSS), pp 74–80
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 2046–2053
Kumar S, Kumar S, Raman B, Sukavanam N (2011) Human action recognition in a wide and complex environment. Real-Time Image Video Process 7871:78710I
Lassoued I, Zagrouba E (2018) Human actions recognition: an approach based on stable motion boundary fields. Multimed Tools Appl 77(16):20715–20729
Li M, Leung H (2016) Multiview skeletal interaction recognition using active joint interaction graph. IEEE Trans Multimed 18(11):2293–2302
Lu M, Zhang L (2014) Action recognition by fusing spatial-temporal appearance and the local distribution of interest points. In: International conference on future computer and communication engineering (ICFCCE 2014)
Manresa C, Varona J, Mas R, Perales FJ (2005) Hand tracking and gesture recognition for human-computer interaction. ELCVIA Electron Lett Comput Vis Image Anal 5(3):96–104
Manzi A, Fiorini L, Limosani R, Dario P, Cavallo F (2017) Two-person activity recognition using skeleton data. IET Comput Vis 12(1):27–35
Min W, Cui H, Rao H, Li ZZ, Yao L (2018) Detection of human falls on furniture using scene analysis based on deep learning and activity characteristics. IEEE Access 6:9324–9335
Naveed H, Khan G, Khan AU, Siddiqi A, Khan MUG (2019) Human activity recognition using mixture of heterogeneous features and sequential minimal optimization. Int J Mach Learn Cybern 10(9):2329–2340
Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2014) Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. J Vis Commun Image Represent 25(1):24–38
Quaid MAK, Jalal A (2019) Wearable sensors based human behavioral pattern recognition using statistical features and reweighted genetic algorithm. Multimed Tools Appl, 1–23
Ragheb H, Velastin S, Remagnino P, Ellis T (2008) ViHASi: virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods. In: Second ACM/IEEE international conference on distributed smart cameras. IEEE, pp 1–10
Raptis M, Sigal L (2013) Poselet key-framing: a model for human activity recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2650–2657
Sadek S, Al-Hamadi A, Gerald K, Michaelis B (2013) Affine-invariant feature extraction for activity recognition. ISRN Mach Vis, 2013
Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Annals Stat 26(5):1651–1686
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th international conference on pattern recognition (ICPR). IEEE, pp 32–36
Shah H, Chokalingam P, Paluri B, Pradeep N, Raman B (2007) Automated stroke classification in tennis. In: International conference image analysis and recognition, pp 1128–1137
Uddin MZ, Lee JJ, Kim TS (2010) Independent shape component-based human activity recognition via hidden Markov model. Appl Intell 33(2):193–206
Vats E, Chan CS (2016) Early detection of human actions—a hybrid approach. Appl Soft Comput 46:953–966
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Lin-Cheng L (2011) Action recognition by dense trajectories. In: CVPR 2011-IEEE conference on computer vision & pattern recognition, pp 3169–3176
Weng Z, Guan Y (2018) Action recognition using length-variable edge trajectory and spatio-temporal motion skeleton descriptor. EURASIP J Image Video Process 2018 (1):8
Xu K, Jiang X, Sun T (2015) Human activity recognition based on pose points selection. In: 2015 IEEE International conference on image processing (ICIP), pp 2930–2834
Zhu C, Sheng W (2011) Wearable sensor-based hand gesture and daily activity recognition for robot-assisted living. IEEE Trans Syst Man Cybern-Part A: Syst Humans 41(3):569–573
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI Conference on artificial intelligence, p 8
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Details of Algorithms
Appendix: Details of Algorithms
Rights and permissions
About this article
Cite this article
Dwivedi, N., Singh, D.K. & Kushwaha, D.S. Orientation Invariant Skeleton Feature (OISF): a new feature for Human Activity Recognition. Multimed Tools Appl 79, 21037–21072 (2020). https://doi.org/10.1007/s11042-020-08902-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08902-w