1 Introduction

Human Activity Recognition aims to recognize the physical activity of one or more persons from a sequence of observations using sensors or video logs. It is a difficult task for real time videos because of background clutter, variations in illumination, varying viewpoints, partial occlusions and camera movements. Recognition of human activity is thus an open and challenging research area in the field of computer vision, machine vision, image processing [1] etc. A robust human activity recognition system can greatly enhance efficacy of surveillance system [7], human-computer interaction [21], gesture interpretation [39], life-care system [14, 23], sports analysis [32], smart homes [15] and other application domains. Due to its wide applicability in real life, many researchers have been working in this area since long. Recognition accuracy of such system broadly depends on the quality of extracted features and learning capability of the classifier. Different features like Motion Stable Shape (MSS) [18], Motion Scale-Invariant Feature Transformation (MoSIFT) [6], Speeded Up Robust Features Motion History Image (SURF-MHI) [35], Spatio-temporal interest point (STIP) detector [8], skeleton based features etc. have been used for Human Activity Recognition by the research community in the past.

Human Activity Recognition system is applied on the data collected through the wearable sensor or the recorded video (recorder from CCTV camera, Surveillance camera, etc.). Wearable sensor [3, 26, 39] based approach first collects the data from the sensors attached to the different parts of human body. Thereafter, physical activity of human is recognized by analyzing these data. This type of approach can be used only when a person is always wearing sensors which is not always feasible in a real life scenario. To mitigate this limitation, computer vision based human activity recognition is the preferred method. In the vision based approach, features like MoSIFT, SURF-MHI, etc. are extracted from the video data which are further used to train the classifier. These features work well for the videos recorded with static cameras, but not for the videos recorded with moving cameras. Also, processing time is high for extracting these features because feature extraction requires processing of all pixels in the image. To minimize these limitations, skeleton based features are proposed by many researchers in literature for human activity recognition. Feature selection plays an important role in vision based human activity recognition [9, 40] system because multiple features may be required to train the system for improving its accuracy. One can select large number of features to increase the accuracy, but this results in increased time for feature extraction & learning and testing of the system. Hence, skeleton-based features prove to be a viable alternative.

Skeleton features [15, 32, 37] are extracted by analyzing the skeleton pose in a sequence of frames. The skeletal representation of the human body at any instant reflects original pose of that person. A human pose in a video frame contains lots of pixels, while its skeletal representation contains comparatively smaller number of pixels thus reducing processing time for feature extraction. Human Activity Recognition using skeleton involves following steps:

  1. 1.

    Extraction of skeletons from video frames

  2. 2.

    Extraction of skeleton features

  3. 3.

    Construction of feature vector

  4. 4.

    Training and testing of the classifier

The skeleton feature usually contains joint information of different body parts like end and mid points of hands, legs, and position of head. Feature vectors are then constructed by fusing the features extracted from the person being tracked, in few consecutive frames. Finally, a supervised machine learning based classifier is used to identify the physical activity of the tracked person based on the extracted features. Neural trees, SVM, LDA and other classifiers are used in the existing literature for the activity recognition task. Since, human activity recognition is a multi-class classification problem, tree based classifier is a preferable approach. This observation motivates us to explore the suitability of Random Forest classifier for human activity recognition. The aim of this work is to propose skeleton features for human activity recognition with the following goals:

  • To improve the recognition accuracy of the system

  • To reduce the dependency on camera positions and angles

  • To obtain similar recognition accuracy for the videos recorded by either static or moving cameras

In this paper, a new feature named “Orientation Invariant Skeleton Feature (OISF)” is introduced for human activity recognition. This newly introduced feature is examined through number of experiments on two publicly available datasets: KTH and ViHASi, and one in-house dataset which is available at https://github.com/neelamdw/Actiondataset/. The in-house dataset is developed in the lab which contains five different actions (boxing, hand clapping, hand waving, jogging and walking). Three subsets (SA1,SA2&SA3) are prepared by selecting similar and dissimilar activities from the ViHASi dataset (Table 1 presented in Section 4 may be reffered). Experiment #1, experiment #2 and experiment #3 are performed on SA1,SA2 and SA3. All the activities of KTH dataset are taken together to perform the experiment #4. Similarly, All the activities of in-house dataset are taken together to perform the experiment #5. Following are the major contributions of this paper:

  • A new skeleton based feature named as “Orientation Invariant Skeleton Feature (OISF)” for human activity recognition

  • Performance evaluation of proposed feature with Random Forest classifier for KTH dataset, in-house dataset and different subsets of ViHASi dataset

Table 1 Three subsets of ViHasi dataset

This paper is organized as follows: In Section 2, literature survey for human activity recognition is presented. In Section 3, methodology and architecture of the proposed approach are explained. Experimental results and their analysis are presented in Section 4. Conclusion is presented in Section 5.

2 Literature survey for human activity recognition

In this section, various human activity recognition approaches are briefly reviewed. Uddin et al. [33] utilized Independent Component Analysis (ICA) to extract the activity shape information from the body joints instead of using whole body. Further, they apply Hidden Markov Model (HMM) on extracted activity shape information to recognize the human activity. In this method, features are easy to extract from the silhouettes but it cannot recognize the difference between near and distant body parts. To overcome this limitation, Jalal et al. [13] present human activity recognition for smart homes based on R transformation applied on depth silhouettes which require more processing time. To minimize the processing time, a new model for action recognition based on the combination of mid-level representation (HoG and BoW poselets) and discriminative key frame selection is proposed by Raptis et al. [28]. These approaches require the complete sequence of frames to recognize the human activity correctly. To achieve the early detection of action, a hybrid technique is presented by Vats et al. [34] which combines the benefits of computer vision and fuzzy set theory. This approach may recognize the action, even if a partial action occurs. They use Fuzzy BK sub-product because of its flexibility and capability to imitate the natural human behavior. In the above-mentioned approaches, large number of pixel processing is required.

To reduce the pixel processing, skeleton are extracted from the images at first. Thereafter, features are extracted by processing the skeleton sequences that reduce the pixel processing time. Anjum et al. [2] present complex human activity recognition by tracking a subset of human skeleton joints instead of tracking the whole skeleton. Skeleton joints are selected either manually or automatically from the depth videos recorded by Kinect camera. They use Multiclass Support Vector Machines (MSVMs) to classify the human activity. Weng et al. [37] present human activity recognition using length-variable edge trajectory (LV-ET) and spatio-temporal motion skeleton descriptor. The LV-ET extracted by tracking edge points across video frames based on optical flow with the aim of better descriptor with the evolution of different type of actions. A novel encoding method for trajectory clustering is proposed to extract Spatio Temporal Motion Skeleton (STMS) (also called motion skeleton). Habli et al. [11] propose skeleton-based human activity recognition for elderly monitoring systems. For this task, they use spatial and temporal coordinates for the 3D skeleton and combine both to represent each frame of a human activity. Randomised tree algorithm is used to train and validate the method on the MSR-Action3D and DailyActivity3D datasets. Manzi et al. [22] have introduced an activity recognition system for two persons using skeleton data extracted from a depth camera. They used an unsupervised clustering approach to compute the activity using a set of few basic informative postures first. Thereafter, models are created using multiclass support vector machines on the training set. An optimal number of clusters for each sample are dynamically found by the X-means algorithm during classification phase. Li et al. [19] focus on multi-view skeletal interactions for human activity recognition. For this, a multi Active Joint Interaction Graph (AJIG) model is proposed to code the spatio-temporal patterns of two-person skeletal interactions. Then AJIG kernel is used to compute the similarity between two AJIGs. Further, a multiple kernel learning approach is applied to jointly learn the optimal combination of the numerous AJIG kernels. Ofli et al. [25] present a sequence of the Most Informative Joints (SMIJ) feature to recognize a human skeleton activity. At each time instant, few skeletal joints that are most related to the current action are selected. The selection of joints is based on highly interpretable measures such as mean or variance of joint angles, maximum angular velocity of joints etc. Zhu et al. [40] presents a deep LSTM network for skeleton activity recognition. Their model facilitates the learning of skeleton joint features with the help of co-occurrence exploration mechanism. This model dropout the complex structures among the important joints using exploration mechanism. They compare their method with other deep networks for skeletal activity recognition [9]. A methodology of our proposed approach is discussed in the next section.

3 Proposed approach for human activity recognition

In this section, a skeleton based feature to identify human activity is introduced. Figure 1 depicts the architectural design of the proposed approach for human activity recognition, which consists of three phases.

Fig. 1
figure 1

Flow chart of the proposed approach

In the first phase, video is converted into consecutive frames followed by Skeletonization using Algorithm 1. Figure 2 depicts the skeletonization process of an input video frame. This process involves extraction of foreground object, conversion of foreground object into binary frame, enhancement of the frame, filling of small holes, removal of islands, and repeated thinning operation. After skeletonization, Region of Interest (RoI) is obtained and is marked by an elliptical boundary in the second phase. To draw the elliptical bounding box around skeleton: orientation, centroid, length of major axis, length of minor axis and eccentricity of the ellipse is obtained by applying a MATLAB function regionprops(). This elliptical bounding box is further divided into eight symmetric regions. All of these tasks are performed by applying Algorithm 2. In the third phase, feature FV1 and the newly introduced feature OISF are extracted using Algorithm 3 and Algorithm 4 respectively. Time complexity for extraction of each of the feature: FV 1 and OISF is 𝜃(m × n) where, m × n is the total number of pixels in a frame. Random Forest classifier is trained with each extracted feature separately for human activity recognition. The structure of random forest classifier proposed by LEO BREIMAN [4] is used in this paper. Each forest consist of 500 trees. Once the classifier is trained, it is used to recognize the activity from a new sequence of frames.

Fig. 2
figure 2

Flow chart of skeletonization of a frame

3.1 Phase 1 (pre-processing of video): skeletonization of input video frames

Depends on frame rate and duration of video, an input video (V ) is converted into N consecutive frames (F1,F2, ..., FN). For complex backgrounds of the image, foreground object is extracted by applying the background subtraction technique. Each foreground frame is then transformed into binary frame (image) in two steps:

  • Step 1: For each framve, two intensity thresholds (th1 and th2) of the frame are calculated as:

    $$th1 = h_{j} - \text{c}1 $$
    $$th2 = h_{j} + \text{c}2 $$

Here, hj is the pixel intensity value with maximum frequency in a grayscale frame. Different values of c1 & c2 have been tried through experimentation (In the experiments valuses taken c1 = 35 and c2 = 40). These values (c1, c2) are initially adjusted and they remain constants for a dataset.

  • Step 2: All the intensity values lying between th1 and th2 in a frame are set to ‘1’. All other intensity values are set to 0. The resulting image is a binary image.

Each binary frame is further enhanced by applying median filter. Two morphological operations namely, dilation and erosion are used to obtain a well-defined shape of the person (silhouette) from the binary frame. To perform these morphological operations a linear structuring element of length c3 is used. Dilation and erosion are performed k1 and k2 times respectively, on the input binary frame. Various combinations of dilation and erosion operation are tried. Here, k1 = 2, k2 = 1 & c3 = 3 have shown promising result in terms of silhouette extraction. Skeleton is obtained by applying repeated thinning operation on the silhouette and the silhouette is obtained after performing morphological operations. Thinning operation removes pixels in such a way that an object shrinks to a minimally connected stroke. The pseudo code of skeletonization process is presented in Algorithm 1 (detail Algorithm is presented in ?? as Algorithm A.1).

figure d

Figure 3a, b and c show the ten original frames, and the corresponding binary and skeleton frames obtained through Algorithm 1. These frames are taken from the KTH dataset for ‘running’ activity. For the sake of visual clarity, all the figures in this paper include both the complements of actual silhouettes and the corresponding skeletons.

Fig. 3
figure 3

(a) Frames obtained from input video (b) Extracted silhouettes from the frames (c) Skeletons obtained from silhouettes

3.2 Phase 2: Region of Interest (RoI) selection

The skeleton extracted from the silhouette (obtained in Phase1) is our Region of Interest. The elliptical boundary around the skeleton separates it from the input frame. Centroid (xc,yc), length of major axis (2 × a), length of minor axis (2 × b), eccentricity (e) and orientation (𝜃) of the ellipse are calculated for further processing. As explained before, regionprops() function is used to obtain these values. This ellipse is then divided into eight symmetric regions for in order to extract hands, legs and head features as shown in Fig. 4. To divide the ellipse into eight symmetric regions following four lines are drawn that pass through the centroid of the ellipse:

  1. 1.

    Major-axis

  2. 2.

    Minor-axis

  3. 3.

    Line passing through centroid and inclined at an angle of 450 in clockwise direction from the minor axis

  4. 4.

    Line passing through centroid and inclined at an angle of 450 in anti-clockwise direction from the minor axis

Fig. 4
figure 4

Ellipse with 8 regions

Let us assume that, (xc,yc) is the co-ordinate of centroid of the given ellipse. Then, equation (1) represents parametric equation of the ellipse whose orientation is 𝜃:

$$ \frac{(x_{1}cos\theta +y_{1}sin\theta)^{2}}{b^{2}} + \frac{(x_{1}sin\theta - y_{1}cos\theta)^{2}}{a^{2}} = 1 $$
(1)

where, x1 = xxc and y1 = yyc

To draw these lines, coordinates are obtained by applying eight cuts on the ellipse at the angles 0 (cut 1), 45 (cut2), 90 (cut3), 135 (cut4), 180 (cut5), 225 (cut6), 270 (cut7), 315 (cut8). Four lines are drawn between cut1 & cut5 (minor axis), cut2 & cut6 (dotted line), cut3 & cut7 (major axis), and cut4 & cut8 (dashed line) respectively.

Obtained ROI has eight symmetric regions and each region contains the information of skeleton. Figure 5a, b and c show the skeletal representation of the frames obtained by applying Algorithm 1, ROI with elliptical bounding box and its division into 8-regions using Algorithm 2 (detail Algorithm is presented in ?? as Algorithm A.2) respectively. Algorithms for extracting the features are explained in the Section 3.3.

Fig. 5
figure 5

(a) Obtained skeletons from input video frame (b) Region of Interest(RoI) selection (c) division of RoI in eight symmetric regions

figure e

3.3 Features used

Two features: FV 1 [17] and Orientation Invariant Skeleton Feature (OISF) have been used for the proposed approach. To extract these features, skeletons are obtained from an input video by applying Algorithm 1. Region of Interest (ROI) is selected and bounded by an elliptical bounding box that is further divided into eight symmetric regions (Fig. 3) by applying Algorithm 2. The information in each region of the skeleton is number of pixels, coordinates of hands, legs and head. By using this information, feature vector FV 1 and OISF are calculated. Total eight features for each skeleton are obtained both for FV 1 and OISF. Details of these features are discussed in Sections 3.3.1 and 3.3.2 respectively.

3.3.1 FV 1 feature extraction

Figure 6 shows the flow chart for extracting the FV 1 feature value of one frame. Skeletons sk1, sk2, ..., skN are obtained from the input video V by applying Algorithm 1. For each skeleton, an elliptical boundary is drawn and divided into eight symmetric regions by applying Algorithm 2. \(s{k_{i}^{j}}\) contains the total number of white pixels of ith skeleton in the jth (1 ≤ j ≤ 8) region of the ellipse. For each region, one feature value is extracted as follows:

  • Step 1: Compute the sum of pixels of jth region of ith skeleton ski:

    $${P_{i}^{j}} = sum (s{k_{i}^{j}})$$
  • Step 2: Compute the sum of total pixels (pt) for the ith skeleton ski:

    $$p^{t} =\sum\limits_{j=1}^{8} {P_{i}^{j}}$$

Thus, feature corresponding to jth region is given by the following expression:

$$ {f_{i}^{j}} =\frac{{P_{i}^{j}}}{p^{t}} $$
Fig. 6
figure 6

Flow chart of extraction of FV1 feature in a frame

Here, \(f_{i}^{[1...8]}\) = \([{f_{i}^{1}}, {f_{i}^{2}}..., {f_{i}^{8}}]\) is the feature corresponding to ith skeleton ski. Thus, a feature vector of size (N × 8) is obtained for the input video V. Final feature vector FV 1 is generated by fusing features of ten consecutive frames, since an activity is characterized by analyzing sequence of frames. This feature vector FV 1 is used to train and test the classifier for human activity recognition. Size of final feature matrix FV 1 for N frames is (\(\frac {N}{10} \times 80\)). Motivation for and the process of extraction of newly introduced feature OISF are discussed in the next section.

3.3.2 OISF feature extraction

In this section, algorithm and the motivation behind introducing a new feature, “Orientation Invariant Skeleton Feature (OISF)” is discussed. In the literature, it is observed that existing features for human activity recognition are dependent on the orientation and positioning of the cameras that reduces the recognition accuracy for the videos recorded by moving camera. Motivation of introducing OISF feature is to overcome this limitation. Application of OISF feature improves the recognition accuracy of human activity recognition system for the videos recorded with moving as well as static cameras. This is because, OISF characterizes human actions with respect to the relative movements of hands and legs along the x & y axes separately in each frame. The features corresponding to each skeleton ski is obtained as follows:

  1. 1.

    x and y coordinates of first white pixel (having least value of x coordinate) in the first, second, seventh and eighth regions are determined.

  2. 2.

    x and y coordinates of last white pixel (having maximum value of x coordinate) in the third, fourth, fifth and sixth regions are determined.

  3. 3.

    Absolute differences of x and y coordinates of first & fourth, second & third, fifth & eighth, sixth & seventh regions are calculated and taken as eight features of skeleton ski.

Example 1

: For any skeleton ski, let (x1,y1), (x2,y2), (x7,y7) and (x8,y8) represent the coordinates of first white pixel of first, second, seventh and eighth regions respectively, while (x3,y3), (x4,y4), (x5,y5) and (x6,y6) represent the coordinates of last white pixel of third, fourth, fifth and sixth regions, respectively. First, second, third, fourth, fifth, sixth, seventh and eighth feature values of the skeleton ski are calculated as abs(x1x4), abs(x2x3), abs(x5x8), abs(x6x7), abs(y1y4), abs(y2y3), abs(y5y8) and abs(y6y7) respectively.

Normally, activities cannot be distinguished by analyzing a single frame. Therefore, features of ten consecutive frames are combined to generate the OISF feature vector for the classification of human activities. In the next section, experimental set-up and datasets used are discussed along with the analysis of the proposed method using various performance metrics.

4 Experiments and their analysis

To evaluate the performance of the proposed approach, five experiments have been conducted. All these experiments have been conducted using MATLAB R2017a in core i7 processor with 4GB RAM. For experiments #1, #2, & #3, frames are taken from the Virtual Human Action Silhouette (ViHASi) dataset [27]. For experiment #4, frames are taken from the KTH dataset [31]. For experiment #5, frames are taken from the in-house dataset. The ViHASi dataset contains synthetic videos of 20 action classes and are recorded by 9 actors. These videos used a maximum of 40 synchronized perspective camera views. This 40 synchronized perspective camera views are divided into two sets, each consists of 20 cameras views. The two sets of cameras are fixed at slant angles of 27 and 45 with the horizontal plane respectively. Angular difference between the cameras is 18 in both the sets. In the videos of KTH dataset, there are six actions (boxing, hand clapping, hand waving, jogging, running and walking). These actions are performed by 25 persons in two different scenarios that are indoor and outdoor scenarios with different scale variations. All the video sequences are taken over homogeneous background with static camera. As explained in Section 1, the videos of in-house dataset contains five different actions. These actions are performed by 2 actors in indoor scenario. All the videos are recorded by the static camera with low-resolution, complex background, and variations in illumination.

To create maximum possible combination of activities, twenty action classes of ViHASi dataset have been divided into three subsets for experiments #1, #2 & #3. These subsets are categorized on the basis of similarity of actions and named as SA1, SA2 and SA3. Table 1 presents actions of all sub activities. SA1 contains almost similar actions like running; walking etc. SA2 contains combination of similar (run pull object, run push object etc.) and dissimilar (knockout, punch etc.) actions. SA3 contains dissimilar actions like hang on bar, granade etc. Actions of SA1, SA2 and SA3 are used in experiments #1, #2 & #3, respectively. These three experiments are conducted in two separate scenarios:

  1. 1.

    Videos recorded by the first set of cameras fixed at slant angle of 27

  2. 2.

    Videos recorded by the second set of cameras fixed at slant angle of 45

In all the experiments, input videos are pre-processed using Algorithm 1 and, FV 1 and OISF features are extracted using Algorithm 3 (detail Algorithm is presented in ?? as Algorithm A.3) and Algorithm 4 (detail Algorithm is presented in ?? as Algorithm A.4) respectively. A separate human activity recognition model is created by training Random Forest classifier for each of the following cases:

  • Case 1: Training with FV 1 only

  • Case 2: Training with OISF only

figure f
figure g

By doing this, two separate Random Forest models are developed for each of the experiments. Random Forest classifier is a supervised machine learning based classifier and uses an “ensemble learning method” for the classification. The general method of Random Forest was first proposed by Ho in 1995 [12]. Each classification tree of this classifier uses two well-known methods, named boosting [30] and bagging [5]. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. A weighted vote is taken from all the decision trees to predict the class of the new frame. For example, if any classification problem has n1 classes, then samples of all the classes are selected randomly for the training. If each sample contains K variables, then k (k < K) randomly selected variables are specified at each node. Each decision tree grows up to its maximum extent without any pruning. By considering maximum vote from all the decision trees, a new data is classified. Here, features are randomly selected to split the nodes. The structure of random forest classifier proposed by LEO BREIMAN [4] is used in this paper. Total number of 500 trees are used to construct a forest. Figure 7 summarizes all the experiments. To measure the performance of the proposed approach of the system Confusion Matrix and five performance metrics: Precision, Recall, Specificity, F1 score and Accuracy are used and discussed in the next section.

Fig. 7
figure 7

Summary of all the experiments

4.1 Parameters used for performance measurement

Precision, Recall, F1 score, Accuracy and Confusion Matrix are most important parameters that are being used to evaluate the performance of the proposed approach. Assume a classifier classifies input video frames as shown in Table 2, where \({N_{i}^{j}}\) denotes the number of ith activity classified as jth activity by the classifier. Performance metrics for this classification results are calculated as follows:

$$ \begin{array}{@{}rcl@{}} Precision_{i} = \frac{{N_{i}^{i}}}{{\sum}_{j=1}^{4}{N_{j}^{i}}}\\ Recall_{i} / Sensitivity_{i} = \frac{{N_{i}^{i}}}{{\sum}_{j=1}^{4}{N_{i}^{j}}}\\ Specificity_{i} = \frac{{\sum}_{j,k=1}^{4} {N_{j}^{k}}; j,k \neq i}{{\sum}_{j=1,i\neq j}^{4}{N_{j}^{i}} + {\sum}_{j,k=1}^{4} {N_{j}^{k}}; j,k \neq i}\\ F_{1}\ score_{i} = 2 \times \frac{Precision_{i} \times Recall_{i}}{Precision_{i}\ +\ Recall_{i}}\\ Accuracy_{i} = \frac{{N_{i}^{i}} + {\sum}_{j,k=1}^{4} {N_{j}^{k}}; j,k \neq i}{{\sum}_{j,k=1}^{4}{N_{j}^{k}}} \end{array} $$
Table 2 Classification results of a classifier

4.2 Experiment #1

To perform this experiment, ten similar activities (C2: JumpGetOnBar, C3: Jump-OverObject, C4: JumpFromObject, C5: RunPullObject, C6: RunPushObject, C7: RunTurn90Left, C8: RunTurn90Right, C18: Walk, C19: WalkTurn180, and C20: Run) grouped in SA1 (Table 1) for the jump, walk and run categories are taken from the ViHASi dataset. Figure 8a, b and c show ten samples silhouettes of RunTurn90Left activity recorded from the second set of cameras, skeletons obtained using Algorithm 1 and 8-regions symmetrical division of elliptical bounding box obtained using Algorithm 2 respectively. To measure the effectiveness of the proposed approach, confusion matrix, precision, recall, specificity, F1 score, and accuracy are used as discussed in Section 4.1.

Fig. 8
figure 8

(a) Silhouettes of RunTurn90Left activity (b) skeletons obtained from the silhouettes (c) symmetric 8-regions division of RoI

Tables 3 and 4 show the confusion matrices that are obtained in the first experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. The activities with similar body movements are misclassified in some of the instances for both of the cases. For example, most of the misclassified instances of Run (C20) activity are classified as RunTurn90Left (C8) activity and vice versa. It happens because both the actions have similar body movements.

Table 3 Confusion Matrix of Experiment #1 for the silhouettes captured using cameras with a slant angle of 27
Table 4 Confusion Matrix of Experiment #1 for the silhouettes captured using cameras with a slant angle of 45

Tables 5 and 6 list the Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the first experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles (27 and 45). By analysing the results presented in Tables 5 and 6, it is observed that average Precision is 96.67% & 96.83%, average Recall is 96.67% & 96.83%, average Specificity is 99.63% & 99.65%, average F1 score is 96.65% & 96.83% and average Accuracy is 99.33% & 99.37% with first set of cameras. On the other hand, with second set of cameras, average Precision is 92.30% & 95.25%, average Recall is 92.00% & 95.17%, average Specificity is 99.11% & 99.46% , average F1 score is 92.04% & 95.17% and average Accuracy is 98.40% & 99.03%. Following conclusions can be made from this experiment:

  • Average precision, recall, specificity, F1 score and accuracy are higher in Case 2 (when Random Forest classifier is trained with OISF feature) in comparison to Case 1 (when Random Forest classifier is trained with FV 1). It confirms the effectiveness of the proposed OISF feature.

  • If the difference calculated between two sets of camera angles (slant angle of cameras is 27 and 45) in Case 1 and Case 2, the minimum variation is observed in the accuracy of Case 2. This observation further confirms that OISF is least dependent on camera positioning.

  • High F1 score and high accuracy confirms the applicability of the proposed work for human activity recognition.

Table 5 Performance Metrics of Experiment #1 for the silhouettes captured using cameras with a slant angle of 27
Table 6 Performance Metrics of Experiment #1 for the silhouettes captured using cameras with a slant angle of 45

4.3 Experiment #2

To perform this experiment, combination of ten similar and dissimilar activities (C1: HangOnBar, C2: JumpGetOnBar, C3: JumpOverObject, C4: JumpFromObject, C5: RunPullObject, C6: RunPushObject, C11: KnockoutSpin, C12: Knockout, C16: Punch and C17: JumpKick) grouped in SA2 (Table 1) for jump, run and knockout categories are taken from the ViHASi dataset.

Tables 7 and 8 present the confusion matrices that are obtained in the second experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. Tables 9 and 10 show the Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the second experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. After evaluating the results of experiment #2, it is observed that average Precision is 98.22% & 99.02%, average Recall is 98.17% & 99.00%, average Specificity is 99.79% & 99.89%, average F1 score is 98.16% & 98.99%, and average Accuracy is 99.63% & 99.80% with first set of cameras, whereas average Precision is 94.60% & 96.92%, average Recall is 94.42% & 96.92%, average Specificity is 99.37% & 99.66%, averageF1 score is 94.41% & 96.90% and average Accuracy is 98.99% & 99.38% with second set of cameras. In Case 2, the average precision, recall, specificty, F1 score and accuracy are relatively high when compared to that of Case 1. This again proves the effectiveness of OISF feature.

Table 7 Confusion Matrix of Experiment #2 for the silhouettes captured using cameras with a slant angle of 27
Table 8 Confusion Matrix of Experiment #2 for the silhouettes captured using cameras with a slant angle of 45
Table 9 Performance Metrics of Experiment #2 for the silhouettes captured using cameras with a slant angle of 27
Table 10 Performance Metrics of Experiment #2 for the silhouettes captured using cameras with a slant angle of 45

4.4 Experiment #3

To perform this experiment, ten dissimilar activities of different categories (C1: HangOnBar, C9: HeroSmash, C10: HeroDoorSlam, C12: Knockout, C13: Granade, C14: Collapse, C15: StandLookAround, C16: Punch, C17: JumpKick and C18: Walk) grouped in SA3 (Table 1) are taken from the ViHASi dataset.

Tables 11 and 12 show the confusion matrices that are obtained in the third experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. By comparing all the confusion matrices obtained in experiment #1, experiment #2, and experiment #3, following conclusions can be drawn:

  • Probability of misclassification among similar activities is higher than the probability of misclassification among dissimilar activities.

  • Misclassification rate of the activities captured by the second set of cameras is higher than the activities captured by first set of cameras.

  • Average misclassification rate in Case 2 is less than the average misclassification rate in Case 1. It proves the appropriateness of OISF feature for human activity recognition.

Table 11 Confusion Matrix of Experiment #3 for the silhouettes captured using cameras with a slant angle of 27
Table 12 Confusion Matrix of Experiment #3 for the silhouettes captured using cameras with a slant angle of 45

Tables 13 and 14 list the values of Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the third experiment for both the cases (Case 1 and Case 2) on two different sets of camera angles. By analysing the results presented in Tables 13 and 14, it is observed that average Precision is 98.70% & 99.17%, average Recall is 98.67% & 99.17%, average Specificity is 99.85% & 99.91% , average F1 score is 98.66% & 99.16%, and average Accuracy is 99.73% & 99.83% with first set of cameras. It is also observed from this table that average Precision is 94.75% & 98.04%, average Recall is 94.67% & 98.00%, average Specificity is 99.40% & 99.78% , average F1 score is 94.67% & 98.00%, and average Accuracy is 98.93% & 99.60% with second set of cameras. From this experiment, it can be concluded that the average accuracy is more than 99% for all the dissimilar activities taken from ViHASi dataset. This shows that utilizing OISF feature for human activity recognition gives effective results in terms of activity classification both for similar and dissimilar activities.

Table 13 Performance Metrics of Experiment #3 for the silhouettes captured using cameras with a slant angle of 27
Table 14 Performance Metrics of Experiment #3 for the silhouettes captured using cameras with a slant angle of 45

4.5 Experiment #4

To perform this experiment, all the six activities (Boxing, Hand clapping, Hand waving, Jogging, Running and Walking) are taken from the KTH dataset. Figure 9a, b and c show the ten sample frames of silhouettes of Hand clapping activity, skeletons obtained using Algorithm 1 and 8-regions division of elliptical bounding box using Algorithm 2 respectively.

Fig. 9
figure 9

(a) Extracted silhouettes from the frames of Hand clapping activity (b) skeletons obtained from silhouettes (c) symmetric 8-regions division of RoI

Table 15 shows the confusion matrix of the fourth experiment. Activities that have similar type of body movements such as Running, Jogging and Walking are misclassified in both cases. Table 16 lists the values of Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the fourth experiment for Case 1 and Case 2. By examining the results illustrated in Table 16, it is observed that average Precision is 88.79% & 90.81%, average Recall is 88.67% & 90.78%, average Specificity is 97.73% & 98.15% , average F1 score is 88.68% & 90.74%, and average Aaccuracy is 96.22% & 96.85% for Case 1 and Case 2 respectively. From this experiment, it is concluded that average precision, recall and accuracy increases in Case 2 of the experiment with respect to Case 1 of the experiment. It can be further concluded from these results that Random Forest classifier when trained with newly proposed feature OISF performs well for all types of activity (similar or dissimilar).

Table 15 Confusion Matrix of the Experiment #4
Table 16 Performance Metrics of the Experiment #4

4.6 Experiment #5

To perform this experiment, all the five activities (Boxing, Hand clapping, Hand waving, Jogging, and Walking) are taken from the in-house dataset. Figure 10a, b, c and d show the ten input sample frames of Boxing activity, respective silhouettes and skeletons obtained by using Algorithm 1, and 8-regions division of elliptical bounding box obtained by using Algorithm 2.

Fig. 10
figure 10

(a) Input frames of boxing activity (b) extracted silhouettes from the frames (c) skeletons obtained from silhouettes (d) symmetric 8-regions division of RoI

Table 17 shows the confusion matrix of the fifth experiment. In the complex background also, our method results in high accuracy and minimum false classification for similar types of activities. Table 18 presents the values of Precision, Recall, Specificity, F1 score and Accuracy that are obtained in the fifth experiment for Case 1 and Case 2. By examining the results illustrated in Table 18, it is observed that average Precision is 94.09% & 95.91%, average Recall is 94.00% & 95.86%, average Specificity is 98.50% & 98.97% , average F1 score is 94.01% & 95.88%, and average Accuracy is 97.60% & 98.35% for Case 1 and Case 2 respectively. From this experiment, it is concluded that average precision, average recall and average accuracy increases in Case 2 of the experiment with respect to Case 1 of the experiment. It can be further concluded from these results that Random Forest classifier when trained with newly proposed feature OISF performs well for all types of activity (similar or dissimilar) even in the complex background.

Table 17 Confusion Matrix of the Experiment #5
Table 18 Performance Metrics of the Experiment #5

4.7 Effectiveness analysis of the proposed approach

The proposed approach has been tested on three datasets having different characteristics such as videos recorded with a low-resolution camera, complex background, variation in illumination, outdoor (with varying variations of scale) and indoor scenarios in day vision, and different view angles. Figure 11 depicts the average accuracy obtained in the experiments #1, #2 & #3 performed on ViHASi dataset recorded from the cameras of slant angle 27. This figure also shows the average accuracy obtained in experiment #4 on KTH dataset and experiment #5 on in-house dataset. Figure 12 shows the average accuracy obtained in the first three experiments with cameras slant angle 45.

Fig. 11
figure 11

Recognition accuracy (%) of experiments #1, #2 & #3 of ViHASi dataset at slant angle 27, experiment #4 of KTH dataset and experiment #5 of in-house dataset

Fig. 12
figure 12

Recognition accuracy (%) of experiments #1, #2 & #3 of ViHASi dataset at slant angle 45 for both cases

The x-axis and y-axis of these graphs represent experiment numbers and average accuracy, respectively. It can be observed from the graphs shown in Figs. 11 and 12 that average accuracy greatly varies when FV 1 feature is used to train the Random Forest classifier whereas, it remains consistent with the use of OISF feature. Figure 13 depicts absolute difference between the two sets of camera angles (27 and 45).

Fig. 13
figure 13

Difference in recognition accuracies with slant angle 27 and 45 for all the first three experiments of ViHASi dataset

The x-axis and y-axis of the graph in Fig. 13 represents experiment numbers and absolute difference in accuracies, respectively. Minimum variation in the accuracy may be observed in Case 2 for the experiments #1, #2 & #3 which shows that OISF feature is invariant towards the orientation of the camera. Apart from all these comparisons, when this feature is used to train Random Forest classifier for human activity recognition ≈ 97% of accuracy is achieved on both static and moving cameras.

Average accuracy for all the experiments varies from ≈ 97% to ≈ 99%. Through these results, we can say that the proposed method is capable to deal with scenarios like low resolution, complex background, etc. From experimental results, it is observed that variation in recognition accuracy is small (≈ 2%) for all of the experiments, which confirms the robustness of our proposed method.

FV1 and OISF feature extraction time are shown in Table 19 for all the five experiments. From Table 8, it is observed that the average feature extraction rate of FV1 and OISF features are 38 frames per second (fps) and 34 frames per second (fps) respectively. Training and Testing time of the random forest model are also recorded but it is found to be static and very less as compared to the feature extraction time. These experiments show that even if the size of a frame is in the range of 480 × 640, it can be used for the real-time activity recognition task.

Table 19 Time analysis of all the experiments

5 Comparison of proposed approach with existing state-of-the-art approaches

In this section, average accuracy of the proposed approach is compared with state-of-the-art approaches performed on KTH dataset for human activity recognition.

Table 20 lists the average accuracy of different state-of-the-art approaches and proposed approach. Average accuracy of state-of-the-art approaches is about 94%. Accuracy achieved with FV 1 feature and OISF feature is 96.22% and 96.85% respectively. The average accuracy achieved by OISF feature is on an average 2.5% higher than the others which validate our proposed method. It can be concluded from this comparison that maximum accuracy can be achieved with OISF feature for human activity recognition.

Table 20 Accuracy of state-of-the-art approaches and proposed approach

6 Conclusion

An attempt has been made in this work to introduce a novel and efficient feature termed as OISF that is tested for Human Activity Recognition. To check the robustness of this feature for moving camera, silhouettes from ViHASi dataset that contain videos recorded by different cameras at different angles are taken. Average recognition accuracy of the proposed approach for experiments #1, #2 & #3 are 99.20%, 99.59% and 99.72% respectively. Small variations in recognition accuracy confirm the robustness of the newly proposed OISF feature towards the nature of activities (similar activities or combination of similar and dissimilar activities or dissimilar activities). The recognition accuracy of newly proposed feature OISF is superior to the existing approaches in case of videos for moving camera while its performance is at par with the existing feature in the case of static camera. Experimentally, it is found that overall recognition accuracy of the proposed approach with ViHASi dataset is ≈ 99.30%, for KTH dataset is ≈ 96.85% and for in-house dataset is ≈ 98.34%. In this proposed approach, skeletons are used to extract the features which reduces the processing time of feature extraction. Average feature extraction rate of FV1 and OISF features are 38 frames per second (fps) and 34 frames per second (fps) respectively. Higher accuracies obtained in both the cases prove that the proposed approach is applicable for real-life activities recognition such as patient monitoring, fight detection between persons, etc.