1 Introduction

Human activity detection has numerous applications in the fields of security, surveillance, robotics and interactive systems. By keeping a track of suspicious activities performed by individuals and predicting acrimonious actions beforehand can dilute the effects of unpleasant events. Security personnel have been pretty good with this task of tracing suspected persons for decades. However, humans are prone to error and may result in false accusations. It is for this reason that automated surveillance systems are in the spotlight for aiding the surveillance process. Increased intensity of criminal activities all around the globe entails establishment of better automated systems to speed up the surveillance and output more accurate results. A well-developed system which detects, recognizes and follows the actions of an individual is needed. The system is expected not only to maintain record of past events but also to predict the upcoming events based on the irregularity in the normal trail of a person.

This paper focuses on presenting a comprehensive system to detect 13 single human actions and their combinations to depict multi-human scenarios, considering two people to cover multi-human case. Our technique identifies human activities irrespective of which primary body parts are involved in performing the action. The proposed methodology is extendable to work over a range of actions for any number of people present in the view.

Spatio-temporal, flow-based and keypoints dependent methodologies have been known to work well for human actions recognition, either as individual attributes or as combinations [1,2,3,4,5]. Most researchers have preferred using ready-made silhouettes for feature extraction to model shape-based changes in the human poses over a series of frames and also minimize the computational complexity [6, 7]. Among the flow and trajectory based features, optical flow has been found useful to maintain the track of interest points over a sequence of frames [8,9,10]. For classification, wide usage of support vector machines (SVM) [11, 12], hidden markov models (HMM) [13,14,15,16] and artificial neural networks (ANNs) [17] have been observed.

By making use of the positive aspects of features, being used commonly for human activity detection, and adding a few more, we have created an amalgam of features which results in a comprehensive set of features to detect a wide range of human activities. Velocity and displacement, scale-invariant feature transform (SIFT) [1], histogram of oriented gradients (HOG) [18], Haar wavelets and radial histogram have been observed to provide a significant boost in the accuracy of detection. We have dealt with thin, high-dimensional data by training SVM via sequential minimal optimization (SMO) [19]. To account for realistic scenarios, we have not restricted our system to mere single human actions detection, but have also incorporated the recognition of multiple people performing activities independently. We have presented a generalization of actions, Fig. 1, to account for those which have not been elaborated in this paper but can be classified using our proposed methodology. The training and evaluation of single human actions has been performed on KTH and Weizmann datasets, however, multi human activity has been evaluated over self-generated dataset. The dataset covers 13 human actions performed by two persons independently in each video sequence. We discussed the dataset in Sect. 4 (see Fig. 8).

Fig. 1
figure 1

Broad classification of Action Classes

The paper gives a detailed survey of previous work in Sect. 2. Section 3 contains a discussion on methodology including preprocessing performed on the data, a discussion on extracted features, and classification techniques employed in our system. Sections 4 and 5 consist of experimentation and results respectively.

2 Literature survey

Human actions recognition techniques can be broadly classified into four categories. Flow-based methods, spatio-temporal templates, tracking and centered around interest point [1, 20]. Spatio-temporal templates make use of unique poses of the body to encode the action information. Optical flow and SIFT or speeded up robust features (SURF) features are the examples of flow-based method and interest point based techniques respectively. Moussa et al. [1] in his work, makes use of fine-tuned SIFT interest points, K-means clustering and later builds a normalized codebook using cluster indices to be passed onto the next stages of classification with SVM. He used min–max normalization technique to normalize the codebook of visual words. M. Moussa has made a successful effort in classifying six actions of the KTH dataset and ten belonging to Weizmann. Human pose presents a lot of information about the action being performed when it is used in addition with other poses tracked before or after the observation. Tharau and Hlavac [21] have used such approach in their attempt for activity recognition. Their methodology works well for both static images and a bunch of video frames since it does not involve dynamic feature computations. Poses are assigned weights according to the credibility of their occurrences. The testing phase is based on these assigned weights, classifying the action category using the information of training data weights.

Ming-Yu and Alexander [22] used spatio-temporal features along with the interest points to model local motion. They applied optical flow on SIFT features and retained only those features that showed significant optical flow. Using Bi-Gram bag of words approach for clustering, they achieved an accuracy of 95.83% on KTH dataset. Manosha et al. [23] combined optical flow and silhouette based shape features to train their system on Weizmann and UIUC datasets. They used one-vs-one multiclass SVM to achieve the accuracy of 97.45% on UIUC dataset and perfect 100% on 10 actions of Weizmann [24]. Umakanthan et al. [18] found the problems in standard multi-class SVM based approaches for complex datasets and proposed binary tree SVM based activity recognition using Gaussian Mixture Models. They used HOG and motion boundary histogram at the feature extraction stage to obtain 58.2% accuracy on challenging Hollywood dataset. Recently, Liu [25] et al. used a Bayesian network based generative framework to model the complex activities upon primitive actions recorded from sensors. Their probabilistic network modeled the uncertainties arising from missing or incorrect data from the sensors in temporal domain well. They have 78% accuracy on complex OSUPEL [26] dataset. Abdul Azim et al. [27] proposed optical flow variant trajectory features. They achieved 94.90% accuracy for KTH online database and 95.36% for weizmann dataset.

Multi-person actions recognition is more challenging than single person. One of the primary reasons for this difficulty is the non-availability of simple datasets for multi-person action recognition [28]. They are generally more complex than that used for single human actions.

Gilbert [29] used two-dimensional Harris corner interest points for multi-class action recognition on multi-KTH dataset [30]. Multi-KTH dataset has same six actions as of KTH [31]. Gilbert used similar approach to that used for pose estimations but their templates are mixture of spatio-temporal features. They classify actions using the maximum-likelihood approach based on the count of matching action templates. In [32], they extended their methodology to include information from neighborhood to low-level features to make hierarchical classifiers and they achieved more speed and accuracy than provided in [29]. This approach relies too much on the interest points and does suffer with inaccuracies. Our methodology uses interest points in combination with other low level features encoding the spatio-temporal information of given frames to classify actions. The details about our features are given in Sect. 3.

The methods have been performed over either one of the KTH [31] or Weizmann [33] datasets for evaluation, consisting of unifomity across all the data instances. Our approach performs better in a way that it is independent of this uniformity. It includes train data from more than one datasets and accordingly accounts for non-uniform specifications of the data. Moreover, we have devised a system to classify multi human actions which has been evaluated on our in-house, self-generated dataset. For this dataset too, the accuracies are in agreement with our results for public datasets. Our contribution includes the generalization of basic action categories to extend the action recognition to other actions not covered in this paper and a faster approach for action recognition in general making use of SMO [19] to train polykernel. Our dataset includes more actions than covered in multi-KTH and is more challenging than multi-KTH [30].

3 Methodology

Our proposed system comprises of three major stages: preprocessing, feature extraction and classification. We employed different features extraction approaches to describe activities in series of frames. Considering significance of each feature, heterogeneous features are used to cover all poses. An overview of proposed framework is shown in Fig. 2.

Fig. 2
figure 2

Framework for proposed activity detection system

3.1 Preprocessing

Extracted frames from video are resized to 128 × 64 for calculating fixed number of features. We exploited stable background to segment out human body using background subtraction. We extracted silhouettes by performing frame differencing. Bigger connected components are used to filter out human silhouette from other connected components (Fig. 3). In case of multi-human activity, multiple silhouettes are extracted from the image depending upon the number of humans in the frame. Extracted silhouettes are forwarded to the system for centroid detection. Region of interest from image is selected on the basis of silhouette boundary and cropped out for feature calculation as shown in Fig. 4.

Fig. 3
figure 3

Single human frame preprocessing

Fig. 4
figure 4

Multi human frame preprocessing

3.2 Features

This approach towards human activity detection incorporates fusion of features.

Details of each feature are specified in this sub-section along with their importance in classification.

Velocity and displacement are calculated using silhouette center point marked as P1 and P2 for two frames respectively. Movement of these two points is employed to calculate velocity and displacement. In our method, these two features are averaged out for series of frames.

Displacement is simply the difference between points P1 and P2, while velocity is calculated by dividing displacement with time between two frames. These two features are helpful in distinguishing between activities, for example boxing and walking where former involves negligible body movement in comparison to later.

Histogram of oriented gradients (HOG) gives us the count of localized gradient directions in a specified image block. The bins of histogram represent the distribution of edge directions. HoG works locally by dividing image into cells. These cells represent portions over which the histogram of gradients is created. These histograms are included into a bigger block for normalization to avoid effect of lighting variations. Normalization is carried by using maximum value of a block. Histograms of each separate block are concatenated to form one HoG representation of the image. Two basic steps of HoG are:

  1. 1.

    Computation of gradient values

  2. 2.

    Bins formation of these computed orientations

HOG features have shown their significance in pedestrian detection. Based on this fact, we used importance of these features in combination with other well-known features. We built poselets of human actions using HOG descriptor. It gives us a global edge descriptor for activity detection. In proposed method, we used sequence of frames from a video. For each frame HOG determines the frequency of gradients. An average of all the extracted features is taken. We have used 24 × 128 window size and [–1, 0, + 1] mask for filtering. Histogram is calculated for individual cells of 8 × 8 grid size. Histograms of these cells are normalized by incorporating them into a larger block of 16 × 16 pixels using Eq. 1.

$$f=\frac{m}{{\sqrt {\left\| m \right\|_{2}^{2}} +c}}$$
(1)

‘m’ represents a non-normalized vector containing histograms of a block, whereas ‘c’ denotes a constant.

Local binary pattern (LBP) is a texture analysis tool used to recognize human activities in [34,35,36]. It is robust to illumination changes. It extracts local features of whole image considering a 3 × 3 neighborhood at a time. A constant binary number is generated for each pixel in a window, carried out by comparing the center pixel I(xc,yc) with pixels \(I\left( {{x_c}_{0},{y_c}_{0}} \right),I\left( {{x_c}_{{\text{1}}},{y_c}_{{\text{1}}}} \right), \ldots I\left( {{x_c}_{{\text{7}}},{y_c}_{{\text{7}}}} \right)\) surrounding the center. Let us assume pixel value is represented by v and function that transform intensity into binary value is s(x) given by Eq. 2:

$$s(x)=\left\{ {\begin{array}{*{20}{c}} 1&{{\text{if }}x{\text{ }}>0} \\ 0&{{\text{if }}x{\text{ }}<0} \end{array}} \right.$$
(2)

where x = vivc for 0 ≤ i ≤ 7.

The LBP pattern of I(xcyc) is represented by the decimal number equivalent to binary pattern obtained by Eq. 3:

$$LBP\left( {x_{\text{c}},y_{\text{c}}} \right)=\mathop \sum \limits_{{i=0}}^{7} s\left( {v_{\text{i}} - v{_\text{c}}} \right){2^i}$$
(3)

Total 8192 LBP features are extracted from a single image. These features in combination with others are used for human activity classification.

Haar wavelet is known for data analysis and image compression. Information in input image is divided into approximation and detail sub-signals. Approximation sub-signal is calculated by applying low pass filter in horizontal and vertical direction (LL) to produce top left segment in Fig. 5. Detail sub-signals involve three types of high frequency components. First sub-signal is obtained by using horizontal high pass and vertical low pass filter (HL) for top right block. Low pass for horizontal and high pass filter (LH) for vertical direction is applied to get second bottom left block. Finally, high pass filter (HH) is applied in both directions to get third sub-signal i.e. lower right segment of output image in Fig. 5.

Fig. 5
figure 5

Wavelet features of activities standing and bending (left to right)

It can be seen in Fig. 5, top left segment contains maximum energy and least in lower right. Edges information are preserved only in LH and HL, therefore we used LL, LH, and HL as features for classification. Forward wavelet transform is defined using Eq. 4.

$$~~~\left[ {\begin{array}{*{20}{c}} {{y_{~11}}}&{{y_{~12}}} \\ {{y_{~21}}}&{{y_{~22}}} \end{array}} \right]=F\left[ {\begin{array}{*{20}{c}} {{x_{~11}}}&{{x_{~12}}} \\ {{x_{~21}}}&{{x_{~22}}} \end{array}} \right]$$
(4)

where xij is input matrix, yij is output matrix and F is a translation filter and can be defined as in Eq. 5

$$F=\frac{1}{{\sqrt 2 }}~\left[ {\begin{array}{*{20}{c}} 1&{ - 1} \\ 1&1 \end{array}} \right]$$
(5)

Scale invariant feature transform (SIFT) molds image data into a constant scale of coordinates with respect to regional features. Computation of SIFT points includes following basic steps:

  1. 1.

    Extrema detection using Gaussian function makes use of the scale-space to detect same object from more than one view angles, irrespective of the change in scale. The extrema are evaluated by employing difference of Gaussian function, after which each pixel is compared to its 8 neighbors: this value being either greater or lesser than all the neighboring values indicates a local extrema.

  2. 2.

    Keypoint localization involves filtering points of interest out of all the detected keypoints by simple thresholding.

  3. 3.

    Orientation assignment of detected keypoints comprises up of magnitude and orientation of gradient calculation to bring consistency in local orientation of keypoints.

The SIFT features resulted in a dynamic number of keypoints for every frame independent of the previous frames. We manipulated SIFT features such that it provides us constant number of keypoints for every frame. To achieve the goal, we made use of Algorithm 1 where upper bound (u) is 20 and lower bound (l) is 13. Detected keypoints are used as features in association with other features to enhance classification performance.

Angular histogram is helpful for various human actions involving different poses, each of them possessing a unique spread in space. To incorporate this information, we employed radial/angular histogram which is an efficient shape descriptor and describes varying body postures effectively.

figure a

The rectangular ROI consisting of the detected human body silhouette has been divided into a grid of four blocks to create radial histogram. Angles of each white pixel are calculated by taking the center of each window as the origin. Four histograms of 18 bins are created, one for each of the four blocks. Every bin covers 20°. All four of them are then averaged to obtain a single histogram for one silhouette, reducing its dimension from 72 bins to only 18 (Fig. 6).

Fig. 6
figure 6

Human body silhouette on left side and its angular histogram on right side

3.3 Fusion of features

Recognizing complex human activities require multiple local and global features detection and their interaction for defining motion accurately. It has been accepted by research community that merging different features give good classification results [37]. Activity of a human can be characterized by locomotion of a human body. There can be two possibilities for motion. First is the movement of body parts without relocating complete body i.e. clapping, waving, boxing and bending. Second is the movement of complete body from one point to other i.e. walk, run, jump and side walk. For incorporating all aspect of activity detection different features need to be combined.

figure b

Combination of different features is carried out by initially dividing complete video into multiple groups of 25 frames each, followed by feature calculations for every individual frame in a group. Velocity, displacement, HoG, SIFT, radial histogram and wavelet features are calculated for each individual frame and averaged out for single group. Finding all feature sets for complete group is followed by combining them into a single list of features, used in classification. Complete steps are shown in algorithm 2.

3.4 Classification

We have a large number of sparse features with dimensions exceeding up to 5000. Training a huge data using classical quadratic optimization problem is costly and inefficient. Therefore, we selected SMO [19] which works best on sparse data with high dimensions [38]. Classification results were improved by using it. SMO not only speeded up the training process, it also reduced classification error. The accuracy we obtained using SVM radial basis function (RBF) was 80.486% and 91.80% for SVM polynomial on single human activity dataset. It improved up to 91.99% when SVM trained using SMO. Smaller discrepancies are shown, when SVM polynomial is trained using SMO. This is occurring because SMO optimizes lagrange multipliers analytically, in opposite to traditional SVM training that considers optimizing all multipliers at a time. A global maximum (or minimum) is achieved by SMO, while SVM with polynomial kernel has value closer to maxima (or minima) [19].

SMO uses the same quadratic optimization problem (QP) that SVM uses and does little manipulation to achieve reduced training time and optimized lagrange multipliers. It breaks down the data into best possible smaller QP problem chunks and tries to optimize two lagrangians at a time analytically. The SVM dual quadratic maximization problem is represented as:

$$\begin{gathered} \mathop {\max }\limits_{\lambda } \sum\limits_{{j = 1}}^{m} {\lambda _{j} } - \frac{1}{2}\sum\limits_{{j = 1}}^{m} {\sum\limits_{{k = 1}}^{n} {\lambda _{j} \lambda _{k} y_{j} y_{k} x_{j} x_{k} } } \hfill \\ 0 \le \lambda _{j} \le C,\forall _{j} \hfill \\ \sum\limits_{{j = 1}}^{m} {y_{j} \lambda _{j} = 0} \hfill \\ \end{gathered}$$
(6)

where, λ is represented by lagrange multiplier, x is input data, whereas y represents class label.

The working of SMO is such that it optimizes two lagrangians at a time, keeping all other lagrangians constant, while satisfying equality and box constraints. Considering two lagrange multipliers λ1 and λ2 to be optimized and keeping all other multipliers \({l_{\text{3}}},{l_{\text{4}}}, \ldots ,{l_m}\) to be constants, the equation becomes:

$$~{\lambda _1}{y_1}+{\lambda _2}{y_2} = - \mathop \sum \limits_{{j=3}}^{m} {\lambda _{\text{j}}}{y_{{\text{j~}}}}{\text{~}}$$
(7)

We can re-write the right hand side of equation by replacing it with some constant c:

$${\lambda _{\text{1}}}{y_{\text{1}}}+{\lambda _{\text{2}}}{y_{\text{2}}}=c$$
(8)

The linear Eq. (8) will be used to optimize over λ1 and λ2. The optimal value of λ1 is achieved by finding λ1new,unclipped, restricting it with in upper bound U and lower bound L limits, mentioned in [19].

$$\lambda _{1} = \left\{ {\begin{array}{*{20}c} {U,} & {{\text{if }}\lambda _{1}^{{new,unclipped}}> U} \\ {\lambda _{1}^{{{{new}},{{unclipped}}}} } & {{\text{if L}} \le \lambda _{1}^{{new,unclipped}} \le H} \\ L & {{\text{if }}\lambda _{1}^{{new,unclipped1}} \le L} \\ \end{array} } \right.$$

The similar procedure is employed to find other optimal lagrange multipliers \({l_{\text{1}}},{l_2}, \ldots ,{l_m}\). Classifier decision boundaries are determined by these optimized multipliers.

4 Experimentation

We have used three datasets for training and testing our algorithm:- KTH human action dataset [31] and Weizmann human action dataset [39] for single human activity recognition, multi-human dataset for multiple human’s activity recognition. The accuracy comparative analysis is presented in this section including parameters setting for classifier and dataset description.

4.1 Parameters setting

The classifier accuracy tends to vary with different parameters setting. The parameters that we used are mentioned here. Polynomial kernel function is used for classification of our large non-linear data with cross-validation of 10 folds. The normalization step was carried out before training. Round off error was set to 1.0E−12.

4.2 Datasets

KTH dataset comprises of 600 videos of 6 human actions (boxing, handwaving, hand-clapping, jogging, running and walking). These action were performed by 25 different subjects in 4 scenarios: outdoor, indoor, outdoor with scale variation, and different clothes. The frame rate of videos is 25 with 160 × 120 resolution, examples are shown in Fig. 7. Each video has variable duration ranging from 10 s to a minute. In each video, there are 300–1500 frames and actions on average are repeated after 30 frames. The maximum accuracy achieved on this dataset is around 92–94% [40, 41, 27].

Fig. 7
figure 7

Sample images from KTH and Weizmann dataset

Weizmann dataset has total 93 videos of 10 human actions performed by 9 different persons, look at Fig. 7. There are 40–120 frames per video with resolution 180 × 140. The dataset videos length ranges between 1 and 5 s. The highest accuracy on this dataset reported in literature is 100% [42, 24].

Multi-human dataset We created our own dataset of 13 human activities (walking, running, jogging, jump, pjump, side, boxing, hand clapping, bend, jack, skip, wave1 and wave2) of two persons performing activities independently. The dataset has 130 videos of 10 actors; each subject performed all 13 activities where each video has similar activity, for example, in boxing video both persons conducted boxing. The dataset was captured with static camera under varying lighting conditions. Sample images are shown in Fig. 8.

Fig. 8
figure 8

Sample images from our multi-human dataset

5 Results and discussion

5.1 Single human activity recognition

Both KTH and Weizmann human activity recognition datasets were used for training and testing. Selected actions from KTH are boxing, walking, hand-clapping, running and jogging, while from Weizmann bend, jack, jump, pjump, side, skip, wave1 and wave2 are selected. We randomly picked 4 videos of these actions from Weizmann dataset and 9 videos from KTH dataset for training. We extracted features of 25 frames representing one human action and trained SMO on these features. The confusion matrix is shown in Table 4.

We have tested activity recognition by adding and removing different features. The accuracy results vary with different combinations of velocity (V), displacement (D), SIFT, HOG, LBP, radial histogram (R) and Haar wavelet (W). We selected a best possible feature set that showed the highest accuracy. A comparison of features combinations and respective accuracies is shown in Table 1. The complete feature set achieves 91.038% accuracy. The classification accuracy reduced to 88.878% by removing HOG features and increased to 91.99% by taking out only LBP features. The best results are obtained without LBP features. Feature dimensions are also reduced to 5910 from 14,102 dimensions with exclusion of LBP. Table 1 shows detailed accuracy comparison with different features.

Table 1 Accuracy comparison of different features on single human dataset

Comparison among other well-known classification algorithms is also provided in Table 1. We used same feature set to see the performance variation on other algorithms. It can be seen that SMO is out-performing other classifiers. SVM with polynomial kernel has performance closer to SMO. All other algorithms: decision tree, random forest, naive bayes failed to produce good results on our high dimensional sparse feature vector.

An accuracy of 100% has been achieved on classes boxing, jack, pjump, and wave2, while six classes bend, hand clapping, jump, side, skip, and wave1 showed 97% accuracy. Locomotive actions (walking, running, and jogging) have the highest amount of confusion among each other, hence shown the lowest accuracy, Table 4.

In Table 1, V is velocity, D is displacement, R represents radial histogram, and W corresponds to haar wavelet.

Proposed approach performs better than most of the recent publications. We have very less accuracy variation on KTH and Weizmann dataset that shows the robustness of our method. In comparison to our methodology, Klaser et al. [43] have accuracy decreased to 84.3% for Weizmann dataset while it is 91.4% on KTH dataset. Laptev et al. [44] has equivalent accuracy to our approach on Weizmann dataset. The comparative results are shown in Table 2.

Table 2 Accuracy comparison on KTH and Weizmann with other publications

5.2 Multi-human activity recognition

For multi-human activity training and testing was carried out on our own dataset in a similar way that was performed on single human activity recognition with similar features. The recognition for each person works separately. Two distinct feature sets are calculated for each person, and classified independently. The confusion matrix for multi-human activity is shown in Table 5. Similar to single human activity walking, running and jogging have lowest accuracies because of maximum confusion. The classes hand clapping and pjump had the maximum accuracy of 100%. Bend, boxing, jack, and wave1 are more than 94% accurate, see Table 5 for detailed accuracy comparison of each class. Accuracy varies for different number of features similar to single human activity. Feature set without LBP has attained 86.48% classification accuracy. In opposite to this, 0.214% and 0.429% improvements in accuracy have been shown without LBP + radial histogram and SIFT + LBP + radial histogram respectively. Table 3 shows accuracy comparison with different number of features. Akin to single human activity, all other classifiers have degraded performance in contrast to SMO, although SVM with polynomial kernel is approximately equal to SMO in performance.

Table 3 Accuracy comparison of different features on multi-human dataset

Tables 3 and 1 depict the accuracy for different sets of combinations of features. Rationale behind using this diverse sets of features is diverse dynamic nature of activity. Some of the activities include only performing some specific action on same place while others include locomotion. Activities with some locomotion are best depicted using velocity and displacement measures. SIFT provide interesting local features of an image irrespective of the spatial variation of the action. Haar wavelet and local binary pattern are used for appearance attributes of activity.

From Tables 3 and 1, we can conclude the effect of different features in activity detection. As shown by the accuracies of Tables 3 and 1; velocity, displacement, SIFT, angular histogram, HOG and wavelets are the best descriptor for activity detection as they combine the local and global properties of a human body. LBP is discarded as it considerably slows down the system without giving a significant boost in terms of accuracy. LBP is useful for applications where detailed texture description is required, as in face recognition. So we preferred a combination of velocity, displacement, SIFT, angular histogram and HoG for activity detection.

The confusion matrices in Tables 4 and 5 show the test set accuracies obtained for the best performing system proposed on 13 action classes. The actions with similar body and limb movements become a cause of slight overlap among few of these actions. Boxing and hand clapping show relatively similar movements of arms. This leads to misclassification in some instances. On the similar pattern, jogging and walking both include significant lower body movements. In case of multi-human actions: jumping, jogging, skipping and walking show minor intermixing due to similarity in the body movements of these actions. Waving by one hand and waving using two hands show intermixing but since the second action involves use of another arm, which acts as a classifying trait, the overlap between the two actions is insignificant. The system presents a good overall accuracy and only a minute number of false classifications on the test data with 91.99% and 86.48% accuracies in case of single and multi-human action datasets respectively.

Table 4 Confusion matrix for KTH and Weizmann human activity dataset
Table 5 Confusion matrix for Multi-human activity dataset

6 Conclusion

The methodology makes use of an amalgam of features i.e., velocity, displacement, HOG, Radial histogram, LBP, haar wavelets and SIFT feature points. Multiple combinations of these features have been tested over the classifier to select the one with best results. LBP and haar wavelet have been observed to lower the accuracy for which they have been excluded from the final combination.

Support vector machine and sequential minimal optimization have been employed for training purposes. The method applies to both single human and multi human action scenarios. The system has been trained and evaluated over 13 human actions in both cases. The evaluation and experimentation has been carried out on standard KTH and Weizmann action datasets for single human actions and on a self-generated in-house dataset for multi human dataset. An accuracy of 91.99% has been achieved in case of single human actions and 86.48% upon testing for multi human actions. This system can be extended for human–human interactions and human-object interactions.