1 Introduction

A large number of researchers have been attracted to human action classification problem due to its wide range of real-world applications. The notable implementations cover visual surveillance [1], smart homes [2], sports [3], entertainment [4], healthcare monitoring [5], patient monitoring [6], elderly care [7], Virtual-Reality [8], human–computer interaction [9], and so on.

Human actions refer to distinctive sorts of activities including walking, jumping, waving, etc. However, the vivid variations in human body sizes, appearances, postures, motions, clothing, camera motions, viewing angles, illumination changes make the action recognition task very challenging. Over few years, a large number of researchers introduced several action or activity recognition model by using data sensors like RGB video cameras [10], depth video cameras [2], and wearable sensors [11]. Among these two video data sources, action recognition research based on conventional RGB cameras (e.g. [12] has achieved great progress in the last few decades. However, utilization of RGB cameras for action recognition raises significant impediments such as lighting variations and cluttered background [13]. On the contrary, depth cameras generate depth images, which are insensitive to lighting variations and make background subtraction and segmentation easier. In addition, we can obtain body shape and structure characteristics and the human skeleton information from depth images.

Many previous attempts can be listed for efficient recognition systems such as DMM [14], HON4D [15], Super Normal Vector [16], Skeletons Lie group [17] and etc. But, those existing methods still face some crucial challenges such as depth video processing, appropriate feature extraction and reliable performance of classification model. Considering the aforementioned challenges, this study focuses to build an effective and efficient human action recognition framework on depth action video sequences. The main objective of this work is to enhance the classification accuracy by proposing an efficient recognition framework, which can overcome the above challenges more effectively. More specifically, the action video is illustrated through three 2D motion and three 2D static segments oriented images of the action. In fact, the dynamic and motionless maps are derived from the implementation of 3DMTM [18] on a video. However, the obtained representations are then enhanced with the help of LBP [19] tool. The tool enriches the action illustration by encoding the motion and motionless maps into binary pattern. Eventually, the outputs of the LBP are treated as input of GLAC [20] to generate the auto-correlation gradient vectors. In fact, there are three feature vectors for the action motion segments images and another three feature vectors for the action static segments images. The first three vectors are concatenated to construct a motion information based GLAC vector. Similarly, another single GLAC vector is gained by incorporating the above mentioned last three vectors. For more boosting the proposed method, the aforementioned two single action representation vectors are concatenated for building the final action description. Finally, the action is recognized by passing the vector to a supervised learning algorithm named Extreme Learning Machine with Kernel (KELM) [21].

The main contributions of this paper are:

  • We enhance the auto-correlation features for the optimal description of an action. Besides, to observe the significance of the feature augmentation, the action is also presented with the ordinary auto-correlation features. In fact, our action representation technique addresses the intra-class variation and inter-class similarity problem significantly.

  • We report recognition results on three benchmark datasets namely; MSRAction3D [22], DHA [23] and UTD-MHAD [24]. The recognition results are compared with state-of-the-art handcrafted as well as deep learning methods.

  • We compare the recognition results based on the enhanced auto-correlation features compared to recognition results using the auto-correlation features only. These comparisons are made for the same data sets to fairly evaluate and elaborate the effectiveness of enhanced auto-correlation features.

  • Finally, we report the computational efficiency in terms of the running and computational requirements.

Based on three publicly available data sets - MSRAction3D [22], DHA [23] and UTD-MHAD [24], the proposed method is compared with handcrafted and deep learning methods extensively. The computational efficiency assessment indicates that the proposed approach offer feasibility for the real-time implementation. The working flow of the system is illustrated in Fig. 1.

Fig. 1
figure 1

Workflow illustration of our method

This paper is organized as follows: in Sect. 2, we present some related literature review. Section 3 describes research methodology. The results of experimental and discussions are presented in Sect. 4. Finally, Sect. 5 concludes the work.

2 Related work

Feature extraction is a key step in Computer Vision research problems like object localization, human gait recognition, face recognition, action recognition, text recognition and etc. As a result, researchers have given more attention to extract features effectively. For example, for object recognition, Ahmed et al. [25] introduced a Saliency map on RGB-D indoor data which had numerous applications such as vehicle monitoring system, violence detection, driverless driving system, etc. Hough voting and distinct features were used to measure the efficiency of that work. To explore silhouettes of humans from noisy backgrounds, Jalal et al. [26] applied embedded HMM for activity classifications where spatiotemporal body joints and depth silhouettes were fused to improve accuracy. In another work, to recognize online human action and activity, Jalal et al. [27] performed multi-features fusion along with skeleton joints and shape features of humans. For feature extractions in activity recognition, Tahir et al. [28] applied 1-D LBP and 1-D Hadamard wavelet transform along with Random Forest. On depth video sequences, Kamal et al. [29] utilized modified HMM to complete another fusion process of temporal joint features and spatial depth shape features. On the other hand, to recognize facial expression Rizwan et al. [30] implemented local transform features where HOG and LBP were used for feature extraction. Again, skin joint features by using skin color and self-organized maps were used for activity recognition [31]. In another work, Kamal et al. [32] employed distance parameters features and motion features. Yaacob et al. [33] introduced a discrete cosine transform, particularly for gait action recognition.

In developing vision based handcrafted action recognition, researchers have also done struggle in feature extraction for optimal action representation. The motion features of an action through the simplified depth motion maps were extracted by works in DMM [14], DMM-CT-HOG [34], DLE [35]. The texture features extracted by LBP were utilized in [36]. Recently, Dhiman et al. [37] introduced Zernike moments and R-transform to create a powerful feature vector for abnormal action detection. A genetic algorithms based system was proposed by Chaaroui et al. [38] to improve the efficiency of the skeleton joint-based recognition system by optimizing skeleton-joint subset. Vemulapalli et al. [17] represented human actions as curves that contain skeletal action sequences. Gao et al. [39] proposed a model to recognize 3D actions where they constructed a difference motion history image for RGB and depth sequences. Then, they captured motions through multi-perspective projections. Next, they extracted the pyramid histogram of oriented gradients. Finally, human action was identified by combining multi-perspective and multi-modality discriminated and joint representation. In the work by Rahmani et al [40], the features obtained from depth images were combined with skeleton movements encoded by difference histogram and finally a Random Decision Forest (RDF) was applied to obtain discriminant features for action classifications. On the other hand, Luo et al. [41] represented features by sparse coding-based temporal pyramid matching approach (ScTPM). They also proposed a capturing technique for spatio-temporal features from RGB videos called Center-Symmetric Motion Local Ternary Pattern (CS-Mltp). Finally, they explored the feature-level fusion and classifier-level fusion applying the above mentioned features to improve recognition accuracy. Again, decisive pose features were used by imposing another two distinct transformations called Ridgelet and Gabor Wavelet Transform to detect human action [42]. Moreover, Wang et al. [43] studied ten Kinect-based methods for cross-view and cross-subject action recognition on six dissimilar datasets and finally concluded that skeletal-based recognition is superior to other processes for cross-view.

Deep learning models usually learn features automatically from raw depth sequences, which are then useful to compute high level semantic representations. For example, 2D-CNN and 3D-CNN were employed by Yang and Yang to address the deep learning based depth action classification [44]. Wang et al. [45] used to improve the action representation, unlike DMM, Wang et al. proposed Weighted Hierarchical Depth Motion Maps (WHDMM). The WHDMM was fed into CNN along three CNN streams to recognize actions. In another concept, before passing to CNN, the depth video was described by Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI) [46]. In [47], a novel notion in action classification is introduced by using the RGB domain features as depth domain features by domain adaptation. Motion History Images (MHI) from RGB videos and DMM of depth videos are utilized together to generate a four-stream CNN architectures [48]. By using inertial sensor data and depth data, Ahmad et al. [11] expressed a multimodal \(M^2\) fusion process with the help of CNN and multi-class SVM. Very recently, Dhiman et al. [49] have merged shape and motion temporal dynamics by proposing a deep view-invariant human action system. To detect the human gesture and 3D action, Weng et al. [50] proposed pose traversal convolution Network which applied joint pattern features from the human body. They also represented human gesture and action as a sequence of 3D poses. A self-supervised alignment method was used for unsupervised domain adaptation (UDA) [51] to recognize human action. Busto et al. [52] expressed another concept for action recognition and image classification called open set domain adaptation which works for unsupervised and semi-supervised domain adaptation model.

3 Proposed system

Our proposed framework consists of feature extraction, action representation and action classification. In this section, we discuss the three parts respectively. Figure 2 shows the pipeline of the system.

Fig. 2
figure 2

Architecture visualization of our proposed framework

3.1 Feature extraction

For each action video, three motion and three static information images are firstly computed by applying the 3DMTM [18] on the video. The 3DMTM yields the set \({MHI_{XOY}, MHI_{YOZ}, MHI_{XOZ}}\) of motion images and the set \({SHI_{XOY}, SHI_{YOZ}, SHI_{XOZ}}\) of static images by simultaneously stacking all the moving and stationary body parts (along the front, side and top projection views) of an actor in a depth map sequence.

Now, the MHIs and SHIs are converted to the binary coded form by the LBP [19]. In fact, the later versions are more enhanced than the earlier version of those images. Figure 3 shows an MHI and the corresponding \(BC-MHI\) is represented by Fig. 4. It is clear that the motion information of the action is improved in the \(BC-MHI\).

Fig. 3
figure 3

Motion history image of two hand wave action

Fig. 4
figure 4

Enhanced motion history image of two hand wave action

The binary coded motion images \((BC-MHIs)\) are referred to as \(BC-MHI_{XOY}, BC-MHI_{YOZ}\) and \(BC-MHI_{XOZ}\) on three Euclidean faces whereas the binary coded static images on those faces are denoted as \(BC-SHI_{XOY}, BC-SHI_{YOZ}\) and \(BC-SHI_{XOZ}\). The binary coded images thus obtained, are fed into the GLAC [20] descriptor to extract spatial and orientational auto-correlations for illustrating action. This paper extracts the 0th order and 1st order auto-correlation features to describe an action. In fact, the auto-correlation features are used to describe an action through the rich texture information from images. The texture information includes the image gradients and curvatures simultaneously. Overall, the auto-correlation features are more dominant over the standard histogram oriented features. Thus, we consider the auto-correlation features in our approach. For a spontaneous discussion of the GLAC utilization on \(BC-MHIs\)/\(BC-SHIs\), let I be a binary coded motion/static image (i.e., \(BC-MHI\)/\(BC-SHI\)). For each pixel of I, we use image gradient operators to obtain a gradient vector. The magnitude and the orientation of the gradient vector are computed as follows:

$$\begin{aligned} {m}= & {} {\left\{ \begin{array}{ll} {\sqrt{\left( {\frac{\partial I}{\partial x}}^2+{\frac{\partial I}{\partial y}}^2\right) }}, &{}\quad {\mathrm{if}}\; {\mathrm{I}} = {\mathrm{I}} ({\mathrm{x}}, {\mathrm{y}})\\ {\sqrt{\left( {\frac{\partial I}{\partial y}}^2+{\frac{\partial I}{\partial z}}^2\right) }}, &{}\quad {\mathrm{if}}\; {\mathrm{I}} = {\mathrm{I}} ({\mathrm{y}}, {\mathrm{z}})\\ {\sqrt{\left( {\frac{\partial I}{\partial x}}^2+{\frac{\partial I}{\partial z}}^2\right) }}, &{}\quad {\mathrm{if}}\; {\mathrm{I}} = {\mathrm{I}} ({\mathrm{x}}, {\mathrm{z}})\\ \end{array}\right. }, \end{aligned}$$
(1)
$$\begin{aligned} {\theta }= & {} {\left\{ \begin{array}{ll} {\mathrm {arctan}\left( \frac{\partial I}{\partial x},\frac{\partial I}{\partial y}\right) }, &{}\quad {\mathrm{if}}\; {\mathrm{I}} = {\mathrm{I}} ({\mathrm{x}}, {\mathrm{y}})\\ {\mathrm {arctan}\left( \frac{\partial I}{\partial y},\frac{\partial I}{\partial z}\right) }, &{}\quad {\mathrm{if}}\; {\mathrm{I}} = {\mathrm{I}} ({\mathrm{y}}, {\mathrm{z}})\\ {\mathrm {arctan}\left( \frac{\partial I}{\partial x},\frac{\partial I}{\partial z}\right) }, &{}\quad {\mathrm{if}}\; {\mathrm{I}} = {\mathrm{I}} ({\mathrm{x}}, {\mathrm{z}})\\ \end{array}\right. }, \end{aligned}$$
(2)

The above orientation \(\theta\) can be coded into D orientation bins by voting weights to the nearest bins to form a sparse gradient orientation vector \(\varvec{g} \in {\mathbb {R}}^{\varvec{D}}\).

Through the gradient orientation vector \(\varvec{g}\) and the gradient magnitude m, the Kth order auto-correlation function of local gradients can be written as

$$\begin{aligned} \begin{aligned}&{F\left( d_{0}, \ldots , d_{K}, \varvec{b}_1, \ldots ,\varvec{b}_K\right) } \\&\quad ={\int w\left[ \left( m\left( \varvec{r}\right) ,m\left( \varvec{r}+\varvec{b}_1\right) , \ldots , m\left( \varvec{r}+\varvec{b}_K\right) \right] g_{ d_{o}}(\varvec{r}) g_{ d_{1}}\right. } \\&\qquad \left( \varvec{r}+\varvec{b}_1\right) {\cdots g_{ d_{K}}\left( \varvec{r}+\varvec{b}_K\right) d \varvec{r}}, \end{aligned} \end{aligned}$$
(3)

where \(\varvec{b}_1, \varvec{b}_2,\dots ,\varvec{b}_K\) are the shifting vectors from the position vector \(\varvec{r}\) (indicates the position of each pixel in image I). \(g_d\) indicates the dth element of \(\varvec{g}\) and \(w\left( .\right)\) is a weighting function of m functions. Indeed, the function \(w\left( .\right)\) is used as the auto-correlation’s weights. All the shifting vectors are restricted to local neighbours since the local neighbouring gradients might be immensely correlated. However, two types of correlations among gradients are obtained from Eq. (3): spatial gradient correlations gained with the vectors \(\varvec{b}_i\) and orientational gradient correlations attained through the multiplications of the values \(g_{d_{i}}\). By changing the values of K, \(\varvec{b}_i\), and the weight w; Eq. (3) may take various forms. The lower values of K assists to capture lower order auto-correlation features, which are rich geometric characteristics together with the shifting vectors \(\varvec{b}_i\). Because of image isotropic characteristic, the shifting intervals are kept identical along the horizontal and vertical directions. For \(w\left( .\right)\), the min is accepted for suppressing the impact of isolated noise around auto-correlations.

According to the suggestion by [20], \(K\in \{0,1\}\), \(b_{1x,y}\in \{\pm \Delta \varvec{r},0\}\) and \(w\left( .\right) \equiv {\mathrm {min} \left( .\right) \ }\) are considered in this paper. The \(\Delta \varvec{r}\) is the displacement interval in both horizontal and vertical directions. The interval is same for both directions due to the isotropic property of the image. Now, from Eq. (3), for \(K\in \{0,1\}\) the 0th order \((\varvec{F}_{0})\) and the 1st order \((\varvec{F}_{1})\) GLAC features are as follows:

$$\begin{aligned}&\varvec{F}_0: F_{K=0}\ (d_o)=\sum _{r \in I} m(\varvec{r}) g_{d_{0}}(\varvec{r}) \end{aligned}$$
(4)
$$\begin{aligned}&\varvec{F}_1: F_{K=1} (d_0,d_1, \varvec{b}_1) \nonumber \\&\quad =\sum _{r \in I} \min \left[ m(\varvec{r}), m\left( \varvec{r}+\varvec{b}_{1}\right) \right] g_{ d_{0}}(\varvec{r})g_{d_{1}} (\varvec{r}+\varvec{b}_{1} ) \end{aligned}$$
(5)

A single mask pattern can be used for Eq. (4), and there are four independent mask patterns for Eq. (5) for computing the auto-correlations. The mask/spatial auto-correlation patterns of \((\varvec{r}, \varvec{r} + \varvec{b}_1)\) are depicted with Fig. 5). Since there is a single mask pattern for \(\varvec{F}_0\) and four mask patterns for \(\varvec{F}_1\) then the dimensionality of the above GLAC features \((\varvec{F}_0 \ and \ \varvec{F}_1)\) is \(D + 4D^2\). Although the dimensionality of the GLAC features is high, the computational cost is low due to the sparseness of \(\varvec{g}\). It is worth noting that the computational cost is invariant to the number of bins, D, since the sparseness of \(\varvec{g}\) does not depend on D.

Figure 7 shows an example of 0th and 1st order GLAC features with 8 orientation bins ( bins are shown in Fig. 6). Based on texture features, an action with motion images can be described as a vector \(EAMF=[EAMF_{XOY}, EAMF_{YOZ}, EAMF_{XOZ}]\), where \(EAMF_{XOY}, EAMF_{YOZ}\) and \(EAMF_{XOZ}\) are vectors, which are obtained by conveying the set of binary coded motion information images to the 2D GLAC. In order to represent the static image action based on texture features, the vector \(EASF=[EASF_{XOY}, EASF_{YOZ}, EASF_{XOZ}]\) is obtained by linking the enhanced auto-correlation feature vectors extracted on multi-view static images. The EAMF is complementary to the EASF, therefore we fuse these two vectors in to a single vector to get optimal representation of an action. In our work (for all experiments), the dimension of the single feature vector is of 4752. It is flexible to compute the feature vector due to the sparse vector \(\varvec{g}\). The work in [20] provides more detail on GLAC (Fig. 7).

Fig. 5
figure 5

Mask patterns for the 0th and 1st order auto-correlation

Fig. 6
figure 6

Example of orientation bins in auto-correlation computation

Fig. 7
figure 7

Example of 0th and 1th order GLAC features

3.2 Action classification

To gain the promising classification outcome, the fused version of EAMF and EASF is passed to Kernel based Extreme Learning Machine (KELM). The classification algorithm is discussed in detail in this section. The KELM [21] is an enhancement of Extreme Learning Machine (ELM) classifier [53]. By associating a suitable kernel with ELM, the KELM is derived to improve the discriminatory power of the classification algorithm. The Radial Basis Function (RBF) kernel is employed in our work. For an intuitive illustration, the classifier is described as a single algorithm in Algorithm 1.

figure a

4 Experimental results and discussion

We evaluate the proposed framework on MSRAction3D [22], DHA [23] and UTD-MHAD [24] datasets. Example depth images of each dataset are illustrated in Fig. 8. From the figure, it is clear that these datasets are ready for direct use without any segmentation process. Like other methods in [22,23,24], we straightforward input the depth map sequences in the proposed system without employing a preprocessing algorithm on the sequences.

Fig. 8
figure 8

Action snaps of MSRAction3D, DHA and UTD-MHAD dataset

4.1 Experiments on MSRAction3D dataset

MSRAction3D dataset [22] consists of 20 actions delivered by 10 diverse actors. The dataset includes inter-class similarity in different types of actions. The action examples generated by odd label oriented actors are utilized for model training and even label oriented samples are employed for the model testing [22]. The KELM uses \(C={10}^4\) and \(\ \gamma =0.03\) for training the classification model as optimal parameter values which are determined by 5-fold cross validation.

Table 1 reports a notable accuracy of 97.44% of our approach. The table indicates that the proposed approach is able to achieve better classification accuracy over other existing methods considerably. It can be seen that our method overwhelms deep learning system described in [44] by 6.34% and by 11.34% (see Table 1). To clarify the effectiveness of the feature enhancement, the system based on only the auto-correlation features is also evaluated on this dataset. The enhanced auto-correlation feature based system improves the recognition accuracy by 5.5% over the system without feature enhancement on the same setup and parameters. Figure 9 shows the confusion matrix corresponding to the accurate and incorrect classification rates. Through Table 2, the failure cases of the approach are listed. The table shows that the “horizontal wave” is confused with “hammer” by 8.3%; “draw x” is confusion with “high wave” by 7.14% and “draw circle” by 21.43%. Also action named “draw tick” is confused with “draw x” by 16.67%. Overall, among 20 actions, 17 actions are classified correctly (i.e., 100% classification accuracy) and rest 3 actions are classified incorrectly being confused with other actions.

Table 1 Performance of our method compared to the existing systems in terms of the MSRAction3D dataset
Fig. 9
figure 9

Confusion matrix obtained for the MSRAction3D dataset

Table 2 Class oriented confusion on MSRAction3D dataset

4.2 Experiments on DHA dataset

DHA dataset proposed by Lin et al. [23] with 23 action categories captured by 21 individuals. Due to having inter-class similarity of different types of action categories, such as golf-swing and rod-swing the dataset looks complex. The training and the test instances are separated according to the technique as discussed in the previous dataset [23]. The classification parameters \(C={10}^4\) and \(\ \gamma =0.06\) are decided through the 5-fold cross validation technique.

Our approach gains a significant classification accuracy of %99.13 on this dataset. It can be seen from Table 3, our approach outperforms [23] by 12.33%, [55] by 7.83%, [71] by 3.69%, [58] by 2.44%, [39] by 6.73% and [39] by 4.13%. For this dataset, the enhanced auto-correlation method overcomes the auto-correlation method by an accuracy of 2.17%. The confusion matrix of the dataset is shown in Fig. 10. Furthermore a table is figured out to show the class based confusion information. Table 4 clarifies that the “skip” and “side-clap” are misclassified with low confusion rates and other 21 actions are classified with 100% accuracy. The wrong classification occurs when “skip” is confused with “jump” by 9.09% and “side-clap” is confused with “side-box” by 9.09%.

Table 3 Performance of our method compared to the existing systems in terms of the DHA dataset
Fig. 10
figure 10

Confusion matrix obtained for the DHA dataset

Table 4 Class oriented confusion on DHA dataset

4.3 Experiments on UTD-MHAD dataset

The UTD-MHAD [24] is an action database constructed by a Microsoft Kinect camera and a wearable inertial sensor. In this dataset, 27 different actions are included and each action is performed four times by four females and four males. There are 861 depth sequences after removing 3 corrupted sequences. The 1st to 21st actions are obtained by positioning inertial sensor on the right wrist of performer, and the remaining actions are captured by inertial sensor placed on the subject’s right thigh. A comprehensive set of human actions is contained in the dataset, such as sport actions, daily activities, and training exercises. more detail on the dataset can be found in [24]. The entire database is split into training and test databases following the manner as in the last two databases [24]. The classification parameters C and \(\gamma\) are set to \(({10}^4\) and 0.06 for promising recognition outcomes.

Experimental evaluation of our approach on UTD-MHAD dataset is represented by Table 5. The approach is able to acquire 88.37% overall classification accuracy on the dataset. The comparison of our method with other existing methods is also shown in the table. Our method outperforms [24] (Kinect) by 22.27%, [24] (Kinect+Inertial) by 9.27%, [58] by 3.97%, [72] by 2.57% and [68] by 6.87%. The enhanced auto-correlation system overwhelms the auto-correlation system by 0.93%. The confusion matrix is shown by Fig. 11. The figure describes, the approach has misclassified action classes although the overall classification rate for this dataset is of 88.37%. Due to inter-class similarity, 16 action classes show confusion among 27 action classes. Table 6 is extracted from the confusion matrix to furthermore analyze the experimental results. The table mentions that the action swipe-right is confused with the action “swipe-left”, and the confusion/misclassification rate is 15.79%. For the “wave” action this rate is 20.0% for having confusion with the action “draw-circle-CW”. Similarly, the confusion rate for the action class “clap” and “wave” is of 20.0% and for “wave” and “clap” is of 15.79%. Also, all the samples of the action classes “basket-ball shoot”, “draw-x”, “draw-circle-CW”, “draw-circle-CCW” “draw-triangle”, “baseball-swing”, “pus”, “knock”, “catch”, “jog”, “stand2sit” and “lunge” are not classified perfectly. Those misclassified samples are confused with samples of similar body postures of subjects.

Table 5 Performance of our method compared to the existing systems in terms of the UTD-MHAD dataset
Fig. 11
figure 11

Confusion matrix obtained for the UTD-MHAD dataset

Table 6 Class oriented confusion on UTD-MHAD dataset

4.4 System competency

The computational time and the complexity of key factors are considered to examine the system’s efficiency .

4.4.1 Computational time

The system is evaluated on a Desktop whose configuration includes an Intel i5-7500 Quad-core processor of 3.41 GHz frequency and a 16 GB RAM. There are six major components in the system-i.e., MHI/SHI construction, binary coded MHI generation, binary coded SHI generation, EAMF generation, EASF generation, and KELM classification. The operation time of those components are figured out to measure the time competency of the recognition system. The computational time (in millisecond) per action sample (with 40 depth frames on average) for all the components is represented in Table 7. Note that the system needs less than one second (i.e., 731.43 ± 48.83 ms) to process 40 depth frames. Consequently, it can be claimed that our recognition method can be utilized as real-time recognition system.

Table 7 Computational time (mean ± std) of the key factors of the system

4.4.2 Computational complexity

In fact, the PCA and the KELM are the key components in the computational complexity calculation of the introduced system. The PCA has complexity of \(O\left( m^3+m^2r\right)\) [14] and and the KELM has complextity of \(\ O\left( r^3\right)\) [73]. As a result, the complexity of the system can be revealed as \(\ O\left( m^3+m^2r\right) +O\left( r^3\right)\). Table 8 represents the calculated complexity and compares with complexities of other existing methods. It can be seen that our method has lower computational complexity than other methods listed in the table. our method is also superior over them from the recognition perspective. Thus, our approach is superior in terms of recognition accuracy as well as computational efficiency.

Table 8 Computational complexity comparison of the proposed approach with other existing approaches

5 Conclusion

This paper has introduced an efficacious and efficient human action framework based on enhanced auto-correlation features. The system uses the 3DMTM to derive three motion information images and three motionless information images from an action video. Those motion and static information-oriented maps are improved by engaging the LBP algorithm on them. The outputs of the LBP are fed into GLAC descriptor to get an action description vector. With the obtained feature vectors from GLAC, the action classes are distinguished through the KELM classification model. The approach is extensively assessed in terms of the MSRAction3D, DHA, and UTD-MHAD datasets. Because of our action representation strategy, the proposed algorithm exhibits considerable performance over the existing handcrafted as well as deep learning methods. It is also obvious that the enhanced auto-correlation features-based method overwhelms the simple auto-correlation features-based method successfully. Thus, the improvement of the features is significant to enhance the system. Furthermore, the computational efficiency of the method specifies its suitability for real-time operation.

It is worth mentioning, some misclassifications are observed in our method. Note that the proposed method did not remove noise to improve the performance. The system only employed the LBP as preprocessing method for edge enhancing. Besides the LBP, a noise removing algorithm can be utilized to address the miss-classification issues of the proposed approach and thus for further improvement of the overall recognition accuracy. The descriptor can be more improved to increase the discriminatory power of the approach.

In our future work, we aim to build a deep model using the obtained 2D motion and static images. Besides, the current approach is not evaluated on large and complex RGB datasets like UCF101 and HMBD51. With a proper modification, the approach would be tested on these datasets in the future. Furthermore, we have a plan to build a new recognition framework using the GLAC descriptor on RGB and depth datasets jointly..