1 Introduction

Human action recognition is a hot topic in computer vision, which aims to automatically interpret the semantic information conveyed by human actions and interactions with the external environment. It has many real-world applications, such as security monitoring, intelligent human-computer interaction, smart home, and elderly healthcare etc. [12, 19, 41, 43, 44]. However, this task is still challenging because of problems like illumination, occlusion, varying spatio-temporal scale, clothing, and viewing angles.

Initially, action recognition technology was mainly based on RGB videos acquired by ordinary cameras [10, 49, 52]. However, RGB information is tempted with external factors, such as shooting environment, lighting, and wearing texture, which has limited the development of action recognition. With the introduction of low cost depth sensors, such as Microsoft Kinect, ASUS Xtion and SR-4000, major breakthroughs have been made in human action recognition. Compared to traditional RGB data, depth video sequences provide 3D structure of actions. The pixels of depth maps describe the distance between the surface of objects and sensors [4]. This range information provides convenience for segmenting the foreground person and eliminates the interference caused by complex backgrounds. Therefore, depth maps have better invariance to illumination and texture changes. Actually, human behavior is a tricky task in application scenarios, which contains abundant spatial information in different scales. Over the past few decades, a variety of methods have been investigated to describe depth videos for action recognition [3, 14, 18, 53]. However, the descriptors mentioned in these methods all lack of scale diversity and fail to capture more discriminative features.

Aiming at mining additional multi-scale spatial information from depth video sequences, we motivate to study a novel human action recognition framework with a multi-scale mechanism as illustrated in Fig. 1. We project each frame of depth videos onto three orthogonal Cartesian planes to obtain three-view depth motion images (DMI) which constitutes the 3D action model. After that, we apply the Gaussian pyramid to simulate the scale changes of human eyes and obtain the static multi-scale representation of human motions.

Fig. 1
figure 1

The framework of our proposed human action recognition method

Then, we construct Laplacian pyramids to generate the compact feature map LP-DMI which enhances the dynamic multi-scale information for action recognition, thus LP-DMI-HOG capturing multi-granularity motion features can be extracted following the pyramid structure. Finally, we employ ELM to classify actions. Specifically, the main contributions of this article are summarized as follows:

  1. 1)

    We study a compact multi-scale feature map based on depth video sequences called LP-DMI. Due to its superiority of enhancing multi-scale dynamic information of actions, the proposed feature map outperforms other existing maps. Moreover, some redundant static information inside the body is excluded simultaneously.

  2. 2)

    We introduce a feature extraction scheme according to the hierarchical structure of Laplacian pyramids. We extract HOG features and cascade them as LP-DMI-HOG. This descriptor captures multi-granularity features therefore it is more discriminative than others.

  3. 3)

    We propose a multi-scale human action recognition framework in which we generate compact multi-scale feature maps through the Laplacian pyramid of three-view DMI and then extract multi-granularity features. In addition, we use extreme learning machine for action classification.

  4. 4)

    We conduct experiments on the public MSRAction3D, UTD-MHAD and DHA dataset, and the experimental results demonstrate that our method surpasses the state-of-the-art benchmarks.

The rest of this article is organized as follows. Section 2 reviews the previous work related to ours. In Section 3, the proposed method is presented in detail, including building Laplacian pyramids of DMI, extracting LP-DMI-HOG feature and action classification. Section 4 discusses the experimental results compared to other human action recognition methods. At last, the conclusions of this paper are drawn in Section 5.

2 Related work

According to the type of input data, human action recognition technologies consist of RGB video based methods [10, 49, 52], depth video based methods [13, 25, 59, 62], 3D skeleton based methods [9, 38, 51], and multi-modal data fusion based methods [7, 16, 57]. Due to the convenience of data acquisition and invariance to illumination and texture changes, many researchers focus on the second methods which generally contain three steps: computing depth feature maps from depth video sequences, generating feature descriptors for motion representaion and recognizing actions by classifiers or neural networks [47, 55]. For higher accuracy, tremendous effort has been made to investigate representation and feature extraction strategy for human action recognition. Bobick and Davis [3] introduced a view-based approach on the basis of a temporal template that contains two component versions: the presence and recency of motion in sequence. They computed motion energy images (MEI) and motion history images (MHI) to model spatial and temporal characteristics of human actions. Mohammad et al. [4] utilized the static history images (SHI) as the complementary components of MHI. Motivated by MHI and MEI, Yang et al. [59] projected each depth frame onto three orthogonal Cartesian planes, then the subtraction operations between successive projections were carried out to obtain depth motion maps (DMM). On the contrary to DMM, Kamel et al. [18] investigated the depth motion images (DMI) in which the pixel value is the minimum value of the position of the same pixels over time to describe the overall action appearance from the front view. Since the DMM fails to recognize two actions with reverse temporal orders, Elmadany et al. [13] divided the depth video sequences into multiple partitions with the equal number of frames. Then they constructed the hierarchical pyramid depth motion maps (HP-DMM) so as to capture more detailed information of human movements.

Based on the depth feature maps above, many descriptors have been studied for human action recognition. The histogram of oriented gradients (HOG) [26], the local binary pattern (LBP) [8], and other shape and texture features [11] were calculated from DMM for more accurate description. Oreifej and Liu [27] introduced the histogram of oriented 4D normals (HON4D) in order to describe the action in 4D space, including depth, spatial, and time coordinates. Li et al. [23] introduced Local Ternary Pattern (LTP) as an image filter for DMMs and applied CNN to classify corresponding LTP-encoded images. Tian et al. [35] employed Harris detector and local HOG descriptor on MHI for action recognition and detection. Furthermore, Gu et al. [14] selected ResNet-101 as the deep learning model and fed it with MHI. Aly et al. [2] calculated global and local features using Zernike moments with different polynomial orders to represent global and local motion patterns respectively. Kamel et al. [18] presented a feature fusion method for human action recognition from DMI and moving joints descriptor (MJD) data using convolutional neural networks (CNN). Mohammad et al. [4] extracted the gradient local auto-correlations (GLAC) features from the MHI along with SHI to represent the movements. Chen et al. [6] computed GLAC features based on DMM and put them into the extreme learning machine for activity recognition. Space time occupancy patterns (STOP) was proposed by Vieira et al. [40] in which space and temporal axes were divided into several partitions for each sequence. Besides, the bag of angles (BoA) applied to skeleton sequences and the other descriptor called Hierarchical pyramid DMM deep convolutional neural network (HP-DMM-CNN) for depth videos were presented in [13].

In addition, some new methods have emerged in the latest work. Sun et al. [32] presented a global and local histogram representation model using the joint displacement between the current frame and the first frame, and the joint displacement between pairwise fixed-skip frames, respectively. Ahmad et al. [61] fed feature maps into the CNN architecture rather than using any conventional method, and ulteriorly Trelinski et al. [37] computed concatenated handcrafted and action-specific CNN-based descriptors together to obtain action feature vectors. Li et al. [21] generated 3D body mask and then formed the depth spatial-temporal maps (DSTMs) which provided compact global spatial and temporal information of human motions. Wei et al. [51] modeled human actions with a hierarchical graph in which the depth video sequence was represented as sequential atomic actions. Every atomic action was denoted as a composite latent state consisted by a latent semantic attribute and a latent geometric attribute. However, the methods above fail to capture the multi-scale features for action recognition, and thus have poor robustness. Recently, more attention has been paid to multi-scale motion information. Ji et al. [17] embedded the skeleton information into depth feature maps to divide the human body into several parts. The surface normals of local motion part sequence were partitioned into different space-time cells to obtain local spatio-temporal scaled pyramid which was applied to extract local feature representation. Yao et al. [60] studied parallel pair discriminant correlation analysis (PPDCA) to fuse the multi-scale temporal information with a lower dimension. However, the multi-scale temporal information in this method means features related to different numbers of frames. These methods obtain multi-scale information by different number of frames and cells or various sampling rate, which is only the scale change in the temporal level in essence. In this paper, we present a multi-scale method based on the Scale-space theory in [1]. Note that rather than realize multi-temporal scale, we focus on spatial multi-scale of feature maps to tackle the problem of complex model representation and low implementation efficiency.

3 Proposed method for human action recognition

A typical action contains characteristic information in different scales, and it can be represented by the structured multi-scale features. Learning the information in single spatial scale is deficient to provide discriminative feature sufficiently for human action recognition. In order to increase the scale diversity, we propose a novel method to represent actions by multi-scale feature map LP-DMI and extract multi-granularity feature with hierarchical pyramid structure. Then, extreme learning machine is utilized to recognize human actions.

3.1 Calculation of depth motion images

With the advent of depth cameras, a lot of approaches have been introduced based on the depth videos for human action recognition. Each frame of the depth camera records a snapshot of the action at a certain point in time. In general, DMI is considered as an effective representation of depth video sequences. It captures not only the overall appearance of actions but also the dense range changes in the moving parts. In this paper, we project the frames obtained by the depth camera onto three orthogonal Cartesian coordinate planes, thus each 3D depth frame generates three 2D maps. We record them as mapv(v ∈{f,s,t}) corresponding to the front, side, and top view respectively. The pixel value of DMI is the minimum value of the same spatial position of the depth maps. The three-view DMI of a depth video sequence with N frames can be calculated by the following equation.

$$ \begin{array}{l} D M I_{v}(i, j) =255-\min \left( \operatorname{map}_{v}(i, j, t)\right) ,\\ \qquad \qquad \forall t \in[k, \ldots,(k+N-1)] \end{array} $$
(1)

where mapv(i,j,t) is the pixel value of (i,j) position of 2D map at time t from the perspective of v. k represents the index of the frame. The maps are processed by dividing each pixel value by the maximum value of all the pixels contained in the image for normalization. We crop the region of interest (ROI) in DMI to exclude excess black pixels. This normalization contributes to eliminating intra-class differences and reducing the nuisances caused by body shape and motion amplitude. The generative process of DMI is depicted in Fig. 2.

Fig. 2
figure 2

The process of calculating DMIv from depth video sequences

3.2 Multi-scale represtation of depth video sequences

However, DMI simply reflects spatial information of actions in single scale. In order to capture multi-scale changes of human motions, we adopt the Gaussian pyramid transform which has been demonstrated the practicability in increasing scale diversity [20, 31]. As shown in Fig. 3, we acquire a cluster of multi-scale feature maps shaped like several pyramids. We stipulate that the number of layers goes up in a bottom-up manner. Gl is used to represent the image of lth layer of a Gaussian pyramid, that is to say, the size of Gl+ 1 is smaller than that of the Gl. We need to perform Gaussian kernel convolution and downsampling on the Gl to produce Gl+ 1. Mathematically, the gray value corresponding to the (i,j) position of Gl can be formulated as:

$$ \begin{array}{@{}rcl@{}} G_{l}(i, j)={\sum}_{m=-c}^{c}{\sum}_{n=-c}^{c} \varpi(m, n) \otimes G_{l-1}\left( 2 i+m, 2 j+n\right), \\ \left( 1 \leq l \leq L, 0 \leq i \leq R_{l}, 0 \leq j \leq C_{l}\right) \end{array} $$
(2)

where ⊗ is a convolution operator and L is the total number of layers in every Gaussian pyramid. (m,n) is the position of the convolution kernel. Rl and Cl are the number of rows and columns relative to the lth layer image of the Gaussian pyramid. c determines the size of ϖ and ϖ is a Gaussian window of size (2c + 1) × (2c + 1) satisfying the following formula:

$$ \varpi(m, n)=\frac{1}{2 \pi \sigma^{2}} e^{-\left( m^{2}+n^{2}\right) / 2 \sigma^{2}} $$
(3)

where σ is the standard deviation of the normal distribution. It refers to the variance related to the Gaussian filter which reflects the degree to which the image is blurred. We regard DMI as the lowest layer of the Gaussian pyramid denoted as G1. Then, a set of images \(\left \{ G_{1}, G_{2}, \dots ,G_{L} \right \}\) in which Gl+ 1 is 1/c2 size of Gl can be generated by (2), and constitutes an L-layer Gaussian pyramid. Thus, a series of Gaussian pyramids represented as GPL-DMI are simply calculated by this iterative scheme. In this paper, we set c to 2 and utilize a 5 × 5 Gaussian kernel as (4). The pyramid algorithm reduces the filter band limit between layers by an octave, and chops the sampling interval by the same factor. The frequency of downsampling operations is related to the size of the original image. For the Gaussian pyramid based on an M × N image, the maximum number of layers is \(\left \lfloor \log _{2}{\min \limits } \{M, N\}\right \rfloor \).

$$ \varpi=\left[\begin{array}{ccccc} 1 & 4 & 6 & 4 & 1 \\ 4 & 16 & 24 & 16 & 4 \\ 6 & 24 & 36 & 24 & 6 \\ 4 & 16 & 24 & 16 & 4 \\ 1 & 4 & 6 & 4 & 1 \end{array}\right] $$
(4)
Fig. 3
figure 3

The hierarchical structure of GP-DMI

Influenced by the complexity and concurrency of human behaviors, a simple action may involve the movement of multiple body parts. We view the inherent characteristic inside the body as static information, while the contour information that can better describe the changing of movements as dynamic information. For the majority of actions, the static information inside the human body is highly similar. Take waving arms in different directions for instance, the information of the abdomen and legs are constant to some extent, and cannot provide the discriminative feature for recognition very well. On the contrary, the dynamic information of different body parts can better reflect the spatial changes of actions in the interval, thus reflecting the specific feature of certain action. Inspired by this, we motivate to obtain the multi-scale dynamic information for human action recognition. We interpolate the lth layer of the Gaussian pyramid, that is, insert 0 in even rows and columns. Then, we utilize Gaussian filter to get \(G_{l}^{*}\) which has the equal size as the image one layer below it. We caculate the difference between Gl and \(G_{l}^{*}\) to get the multi-scale dynamic infomation. At the same time, this operation removes a lot of reduntant static information, making LP-DMI more compact than GP-DMI. As in the Gaussian pyramid, we set c to 2. Mathematically:

$$ \begin{array}{@{}rcl@{}} G_{l}^{*}(i, j)=4 \sum\limits_{m=-2}^{2} \sum\limits_{n=-2}^{2} \varpi(m, n) \otimes G_{l}\left( \frac{i+m}{2}, \frac{j+n}{2}\right), \\ \left( 1 \leq l \leq L, 0 \leq i \leq R_{l}, 0 \leq j \leq C_{l}\right) \end{array} $$
(5)

and

$$ G_{l}\left( \frac{i+m}{2}, \frac{j+n}{2}\right)=\left\{\begin{array}{ll} G_{l}\left( \frac{i+m}{2}, \frac{j+n}{2}\right), & \text{if }\frac{i+m}{2}, \frac{j+n}{2}\in \mathbb{N}^{+}\\ 0 , & \text{otherwise } \end{array}\right. $$
(6)

Therefore, the Laplacian pyramid can be calculated as follows.

$$ \left\{\begin{array}{ll} L P_{l}=G_{l}-G_{l+1}^{*}, & 1 \leq l<L \\ L P_{L}=G_{L}, & l=L \end{array}\right. $$
(7)

where LPl is the lth layer of the Laplacian pyramid. Considering the integrity of motion infomation, we directly take the top layer of Gaussian pyramids as that of the Laplacian pyramid. Consequently, they have equal number of layers. Specifically, each depth frame produces three depth feature maps according to three views, thereby, it has three generated Laplacian pyramids. As shown in Fig. 4, the Laplacian pyramids cut down a large amount of static information inside the body meanwhile strengthen the dynamic information of body boundaries, which is more conducive to extracting discriminative features. In Sec. 4, we will further evaluate the proposed multi-scale feature map LP-DMI.

Fig. 4
figure 4

An example of a four-layer LP4-DMI with three angles

3.3 Feature extraction with hierarchical pyramid structure

There are several reasonable options for determining which feature to extract [42, 45, 46]. In this paper, we utilize HOG descriptors to extract the local features of LP-DMI denoted as LP-DMI-HOG. HOG feature is sensitive to the distribution of gradient and edge information, thus it characterizes gradient changes especially the shape of objects pretty well. The basic idea is to compute gradient orientation histograms on a dense grid of uniformly spaced cells and perform local contrast normalization [59]. Before extracting features, we copy adjacent pixels to normalizing the feature maps from the same view to the same size. The interpolated pixel values are the same as the neighboring pixels, so they will not interfere with the multi-scale information and we can compute multi-granularity motion features effectively. Moreover, this step is beneficial to solve the problem of too small pictures caused by incremental layers. We cascade the HOG feature extracted from LP-DMI in the same layer to obtain the three-view features at the same scale. Then we derive LP-DMI-HOG from coarse-grained to fine-grained as the layer increases. We normalize the resulting feature vectors using min-max scaling, and the principal component analysis (PCA) is applied to reduce the dimension for the sake of computational efficiency.

We normalize the depth feature maps projected onto the same planes to a uniform size, and the specific parameter settings are shown in Fig. 2. We set the size of each cell to 10 × 10 pixels and the number of gradient orientation bins is 9. The size of block is 2 × 2. Furthermore, the step is 10 pixels. The remained principal components of MSRAction3D, UTD-MHAD, and DHA is 550, 860, and 450. So that, each action sample is a total of 15444 and 20592 dimensions when the number of layers is 3 and 4 respectively. Note that, we consider this as the default setting of feature extraction. Then, the resulting feature will be fed into ELM for action classification.

3.4 Action recognition by extreme learning machine

In this work, we employ extreme learning machine (ELM) for action classification which was proposed by Huang et al. for training single-hidden layer feed-forward neural networks (SLFNs) [63]. The weight between the input layer and hidden layer can be initialized randomly as well as the bias of the hidden nodes. Therefore, the ELM just calculates the weight matrix between the hidden layer and output layer without the need to tune parameters. The matrix can be figured out by finding the generalized inverse matrix, thus the extreme learning machine has distinct advantages in parameter selection and computational efficiency. That is why we use extreme learning machine for action recognition. Given a training set with n samples and m classes \( D=\left \{\left (x_{i}, y_{i}\right ) | x_{i} \in R^{n}, y_{i} \in \\ R^{m}, i=1,2, \ldots , n\right \} \), the SLFNs with N hidden nodes can be expressed as:

$$ f\left( x_{i}\right)=\sum\limits_{j=1}^{N} \beta_{j} g\left( w_{j} \cdot x_{i}+b_{j}\right)=o_{i}, i=1,2, \ldots, N $$
(8)

where wj = (wj1,wj2,...,wjd)T is the weight vector connecting the jth hidden node with the input nodes. βj = (βj1j2,...,βjm)T is the weight vector connecting the jth hidden node with the output nodes. bj represents the threshold of the jth hidden neuron, and g(x) denotes the activation function. Note that wj and bj are assigned randomly. The goal of ELM is to minimize the training error as far as possible, which can be depicted as \({\sum }_{i=1}^{N}\left \|o_{i}-y_{i}\right \|=0\). Therefore, parameters βj = (βj1j2,...,βjm)T can be estimated by least-square tting with the given training data D. In other words, the problem can be written as the following equation.

$$ Y = H{\upbeta} $$
(9)

with

$$ H=\left( \begin{array}{ccc} g\left( w_{1} \cdot x_{1}+b_{1}\right) & {\dots} & g\left( w_{m} \cdot x_{1}+b_{m}\right) \\ {\vdots} & {\ddots} & {\vdots} \\ g\left( w_{1} \cdot x_{n}+b_{1}\right) & {\cdots} & g\left( w_{m} \cdot x_{n}+b_{m}\right) \end{array}\right) $$
(10)
$$\upbeta=\left( {{\upbeta}_{1}^{T}}, {{\upbeta}_{2}^{T}}, \ldots, {{\upbeta}_{m}^{T}}\right)^{T},$$
$$Y=\left( {y_{1}^{T}}, {y_{2}^{T}}, \ldots, {y_{n}^{T}}\right)^{T}$$

H is the hidden layer output matrix of the network, in which the jth column is the jth hidden nodes output vector concerning inputs \( \left (x_{1}, x_{2},\dots ,x_{m}\right ) \). The ith row of H is the output vector of the hidden layer about input xi. Once the input weight wj and the hidden layer bias bj are determined, the output matrix H of the hidden layer is unique. The number of hidden nodes is usually much smaller than that of training samples. In this case, the smallest norm least-squares solution of (9) is equivalent to solving the following equation.

$$ \hat{\upbeta}=H^{\dagger} Y $$
(11)

where H is the Moore-Penrose generalized inverse of matrix H [33].

4 Experiment results and analysis

In order to evaluate the effectiveness of the proposed framework, we conduct experiments on the public MSRAction3D [18], UTD-MHAD [7], and DHA dataset [4]. In Fig. 5, the depth video sequence of pickup and throw is shown as an example of action samples. We investigate how many layers are sufficient to capture multi-scale features for action recognition and compare several ways of extracting local features. In this section, we present the results of the ablation experiment, optimizing and confirming the effectiveness of the multi-scale mechanism in the proposed framework. Meanwhile, we show the advantages of our proposal over other state-of-the-art methods.

Fig. 5
figure 5

The depth video sequence of pickup and throw in MSRAction3D dataset

4.1 Datasets and experimental settings

4.1.1 Datasets description

The MSRAction3D is a dataset for action recognition which contains 557 depth video sequences and 557 skeleton sequences for 20 actions captured by Kinect sensor. The actions including high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, and pickup and throw are taken by 10 subjects. Every action is repeated by all the subjects two or three times.

The UTD-MHAD includes 861 samples of 8 subjects. There are 27 actions in total, and every subject performed each action 4 times. The actions are: right arm swipe to the left, right arm swipe to the right, right hand wave, two hand front clap, right arm throw, cross arms in the chest, basketball shoot, right hand draw x, right hand draw circle (clockwise), right hand draw circle (counter clockwise), draw triangle, bowling, front boxing, baseball swing from right, tennis right hand forehand swing, arm curl, tennis serve, two hand push, right hand knock on door, right hand catch an object, right hand pick up and throw, jogging in place, walking in place, sit to stand, stand to sit, forward lunge, and squat.

The DHA database is orgnized with 483 depth video sequences for 23 actions. Each sample video was performed by 2 or 3 times by 21 subjects (12 males and 9 females). The list of action classes are: bend, jack, jump, pjump, run, side, skip, walk, one-hand-wave, two-hand-wave, front-clap, side-clap, arm-swing, arm-curl, leg-kick, leg-curl, rod-swing, golf-swing, front-box, side-box, tai-chi, pitch, and kick.

4.1.2 Experimental setups

We conduct experiments with the following experimental settings.

Setup 1: Cross-subject. In order to have fair experimental results, we perform the cross-subject tests on the three benchmark datasets according to the experimental settings of [18, 28]. More precisely, we use odd subjects for training, whereas even subjects are applied for testing.

Setup 2: Subset partition. We divide the MSRAction3D dataset into three subsets as shown in Table 1, and three different tests are conducted on these subsets following the settings as [4]. In test 1, 1/3 action samples in each subset are employed as the training set, and the remaining samples are used for validation. On the contrary, test 2 uses 2/3 samples for training, and the rest samples are taken in the testing set. Test 3 has a cross-subject test on each subset abide by setup 1, that is to say, the action samples corresponding to the odd subjects in each subset are used for training and the rest for testing.

Table 1 Three subsets of the MSRAction3D dataset

Setup 3: K-Fold cross-validation. In order to further prove the scientific nature of multi-scale feature maps, we carried out k-fold cross-validation (KFCV) experiments. In this setting, every dataset is divided into ten portions in which the nine pieces are combined as the training set, and the remaining parts are used as the testing set. The above process is repeated for ten times testing all the parts one by one, and then the average score is taken as the final recognition accuracy. Furthermore, in each fraction we keep the categories ratios same as the original data.

4.2 Ablation study

4.2.1 Influence of layer parameter

To exploit the optimal muti-scale feature map of different datasets, we construct LP-DMI with different layers in a step-wise manner and perform experiments according to setup 1 on three datasets. The experimental results with respect to the GP-DMI and LP-DMI from 2 to 6 layers are presented in Fig. 6. The first thing we noticed is that the motion feature will be too coarse-grained ro recognize similar actions if the number of layers is inadequate.

Fig. 6
figure 6

The recognition accuracy of LP-DMI with different layers

Otherwise, if the number of layers is superfluous, the static information may be more redundant, which leads to low efficiency and accuracy. In addition, the experimental results illustrate that LP-DMI yeilds better recognition accuracy on the whole, which achieves the highest recognition rate of 93.41% when the number of layers is 4 on MSRAction3D dataset. The LP3-DMI on UTD-MHAD and DHA dataset are the optimum, and the recognition rates are respectively 85.12% and 91.94%. We will abide by the optimal layer setting obtained here in subsequent experiments.

4.2.2 Evaluation of different feature extraction strategies

After that, we compare several strateges of feature extraction and normalization following setup 1. The default feature extraction setting in Sec. 3 was not adopted in this experiment but a combination of two dynamic constraints in order to prevent the feature map and the cell of HOG from being too small. In details, \({D^{l}}_{v}(w, h, d)\) is the normalization parameter denoting that the size of LPl-DMIf, LPl-DMIs, LPl-DMIt is w × h, h × d, w × d. Constraint N1: \({D^{l}}_{v}(w/2^{l-1}, h/2^{l-1}, d/2^{l-1})\). Constraint N2: \({D^{l}}_{v}(w, h, d)\) = \({D^{l}}_{v}(160, 320, 240)\). Constraint C1: the size of cell is 20 × 20. Constraint C2:the size of cell is 20/2l− 1 × 20/2l− 1. These parameters determine various scale of the feature map and the granularity of descriptor, and we report experimental results in Tables 2 and 3. We observe that N2 combined with C1 outperforms other strateges. In other words, the applied normalization method achieve the effect of improving the classification accuracy. We show the recognition accuracy and average computation time of LP-DMI-HOG descriptor and VGG-16 [46] in Table 4, confirming that LP-DMI-HOG is more efficient and more discriminative. Considering the tradeoff between precision and efficiency, we chose HOG descriptor to extract motion features.

Table 2 The results of various normalization strategies on MSRAction3D dataset
Table 3 The results of various feature extraction strategies on UTD-MHAD
Table 4 The comparsion of different descriptors

4.2.3 Effectiveness of multi-scale feature map LP-DMI

We evaluate the effectiveness of LP-DMI from two aspects. On the one hand, we have proved that LP-DMI is a more discriminative multi-scale feature map compared with GP-DMI. On the other hand, we will certify that the LP-DMI-HOG extracted from LP-DMI excels HOG features based on other feature maps. For the fairness of the results, we follow default feature extraction strategy on these depth maps to obtain HOG descripton, and employ ELM for action recognition. In terms of MSRAction3D dataset, we conduct the experiments following setup 2. The alone and average results with regard to AS1, AS2, and AS3 are presented in Table 5, and the highest rate of each subset has been shown in bold. As can be seen, LP-DMI achieves the highest average recognition rate in three different tests and outperforms than other feature maps. Specifically, in test one, LP-DMI achieved 90.42% accuracy on the three subsets. In addition to the performance on AS2 which is slightly lower than DMM, LP-DMI has an absolute advantage on other two subsets. In the second test, our proposal exceeds others significantly and gets the best recognition rate of 98.63% on AS2. Furthermore, the ELM trained by LP-DMI-HOG even can completely label all the testing samples on AS3. Therefore, in spite of the recognition rate of DMM and HP-DMM on AS1 equals to our method, the average recognition rate we have achieved is still 5% higher than them. In test three, LP-DMI obtains an average recognition rate of 94.59%. The result of LP-DMI on AS1 is 0.95% mildly lower than that of the DMM, but the recognition rates on other subsets are optimal. Overall, LP-DMI surpasses MEI, MHI and GP5-DMI in all tests. Although DMM, HP-DMM, and DMI on individual subsets are superior to LP-DMI, the average recognition rate of our method is the highest. It should be noted that we almost improved accuracy by 4% in three tests by constructing Laplacian Pyramid pyramid for DMI, and this transformation process is very efficient and does not cause too much time consumption.

Table 5 The comparation of other feature maps on MSRAction3D dataset(%)

On UTD-MHAD and DHA dataset, we testify the proposed LP-DMI complying with setup 1, and describe the result in Table 6.

Table 6 The recognition rate of depth feature maps on UTD-MHAD and DHA dataset(%)

LP3-DMI yeilds the best recognition accuracy of 85.12% on UTD-MHAD. Once more, the experiments of DHA dataset validate our methods in which LP3-DMI produce the result of 91.94%. For elaborating the performance of our method clearly, the confusion matrix computed from three datasets is depicted in Fig. 7. It can be seen that our method can correctly recognize the majority of actions. After analyzing the accuracy of specific classes, we find that the errors mainly occur in the classification of similar actions. For example, skip and jump, front-box and arm-curl, draw x and draw tick. In a word, this experimrnt further confirms that LP-DMI is a compact multi-scale feature map, and the proposed LP-DMI-HOG descriptor is promising.

Fig. 7
figure 7

The confusion matrix of three datasets

In order to further prove the scientific nature of multi-scale action representation, we conduct a k-fold cross-validation experiment additionally complying with setup 3. Figure 8a shows the recognition accuracy of different feature maps corresponding to three datasets, and Fig. 8b depicts the part of LP-DMI that is higher than others.

Fig. 8
figure 8

K-Fold cross-validation results of three datasets

For MSRAction3D dataset, LP-DMI achieves the highest recognition rate of 98.48% with a little difference of 0.43% to GP-DMI. It should be noted that both of them are higher than their template feature map DMI by more than 3%. Compared with the single scale feature map, LP-DMI can improve the recognition accuracy by up to 8.27%. Besides, the experimental results of UTD-MHAD prove the advantages of LP-DMI as well, which are 4.57% and 3.8% higher than MEI and DMM, respectively. The scores of HP-DMM, DMI and GP-DMI are close, which are 0.61% lower than LP-DMI on average. LP-DMI also achieves optimistic results 92.39% on the DHA dataset, which is better than GP-DMI by 1.81% and even 9.49% higher than MEI. Compared with common DMI, the accuracy of LP-DMI is markedly improved by 2.18%. In general, the multi-scale feature maps LP-DMI and GP-DMI are significantly superior to other single-scale feature maps. The results further prove that increasing scale diversity can enhance the discriminativeness of motion features, thus achieving higher recognition accuracy.

4.3 Comparisons to other state-of-the-art approaches

In this experiment, we follow setup 1 which is same as the baseline methods for persuasion. The cross-subject test is challenging due to variations in the same actions performed by different subjects, but our method can still achieve high accuracy. In Table 7, our method obtains the promising accuracy of 93.4% compared with other solutions utilizing single depth modality data on MSRAction3D dataset, and it is 4.5% higher than DMM-GLAC which extracts local feature descriptor from depth feature maps as well. The HP-DMM-CNN, 3D-CNN as well as the method in [61] using convolutional neural networks are 1.1%, 7.3% , 6.3% lower than our method. It should be noted that the method proposed by Ji et al [17] is 2.6% lower than ours although they obtain local spatio-temporal scaled pyramid and embed skeleton information. Furthermore, we fuse LP-DMI-HOG descriptor with HP-DMM-HOG, MJD-HOG, GCN [29] by canonical correlation analysis(CCA) [13], which yeilds the recognition rate of 94.9%, 95.6%, 94.5%.

Table 7 Comparison of our method with baseline methods on MSRAction3D dataset

We also demonstrate the generality of our framework on UTD-MHAD and report the results in Table 8. Our methods obtains the recognition accuracy of 85.1% which is 2.3% and 0.7% higher than HP-DMM-CNN and 3DHOT-MBC. The method proposed by Nguyen et al. which employs hierarchical gaussian descriptor is 1% lower than us. In addition, our approach surpasses other methods employing HOG descriptor. For example, LP-DMI is 3.6% higher than DMM-HOG and 11.4% higher than HP-DMM-HOG. With same evaluation strategy, we compare our system with depth-based and multi-modal feature fusion methods as well. The above experiments prove that our method is superior to other deep video based approaches and is able to achieve better performance through fusion techniques.

Table 8 Comparison of our method with baseline methods on UTD-MHAD

5 Conclusion

In this paper, we proposed a novel method based on the Laplacian pyramid considering multi-scale information for human action recognition. We calculated LP-DMI to increase the scale diversity of depth motion images in order to capture the multi-scale motion features and strengthen more favorable dynamic information. The experimental results have demonstrated that LP-DMI is more compact and discriminative than existing feature maps. Furthermore, the extracted LP-DMI-HOG which contains multi-granularity features has effectively improved the accuracy of action recognition. The experiments results conducted on MSRAction3D, UTD-MHAD and DHA dataset outperform the baseline methods. However, our method is still flawed in identifying actions with similar motion trajectories.The future work will focus on fusing multimodal features and considering multi-scale temporal information to facilitate the recognition accuracy.