1 Introduction

From the early beginning of computer vision, human action analysis has been one of the significant research topic, due to the wide real-world applications in various fields like health and medicine, sports and recreation, content-based video search, robotics and other systems which involve interactions between humans and electronic devices.

In past decades, research of human action recognition has mainly focused on recognizing actions from videos captured by traditional visible light cameras. The recent advent of low-cost and easy-operation depth sensors (like Kinect) have received a great deal of attention from researchers to reconsider problems such as activity recognition using depth images alongside color images. Compared with conventional RGB cameras, RGB-D cameras provide several advantages. First, depth cameras are insensitive to illumination changes (it can produce depth images even in a total darkness). Next, more discriminative information (like 3D geometric structural data of the scene) can be extracted from depth maps. Moreover, 3D positions of skeletal joints can be estimated from depth images quickly and accurately. Although the estimated skeleton brings benefits to activity recognition, some shortcomings limits its usage. For instance, the estimation is unreliable or even failed when the human is partly in view or touches background [50].

A substantial task in human action recognition is designing an efficient, compact, and robust video representation, despite the presence of challenging conditions. First, depth sensors (like Kinect) usually generate potentially noisy depth maps due to some special reflectance materials, fast movements, and porous surfaces. Second, there are significant intra-class variations in human action recognition on account of execution rates, personal styles, and different viewpoints. Next, overlaps among different action categories make characterizing the inter-class dissimilarities very difficult.

Most recent approaches recognize actions by constructing a histogram of descriptors of spatio-temporal interest points (STIP) in videos. The results of these approaches have been promising on RGB dataset; however the extension of these methods for depth images cannot be optimal, since depth images are much noisier than the RGB ones. For instance, undefined depth points appear as black regions on the surface of human body in depth images, therefore many interest point detectors falsely fire on these noisy regions [30]. In addition, almost all the hand-crafted feature descriptors are unsupervised. As such, they are barely able to handle inter- and intra-class variations for human action recognition.

To address the aforementioned challenges and design a more discriminative descriptor, in this paper, SSTKDes is proposed for human action recognition from the RGB-depth data. A general overview of the proposed method is depicted in Fig. 1. Note that to deal with noisy depth data, smoothing filters are first applied on depth videos.

Fig. 1
figure 1

General overview of the proposed method for action recognition

It has been shown that descriptors based on low-level pixel attributes work fine on both RGB and depth images for object [5] and action recognition [22]. To properly characterize the spatio-temporal structure of actions and provide a discriminative descriptor, a primary rich pixel attribute is needed to be extracted from RGB-D videos. For action recognition, the descriptor should capture both shape and motion information. In this paper, the 3D (i.e., spatio-temporal) gradient is utilized as the primary attribute for both RGB and depth videos. Then, following the kernel descriptor proposed for visual recognition [5], the spatio-temporal attributes extracted from each pixel are transformed into a compact unsupervised kernel space.

Moreover, an efficient approach for coping with intra-class variations and inter-class similarities would be calculating the projections that can actively discriminate among classes; i.e., a supervised descriptor. Exploiting video labels for designing the feature descriptor can yield the method to achieve a more accurate, robust, efficient, and discriminative feature representation. Therefore, in the next step, a supervised strategy is utilized to transfer non-linear video features into a more discriminative feature space, motivated by [43]. For this transformation, a combination of LMNN [47] and EMK [6] is utilized.

The goal of LMNN is to find a supervised transformation of input space such that in the new space, the k-nearest neighbors have matching labels while samples from other classes are separated by large margins. Then, by using a convex optimization based on the hinge loss, this margin criterion is solved. From another sight, it can be beneficial to solve the problems of intra-class variations and inter-class similarities in human action recognition by transforming data into a new space in which samples with the same labels are closer to each other than those with different labels. The EMK is a kernel representation of well-known bag-of-words method. It has been proved to produce more accurate quantization and also learn nonlinear correlations among body parts in human action recognition [22].

2 Related work

Various RGB-based action recognition methods have been published in the literature [1, 31, 48]. Most of the methods extract STIP [24] and use the distribution of local features, like histogram of optical flow (HOF) [25] and histogram of gradient (HOG) [12], to represent the spatio-temporal patterns. During recent years, human action recognition based on 3D perception data has been wildly grown. Based on the representation of 3D data sequences, the methods can be divided into three main groups of skeleton data, point cloud, and depth map. Several surveys [2, 18, 56] have also been published in this regard.

In skeleton-based methods, the 3D position of body joints are utilized for action recognition. The 3D location of joints can be captured by multi-camera motion capture (MoCap) systems. Although this data is very accurate and almost free of noise, it is marker-based and therefore very expensive and difficult to be produced. On the other hand, some approaches exploit the position of human joints provided by Shotton et al. [34] which extracts human skeleton from depth map in real-time. The features used in this group are based on 3D joint locations [10, 33], relative position of joints from a reference joint or from each other [42, 54], angles between connected parts [16, 36], velocity of joints [46, 54], and joints trajectories along temporal dimension [17, 20, 37]. The estimated body joints from depth map are quiet accurate in experimental settings; however, their usage is limited to some especial cases. The situations in which the occlusion or self-occlusion occurs, a person touches the background, or a person is not in an upright position [50], makes the process of estimating the 3D position of joints very difficult or even impossible.

A point cloud is a collection of points in the 3D coordinate system. The point cloud can be acquired fully (or partially) by 3D reconstruction methods from multi-views or depth maps. The methods in the second group extract global or local features from the point cloud of human body [3, 4]. The point cloud of human body reveals important cues for recognizing actions. Although it can result in discriminative descriptors, representing actions based on the point cloud requires more computational cost.

By using the point cloud of human body from depth images, Jiang Wang et al. [41] proposed the random occupancy pattern (ROP). In this method, a sparse subset of the most discriminative sub-volumes, obtained from the whole 4D volume of human body during the action period, are selected by an Elastic-Net regularization method. The depth motion map (DMM), used in several methods [11, 44, 55], utilizes the point cloud computed from the depth map. To produce DMM, the point cloud of a human body is projected onto three orthogonal Cartesian views. Then, the global activity of the entire video sequences is accumulated on these planes. To classify actions, different features like HOG [55] and local binary pattern (LBP) [11] have been extracted from DMM. Vieira et al. [40] defined a 4D grid for a sequence of depth map by dividing the space and time axes into multiple cells. These cells typically consist of points on the silhouette or moving parts of the body. Then, they enhanced the roles of sparse cells by using a saturation scheme. Oreifej and Liu [30] proposed the histogram of oriented 4D normals (HON4D) descriptor based on the distribution of 4D normal vectors in some spatio-temporal cells of actions. Yang and Tian [53] proposed the super normal vector (SNV). It was a sparse dictionary-based method of low-level polynormals in which each poly-normal was calculated by clustering hypersurface normal vectors in each spatio-temporal neighborhood.

The depth map-based methods (the third group), usually use features, either local or global, which are extracted from a consecutive space-time volume. Xia and Aggarwal [50] extracted depth-based spatio temporal interest points (DSTIP) using a response function of spatio-temporal filtering. Then, depth cuboid similarity feature (DCSF) was proposed for describing the local 3D cuboid (x,y,t) with adaptable size centered at DSTIP. Lu et al. [29] proposed a descriptor for depth maps which was an extension of binary feature descriptor used in RGB video [9]. After partitioning the depth maps into three layers (named background, activity, and occlusion layers), features were extracted from some spatio-temporal local 3D cubes from the activity layer in depth sequences.

Due to its success in various classification tasks, many researches have utilized deep-learned features in action recognition from RGB [38, 45], depth [44], and skeletal data [14]. The recurrent neural network (RNN) [14, 27], 3D convolutional neural network (3DCNN) [49], and 2D CNN with some primary motion features (e.g., DMM) [44] as their inputs are the most used networks. Among these methods, 3DCNN and RNN can deal with temporal information. The 3D convolution and 3D pooling layers of 3DCNN models allow capturing discriminative features along both spatial and temporal dimensions. The RNN takes into account the temporal data using recurrent connections in hidden layers. The original CNN network can only cope with images instead of videos. Simonyan and Zisserman [35] randomly sampled a fixed number of frames from a video, and then applied CNN on every individual frame. Finally, they used the average scoring of selected frames for classification. In another work, Yu et al. [57] extracted features from CNN for all frames of one video and then applied pooling on the frame-level features to get video-level features.

Action recognition has not gained a high performance from deep networks compared with other research areas (like image classification) [15]. It might be related to the fact that network performance is dependent on a large number of weights that have to be learnt from a large annotated data (like ImageNet), which is not currently available for action recognition purposes. In addition, such enormous data cannot be provided by many real-world problems [21]. Therefore, there is a need for methods that can achieve a high performance with small amount of data.

In some cases, for human action recognition, to cope with data from different sources, the multiple kernel learning (MKL) is used to combine the kernels which are established for each individual data source. For human action recognition, Xiao et al. [52] used MKL to select the most discriminative kernels in the function of composing of different kernels by setting weights for each kernel. In that work, the authors took advantage of the Bacterial Chemotaxis (BC) and the Powell optimization methods to find the weight of each kernel. The powerful local optimization ability of the Powell method is adopted to improve the local search precision of BC.

Bayesian logistic regression is also another framework for classification. To have more flexibility, variational transformations are used in order to approximate the likelihood function with a simpler and tractable exponential form by means of introducing extra variables known as variational parameters. To deal with the regression of several classes, variational Bayesian multinomial logistic regression has been proposed [19].

Among available algorithms for action recognition, only deep-models use the labels for producing features. To the best of our knowledge, all proposed hand-crafted descriptors for human action recognition are unsupervised. In object recognition, it has been demonstrated that supervised techniques (like linear discriminate analysis (LDA)), improve the result of scale-invariant feature transform (SIFT) [8].

The kernel descriptor (KDES) utilized the unsupervised kernel principal component analysis (KPCA) to learn a compact descriptor. A 3D extension of KDES was proposed for action recognition [22]. It achieved superior performance on RGB-D datasets. Wang et al. [43] proposed a supervised extension of kernel descriptor, called supervised kernel descriptor (SKDES), for objection recognition from RGB data. They took the advantage of KDES to design a low-level feature descriptor.

In this paper, SSTKDes is proposed for human action recognition using RGB-D data. The low-level attributes are spatio-temporal features extracted from RGB and depth videos. The attribute space is then transferred into a new space by a non-linear compact kernel-based transformation using a supervised process. Finally, the SVM classifier is applied on the final feature descriptor.

3 Proposed method

The aim of this paper is to design a global descriptor for human action recognition from RGB-D data. An overview of the proposed method is illustrated in Fig. 1. The input data is a sequence of RGB-D images in which just one person is in the scene and the person performs one action. The background in these datasets can be plain or textured (in some datasets the background has been subtracted).

First, hierarchy-levels are defined for each video. The hierarchy of three levels of video is shown in Fig. 2. The first level is the action volume which covers all spatio-temporal dimensions of the input sequence. The action volume is divided into some sub-volumes, called part. Then, each part is divided into smaller units, called 3D blob. Each 3D blob contains a cubic spatio-temporal data (i.e., pixel level) in RGB or depth video.

Fig. 2
figure 2

Hierarchy levels of video

The contribution of this work is first explained in Section 3.1. The denoising step is then presented in Section 3.2. Next, in Section 3.3 it is explained that how raw attributes for all pixels are transformed into an unsupervised kernel feature space. Finally, details on the supervised video descriptor are provided in Section 3.4.

3.1 Contributions of this work

This work differs from the existing approaches as follow. The KDES [5] and CKSVD (EMK) [6] methods are unsupervised kernel descriptors proposed for object recognition. The HKDES [22] is a 3D hierarchical extension of KDES for human action recognition. Those methods are based on unsupervised kernel features while in this method a supervised kernel descriptor is proposed. Moreover, LMNN [47] is a supervised strategy that has been used for image, speech, and text classification. In the proposed method, LMNN is used for action recognition. The proposed method is particularly designed to extract rich information from both RGB and depth data for human action recognition while the supervised kernel method in [43] is designed for object recognition from only RGB data.

The key contributions of this work can be summarized as follows:

  • A novel hierarchical framework for feature extraction of video is proposed. This structure is capable of discarding the irrelevant details while preserving the task-relating important feature.

  • To reduce the effect of two kinds of noise (i.e., small variations of sensing device and undefined depth points) in (not highly noisy) depths map, a spatio-temporal smoothing function is used.

  • First, a dictionary (generated through EMK) is exploited to encode the features. Then, the LMNN is utilized as a supervised learning plan to create a margin of safety around the kNN decision boundaries that separates videos with different labels.

  • The proposed method is evaluated on five public action and gesture recognition datasets, and has achieved the state-of-the-art results on four datasets.

3.2 Denoising

Xia and Aggarwal [50] divided the noise in depth videos into three categories of: noise from variations in sensing devices, boundary of agents, and holes caused by: special reflectance materials, fast movements, porous surfaces, and other random effects. The first group can be reasonably removed by smoothing functions that consist of two separate spatial and temporal filters. Following [50], Gaussian filter is utilized as spatial smoothing function; which is a kind of low-pass filter that can remove high-frequency components in depth images. Mathematically, Gaussian smoothing is obtained by convolving the input signal (depth image) with a Gaussian function, as

$$\begin{array}{@{}rcl@{}} d_{s} &=& d(x,y,t)*g(x,y;\sigma) \\ g(x,y;\sigma) &=& \frac{1}{2\pi\sigma^{2}}\ \exp \left( -\frac{x^{2}+y^{2}}{2\sigma^{2}}\right) \end{array} $$
(1)

in which, d(x, y, t) is the input depth map at time t, g is the 2D Gaussian function with standard deviation σ which controls the spatial scale, and d s is the smoothed depth image. Also, the 1D complex Gabor filter is defined as

$$\begin{array}{@{}rcl@{}} &&d_{st}= (d_{s} (x,y,t)*h_{ev})^{2} + (d_{s} (x,y,t)*h_{od})^{2}\\ &&h_{ev} (t;\tau,\omega)= \cos af;(2\pi\omega t)\ \exp af;(-t^{2}/\tau^{2} )\\ &&h_{od} (t;\tau,\omega)= \sin af;(2\pi\omega t)\ \exp af;(-t^{2}/\tau^{2} ) \end{array} $$
(2)

where τ controls the temporal scale and ω is usually used as a constant value related to τ (e.g., \(\frac {0.6}{\tau }\)). In the rest of this paper, d denotes the smoothed d s t .

In general, the result of smoothing filters is setting the value of each pixel to the weighted average of itself and its neighbors. Therefore, these filters set the value of each pixel into the closer harmony with the values of its neighbors. From another sight, noisy pixels with significantly higher or lower intensity than surrounding neighborhood, will be smoothed. Consequently, this method can approximately remove small variation of sensing device and holes in (not highly noisy) depths map. Xia and Aggarwal [50] defined a measure of temporal value variation (i.e., correction function) for each pixel to remove the second and third group of noise (namely, boundary of agents, and holes) in highly noisy depth map. But, in this paper, this function is not necessary. Their work is an STIP-based method, in which a response function is used to find interest points. These functions usually have large values in noisy pixels. As such, noisy points might be falsely selected as interest points. Fortunately, the proposed descriptor is not based on interest points. It uses a pooling function of attributes of all pixels in a video. Therefore, all points have the same weight to calculate the descriptor. Consequently, it reduces the effect of noise in the final descriptor.

3.3 Unsupervised kernel space

Some methods (SIFT and HOG), compute the histogram of orientation-based attributes of all pixels in small windows as the feature descriptor by quantizing the individual pixel attribute value into some bins. It is obvious that the quantization error decreases the accuracy of those methods. To overcome this problem, KDES [5] has been proposed to generate rich feature descriptors from pixel attributes for object recognition. It can capture more descriptive information lying in high dimensional space, compared to the SIFT and HOG.

By using Euclidean distance for measuring dissimilarity between two 3D blobs, as

$$\begin{array}{@{}rcl@{}} Dist(B_{1},B_{2} )&=& (F(B_{1} )- F(B_{2} ))^{T} (F(B_{1} )- F(B_{2} ))\\ &=&2- 2 \ {F(B_{1})}^{T} F(B_{2}) \end{array} $$
(3)

it can be shown that the kernel view of two 3D blobs and the similarity of them are directly proportional. In [6], to compute the similarity between two blobs of two images, the match kernel is utilized which is a kernel function that averages over the continuous similarities among all pixel attributes in two blobs. In this paper, a 3D extension of the kernel representation of orientation histogram, in an unsupervised space, is used as the first step of the feature extraction process. The low-level pixel attribute exploited here is the 3D gradient for both RGB and depth videos which can capture the shape changes alongside both spatial and temporal dimensions. Following the formulation in [5], the match kernel between two 3D blobs of two action videos is calculated by

$$ K_{3D}(B_{1},B_{2}) = \sum\limits_{z \in B_{1}}\sum\limits_{z^{\prime} \in B_{2}} \widetilde{m}(z) \ \widetilde{m}(z^{\prime}) \ k_{o}(\widetilde{\theta}_{z},\widetilde{\theta}_{z^{\prime}}) \ k_{p}(z,z^{\prime}) $$
(4)

where B 1 and B 2 are two 3D blobs of two actions, \(\widetilde {m}(z)= {~}^{m(z)}/_{\sqrt {{\sum }_{z\in P} m(z)^{2}+\epsilon }}\) is the normalized 3D gradient magnitudes of z, \(\widetilde {\theta }_{z}\) is the orientation of 3D gradient of z, \(k_{o}(z,z^{\prime })=\exp \left (-\gamma _{o}\ \|\widetilde {\theta }_{z}-\widetilde {\theta }_{z^{\prime }}\|^{2} \right )\) is the Gaussian kernel of orientations of two pixels (which computes the similarity of these orientations), and \(k_{p}(z,z^{\prime })=\exp (-\gamma _{p}\ \|z-{z^{\prime }}\|^{2})\) is the Gaussian kernel of the 3D position of pixel in 3D blob (i.e., z, and measures the closeness of two pixels in spatio-temporal manner). By decomposing each Gaussian kernel into the inner product of two functions as \(k_{o}(z,z^{\prime })= \phi _{o}({\widetilde {\theta }_{z}})^{T} \ \phi _{o}({\widetilde {\theta }_{z^{\prime }}})\) and \(k_{p}(z,z^{\prime }) = \phi _{p}({\widetilde {\theta }_{z}})^{T} \ \phi _{p}({\widetilde {\theta }_{z^{\prime }}} )\), the feature representation for one 3D blob can be derived as

$$ F_{3D}(B_{1}) = \sum\limits_{z \in B_{1}} \widetilde{m}(z) \ \phi_{o}(\widetilde{\theta}_{z}) \otimes \phi_{p}(z) $$
(5)

in which ⊗ is the Kronecker product. Since the Gaussian kernel is used, the dimension of F 3D is infinite. For computational efficiency and for representational convenience, Bo et al. [5] provided a method to learn compact low dimensional features from match kernels. To do that, first a set of sufficient basis vectors, \(\{x_{i}\}_{i=1}^{d_{o}}\) and \(\{y_{j}\}_{j=1}^{d_{p}}\), are uniformly and densely sampled from the support regions of orientation and position, respectively. Then, the size of joint basis vectors is reduced using KPCA. As such, the final 3D kernel feature is obtained by

$$ F^{u}_{3D}(B_{1}) = \sum\limits_{i=1}^{d_{o}}\sum\limits_{j=1}^{d_{p}} \alpha^{u}_{ij} \left\{\sum\limits_{z \in B_{1}} \widetilde{m}(z) \ k_{o}(\widetilde{\theta_{z}},x_{i}) k_{p}(z,y_{j})\right\} $$
(6)

where \(\{x_{i}\}_{i=1}^{d_{o}}\) and \(\{y_{j}\}_{j=1}^{d_{p}}\), are the basis vectors sampled from the support regions of orientation and position features, respectively, and d o and d p are the size of these basis vectors, \(\alpha ^{u}_{ij}\) is the u th(u = 1,..., d o × d p ) projection coefficient computed by applying KPCA to the joint basis vectors \(\left \{\phi _{o}(x_{1})\otimes \phi _{p}(y_{1}), ..., \phi _{o}(x_{d_{o}})\otimes \phi _{p}(y_{d_{p}})\right \}\).

3.4 Supervised kernel descriptor

By rewriting the obtained 3D blob features from the KDES into the vector form, we get F(B) = A T K, where \(A=\left [ \boldsymbol {\alpha }^{1}, ...,\boldsymbol {\alpha }^{d_{o} \times d_{p}}\right ]\) and each \(\boldsymbol {\alpha }^{u}=[{\alpha ^{u}_{1}}, ..., \alpha ^{u}_{d_{o}\times d_{p}}]^{T} \) denotes the u t h principal components (kernel transform) in (6) and \(K={\sum }_{z \in B} \ \widetilde {m}(z) \left [k_{o}(\widetilde {\theta }_{z},x_{1}) k_{p}(z,y_{1}), ..., k_{o}(\widetilde {\theta }_{z},x_{d_{o}}) k_{p}(z,y_{d_{p}})\right ]^{T}\). Note that the KPCA used in the previous session is an unsupervised transform while here, a supervised spatio-temporal kernel-based method is used for learning \(\alpha ^{u}_{ij}\), to transfer non-linear video features into a more discriminative space. The proposed SSTKDes is depicted in Fig. 3. Each step is explained in the following sub-sections.

Fig. 3
figure 3

Supervised spatio-temporalkernel descriptor (SSTKDes) algorithm

3.4.1 Hierarchy of feature levels

Features of action volume are obtained by concatenating the features of its parts, where the part feature is formed by average pooling of the encoded 3D blob features within it. The feature of each 3D blob is calculated by sum pooling of pixels included in it.

Pooling efficiently transfers from the feature representation to a new space, such that irrelevant details are discarded while task-relating important features are preserved [7]. Consequently, it significantly reduces the computational complexity and makes the representation compact. It is also used to achieve robustness to noise and invariance to the speed of the action. Sum pooling of features over a local neighborhood (i.e., 3D blob) reduces the effect of noisy features. In addition, average pooling over 3D blobs contained in each part can yield the method to be invariant to the speed of the action, to certain extend. For the same actions with different speeds, the entire action volume is divided into the same number of parts. If there is no significant speed difference, corresponding parts contain almost the same information, but with different speeds. From another view point, different speeds can be expressed as different numbers of frames in each part. Thus, average pooling in each part results in almost the same information for the corresponding parts with different speeds.

To have a more accurate quantization process, the features of 3D blobs are encoded using a learnt dictionary generated through the constrained singular value decomposition in kernel feature space (CKSVD) [6] over 3D blob features. Thus, the part feature, which is a pooling of the encoded patch features, is

$$ F_{P_{s}} = Pool^{|P_{s}|}_{b=1} \ g(C,A^{T} k_{bs}) $$
(7)

where P s is the s th part of the action volume, |P s | is the number of 3D blobs contained in the s th part, \(F_{P_{s}}\) is the feature vector of the s th part of the action volume, g is the encoding function, P o o l is a pooling operator, k b s is the kernel feature of the b th 3D blob contained in the s th part, and C is the dictionary. In this paper, average pooling is used as the P o o l operator and ridge regression is used as the encoding function g. Dictionary C can be considered as \(C = \left (A_{D \times D}^{T} F_{D \times \widetilde {b}}\right ){Z_{n \times \widetilde {b}}}^{T}\), where A is the matrix for transferring to the unsupervised space (defined in Section 3.4), F contains a set of \(\widetilde {b}\) of 3D blob-level kernel features which are sampled over the whole 3D blob features, and Z is a matrix that transforms features to the dictionary space.

Therefore, if the encoded feature vector is g(C, A T k b s ) = c b s , then \(F_{P_{s}}=\frac {1}{|P_{s}|} {\sum }_{b=1}^{|P_{s}|} c_{bs}\). Since the ridge regression is used to encode the 3D blob, the code c b s has a closed-form solution of

$$\begin{array}{@{}rcl@{}} c_{bs}^{\ast} &=& \arg \min\limits_{c_{bs}} \|A^{T} k_{bs} - Cc_{bs}\|_{2}^{2} + \mu \|c_{bs}\|_{2}^{2} \\ &=& (C^{T} C+\mu I)^{-1} C^{T} (A^{T} k_{bs})\\ &=& (Z F^{T} A A^{T} F Z^{T}+\mu I)^{-1} Z F^{T} A A^{T} k_{bs}. \end{array} $$
(8)

By setting μ > 0, C T C + μ I will be a positive definite matrix. As such, for each action volume, the feature vector F V is obtined by

$$ F_{V} = \cup^{PN}_{s=1} \left[F_{p_{s}} \right] = \cup^{PN}_{s=1} \left[\frac{1}{|P_{s}|}\sum\limits_{b=1}^{|P_{s}|}c_{bs}\right] $$
(9)

where PN is the number of parts included in the action’s volume and ∪ denotes a concatenation operator.

3.4.2 Supervised learning plan

The goal of this step is to find a compact kernel-based transformation in which a margin of safety is created around the kNN decision boundaries that separate samples with different labels; and consequently the k-nearest neighbors will always belong to the same class. In other words, the inter-class distance is maximized and the intra-class variation is minimized. To reach this goal, a supervised strategy, LMNN [47], is used.

For each sample of the action volume v i , two kinds of neighbors are identified. Target neighbors, shown with \(\mathcal {TN}(v_{i})\), are k p nearest neighbors of v i with the same label. Target neighbors are desired to be the closest neighbors of v i in the new space. In the current space, there might be some (k n ) differently labeled samples which are closer to the sample than the target, called impostors. These are shown by \(\mathcal {IM}(v_{i})\). The aim of the LMNN is to transform the feature space in such a way that the number of impostors is minimized. This is done by enlarging the safety margin around the kNN decision boundaries.

The loss function proposed for LMNN has two constraints. One constraint for penalizing the distance between each sample and its target neighbors to pull them closer to the sample (i.e., decreasing intra-class variations). The other constraint for penalizing the small distance between each sample and its impostors to push them further (i.e., increasing inter-class distances). In [43], to avoid overfitting and also to make the descriptor compact, a rank regularization is added to the loss function. The loss function used in this paper is

$$\begin{array}{@{}rcl@{}} \min\limits_{A} E(A)&= &\sum\limits_{v_{i} \in train}\left( \lambda \sum\limits_{v_{j} \in {\mathcal{TN}(v_{j})}} d_{ij} + \sum\limits_{v_{l} \in {\mathcal{IM}(v_{j})}} \sum\limits_{v_{j} \in {\mathcal{TN}(v_{j})}} \left[1+d_{ij}-d_{il}\right]_{+}\right) \\ &&+\lambda_{\ast} {\|A\|}_{\ast} \end{array} $$
(10)

in which v i is one sample of the training data, d i j and d i l are distances between one sample with its target neighbors and its impostors, respectively, [d]+ = m a x{d,0} is the hinge loss function, and ∥A is the nuclear norm of matrix A which is a convex surrogate of r a n k(A). By substituting the feature vector of action volume (9) in \(d_{ij} = \|F_{V_{i}}-F_{V_{j}} \|^{2}_{2}= (F_{V_{i}}-F_{V_{j}})^{T} (F_{V_{i}}-F_{V_{j}})\), we get

$$\begin{array}{@{}rcl@{}} d_{ij} &=& \sum\limits_{s=1}^{PN}\left( k_{si}-k_{sj}\right)^{T} \textbf{L} \textbf{L}^{T} \left( k_{si}-k_{sj}\right)\\ \textbf{L} &=& A A^{T} F Z^{T} (Z F^{T} A A^{T} F Z^{T} + \mu I)^{-1} \\ k_{si} &=& \frac{1}{|P_{s}|} \sum\limits_{b=1}^{|P_{s}|} k_{bs} \end{array} $$
(11)

in which k b s are kernel features of the b th 3D blob within the s th part of the i th video, and k s i are the kernel features of the s th part of the i th video. For M = L L T, it can be shown [43] that r a n k(M) = r a n k(A). Thus, this minimization can be performed over the convex cone of positive semi-definite matrices M. Therefore, the final convex version of the loss function, which is now a function of M, is

$$\begin{array}{@{}rcl@{}} \min_{\textbf{M}} E(\textbf{M}) &= &\sum\limits_{v_{i} \in train}\left( \lambda \sum\limits_{v_{j} \in {\mathcal{TN}(v_{j})}} d_{ij} + \sum\limits_{v_{l} \in {\mathcal{IM}(v_{j})}} \sum\limits_{v_{j} \in {\mathcal{TN}(v_{j})}} \left[1+d_{ij}-d_{il}\right]_{+}\right)\\ &&+ \lambda_{\ast} \text{Tr}(\textbf{M}) \\ &&s.t. \ \textbf{M} \succeq 0 \ and \ d_{ij} = \sum\limits_{s=1}^{PN} (k_{si}-k_{sj})^{T} \textbf{M} (k_{si}-k_{sj}) \end{array} $$
(12)

in which M is a semi-definite matrix. This optimization can be solved by using gradient-based algorithms. The complexity of each iteration is of order O(N D v k p k n ), in which N is the size of the training data, D v is the dimensionality of each video, and k p and k n are the average number of target neighbors and impostors, respectively. Since it is difficult to perform a general batch optimization with this complexity with the whole training data, a stochastic optimization is used to optimize this loss function. In fact, the regularized dual averaging (RDA) is used, which is generic and applicable to non-smooth losses (like hinge loss). (More details on the optimization process can be found in [43]).

3.4.3 Second kernel

If c i and c j are two encoded 3D blob features, then by applying the second kernel function (i.e., a radial basis function (RBF)) between a pair of encoded features, \(k_{M} (c_{i},c_{j} )= \exp (-\gamma _{m} (c_{i}-c_{j} )^{T} (c_{i}-c_{j} ))\) is obtained. By setting the value of c i and c j , based on (8), in the second kernel function, then k M is achieved as

$$ k_{M} (c_{i},c_{j} )= \exp (\gamma_{m} (k_{i}-k_{j} )^{T} \textbf{M}(k_{i}-k_{j} )) $$
(13)

in which γ m is the kernel parameter. Thus, the final features of each action volume can be calculated using M and therefore there will be no need to decompose M to obtain A. Using this kernel, following the formulation of EMK [6], the final feature vector for each action volume is \(F_{V} = \cup ^{PN}_{s=1} \left [\frac {1}{|P_{s}|}\textbf {G} {\sum }_{m \in P_{s}} k_{M} (c_{M},C) \right ]\), where C is the dictionary constructed by EMK [6], and G is calculated using \(\textbf {G}^{T} \textbf {G}=\left (k_{M} (C,C)\right )^{-1}\). The process of calculating these features, based on learned dictionary is completely explained in [6]. Finally, the linear SVM is applied for the classification.

4 Experimental results

The proposed action recognition method is evaluated on five public RGB-D benchmark datasets of: MSR Action 3D dataset [26], MSR Gesture 3D dataset [41], MSR Action Pairs dataset [30], MSR Daily Activity 3D dataset [42], and UT Kinect [50]. The algorithm is also compared with the state-of-the-art methods of action recognition from RGB-D data. The empirical results show that the proposed method outperforms other methods.

4.1 Parameter setting

The parameters to be set for the denoising step are σ in 2D Gaussian smoothing and τ in 1D Gabor filtering, applied spatially and temporally, respectively. In this work, σ = 1.5, 2.5, and 3.5 and τ = 1.5 and 2.5 are tested. Figure 4 presents the final accuracy for different values of σ and τ. The experiment without smoothing is shown with value 0 as both σ and τ. It is noticed that smoothing functions increase the accuracy of SSTKDes, especially for MSR Daily Activity 3D dataset which is the noisiest one. It is also observed that the final accuracy is robust to different values used as smoothing parameters.

Fig. 4
figure 4

a Accuracy vs. σ, b Accuracy vs. τ

The kernel parameters γ o and γ p for orientation and position kernel have been set to 5 and 3, respectively; like the original kernel descriptor paper [5]. To handle the computational cost (like [22]), each video is resized in such a way that the size of its frames is no larger than 150 × 150. The effect of the 3D blob size along 3 dimensions on the final accuracy was tested by running the algorithm with different values. It is noteworthy that by changing the 3D blob size from 5 × 5 × 5 to 20 × 20 × 20 there was no significant changes in the final accuracy. It is worth mentioning that, having 3D blobs with large sizes (like the whole pixel size) reduces the accuracy. Since the features are pooled over all pixels in 3D blobs, the spatial and temporal order of motions in video will be lost. to have a more fair comparison, the 3D blob with size 16 × 16 × 16 and 50 % overlap with neighbors is selected from the video (like [22]). In addition, as the number of 3D blobs is too large for constructing the dictionary, the dense sampling is used.

It might be worth mentioning that since the temporal resolution in the used benchmark datasets is smaller than the spatial resolution in a certain extent, it was thought that increasing the temporal resolution might affect the final accuracy. Therefore, in an experience, the temporal resolution was increased by a bilinear temporal interpolation. But, there was no improvement on the final accuracy. This might be related to the fact that simple bilinear interpolation does not have significant effects on 3D gradients used as the low-level features. In other words, it can only smooth the 3D gradients. For future work, the effect of more complicated interpolation methods, particularly the learning-based ones, can be tested.

In video hierarchy step, each video was divided to 1 × 1 × 1, 2 × 2 × 2, 3 × 3 × 3, 4 × 4 × 4, and 5 × 5 × 5 parts. Figure 5 shows the overall accuracy of SSTKDes on all datasets with different part numbers. It is noticed that, there is an optimal number of words (i.e., 4 × 4 × 4). By increasing the number of parts from 1 × 1 × 1 to 4 × 4 × 4 the accuracy of SSTKDes significantly increases and after that does not change. The concatenation of parts can yield the descriptor to preserve the general spatio-temporal order of actions while average pooling in each part removes redundant data. For instance, in Action Paris Dataset, actions pick up and put down have the same motion and shape but with inverse temporal order. In Fig. 5, it can be seen that by increasing the part numbers, the accuracy of this dataset is notably increasing.

Fig. 5
figure 5

Accuracy of actions vs. part numbers

The LMNN has also some parameters to be set. In (12), the number of target neighbors is set to k p = 4. The parameters λ and λ are also set experimentally to 0.5 and 0.01, respectively.

One of the most effective parameters on the final accuracy is γ m . It is used to calculate the final feature vector in the second kernel. To empirically study the performance of the proposed method, SSTKDes is trained with different values of γ m . The accuracy for different values of this parameter over the range of 1 to 0.00001 is shown in Fig. 6a. It can be seen that value 0.0001 is optimal for all datasets. In (13), the exponential power consist of γ m and another element. Hence, despite the fact that γ m has a value close to zero; the exponential power is not a very small number.

Fig. 6
figure 6

a Accuracy vs. γ m . b Accuracy vs. codebook size

The dictionary used in this method is trained with multiple codebook sizes. In Fig. 6b, the accuracies of all datasets with different codebook sizes are depicted. It can be seen that for every dataset, the accuracy is first seriously increasing with the number of words in dictionary and then it does not change anymore. It is also observed that 3000 and 4000 are the most optimal values for different datasets.LIBlinear Footnote 1 is also used for classification with linear kernel. Parameter c is empirically set to 10. For other SVM parameters, the default values are used.

4.2 MSR action 3D dataset

The MSR Action 3D Dataset contains gaming actions. It consists of depth sequences of 20 actions of: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, and pick up and throw; each performed by 10 subjects for 2–3 times. The frame rate is 15 fps with resolution 320 × 240. The background of this dataset has been removed. The most important challenge of this dataset is the inter-action similarities. This dataset only contains depth videos, so our method is applied on this data.

In order to facilitate a fair comparison, the same experimental setting as other papers [42], is used; i.e., subjects 1,3,5,7,9 are used for training and the rest for testing. The confusion matrix on this dataset is presented in Fig. 7a. Table 1 lists the accuracy of the existing methods on this dataset. SSTKDes achieves the accuracy of 95.60% which outperforms the other methods. By comparing the confusion matrix of SSTKDes and the one of [22], it can be observed that confusions between similar actions in our method are fewer than [22]. In other words, SSTKDes effectively pushes the actions with different labels farther and pull the ones with the same label closer. Thus, it is more successful to handle the inter-class similarities and intra-class variabilities.

Fig. 7
figure 7

Confusion matrix on: a MSR Action 3D dataset, b MSR Daily Activity 3D dataset

Table 1 Performance comparison on MSR Action 3D dataset

4.3 MSR daily activity 3D dataset

The MSR Daily Activity 3D dataset includes daily activities in a more realistic setting; i.e., two different poses with human object interaction in the living room. It consists of both RGB and depth sequences of 16 actions of: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up, and sit down; which are performed by 10 subjects twice in two different poses of “sitting on a sofa” and “standing”. Subjects appear at different distances to the camera. Also, most of the actions involve human object interactions. These facts make this dataset very challenging. The proposed method is applied on both RGB and depth videos, and also a concatenation of both features from RGB and depth (i.e., RGB-D).

Figure 7b shows the confusion matrix of the best result (RGB-D) on this dataset. Table 2 compares the proposed method with the existing state-of-the-art methods. The result of our methods on depth data is better than the RGB one. The used raw attributes are the 3D gradients. Therefore, in a clutter background with more texture, many strong 3D gradients are extracted from background which are not related to the action. Since depth data has no texture, the strong 3D gradients are related to the performed action and probably noises. As it is discussed in Section 3.2, denoising step is able to suppressed a part of noise. Figure 4 also shows that smoothing function notably increase the performance of SSTKDes on this dataset. Hence, after denoising most part of strong 3D gradients will be related to the performed action. As a consequence, the depth data achieves better result than the RGB.

Table 2 Performance comparison on MSR Daily Activity 3D dataset

For this dataset, ActionLet [42] and SNV [53] get better accuracies than the proposed method. Those methods used the skeletal data for extracting the low-level features. In detail, the low-level features are extracted around a spatio-temporal neighborhood of the 3D location joints. In other words, they did not use all pixels in one frame for feature extraction, and thus, the noisy background has less effect on the video descriptor. However, based on the discussion about skeletal data in Section 2, this kind of data is not used and SSTKDes descriptor is computed by exploiting all pixels in frames. As a result, the noisy depth values influence the final accuracy. It is worth mentioning that although these methods get better result on this dataset, the SSTKDes likely results in better accuracy in real situations where the estimated human skeleton is not reliable. However, SSTKDes still achieves better accuracy than the HKDES [22] which is the effect of using the supervised strategy.

4.4 MSR gesture 3D dataset

The MSR Gesture 3D dataset contains depth sequences of: 12 dynamic American Sign Language (ASL) gestures, bathroom, blue, finish, green, hungry, milk, past, pig, store, where, j, z. Each gesture contains the segmented hand portion (above the wrist), and is performed by 10 subjects for 2–3 times. There is no available RGB data for this dataset. The confusion matrix in Fig. 8a is the best result of our method on depth videos of this dataset.

Fig. 8
figure 8

Confusion matrix of: a Gesture3D dataset, b UT Kinect dataset

Table 3 addresses the performance of our method compared to the state-of-the-art methods. It can be indicated that SSTKDes outperforms all methods. The underlaying reason is that SSTKDes can efficiently take into account both spatial and temporal structure, formed by concatenating the features of parts. In fact, the motion (i.e., temporal information) of different hand parts (i.e., spatial information) are well organized by the hierarchy structure of actions. The used supervised strategy is also worthwhile to keep gestures with the same labels close to each other. As a result, it is capable of learning nonlinear (spatio-temporal) correlations of different parts of hand.

Table 3 Performance comparison on MSR Gesture 3D dataset

4.5 MSR action Paris dataset

The MSR Action Pairs dataset is a paired-activity dataset of both RGB and depth sequences of 6 pairs (12 activities) which are performed by 10 subjects for 3 times. It contains: lift a box/place a box, pick up a box/put down a box, push a chair/pull a chair, put on a backpack/take off a backpack, stick a poster/remove a poster, and wear a hat/take off a hat. The challenge of this dataset is the same shape of each action pair with reverse temporal order (like pick up and put down). In other words, considering the temporal order of frames is the most important factor for action recognition in this dataset.

Table 4 compares the performance of the proposed method with other state-of-the-art methods. It is indicate that SSTKDes achieves the best accuracy on this dataset; i.e., 100%. Therefore, no confusion matrix is depicted for this dataset. SSTKDes gains from preserving the temporal changes of the whole action volume by concatenation of parts. As a consequence, it can distinguish between actions with similar shape and different motion directions. The accuracies of RGB and depth data are very close to each other on this dataset.

Table 4 Performance comparison on Action Pairs dataset

4.6 UT kinect

The UT Kinect contains both RGB and depth sequences of 10 actions of: hello, push, pull, boxing, step, forward-kick, side-kick, wave hands, bend, and clap hands; performed twice by 10 subjects. The actions in this dataset cover the movements of hands, arms, legs, and upper torso. The frame rate is 30 fps and its resolution is 320 × 240. Figure 8b shows the confusion matrix of the proposed method on this dataset which achieves 97%. Table 5 lists the performance of the proposed method and other stat-of-the-art methods. SSTKDes achieves the best result along with [39]. For this dataset, again, the depth data achieves better result than the RGB. The underlaying reason is that 3D gradients in depth data contains 3D geometrical information of the subject and the scene alongside the temporal information.

Table 5 Performance comparison on UT Kinect dataset

5 Conclusion

A novel supervised spatio-temporal kernel descriptor is proposed for human action recognition from RGB-D videos. 3D gradients are used as the low-level attributes for both RGB and depth videos, owing to the fact that it can capture both spatial and temporal information. In depth video, 3D gradient is also capable of taking 3D geometric information into account. The low-level 3D gradient attributes are then transfered into a kernel space. In the next level, by using a supervised strategy (LMNN) and a set of 3D blob kernel basis vectors (dictionary), generated through the EMK, features in kernel space are transformed into a more discriminate space. The success of this method was shown on object recognition [43].

In this paper, it has been shown that LMNN can efficiently minimize the intra-class variation and maximize inter-class dissimilarities for action recognition. Moreover, EMK combines the strengths of both bag of words and set kernels. It maps local features to a low dimensional feature space and then the set-level feature vector is formed by averaging the resulting feature vectors. It produces more accurate quantization and better performance. Finally, actions are classified by linear SVM using the feature vectors extracted from RGB, depth, or concatenation of them (RGB-D). The experimental results show the efficiency and superiority of SSTKDes.