1 Introduction

Human action recognition has been a significant issue for the past three decades due to its practical applications in many critical fields, e.g., human-computer interaction, video surveillance, motion retrieval, and health-care [6, 9, 21, 27, 34, 36]. Much progress related to the research on intensity image based action recognition has been made [11, 15, 18, 30, 39]. However, intensity image based action recognition methods vastly suffer from lots of difficult situation, such as the illumination and viewpoint variation, the cluttered background, the camera movement, and the partial occlusion. Moreover, these methods also encounter difficulties to extract discriminative features because of the intra-class variability and inter-class similarity of the action sequences, which are critical for achieving a high recognition accuracy.

Range information, provided by the Microsoft Kinect liked somatosensory devices, has been proved to be useful for solving these problems, which can improve the recognition accuracy of the actions that are hard to be recognized by intensity images due to their similarity in 2D projections space. Significant improvements [7, 20, 32, 41] have shown promising applications of depth maps in the field of human action recognition. Furthermore, the sequences of 3D skeleton joints of an action video clip also can be obtained in real-time using Microsoft Kinect SDK toolkit [57]. Since human skeleton can be viewed as an articulated system connected by hinged joints, the human actions are essentially embodied in skeletal motions in the 3D space. As a result, many skeleton based methods are springing up [2, 5, 10, 26, 49, 54].

Despite fruitful research work on 3D skeleton based action recognition methods, existing methods still endure some drawbacks, especially when representing the structure of actions. Some methods [1, 12, 17, 51] treat the feature of each joint by stacking them together, which produce a great amount of computing cost and high dimension feature, others [8, 14, 19, 31, 48] are devoted to subtle feature pruning and accompanied by a complicated classification models, which are usually time consuming and supervised. These two kinds of methods improve recognition accuracy while reducing computational efficiency, which is more important in practical usage. Considering the presented problems, in this paper, we propose a novel joint offsets based histogram representation model for each joint, which is simple to implement, sufficiently efficient in recognition tasks, and unsupervised at training process. At the same time, the characteristics of the displacements in the local movement and the global movement are comprehensively considered, thus improving the recognition accuracy.

The flowchart of the proposed framework is illustrated in Fig. 1, which includes the training phase and the testing phase. The main idea of this paper comes from the observation that the offset determined by the displacement of each skeleton joint corresponding to two different frames reflects the movement characteristic of the joint during the time interval of these two frames. Moreover, the joint offset from the first frame to the current frame, named after global offset feature, manifest the global movement of the joint, and the joint offset from a pair of frames with a fixed interval, named after local offset feature, manifest the local movement of the joint. The strategy to hybrid local feature and global feature together can promote the action recognition accuracy without increasing the complication of the model. Then the joint histogram representation is generated by clustering and coding the global and local offsets of each joint, respectively. Furthermore, because the effect of each bin of histogram cannot be neglected, saturation based histogram representation is proposed. This strategy based on histogram representation model is motivated by [26]. While Lu et al. [26] proposed a histogram model by clustering local offset vectors of all joints together, our model differs from their work in three aspects: (1) We propose the global offset feature that captures the global characteristic of an action, and integrate it with the local offset feature together to construct the motion representation model; (2) We improve the method performance by clustering the offset vectors for each joint independently instead of clustering for all joints together; (3) We use saturation for histogram representation model to enhance the discrimination ability of the features. Afterwards, we employ two different classifiers to classify the different actions, respectively, i.e., Naive-Bayes-Nearest-Neighbor(NBNN) and Sparse Representation-based Classifier(SRC). The experiments are run over five datasets with different characteristic, including BJUT dataset captured by ourselves with Kinect, UCF Kinect dataset, Florence 3D action dataset, MSR-Action3D dataset, and NTU RGB+D Dataset. The results show that our method achieves both higher accuracy and efficiency.

Fig. 1
figure 1

The general framework of the proposed method, where purple and green solid points are cluster centers of global and local offsets, respectively. For clarity, we only illustrate global offsets and local offsets of five joints

In summary, the main contributions of our work include four aspects as follows:

  1. (1)

    A novel action feature that consists of global and local position offsets of joints is used, which synergistically reflects the spatial and temporal properties of the action video.

  2. (2)

    The action representation model is generated using a set of joint histograms. The motion independence of skeleton joints is embedded with the representation by applying K-means clustering to all offset vectors of each joint separately. Moreover, saturation of each histogram bin is considered to enhance the discrimination ability.

  3. (3)

    The effect of two different classifiers on our proposed method, i.e., NBNN and SRC, are verified separately, which maintain the effectiveness of the spatial independence of joints by a distance measurement principle of joint histogram-to-class. The former is a non-parametric classifier, which does not require learning process, and is easy to implement for practical usage. The latter is a parametric classifier, which requires learning process.

  4. (4)

    A novel action dataset is provided which consists of ten classes, in which each video sequence poses an action multi-period.

The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 elaborates on the global-and-local featuring phase and the saturation based histogram representation phase. Section 4 describes two classifiers, including Naive-Bayes-Nearest-Neighbor(NBNN) and Sparse representation-based classifier(SRC). Experimental results are presented in Section 5. The conclusion is given in Section 6.

2 Related work

Various range information based human action recognition approaches have been proposed in the past decades. According to the types of the raw data the approaches rely on, these methods fall into three classes, which are depth map based, skeleton joints based, and multiple data modalities based methods.

The first class of methods only used depth maps or 3D point clouds converted from depth maps for action recognition. Li et al. [20] constructed an action graph to represent the actions, and used a bag of 3D points to characterize the postures, which is obtained by projection based sampling method. Yang et al. [53] applied HOG to depth motion maps which were generated by accumulating motion energy of depth maps projected onto three orthogonal Cartesian planes. Wang et al. [47] proposed random occupancy pattern features for action recognition, and weighted random sampling to explore an extremely large dense sampling space. Similarly, Vieira et al. [45] proposed a space-time occupancy patterns, and divided space and time axes into multiple segments to define a 4D grid. Oreifej et al. [33] encoded the distribution of surface normal orientation in the 4D space. In the above methods, they model the action by the discrete points in the depth map and fail to consider every map as an overall set with inherent structure of human body, which limits the recognition accuracy.

As for skeleton joints based methods, the features were extracted to capture the essential structure of actions. Ellis et al. [12] presented a logistic regression learning framework that automatically determined distinctive pose representation of each action. Xia et al. [50] used histograms of 3D skeleton joint locations (HOJ3D) as a compact representation of postures. Posture vocabularies were built by clustering HOJ3D vectors calculated from a large collection of postures, and discrete hidden Markov model was used for action classification. Zhou et al. [56] presented a skeleton induced discriminative approximate rigid part model for human action recognition, which not only captured the human geometrical structure, but also took rich human body surface cues into consideration. Yang et al. [51] proposed another action feature descriptor called eigenjoints by calculating the differences of joints, which includes relative position information, consecutive information, and offset information. Similarly, Jiang et al. [17] presented a method with consecutive information and relative position information, and employed weighted graphs to organize these information. Li et al. [19] also used a graph-based model to characterize the actions, but only used relative position feature. The proposed top-K relative variance of joint relative distances determined which joint pairs should be selected in the resulting graph. In addition, the temporal pyramid covariance descriptors were adopted to represent joint locations. Qiao et al. [35] proposed trajectorylet, which captured static and dynamic information in a short interval of joints, and generated a representation for actions by learning a set of distinctive trajectorylet detectors. In order to reduce the feature space, Luvizon et al. [29] proposed extracting sets of spatial and temporal local features from subgroups of joints, used the Vector of Locally Aggregated Descriptors (VLAD) represent an action, and then proposed a metric learning method which can efficiently combine the feature vectors. Lu et al. [26] extracted feature by computing local position offsets of joints. However, this method didn’t fully exploit temporal relationship of action sequences, which fails to capture the continuous information of each action; moreover, it didn’t consider the motion independence of each joint in the codebook formation phase which mattered in action recognition. Alternatively, we construct a global offset feature as well as K-means clustering of offsets of each joint for compensating the deficiency of the method of Lu et al. [26]. Besides, some works have presented satisfactory results using skeletal features in RNN [42] and Long Short-Term Memory (LSTM) networks [55]. Due to the relatively small number of training samples, neural networks methods usually leads to strong overfitting.

As for multiple data modalities based methods, more than one type of data source was used for action recognition. Ohn-bar et al. [32] proposed two descriptors including joint angle similarity and modified HOG algorithm. Similarly, Zhu et al. [58] fused the spatio-temporal interest points extracted from depth sequence and skeleton joint feature with random forests. Luo et al. [28] proposed a framework fusing pairwise relative position feature extracted from skeleton joints and center-symmetric motion local ternary pattern feature extracted from RGB sequences. Besides RGB and skeleton joints, Sung et al. [41] added depth maps for action recognition. A two-layer maximum entropy Markov model was presented for classification. Wang et al. [46] combined the pairwise relative position feature and local occupancy pattern, and employed Fourier temporal pyramid to represent the actions. In the above methods, they have complex models and require long computing time.

As the extension of human action, human activity can be considered as the composition of some actions. There has been relatively little work on bridging the gap between actions and activities, Liu et al. [22, 23] provided temporal pattern mining, which encoded temporal relatedness among actions, and captured the intrinsic properties of activities. Furthermore, Liu et al. [24] presented a probabilistic interval-based model where the Chinese restaurant process model is incorporated to capture the inherent structural varieties of complex activities. Due to the difficult collection of annotated or labelled training data for sensor-based supervised human activity recognition, Lu et al. [27] proposed an unsupervised method for recognizing physical activities using smartphone sensors. Since action recognition is the basis of activity recognition, in this paper, we focus on discussing action recognition.

From this related work, we can conclude three important facts. First, most methods concentrating on high efficiency are skeleton-based method. Second, both spatial and temporal information are important for action recognition. It is not trivial to fuse global and local temporal information. Third, the trajectory feature of each joint is independent and important for action recognition, and the importance of each joint feature is not equal. In our work, we only use skeletal joints as input data, and our method characterizes both the global and local movements. The combination of the joints can improve the recognition accuracy, but this requires class label information. However, the training process in this paper is unsupervised, so the labels-related combination of joints cannot be realized. Therefore, we propose the following compromise: We separately represent the trajectory of each joint using saturation based histogram representation, allowing further classification by measuring the distance of joint feature to class.

3 Feature extraction and action representation

In this section, the proposed representation model based on global and local offsets of skeleton joints is described. The main idea is first to extract the low-level feature of an action by computing the position offset of corresponding joint in two assigned frames, and then to construct the histogram representation of the action by clustering and coding the global and local offsets of each joint, respectively.

3.1 Joint-based spatial-temporal feature extraction

Let Ψ denote a set including N video sequences:

$$ {\Psi}\equiv\{F_{r}|F_{r}=[f_{r}(1),f_{r}(2),\ldots,f_{r}(n_{r})],r = 1,2,\ldots,N\}, $$
(1)

where Fr represents the r th video with nr frames. Suppose that J joints are acquired in each frame, the t th frame fr(t) can be denoted by the 3D coordinates of joints as follows:

$$ f_{r}(t)=\{\theta_{1,r}(t),\theta_{2,r}(t),\ldots,\theta_{J,r}(t)\},t = 1,2,\ldots,n_{r}, $$
(2)

where 𝜃j,r(t) = (xj,r(t),yj,r(t),zj,r(t)) denotes the 3D position of the j th joint in fr(t), j = 1,2,…,J.

Obviously, the joint coordinates reveal the spatial feature of the action, while the joint displacements characterize the temporal feature of the action. Therefore, we calculate the corresponding joint offset between the t th frame and the (t −Δt)th frame to present the spatial-temporal feature of the action.

$$ \phi_{j,r}^{L}(t)=\theta_{j,r}(t)-\theta_{j,r}(t-{\Delta} t), $$
(3)

where Δt is the time difference which can balance the precision of the offset and the ability of robustness to noise. If Δt becomes greater, noise fluctuations are more robust, but computation precision becomes lower, and vice versa. Upper label L is used to indicate that the feature describes the local movement of the joint during the time interval[(t −Δt),t]. However, (3) only characterizes the local spatial-temporal property, and fails to express the global movement of the joint related to the original pose in the first frame. Therefore, in order to enhance the spatial-temporal property, we introduce the global offset, which is computed as the displacement from the joint position in the first frame to the position of the corresponding joint in the t th frame.

$$ \phi_{j,r}^{G}(t)=\theta_{j,r}(t)-\theta_{j,r}(1). $$
(4)

Thus, fr(t) can be represented as follows:

$$\begin{array}{@{}rcl@{}} {{\Phi}_{r}^{G}}(t)=[\phi_{1,r}^{G}(t),\phi_{2,r}^{G}(t),\ldots,\phi_{J,r}^{G}(t)], \\ {{\Phi}_{r}^{L}}(t)=[\phi_{1,r}^{L}(t),\phi_{2,r}^{L}(t),\ldots,\phi_{J,r}^{L}(t)]. \end{array} $$
(5)

In other words, the combination of two features forms the preliminary feature representation of each frame as follows:

$$ {\Phi}_{r}(t)=[{{\Phi}_{r}^{G}}(t),{{\Phi}_{r}^{L}}(t)], t={\Delta} t + 1,{\Delta} t + 2,\ldots,n_{r}. $$
(6)

Therefore, an action can be described as follow:

$$ F_{r}^{\prime}=\left[\begin{array}{llll} \phi_{1,r}^{G}({\Delta} t + 1) & \phi_{2,r}^{G}({\Delta} t + 1) & {\cdots} & \phi_{J,r}^{G}({\Delta} t + 1)\\ \phi_{1,r}^{G}({\Delta} t + 2) & \phi_{2,r}^{G}({\Delta} t + 2) & {\cdots} & \phi_{J,r}^{G}({\Delta} t + 2)\\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ \phi_{1,r}^{G}(n_{r}) & \phi_{2,r}^{G}(n_{r}) & {\cdots} & \phi_{J,r}^{G}(n_{r})\\ \phi_{1,r}^{L}({\Delta} t + 1) & \phi_{2,r}^{L}({\Delta} t + 1) & {\cdots} & \phi_{J,r}^{L}({\Delta} t + 1)\\ \phi_{1,r}^{L}({\Delta} t + 2) & \phi_{2,r}^{L}({\Delta} t + 2) & {\cdots} & \phi_{J,r}^{L}({\Delta} t + 2)\\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ \phi_{1,r}^{L}(n_{r}) & \phi_{2,r}^{L}(n_{r}) & {\cdots} & \phi_{J,r}^{L}(n_{r}) \end{array}\right]. $$
(7)

For the method of Lu et al. [26], only the local offset is extracted to represent an action, while our proposed method improves its representation method by introducing the global offset as above. Figure 2 shows that the difference between the global offset of the t th frame and the local offset of the t th frame equals the global offset of the (t −Δt)th frame, i.e.,

$$ \phi_{j,r}^{G}(t-{\Delta} t)=\phi_{j,r}^{G}(t)-\phi_{j,r}^{L}(t), $$
(8)

which means the partial feature of the current frame is embraced by the feature of the subsequent frame.

Fig. 2
figure 2

Illustration of temporal sequence property

3.2 Joint-based histogram representation model

After each training action has been represented by a set of low-level features from all body joints according to (7), histogram of occupation frequency(HOF) representation method based on offset vectors clustering is used to generate the action representation model. Inspired by Luo et al. [28] which indicated that each joint plays a different role for different actions, we maintain the motion independence of joints using K-means clustering for the offset vectors of each joint respectively as illustrated in Fig. 3 rather than for all joints together.

Fig. 3
figure 3

Illustration of histogram representation using K-means clustering for ΩjG and ΩjL

Firstly, we group together the global offset vectors of each joint of training action sequences, and denote it by \({{\Omega }_{j}^{G}}=\{\phi _{j,r}^{G}(t)\}_{j = 1,2,\ldots ,J, r = 1,2,\ldots ,N,t = 1,2,\ldots ,n_{r}}\), where N is the number of video sequences, and nr is the frame number of the r th sequence. Here \({{\Omega }_{j}^{G}}\) corresponds to the global feature set of the j th body joint of all frames in all training action sequences. In the same way, \({{\Omega }_{j}^{L}}=\{\phi _{j,r}^{L}(t)\}_{j = 1,2,\ldots ,J, r = 1,2,\ldots ,N,t = 1,2,\ldots ,n_{r}}\) denotes the local feature set of each body joint. Then we use K-means clustering algorithm for \({{\Omega }_{j}^{G}}\) and \({{\Omega }_{j}^{L}}\) to form cluster centers \(C_{j,k}^{G}\) and \(C_{j,k}^{L}, k = 1,2,\ldots ,K\), where K is the number of clusters, which is very important to balance the discrimination and robustness of representation model. The Euclidean distance is used as the clustering metric.

Then, a given video sequence \( F_{r}^{\prime }\) with nr frames described as (7) can be further expressed by a set of histograms, which represent occupation frequencies of assigned clusters.

$$\begin{array}{@{}rcl@{}} \alpha_{j,r}^{G}(k^{\prime})&=&\frac{\#\{\phi_{j,r}^{G}(t)|k^{\prime}=\arg\min_{k}\| \phi_{j,r}^{G}(t)-C_{j,k}^{G} \|\}}{n_{r}-{\Delta} t},k^{\prime}= 1,2,\ldots,K,\\ \alpha_{j,r}^{L}(k^{\prime})&=&\frac{\#\{\phi_{j,r}^{L}(t)|k^{\prime}=\arg\min_{k}\| \phi_{j,r}^{L}(t)-C_{j,k}^{L} \|\}}{n_{r}-{\Delta} t},k^{\prime}= 1,2,\ldots,K, \end{array} $$
(9)

where \(\alpha _{j,r}^{G}(k^{\prime })\) and \(\alpha _{j,r}^{L}(k^{\prime })\) represent the histograms of global and local offsets of the j th joint, respectively, and #{} denotes the cardinality of a set, k = 1,2,…,K, t = Δt + 1,Δt + 2,…,nr. Thus, the movement of the j th joint can be represented by a histogram, i.e., \(\alpha _{j,r}=[\alpha _{j,r}^{G},\alpha _{j,r}^{L}]\in \text {I\!R}^{2K} \).

Furthermore, we notice the existence of nonuniform distribution of occupation frequencies among the histograms. For instance, the majority of clusters in a histogram have a few occupation frequencies. In the situation, if we don’t consider saturation, and directly use the histogram feature, the effect of clusters that have low frequencies would be diminished in the classification. Their frequencies are very low relative to the other clusters with high frequencies, nevertheless these clusters with low frequencies are usually very relevant for action recognition. Therefore, saturation based histogram of occupation frequency(SHOF) is proposed. we set a parameter ε to truncate the high occupation frequencies, then the histogram can be represented as follows:

$$\begin{array}{@{}rcl@{}} \alpha_{j,r}^{{\prime}G}(k)&=&\frac{\min\{\alpha_{j,r}^{G}(k),\varepsilon\}}{{\sum}_{k^{\prime}= 1}^{K}\min\{\alpha_{j,r}^{G}(k^{\prime}),\varepsilon\}},k = 1,2,\ldots,K,\\ \alpha_{j,r}^{{\prime}L}(k)&=&\frac{\min\{\alpha_{j,r}^{L}(k),\varepsilon\}}{{\sum}_{k^{\prime}= 1}^{K}\min\{\alpha_{j,r}^{L}(k^{\prime}),\varepsilon\}},k = 1,2,\ldots,K, \end{array} $$
(10)

where ε is empirically selected to maximize recognition accuracy. Thus, the movement of the j th joint can be represented by a histogram, i.e., \(\alpha _{j,r}^{\prime }=[\alpha _{j,r}^{'G},\alpha _{j,r}^{'L}]\in \text {I\!R}^{2K}\). Finally, an action sequence can be represented by SHOF based on the joint movement, i.e., \(F_{r}^{\prime \prime }=[\alpha _{1,r}^{\prime },\alpha _{2,r}^{\prime },\ldots ,\alpha _{J,r}^{\prime }]\in \text {I\!R}^{2K\times J}\). Our proposed framework is presented in Algorithm 1.

figure g

4 Classification

Suppose an action sequence is represented by a set of histograms of all joints, i.e., V = [h1,h2,…,hJ]. To remain the movement independence of each joint, which often provides additional clues for action discrimination, we classify an action video by measuring the distance of joint histogram-to-class rather than the distance of video-to-video or video-to-class. The action recognition based on the distance of histogram-to-class is performed according to the following equation:

$$ c^{\ast}=\arg\min_{c}\sum\limits_{j = 1}^{J} \|h_{j}-{U_{j}^{c}}(h_{j})\|^{2}, $$
(11)

where c reflects the class that the testing video sequence V belongs to, hj is the histogram of the j th joint of V, \({U_{j}^{c}}(h_{j})\) is the nearest histogram with hj in the class c. We apply two different classifiers, i.e., NBNN and SRC, to classify the actions based on the above distance measurement principle, respectively. The difference is that the former is employed to classify proposed histogram representation, and the latter is employed to classify the sparse representation transformed from histogram representation.

4.1 Naive-bayes-nearest-neighbor classifier

Naive-Bayes-Nearest-Neighbor (NBNN) [4] is employed by measuring the distance of histogram-to-class defined as above. NBNN is a non-parametric classifier and equips with the following four advantages compared with other learning-based classifiers. (1) NBNN doesn’t require learning process; (2) NBNN can avoid the over-fitting problem; (3) NBNN can deal with a large number of classes; (4) NBNN is easy to implement for practical usage.

The action video is classified according to (11) with \({U_{j}^{c}}(h_{j})\) represented as follows:

$$ {U_{j}^{c}}(h_{j})=N{N_{j}^{c}}(h_{j})=\arg\min_{h^{\prime}}|h_{j}-h_{j}^{\prime}(c)|, $$
(12)

where \(h_{j}^{\prime }(c)\) denotes the histogram of j th joint of an action in the class c.

4.2 Sparse representation-based classifier

Assume that there are C classes in the training set. For the j th joint in the c th class, gathering all histograms of sample videos, we can learn a dictionary to represent the histogram feature of the j th joint. In this way, we learn C × J dictionaries. The j th joint histograms in the training videos of the c th class are arranged as columns of matrix \({A_{j}^{c}}=\{{h_{j}^{p}}\}_{p = 1,2,\ldots ,P}\), where P is the number of training videos in the c th class. We wish to learn a dictionary \({D_{j}^{c}}\in \text {I\!R}^{2K\times P} \) over which \({A_{j}^{c}}\) has a sparse representation \({X_{j}^{c}}=\{x_{1},x_{2},\ldots ,x_{P} \}\). It is modeled as the following optimization problem:

$$ \min_{D,X}\{\|{A_{j}^{c}}-{D_{j}^{c}}{X_{j}^{c}}\|_{F}^{2}\} \ \ \ s.t. \|x\|_{0}\leq q_{1}. $$
(13)

For a testing video sequence, V = [h1,h2,…,hJ]. One way to classify V is to find approximations of {hj}j= 1,2,…,J, given by each of the learned dictionaries and their corresponding reconstruction errors. The following (14) defines the item of \({U_{j}^{c}}(h_{j})\).

$$ {U_{j}^{c}}(h_{j})={D_{j}^{c}}\hat{x}_{j}^{c}=\min_{{D_{j}^{c}}\tilde{x}_{j}^{c}}\|h_{j}-{D_{j}^{c}}\tilde{x}_{j}^{c}\|_{2}^{2} \ \ \ s.t. \|\tilde{x}_{j}^{c}\|_{0}\leq q_{2}, $$
(14)

where \(\hat {x}_{j}^{c}\) is the sparse representation of V over \({D_{j}^{c}}\), j = 1,2,…,J. The \({U_{j}^{c}}(h_{j})\) is then substituted into (11) to execute the classification.

5 Experimental results

In this paper, we evaluate our method on five datasets, including a new dataset captured by ourselves called BJUT Kinect dataset and four publicly available datasets: UCF Kinect dataset [12], Florence 3D action dataset [37], MSR-Action3D dataset [20], and NTU RGB+D Dataset [38] . The experiments are run on a Core (TM) i7-4790 3.6GHz machine with 8GB RAM using Matlab R2016a.

5.1 Databases

5.1.1 BJUT Kinect dataset

We introduce a new action dataset by Kinect sensor called BJUT Kinect dataset, which we collected in order to emphasize two points: First, each video sequence of the dataset is multi-period, and each actor performed a requested action different times in each sequence. This dataset is useful to evaluate how well the feature descriptors for multi-period actions. Second, each individual performed actions freely without standard action demo so that this dataset has a certain diversity, which brings difficulty for recognition. This dataset has 159 video sequences in total and 10 classes listed in Table 1. In each frame, the 3D coordinates of 25 joints are available. The dataset is captured from 12 individuals including 9 males and 3 females whose ages range from 24 to 35. The actions of this dataset are illustrated in Fig. 4.

Table 1 The list of actions on the BJUT Kinect dataset
Fig. 4
figure 4

Several poses associated with different actions on the BJUT Kinect dataset

5.1.2 UCF Kinect dataset

UCF Kinect dataset [12] is a publicly available dataset including 16 classes: balance, climb ladder, climb up, duck, hop, kick, leap, punch, run, step back, step forward, step left, step right, twist left, twist right, vault. This dataset is captured by Kinect sensor and the OpenNI platform, gathered from 16 individuals (13 males and 3 females whose ages range from 25 to 35), and has 1280 video sequences in total. In each frame, the 3D coordinates of 15 joints are available. Every one performed 16 actions five times. The dataset is used to measure the latency possible, and how quickly a method can overcome the ambiguity in initial poses when performing an action.

5.1.3 Florence 3D action dataset

Florence 3D action dataset [37] is captured by Kinect camera including 215 action sequences. It includes 9 action classes: wave, drink from a bottle, answer phone, clap, tight lace, sit down, stand up, read watch, bow. 10 subjects were asked to perform the above actions twice or three times. The 3D positions of 15 joints are provided in each frame. Moreover, the dataset has high intra-class variations, and most activities involve human-object interactions, which is challenging for recognition only by 3D joints.

5.1.4 MSR-Action3D dataset

MSR-Action3D dataset [20] is a publicly available dataset including 567 action sequences, and is performed by 10 individuals. It includes 20 action classes: high wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw X, draw tick, draw circle, hand clap, hand wave, side boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pickup throw. The data was recorded with a depth sensor similar to Kinect. Each individual performed each action 2 or 3 times. The dataset provides 3D skeleton joints and depth maps. We use 3D skeleton joints only, and the 3D coordinates of 20 joints are available in each frame.

5.1.5 NTU RGB+D dataset

NTU RGB+D Dataset [38] is a large human action dataset, which provides more than 56000 sequences and 4 million frames. There are 60 action classes performed by 40 distinct subjects, including 40 daily actions(e.g., drinking, reading, writing), 9 health-related actions (e.g., sneezing, staggering, falling down) and 11 mutual actions (e.g., handshaking, hugging, punching). The dataset used three cameras to capture these actions, which were placed at different locations and viewpoints. In each frame, the 3D coordinates of 25 joints are available. Due to the large number of viewpoints, intra-class and sequence length variations, the dataset is very challenging.

5.2 Parameters evaluation

For action representation, we need tune three parameters. Time difference Δt balances the precision and robustness to noise, the number of clusters K balances the discrimination of our method, and saturation parameter ε balances the effect of each bin of the histogram. We empirically determine the intervals and the step size of each parameter. Table 2 shows the parameters setting of each dataset. Due to different protocols of each dataset, we describe the tuning process for three of the evaluated datasets. We tune Δt and K jointly, and evaluate all combinations of the two parameter values to find the optimal values according to the accuracy. Figure 5 shows the performance with different values of Δt and K for three dataset, where BJUT1 represents original BJUT dataset, and BJUT2 represents BJUT dataset with the video segmented to contain only one motion in a clip. There is an optimum spot in each dataset for Δt and K, which gives the best performance. We tune ε on two datasets as shown in Fig. 6. It is clear that, when ε is 0.5 and 0.6, we achieve the best performance on UCF Kinect dataset and Florence 3D action dataset, respectively. For sparse representation-based classifier (SRC), we need tune q2 to investigate the impact of the sparsity of joint histogram feature. The experiments are performed on the UCF Kinect dataset with different values of q2 as shown in Fig. 7. From the figure, we can see that q2 = 4 achieves the best performance.

Table 2 Parameters setting of each dataset, where Δt is time difference, K is clusters number, and ε is saturation parameter
Fig. 5
figure 5

The recognition accuracies with different combinations of parameter values for each dataset

Fig. 6
figure 6

The recognition accuracies with different values of ε

Fig. 7
figure 7

The recognition accuracies with different values of q2

5.3 Experimental results on BJUT Kinect dataset

For this dataset, an action is performed more times in a video sequence. Therefore, the evaluation is performed with two protocols. We first test on the dataset without video segmentation(BJUT1), and then test on the dataset with the video segmented to contain only one motion in a clip(BJUT2). The ratio of training video and testing video is both 3:1, and the repeat count is 10 and 20, respectively. Table 3 shows the comparable results. Here we use “LF”, “GF”, “AT”, and “ER” to denote “local feature”, “global feature”, “clustering offsets of all joints together”, and “clustering offsets of each joint respectively”, for convenience. For SRC-based method, q1 = 2, q2 = 4. The results verify that the modeling strategy of clustering offset vectors of each joint respectively is effective and outperforms the strategy of clustering offset vectors of all joints together. We also can reach the conclusion that our model improves recognition accuracy without excessive increase dimensionality. The effectiveness comes from employing the global offset feature to intensify the temporal property of the model. Furthermore, the accuracy of BJUT1 is better than BJUT2 employing the global feature, which further illustrate that global feature intensifies the temporal property. The recognition accuracy of the SRC-based method is slightly better than NBNN-based method. Our method based on SRC outperforms 0.93% and 0.08% than our method based on NBNN classifier on the BJUT1 and the UCF, respectively. However, our method based on NBNN outperforms 0.95% than SRC on the BJUT2.

Table 3 Performance comparison on the BJUT Kinect dataset. we use “LF”, “GF”, “AT”, and “ER” to denote “local feature”, “global feature”, “clustering offsets of all joints together”, and “clustering offsets of each joint respectively”, respectively. (8,40) represents Δt = 8 and K = 40

5.4 Experimental results on UCF Kinect dataset

For this dataset, We use 5-fold cross validation estimation method to evaluate our method. We obtain the best performance when Δt is 4, K is 10, and ε is 0.5. Table 4 shows the accuracy comparison with the state-of-the-art methods. In the case of UCF Kinect dataset, the average accuracy of the proposed method is 99.14%. Our method is comparable to the method of Jiang et al. [17], and better than other methods. Figure 8 shows the confusion matrix of action recognition corresponding to the best accuracy of our method. The result shows that the recognition accuracies of most actions are 100%, such as climb ladder, duck, hop, step back, step forward and twist right. As for balance, kick, leap, step left, step right and twist left, their recognition accuracies are no less than 99%, even the poor recognition accuracies for climb up, punch and vault are as high as 95%. Table 5 shows the average testing runtime of each phase of our method. The average frame number of each video is 66 for the UCF. Our method based on NBNN only costs approximate 0.018s for one video sequence, and our method based on SRC costs approximate 0.088s for one video sequence. Both two methods are highly efficient, and NBNN is faster than SRC.

Table 4 Performance comparison on the UCF kinect dataset
Fig. 8
figure 8

Confusion matrix of proposed method(SHOF+NBNN) for the UCF kinect dataset

Table 5 The average testing runtime for each phase of our method per action sequence on the UCF dataset

5.5 Experimental results on Florence 3D action dataset

For this dataset, we follow the standard leave-one-out-cross-validation protocol as described in [37] to evaluate our method. We obtain the best performance when Δt is 2, K is 10, and ε is 0.6. Table 6 shows the accuracy comparison with the state-of-the-art methods. We can see that our proposed method, comparable to the method of Yang et al. [54] which is supervised-based method, achieves better performance than other methods. We can obtain a recognition accuracy of 92.19%, which is very good performance. Figure 9 illustrates the confusion matrix on the Florence 3D action dataset. We can see that the proposed method performs very well on most of the actions, except some actions, such as drink from a bottle and answer phone, which are often misclassified each other. The reason is that for these human-object interactions, object information is not available from the skeleton joints data making these interactions look almost the same.

Table 6 Performance comparison on the florence 3D action dataset
Fig. 9
figure 9

Confusion matrix of our method on the Florence 3D action dataset

5.6 Experimental results on MSR-Action3D dataset

For this whole dataset, we follow the cross-subject evaluation as described in [33], where the samples of half of the subjects are utilized for training, and the others are employed as testing data. We obtain the best performance when Δt is 6, K is 30, and ε is 0.3. Table 7 shows the comparison results with the state-of-the-art unsupervised methods. Our proposed method achieves acceptable performance. For this dataset, the accuracy of supervised methods [28, 48] can reach 93.8%, which are outperform ours, but this result should be viewed in the context of the accuracy/latency trade-off. These methods require that the entire action be viewed before recognition can occur. Insight into the performance of our method can be gained by examining the accuracies for specific action classes. Figure 10 shows the comparisons of recognition accuracies of our method and the method of Luvizon et al. [29], which is a state-of-the-art unsupervised method. From the figure, we can see that the two unsupervised methods can all exactly distinguish the actions with different body poses, but get into trouble when distinguishing the actions with similar poses, such as draw x and draw tick. The immediate reason is that our representation model is based on the primary skeleton joints related with torso and limbs like “big” part, and ignore the detailed joints related with fingers like “little” part. As a result, the actions distinguished by subtle detail are difficult to recognize. Therefore, using more detailed joints to represent action is our future research plan.

Table 7 Performance comparison on the MSR-Action3D dataset
Fig. 10
figure 10

Recognition accuracy (per action) for the MSR-Action3D dataset obtained by local feature+VLAD and our method

5.7 Experimental results on NTU RGB+D dataset

For this dataset, the evaluation is performed with two standard protocols as described in [38], i.e., cross-subject evaluation and cross-view evaluation. For cross-subject evaluation, the samples of 20 subjects are used for training and the samples from 20 other subjects are used for testing. For cross-view evaluation, the samples captured by two cameras are used for training and the others are used for testing. We obtain the best performance when Δt is 2, K is 100, and ε is 1. The comparison results with the state-of-the-art handcrafted methods on this dataset are reported in Table 8. We can find out that our proposed method achieves acceptable performance when features are calculated only using skeleton joints without using multi-modal fusion such as [16]. The method [16] also employed supervised learning for their features, whose performance improvement coincided with an increase in computational cost particularly in the training phase.

Table 8 Performance comparison on the NTU RGB+D dataset

5.8 Efficiency analysis

Because fair execution under the same condition are almost impossible, we cannot compare the actual computation time of other methods. Therefore, the efficiency analysis is discussed from two aspects, i.e., the computational complexity analysis and the latency analysis.

For the computational complexity analysis, we compare our method with the methods of [12] and [26], which are also concentrate on computational efficiency while keeping a satisfactory recognition accuracy. Table 9 shows the comparison of computational complexity. From the table, we can see that our method has a comparable computational complexity with the compared methods. Our feature has lower dimension and our method has higher recognition accuracy. Therefore, our method is more effective with high computational efficiency while keeping a better recognition accuracy than the compared methods. In a word, our proposed method has low computational complexity, which can be implemented in real-time.

Table 9 Computational complexity of different methods, where feature dimension and accuracy are on the UCF Kinect dataset, and n is frame number

For the latency analysis, the goal here is to investigate how many frames are sufficient to enable accurate action recognition. We evaluate our method on sequences of varying frame lengths. From the original dataset, new datasets are created by varying a parameter termed maxFrames. The sub-sequences are extracted from the first maxFrames frames of the video. If the video is shorter than maxFrames, the entire video is used. The comparison of recognition accuracies using different number of first maxFrames frames are illustrated in Fig. 11, where LAL, CRF and BOW are the methods of Ellis et al. [12], and Local offset is the method of Lu et al. [26]. From Fig. 11 we can see that our method clearly outperforms other methods. All of the methods perform poorly given a small number of frames, and well given a large number of frames. However, in the middle range, i.e., from 20 frames to 40 frames, our approach achieves a much higher accuracy than all other methods. In a word, our method can recognize actions at the desired accuracy with a lower latency.

Fig. 11
figure 11

Accuracy vs. state-of-the-art methods over videos truncated at varying maximum frames

5.9 Discussion

Unlike many methods [14, 29] using supervised methods, our method is an unsupervised method. When using unsupervised techniques to extract features, there is no need to rely on prior knowledge, and no data inadaptability problem, since the features are learned from the data. Based on our experiments, time difference Δt has the effect on the precision and robustness to noise, the number of clusters K has the effect on the discrimination, and saturation parameter ε has the effect on balancing the effect of each bin of the histogram. We find that the three parameters of action representation have different values for different datasets. Tuning the three parameters is an important task, which has significant effect on the recognition accuracy of our method(see Figs. 56 and 7). We also come to a conclusion that each component of our proposed method improves the recognition accuracy. Compared with Lu et al. [26], which is very close to ours, our method not only characterize both the global and local movements of the joints in an action sequence, but also improve temporal sequence property (see Table 3). Furthermore, our method, comparable to some methods which employ supervised techniques and complex learning model, achieves better performance than many other methods on five different types of datasets. For BJUT Kinect dataset, each video sequence is multi-period. For UCF Kinect dataset, it is a relatively large and clean dataset, and general measure the efficiency of methods. For Florence 3D action dataset, it has many human-object interactions and high intra-class variations. For MSR-Action3D dataset, it has a great amount of noise and high intra-class variations. For NTU RGB+D Dataset, it is perhaps the largest human action dataset, and has a large number of viewpoints and intra-class variations. The evaluations in terms of efficiency have clearly revealed our method can recognize actions in real-time. It is possible to recognize actions up to 92% using only 30 frames which is a good performance comparing to state-of-the-art methods (see Fig. 11). Thus, our approach can be used for interactive systems.

However, our method has some limitations. Our proposed method is a 3D joint-based framework for human action recognition from skeletal joints sequences. Therefore, for actions involving human-object interactions, our method does not provide any relevant information about objects and thus, action with different objects are confused. In future, this limitation can be improved by leveraging complementary information, which can be extracted from depth or color images associated with 3D joint locations. Besides, if a dataset has both a great amount of noise and high intra-class variations, such as MSR-Action3D dataset, our method cannot accurate recognition. Further study is needed to determine precisely how important low latency is in these types of abstract actions. More detailed joint information is also an extension of future research.

6 Conclusion

This paper presents a novel framework for action recognition focusing on the computational efficiency. In the framework, an action feature is designed based on offsets of skeleton joints including global offset feature and local offset feature that can intensify the temporal sequence property. A novel histogram representation model based on global and local offsets of joints is introduced to represent actions considering the spatial independence of joints. K-means clustering algorithm is used for the global or local offset vectors of each joint, respectively. This method can get higher accuracy than the method of K-means based on offset vectors of all joints together. Histogram of occupation frequency based high-level representation model is constructed to represent a video sequence. The saturation scheme is presented to modify the model, in case that the majority of the clusters with low occupation frequencies would be overlooked. Two classifiers based on measuring the histogram-to-class distance are designed, including NBNN and SRC. Two classifiers achieve approximate recognition accuracies, and NBNN is much faster than SRC. A novel dataset for the purpose of our experiments called BJUT dataset by Kinect and four publicly available datasets including UCF Kinect dataset, Florence 3D action dataset, MSR-Action3D dataset, and NTU RGB+D Dataset are introduced to testify our framework. The experiments on five datasets show that our method is effective, and achieves a comparable or a better performance compared with the state-of-the-art methods.

In conclusion, the motion feature proposed in this paper is concise and intuitive, and the action representation model is facile and discriminative. However, the actions with similar body poses or human-object interactions cannot be recognized precisely using our method. To improve our framework, exploring more discriminative features with low dimensionality is the coming work.