Keywords

1 Introduction

Human action recognition is one of the most challenging applications in the field of computer vision. It requires to infer an action model from the observation of a motion sequence, hence it requires the solution of an inverse problem [18]. Furthermore, the modelling process is based on several steps tackling, in turn, different sub-problems: data acquisition, motion analysis and segmentation in individual actions, alignment between sequences and classification with respect to a given taxonomy. All these steps are computationally expensive, while ideally recognition should be performed online.

In this paper we address the alignment and classification part of the complete pipeline. Namely, we assume that a sequence that captures an individual action is already available and the task is to recognize the performed action. To this end we introduce a model based on the the Back-Constrained GP-LVM introduced in [9, 10], and extend it for the application of action recognition, exploiting the strength of a lower dimensional manifold. In detail, we derive a discriminative, probabilistic dimensionality reduction model for mapping motion capture sequences in a low dimensional latent space which assists the action classification process. The proposed model introduces a latent space featuring a fixed set of actions and constrains feature distances in data space to be suitably projected in the latent space, in order to preserve the clustering of common patterns. Actions are represented as a sequence of poses, which can be taken from motion capture (MoCap) data. This projection ensures a discriminative power to the GP-LVM model and it also exploits the peculiar property of action sequences of being reducible to a lower dimensional manifold [17].

In Sect. 2 we briefly review recent works on pose-based action recognition and dimensionality reduction, showing the major trends of research in this field. In Sect. 3 we overview the theoretical foundation of GP-LVM on which our model is based. In Sect. 4 we present our discriminative model. Section 5 demonstrates the latent space structure recovered by the proposed model and examines its performance on human action classification. We compare our method with a sequence classification method based on Dynamic Time Warping as well as the Variational Gaussian Process Dynamical Systems [6] recently proposed for modelling high dimensional dynamical systems. We conclude the work addressing possible extensions.

2 Related Work

In this section we review some of the main approaches to action recognition and mainly those which refer to manifold learning or treat the problem of action recognition in MoCap sequences.

So far many techniques have been proposed in the literature regardingaction recognition where stochastic, volumetric or non-parametric models are most commonly employed. Detailed reviews of the techniques which have been considered in the research on human motion analysis and on action recognition can be found in [1, 12, 26]. Several works address the problem of modelling and recognizing human motion by learning the structure of the low dimensional manifold where it resides, and by recovering a mapping between the high dimensional observations and this manifold.

In [7] the authors consider MoCap sequences and they learn the structure of a unidimensional smooth manifold by applying the tensor voting technique [13]. A motion distance score is used to compute the similarity between the actions recorded in two different sequences. The setting provides the possibility to compare also actions extracted from videos with actions taken from MoCap sequences.

In [34] the authors consider a two dimensional manifold with a toroidal topology in order to estimate human motion. They build on the idea of Gaussian Process Latent Variable Models (GP-LVM) [9] to identify a manifold which jointly captures gait and pose, via three different models. They introduce a new model (JGPM) which they compare to two constrained latent variable models based on GP-LVM and Local Linear GP-LVM [29] respectively.

In [23] the authors propose a non-linear generative model for human motion data that considers binary latent variables. The introduced architecture makeson-line inference efficient and allows for a simple approximate learning procedure. The method performance is evaluated by synthesizing various motion sequences and by performing on-line filling in of data, lost during motion capture.

Following a different perspective, in [21] the authors explore the space of actions, spanned by a set of action-bases, to identify some action invariants with respect to viewpoint, execution rate and subject’s body shape. Action recognition is performed for four different kind of actions (sitting, standing, running and walking) and the results show that it is possible to correctly classify most of these actions using the proposed method.

The redundancy of the original representation of MoCap sequences is alsoexploited in [11] where a compressive sensing method is introduced. Here the authors argue that human actions are sparse in the action space domain as well as the time domain, and they seek therefore a sparse representation. The sparse representation introduced can assist in different applications regarding MoCap data like motion approximation, compression, action retrieval and action classification.

Finally, in [32] (see also [30, 33]) the authors examine whether and to what extent the use of information about the subject’s pose assists recognition. In this case, several pose-based features are used, based on the relative pose features introduced in [14, 15]. Their results suggest that knowing the pose of the subject leads to better results, in terms of classification rate. It is also shown that pose based features alone are usually sufficient, as their combination with appearance based features is usually not leading to higher classification rate.

3 Gaussian Process Latent Variable Models

In this section we review Gaussian Process Latent Variable Models [9]. A Gaussian process is a collection of random variables such that any finite collection of them has a Gaussian distribution [19]. Namely, a random variable of a Gaussian process is \(f(\mathbf {x}_i)= \mathcal {GP}(\mu (\mathbf {x}_i), k(\mathbf {x}_i,\mathbf {x}_j))\), with \(\mu \) and \(k(\mathbf {x},\mathbf {x}')\) the mean and covariance function of the process respectively, indexed over the set \(\mathcal {X}\) of all the possible inputs. The Gaussian process is a non parametric prior for the random variable \(f(\mathbf {x}_i)\) where \(\mathbf {x}_i\) is the deterministic input. Gaussian processes have been successfully used for both regression and classification tasks.

In [9] the author shows that Principal Component Analysis (PCA) can be interpreted as a product of Gaussian processes mapping latent-space points to points in data-space, when the covariance function is linear; when instead a non-linear covariance function is used, such as an RBF kernel then the mapping is non-linear. Lawrence shows the advantages in using Gaussian Processes Latent Variable Models (GP-LVM); for example, for optimization purposes, the data can be divided in active and inactive, according to some rule. Then, because points in the inactive set project into the data-space as Gaussian distributions, due to the properties of the variance the likelihood of each data point can be optimized independently.

In addition to the advantage in terms of visualization and computational efficiency highlighted in [9], GP-LVM turns out to be a powerful unsupervised learning algorithm. Indeed, GP-LVM can manage, via the non-linear mapping of the latent variables to the data-space, noisy or incomplete input data, when Gaussian processes are used as non parametric priors for them.

At this point, we introduce some preliminary definitions that we will refer throughout the following sections

Let \(\mathbf {Y}\) be the normalized data in \({\mathbb R}^{N\times d}\), for example specifying the pose of a subject in space, with respect to a coordinate frame; let \(\mathbf {X}\) be the mapped positions in latent-space, with \(\mathbf {X}\in {\mathbb R}^{N\times q}\), with \(q\le d\). Let \(f\) be a mapping, such that:

$$\begin{aligned} y_{nj}= f(\mathbf {x}_n,\mathbf {w}_j)+\epsilon _{nj}, \end{aligned}$$
(1)

Here, \(y_{nj}\) is the observed element of the \(n\)th row and \(j\)th column of \(\mathbf {Y}\), \(\epsilon _{nj}\) denotes the noise affecting the mapping and \(\mathbf {x}_n\), the \(n\)th row of \(\mathbf {X}\), and \(\mathbf {w}_j\) are the parameters of the mapping \(f\). Given a Gaussian process as a prior on \(f\), when the prior is the same on each of the \(f\) functions one obtains [9]:

$$\begin{aligned} p(\mathbf {Y}|\mathbf {X},\theta ) = \prod _{j=1}^d \mathcal {N}(\mathbf {y}_j|\mathbf {0},\mathbf {K}) \end{aligned}$$
(2)

Here, \(\mathbf {y}_j\) is the \(j\)th column of \(\mathbf {Y}\) and \(\mathbf {K}\) is the \(N\times N\) kernel of the Gaussian process. We see that (2) suggests a conditional independence in the data space, given the latent space representation.

Learning amounts to maximizing the likelihood of the position of the latent variables \(\mathbf {X}\) and \(\theta \), which are the parameters of the kernel:

$$\begin{aligned} L(\mathbf {X},\theta ) =\displaystyle { - \frac{d}{2} \log |\mathbf {K}|-\frac{1}{2} \mathrm {Tr}\left( \mathbf {K}^{-1}\mathbf {YY}^{\top } \right) } \end{aligned}$$
(3)

In order to optimize the non-linear model, it is necessary to initialize the model using appropriate initial values for the positions of the latent-space points. It is also necessary to initialize the hyperparameters of the model. Optimization is obtained by an iterative minimization of the objective function, by using a gradient based algorithm. As the model is non-linear, the hypersurface is subject to local-minima, so the initialization of the positions of the latent-space points is crucial. When non-linear dimensionality reduction methods are used for the initialization, like local linear embedding (LLE) [20] or ISOMAP [24], the structure of the manifold is expected to be more accurately recovered. GP-LVM have been exploited in many applications as for example in [2729, 31].

4 Discriminative Sequence Back-Constrained GP-LVM

As mentioned in the previous sections, models from the family of GP-LVM methods are well suited for predicting missing values or missing samples of time sequences. However, they do not seem to perform equally well when they are used for clustering and classification problems, particularly for time-series data. This drawback of the classical GP-LVM methods can be also witnessed by observing that it is hard to recover the structure of a common latent-space for a set of sequences, as their latent space representations are scattered across the latent-space and no relation can be drawn between sequences corresponding to the same action. This is due to the fact that standard GP-LVM models do not provide a mechanism to encourage points to be placed closer to each other in the latent-space when they belong to the same class and the same also holds at the level of individual sequences.

Local distances can be directly used in GP-LVM to provide a common latent-space representation as they are well suited for classification purposes. In fact local distances in data-space provide some information regarding the intra-class variation. Lawrence and Quiñonero-Candela in [10] have introduced Back-ConstrainedGP-LVM which considers local distances in the data-space. The GP-LVM model uses a product of Gaussian processes to map from the latent-space to the data-space. Each of these processes refers to a different dimension of the data-space and it is governed by the coordinates of the latent-points. In order to obtain a smooth mapping in the opposite direction, the authors in [10] propose to construct this mapping by means of a kernel based regression. Adopting this technique, the latent points are constrained to be the product of a smooth mapping from the data-space. This forces small distances in data-space to lead to small distances between the corresponding points in the latent-space. The smoothness of the mapping from the data-space to the latent-space is determined by the kernel function. Using this mapping, it is not needed to perform a new optimization to approximate the latent-space representation of new data.

The previous method cannot be directly applied on data originating fromsequences, as it is expected that individual elements of a sequence do not provide sufficient information regarding the characteristics of the entire sequence. Building on the same principle, namely the use of local distances in the data-space as back-constraints, we formulate a GP-LVM variant which considers entire sequences rather than individual data points.

Before introducing our model, we briefly review the Dynamic Time Warping (DTW) algorithm, as well as a set of sequence alignment kernels based on DTW and its variants, which will be used for the derivation of our model.

4.1 Dynamic Time Warping and Sequence Alignment Kernels

Dynamic Time Warping is used to match two time dependent sequences by nonlinearly warping one sequence onto the other. Let us consider two vector sequences \(\mathbf {Y}=\left( \mathbf {y}_{1},\ldots , \mathbf {y}_{N} \right) \) with \(N\in \mathbb {N}\) and \(\mathbf {Z}=\left( \mathbf {z}_{1},\ldots , \mathbf {z}_{M} \right) \) with \(M\in \mathbb {N}\). Each vector in the sequence belongs to a n-dimensional feature space \(\mathcal {F}\) so \(\mathbf {y}_{n},\mathbf {z}_{m}\in \mathcal {F}\). A local distance measure is defined to compare a pair of features, provided by an appropriate kernel function:

$$\begin{aligned} \kappa :\mathcal {F}\times \mathcal {F}\rightarrow \mathbb {R}^{+} \end{aligned}$$
(4)

A warping path is a sequence \(p=(p_1,\ldots ,p_L)\) where each element is a pair \(p_l=(n_l,m_l)\). The total cost of a warping path \(p\), according to the predefined distance measure, is:

$$\begin{aligned} c_{p}(\mathbf {y}_n,\mathbf {z}_m) = \displaystyle {\sum _{l=1}^{L}\kappa (\mathbf {y}_{n_{l}},\mathbf {z}_{m_{l}})} \end{aligned}$$
(5)

The Dynamic Time Warping distance between two sequences is defined as the minimal total cost among all possible warping paths. To obtain this value we have to solve the following optimization problem:

$$\begin{aligned} DTW(\mathbf {Y},\mathbf {Z}) = \underset{p}{\min }\left\{ c_{p}(\mathbf {Y},\mathbf {Z})\right\} \end{aligned}$$
(6)

We can also identify an optimal warping path (not necessarily unique):

$$\begin{aligned} p^{*} = \underset{p}{\arg \min }\left\{ c_{p}(\mathbf {Y},\mathbf {Z})\right\} \end{aligned}$$
(7)

The DTW distance is well-defined, even though there may exist many warping paths of minimal total cost. Moreover, it is symmetric if the distance measure is also symmetric, but it is not a proper metric, as it does not satisfy the triangle inequality. In order to apply DTW on MoCap sequences, we must first define the local cost measure \(\kappa \). Two popular choices are to use the sum of the geodesic distances between the unit-quaternions representing the joint angles, as well as the optimal alignment distance between the three dimensional positions of the joints [14].

Based on the notions of the DTW distance and the optimal warping path, alignment kernels have been proposed which consider entire sequences as a whole. As an example we cite here [2, 5, 22].

4.2 Sequence Back-Constrained GP-LVM

In this section we show how to enforce a clustering of the sequences in the latent-space, governed by their respective similarity, which will enable a more accurate classification of a new sequence. To ensure that data instances which are close to each other in the data-space, are mapped to positions which are close also in the latent-space, we apply a similarity measure for comparing different sequences and identify a characteristic feature, summarizing the entire sequence.

Here we consider that each frame of a motion sequence is represented as a\(d\)-dimensional array. An entire sequence, with index \(s\), is represented thus as a set of \(d\) dimensional arrays of cardinality \(L_s\), forming a matrix \(\mathbf {Y}_{s}\in \mathbb {R}^{L_{s}\times d}\). A collection of \(S\) motion sequences is represented as the concatenation of the respective sub-matrices forming the data-matrix \(\mathbf {Y}\in \mathbb {R}^{N\times d}\), with \(N=\sum _{s=1}^S L_{s}\). Let \(\mathcal {J}_{s}\) be the set of indices of the \(s\)th sequence in the data matrix, the corresponding representation of the data-points in the \(q\) dimensional latent-space form a matrix \(\mathbf {X}\in \mathbb {R}^{N\times q}\). The coordinates of the centroid of the latent-space representation of the \(s\)th sequence, is defined as:

$$\begin{aligned} \mu _{sq}=\displaystyle {\frac{1}{L_s}\sum _{n\in \mathcal {J}_{s}}x_{nq}} \end{aligned}$$
(8)

The likelihood of the GP-LVM model is given by (3). The centroid of the latent positions of the data points is taken to be the characteristic feature of the sequence. Therefore, we require that the local distances between the sequences in data-space, computed via the DTW technique, are preserved in latent-space; thus they are specified as the distances between the centroids \(\varvec{\mu }_{s}\). Hence, we consider a mapping to the latent-space governed by an alignment kernel \(k\):

$$\begin{aligned} g_{q}(\mathbf {Y}_{s})=\displaystyle {\sum _{m=1}^{S}a_{mq}k(\mathbf {Y}_{s},\mathbf {Y}_{m})} \end{aligned}$$
(9)

The degree to which the local distances in the data-space are preserved depends on the particular characteristics of the kernel employed for the mapping.

We, thus, have to maximize a constrained likelihood, instead of maximizing the likelihood of the original GP-LVM model.

Each of the \(S\cdot q\) constraints can be written as:

$$\begin{aligned} g_{q}(\mathbf {Y}_{s}) - \mu _{sq} = 0 \end{aligned}$$
(10)

Maximizing the constrained likelihood of the model, we expect to obtain a latent-space representation, where similar sequences are better grouped together, with respect to the representation obtained by the original model. Another important advantage of this approach is that we can use the inverse mapping recovered in the learning phase, for the purposes of fast inference. In this way, we avoid the costly operation of reoptimisation, which is otherwise necessary to obtain the latent-space representation of new sequences.

Up to this point, we did not consider the labels of each type of sequence. In the following section, we modify our model by replacing the Gaussian prior with a prior which will make the model more discriminative.

4.3 Discriminative Sequence Back-Constrained GP-LVM

Discriminative GP-LVM (D-GPLVM) has been originally introduced in [27]. In order to make the Sequence Back-Constrained GP-LVM (SB-GPLVM) model more discriminative, we can consider a measure of the between-group variation and the within-group separation. Referring to Fisher’s Discriminant Analysis, in case we need to estimate a linear projection of the data, such that an optimal separation is achieved, we need to maximize the ratio of the between-group-sum of squares to the within-group-sum of squares.

We thus seek the direction of projection given by the vector \(\mathbf {a}\) which provides a good separation of the data. Denoting as \( \mathbf {X}=[\mathbf {x}_1,\ldots ,\mathbf {x}_N]^{\mathrm {T}}\) the low dimensional representation of the data points \(\mathbf {Y}=[\mathbf {y}_{1},\ldots ,\mathbf {y}_{N}]^{\mathrm {T}}\), the between-group-sum of squares is given as:

$$\begin{aligned} \mathbf {a}^{\mathrm {T}}\mathbf {B}\mathbf {a} = \displaystyle {\sum _{c=1}^{C}\frac{N_{c}}{N}\mathbf {a}^{\mathrm {T}}(\mu _{c}-\mu _{0})(\mu _{c}-\mu _{0})^{\mathrm {T}}\mathbf {a}} \end{aligned}$$
(11)

The within-group-sum of square is given as:

$$\begin{aligned} \mathbf {a}^{\mathrm {T}}\mathbf {W}\mathbf {a} = \displaystyle {\frac{1}{N}\sum _{c=1}^{C}\sum _{n=1}^{N_{c}}\mathbf {a}^{\mathrm {T}}(\mathbf {x}_{n}^{(c)}-\mu _{c})(\mathbf {x}_{n}^{(c)}-\mu _{c})^{\mathrm {T}}\mathbf {a}} \end{aligned}$$
(12)

Here \(\mathbf {X}^{(c)}=[\mathbf {x}_{1}^{(c)},\ldots ,\mathbf {x}_{N_{c}}^{(c)}]^{\mathrm {T}}\) are the \(N_{c}\) points which belong to the class \(c\), \(\varvec{\mu }_{c}\) is the mean of the elements of class \(c\) and \(\varvec{\mu }_{0}\) is the mean computed across all the points.

The criterion used for maximizing between-group separability and minimizing within-group variability is the following [8]:

$$\begin{aligned} J(\mathbf {X})= \mathrm {Tr} (\mathbf {W}^{-1}\mathbf {B}) \end{aligned}$$
(13)

Based on the previous discussion, in order to transform the SB-GPLVM model making it discriminative, it is necessary to replace the Gaussian prior with a prior which depends on (13). This prior takes the following form:

$$\begin{aligned} p(\mathbf {X}) = \frac{1}{\alpha }\exp \left\{ -\frac{\gamma }{2} J^{-1}\right\} \end{aligned}$$
(14)

where \(\alpha \) is a normalization constant, possibly depending on \(p\), and \(\gamma \) represents the scaling factor of the prior.

The log likelihood associated with the discriminative model becomes:

$$\begin{aligned} L = \displaystyle {-\frac{d}{2}\log |\mathbf {K}|- \frac{1}{2} \mathrm {Tr}\left( \mathbf {K}^{-1} \mathbf {YY}^{\mathrm {T}}\right) - \frac{\gamma }{2}\mathrm {Tr}\left( \mathbf {B}^{-1}\mathbf {W}\right) } \end{aligned}$$
(15)

The parameter \(\gamma \) controls the relative importance of the discriminative prior and it reflects the ability of the model to be more discriminative or more generalizing, according to the value it takes.

4.4 Classification Based on D-SBGPLVM

In this section we illustrate how to compute the latent representation of the data points belonging to a new sequence. This will allow to classify any new sequenced according to the introduced D-SBGPLVM model. Let \(\mathbf {Y}_{*}\) be the data-space representation of a new sequence and \(\mathbf {X}_{*}\) the corresponding latent-space representation. The new sequence’s centroid in latent-space can be estimated orders of magnitude faster than \(\mathbf {X}_{*}\) by making use of Eq. (9) introduced in Sect. 4.2. Thus, the coordinates of the test sequence centroid, in each dimension of the latent space are given by:

$$\begin{aligned} \forall \ q: \; \mu _{*q} = g_{q}(\mathbf {Y}_{*})=\sum _{s=1}^{S}a_{qs}k(\mathbf {Y}_{*},\mathbf {Y}_{s}) \end{aligned}$$
(16)

where \(\mu _{*q}\) is the \(q\)th dimension coordinate of the centroid \(\varvec{\mu }_{*}\) of the test sequence. In this case, no minimization is required and the time, necessary for computing the coordinates of the centroid of the test sequence, is proportional to the time needed to compute the kernel values.

At this point, any multi-class classification method can be employed, in order to perform classification. As the latent-space has a dimensionality much smaller than the original data-space, it is expected that classification is more robustly performed in the latent representation of the sequences. Moreover, the proposed method provides a concise way to classify sequences as a whole, as the model treats them explicitly as individual entities.

5 Results

The ability of the Discriminative Sequence Back-Constrained GP-LVM model to provide a latent-space representation suitable for efficient and robust classification of sequences, is examined in this section.

Evaluation on the HDM05 “Cuts” Dataset [16]. Part of the “Cuts” sequences, contained in the HDM05 dataset, has been used for evaluating the model we propose, in comparison to other methods which can be used for sequence classification. This dataset includes the following actions: Clapping hands-5 repetitions (17 sequences); Hopping on right leg-3 reps. (12 seqs.); Kick with right foot in front-2 reps. (15 seqs.); Running on place-4 steps (15 seqs.); Throwing high with right hand while standing (14 seqs.); Walking starting with right foot-4 steps (16 seqs.).

The sequences are sampled at a frequency of 120 frames per second. For this dataset, sequences are already accurately segmented, in order to contain a single action with the same number of repetitions.

The results of the proposed method are compared with the classification results, obtained by directly using the DTW distances of the sequences in the data-space, as well as using the highest class-conditional densities obtained by the Variational Gaussian Process Dynamical Systems (V-GPDS) method [6]. All results are taken by Cross-Validation. Each experiment is performed by keeping all action sequences of one of the five subjects as test sequences and by using the sequences of the other four subjects as training instances. Finally, the results are averaged over the five individual experiments.

Table 1 gives the accuracy rate achieved with each of these three methods for each action as well as in average. Regarding the results obtained by the proposed method, relative features are used and the dimensionality of the latent-space space is fixed to four. Moreover, for the back-constraints the kernel proposed in [2] is used and the initial positions of the latent points are obtained by using the Local Linear Embedding algorithm [20]. Finally, classification in latent-space is performed by SVMs using the RBF kernel function. Figure 1 shows the corresponding confusion matrix obtained by using the D-SBGPLVM model.

Table 1 Comparison of the classification results for the HDM05 “Cuts” dataset
Fig. 1
figure 1

Confusion matrix by using D-SBGPLVM model in combination with SVM on the HDM05 “Cuts” dataset. Average accuracy: 80.9 %

One can see from the results provided in Table 1 that our method gives the best results, both for each individual type of action, except for Hop, as well as in average. We observe that the classification accuracy is relatively high for the DTW distance alone. This depends also on the fact that this dataset is specifically constructed in such a way, that actions of the same kind can be aligned with a very small cost. This is possible as they are defined at a high detail level regarding their execution and they have been also accurately segmented manually. Regarding classification of human actions using the V-GPDS model, it is necessary to train a different model for each individual type of action. After a model has been trained for each type of action, it is possible to compute the class conditional densities for the new sequence.

Considering that the analogous model of V-GPDS, which does not consider time dynamics introduced in [25], provides good classification results (e.g. on the USPS Handwritten Digits Dataset) we expected higher classification rates for the adapted V-GPDS model. Searching the cause of this issue, we have noticed that models for certain actions tend to provide quite high conditional densities most of the time. Further investigation is needed in this direction, as the experiments performed using V-GPDS were not sufficient to derive safe conclusions and possibly a more suitable adaptation of the model for classification purposes is needed.

In the case of D-SBGPLVM, the model is trained by optimising the latent coordinates of the sequences and the hyper-parameters of the model by using all training sequences. By the optimisation process, we recover also the parameters of the kernel based regression, which forms the inverse mapping from the data-space to the latent-space. We provide some examples of bi-dimensional latent-spaces recovered by training the model using sequences of the HDM05 “Cuts” dataset in Fig. 2. In these figures, each color corresponds to a different class of action, crosses are the latent representations for each individual data point, triangles correspond to the centroids of the training sequences and finally the squares correspond to the estimated position of the testing sequences centroids computed using the back-constraints. In Fig. 2 the recovered latent-spaces are shown for three different types of representations considered for the sequences and by using Probabilistic PCA in order to retrieve initial values for the latent points. In the case of Euler Angles and Unit-Quaternions, one can notice that different sequences are placed on top of each other and thus we expect classification rates to be low.

Fig. 2
figure 2

Left Column Latent-space representation by PPCA initalization and considering Euler Angles (Top), Unit-Quaternion (Middle) and 3D Point Cloud (Bottom) representations Right Column latent-space representation considering relative features representation and PPCA (Top), LLE (Middle) and ISOMAP (Bottom) initalization

Our interpretation is that this mainly depends on the high non-linearity of the data-space and the fact the PPCA, being a linear dimensionality reduction technique, is not able to provide suitable initial values for the latent points. As our model is non-linear and it is optimized by using a gradient based algorithm, it is susceptible to local minima. However, in the case of 3D point cloud representation, the data-space does not show excessive non-linearity and even PPCA initialization seems to be sufficient to recover a better structure for the latent-space.

The case of Relative Features (as in [14], but without discretization based on some threshold) is examined also in Fig. 2. Relative features include for example the distance between two specified joints, the distance of a joint with respect to the plane defined by three other, the angle between two successive joints etc. Here we can better observe the impact of the initialization technique on the resulting structure of the latent-space. It is evident that the use of more sophisticated non-linear dimensionality reduction techniques to obtain the initial values, helps recovering a better structure of the common latent-space.

Evaluation on actions of the CMU Dataset [4]. Seven actions from the CMU dataset have been also considered for evaluating the model we propose. This dataset includes the following actions: Walking (15 sequences); Running (15 seqs.); Jumping (15 seqs.); Sitting-Standing (7 seqs.); Throwing-Tossing (15 seqs.); Boxing (9 seqs.); Dancing (9 seqs.).

Each of these actions is performed from a different actor. Moreover, the actions have not been hand-picked and their label only relies on the default labelling provided by the publishers of the dataset. Finally, motion sequences have not been manually segmented. We perform classification instead by just considering the first two seconds of each sequence. For these reasons, we can see that this dataset represents a more challenging and realistic instance of the action recognition problem. Five-fold cross-validation has been used here for obtaining the final classification results.

The classification accuracy achieved by the proposed method, compared with the results of DTW distances and V-GPDS method, are provided in Table 2. Here, Euler angles are considered as features provided to the D-SBGPLVM, while the rest of the setting is the same with the one described for the “Cuts” experiments. In Fig. 3 we provide the corresponding confusion matrix and the overall classification rate, when the D-SBGPLVM model is used.

Table 2 Comparison of the classification results for the actions taken from CMU dataset

We can observe here, that the results for the “CMU” dataset are analogous to the ones corresponding to the “Cuts” dataset. We expect that the lower rate achieved in general by all algorithms mainly depend on the particular difficulties which characterise this dataset, as mentioned above. Considering these difficulties, one can see that the proposed model gives satisfying classification results. This also demonstrates the generalization capabilities of the proposed probabilistic model, which based on this characteristic leads to an overall accuracy that exceeds the accuracy achieved by the other two methods considered here.

Fig. 3
figure 3

Left Confusion matrix by using D-SBGPLVM model in combination with SVM on the CMU dataset. Average accuracy: 72.9 % Right sorted ARD covariance function values obtained after training the model for the same dataset using the RBF-ARD kernel. Average accuracy: 74.1 %

The same experiments were also performed by considering the recently proposed ‘path kernel’ [3] providing equivalent results. The classification rate was slightly lower but this may be related to the particular selection of the parameters of the kernel. Moreover, we performed trials using the automatic relevance determination (ARD) squared exponential kernel as in [6, 25]. In this case, considering eight dimensions for the latent space, we obtained a classification rate of 74.1 % for the CMU dataset. What is important to note here are the values of the relative importance of each dimension after training the model, shown in Fig. 3. One can see here that most of the information for the actions is embedded in a four dimensional sub-manifold. This result is in accordance with the ones reported in [17].

6 Conclusions

In this paper, we have introduced a novel GP-LVM variant in order to recover the structure of a lower dimensional manifold for a set of sequences of different action types. We have shown that the model, according to our approach, attains increased classification accuracy by working in the low dimensional latent-space instead of the original data-space. By exploiting the inverse mapping, from the data-space to the latent-space, our approach is able to infer the class of a new sequence within a few seconds (Matlab implementation tested on the following system: 2.2 GHz quad-core AMD Phenom, 4 GB RAM). This provides a crucial advantage with respect to other GP-LVM models which require several minutes to complete this task, having to deal with a new optimization to obtain the latent-space representation of the new data instances. We have further shown that the proposed D-SBGPLVM model attains classification rates equivalent to the current state-of-the-art when combined with a standard classifier, as for example SVM, for classification in the latent-space.

Within the directions of our future work, we further consider the combination of the proposed method with some pose recovery algorithm. In this way, it would be possible to train the model by using action sequences taken from a MoCap dataset and classify sequences recovered from videos by means of the pose recoveryalgorithm. This would make action recognition from 2D video sequences also possible. Finally, we are currently considering automated ways for the segmentation of motion sequences to sub-sequences of individual actions without prior knowledge of the actions preformed. This step is important for allowing the processing ofsequences containing multiple actions with the method described in this work.