1 Introduction

3D human pose estimation from images is a challenging but important research topic with applications in many areas including Human-Computer Interaction [26], robotics, surveillance, computer graphics and sport science. Recent approaches to 3D human pose estimation can be roughly classified into two categories, generative and discriminative. Generative approaches explicitly model human body appearance and kinematic constraints and usually concentrate on development of efficient inference methods that are able to handle the high dimensionality of human pose. Discriminative approaches directly learn the mapping from image space to pose space.

Generative approaches estimate the pose by building a geometric model associated with human body pose and evaluating how much the model agrees with human body appearance in the image. However, recovering the pose involves heavy inference because pose estimation has been transformed into a complex optimization problem [5, 9, 16, 20, 28, 31].

Discriminative approaches are popular due to their flexibility of choosing image descriptors, easy adaptation to different learning methods, no need for initialization, and most importantly, the ability of fast inference in real-world databases [27]. The main goal of discriminative 3D human pose estimation is to learn a nonlinear mapping from image descriptors to 3D human pose configurations. This is challenging due to high-dimensionality and multimodality of the mapping. Moreover, the mapping is highly noisy because of image ambiguities and subject variations.

In this paper we present a novel discriminative framework that can learn a complex mapping from image descriptors to 3D human pose configurations. We propose a local online approach to select most informative and helpful training examples for the query frame, and then group them into motionlets. As depicted by Fig. 1, every motionlet consists of training examples that covers a local area with respect to image space, pose space and time stream. The concept of motionlets is a natural embodiment of the local motion similarity of human motion, which is the basis assumption of discriminative human pose estimation. We take advantage of Locality-constrained Linear Coding (LLC) algorithm [13] to reconstruct 3D human poses using motionlets as codebooks. LLC offers an efficient local smooth sparse projection of an image descriptor into its local-coordinate system with good reconstruction. Each motionlet contributes a candidate pose. We handle the problem of multimodality through selecting the most appropriate pose from these candidate poses. To further eliminate inference ambiguities, we extend our framework to incorporate multiviews and retain an accurate and robust inference from image descriptors to 3D human poses.

Fig. 1
figure 1

Motionlets for a query frame. Each motionlet consists of training examples that cover a local region in terms of appearance space, pose space and time stream

This work is an extension of our previous research in [29] with more technical details, experimental result comparison and analysis. In the following sections, we first review related work, and then present our online framework of motionlet LLC coding. We define local neighborhoods for a query frame, and then show that multimodality of the mapping is mainly caused by the multiple instances of motionlets. We demonstrate how to choose from candidate poses recovered by LLC coding and how to incorporate multiviews into our framework. Finally, we show qualitative results on our Taichi data set and quantitative results on the HumanEva-I data set [23].

2 Related work

Discriminative approaches to human pose estimation provide fast inference because they directly learn the mapping from image descriptors to human pose configurations, avoiding heavy likelihood inference. There are various image descriptors that can be flexibly incorporated into discriminative human pose estimation. These image descriptors are usually based on silhouettes [1, 2, 6, 7, 11], gradients [4, 18], or edges [3, 17, 22, 25], with different computation complexity, descriptor dimensionality, robustness against clutter, discriminating power and generalization ability. Notably, many of these image descriptors requires accurate location of the subject or background segmentation. As a consequence, the result of segmentation has substantial influence on the performance of a pose estimation algorithm. Alternatively, hierarchical multi-level image descriptors such as HMAX [14, 21], spatial pyramids [14], and vocabulary trees [14] can be used with no need for localization or segmentation. In this work, we choose HMAX [14, 21] as image descriptor based on two considerations. First, incorporating background segmentation or bounding box of the subject as a preprocessing step will cost more computation and the overall pose estimation performance will decay as the result of background segmentation or human detection deteriorates. Thus, descriptors that don’t require accurate location of the subject or background segmentation are preferred. Second, hierarchical multi-level image descriptors like HMAX can provide some robustness against background clutter. In the proposed framework, HMAX serves as an appearance clue of body pose, working together with other clues of spacial and temporal constraints.

In the existing literature, various methods can be adopted to learn the mapping from image descriptors to human pose configurations, ranging from nearest-neighbor retrieval [22] and manifold embedding [7, 14] to linear/nonlinear regression [1, 33] and probabilistic mixture of predictors [15, 24]. As we mentioned in the introduction, discriminative approaches have to model multimodality of the high-dimensional nonlinear appearance-to-pose mapping. Usually, multimodality of the mapping is represented by mixture of models, such as Bayesian mixture of experts (BME) [15, 24], mixture of probabilistic PCA [10], and mixture of multi-layer perceptrons [19]. In [8, 30, 34, 35], mixture of local gaussian process experts was used, where multimodality was handled by expert selection. Different from previous strategies, we propose to model multimodality of the mapping by motionlets, each of which contains training examples that covers a local area with respect to image space, pose space and time stream. As will be shown in following section, the main cause of multimodality of the appearance-to-pose mapping is that there usually exist multiple instances of motionlets for the query frame. Our solution directly and efficiently deal with the multimodality by selecting candidate poses contributed by these motionlets.

Recently, Local Coordinate Coding (LCC) has demonstrated promising results on learning the local geometry of data points [32]. As a variant of LCC, Locality-constrained Linear Coding [13] offers an efficient local smooth sparse projection of an image descriptor into its local-coordinate system with good reconstruction. In our discriminative pose estimation framework, we take advantage of LLC to reconstruct 3D human poses for the query frame using motionlets as codebooks.

3 Motionlet LLC coding for human pose estimation

3.1 Local motion similarity

There are three aspects of human motion that should be taken into consideration for human pose estimation: appearance, pose, and time. One key property of human motion is local motion similarity with respect to appearance, pose and time, which is the basis assumption of all discriminative human pose estimation methods. From this point of view, the improved accuracy and efficiency of recent local approaches [8, 30, 34] should be owed to their effective and efficient use of local motion similarity of human motion.

An embodiment of local motion similarity is showed by Fig. 2. Local motion similarity is represented by the similarity among training/validation frames within an local region in terms of image space, pose space and time stream, which is reflected by dark strips along inclined downward 45° in the affinity matrices for HMAX image descriptors and pose vectors. Note that several dark strips along inclined upward 45° appear in the affinity matrices for HMAX image descriptors, which are caused by ambiguities of HMAX image descriptors. Moreover, the affinity matrices for pose vectors are clean and smooth while the affinity matrices for HMAX image descriptors are noisy and jittery. We believe that it is interesting to develop a criterion of good image descriptors for discriminative human pose estimation based on this observation. Intuitively speaking, the more the affinity matrix for image descriptors resembles the corresponding affinity matrix for pose vectors in appearance, the better are the image descriptors for pose estimation. We leave it here for future study.

Fig. 2
figure 2

Affinity matrices for camera 1 of (1) subject 1 training walking sequence, (2) subject 1 training walking sequence v.s. validation walking sequence, (3) subject 1 training walking sequence v.s. subject 2 training walking sequence, (4) subject 2 training jog sequence, (5) subject 2 training jog sequence v.s. validation jog sequence, (6) subject 2 training jog sequence v.s. subject 3 training jog sequence of HumanEva-I data set. Odd rows show affinity matrices of HMAX image descriptors, and even rows show affinity matrices of ground truth poses represented as vectors. Dark values stand for small distances. The white bar area in the affinity matrices is caused by absence of mocap data. Note that subject 1–3 are persons of different genders with visually different dressing

It also worths mentioning that there are multiple dark strips along inclined downward 45° in the affinity matrices, which will cause the problem of multimodality when one tries to learn the mapping from appearance space to pose space. Usually, training examples of a data set for human pose estimation (like HumanEva [23]) contains samples of multiple subject performing the same action and/or one subject performing the same action multiple times and/or multiple actions sharing some similar poses. All these contribute to multiple modes of the conditional distribution when estimating the pose from an image, which are one big source of ambiguities. In the following text, we will introduce a novel concept of motionlets to address this problem.

3.2 Motionlets for human pose estimation

As discussed previously, local motion similarity plays a very important role in discriminative human pose estimation. The concept of motionlets for human pose estimation is a natural embodiment of local motion similarity of human motion. We now formulate the definition of motionlets in the context of discriminative human pose estimation.

Let X = (F,P) be a training sequence, where F = [f 1,f 2, ⋯ ,f N ] consists of image descriptors (e.g. HMAX) of the image sequence of length N, and P = [p 1,p 2, ⋯ ,p N ] contains corresponding ground truth poses. Given a query frame with its image descriptor f q , there exists one or more motionlets, denoted by M = [M 1, ⋯ ,M T ] where T ≥ 1. As shown by Fig. 1, each motionlet covers a local region of training examples in terms of appearance space, pose space and time stream, which is given by

$$ M_i = (F_i,P_i), i\in\{1,\cdots,T\} $$
(1)
$$ F_i = [f_{a_i},\cdots,f_{b_i}] $$
(2)
$$ P_i = [p_{a_i},\cdots,p_{b_i}]. $$
(3)

a i and b i are head index and tail index of the motionlet respectively, which should hold the following conditions,

  1. 1.

    a i  < b i , a i ,b i  ∈ {1,2, ⋯ ,N};

  2. 2.

    ∀ j ∈ {a i , ⋯ ,b i }, d(f q ,f j ) < δ;

  3. 3.

    \(a_i<1\parallel d(f_q,f_{a_i-1})>\delta\);

  4. 4.

    \(b_i>N\parallel d(f_q,f_{b_i+1})>\delta\),

where d is a distance function and δ is a threshold determining how close the image descriptors of the motionlet are with that of the query frame. Here we use Euclidean distance.

Combining the definition of motionlets and local motion similarity of human motion, it’s easy to understand that there are usually multiple motionlets for a query frame. This is because there usually several training sequences, each of which usually contains several motionlets for the query frame. As mentioned in the introduction, the multiple instances of motionlets is the main cause of multimodality of the mapping in discriminative human pose estimation. By making use of the concept of motionlets, multimodality of the mapping can be handled more directly.

As depicted by Fig. 3, multiple motionlets can be extracted from training data for a query frame. Specifically, the query frame is from subject 1 test walking sequence and the motionlets are from subject 1 training walking sequence, subject 1 validation walking sequence, subject 2 training walking sequence and subject 3 training walking sequence respectively. We can see that the motionlets are subsequence of training samples possessing close appearance with the query frame, which give more valuable information about the true pose for the query frame compared with the rest training samples.

Fig. 3
figure 3

Example motionlets extracted for a query frame. Better view zoomed in to see small body pose differences among each motionlet such as the distance between subject’s two feets

Organizing training data into motionlets can actually be beneficial in two ways. First, it naturally encodes time-sequential prior of human motion. This is helpful even we are recovering pose from a single image, because the result estimate is guaranteed to be based on sets of coherent training samples, not groups of irrelevant ones. Second, it reduces the number of training samples to be considered for recovering the pose, thus the computational expenses are reduced which is good for inference on large data sets.

We’ve formulated the concept of motionlets, explained how it relates to multimodality of the mapping from appearance space to pose space in discriminative human pose estimation, and analyzed the benefits of incorporating the idea of motionlets. In the following text, we’ll show how to integrate motionlets into a discriminative pose estimation framework.

3.3 Locality-constrained linear coding with coupled codebooks

In order to learn the high dimensional nonlinear mapping from appearance space to pose space, it is essential to capture the relation between the two spaces. Recently, Local Coordinate Coding (LCC) has shown promising results on learning the local geometry of data points [32]. Yu et al. [32] confirm that locality is essential when fitting a nonlinear function on the manifold. They find that a high dimensional nonlinear function can be approximated by a global linear function, and proposes a new method called Local Coordinate Coding. The points on the manifold can be expressed as coordinates with respect to a set of locally anchor points, which have a lower dimensional. Jinjun et al. [13] present a fast implementation of LCC called Locality-constrained Linear Coding(LLC) which introduces a locality penalty item that participates in the coordinate computation.

Assuming a dictionary with N bases B ∈ ℝQ×N is known, for a given data point x ∈ ℝQ, local-constrained linear coding finds a best coding w ∈ ℝN for the input patch which minimizes the reconstruction error and the violation of the locality constraint. Formally, this process can be formulated as optimizing the following objective function:

$$ \begin{array}{rll} \label{eq:llc} & \min_{w} \big || x - B \cdot w |\big |^2 + \lambda \sum_{i=1}^N {Dist_{i} * w_i }, \nonumber\\ & s.t.\sum_{i=1}^N w_i = 1 \end{array} $$
(4)

where \(Dist_{i} = \exp \Big( \frac {\big || x - B_{i} |\big |^2 } {\sigma} \Big)\) and it is the locality adaptor that gives different freedom for each basis vector proportional to its similarity to the input descriptor x. And the solution of LLC can be derived analytically as follows:

$$ c^* = Norm \left( C_i + \lambda * diag \left( \exp \left( \frac{\big || x -B_{i} |\big |^2 }{\sigma} \right) \right) \right) $$

where \(C_i=(B - 1x ) (B - 1x )^T\) denotes the data covariance matrix.

The data manifold in appearance space possesses similar local geometry with that in pose space. Given a query point in appearance space, we can recover the corresponding pose data using the coefficients of query point in the learned appearance subspace with respect to the pose dictionary B P. For a query point f in appearance space, the coefficients w * with respect to appearance dictionary B F can be obtain by (4). Then the recovered pose p * can be obtained by

$$ p^* = \sum {B^P * w^*}. $$
(5)

In all the above discussions, the appearance dictionary B F and the corresponding pose dictionary B P are assumed to be known. Here we concatenate the appearance data and its corresponding pose data to get a coupled appearance-pose dictionary, where each dictionary entry B i is can be separated into appearance part \(B_i^F\) and pose part \(B_i^P\). Using the coupled appearance-pose dictionary, we can get a high dimensional nonlinear mapping from appearance space to pose space through LLC coding.

3.4 Motionlet LLC coding

In our local discriminative framework of human pose estimation, Local Coordinate Coding (LCC) is adopted to reconstruct 3D human poses for the query frame using motionlets as codebooks. Let M′ = {M1,M2, ⋯ ,M T}, where M i  = (F i ,P i ) and i ∈ {1, ⋯ ,T′}, be all the motionlets for the query frame from training sequences. A series of LLC coding coefficients of f q , denoted by \(C^M=\{c_1,\cdots,c_{T^{prime}}\}\) , are computed by performing LLC coding on the image descriptor f q with each of the motionlets as codebook. Then the reconstructed image descriptors and corresponding 3D human poses are given by

$$ F^M = \{f^{prime}_1,\cdots,f^{prime}_{T^{prime}}\} $$
(6)
$$ f^{prime}_i = F^{prime}_ic_i $$
(7)
$$ P^M = \{p^{prime}_1,\cdots,p^{prime}_{T^{prime}}\} $$
(8)
$$ p^{prime}_i = P^{prime}_ic_i, $$
(9)

where i ∈ {1, ⋯ ,T′}.

Now P M contains candidate poses, each of which is contributed by one of the motionlets. The most appropriate candidate pose is selected as result estimate given by

$$ p^* = p^{prime}_\theta $$
(10)
$$ \theta = \arg\min\limits_i[d(f_q,f^{prime}_i)+\lambda dist(f_q,F^{prime}_i)], $$
(11)

where λ is relative weight of the image descriptor distance term over the image descriptor reconstruction error term and the distance between a image descriptor and those of a motionlet is defined by

$$ dist(f,F^{prime}) = \min\limits_{f_i\in F^{prime}}d(f,f_i). $$
(12)

This is much like a Nearest-Neighbor strategy. One may deem that the image descriptor reconstruction error term is positively correlated with the image descriptor distance term and one of them can be omitted. But in fact there exist cases where the former is relatively small with the latter being relatively big and vice versa.

At this end, we can inference 3D human pose from monocular image sequences. The following text will show how to exploit multiview image sequences to achieve more accurate and robust human pose estimation.

3.5 Multiview integration

For monocular human pose estimation, there are considerable ambiguities in the mapping from image space to pose space, not mentioning the ambiguities caused by image descriptors (e.g. dark strips along inclined upward 45° in the affinity matrices for HMAX image descriptors shown by the first row of Fig. 2). To account for these inference ambiguities, we extend our framework to incorporate multiviews.

When multiview sequences are available, we first estimate 3D human pose from each single view. This provides several candidate poses, each of which corresponds with one view. Again, candidate pose selection is applied to get final estimate. Unlike the method in [18] combining all views into one descriptor, our method is more robust against estimation failure occurs in one of the views, while in [18] the occurrence of inaccurate background segmentations in one or more views will always generate bad estimation result.

4 Experimental evaluation

4.1 HumanEva-I data set

To quantitatively evaluate our method, we conduct experiments on the HumanEva-I data set [23]. The HumanEva-I data set contains multiview video sequences that are synchronized with 3D body poses obtained from a motion capture system. The database contains sequences of different subjects performing several predefined actions (e.g. walking, jogging, gesturing, etc.), which are originally partitioned into training, validation, and testing sets. In this experiment, we use the original training and validation sets as training set and the original testing set for testing. 3D body pose is represented by joint positions and HMAX image descriptors are used. Except for monocular pose recovery, we also evaluate our 3D human pose estimation method with multiviews incorporated on the data set.

We report mean 3D difference errors between estimated joint positions and ground truth joint positions in mm, relative to the pelvis (torsoDistal) joint. In the experiments, we remove frames with invalid ground truth poses from the training set.Footnote 1 With ground truth poses of testing set withheld, we use the on-line evaluation system of HumanEva project [12]. The results are reported in Table 1. Note that training sequences and testing sequences are originally from different sequences, but our method accurately infers the 3D body poses even though the poses and appearances in training and testing data might have been relatively different. In the single-view case, the errors are relatively big. This should be due to the lack of discriminating power of HMAX descriptors and ambiguities caused by them (see Fig. 2).

Table 1 Quantitative results on HumanEva-I data set: Mean 3D error in mm for HumanEva-I testing set of 3 subjects performing various actions, evaluated with a single view (C1) and multiviews (C1,BW1-BW4)

It worths mentioning that for different query point location in appearance space, the number and the size of motionlets vary. When the query point falls on the manifolds of all subjects performing Throw/Catch and Box actions, the motionlets for the query point tend to be small. This is because those actions are relatively violent and the distribution of training samples on the manifolds is sparse. Small motionlets will lead to ambiguity in capturing local geometry of the manifolds. As a result, the reconstructed pose by LLC coding may not be accurate. In addition, when the query point falls on the manifold of subject 1 performing Jog action, the number of motionles decreases due to that plenty macap data for the training/validation sequence is invalid. This again hurts the result estimate because less candidate poses are available.

Figure 4 shows mean 3D error plots of subject 2 performing various actions with blue thin plots for single view and red thick plots for multivew. The estimation result of single view suffers from a high degree of ambiguity. This can be understood that we lost much information by projecting 3D scene onto 2D image. For example, there is forward-backward ambiguity when the subject is walking towards/away from the camera, which caused some peaks in the plot of single view. Fortunately, incorporating multiviews can resolve the ambiguity and help bring down the error. The plots for multiview are less jittery than those for single view.

Fig. 4
figure 4

Mean 3D error (in mm) plots for subject 2 a Walking, b Jog, c Throw/catch, d Gestures and e Box test sequences. Areas with zero error contain invalid mocap data

Figure 5 makes a comparison of out method with Poppe’s [18]. Note that in [18] foreground HOG is used as image descriptors, which relies on good background segmentation, which is a strong assumption. In the single view setting, the error of our results is larger. However, our method gets comparable or even better results in subject 1 Gestures, subject 3 Walking and subject 3 Gestures. In addition, our method outperforms [18] in the multiview setting. By incorporating multiviews, our method becomes more robust, while in [18] estimation failure in one or more views will always ruin the result estimate.

Fig. 5
figure 5

Mean 3D error (in mm) of the proposed method and Poppe’s method

4.2 Taichi data set

In this experiment, we evaluate our method qualitatively on our Taichi data set. For the Taichi data set, we collect monocular image sequences of a subject exercising Taichi with synchronized ground truth pose captured by two Microsoft Kinect sensors. 3D pose is represented as a vector of 20 concatenated 3D joint positions and is estimated from HMAX image descriptors.

Figure 6 depicts several test frames of Taichi data set and their corresponding estimated 3D human poses. Note that in some cases it’s hard to distinguish arms from torso due to very dark clothes and self occlusion, but our method can accurately infers 3D poses under such condition.

Fig. 6
figure 6

Qualitative results from monocular 3D human pose estimation of our proposed method on our Taichi data set

5 Conclusions

In this paper we presented a local online framework for 3D human pose estimation, which is able to learn a complex, high-dimensional, and multimodal nonlinear mapping from image descriptors to 3D human poses. We have formulated the concept of motionlets and showed that multimodality of the mapping is mainly caused by the multiple instances of motionlets for each query frame. We directly handle multimodality of the mapping by first group most informative and helpful training examples into motionlets, then perform LLC Coding to learn the nonlinear mapping and get candidate poses, and finally choose the most appropriate pose as the result estimate. To improve accuracy and robustness, we extend our framework to incorporate multiviews. We conducted the experiments on our Taichi data set and the real HumanEva-I data set to evaluate our proposed method qualitatively and quantitatively, and achieved accurate results. In future work, we plan to develop a method to assess image descriptors and to find good image descriptors for discriminative human pose estimation.