Keywords

1 Introduction

3D body pose estimation, whose goal is to recover the poses of body parts and joints with naturally articulated movements, plays a key role and is a well investigated problem in variety of areas such as computational vision, human-computer interface, and computer animations, and so on. Especially, in the works of Shotton et al. [1, 2], random forest algorithm proposed by [3] is employed to predict body poses from single depth image. The random forest is an ensemble learning method, which has proven fast and effective multi-class classifiers for various works such as image classification, object tracking, facial expression recognition, pose estimation and so on [4,5,6]. The solution proposed by Shotton et al. [1] is embedded within the commercial product ‘Microsoft Kinect sensorTM’, which is readily available off-the-shelf gaming system. Moreover, the depth comparison proposed by [1] is popularly used in many works [2, 7] as learning features for the random forest classifier. Within the framework of body pose estimation based on the classified body parts [1], the accuracy and reliability of the body parts classification are important because they might influence the consequent learning process to infer the positions of 3D body joints. Furthermore, although the depth comparison features proposed by [1] are easy to compute and efficient in characterizing the change in body parts, the features themselves encode the only local information for the body parts not a global information such as the deformed whole body or the skeletal structure of the body joints. The depth comparison features are insufficient to empower the discriminative ability of the classifier.

Fig. 1.
figure 1

Systematic overview of our system and our ground-truth samples (normalized to the depth [0,1]): from the top row, forward walking (T2), hand waving1 (T3.a), hand waving2 (T3.b), sitting (T4), and upstanding (T5) motions.

Our approach for 3D body pose estimation from a single depth-map is related to the previous works from [8, 9] as they exploit a geodesic distance graph of the body depth image to localize the skeletal joints of the body.

In works [10], a variety of objective functions with the geodesic distance transforms based features for identifying interest objects in the semantic image segmentation with random forest. Moreover, in the context of a decision forest, a joint objective function for pixel classification and shape regression is introduced in [11], which yields more spatially consistent predictions than results from the typical objective function only considering the data labels.

Motivated by existing works [1, 8, 11], we propose a new feature descriptor based on a geodesic path over the body surface, referred to as GPS (Geodesic Path Sequence), which is derived by concatenating sequence of characters correspond to the vectors along deformation paths. In order for the body parts classification, we also incorporate the length of each GPS descriptor into the joint entropy-based objective function involving both the body parts labels themselves and their geodesic structural information, leading to more accurate predictions. The geodesic descriptors reflect a geometry of body surface well, which is expected to improve our body parts classification performance. In addition, we exploit a skeleton matching method based on the geodesic extrema of the body, thereby reducing the misidentification problems for the joints and their bones in the skeletal configuration. As with the step in [1], we develop a ground-truth generator and cheaply create varied realistic data by synthesizing an avatar 3D body model with some interesting poses sampled from a large motion capture data set, which consists of five different motions: standing (T1), walking (T2), hand waving1 (T3.a), hand waving2 (T3.b), sitting (T4), and upstanding (T5) (see Fig. 1, samples similar to standing (T1) set are included in the other sets). In this paper, our final goal is to predict an accurate skeletal configuration of the body pose rather than the standard anatomic positions of the body joints.

2 Geodesic Path Sequence Descriptor

We show how well our GPS provides significant patterns with discriminative information across anatomically different body parts, through the empirical comparison of affinity matrices derived from two different types of features (i.e., our GPS and depth values). Then, we describe how to incorporate our GPS descriptors into the joint entropy-based objective in learning the random forest classifier.

In order to take the human body manifold structure into account, we exploit the geodesic distances and their paths among all points over the body surface and a their barycenter point as feature descriptors for the random forest classifier. At first, we construct an undirected weighted graph \(\mathcal {G} =({\mathcal {V}, \mathcal {E}})\) from the body points set \(\{\varvec{p}_{\varvec{x}}\} \subseteq \mathcal {V}\), where \(\mathcal {V}\) and \(\mathcal {E}\) denote a set of vertices and a set of edges with pairwise distances being assigned as edge weights, and each \(\varvec{p}_{\varvec{x}_i}\) is a 3D position vector consisting of a 2D coordinate \(\varvec{x}_i\) and its depth \(d_D(\varvec{x}_i)\) in the body depth image. The set of edges are defined as:

$$\begin{aligned} \mathcal {E} = \{d_E(\varvec{p}_{\varvec{x}_i}, \varvec{p}_{\varvec{x}_j})\in \mathcal {V} \times \mathcal {V}~|&(||\varvec{p}_{\varvec{x}_i} - \varvec{p}_{\varvec{x}_j} ||_2 < \delta ) \nonumber \\ \wedge ~&(||\varvec{x}_i - \varvec{x}_j||_{\infty } \le 1) \}, \end{aligned}$$
(1)

Each edge \(d_E(\varvec{p}_{\varvec{x}_i}, \varvec{p}_{\varvec{x}_j})\in \mathcal {E}\) is stored as a weight \(w(d_E)\), where a 3D Euclidean distance of less than \(\delta \). The Dijkstra geodesic distance \(d_G\) is then computed along the shortest path \(\mathcal {P}\) between \(\varvec{p}_{\varvec{x}_p}\) and \(\varvec{p}_{\varvec{x}_q}\), which is defined as:

$$\begin{aligned} d_G(\varvec{p}_{\varvec{x}_p}, \varvec{p}_{\varvec{x}_q}) = \sum _{d_E\in \mathcal {P}(\varvec{p}_{\varvec{x}_p},\varvec{p}_{\varvec{x}_q})} {w(d_E)} \end{aligned}$$
(2)

The graph based geodesic descriptors are invariant to large motion deformations and geometric transforms as long as the local connection relationships remain, which well reflect the local body structure [8, 12]. We then generate a body-centric geodesic map by measuring the Dijkstra geodesic distances for all N points on the body, \(\{ d_G(\varvec{p}_{\varvec{x}_i}, \varvec{p}_{\varvec{x}_0}) \}_{i=1}^N\). Each Dijkstra geodesic distance, \(d_G(\varvec{p}_{\varvec{x}_i}, \varvec{p}_{\varvec{x}_0})\), is associated with the sum of edge weights along a shortest path between a point, \({\varvec{x}_i}\), over the body surface and a barycenter of the body, \({\varvec{x}_0}\), under an assumption that points on anatomically similar body parts maintain a nearly constant geodesic distance. From the body-centric geodesic map, we finally define a descriptor based on the geodesic path which is represented by accumulating sequences of characters correspond to the vectors along the body’s deformation path. The GPS for a point \(\varvec{x}_i\) is defined as:

$$\begin{aligned} \varvec{d}_{g}({\varvec{x}_i}) = [c_1, c_2, \cdots , c_i], \end{aligned}$$
(3)

where \(c_i\) is a character indicating the direction of between \(\varvec{p}_{\varvec{x}_{i-1}}\) and \(\varvec{p}_{\varvec{x}_{i}}\).

Dynamic time warping (DTW) is a powerful algorithm for measuring similarity between two time series by finding an optimal alignment. In here, we employ the fast DTW algorithm [13] in order to compute the similarity between two GPS descriptors in the binary test function of random forest within linear time. Fig. 2 shows that affinities between the inter- and intra- body parts for data aligned in the parts. All distance values for the affinities are normalized between 0 and 1. The more the affinity matrix has well-formed block diagonal structure, the better the partitioning of different parts. As shown in Fig. 2, the simple depth comparison features empirically do not provide enough discriminative power in learning the classifier. In case of two points having similar depth values, but located at different parts of the body, the features likely lead to erroneous predictions in the classification problem. Meanwhile, our proposed GPS is robust to large motion deformation, and it is effectively discriminative for different body parts.

Fig. 2.
figure 2

From the left, each pair of affinity matrices are depth-based and GPS-based similarities between two different body parts for data in Fig. 1’s overview.

3 Joint Entropy-Based Body Parts Classification

For formulation of the body parts classification from single depth image, we assume that a set of N training samples \(\mathcal {Q} = \{ ({\varvec{f}_{\theta }}_i, \varvec{l}_i ) \}_{i=1}^{N}\) is given. The input variable \({\varvec{f}_{\theta }}_i\) corresponds to a feature for an individual pixel \(\varvec{x}_i\). The output variable is a discrete label \(\varvec{l}_i \in \mathcal {C}\), where \(\mathcal {C}\) is a finite set of body labels.

In a given pixel \(\varvec{x}\) in depth image D, we propose a GPS comparison feature similar to the existing depth comparison feature [1], which is defined as:

$$\begin{aligned} \varvec{f}_{\theta }(D, \varvec{x}) = d_{W}\left( \varvec{d}_g\left( \varvec{x} + \frac{\varvec{i}}{ | \varvec{d}_g\left( \varvec{x}\right) |}\right) , \varvec{d}_g\left( \varvec{x} + \frac{\varvec{j}}{| \varvec{d}_g\left( \varvec{x}\right) |}\right) \right) , \end{aligned}$$
(4)

where \(d_W(\varvec{d}_g({\varvec{x}_{\theta }}_i),\varvec{d}_g({\varvec{x}_{\theta }}_j))\) is a warp path distance between \(\varvec{d}_g({\varvec{x}_{\theta }}_i)\) and \(\varvec{d}_g({\varvec{x}_{\theta }}_j)\) descriptors. \(\theta = (\varvec{i}, \varvec{j})\) is a pair of offsets to the pixel \(\varvec{x}\), and the scale invariance of depth is considered through the normalized by the length of \(\varvec{d}_g(\varvec{x})\). Each node in tree is trained over a set of splitting candidates \(\phi = \{(\theta , \tau )\}\), where feature parameter \(\theta \) and partition threshold \(\tau \). The split candidates \(\phi \) are randomly sampled from uniform distribution. For each \(\phi \) (\(m=|\phi |\)), the subsets \(\mathcal {Q}_{L}\) and \(\mathcal {Q}_{R}\) partitioned from the original set of data \(\mathcal {Q}\) are evaluated with our various energy functions at the current node. The partitioning is performed as follows:

$$\begin{aligned} \mathcal {Q}_{L}(\phi )= & {} \{(D, \varvec{x})~ |~ \varvec{f}_{\theta }(D, \varvec{x}) < \tau \} \nonumber \\ \mathcal {Q}_{R}(\phi )= & {} \mathcal {Q}~ \backslash ~ \mathcal {Q}_{L}(\phi ) \end{aligned}$$
(5)

For the forest training procedure, the goal is to find optimal splitting parameters of each node and build partitioning binary tree which minimizes the objective function \(\mathcal {J}\) defined as follows:

$$\begin{aligned} \phi ^{*}={{\arg \min }}_{\phi }{\mathcal {J}(\mathcal {Q}, \phi )}. \end{aligned}$$
(6)

An optimal criteria \(\phi ^{*} = \{ {\theta }^{*}, \tau ^{*}\}\) is defined as the split parameters of the node, and later used for prediction of new input data. The entropy is the expected value of the information contained in each message. The Shannon’s entropy is generally used for training forests. Our goal is now to learn the joint probability \(p_{t}(\varvec{l}, \varvec{g} | \varvec{f}_{\theta })\), where new variable \(\varvec{g} \in \mathbb {R}^3\) is a continuous regression variable for describing the relative 2D offsets between the depth pixel \(\varvec{x}\) and a barycenter of the body \(\varvec{x}_0\), and the geodesic distance \(d_G(\varvec{p}_{\varvec{x}}, \varvec{p}_{\varvec{x}_0})\). By using the chain rule, we rewrite the joint distribution as \( p_{t}(\varvec{l}, \varvec{g} | \varvec{f}_{\theta }) = p_{t}(\varvec{l} | \varvec{f}_{\theta })p_{t}(\varvec{g} | \varvec{f}_{\theta }, \varvec{l} )\), where we assume that \(p_{t}(\varvec{g} | \varvec{f}_{\theta }, \varvec{l} )\) is a multivariate normal distributions. That is, \(p_{t}(\varvec{g} |\varvec{f}_{\theta }, \varvec{l} ) \sim \mathcal {N}(\varvec{\mu }_{\varvec{g}| \varvec{l} }, \varvec{\varSigma }_{\varvec{g}| \varvec{l}} | \varvec{g}, \varvec{f}_{\theta }, \varvec{l} )\) is one distribution per class label \(\varvec{l}\). We actually define the joint objective function \(\mathcal {J}\) as follows:

$$\begin{aligned} \mathcal {J}(\mathcal {Q}, \phi )= & {} \sum _{\scriptstyle {p\in \{L,R\}}}\sum _{\scriptstyle {\varvec{x} \in \mathcal {Q}_{p}}} {\frac{|\mathcal {Q}_{p}|}{|\mathcal {Q}|}} {\psi _{E}{(\varvec{l}, \varvec{g} ; \mathcal {Q}_{p})}} , \end{aligned}$$
(7)
$$\begin{aligned} \psi _{E}{(\varvec{l}, \varvec{g}; \mathcal {Q}_{p})}= & {} -\sum _{\varvec{l}\in \mathcal {C}}\int _{\varvec{g}\in \mathbb {R}^3}{p_{t}(\varvec{l},\varvec{g} |\varvec{f}_{\theta })} \text{ log }({p_{t}(\varvec{l},\varvec{g} | \varvec{f}_{\theta })})d\varvec{g}, \nonumber \\= & {} \underbrace{-\sum _{\varvec{l}\in \mathcal {C}} {p_{t}(\varvec{l} | \varvec{f}_{\theta })\text{ log }(p_{t}(\varvec{l} | \varvec{f}_{\theta }))}}_{\psi _{E}(\varvec{l}; \mathcal {Q}_{p})} \nonumber \\&\underbrace{+\sum _{\varvec{l}\in \mathcal {C}}{p_{t}(\varvec{l} | \varvec{f}_{\theta })} \Big (\frac{1}{2}\text{ log }((2\pi e)^3 |\varvec{\varSigma }_{\varvec{g} | \varvec{l} }| ) \Big )}_{\psi _{E}(\varvec{g}; \mathcal {Q}_{p} | \varvec{l})}, \end{aligned}$$
(8)

where \(|\varvec{\varSigma } |\) denotes the determinant of a matrix.

Finally, the models of random forest are achieved by optimizing the joint objective function Eq. (8), including the conventional objective \({\psi _{E}(\varvec{l}; \mathcal {Q}_{p})}\) for a discrete label \(\varvec{l}\) as well as the objective \({\psi _{E}(\varvec{g}; \mathcal {Q}_{p} | \varvec{l})}\) for a continuous variable \(\varvec{g}\) given \(\varvec{f}_{\theta }\) and \(\varvec{l}\) variables. In here, the posterior that we are interested in is about the body parts classification. The overall prediction of the forest with T trees is estimated by averaging the individual predictions together and the output is predicted by inferring:

$$\begin{aligned} \varvec{l}^{*} = {{\arg \max }}_{\scriptstyle {\varvec{l} \in \mathcal {C}}}p({\varvec{l} | \varvec{f}_{\theta }}) = {{\arg \max }}_{\scriptstyle {\varvec{l} \in \mathcal {C}}}\frac{1}{T}\sum _{t=1}^T{p_{t}(\varvec{l} | \varvec{f}_{\theta })}. \end{aligned}$$
(9)

4 Body Joints and Skeleton Identification

Given a body-centric geodesic map, as with the way in [8, 9, 12], the extreme points are computed by incrementally maximizing geodesic distances on the body surface. Based on the classified body parts and the geodesic paths between the body’s barycenter and its geodesic extrema (i.e., end-nodes of the human skeletal graph), we localize and identify the joint candidates lying on the paths. The joint candidates are selected with \(\angle (\overrightarrow{\varvec{x}_{k-1}\varvec{x}_{k}},\overrightarrow{\varvec{x}_{k}\varvec{x}_{k+1}}) > \epsilon \)). Here, \(\angle (\overrightarrow{\varvec{x}_{k-1}\varvec{x}_{k}},\overrightarrow{\varvec{x}_{k}\varvec{x}_{k+1}})\) is an angle between two vectors \(\overrightarrow{\varvec{x}_{k-1}\varvec{x}_{k}}\) and \(\overrightarrow{\varvec{x}_{k}\varvec{x}_{k+1}}\), where the three points \((\varvec{x}_{k-1}, \varvec{x}_{k}, \varvec{x}_{k+1})\) being around the point \({\varvec{x}_k}\), are on the same GPS \(\varvec{d}_{g}({\varvec{x}_k})\). \(\epsilon \) is a threshold depending on the body type, and it is empirically set to about 30 in our experiments. After obtaining the joint candidates set \(\{\varvec{x}_i\}\), the representative label \(\varvec{l}^{'}\) is evaluated as Eq. (10) from the local window at each joint candidate, where the local patches are based on the already classified body parts. Fig. 3 describes the meta-examples generated at each step.

Fig. 3.
figure 3

(a) Geodesic extrema (blue points in red regions); (b) joint candidates (red points) on five GPSs (black lines); (c) color-labeled patches on joint candidates; (d) labels set \(\{\varvec{l}^{'}\}\) classified into five sub-skeletons (i.e., each skeleton for four limbs and one trunk). (Color figure online)

$$\begin{aligned} \varvec{l}^{'}_{\varvec{x}_i} ={\mathop {\arg \max }\limits _{\varvec{l}_{\varvec{u}}^{*}}}\sum _{\scriptstyle {\varvec{u}\in \mathcal {W}_{\varvec{x}_i}}}\sum _{\scriptstyle {\varvec{l} \in \mathcal {C}}} {\delta (\varvec{l}_{\varvec{u}}^{*}, \varvec{l})}, \end{aligned}$$
(10)

where \(\varvec{l}_{\varvec{u}}^{*}\) is the label for the position \(\varvec{u} \in \mathbb {R}^2\) being in the local window \(\mathcal {W}_{\varvec{x}_i}\) centered at \(\varvec{x}_i\). \(\delta \) is a kronecker delta function. Our main idea is to match two graphs by comparing the labeled sets of ordered points on the paths between the body center and the geodesic extrema of the skeletal configuration under the assumption that there are meaningful joints for the human skeletal structure in the set of joint candidates. In here, the body center and the geodesic extrema labels are defined as 7 (center), 0 (head), 10 (right hand), 13 (left hand), 16 (right foot), and 19 (left foot), respectively. All joint candidates are identified and clustered as in Fig. 3(d) by matching with a given template graph as Fig. 4(a). In order to match the sub-skeletons (i.e., each skeleton for limbs and trunk) with the template graph, we consider a weighted bipartite graph such as illustrated in Fig. 4(b), which is with two vertex sets, a set of sub-skeleton labels and a set of joint labels, and the weight of each edge is defined as a DTW distance between two consecutive joint labels. Given the bipartite graph, the matching is performed by using the Hungarian method [14]. Finally, the skeletal graph with 15 labeled nodes is extracted, which correspond to the whole body skeleton (see Fig. 5(b)).

Fig. 4.
figure 4

(a) Template skeleton model consisting of four limb sub-skeletons and one trunk sub-skeleton; (b) bipartite graph with two vertex sets (s#: set of joint labels for each sub-skeleton in the template; c#: set of candidate joint labels for each geodesic path).

5 Numerical Experiments

We show the usefulness of our method, through the empirical comparison to different objective functions based on different types of features (i.e., our GPS and depth comparison feature [1]). We applied our method to samples from our ground-truth data sets, consisting of five types of motions: forward/backward walking, hand waving1, hand waving2, sitting, and standing; each motion group has approximately 100 frames. As in a conventional leave-one-out training scheme, the sequences for each model is evaluated with the trained model from other models. For quantitative evaluation of estimated joint positions and skeleton accuracy, we present three different types of measurements: (a) we estimate the mean absolute error (MAE) Eq. (11) in order for the training error evaluation of the classified body parts; (b) the mean average precision (mAP) is evaluated by averaging the precision of the estimated 15 joints on each frame, which is to determine whether the position of the estimated joint is within a given threshold relative to the ground-truth (in here, the threshold is fixed to \(max(\{|\varvec{s}_{i}|\}_{i=0}^{4})/10\)); (c) the other is a new measurement of similarity between the estimated skeletons and the ground-truth skeleton by comparing their DTW score, which is referred to as mean average matching (mAM) and defined as Eq. (12).

$$\begin{aligned} MAE = \frac{1}{N} \sum _{\scriptstyle {i}}\sum _{\scriptstyle {\varvec{l} \in \mathcal {C}}} {| \varvec{l}_{i}^{*} - \varvec{l}_{i}^{G} |} \in [0, 1], \end{aligned}$$
(11)

where \(\varvec{l}_{i}^{*}\) and \(\varvec{l}_{i}^{G}\) represent the predicted label of data \(\varvec{x}_i\) and the corresponding ground-truth label, respectively.

$$\begin{aligned} AM(i) = \frac{1}{| \mathcal {F} |}\sum _{f \in \mathcal {F}}{\frac{d_{W} ( \varvec{d}_g(\varvec{s}_{fi}^{*}), \varvec{d}_g(\varvec{s}_{fi}^{G}))}{max(\{|\varvec{s}_{i}|\}_{i=0}^{4})}} \in [0, 1], \end{aligned}$$
(12)

where AM(i) is an average matching score between the estimated sub-skeleton \(\varvec{s}_{i}^{*}\) and the corresponding ground-truth sub-skeleton \(\varvec{s}_{i}^{G}\), one body skeleton has five sub-skeletons, \(\{\varvec{s}_{i}^{*}\}_{i=0}^{4}\), and the mean matching score is normalized by the maximum length of the five sub-skeletons. \(\mathcal {F}\) is a set of target frames.

Table 1. MAE, mAP, and mAM results for data sets used in Fig. 5(c, d) at depth level 30.
Fig. 5.
figure 5

Performance comparison: (top) APs for each joint at depth level 30, based on ours (blue) and [1] (red) with Training (T4)+Testing (T2, T5) sets; (mid) the predicted body parts for data in Fig. 1 and its skeletons with ground-truth (black lines); (bot) mAM values with different data sets, using [1] (a, c) and ours (b, d) methods, repectively. (Color figure online)

Fig. 6.
figure 6

Experimental results of estimated skeletons for forward-walking samples in Fig. 1: from the left, body depth, color-labeled patches on the joint candidates, identified & clustered joint candidates, body skeleton overlapping with the depth, skeletons from our method, and (the last two) skeletons from Shotton2011cvpr with ground-truth (black lines). (Color figure online)

In [1, 2], they apply a local mode-finding approach based on mean-shift with a weighted Gaussian kernel for each classified body part to infer the final positions of 3D skeletal joints. However, as shown in Fig. 5(a) and 6, the local modes obtained from the misclassified outlying parts such as the left hand (label 13) and the right foot (label 16) cause failing skeleton results. Meanwhile, as shown in Fig. 5(b) and the MAE in Table 1, our method provides well-classified body parts as well as well-matched body skeletons through our GPS based-joint entropy and skeletal matching methods. As shown in Fig. 5 and Table 1, although our method is slightly less accurate than [1] in the mAP, our method offers the advantage of more accurately matching the body skeleton. In our method, the position of a target joint to be predicted depends on its GPS, geodesic distance, and the inclination angle between its neighboring joint candidate vectors on the GPS. Because of this assumption, it can be seen that the measured mAP value at the anatomical joint position reference is slightly lower. On the other hand, the mAM value makes us confirm that our proposed method well reflects not only the local features in the body depth data, but also the global structures in the skeletal configuration. Figure 6 shows a visual comparison of the predicted results.

6 Conclusions

We have presented a novel geodesic path sequence (GPS) descriptor, joint entropy-based objective with the GPS, and skeleton matching method for 3D body pose estimation based on the body parts classification, whereby it is possible to robustly predict the skeleton’s position under severe body deformations. We also incorporate the GPS descriptor into a joint entropy-based objective function for learning both class and structural information about the body parts. Useful aspects of our proposed method could be summarized as follows: (a) The GPS descriptors can be widely used in variety of fields as a descriptor for deformable object representation; (b) The joint entropy objective function based on our GPS comparison features well reflects geodesic structural information over the body surface, leading to more accurate predictions in the random forest classifier; (c) The skeleton matching & identificaton based on the geodesic extrema of the body, which enhance more robustness to joints mis-identification. Empirical comparison with the conventional solution, single entropy-based objective with depth comparison features, confirmed the high performance of our method.