1 Introduction

Given an image of a person, the problem of human pose estimation can be briefly described as localizing the position and orientation of the body limbs. The complexity of the problem comes from issues like background clutter, changes in viewpoint, changes in appearance, self-occlusions of body parts, etc. Among the various methods proposed for human pose estimation, models based on pictorial structures tend to provide superior performance, e.g. Felzenszwalb and Huttenlocher (2005), Ramanan (2007), Andriluka et al. (2009). The first works on pictorial structures for human pose estimation (Felzenszwalb and Huttenlocher 2005) employed a tree-structured model composed by parts representing the human body (e.g., left foot, lower left leg, etc.), connected following the kinematic constraints of the human body (e.g., left foot is connected to left leg). More specifically, the body parts are modeled as rectangles, parametrized by position, orientation, and size.

One of the main limitations of such pictorial structures is that exact inference is only possible when the model is tree-structured. This means that only a small subset of the total number of body part relationships can be modeled, which is often not enough to capture the whole expressivity of the human body. Additionally, body part detectors are quite sensitive to noise and result in several false positive detections at test time, either confusing a part with the background or with another part (see Fig. 1). Moreover, sliding-window detection approaches can fail to recover joints in the presence of occlusions, unless the appearance of an occluded part is explicitly modeled.

Fig. 1
figure 1

a Detection score map for the right shoulder using a classical sliding-window detection approach with a linear SVM trained on HOG features. b Rescored version of a produced by our context-based rescoring. The original map a has a strong score on the actual shoulder location, but also in other regions like the hips. In contrast, with our proposed rescoring, we get a more spatially-consistent score map, showing a high response near the correct shoulder location, and suppression of false positive locations. In addition, our rescoring method can hallucinate the location of a part, e.g. foot d even if there is not a high-scoring region in the original map c

In this work, we propose a new method for obtaining robust part detections in a pictorial structure formulation for human pose estimation. Following the work of Pishchulin et al. (2013b), and motivated by the fact that the modeling and detection of such small image regions as the body joints (“basic parts” from now on) is sensitive to noise, we introduce a mid-level representation of the image (based on poselets) in the pictorial structure model. More specifically, we propose a method for the automatic discovery of a compact set of discriminative poselets (Bourdev et al. 2010) that offers both high detection precision and a covering of the different poses in a given validation dataset. Using the evidence of these new mid-level parts, we rescore the basic part detections in order to obtain a more robust basic part detection, and thus improve the inference of human pose. In particular, we make the following contributions:

  • We propose a methodology for rescoring part detections in a pictorial structure framework, using a contextual model of relations between parts.

  • We propose a poselet selection method for the automatic discovery of a compact set of poselets that cover the range of poses available in a given set of images.

  • We extract spatial and score-related contextual features from the new set of mid-level part detections.

  • We use boosting to learn a set-based rescoring function for each part in the pictorial structure model, using the previously computed set of contextual features.

  • Finally, we combine the original and rescored basic part detections in the unary potential of the pictorial structure model proposed in Pishchulin et al. (2013b).

  • The approach is generic: it is applicable to the detection of other objects composed by different parts.

Experimental evaluation is conducted on two benchmarks: Leeds Sports (Johnson and Everingham 2010) and PARSE (Ramanan 2007). In the experiments, we achieve results comparable to the state of the art. More specifically, we achieve competent performance w.r.t. Pishchulin et al. (2013b), while reducing the size of the mid-level representation by an order of magnitude (47 poselets in our approach vs. 1013 in Pishchulin et al. (2013b)) and consequently reducing the execution time by a \(68~\%\). In particular, among the different mid-level part representations in our comparative analysis, the automatic discovery of poselets with covering attains the best results in both datasets.

2 Related Work

In the context of human pose estimation, pictorial structure models have been widely used. Since the search space of such highly-articulated models can be very large, some works proposed methodologies for reducing it. Ramanan (2007) proposed an iterative inference process for the pictorial structure model. An edge-based appearance model of the body parts is firstly applied, giving a rough estimate of their location. Using this rough estimate, a color model of the foreground and the background is learnt, which helps in reducing the search space of the inference. A further extension of this idea was proposed by Ferrari et al. (2008a); they directly segment the human body from the background of the scene, eliminating the need of a two-step inference. In contrast, Andriluka et al. (2009) proposed to use stronger body part detectors based on dense shape context features, outperforming previous works.

Since the number of possible poses a person can adopt is huge, usually a simple pictorial structure model is not able to model all this variability in the data. In order to overcome this, some works increase the flexibility of pictorial structure models by adding pose prior information in the model. For example, Sapp et al. (2010) adjust the parameters of the pictorial structure model (particularly the ones regarding the pairwise potential) depending on image evidence. Similarly, Johnson and Everingham (2010) partition the space of human poses into different clusters and learn specific pose priors and appearance models. In a further extension of their work, Johnson and Everingham (2011) replace a single pictorial structure model by a mixture model, such that each component of the model is specialized to a particular region of the pose space. Additionally, they present a new dataset of 10, 000 images that they use as additional training data. Yang and Ramanan (2013) followed the same trend and proposed a “flexible mixture of parts” model. Moreover, they proposed a new formulation of the problem; they model the body joints instead of the body parts, thus simplifying the formulation and reducing the complexity of inference. Specifically, the body joints are modeled as a mixture of small HOG filters (Dalal and Triggs 2005) capturing a small neighborhood around them. While simplifying the inference and attaining better results than previous works, the performance of their method is still compromised by the use of a tree-structured model.

In contrast, some works have proposed to work with loopy graphs. Tian and Sclaroff (2010) augmented the tree model by adding a small set of edges, and presented an efficient inference algorithm for their model. In the experimental section, they show how their method overcomes the so-called “double-counting” phenomena that tree-structures tipically suffer from. Sun et al. (2012) proposed a branch-and-bound inference algorithm, allowing to compute exact inference on loopy graph models efficiently. Duan et al. (2012) proposed a multi-layer pictorial structure model, incorporating body part evidence at different scales; from full body to local parts. In their proposed model, each layer is tree-structured by itself, but edges between adjacent layers make the whole model a loopy graph. However, this graph can be easily decomposed into tree-structured subproblems (which are amenable to exact inference), and they show how inference can be performed via dual-decomposition. Finally, Wang et al. (2011) also proposed a hierarchical approach. They presented a manually-defined hierarchical decomposition of the human body; they introduced poselets (Bourdev and Malik 2009; Bourdev et al. 2010) as new parts in the model, adding also the corresponding set of higher-order edges with the existing basic parts. Since this extension results in a loopy-graph model, the method can only perform approximate inference.

Some recent works propose hierarchical tree models that are able to model high-order part dependencies by recursively dividing the human body into different mid-level parts, in addition to the set of basic body parts. Tian et al. (2012) presents a hierarchical model where leaf nodes correspond to body parts, and intermediate latent nodes cover different subsets of nearby body parts. Similarly, Wang et al. (2011) and Wang and Li (2013) also included mid-level poselet-based body parts in their pictorial structure model, but they propose an algorithm for discovering the best possible tree topology that connects all the parts. In contrast, Pishchulin et al. (2013a) introduced higher-order part dependencies in their pictorial structure model by conditioning the unary and pairwise terms on poselet evidence. At training time, they cluster the relative distance between a particular part and the torso, and model this offset as a Gaussian distribution for each cluster. At test time, they solve a classification problem in which poselet responses are used to predict the correct cluster and recover the corresponding Gaussian distribution. This additional information about the location of that particular part is then incorporated in the pictorial structure model as an extra unary potential. They proceed analogously to predict part orientations and to learn the pairwise terms of the model. In a further extension of their work, Pishchulin et al. (2013b) achieve a higher performance by incorporating additional parts that model body joints in their pictorial structure, following the idea of flexible spatial models from Yang and Ramanan (2013).

A bridge between human pose estimation and object detection was proposed by Yao and Fei-Fei (2010). They model mutual contextual information between poses and objects for a certain set of human-object interaction activities, like “tennis serve” or “croquet shot.” The results indiciate that pose estimation can help object detection and vice versa. Another work on exploiting contextual information between objects was proposed by Cinbis and Sclaroff (2012); they present an approach for rescoring detections of different objects, introducing the notion of sets of contextual relations between object detections in an image. Each detection from a certain object class is represented by its context, defined as a set of detections from every other object detector. Then, a feature vector is extracted from each contextual detection, encoding spatial relations, relative scores and class-related relations. A generalization of the well-known Adaboost algorithm, called SetBoost, is used for rescoring an object detection given its set-based context representation.

Fig. 2
figure 2

Proposed pipeline for human pose estimation. Given an input image, a set of basic and mid-level part detections is obtained. For each basic part detection \({l_m}\), a contextual representation is computed based on relations with respect to the set of mid-level part detections. Using these contextual representations, basic part detections are rescored using a classifier for that particular basic part class. The original and rescored detections for all basic parts are then used in inference on a pictorial structure (PS) model to obtain the final pose estimate

In our previous work (Hernández-Vela et al. 2014), we recasted the pictorial structure formulation from Yang and Ramanan (2013) in order to include information from a mid-level representation of the image. More specifically, we proposed a rescoring mechanism for assigning new scores to the basic part detections, given the evidence from mid-level body parts. In order to define the mid-level part representation, we proposed a method for the automatic discovery of a compact set of poselets, which maximize precision while enforcing coverage of the poses in a set of validation images. Experiments on two benchmarks showed an improvement in the results obtained with our extended pictorial structure formulation, w.r.t. Yang and Ramanan (2013). In this work, we further demonstrate the benefits of our proposed rescoring mechanism by embedding it in a second human pose estimation method from the state of the art, namely, the work from Pishchulin et al. (2013b). Whereas Pishchulin et al. (2013a), Pishchulin et al. (2013b) employ over 1, 000 poselets in their experiments, we demonstrate that it is possible to get similar performance using fewer than 50 poselets, automatically discovered by our poselet selection algorithm. A further deeper analysis of the results show that these rescoring functions provide much relevant information to the pictorial structure model than the Gaussian-shaped potentials defined by Pishchulin et al. (2013a), by discovering and exploiting pairwise relations between the basic parts in the PS model and the poselets.

3 Approach

An overview of the proposed formulation is shown in Fig. 2. We are motivated by the aforementioned limitations of basic, low-level part detectors that are commonly used in pictorial structure models, e.g., HOG patches centered at body joints (Yang and Ramanan 2013). In our formulation, we define and learn an additional set of mid-level body part detectors that improve the localization of the basic ones. Mid-level and basic part detectors are applied in order to extract a set of pairwise contextual features between each pair of basic and mid-level part hypotheses. A classifier for a certain basic part class will compute a new score for its detections, based on the set of contextual features computed between the basic and mid-level parts. The original and rescored detections for all basic parts are then used in inference on a pictorial structure model to obtain the final pose estimate.

For the purpose of illustrating and evaluating the benefits of our contextual rescoring framework, we incorporate it within the pictorial structure (PS) model formulation of Pishchulin et al. (2013b). In the rest of this section, we start by defining the extended PS model, to show how the pieces of the formulation fit together. We then present the formulation for the mid-level feature representations and the contextual rescoring mechanism.

3.1 Pictorial Structure Formulation

Following Pishchulin et al. (2013a), we define \(L = (l_1,...,l_M)\) a full body pose, consisting of M parts. Each part \(l_m = (x_m,y_m,r_m,z_m)\) is parametrized by its center position \((x_m,y_m)\), rotation \(r_m \in [0,360)\) and scale \(z_m \in \mathbb {R}_+\). In practice, images are rescaled so as to normalize the body size across the whole dataset, so \(z_m\) can be omitted. The CRF energy function of a given pose L is then defined as:

$$\begin{aligned} E(L,D,\beta ) = \sum _{m=1}^M E^u(l_m, D, \beta ^u) + \sum _{n \sim m} E^p(l_n,l_m,\beta ^p), \end{aligned}$$
(1)

where D denotes image evidence, and \(\beta \) are model parameters. The unary potential is defined as a weighted combination of two terms:

$$\begin{aligned} E^u = E^{u,boost}(l_m,D) + w_p E^{u,poselets}(l_m, D). \end{aligned}$$
(2)

The first term \(E^{u,boost}\) is the local appearance-based unary potential, defined as the log-likelihood obtained with pre-trained AdaBoost classifiers. The second term \(E^{u,poselets}\) is the one that incorporates evidence from poselet detections into the PS model, and \(w_p\) is a weight. More specifically, Pishchulin et al. (2013b) use poselet detections to predict the position and rotation of the body parts. Their poselet-based feature representation is defined as a feature vector \(f \in \mathbb {R}^P\), built by aggregating the maximum scores from the detections of poselets \(p=1,...,P\) in an image. Additionally, they apply a max-pooling step in order to obtain only relevant poselet detections around a certain area w.r.t. the torso. During training, the relative position between a certain part m and the torso location is clustered into K clusters. Then, for each one of these clusters, the relative offset from the torso is modeled as an isotropic Gaussian distribution. Finally, at test time the position prediction is formulated as a classification problem where one of these K clusters is predicted from a feature vector f, and the corresponding Gaussian distribution is recovered. The same approach is used to predict the rotation of a certain part m (see Pishchulin et al. (2013b) for more details). One of the main limitations of this approach is that the position (or rotation) prediction is restricted to one of the K Gaussian components (K is pre-defined at training time), so the choice of the value for K is decisive for the performance of the system. Moreover, another weakness of this approach is that it strongly relies on a torso location estimate, so a robust torso detector would be required to correctly predict the position of a body part m. The training and testing execution pipelines from Pishchulin et al. (2013b) are summarized in Algorithm 1.

figure a

As an alternative to the position prediction approach from Pishchulin et al. (2013b), we propose to discriminatively train a classifier to learn contextual relationships between detections from part m and from a set of mid-level parts (e.g. poselets), and produce a poselet-conditioned postion prediction for part m, without the need to discretize the space of possible predictions into different clusters. In order to do that, we learn a different rescoring function \(R_m\) for each body part m, such that it assigns a new score to evidence from this part m, given a set of poselet evidence. We define:

$$\begin{aligned} E^{u,poselets}(l_m, D) = R_m(C_m^\mathcal {P}), \end{aligned}$$
(3)

where \(R_m: \mathcal {C} \rightarrow \mathbb {R}\) is a rescoring function that receives as input a set of contextual feature vectors \(C_m^\mathcal {P}\) and returns a new score for part m. This set of contextual feature vectors is associated to part m and a set \(\mathcal {P}\) of mid-level contextual detections:

$$\begin{aligned} \mathcal {P} = \left\{ B_p^i \right\} _{i=1, p=1}^{T, P}, \end{aligned}$$
(4)

where T denotes the number of detections taken from a certain mid-level part p, and \(B_p = (x_p,y_p,wi_p,he_p,s_p)\) is the bounding box of the corresponding detections parametrized by position \((x_p,y_p)\), size \(wi_p \times he_p\) and detection score \(s_p\). In contrast to (Pishchulin et al. 2013b), our method does not strongly rely on any specific torso detection hypotheses, but on the ones from the whole set of basic parts in the PS model. Additionally, the proposed approach implicitly takes into account the detection scores from both poselet and basic parts’ hypotheses as a confidence value to measure their trustfulness, and weight them accordingly to leverage high-scoring detections or weaken the low-scoring ones. Algorithm 2 shows the execution pipeline for training and testing of our proposed approach.

3.2 Mid-Level Part Representation

In order to formulate the set of contextual vectors \(C_m^\mathcal {P}\), we define a mid-level body part representation. Since higher-level body parts model a larger image portion than just a small local patch as in the case of basic parts, it is expected they will perform better in terms of object detection. Moreover, poselets (Bourdev and Malik 2009) model a specific human region extent under a certain pose configuration, so they are a suitable choice for the definition of the mid-level representation if one wants to predict the position of the body parts using their evidence. However, different strategies may be followed for the definition of a set of poselet detectors. While the manually-defined hierarchical decomposition presented by Pishchulin et al. (2013a) is sound and follows the kinematic constraints of the human body, the choice of its parameters may seem arbitrary, and not optimized for our position prediction task. In contrast, we propose a method for the automatic discovery of a compact subset of poselets to define a compact mid-level representation. We first review the manual hierarchical decomposition and then we introduce our automatic poselet discovery algorithm in the next lines.

figure b

Hierarchical decomposition In their paper, Pishchulin et al. (2013b) define the set of poselets in their mid-level representation with guidance from a human annotator, similarly to Wang et al. (2011). In addition to the 10 basic parts in their original PS model, they also define a set of 11 parts, namely full body, upper body with arms, torso and head, right arm and torso, left arm and torso, right arm alone, left arm alone, torso with legs, legs, right leg alone, and left leg alone. Then, for each one of these parts they cluster relative positions of a subset of related body parts using K-means with a predefined number of clusters, discarding clusters with fewer than 10 examples. In practice, in their experiments they select \(K=200\), obtaining a final total number of \(P=1,013\) poselet detectors. While this hierarchical decomposition of the human body is directly motivated by the human body structure and kinematics, it might not be the best one for the task of predicting body parts locations since the choice of this mid-level representation is apparently arbitrary, without evaluating the prediction accuracy of the different body parts. Additionally, the number of poselets generated with this methodology can be very large, depending on the number of clusters defined for each mid-level part.

Fig. 3
figure 3

Sample Poselet templates. Body joints are shown with colored dots, and their corresponding estimated Gaussian distributions as blue ellipses (Color figure online)

Poselet discovery We propose a poselet selection methodology, inspired by the one proposed by Bourdev et al. (2010), to define our mid-level parts p. We first generate a large number (thousands) of random seed windows \(B_p^i\) from the training set images \(\{I^n \}_{n=1}^N\), and for each one of them we collect similar patches from other training images by Procrustes alignment of the body joint annotations \((x_k^{gt},y_k^{gt})\) from the ground-truth. Then, for each seed window and its associated set of similar examples, we train a mid-level part detector. Additionally, we model the spatial distribution of the keypoints k contained within the spatial extent of each seed window as Gaussian distributions (\(\mu _p^k\),\(\Sigma _p^k\)) (see Fig. 3), that we use to evaluate the precision of predicting body joint locations. To that end, we then test these poselet detectors in a validation set \(\{I^n \}_{n=1}^V\), and look for True Positive (TP) and False Positive (FP) detections. In order to do that, we use the same criterion as the Percentage of Correctly-placed Parts (PCP) metric (Ferrari et al. 2008b), widely used for evaluating human pose estimation methods. More specifically, we consider a detection as a TP if:

figure c
$$\begin{aligned} \text {dist}(\mu _p^k, (x_k^{gt},y_k^{gt})) \le \kappa , \forall k \in B_{p}, \end{aligned}$$
(5)

where \(\kappa \) is a threshold value. That is, we classify a detection as a TP if the distance between the body joint estimations \(\mu _p^k\) and their corresponding ground-truth annotations \((x_k^{gt},y_k^{gt})\) is below a threshold \(\kappa \), for all the joints k contained in the poselet. Conversely, we consider a detection as a FP if none of the keypoints k fulfill the condition above. Since the seed windows are generated randomly, some of them will be redundant, or some others might have poor performance, so we could select a subset of relevant poselets and reduce the size of the mid-level part representation. This selection is treated as a “set cover” problem by Bourdev et al. (2010); poselets are selected in a greedy manner so as to “cover” more examples, i.e. the poselets that found TP detections in a larger number of training images. However, this methodology does not prioritize poselets with good performance if they only fire in a little subset of training images. In order to overcome this problem, we propose using a weighted version of the “set cover” problem, in which the precision of the selected poselets is maximized, while ensuring coverage of the images in a validation dataset. We define a binary matrix \(A_{n p}\) to keep track of which poselet p fires in which n-th validation image. Finally, we formulate this weighted “set cover” problem with the following integer programming:

$$\begin{aligned} \text {minimize }&\sum \limits _{p} (1 - \text {Prec}(p)) \mathbf {x}_{p} \nonumber \\ \text {subject to }&\sum \limits _{p: A_{np}=1} \mathbf {x}_{p} \ge 1 \; \forall n\nonumber \\&\mathbf {x}_{p} \in \{0,1\}, \end{aligned}$$
(6)

where \(\text {Prec}(.)\) computes the precision of a poselet. The solution \(\mathbf {x}\) will find the subset of poselets \(\{p\}\) s.t. \(\mathbf {x}_{p} = 1\), i.e. a set of poselets ensuring that in every validation image there is at least one poselet that fires. The constraints of the integer program will enforce each validation image n to be covered by at least one poselet, but also the best-performing ones will be prioritized, since we are minimizing \((1 - \text {Prec}(p))\). Given the NP-complete nature of the problem, we find the solution via a Linear Programming relaxation \((\mathbf {y}_{p} \in \mathbb {R}_{\ge 0}, \mathbf {y}_{p} \le 1)\) and round the solution \(\mathbf {y}\) to obtain \(\mathbf {x}\). The proposed method is summarized in Algorithm 3.

3.3 Contextual Rescoring

Fig. 4
figure 4

Two sample images depicting a reference detection bounding box \(B_m\) in yellow (for the right ankle), and the set of contextual mid-level detections \(\mathcal {P}\) in blue, orange and green for the upper body, lower body and full body, respectively (Color figure online)

Table 1 List of contextual features included in \(c_{B_m,B_{p}}\). For clarification, the detection score is encoded classwise in a sparse vector, i.e. a vector of size \(P_m\) set to zeros except the position corresponding to the class of the detection, which contains the detection score

We build our contextual model on top of the mid-level part representation presented in Sect. 3.2. More specifically, we want to model underlying spatial and score-related relationships between basic and mid-level part detections (see Fig. 4). By doing this, a certain mid-level part detection would be able to determine a hypothesis for the location of a certain basic part. For this task, we define the context of a given basic part detection \(l_m\) as a set:

$$\begin{aligned} C_{m}^{\mathcal {P}} = \left\{ c_{B_m,B_{p}} \; | \; \forall B_p \in \mathcal {P} \right\} , \end{aligned}$$
(7)

composed by contextual feature vectors \(c \in C\). These contextual feature vectors c encode spatial, score-related and class-related relationships between a reference basic part detection \(B_m\) and a contextual mid-level detection \(B_p\) (we use the same set of features as Cinbis and Sclaroff (2012)). The specific set of features we use is summarized in Table 1. Finally, the rescoring function given a set of contextual feature vectors C is then defined as:

$$\begin{aligned} R(C)= & {} \sum _{\theta =1}^\Theta Q_{\theta }(C), \end{aligned}$$
(8)
$$\begin{aligned} Q_{\theta }(C)= & {} \alpha _{\theta } \sum _{c \in C} k_c \cdot q_{\theta }(c), \end{aligned}$$
(9)

where \(Q_{\theta }\) is a weak set classifier, and \(q_{\theta }\) is a weak item classifier, weighted by \(\alpha _{\theta }\). The term \(k_c\) introduces an additional weight related to the relevance of the item. In practice, \(k_c\) is set to its corresponding detection score \(s_{p}\), and \(q_{\theta }\) functions are defined as decision trees with F leaves, which generate \(U_1,...,U_F\) partitions of the feature space. The weights \(\alpha ^f\) for each leaf f are computed as:

(10)

where \(\lambda \) is a regularization parameter, \(y_n \in \{-1,1\}\) is the label of training sample n, \(L: \mathbb {R} \rightarrow \mathbb {R}\) is a loss function, and \(H^f(C) = \sum _{c \in C, c \in U_f} k_c\) is the total sum of relevance weights for all items in the set C that fall into a given partition \(U_f\). The whole training procedure is summarized in Algorithm 4.

figure d

In order to train the rescoring function \(R_m\) for basic part \(l_m\), we run its corresponding basic part detector on a validation set, as well as the whole set of mid-level part detectors p. Then, for each basic part detection \(B_m\), we compute the corresponding mid-level contextual feature set \(C_{B_m}^{\mathcal {P}}\), and assign a label \(y_{B_m}\):

$$\begin{aligned} y_{B_m} = \left\{ \begin{array}{ll} 1 &{} \quad O(B_m,B_m^{gt}) \ge \tau \\ -1 &{} \quad otherwise \end{array} \right. , \end{aligned}$$
(11)

where \(B_m^{gt}\) is the bounding box for part m computed from the ground-truth annotation data, \(O(B_i,B_j) = B_i \cap B_j / B_i \cup B_j\) is the overlapping Jaccard index between two bounding boxes, and \(\tau \) is a threshold value. The complexity of rescoring a basic part detection is \(\mathcal {O}(P)\).

4 Experiments

We conducted experiments on two different benchmarks, comparing the human poses estimated by our proposed method with the results of state-of-the-art methods; especially with the results obtained by Pishchulin et al. (2013b). We first detail the data, methods and validation, and the evaluation measurements we fixed in our experimental setup. Next, we present the different experiments we performed.

Data We conducted experiments on two publicly available challenging datasets: Leeds Sports (LSP) (Johnson and Everingham 2010), which comprises 2, 000 images of people playing 8 different sports, and the PARSE dataset (Ramanan 2007), which contains 305 images of people playing different sports and activities. The annotations for both datasets consist of 14 position labels, one for each body joint: left/right ankle, knee, hip, wrist, elbow and shoulder, neck and head top. In both datasets, the annotations are observer centric, i.e. left/right labels on the limbs are defined as the left-most/right-most limb in the image respectively (originally, the annotations for the LSP dataset were person-centric, but we used the observer-centric annotations provided by Eichner and Ferrari (2012)). We divided the training set of LSP into 2 subsets: training and validation. The final training set contains \(75~\%\) of the images in the whole original training set, and is used for learning the PS model. The validation set contains the remaining \(25~\%\) of the images and is used for learning the rescoring functions \(R_m\) and the final subset of poselets. In the case of the PARSE dataset, we just use the test split in order to test our rescoring approach, pre-trained on LSP.

Fig. 5
figure 5

Sample poselets from the LSP dataset. a Poselets with highest precision. b Poselets discovered by our selection method, maximizing precision and enforcing covering of the validation set

Fig. 6
figure 6

Position prediction comparison in a LSP and b PARSE datasets. In each plot, PCP performance is shown as a function of \(w_p\). We compare our proposed rescoring approach when using \(P=47\) poselets, automatically selected by our proposed poselet discovery method, w.r.t. the position prediction from Pishchulin et al. (2013b) with \(P=1,013\) and \(P=47\) poselets. In addition, we also compare against Pishchulin et al.’s position prediction when using our automatically selected poselets, in LSP (Color figure online)

Methods and validation We downloaded the implementation from Pishchulin et al. (2013b) and re-trained their model for our experimental setup. The PS model we use for our experiments has \(M=22\) parts: while the original PS model presented by Pishchulin et al. (2013a) was composed by \(M=10\) parts (left/right lower legs, left/right upper legs, torso, head, left/right forearms, left/right upper arms), a further extension proposed by Pishchulin et al. (2013b) introduced an additional set of parts modeling the body joints of the limbs, thus resulting in a total of \(M=22\) parts. For the manually-defined mid-level representation, we first use the parameters reported in their paper: we fix \(K=200\) clusters for each one of the 11 mid-level parts, resulting in a total number of \(P=1,013\) poselets. Furthermore, we also fix K so as to obtain the same number of poselets as our poselet selection method does.

Our poselet selection method automatically selects 47 poselets in the LSP dataset, from an initial set of 2, 000 poselet proposals (see Fig. 5). In order to define the set \(\mathcal {P}\) of contextual detections, we take the \(T = 2\) best detections from each mid-level part detector. This way, we are able to capture more information than just taking the best detection while keeping the size of \(\mathcal {P}\) low. Given that our experimental setup is similar to that of Cinbis and Sclaroff (2012), we followed their work for the choice of parameter values: each rescoring function R(C) is defined as a forest of \(\Theta = 20\) decision tree weak classifiers, each one of them having a maximum number \(V=150\) of leaf nodes. In addition, we use \(\lambda = 0.01\) and \(\tau = 0.6\), following other works like Yang and Ramanan (2013) or Bourdev et al. (2010) .

We compare our proposed formulation against the original PS model of Pishchulin et al. (2013b) on both benchmark databases, as well as with some relevant methods from the state of the art. In our previous work (Hernández-Vela et al. 2014), we showed that the rescoring approach improved the human pose estimation performance w.r.t. the work from Yang and Ramanan (2013). In this work, we demonstrate the generality of the proposed rescoring approach by integrating it within the more recent pictorial structure formulation from Pishchulin et al. (2013b). Incorporating our rescoring approach can enable a significant (\(95~\%\)) reduction in the number of poselets needed in the model, while attaining nearly comparable accuracy in human pose estimation and body part localization.

Evaluation measurement We use the Percentage of Correctly-placed Parts (PCP) (Ferrari et al. 2008b) as the evaluation measure, like most recent works in human pose estimation.

In the first part of the experimental results we compare the position prediction from Pishchulin et al. (2013b) against the one we propose, based on contextual rescoring. In this first comparison, rotation prediction is disabled, and the pairwise potential is not conditioned on any poselet evidence. Furthermore, specialized head and torso detectors are disabled, as well as the torso position prior, in order to better illustrate the influence of the position prediction in the final pose estimation results. Moreover, we compare different mid-level representations including our proposed poselet selection method maximizing precision and enforcing covering, selecting the poselets with maximum precision, the greedy algorithm from Bourdev et al. (2010), and the manually-defined hierarchy from Pishchulin et al. (2013a). Next, we compare our proposal with respect to relevant works from the state of the art. Finally, we analyze the contextual features selected by the SetBoost classifiers and the execution time of our proposal and the full model from Pishchulin et al. (2013b).

4.1 Position Prediction Evaluation

We first evaluate the proposed part position prediction based on contextual rescoring in conjunction with our proposed poselet selection algorithm for the automatic discovery of the mid-level representation. We compare it against the position prediction method proposed by Pishchulin et al. (2013b) and their manually-defined hierarchical decomposition as mid-level representation. One can see in Fig. 6 how our proposed position prediction yields maximum performance that is similar to Pishchulin et al. (2013b) for the LSP dataset, while reducing the number of poselets P by more than an order of magnitude. We can also observe how the performance of Pishchulin et al. (2013b) drops for \(w_p \ge 0.15\), while our approach gets the maximum performance for \(w_p=0.45\) and yields a much smoother performance response when varying \(w_p\). Additionally, when using Pishchulin et al. (2013b) and reducing P so as to match the size of our proposed mid-level representation, their performance drops more sharply. Not only that, but when using our set of 47 automatically selected poselets in the method of Pishchulin et al. (2013b), the performance drops sharply as well, showing that 47 poselets are not enough in this case.

Fig. 7
figure 7

Comparison of different mid-level representations in a LSP and b PARSE datasets. In each plot, PCP performance is shown as a function of \(w_p\). We compare our poselet selection maximizing precision and enforcing covering against (1) the manual hierarchical decomposition from Pishchulin et al. (2013b), (2) selecting the poselets with maximum precision, and (3) the poselet selection greedy method from Bourdev et al. (2010)

When testing on the PARSE dataset using the same models we learnt on LSP, the behavior is similar, even obtaining a slightly better performance w.r.t. Pishchulin et al. (2013b). We noted that the optimal values for \(w_p\) in terms of PCP performance are higher when using our contextual rescoring approach, w.r.t. Pishchulin et al. (2013b) (\(w_p=0.45\) vs. \(w_p=0.15\) in the LSP dataset, and \(w_p=0.2\) vs. \(w_p=0.1\) in the PARSE dataset), indicating that our extra unary potentials incorporate more relevant information to the PS model in comparison to Pishchulin et al. (2013b). This could be due to the unimodality nature of the Gaussian-shaped extra unary potential \(E^{u,poselets}\) defined by Pishchulin et al. (2013b). In contrast, our rescoring approach does not limit \(E^{u,poselets}\) to be unimodal, and thus can keep several possible position prediction hypotheses that can eventually be selected (see column (d) in Figs. 10 and 11).

Figures 10 and 11 show some qualitative examples (as well as the corresponding extra unary potentials \(E^{u,poselets}\)) in the LSP and PARSE datasets, where our proposed position prediction approach obtains better results than the one from Pishchulin et al. (2013b). We observed that the method of Pishchulin et al. (2013b) is prone to fail in cases where the torso detection is unreliable due to the viewpoint or the pose itself, since their approach strongly relies on the torso location hypotheses in order to introduce information from the mid-level representation into the model. In contrast, our proposed rescoring mechanism is able to correctly recover most of the body parts when the torso is hard to detect.

Table 2 Pose recovery results for LSP dataset
Table 3 Pose recovery results for PARSE dataset

In order to validate our proposed poselet selection method, we run the pipeline in Algorithm 2 choosing different alternatives for the definition of the mid-level representation, and compare the obtained respective performance in terms of PCP. Figure 7 shows that our proposed poselet selection method maximizing precision and enforcing covering yields the best results. While there is no significant difference in performance between the different strategies in the LSP dataset, the performance of our method clearly stands out for the PARSE dataset. It is worth noting that the proposed method automatically finds the appropriate number of poselets P, similarly to the method proposed by Bourdev et al. (2010). Taking this into account, our method obtains better performance than Bourdev et al. (2010) while still selecting a smaller number of poselets (\(P=47\) vs. \(P=50\)).

4.2 Comparison with State of the Art

In Table 2 we compare our proposed method with other recent methods in the state of the art, for the LSP dataset. We divide the table into four sections depending on the feature representation that each method uses, in order to do a fairer comparison with the state of the art. Our proposed formulation based on contextual rescoring and automatic poselet selection obtains slightly worse results (\(-1.77~\%\)) than the best performing method using Shape Context features: Pishchulin et al. (2013b) (in this case we show results using their full model). However, our approach just uses \(P=47\) poselets in contrast to their set of \(P=1,013\) (a reduction of \(95~\%\)), hence our method is much faster as we will show in Sect. 4.4. Moreover, our method reaches similar performance to Pishchulin et al. (2013b) in only predicting position, while the latter uses heavier artillery (position, rotation and pairwise prediction, specialized head & torso detectors, and torso position prior). Moreover, we performed an additional experiment doubling the number of poselets from 47 to 94 (adding the top 47 poselets ranked by precision to the 47 found by the proposed poselet discovery method), achieving a PCP of \(67.24~\%\), slightly improving the performance obtained by Pishchulin et al. when only using position prediction. The work of Chen and Yuille (2014) provides the best results in this comparison. However, it is important to note that their method is based on Deep Convolutional Neural Networks, providing with a richer feature representation. Similarly, the work of Ramakrishna et al. (2014) (the second best performing method) also uses additional features: color and contextual features.

For the PARSE dataset, we test the model from Pishchulin et al. (2013b) (and our proposed reformulation) trained on the LSP dataset; the other numbers in the table are gathered from the corresponding papers. In this case, Tian et al. (2012) reports outstanding results (74.4 %). However, their reported performance is not directly comparable to ours, since their PCP evaluation criteria is more relaxed, as pointed out by Yang and Ramanan (2013). Additionally, the works of Wang and Li (2013) and Duan et al. (2012) also use the relaxed version of the PCP performance measure, so their reported results are not directly comparable either. Johnson and Everingham (2011) still gets higher performance (67.40 %), but they use an additional 11, 000 images for training, so the results are not directly comparable. Our method scores 65.37 % (\(-1.36~\%\) w.r.t. Pishchulin et al. (2013b)) in this case, following the same trend as in the LSP dataset (Table 3).

Fig. 8
figure 8

Failure cases in a LSP and b PARSE datasets. Our proposed method cannot recover the human pose correctly, mainly due to upside-down poses and cases with extra people close to the actual subject

Fig. 9
figure 9

Contextual feature selection histograms computed from the learnt decision trees \(q_{\theta }\), grouped by subsets of joints: a upper-body limbs, b lower-body limbs, c head & torso, and d full body

Qualitative results in the LSP and PARSE datasets using the full model from Pishchulin et al. (2013b) can be seen in column (c) from Figs. 10 and 11, respectively. As we discussed in the previous subsection, our proposed rescoring approach can cope with images where the torso is hard to detect, while Pishchulin et al. (2013b) fails. In addition, our proposed approach is more resilient to changes in viewpoint and scale, as we can see in Fig. 10. These images present some cases where some body parts suffer from foreshortening (rows 2–3 in Fig. 10) or lengthening (row 4 in Fig. 10) due to the viewpoint. In these cases, fitting a PS model in the image may become an ill-posed problem, since the expected size ratio among the different body limbs is not kept the same and the PS model assumes the scale is fixed for all the parts. However, our proposed rescoring approach is inherently invariant to scale, since the computed contextual features regarding spatial coordinates are normalized by the reference body part detection height. As a result, the proposed method tends to recover from these aberrations by placing the body parts in the image in a more visually-coherent way (row 4 in Fig. 10).

Fig. 10
figure 10

Qualitative results on LSP dataset. a Gaussian-shaped position prediction from Pishchulin et al. (2013b), b Estimated pose from Pishchulin et al. (2013b) (just predicting position), c Estimated pose from Pishchulin et al. (2013b) (full model), d Position prediction with our proposed rescoring, and e Our results. White crosses in columns a and d show the part being rescored in each case; from the first row to the last one: rightmost ankle, rightmost wrist, leftmost ankle, leftmost wrist

Fig. 11
figure 11

Qualitative results on PARSE dataset. a Gaussian-shaped position prediction from Pishchulin et al. (2013b), b Estimated pose from Pishchulin et al. (2013b) (just predicting position), c Estimated pose from Pishchulin et al. (2013b) (full model), d Position prediction with our proposed rescoring, and e Our results. White crosses in columns a and d show the part being rescored in each case; from the first row to the last one: right ankle, right wrist, right ankle, left ankle

As shown in Fig. 8, there are cases where our method does not produce a pose estimate or localization of body parts that agrees with the ground-truth. These challenging images include: (a) multiple people and (b) strongly rotated poses, e.g., people doing handstands. When there is more than one person in the image, the poselet detectors are likely to fire on all the people appearing in the scene, while we are only interested in a specific subject (the one for whom the ground-truth annotations were made available). In future work, this could be addressed by first running a person detector, and then running the proposed pipeline in Algorithm 2 separately for each detected subject. When a person appears strongly rotated, then the localization of body parts can be incorrect. Since images that depict rotated poses are relatively rare in the training dataset, our mid-level representation is unlikely to represent them well. Nevertheless, it is likely that there will be a rotated version among the selected poselets in our model. Thus, in future work, it may be possible to address such cases by extending our contextual features to include relative rotation features, and run the poselet detectors at a wider range of rotations. When a person appears strongly rotated

4.3 Contextual features

In Fig. 9 we analyze the most selected contextual features that the decision trees \(q_{\theta }\) choose for splitting each node, in order to see which features are more important in each case. More specifically, we show the feature selection histograms for different subsets of the human body parts: upper-body limbs, lower-body limbs, head & torso and full body. We see that in a general trend, the signed relative position (rel.x and rel.y) is the most important contextual feature, since it is in the top three most selected features in all cases. This tells us that our rescoring functions are able to exploit the pattern about the relative locations of the body parts and the locations of poselet detections. Furthermore, it is interesting to note that in the case of the lower-body limbs, the relative position in the x-axis is selected more times than in the y-axis (in fact, the relative distance (dist) is selected more than rel.y). One possible explanation for this is that the pattern between left and right lower body limbs and the poselet detections tends to be more consistent, in contrast with the upper-body. Interestingly, the opposite case takes place with the head and the torso: relative position in the y-axis is selected more often than in the x-axis. In this case, the head and the torso are usually centered w.r.t. the body parts, so vertical relationships tend to be more discriminative than horizontal ones.

Finally, the relative score is the second most selected contextual feature, in general terms. While the score difference (score.diff) is more frequent in the case of head & torso and the lower-body limbs, the score ratio stands out in the case of upper-body limbs, and is slightly more frequent in the full body case (Figs. 10, 11).

In conclusion, the proposed rescoring approach not only learns the patterns of the spatial layout between the basic parts in the PS model and the poselets in the mid-level representation, but also learns and exploits the patterns of the outputs from the corresponding detectors, as in other machine learning techniques like stacked sequential learning (Puertas et al. 2014).

4.4 Execution Time

In this subsection, we compare the running time of the proposed test execution pipeline (Algorithm 2) with respect the one proposed by Pishchulin et al. (2013b) (Algorithm 1). Running times of both pipelines are summarized in Table 4. Considering that basic part detectors have been already run (it is a common step in both pipelines), we achieve a running time reduction of 68.23 %: our method takes 209.36 s in total, in contrast to the 659.15 s of Pishchulin et al. (2013b). This reduction comes mainly from the mid-level representation size reduction: the running time for the poselet detectors is reduced by 95.24 %. Furthermore, we just need to perform PS model inference once at the end of the execution pipeline (Pishchulin et al. 2013b needs a first inference to obtain a good torso position hypothesis), resulting in an additional time saving. The total execution time of the proposed approach including basic part detection is of 726.82 s. In contrast, the total run time of Pishchulin et al. (2013b) is of 1, 176.60 (the basic part detectors take 517.46 s to run).

Table 4 Running time (in seconds) of the test pipelines from Pishchulin et al. (2013b) (Algorithm 1) and our proposal (Algorithm 2)

It is worth noting that we could not evaluate the running time of the specialized torso detector used by Pishchulin et al. (2013b) since the code for that part of the pipeline is not available, so the real time reduction we can achieve may be slightly greater. For clarification, the implementation of both execution pipelines do not make use of GPUs or any other kind of parallel computing approaches.

5 Conclusion

We have proposed a contextual rescoring methodology for predicting the position of body parts in a human pose estimation framework based on pictorial structures. This contextual rescoring approach benefits from a mid-level body part representation to obtain more confident hypothesis for locating basic body parts. In order to define this mid-level representation, we propose an algorithm for the automatic discovery of a set of poselets that maximize precision while enforcing covering of the different poses that appear in a validation set. Using spatial and score-related features extracted from the set of basic and mid-level part detections, we rescore the body joints hypotheses and combine them with the original scores in the unary potential of a PS model.

Experiments with two standard benchmarks demonstrated that our body part position prediction formulation can yield a performance similar to that of Pishchulin et al. (2013b) in the LSP dataset and even better in the PARSE dataset, while reducing the number of poselets in the mid-level representation by 95.36 %. Compared to other automatic poselet discovery strategies, the one proposed in this paper yields the best results in combination with the proposed rescoring mechanism, in terms of PCP performance in the task of human pose estimation. In addition, the proposed poselet discovery method is fully automatic, while the hierarchical decomposition from Pishchulin et al. (2013b) is manually-defined.

In experiments that examined the influence of \(w_p\), the weight for the extra unary potential that encodes position prediction based on poselet evidence, we noted that the results (PCP) are more stable with respect to changes \(w_p\), versus Pishchulin et al. (2013b). Perhaps more crucially, the optimal value for \(w_p\) is higher in our formulation versus Pishchulin et al. (2013b): 0.45 versus 0.15 for LSP, and 0.2 versus 0.1 for PARSE. This suggests that our proposed rescoring approach can produce a more reliable prediction of the part location in the image. A further analysis on the contextual features selected by the weak classifiers in the rescoring approach reveals that the most important features are the relative position and relative detection scores. This affirms that our method can discover patterns of the spatial layout of body parts and thus improve the body part localization obtained from the outputs of standard body part detectors.

Finally, we note that the reduction in the mid-level part representation results in a significant reduction in the execution time. We reduce the execution time by 68.23 % with less than 2 % reduction in accuracy w.r.t. the full model from Pishchulin et al. (2013b). Moreover, we just predict the position of the body parts, while Pishchulin et al. (2013b) also predicts their rotation and the pairwise parameters, and makes use of additional specialized head and torso detectors and a torso position prior. As future work, we could adapt the proposed rescoring approach for predicting not only the position, but the rotation of the body parts. Additionally, we could extend the set of contextual features in order to directly model relative orientations between parts through the rescoring functions.