1 Introduction

Detecting and segmenting people in real-world environments are central problems with applications in indexing, surveillance, 3D reconstruction and action recognition. Prior work in 3D human pose reconstruction from monocular images [1,2,3], as well as more recent, successful RGB-D sensing systems based on Kinect [4] have shown that the availability of a figure-ground segmentation opens paths towards robust and scalable systems for human sensing. Despite substantial progress, the figure-ground segmentation in RGB images remains extremely challenging, because people are observed from a variety of viewpoints, have complex articulated skeletal structure, varying body proportions and clothing, and are often partially occluded by other people or objects in the scene. The complexity of the background further complicates matters, particularly as any limb decomposition of the human body leads to parts that are relatively regular but not sufficiently distinctive even when spatial connectivity constraints are enforced [5]. Set aside appearance inhomogeneity and color variability due to clothing, which can overlap the background distribution significantly, it is well known that many of the generic, parallel line (ribbon) detectors designed to detect human limbs, fire at high false positive rates in the background. This has motivated work towards detecting more distinctive part configurations, without restrictive assumptions on part visibility (e.g. full or upper view of the person), for which poselets [6] have been a successful example. However, besides relatively high false positive rates typical in detection, the transition from a bounding box of the person to a full segmentation of the human body is not straightforward. The challenge is to balance, on one hand, sufficient flexibility towards representing variability due to viewpoint, partial views and articulation, and, on the other hand, sufficient constraints in order to obtain segmentations that correspond to meaningful human shapes, all relying on region or structural human body part detectors that may only be partial or not always spatially accurate.

In this work we attempt to connect two relevant, recent lines of work, for the segmentation of people in real images. We rely on bottom-up figure-ground generation methods and region-level person classifiers in order to identify promising hypotheses for further processing. In a second pass, we set up informed constraints towards (human) class-specific figure-ground segmentation by leveraging skeletal information and data-driven shape priors computed on-the-fly by matching region candidates against exemplars of a large, recently introduced human motion capture dataset containing 3D and 2D semantic skeleton information of people, as well images and figure-ground masks from background subtraction (Human3.6M [7]). By exploiting globally optimal parametric max-flow energy minimization solvers, this time, based on a class dependent (as opposed to generic and regular) foreground seeding process [8,9,10], we show that we can considerably improve the quality of competitive object proposal generators. To our knowledge, this is one of the first formulations for class-specific segmentation that in principle can handle multiple viewpoints and any partial view of the person. It is also one of the first to leverage a large dataset of human shapes, together with semantic structural information, which until recently, have not been available. We show that such constraints are critical for accuracy, robustness, and computational efficiency.

1.1 Related Work

The literature on segmentation is huge, even when considering only sub-categories like top-down (class-specific) and bottom-up segmentation. Humans are of significant interest to be devoted special methodology, and that proves to be effective [5, 6, 11,12,13,14,15,16,17,18]. One approach is to consider shape as category-specific property and integrate it within models that are driven by bottom-up processing [19,20,21,22,23,24]. Pishchulin et al. [23] develop pictorial structure formulations constrained by poselets, focusing on improving the response quality of an articulated part-based human model. The use of priors based on exemplars has also been explored, in a data-driven process. Both [25, 26] focus on a matching process in order to identify exemplars that correspond to similar scene or object layouts, then used in a graph cut process that enforces spatial smoothness and provides a global solution. Our approach is related to such methods, but we use a novel data-driven prior construction, enforce structural constraints adapted to humans, and search the state space exhaustively by means of parametric max-flow. In contrast to priors used in [25, 26], which require a more repeatable scene layout, we focus on a prior generation process that can handle a diverse set of viewpoints and arbitrary partial views, not known a-priori, and different across the detected instances. Recently, there has been a rapid development of deep learning techniques towards the scene understanding task with focus on semantic segmentation [27,28,29], including humans.

Methods like [30] resemble ours in their reliance on a detection stage and the principle of matching that window representation against a training set where figure-ground segmentations are available, then optimizing an energy function via graph-cuts. Our window representation contains additional detail and this makes it possible to match exemplars based on the identified semantic content. Our matching and shape prior construction are optimized for humans, in contrast to the generic ones used in [30] (which can however segment any object, not just people, as is our focus hereFootnote 1). We use a large prior set of structurally annotated human shapes, and search the state space using a different, parametric multiple hypotheses scheme. Our prior construction uses, among other elements, a Procrustes alignment similar to [31] but differently: (1) we use it for shape prior construction (input dependent, on-the-fly) within the energy optimizer as opposed to object detection (classification, construction per class) as in [31], (2) we only use instances that align well with the query, thus reflecting accurate shape models, as opposed to fusing top-k instances to capture class variability in [31]. An alternative, interesting formulation for object segmentation with shape priors is branch-and-mincut [32], who propose a branch and bound procedure in the compound space of binary segmentations and hierarchically organized shapes. However, the bounding process used for efficient search in shape space would rely on knowledge of the type of shapes expected and their full visibility. We focus on a different optimization and modeling approach that can handle arbitrary occlusion patterns of shape. Our prior constraint for optimization is generated on-the-fly by fusing the visible exemplar components, following a structural alignment scheme.

Recently, there has been a resurrection of bottom-up segmentation methods based on multiple proposal generation, with surprisingly good results considering the low-level processing involved. Some of these methods generate segment hypotheses either by combining the superpixels [33] of a hierarchical clustering method [34,35,36,37], by varying the segmentation parameters [38] or by searching an energy model, parametrically, using graph cuts [10, 38,39,40,41,42]. Most of the latter techniques use mid-level shape priors for selection, either following hypotheses generation [10, 38, 40] or during the process. Some methods provide a ranking, diversification and compression of hypotheses, using e.g. Maximal Marginal Relevance (MMR) diversification [10, 38], whereas others report an unordered set [39, 40]. Hypotheses pool sizes in the order of 1,000–10,000 range in the expansionary phase, and compressed models of 100–1,000 hypotheses following the application of trained rankers (operating on mid-level features extracted from segments) with diversification, are typical, with variance due to image complexity and edge structure. While prior work has shown that such hypotheses pools can contain remarkably good quality segments (\(60-80\%\) intersection over union, IoU, scores are not uncommon) this leaves sufficient space for improvement particularly since sooner or later, one is inevitably facing the burden of decision making: selecting one hypothesis to report. It is then not uncommon for performance to sharply drop to \(40\%\). This indicates that constraints and prior selection methods towards more compact, better quality hypotheses sets are necessary. Such issues are confronted in the current work.

2 Methodology

We consider an image as \(I:\mathcal{V}\rightarrow R^3\), where \(\mathcal{V}\) represents the set of nodes, each associated with a pixel in the image, and the range is the associated intensity (RGB) vector. The image is modeled as a graph \(G = (\mathcal{V},\mathcal{E})\). We partition the set of nodes \(\mathcal{V}\) into two disjoint sets, corresponding to foreground and background, represented by labels 1 and 0, respectively. Seed pixels \(\mathcal{V}_f\) and \(\mathcal{V}_b\) are subsets of \(\mathcal{V}\) and they are constrained to foreground and background, respectively. \(\mathcal{E}\) is the subset of edges of the graph G which reflect the connections between adjacent pixels. The formulation we propose will rely on object (or foreground) structural skeleton constraints obtained from person detection and 2D localization (in particular the identification of keypoints associated with the joints of the human body, and the resulting set of nodes corresponding to the human skeleton, obtained by connecting keypoints, \(T \subseteq \mathcal{V}\)), as well as a data-driven, human shape fusion prior \(S:\mathcal{V} \rightarrow [0,1]\), constructed ad-hoc by fusing similar configurations with the one detected, based on a large dataset of human shapes with associated 2D skeleton semantics (see Sect. 2.1 for details). The energy function defined over the graph G, with \(X=\cup \{x_u\}\) being the set of all image pixels, is:

$$\begin{aligned} E_{\lambda }(X) = \sum _{u \in \mathcal{V}} U_{\lambda }(x_u) + \sum _{(u,v) \in \mathcal{E}} V_{uv}(x_u,x_v) \end{aligned}$$
(1)

where

$$\begin{aligned} U_{\lambda }(x_u)= & {} D_{\lambda }(x_u) + S(x_u)\\ \end{aligned}$$

with \(\lambda \in \mathbb {R}\), and unary potentials given by semantic foreground constraints \(\mathcal{V}_f \leftarrow T\):

$$\begin{aligned} D_{\lambda }(x_u) = \left\{ \begin{array}{l l} 0 &{} \quad \text{ if } x_u=1\text{, } u \notin \mathcal{V}_b \\ \infty &{} \quad \text{ if } x_u=1\text{, } u \in \mathcal{V}_b \\ \infty &{} \quad \text{ if } x_u=0\text{, } u \in \mathcal{V}_f \\ f(x_u) + \lambda &{} \quad \text{ if } x_u=0\text{, } u \notin \mathcal{V}_f \\ \end{array} \right. \end{aligned}$$
(2)

The foreground bias is implemented as a cost incurred by the assignment of non-seed pixels to background, and consists of a pixel-dependent value \(f(x_u)\) and an uniform offset \(\lambda \). Two different functions \(f(x_u)\) are used alternatively. The first is constant and equal to 0, resulting in a uniform (variable) foreground bias. The second function uses color. Specifically, RGB color distributions \(p_f(x_u)\) on seed \(\mathcal{V}_f\) and \(p_b(x_u)\) on seed \(\mathcal{V}_b\) are estimated and derive \(f(x_u) = \ln \frac{p_f(x_u)}{p_b(x_u)}\). The probability distribution of pixel j belonging to the foreground is defined as \(p_f(i)=\exp (-\gamma \cdot \min _j(||I(i) - I(j)||))\), with \(\gamma \) a scaling factor, and j indexes representative pixels in the seed region, selected as centers resulting from a k-means algorithm (k is set to 5 in all of our experiments). The background probability is defined similarly.

The pairwise term \(V_{uv}\) penalizes the assignment of different labels to similar neighboring pixels:

$$\begin{aligned} V_{uv}(x_u, x_v) = \left\{ \begin{array}{l l} 0 &{} \quad \text{ if } x_u = x_v \\ g(u,v) &{} \quad \text{ if } x_u \ne x_v \\ \end{array} \right. \end{aligned}$$
(3)

with similarity between adjacent pixels given by \(g(u, v) = \exp \Bigl [ - \frac{\max (Gb(u), Gb(v))}{\sigma ^2} \Bigr ] \). Gb returns the output of the multi-cue contour detector [43, 44] at a pixel location. The boundary sharpness parameter \(\sigma \) controls the smoothness of the pairwise term.

The function \(f(x_u)\) is the same as in the CPMC [10] algorithm. It takes two forms, the first is constant and acts as a foreground bias and the second uses color information, particularly the color distribution of the seed pixels computed with k-means algorithm. The energy function defined by (1) is submodular and can be optimized using parametric max-flow, in order to obtain all breakpoints of \(E_{\lambda }(X)\) as a function of \((\lambda ,X)\) in polynomial time. The advantage of our approach is that it can be used with any object proposal generator based on graph-cut energy minimization.

Given the general formulation in (1) and (2), the key problems to address are: (a) the identification of a putative set of person regions and structural constraints hypotheses T; (b) the construction of an effective, yet flexible data-driven human shape prior S, based on a sufficiently diverse dataset of people shapes and skeletal structure, given estimates for T. (c) minimization of the resulting energy model (1). We address (a) without loss of generality, using a human region classifier (any other set of structural, problem dependent detectors can be used, here e.g. face and hand detectors based on skin color models or poselets). We address (b) using methodology that combines a large dataset of human pose shapes and body skeletons, collected from Human3.6M [7] with shape matching, alignment and fusion analysis, in order to construct the prior on-the-fly, for the instance being analyzed. We refer to a model that leverages both problem-dependent structural constraints T and a data-driven shape prior S, in a single joint optimization problem, as Constraint Parametric Problem Dependent Cuts with Shape Matching. Alignment and Fusion (CPDC-MAF). The integration of bottom-up region detection constraints with a shape prior construction is described in Sect. 2.1. The CPDC-MAF model can be optimized in polynomial time using parametric max-flow, in order to obtain all breakpoints of the associated energy model (addressing c).

2.1 Data-Driven Shape Matching, Alignment and Fusion (MAF)

We aim to obtain an improved figure-ground segmentation for persons by combining bottom-up and top-down, class specific information. We initialize our proposal set using CPMC. While any figure-ground segmentation proposal method can be employed in principle, we choose CPMC due to its performance and because our method can be viewed as a generalization with problem dependent seeds and shape priors. We filter the top N segment candidates using an O2P-region classifier [45] trained to respond to humans, using examples from Human3.6M, to obtain \(\mathcal{D}=\{d_i=\{\mathbf {z}, \mathbf {b}\}, | i=1,\ldots N\}\). Each candidate segment is represented by a binary mask \(\mathbf {z}\), where 1 stands for foreground and 0 stands for background, and a bounding box \(\mathbf {b} \in \mathbb {R}^4\) where \(\mathbf {b} = (m, n, w, h)\); m and n represent the image coordinates of the bottom left corner of the bounding box, w and h represents its width and its height.

We use the set of human region candidates in order to match against a set of human shapes and construct a shape prior. There are challenges however, particularly being able to: (1) access a sufficiently representative set of human shapes to construct the prior, (2) be sufficiently flexible so that human shapes from the dataset, which are very different from the shape being analyzed, would not negatively impact estimates, (3) handle partial views—while we rely on bottom-up proposals that can handle partial views, the use, in contrast, of a shape prior that can only represent, e.g. full or upper-body views, would not be effective.

Fig. 1.
figure 1

Our Shape Matching Alignment Fusion (MAF) construction based on semantic matching, structural alignment and clipping, followed by fusion, to reflect the partial view. Notice that the prior construction allows us to match partial views of a putative human detected segment to fully visible exemplars in Human3.6M. This allows us to handle arbitrary patterns of occlusion. We can thus create a well adapted prior, on-the-fly, given a candidate segment.

Fig. 2.
figure 2

Processing steps of our segmentation methods based on Constrained Parametric Problem Dependent Cuts (CPDC) with shape Matching, Alignment and Fusion (MAF).

We address (1) by employing a dataset of 100,000 human shapes together with the corresponding skeleton structure, sub-sampled from the recently created Human3.6M dataset [7]; (2) by employing a matching, alignment and fusion technique between the current segment and the individual exemplar shapes in the dataset. Shapes and structures which cannot be matched and aligned properly are discarded. (3) is adressed by leveraging the implicit correspondences available across training shapes, at the level of local shape matches, by only aligning and warping those components of the exemplar shapes that can be matched to the query. A sample flow of our entire method can be visualized in Figs. 1 and 2.

Boundary Point Sampling: Given a bottom-up figure-ground proposal represented as a binary mask \(\mathbf {z} \in \mathcal {D}\), we sample through the image coordinates of the boundary points of the foreground segment. Thus we obtain a set of 2D points \(\mathbf {p}_{j}, j=1,\ldots ,K\) with \(\mathbf {p}_{j} \in \mathbb {R}^2\) where \(\mathbf {p}_{j} = (x_j, y_j)\). We loop through the shapes of our human shape dataset Human3.6M and, for each shape, we rotate and scale it so that it has the same orientation and scale as the foreground candidate segment and sample through its boundary points. Thus we obtain a set of 2D points \(\mathbf {q}_{jl}, j=1,\ldots ,K\), with \(l=1,\ldots ,L\), where L represents the number of poses in the shape-pose dataset, in our case \(L=100,000\).

Shape Matching and Transform Matrix: We employ the shape context descriptor [46] at each position \(\mathbf {p}_{j}\) from the candidate segment and at each position \(\mathbf {q}_{jl}\). We evaluate a \(\chi ^2\) distance [47] on the resulting descriptors to select the indexes l with sufficient good matches, such that we estimate an affine transformation.

We apply a 2D Procrustes transform with 5 degrees of freedom (rotation, anisotropic scaling including reflections, and translation) on each \(\mathbf {q}_{jl}\) in order to align each shape in the dataset with the corresponding boundary point. This results in a \(3\times 3\) transformation matrix \(\mathbf {W}_l\) and an error for the transform, \(e_l\) (average over \(e_{jl}, j=1,\ldots ,K\)) which represents the Euclidean distance between the boundary points \(\mathbf {p}_{j}\) and the Procrustes transformed ones, \(\mathbf {W}_l \cdot \mathbf {q}_{jl}\), in the image plane.

Prior Shape Selection and Warping: In order to determine which prior shapes are relevant for the current detected query, we identify the subset of indexes in the dataset \(\mathcal{T}\) which correspond to transformation errors that are smaller than a given threshold \(\epsilon \). Thus, we obtain the corresponding figure-ground masks \(\mathbf {m}_t, t \in \mathcal{T}\). For each mask \(\mathbf {m}_t\), we select the coordinates of foreground pixels and warp them using the transform matrix computed using the 2D joint coordinates transformation. We apply the same procedure to the attached skeleton configuration of the corresponding mask. Thus, we obtain the coordinates of the foreground pixels for the transformed mask, \(\varPhi _t\) and the transformed skeleton coordinates \(\varPsi _t\).

Prior Shape Fusion: We compute the mean of the entire set of transformed masks, \(\varPhi _t\), obtaining a MAF prior, S, corresponding to the detection d as seen in Fig. 1. The values of the shape prior mask range from 0 to 1, background and foreground probabilities, respectively. Also, we compute the mean of the entire set of transformed skeletons \(\varPsi _t\), obtaining a configuration of keypoints \(\mathbf {B} \in \mathbb {R}^{3\times 15}\) with \(\mathbf {B}_j = (x, y, 1)\) where x and y represent the image coordinates of the warped joint from Human3.6M. This can be used to obtain a problem dependent mask \(\mathbf {m}\) as follows. Initially we set the mask to have the same dimension as the entire image, filled with 0. We use Bresenham’s algorithm to draw a line between the semantically adjacent joints, for example: left elbow - left wrist, right hip - right knee, and so on. We assign the set of skeleton nodes to the foreground as \(T=\{i \in \mathcal{V}| \mathbf {m}(i)=1\}\). This entire procedure of obtaining the shape prior information (mask and skeleton) is illustrated in Algorithm 1.

figure a

3 Experiments

We test our methodology on two challenging datasets: H3D [48] which contains 107 images and MPII [49] with 3799 images. We have figure-ground segmentation annotations available for all datasets. For the MPII dataset, we generate figure-ground human segment annotations ourselves. Both the H3D and the MPII datasets contain both full and partial views of persons with self-occlusion which makes them extremely challenging.

We run several segmentation algorithms including CPMC [10] as well as our proposed CPDC-MAF, where we use bottom-up person region detectors trained on Human3.6M, and region descriptors based on O2P [45]. We also constructed a model referred to as CPDC-MAF-POSELETS, built using problem dependent seeds based on a 2D pose detector instead of the proposed segments of a figure-ground segmentation algorithm. While any methodology that provides body keypoints (parts or articulations) is applicable, we choose the poselet detector because it provides results under partial views of the body, or self occlusions of certain joints together with joint position estimates. Conditioned on a detection, we apply the same idea as in our CPDC-MAF, except that we use the detected skeletal keypoints to match against the exemplars in the Human3.6M dataset. A matching process based on semantic keywords (the body joints) is explicit, immediate (since joints are available both for the putative poselet detector and for the exemplar shapes in Human3.6M) and arguably simpler than matching shapes in the absence of skeletal information. The downside is that when the poselet detection is incorrect, the matching will also be (notice that alignments with high score following matching are nevertheless discarded within the MAF process).

We initialize CPDC-MAF, bottom-up, by using candidate segments from CPMC pool, selected based on their \(\mathbf {person}\) ranking score after applying the O2P classifier. This is followed by a non-maximum suppression step were we remove the pair of segments with an overlap above 0.25. We use the MAF process to reject irrelevant candidates and to build shape prior masks and skeleton configuration seeds for the segments with good matching produced by shape context descriptors. On each resulting shape prior and skeleton seeds, we run the CPDC-MAF model with the resulting pools from each candidate segment merged to obtain the human region proposals for an entire image.

Table 1. Accuracy and pool size statistics for different methods, on data from H3D and MPII. We report average IoU over test set for the first segment of the ranked pool and the ground-truth figure-ground segmentation (First), the average IoU over test set of the segment with the highest IoU with the ground-truth figure-ground segmentation (Best) and average pool size (Pool Size).
Fig. 3.
figure 3

Dimension of segmentation pool for MPII and various methods along with average pool size (in legend). Notice significant difference between the pool size values of CPDC-MAF-POSELETS and CPDC-MAF compared to the ones of CPMC. CPMC pool size values maintain an average of 700 units, whereas the pool sizes of CPDC-MAF and CPDC-MAF-POSELETS are considerably smaller, around 100 units.

Fig. 4.
figure 4

IoU for the first segment from the ranked pool in MPII. The values for CPMC and CPDC-MAF-POSELETS have higher variance compared to CPDC-MAF resulting in the performance drop illustrated by their average.

Fig. 5.
figure 5

Sample of generated segments for images from MPII. From left to right, original image, top 5 first ranked segments of the CPDC-MAF generated pool of segments.

Fig. 6.
figure 6

Segmentation examples for various methods. From left to right, original image, CPMC with default settings on person’s bounding box, CPDC-MAF-POSELETS and CPDC-MAF. See also Table 1 for quantitative results.

Fig. 7.
figure 7

Segmentation examples for difficult cases including partial views and occlusions. From left to right, original image, CPMC with default settings on person’s bounding box, CPDC-MAF-POSELETS and CPDC-MAF.

For each testing setup, we report the mean values (computed over the entire testing dataset) of the intersection over union (IoU) scores for the first segment in the ranked pool and the ground-truth figure-ground segmentation for each image. We also report the mean values of the IoU scores for the pool segment with the best IoU score with the ground-truth figure ground segmentation.

Results for different datasets can be visualized in Table 1. In turn, Figs. 3, 4 show plots for the size of the segment pools and IoU scores for highest ranked segments generated by different methods, with image indexes sorted according to the best performing method (CPDP-MAF). Qualitative segmentation results for the various methods tested are given in Figs. 6 and 7. Also, we illustrate sample results with top ranked pool of segments in Fig. 5.

4 Conclusions

We have presented class-specific image segmentation models that leverage human body part detectors based on bottom-up figure-ground proposals, parametric max-flow solvers, and a large dataset of human shapes. Our formulation leads to a sub-modular energy model that combines class-specific structural constraints and data-driven shape priors, within a parametric max-flow optimization methodology that systematically computes all breakpoints of the model in polynomial time. We also propose a data-driven class-specific prior fusion methodology, based on shape matching, alignment and fusion, that allows the shape prior to be constructed on-the-fly, for arbitrary viewpoints and partial views. We demonstrate competitive results in two challenging datasets: H3D [48] and MPII [49], where we improve the first ranked hypothesis estimates of mid-level segmentation methods by \(20\%\), with pool sizes that are up to one order of magnitude smaller. In future work we will explore additional class-dependent seed generation mechanisms and plan to study the extension of the proposed framework to video.