Abstract
The figure-ground segmentation of humans in images captured in natural environments is an outstanding open problem due to the presence of complex backgrounds, articulation, varying body proportions, partial views and viewpoint changes. In this work we propose class-specific segmentation models that leverage parametric max-flow image segmentation and a large dataset of human shapes. Our contributions are as follows: (1) formulation of a sub-modular energy model that combines class-specific structural constraints and data-driven shape priors, within a parametric max-flow optimization methodology that systematically computes all breakpoints of the model in polynomial time; (2) design of a data-driven class-specific fusion methodology, based on matching against a large training set of exemplar human shapes (100,000 in our experiments), that allows the shape prior to be constructed on-the-fly, for arbitrary viewpoints and partial views.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Detecting and segmenting people in real-world environments are central problems with applications in indexing, surveillance, 3D reconstruction and action recognition. Prior work in 3D human pose reconstruction from monocular images [1,2,3], as well as more recent, successful RGB-D sensing systems based on Kinect [4] have shown that the availability of a figure-ground segmentation opens paths towards robust and scalable systems for human sensing. Despite substantial progress, the figure-ground segmentation in RGB images remains extremely challenging, because people are observed from a variety of viewpoints, have complex articulated skeletal structure, varying body proportions and clothing, and are often partially occluded by other people or objects in the scene. The complexity of the background further complicates matters, particularly as any limb decomposition of the human body leads to parts that are relatively regular but not sufficiently distinctive even when spatial connectivity constraints are enforced [5]. Set aside appearance inhomogeneity and color variability due to clothing, which can overlap the background distribution significantly, it is well known that many of the generic, parallel line (ribbon) detectors designed to detect human limbs, fire at high false positive rates in the background. This has motivated work towards detecting more distinctive part configurations, without restrictive assumptions on part visibility (e.g. full or upper view of the person), for which poselets [6] have been a successful example. However, besides relatively high false positive rates typical in detection, the transition from a bounding box of the person to a full segmentation of the human body is not straightforward. The challenge is to balance, on one hand, sufficient flexibility towards representing variability due to viewpoint, partial views and articulation, and, on the other hand, sufficient constraints in order to obtain segmentations that correspond to meaningful human shapes, all relying on region or structural human body part detectors that may only be partial or not always spatially accurate.
In this work we attempt to connect two relevant, recent lines of work, for the segmentation of people in real images. We rely on bottom-up figure-ground generation methods and region-level person classifiers in order to identify promising hypotheses for further processing. In a second pass, we set up informed constraints towards (human) class-specific figure-ground segmentation by leveraging skeletal information and data-driven shape priors computed on-the-fly by matching region candidates against exemplars of a large, recently introduced human motion capture dataset containing 3D and 2D semantic skeleton information of people, as well images and figure-ground masks from background subtraction (Human3.6M [7]). By exploiting globally optimal parametric max-flow energy minimization solvers, this time, based on a class dependent (as opposed to generic and regular) foreground seeding process [8,9,10], we show that we can considerably improve the quality of competitive object proposal generators. To our knowledge, this is one of the first formulations for class-specific segmentation that in principle can handle multiple viewpoints and any partial view of the person. It is also one of the first to leverage a large dataset of human shapes, together with semantic structural information, which until recently, have not been available. We show that such constraints are critical for accuracy, robustness, and computational efficiency.
1.1 Related Work
The literature on segmentation is huge, even when considering only sub-categories like top-down (class-specific) and bottom-up segmentation. Humans are of significant interest to be devoted special methodology, and that proves to be effective [5, 6, 11,12,13,14,15,16,17,18]. One approach is to consider shape as category-specific property and integrate it within models that are driven by bottom-up processing [19,20,21,22,23,24]. Pishchulin et al. [23] develop pictorial structure formulations constrained by poselets, focusing on improving the response quality of an articulated part-based human model. The use of priors based on exemplars has also been explored, in a data-driven process. Both [25, 26] focus on a matching process in order to identify exemplars that correspond to similar scene or object layouts, then used in a graph cut process that enforces spatial smoothness and provides a global solution. Our approach is related to such methods, but we use a novel data-driven prior construction, enforce structural constraints adapted to humans, and search the state space exhaustively by means of parametric max-flow. In contrast to priors used in [25, 26], which require a more repeatable scene layout, we focus on a prior generation process that can handle a diverse set of viewpoints and arbitrary partial views, not known a-priori, and different across the detected instances. Recently, there has been a rapid development of deep learning techniques towards the scene understanding task with focus on semantic segmentation [27,28,29], including humans.
Methods like [30] resemble ours in their reliance on a detection stage and the principle of matching that window representation against a training set where figure-ground segmentations are available, then optimizing an energy function via graph-cuts. Our window representation contains additional detail and this makes it possible to match exemplars based on the identified semantic content. Our matching and shape prior construction are optimized for humans, in contrast to the generic ones used in [30] (which can however segment any object, not just people, as is our focus hereFootnote 1). We use a large prior set of structurally annotated human shapes, and search the state space using a different, parametric multiple hypotheses scheme. Our prior construction uses, among other elements, a Procrustes alignment similar to [31] but differently: (1) we use it for shape prior construction (input dependent, on-the-fly) within the energy optimizer as opposed to object detection (classification, construction per class) as in [31], (2) we only use instances that align well with the query, thus reflecting accurate shape models, as opposed to fusing top-k instances to capture class variability in [31]. An alternative, interesting formulation for object segmentation with shape priors is branch-and-mincut [32], who propose a branch and bound procedure in the compound space of binary segmentations and hierarchically organized shapes. However, the bounding process used for efficient search in shape space would rely on knowledge of the type of shapes expected and their full visibility. We focus on a different optimization and modeling approach that can handle arbitrary occlusion patterns of shape. Our prior constraint for optimization is generated on-the-fly by fusing the visible exemplar components, following a structural alignment scheme.
Recently, there has been a resurrection of bottom-up segmentation methods based on multiple proposal generation, with surprisingly good results considering the low-level processing involved. Some of these methods generate segment hypotheses either by combining the superpixels [33] of a hierarchical clustering method [34,35,36,37], by varying the segmentation parameters [38] or by searching an energy model, parametrically, using graph cuts [10, 38,39,40,41,42]. Most of the latter techniques use mid-level shape priors for selection, either following hypotheses generation [10, 38, 40] or during the process. Some methods provide a ranking, diversification and compression of hypotheses, using e.g. Maximal Marginal Relevance (MMR) diversification [10, 38], whereas others report an unordered set [39, 40]. Hypotheses pool sizes in the order of 1,000–10,000 range in the expansionary phase, and compressed models of 100–1,000 hypotheses following the application of trained rankers (operating on mid-level features extracted from segments) with diversification, are typical, with variance due to image complexity and edge structure. While prior work has shown that such hypotheses pools can contain remarkably good quality segments (\(60-80\%\) intersection over union, IoU, scores are not uncommon) this leaves sufficient space for improvement particularly since sooner or later, one is inevitably facing the burden of decision making: selecting one hypothesis to report. It is then not uncommon for performance to sharply drop to \(40\%\). This indicates that constraints and prior selection methods towards more compact, better quality hypotheses sets are necessary. Such issues are confronted in the current work.
2 Methodology
We consider an image as \(I:\mathcal{V}\rightarrow R^3\), where \(\mathcal{V}\) represents the set of nodes, each associated with a pixel in the image, and the range is the associated intensity (RGB) vector. The image is modeled as a graph \(G = (\mathcal{V},\mathcal{E})\). We partition the set of nodes \(\mathcal{V}\) into two disjoint sets, corresponding to foreground and background, represented by labels 1 and 0, respectively. Seed pixels \(\mathcal{V}_f\) and \(\mathcal{V}_b\) are subsets of \(\mathcal{V}\) and they are constrained to foreground and background, respectively. \(\mathcal{E}\) is the subset of edges of the graph G which reflect the connections between adjacent pixels. The formulation we propose will rely on object (or foreground) structural skeleton constraints obtained from person detection and 2D localization (in particular the identification of keypoints associated with the joints of the human body, and the resulting set of nodes corresponding to the human skeleton, obtained by connecting keypoints, \(T \subseteq \mathcal{V}\)), as well as a data-driven, human shape fusion prior \(S:\mathcal{V} \rightarrow [0,1]\), constructed ad-hoc by fusing similar configurations with the one detected, based on a large dataset of human shapes with associated 2D skeleton semantics (see Sect. 2.1 for details). The energy function defined over the graph G, with \(X=\cup \{x_u\}\) being the set of all image pixels, is:
where
with \(\lambda \in \mathbb {R}\), and unary potentials given by semantic foreground constraints \(\mathcal{V}_f \leftarrow T\):
The foreground bias is implemented as a cost incurred by the assignment of non-seed pixels to background, and consists of a pixel-dependent value \(f(x_u)\) and an uniform offset \(\lambda \). Two different functions \(f(x_u)\) are used alternatively. The first is constant and equal to 0, resulting in a uniform (variable) foreground bias. The second function uses color. Specifically, RGB color distributions \(p_f(x_u)\) on seed \(\mathcal{V}_f\) and \(p_b(x_u)\) on seed \(\mathcal{V}_b\) are estimated and derive \(f(x_u) = \ln \frac{p_f(x_u)}{p_b(x_u)}\). The probability distribution of pixel j belonging to the foreground is defined as \(p_f(i)=\exp (-\gamma \cdot \min _j(||I(i) - I(j)||))\), with \(\gamma \) a scaling factor, and j indexes representative pixels in the seed region, selected as centers resulting from a k-means algorithm (k is set to 5 in all of our experiments). The background probability is defined similarly.
The pairwise term \(V_{uv}\) penalizes the assignment of different labels to similar neighboring pixels:
with similarity between adjacent pixels given by \(g(u, v) = \exp \Bigl [ - \frac{\max (Gb(u), Gb(v))}{\sigma ^2} \Bigr ] \). Gb returns the output of the multi-cue contour detector [43, 44] at a pixel location. The boundary sharpness parameter \(\sigma \) controls the smoothness of the pairwise term.
The function \(f(x_u)\) is the same as in the CPMC [10] algorithm. It takes two forms, the first is constant and acts as a foreground bias and the second uses color information, particularly the color distribution of the seed pixels computed with k-means algorithm. The energy function defined by (1) is submodular and can be optimized using parametric max-flow, in order to obtain all breakpoints of \(E_{\lambda }(X)\) as a function of \((\lambda ,X)\) in polynomial time. The advantage of our approach is that it can be used with any object proposal generator based on graph-cut energy minimization.
Given the general formulation in (1) and (2), the key problems to address are: (a) the identification of a putative set of person regions and structural constraints hypotheses T; (b) the construction of an effective, yet flexible data-driven human shape prior S, based on a sufficiently diverse dataset of people shapes and skeletal structure, given estimates for T. (c) minimization of the resulting energy model (1). We address (a) without loss of generality, using a human region classifier (any other set of structural, problem dependent detectors can be used, here e.g. face and hand detectors based on skin color models or poselets). We address (b) using methodology that combines a large dataset of human pose shapes and body skeletons, collected from Human3.6M [7] with shape matching, alignment and fusion analysis, in order to construct the prior on-the-fly, for the instance being analyzed. We refer to a model that leverages both problem-dependent structural constraints T and a data-driven shape prior S, in a single joint optimization problem, as Constraint Parametric Problem Dependent Cuts with Shape Matching. Alignment and Fusion (CPDC-MAF). The integration of bottom-up region detection constraints with a shape prior construction is described in Sect. 2.1. The CPDC-MAF model can be optimized in polynomial time using parametric max-flow, in order to obtain all breakpoints of the associated energy model (addressing c).
2.1 Data-Driven Shape Matching, Alignment and Fusion (MAF)
We aim to obtain an improved figure-ground segmentation for persons by combining bottom-up and top-down, class specific information. We initialize our proposal set using CPMC. While any figure-ground segmentation proposal method can be employed in principle, we choose CPMC due to its performance and because our method can be viewed as a generalization with problem dependent seeds and shape priors. We filter the top N segment candidates using an O2P-region classifier [45] trained to respond to humans, using examples from Human3.6M, to obtain \(\mathcal{D}=\{d_i=\{\mathbf {z}, \mathbf {b}\}, | i=1,\ldots N\}\). Each candidate segment is represented by a binary mask \(\mathbf {z}\), where 1 stands for foreground and 0 stands for background, and a bounding box \(\mathbf {b} \in \mathbb {R}^4\) where \(\mathbf {b} = (m, n, w, h)\); m and n represent the image coordinates of the bottom left corner of the bounding box, w and h represents its width and its height.
We use the set of human region candidates in order to match against a set of human shapes and construct a shape prior. There are challenges however, particularly being able to: (1) access a sufficiently representative set of human shapes to construct the prior, (2) be sufficiently flexible so that human shapes from the dataset, which are very different from the shape being analyzed, would not negatively impact estimates, (3) handle partial views—while we rely on bottom-up proposals that can handle partial views, the use, in contrast, of a shape prior that can only represent, e.g. full or upper-body views, would not be effective.
We address (1) by employing a dataset of 100,000 human shapes together with the corresponding skeleton structure, sub-sampled from the recently created Human3.6M dataset [7]; (2) by employing a matching, alignment and fusion technique between the current segment and the individual exemplar shapes in the dataset. Shapes and structures which cannot be matched and aligned properly are discarded. (3) is adressed by leveraging the implicit correspondences available across training shapes, at the level of local shape matches, by only aligning and warping those components of the exemplar shapes that can be matched to the query. A sample flow of our entire method can be visualized in Figs. 1 and 2.
Boundary Point Sampling: Given a bottom-up figure-ground proposal represented as a binary mask \(\mathbf {z} \in \mathcal {D}\), we sample through the image coordinates of the boundary points of the foreground segment. Thus we obtain a set of 2D points \(\mathbf {p}_{j}, j=1,\ldots ,K\) with \(\mathbf {p}_{j} \in \mathbb {R}^2\) where \(\mathbf {p}_{j} = (x_j, y_j)\). We loop through the shapes of our human shape dataset Human3.6M and, for each shape, we rotate and scale it so that it has the same orientation and scale as the foreground candidate segment and sample through its boundary points. Thus we obtain a set of 2D points \(\mathbf {q}_{jl}, j=1,\ldots ,K\), with \(l=1,\ldots ,L\), where L represents the number of poses in the shape-pose dataset, in our case \(L=100,000\).
Shape Matching and Transform Matrix: We employ the shape context descriptor [46] at each position \(\mathbf {p}_{j}\) from the candidate segment and at each position \(\mathbf {q}_{jl}\). We evaluate a \(\chi ^2\) distance [47] on the resulting descriptors to select the indexes l with sufficient good matches, such that we estimate an affine transformation.
We apply a 2D Procrustes transform with 5 degrees of freedom (rotation, anisotropic scaling including reflections, and translation) on each \(\mathbf {q}_{jl}\) in order to align each shape in the dataset with the corresponding boundary point. This results in a \(3\times 3\) transformation matrix \(\mathbf {W}_l\) and an error for the transform, \(e_l\) (average over \(e_{jl}, j=1,\ldots ,K\)) which represents the Euclidean distance between the boundary points \(\mathbf {p}_{j}\) and the Procrustes transformed ones, \(\mathbf {W}_l \cdot \mathbf {q}_{jl}\), in the image plane.
Prior Shape Selection and Warping: In order to determine which prior shapes are relevant for the current detected query, we identify the subset of indexes in the dataset \(\mathcal{T}\) which correspond to transformation errors that are smaller than a given threshold \(\epsilon \). Thus, we obtain the corresponding figure-ground masks \(\mathbf {m}_t, t \in \mathcal{T}\). For each mask \(\mathbf {m}_t\), we select the coordinates of foreground pixels and warp them using the transform matrix computed using the 2D joint coordinates transformation. We apply the same procedure to the attached skeleton configuration of the corresponding mask. Thus, we obtain the coordinates of the foreground pixels for the transformed mask, \(\varPhi _t\) and the transformed skeleton coordinates \(\varPsi _t\).
Prior Shape Fusion: We compute the mean of the entire set of transformed masks, \(\varPhi _t\), obtaining a MAF prior, S, corresponding to the detection d as seen in Fig. 1. The values of the shape prior mask range from 0 to 1, background and foreground probabilities, respectively. Also, we compute the mean of the entire set of transformed skeletons \(\varPsi _t\), obtaining a configuration of keypoints \(\mathbf {B} \in \mathbb {R}^{3\times 15}\) with \(\mathbf {B}_j = (x, y, 1)\) where x and y represent the image coordinates of the warped joint from Human3.6M. This can be used to obtain a problem dependent mask \(\mathbf {m}\) as follows. Initially we set the mask to have the same dimension as the entire image, filled with 0. We use Bresenham’s algorithm to draw a line between the semantically adjacent joints, for example: left elbow - left wrist, right hip - right knee, and so on. We assign the set of skeleton nodes to the foreground as \(T=\{i \in \mathcal{V}| \mathbf {m}(i)=1\}\). This entire procedure of obtaining the shape prior information (mask and skeleton) is illustrated in Algorithm 1.
3 Experiments
We test our methodology on two challenging datasets: H3D [48] which contains 107 images and MPII [49] with 3799 images. We have figure-ground segmentation annotations available for all datasets. For the MPII dataset, we generate figure-ground human segment annotations ourselves. Both the H3D and the MPII datasets contain both full and partial views of persons with self-occlusion which makes them extremely challenging.
We run several segmentation algorithms including CPMC [10] as well as our proposed CPDC-MAF, where we use bottom-up person region detectors trained on Human3.6M, and region descriptors based on O2P [45]. We also constructed a model referred to as CPDC-MAF-POSELETS, built using problem dependent seeds based on a 2D pose detector instead of the proposed segments of a figure-ground segmentation algorithm. While any methodology that provides body keypoints (parts or articulations) is applicable, we choose the poselet detector because it provides results under partial views of the body, or self occlusions of certain joints together with joint position estimates. Conditioned on a detection, we apply the same idea as in our CPDC-MAF, except that we use the detected skeletal keypoints to match against the exemplars in the Human3.6M dataset. A matching process based on semantic keywords (the body joints) is explicit, immediate (since joints are available both for the putative poselet detector and for the exemplar shapes in Human3.6M) and arguably simpler than matching shapes in the absence of skeletal information. The downside is that when the poselet detection is incorrect, the matching will also be (notice that alignments with high score following matching are nevertheless discarded within the MAF process).
We initialize CPDC-MAF, bottom-up, by using candidate segments from CPMC pool, selected based on their \(\mathbf {person}\) ranking score after applying the O2P classifier. This is followed by a non-maximum suppression step were we remove the pair of segments with an overlap above 0.25. We use the MAF process to reject irrelevant candidates and to build shape prior masks and skeleton configuration seeds for the segments with good matching produced by shape context descriptors. On each resulting shape prior and skeleton seeds, we run the CPDC-MAF model with the resulting pools from each candidate segment merged to obtain the human region proposals for an entire image.
For each testing setup, we report the mean values (computed over the entire testing dataset) of the intersection over union (IoU) scores for the first segment in the ranked pool and the ground-truth figure-ground segmentation for each image. We also report the mean values of the IoU scores for the pool segment with the best IoU score with the ground-truth figure ground segmentation.
Results for different datasets can be visualized in Table 1. In turn, Figs. 3, 4 show plots for the size of the segment pools and IoU scores for highest ranked segments generated by different methods, with image indexes sorted according to the best performing method (CPDP-MAF). Qualitative segmentation results for the various methods tested are given in Figs. 6 and 7. Also, we illustrate sample results with top ranked pool of segments in Fig. 5.
4 Conclusions
We have presented class-specific image segmentation models that leverage human body part detectors based on bottom-up figure-ground proposals, parametric max-flow solvers, and a large dataset of human shapes. Our formulation leads to a sub-modular energy model that combines class-specific structural constraints and data-driven shape priors, within a parametric max-flow optimization methodology that systematically computes all breakpoints of the model in polynomial time. We also propose a data-driven class-specific prior fusion methodology, based on shape matching, alignment and fusion, that allows the shape prior to be constructed on-the-fly, for arbitrary viewpoints and partial views. We demonstrate competitive results in two challenging datasets: H3D [48] and MPII [49], where we improve the first ranked hypothesis estimates of mid-level segmentation methods by \(20\%\), with pool sizes that are up to one order of magnitude smaller. In future work we will explore additional class-dependent seed generation mechanisms and plan to study the extension of the proposed framework to video.
Notes
- 1.
Notice, however, that the methodology we propose is also applicable to other categories than people. Here we focus on humans because for now, large training sets of segmented shapes with structural annotations are available only for them, through Human3.6M [7]. But, as large datasets for other object categories emerge, we expect our methodology to generalize well. In this respect, our results on a challenging visual category, humans, are indicative of the performance bounds one can expect.
References
Urtasun, R., Darrell, T.: Sparse probabilistic regression for activity-independent human pose inference. In: CVPR (2008)
Ionescu, C., Li, F., Sminchisescu, C.: Latent structured models for human pose estimation. In: ICCV (2011)
Ionescu, C., Carreira, J., Sminchisescu, C.: Iterated second-order label sensitive pooling for 3D human pose estimation. In: CVPR (2014)
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011)
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. PAMI 35, 2878–2890 (2013)
Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15567-3_13
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI 7, 1325–1339 (2014)
Gallo, G., Grigoriadis, M.D., Tarjan, R.E.: A fast parametric maximum flow algorithm and applications. SIAM J. Comput. 18, 30–55 (1989)
Kolmogorov, V., Boykov, Y., Rother, C.: Applications of parametric maxflow in computer vision. In: ICCV (2007)
Carreira, J., Sminchisescu, C.: CPMC: automatic object segmentation using constrained parametric min-cuts. In: PAMI (2012)
Ladicky, L., Torr, P.H.S., Zisserman, A.: Human pose estimation using a joint pixel-wise and part-wise formulation. In: CVPR (2013)
Wang, H., Koller, D.: Multi-level inference by relaxed dual decomposition for human pose segmentation. In: CVPR (2011)
Ghiasi, G., Yang, Y., Ramanan, D., Fowlkes, C.C.: Parsing occluded people. In: CVPR (2014)
Xia, W., Song, Z., Feng, J., Cheong, L.-F., Yan, S.: Segmentation over detection by coupled global and local sparse representations. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 662–675. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33715-4_48
Ferrari, V., Marin, M., Zisserman, A.: Pose search: retrieving people using their pose. In: CVPR (2009)
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: CVPR (2009)
Zuffi, S., Freifeld, O., Black, M.J.: From pictorial structures to deformable structures. In: CVPR (2012)
Zuffi, S., Romero, J., Schmid, C., Black, M.J.: Estimating human pose with flowing puppets. In: ICCV (2013)
Boussaid, H., Kokkinos, I.: Fast and exact: ADMM-based discriminative shape segmentation with loopy part models. In: CVPR (2014)
Alpert, S., Galun, M., Basri, R., Brandt, A.: Image segmentation by probabilistic bottom-up aggregation and cue integration. In: CVPR (2007)
Kumar, M.P., Torr, P., Zisserman, A.: OBJCUT: efficient segmentation using top-down and bottom-up cues. PAMI 32, 530–545 (2010)
Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. IJCV 77, 259–289 (2008)
Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: CVPR (2013)
Flohr, F., Gavrila, D.M.: PedCut: an iterative framework for pedestrian segmentation combining shape models and multiple data cues. In: BMVC (2013)
Russell, B.C., Efros, A., Sivic, J., Freeman, W.T., Zisserman, A.: Segmenting scenes by matching image composites. In: NIPS (2009)
Rosenfeld, A., Weinshall, D.: Extracting foreground masks towards object recognition. In: ICCV (2011)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Lin, G., Shen, C., Ian, R., van dan Hengel, A.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR (2016)
Kuettel, D., Ferrari, V.: Figure-ground segmentation by transferring window masks. In: CVPR (2012)
Gu, C., Arbeláez, P., Lin, Y., Yu, K., Malik, J.: Multi-component models for object detection. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 445–458. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33765-9_32
Lempitsky, V., Blake, A., Rother, C.: Image segmentation by branch-and-mincut. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 15–29. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88693-8_2
Ren, X., Malik, J.: Learning a classification model for segmentation. In: ICCV (2003)
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. PAMI (2010)
Malisiewicz, T., Efros, A.: Improving spatial support for objects via multiple segmentations. In: BMVC (2007)
van de Sande, K.E., Uijlings, J.R., Gevers, T., Smeulders, A.W.: Segmentation as selective search for object recognition. In: ICCV (2011)
Brox, T., Bourdev, L., Maji, S., Malik, J.: Object segmentation by alignment of poselet activations to image contours. In: CVPR (2011)
Endres, I., Hoiem, D.: Category independent object proposals. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 575–588. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15555-0_42
Kim, J., Grauman, K.: Shape sharing for object segmentation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 444–458. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33786-4_33
Levinshtein, A., Sminchisescu, C., Dickinson, S.: Optimal contour closure by superpixel grouping. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 480–493. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15552-9_35
Maire, M., Yu, S.X., Perona, P.: Object detection and segmentation from joint embedding of parts and pixels. In: ICCV (2011)
Dong, J., Chen, Q., Yan, S., Yuille, A.: Towards unified object detection and semantic segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 299–314. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_20
Maire, M., Arbelaez, P., Fowlkes, C., Malik, J.: Using contours to detect and localize junctions in natural images. In: CVPR (2008)
Leordeanu, M., Sukthankar, R., Sminchisescu, C.: Efficient closed-form solution to generalized boundary detection. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 516–529. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33765-9_37
Carreira, J., Caseiro, R., Batista, J., Sminchisescu, C.: Semantic segmentation with second-order pooling. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7578, pp. 430–443. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33786-4_32
Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. PAMI 24, 509–522 (2002)
Ryabko, B.Y., Stognienko, V., Shokin, Y.I.: A new test for randomness and its application to some cryptographic problems. J. Stat. Plan. Infer. 123, 365–376 (2004)
Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: ICCV (2009)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
Acknowledgments
This work was supported in part by CNCS-UEFISCDI under PCE-2011-3-0438, JRP-RO-FR-2014-16, and NVIDIA through a GPU card donation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Popa, AI., Sminchisescu, C. (2017). Parametric Image Segmentation of Humans with Structural Shape Priors. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10112. Springer, Cham. https://doi.org/10.1007/978-3-319-54184-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-54184-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54183-9
Online ISBN: 978-3-319-54184-6
eBook Packages: Computer ScienceComputer Science (R0)