Abstract
We present a technique for estimating the spatial layout of humans in still images—the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side).
We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films.
We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection models.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agarwal, A., & Triggs, B. (2004a). 3d human pose from silhouettes by relevance vector regression. In CVPR.
Agarwal, A., & Triggs, B. (2004b). Tracking articulated motion using a mixture of autoregressive models. In ECCV.
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In CVPR.
Arandjelovic, O., & Zisserman, A. (2005). Automatic face recognition for film character retrieval in feature-length films. In CVPR.
Bergtholdt, M., Knappes, J., & Schnorr, C. (2008). Learning of graphical models and efficient inference for object class recognition. In DAGM.
Bishop, C. (2006). Pattern recognition and machine learning. Berlin: Springer.
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In ICCV.
Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. In PAMI.
Buehler, P., Everinghan, M., Huttenlocher, D., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In BMVC.
Cham, T., & Rehg, J. (1999). A multiple hypothesis approach to figure tracking. In CVPR.
Comaniciu, D., & Meer, P. (2002). Mean shift: a robust approach toward feature space analysis. In PAMI.
Crow, F. (1984). Summed-area tables for texture mapping. In SIGGRAPH.
Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In CVPR.
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In ICCV VS-PETS.
Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC.
Eichner, M., & Ferrari, V. (2010). We are family: Joint pose estimation of multiple persons. In ECCV.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results.
Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In CVPR.
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.
Ferrari, V., Tuytelaars, T., & Van Gool, L. (2001). Real-time affine region tracking and coplanar grouping. In CVPR.
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR.
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2009). Pose search: retrieving people using their pose. In CVPR.
Forsyth, D., & Fleck, M. (1997). Body plans. In CVPR.
Gavrilla, D. M. (2000). Pedestrian detection from a moving vehicle. In ECCV.
Guan, P., Weiss, A., Balan, A., & Black, M. (2009). Estimating human shape and pose from a single image. In ICCV.
Hua, G., Yang, M. H., & Wu, Y. (2005). Learning to estimate human pose with data driven belief propagation. In CVPR.
Ikizler, N., & Duygulu, P. (2007). Human action recognition using distribution of oriented rectangular patches. In ICCV workshop on human motion understanding.
Ioffe, S., & Forsyth, D. (1999). Finding people by sampling. In ICCV.
Jiang, H. (2009). Human pose estimation using consistent max-covering. In ICCV.
Jiang, H., & Martin, D. R. (2008). Global pose estimation using non-tree models. In CVPR.
Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In MLVMA.
Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC.
Ke, Y., Sukthankar, R., & Hebert, M. (2007). Spatio-temporal shape and flow correlation for action recognition. In CVPR.
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2004). Learning layered pictorial structures from video. In ICVGIP.
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2009). Efficient discriminative learning of parts-based models. In ICCV.
Lan, X., & Huttenlocher, D. P. (2004). A unified spatio-temporal articulated model for tracking. In CVPR.
Lan, X., & Huttenlocher, D. (2005). Beyond trees: common-factor models for 2D human pose recovery. In ICCV.
Laptev, I. (2006). Improvements of object detection using boosted histograms. In BMVC.
Laptev, I., Perez, P. (2007). Retrieving actions in movies. In ICCV.
Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Lee, M.W., Cohen, I. (2004). Proposal maps driven MCMC for estimating human body pose in static images. In CVPR.
Li, P., Ai, H., Li, Y., & Huang, C. (2007). Video parsing based on head tracking and face recognition. In CIVR.
Mikolajczyk, K., Schmid, C., & Zisserman, A. (2004). Human detection based on a probabilistic assembly of robust part detectors. In ECCV.
Mori, G., & Malik, J. (2002). Estimating human body configurations using shape context matching. In CVPR.
Niebles, J., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In CVPR.
Nocedal, J., & Wright, S. (2006). Numerical optimization. Berlin: Springer.
Ozuysal, M., Lepetit, V., Fleuret, F., & Fua, P. (2006). Feature harvesting for tracking-by-detection. In ECCV.
Ramanan, D. (2006). Learning to parse images of articulated bodies. In NIPS.
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In CVPR.
Ren, X., Berg, A., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In CVPR.
Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In ECCV.
Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. In SIGGRAPH.
Sapp, B., Jordan, C., & Taskar, B. (2010a). Adaptive pose priors for pictorial structures. In CVPR.
Sapp, B., Toshev, A., & Taskar, B. (2010b). Cascaded models for articulated pose estimation. In ECCV.
Shechtman, E., & Irani, M. (2007). Matching local self-similarities across images and videos. In CVPR.
Sigal, L., & Black, M. (2006). Measure locally, reason globally: occlusion-sensitive articulated pose estimation. In CVPR.
Sigal, L., Isard, M., Sigelman, B. H., & Black, M. J. (2003). Attractive people: assembling loose-limbed models using non-parametric belief propagation. In NIPS.
Singh, V. K., Nevatia, R., & Huang, C. (2010). Efficient inference with multiple heterogeneous part detectors for human pose estimation. In ECCV.
Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In ICCV.
Sivic, J., Everingham, M., & Zisserman, A. (2005). Person spotting: video shot retrieval for face sets. In CIVR.
Tian, T. P., & Sclaroff, S. (2010a). Fast globally optimal 2D human detection with loopy graph models. In CVPR.
Tian, T. P., & Sclaroff, S. (2010b). Fast multi-aspect 2D human detection. In ECCV.
Tran, D., & Forsyth, D. (2010). Improved human parsing with a full relational model. In ECCV.
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.
Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In ECCV.
website (2008). VGG upper body detector. http://www.robots.ox.ac.uk/~vgg/software/UpperBody/.
website (2009a). Buffy stickmen dataset. http://www.robots.ox.ac.uk/~vgg/data/stickmen/.
website (2009b). ETHZ PASCAL stickmen dataset. http://www.vision.ee.ethz.ch/~calvin/ethz_pascal_stickmen/.
website (2009c). HPE software. http://www.vision.ee.ethz.ch/~calvin/articulated_human_pose_estimation_code/.
website (2009d). VGG pose estimation and search. http://www.robots.ox.ac.uk/~vgg/research/pose_estimation/.
website (2010a). CALVIN upper body detector. http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/.
website (2010b). HPE online demo. http://www.vision.ee.ethz.ch/~hpedemo/.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Eichner, M., Marin-Jimenez, M., Zisserman, A. et al. 2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images. Int J Comput Vis 99, 190–214 (2012). https://doi.org/10.1007/s11263-012-0524-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0524-9