Spatio-temporal Matching for Human Detection in Video

Zhou, Feng; De la Torre, Fernando

doi:10.1007/978-3-319-10599-4_5

Feng Zhou¹⁹ &
Fernando De la Torre¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8694))

Included in the following conference series:

European Conference on Computer Vision

17k Accesses
21 Citations

Abstract

Detection and tracking humans in videos have been long-standing problems in computer vision. Most successful approaches (e.g., deformable parts models) heavily rely on discriminative models to build appearance detectors for body joints and generative models to constrain possible body configurations (e.g., trees). While these 2D models have been successfully applied to images (and with less success to videos), a major challenge is to generalize these models to cope with camera views. In order to achieve view-invariance, these 2D models typically require a large amount of training data across views that is difficult to gather and time-consuming to label. Unlike existing 2D models, this paper formulates the problem of human detection in videos as spatio-temporal matching (STM) between a 3D motion capture model and trajectories in videos. Our algorithm estimates the camera view and selects a subset of tracked trajectories that matches the motion of the 3D model. The STM is efficiently solved with linear programming, and it is robust to tracking mismatches, occlusions and outliers. To the best of our knowledge this is the first paper that solves the correspondence between video and 3D motion capture data for human pose detection. Experiments on the Human3.6M and Berkeley MHAD databases illustrate the benefits of our method over state-of-the-art approaches.

Download to read the full chapter text

Chapter PDF

Human Pose Tracking Using Online Latent Structured Support Vector Machine

Recent Developments in Tracking Objects in a Video Sequence

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Shotton, J., Girshick, R.B., Fitzgibbon, A.W., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., Blake, A.: Efficient human pose estimation from single depth images. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2821–2840 (2013)
Article Google Scholar
Boiman, O., Irani, M.: Detecting irregularities in images and in video. Int. J. Comput. Vis. 74(1), 17–31 (2007)
Article Google Scholar
Wei, X.K., Chai, J.: VideoMocap: modeling physically realistic human motion from monocular video sequences. ACM Trans. Graph. 29(4) (2010)
Google Scholar
Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Article Google Scholar
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2013)
Article Google Scholar
Andriluka, M., Roth, S., Schiele, B.: Discriminative appearance models for pictorial structures. Int. J. Comput. Vis. 99(3), 259–280 (2012)
Article MathSciNet Google Scholar
Ionescu, C., Li, F., Sminchisescu, C.: Latent structured models for human pose estimation. In: ICCV (2011)
Google Scholar
Eichner, M., Jesús, M., Zisserman, A., Ferrari, V.: 2D articulated human pose estimation and retrieval in (almost) unconstrained still images. Int. J. Comput. Vis. 99(2), 190–214 (2012)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (2014)
Google Scholar
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: A comprehensive multimodal human action database. In: IEEE Workshop on Applications on Computer Vision (WACV), pp. 53–60 (2013)
Google Scholar
Poppe, R.: Vision-based human motion analysis: An overview. Comput. Vis. Image Underst. 108(1-2), 4–18 (2007)
Article Google Scholar
Sapp, B., Toshev, A., Taskar, B.: Cascaded models for articulated pose estimation. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 406–420. Springer, Heidelberg (2010)
Chapter Google Scholar
Andriluka, M., Roth, S., Schiele, B.: Monocular 3D pose estimation and tracking by detection. In: CVPR (2010)
Google Scholar
Sapp, B., Weiss, D., Taskar, B.: Parsing human motion with stretchable models. In: CVPR (2011)
Google Scholar
Burgos, X., Hall, D., Perona, P., Dollár, P.: Merging pose estimates across space and time. In: BMVC (2013)
Google Scholar
Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013)
Google Scholar
Zuffi, S., Romero, J., Schmid, C., Black, M.J.: Estimating human pose with flowing puppets. In: ICCV (2013)
Google Scholar
Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 44–58 (2006)
Article Google Scholar
Elgammal, A.M., Lee, C.S.: Inferring 3D body pose from silhouettes using activity manifold learning. In: CVPR (2004)
Google Scholar
Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with Gaussian process dynamical models. In: CVPR (2006)
Google Scholar
Sigal, L., Black, M.J.: Predicting 3D people from 2D pictures. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2006. LNCS, vol. 4069, pp. 185–195. Springer, Heidelberg (2006)
Chapter Google Scholar
Simo-Serra, E., Ramisa, A., Alenyà, G., Torras, C., Moreno-Noguer, F.: Single image 3D human pose estimation from noisy observations. In: CVPR (2012)
Google Scholar
Yao, A., Gall, J., Gool, L.J.V.: Coupled action recognition and pose estimation from multiple views. Int. J. Comput. Vis. 100(1), 16–37 (2012)
Article MATH Google Scholar
Yu, T.H., Kim, T.K., Cipolla, R.: Unconstrained monocular 3D human pose estimation by action detection and cross-modality regression forest. In: CVPR (2013)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003)
Chapter Google Scholar
Messing, R., Pal, C.J., Kautz, H.A.: Activity recognition using the velocity histories of tracked keypoints. In: ICCV (2009)
Google Scholar
Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: Action recognition through the motion analysis of tracked features. In: ICCVW (2009)
Google Scholar
Carnegie Mellon University Motion Capture Database, http://mocap.cs.cmu.edu
Park, D., Ramanan, D.: N-best maximal decoders for part models. In: ICCV (2011)
Google Scholar
Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Numerical geometry of non-rigid shapes. Springer (2008)
Google Scholar
Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3D shape from image streams. In: CVPR (2000)
Google Scholar
Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Trajectory space: A dual representation for nonrigid structure from motion. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1442–1456 (2011)
Article Google Scholar
Akhter, I., Simon, T., Khan, S., Matthews, I., Sheikh, Y.: Bilinear spatiotemporal basis models. ACM Trans. Graph. 31(2), 17 (2012)
Article Google Scholar
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: NIPS, pp. 849–856 (2001)
Google Scholar
Jiang, H., Drew, M.S., Li, Z.N.: Matching by linear programming and successive convexification. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 959–975 (2007)
Article Google Scholar
Trendafilov, N.: On the l ₁ Procrustes problem. Future Generation Computer Systems 19(7), 1177–1186 (2004)
Article Google Scholar
Lin, Z., Chen, M., Ma, Y.: The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055 (2010)
Google Scholar
Mosek, http://www.mosek.com/

Download references

Author information

Authors and Affiliations

Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Feng Zhou & Fernando De la Torre

Authors

Feng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Fernando De la Torre
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Toronto, 6 King’s College Road, M5H 3S5, Toronto, ON, Canada
David Fleet
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, Technicka 2, 166 27, Prague 6, Czech Republic
Tomas Pajdla
Max-Planck-Institut für Informatik, Campus E1 4, 66123, Saarbrücken, Germany
Bernt Schiele
ESAT - PSI, iMinds, KU Leuven, Kasteelpark Arenberg 10, Bus 2441, 3001, Leuven, Belgium
Tinne Tuytelaars

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, F., De la Torre, F. (2014). Spatio-temporal Matching for Human Detection in Video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8694. Springer, Cham. https://doi.org/10.1007/978-3-319-10599-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-10599-4_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10598-7
Online ISBN: 978-3-319-10599-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spatio-temporal Matching for Human Detection in Video

Abstract

Chapter PDF

Similar content being viewed by others

Human Pose Tracking Using Online Latent Structured Support Vector Machine

Recent Developments in Tracking Objects in a Video Sequence

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Spatio-temporal Matching for Human Detection in Video

Abstract

Chapter PDF

Similar content being viewed by others

Human Pose Tracking Using Online Latent Structured Support Vector Machine

Recent Developments in Tracking Objects in a Video Sequence

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation