Abstract
We present a method to simultaneously estimate 3D body pose and action categories from monocular video sequences. Our approach learns a generative model of the relationship of body pose and image appearance using a sparse kernel regressor. Body poses are modelled on a low-dimensional manifold obtained by Locally Linear Embedding dimensionality reduction. In addition, we learn a prior model of likely body poses and a dynamical model in this pose manifold. Sparse kernel regressors capture the nonlinearities of this mapping efficiently. Within a Recursive Bayesian Sampling framework, the potentially multimodal posterior probability distributions can then be inferred. An activity-switching mechanism based on learned transfer functions allows for inference of the performed activity class, along with the estimation of body pose and 2D image location of the subject. Using a rough foreground segmentation, we compare Binary PCA and distance transforms to encode the appearance. As a postprocessing step, the globally optimal trajectory through the entire sequence is estimated, yielding a single pose estimate per frame that is consistent throughout the sequence. We evaluate the algorithm on challenging sequences with subjects that are alternating between running and walking movements. Our experiments show how the dynamical model helps to track through poorly segmented low-resolution image sequences where tracking otherwise fails, while at the same time reliably classifying the activity type.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agarwal, A., & Triggs, B. (2004a). 3D human pose from silhouettes by relevance vector regression. In IEEE conference on computer vision and pattern recognition (CVPR).
Agarwal, A., & Triggs, B. (2004b). Tracking articulated motion using a mixture of autoregressive models. In European conference on computer vision (ECCV).
Agarwal, A., & Triggs, B. (2005). Monocular human motion capture with a mixture of regressors. In IEEE CVPR workshop on vision for human-computer interaction.
Bailey, D. G. (2004). An efficient euclidean distance transform. In International workshop on combinatorial image analysis.
Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society.
Doucet, A., Godsill, S., & Andrieu, C. (2000a). On sequentional Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing.
Doucet, A., Godsill, S., & West, M. (2000b). Monte Carlo filtering and smoothing with application to time-varying spectral estimation. In IEEE conference on acoustics, speech and signal processing (vol. II, pp. 701–704).
Elgammal, A., & Lee, C.-S. (2004). Inferring 3D body pose from silhouettes using activity manifold learning. In IEEE conference on computer vision and pattern recognition (CVPR).
Forney, G. D. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61(3), 268–278.
Forsyth, D. A., Arikan, O., Ikemoto, L., Brien, J. O., & Ramanan, D. (2006). Computational studies of human motion: Part 1. Computer Graphics and Vision, 1(2/3).
Grauman, K., Shakhnarovich, G., & Darrel, T. (2003). Inferring 3D structure with a statistical image-based shape model. International conference on computer vision (ICCV).
Isard, M. (2003). Pampas: Real-valued graphical models for computer vision. In IEEE conference on computer vision and pattern recognition (CVPR).
Isard, M., & Blake, A. (1998a). Condensation—conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.
Isard, M., & Blake, A. (1998b). A mixed-state CONDENSATION tracker with automatic model-switching. In International conference on computer vision (ICCV) (pp. 107–112).
Jaeggli, T., Koller-Meier, E., & Gool, L. V. (2006). Monocular tracking with a mixture of view-dependent learned models. In IV conference on articulated motion and deformable objects (AMDO).
Kschischang, F., Frey, B. J., & Loeliger, H.-A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.
Lawrence, N. D. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816.
Lee, C.-S., & Elgammal, A. (2007). Modeling view and posture manifolds for tracking. In International conference on computer vision (ICCV).
Li, R., Yang, M.-H., Sclaroff, S., & Tian, T.-P. (2006). Monocular tracking of 3D human motion with a coordinated mixture of factor analyzers. In European conference on computer vision (ECCV) (pp. 137–150).
Li, R., Tian, T.-P., & Sclaroff, S. (2007). Simultaneous learning of non-linear manifold and dynamical models for high-dimensional time series. In International conference on computer vision (ICCV).
Lim, H., Camps, O. I., Sznaier, M., & Morariu, V. I. (2006). Dynamic appearance modeling for human tracking. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 751–757).
Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2), 90–126.
Navaratnam, R., Fitzgibbon, A. W., & Cipolla, R. (2007). The joint manifold model for semi-supervised multi-valued regression. In International conference on computer vision (ICCV).
Pavlovic, V., Rehg, J. M., & MacCormick, J. (2001). Learning switching linear models of human motion. In Neural information processing systems.
Rosales, R., & Sclaroff, S. (2001). Learning body pose via specialized maps. In Neural information processing systems.
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.
Sidenbladh, H., Black, M., & Fleet, D. (2000). Stochastic tracking of 3D human figures using 2D image motion. In European conference on computer vision (ECCV) (pp. 702–718).
Sigal, L., Bhatia, S., Roth, S., Black, M., & Isard, M. (2004). Tracking loose-limbed people. In IEEE conference on computer vision and pattern recognition (CVPR).
Sminchisescu, C., & Jepson, A. (2004). Generative modeling for continuous non-linearly embedded visual inference. In International conference on machine learning (ICML).
Sminchisescu, C., Kanaujia, A., Li, Z., & Metaxas, D. (2005). Discriminative density propagation for 3D human motion estimation. In IEEE conference on computer vision and pattern recognition (CVPR).
Sudderth, E. B., Ihler, A. T., Freeman, W. T., & Willsky, A. S. (2003). Nonparametric belief propagation. In IEEE conference on computer vision and pattern recognition (CVPR).
Sun, Y., Bray, M., Thayananthan, A., Yuanand, B., & Torr, P. (2006). Regression-based human motion capture from voxel data. In British machine vision conference.
Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P., & Cipolla, R. (2006). Multivariate relevance vector machines for tracking. In European conference on computer vision (ECCV).
Tipping, M. (2000). The relevance vector machine. In Neural information processing systems.
Urtasun, R., Fleet, D. J., & Fua, P. (2006). 3D people tracking with Gaussian process dynamical models. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 238–245).
Wang, J. M., Fleet, D. J., & Hertzmann, A. (2006). Gaussian process dynamical models. In Neural information processing systems (pp. 1441–1448).
Wiberg, N. (1996). Codes and decoding on general graphs. PhD thesis, Department of Electrical Engineering, Linköping University, Sweden.
Yedidia, J., Freeman, W., & Weiss, Y. (2002). Understanding belief propagation and its generalizations (Technical report TR-2001-22). MERL.
Zivkovic, Z., & Verbeek, J. (2006). Transformation invariant component analysis for binary images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 254–259).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jaeggli, T., Koller-Meier, E. & Van Gool, L. Learning Generative Models for Multi-Activity Body Pose Estimation. Int J Comput Vis 83, 121–134 (2009). https://doi.org/10.1007/s11263-008-0158-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-008-0158-0