Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements

Vig, Eleonora; Dorr, Michael; Cox, David

doi:10.1007/978-3-642-33786-4_7

Eleonora Vig²¹,
Michael Dorr²² &
David Cox²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7578))

Included in the following conference series:

European Conference on Computer Vision

7353 Accesses
49 Citations

Abstract

Algorithms using “bag of features”-style video representations currently achieve state-of-the-art performance on action recognition tasks, such as the challenging Hollywood2 benchmark [1,2,3]. These algorithms are based on local spatiotemporal descriptors that can be extracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance [1]. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This approach is evaluated with three state-of-the-art action recognition algorithms [1,2,3], and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the approach. Saliency-based pruning allows up to 70% of descriptors to be discarded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining representations learned separately on salience-pruned and unpruned descriptor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Trajectories model [1] enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) reported on Hollywood2 to date.

Download to read the full chapter text

Chapter PDF

A Fast Action Recognition Strategy Based on Motion Trajectory Occurrences

Article 01 July 2019

A Robust and Efficient Video Representation for Action Recognition

Article 17 July 2015

Compact Video Description and Representation for Automated Summarization of Human Activities

Keywords

References

Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC, p. 127 (2009)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Cheng-Lin, L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176 (2011)
Google Scholar
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)
Google Scholar
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR, vol. 3, pp. 32–36 (2004)
Google Scholar
Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, Nice, France (2003)
Google Scholar
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Google Scholar
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. on PAMI 20, 1254–1259 (1998)
Article Google Scholar
Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Advances in NIPS, vol. 18, pp. 155–162 (2006)
Google Scholar
Vig, E., Dorr, M., Martinetz, T., Barth, E.: Intrinsic dimensionality predicts the saliency of natural dynamic scenes. IEEE Trans. on PAMI 34, 1080–1091 (2012)
Article Google Scholar
Geisler, W.S., Perry, J.S.: A real-time foveated multiresolution system for low-bandwidth video communication. In: Rogowitz, B.E., Pappas, T.N. (eds.) Human Vision and Electronic Imaging: SPIE Proceedings, pp. 294–305 (1998)
Google Scholar
Rutishauser, U., Walther, D., Koch, C., Perona, P.: Is bottom-up attention useful for object recognition. In: CVPR, pp. 37–44 (2004)
Google Scholar
Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE Trans. on PAMI 29, 411–426 (2007)
Article Google Scholar
Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., Shum, H.Y.: Learning to detect a salient object. IEEE Trans. on PAMI 33, 353–367 (2011)
Article Google Scholar
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR, pp. 2929–2936 (2009)
Google Scholar
Mathe, S., Sminchisescu, C.: Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition. In: Fitzgibbon, A., Lazebnik, S., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VII, vol. 7573, pp. 842–856. Springer, Heidelberg (2012)
Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)
Google Scholar
Pinto, N., DiCarlo, J.J., Cox, D.D.: How far can you get with a modern face recognition test set using only simple features? In: CVPR (2009)
Google Scholar
Dorr, M., Martinetz, T., Gegenfurtner, K., Barth, E.: Variability of eye movements when viewing dynamic natural scenes. Journal of Vision 10, 1–17 (2010)
Article Google Scholar
Tseng, P., Carmi, R., Cameron, I., Munoz, D., Itti, L.: Quantifying center bias of observers in free viewing of dynamic natural scenes. Journal of Vision 9 (2009)
Google Scholar
Mota, C., Aach, T., Stuke, I., Barth, E.: Estimation of multiple orientations in multi-dimensional signals. In: ICIP, pp. 2665–2668 (2004)
Google Scholar
Pomplun, M., Ritter, H., Velichkovsky, B.: Disambiguating complex visual information: Towards communication of personal views of a scene. Perception 25, 931–948 (1996)
Article Google Scholar
Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: ICML, New York, USA, pp. 6–13 (2004)
Google Scholar
Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. J. Mach. Learn. Res. 7, 1531–1565 (2006)
MathSciNet MATH Google Scholar
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV, pp. 2106–2113 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

The Rowland Institute at Harvard, Cambridge, MA, 02142, USA
Eleonora Vig & David Cox
Schepens Eye Research Institute, Harvard Medical School, Boston, MA, 02114, USA
Michael Dorr

Authors

Eleonora Vig
View author publications
You can also search for this author in PubMed Google Scholar
Michael Dorr
View author publications
You can also search for this author in PubMed Google Scholar
David Cox
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Microsoft Research Ltd., CB3 0FB, Cambridge, UK
Andrew Fitzgibbon
Dept. of Computer Science, University of North Carolina, 27599, Chapel Hill, NC, USA
Svetlana Lazebnik
California Institute of Technology, 91125, Pasadena, CA, USA
Pietro Perona
Institute of Industrial Science, The University of Tokyo, 153-8505, Tokyo, Japan
Yoichi Sato
INRIA, 38330, Montbonnot, France
Cordelia Schmid

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vig, E., Dorr, M., Cox, D. (2012). Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds) Computer Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33786-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-33786-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33785-7
Online ISBN: 978-3-642-33786-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements

Abstract

Chapter PDF

Similar content being viewed by others

A Fast Action Recognition Strategy Based on Motion Trajectory Occurrences

A Robust and Efficient Video Representation for Action Recognition

Compact Video Description and Representation for Automated Summarization of Human Activities

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements

Abstract

Chapter PDF

Similar content being viewed by others

A Fast Action Recognition Strategy Based on Motion Trajectory Occurrences

A Robust and Efficient Video Representation for Action Recognition

Compact Video Description and Representation for Automated Summarization of Human Activities

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation