Abstract
Many state-of-the-art approaches for object recognition reduce the problem to a 0-1 classification task. This allows one to leverage sophisticated machine learning techniques for training classifiers from labeled examples. However, these models are typically trained independently for each class using positive and negative examples cropped from images. At test-time, various post-processing heuristics such as non-maxima suppression (NMS) are required to reconcile multiple detections within and between different classes for each image. Though crucial to good performance on benchmarks, this post-processing is usually defined heuristically.
We introduce a unified model for multi-class object recognition that casts the problem as a structured prediction task. Rather than predicting a binary label for each image window independently, our model simultaneously predicts a structured labeling of the entire image (Fig. 1). Our model learns statistics that capture the spatial arrangements of various object classes in real images, both in terms of which arrangements to suppress through NMS and which arrangements to favor through spatial co-occurrence statistics.
We formulate parameter estimation in our model as a max-margin learning problem. Given training images with ground-truth object locations, we show how to formulate learning as a convex optimization problem. We employ the cutting plane algorithm of Joachims et al. (Mach. Learn. 2009) to efficiently learn a model from thousands of training images. We show state-of-the-art results on the PASCAL VOC benchmark that indicate the benefits of learning a global model encapsulating the spatial layout of multiple object classes (a preliminary version of this work appeared in ICCV 2009, Desai et al., IEEE international conference on computer vision, 2009).
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta, D., Heitz, G., & Ng, A. (2005). Discriminative learning of Markov random fields for segmentation of 3d scan data. In CVPR, II (pp. 169–176).
Baur, R., Efros, A. A., & Hebert, M. (2008). Statistics of 3d object locations in images (Tech. Rep. CMU-RI-TR-08-43). Robotics Institute, Pittsburgh, PA.
Blaschko, M. B., & Lampert, C. H. (2008). Learning to localize objects with structured output regression. In ECCV (pp. 2–15). Berlin: Springer.
Choi, M., Lim, J., Torralba, A., & Willsky, A. (2010). Exploiting hierarchical context on a large database of object categories. In IEEE conference on computer vision and pattern recognition, CVPR
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR I (pp. 886–893).
Desai, C., Ramanan, D., & Fowlkes, C. (2009). Discriminative models for multi-class object layout. In IEEE international conference on computer vision.
Desai, C., Ramanan, D., & Fowlkes, C. (2010). Discriminative models for static human-object interactions. In Workshop on structured prediction in computer vision, CVPR.
Divvala, S., Hoiem, D., Hays, J., & Efros, A. (2009). An empirical study of context in object detection. In CVPR.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop.
Felzenszwalb, P. (2008). http://people.cs.uchicago.edu/pff/latent.
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.
Finley, T., & Joachims, T. (2008). Training structural svms when exact inference is intractable. In Proceedings of the 25th international conference on machine learning (pp. 304–311). New York: ACM.
Franc, V. (2006). http://cmp.felk.cvut.cz/xfrancv/libqp/html.
Galleguillos, C., Rabinovich, A., & Belongie, S. (2008). Object categorization using co-occurrence, location and appearance. In CVPR, Anchorage, AK.
Hall, E. (1966). The hidden dimension. New York: Anchor Books.
He, X., Zemel, R., & Carreira-Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In CVPR (Vol. 2). Los Alamitos: IEEE Comput. Soc.
Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. IJCV, 80(1), 3–15.
Joachims, T., Finley, T., & Yu, C. (2009). Cutting plane training of structural SVMs. Machine Learning, 77(1), 27–59.
Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1568–1583. http://doi.ieeecomputersociety.org/10.1109/TPAMI.2006.200.
Kumar, S., & Hebert, M. (2005). A hierarchical field framework for unified context-based classification. In Tenth IEEE international conference on computer vision, ICCV, 2005 (Vol. 2).
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Workshop on statistical learning in computer vision, ECCV (pp. 17–32).
Liu, Y., Lin, W., & Hays, J. (2004). Near-regular texture analysis and manipulation. ACM Transactions on Graphics, 23(3), 368–376.
Meltzer, T. (2006). http://www.cs.huji.ac.il/talyam/inference.html.
MSR (2006). http://research.microsoft.com/en-us/downloads/dad6c31e-2c04-471f-b724-ded18bf70fe3/.
Murphy, K., Torralba, A., & Freeman, W. (2003). Using the forest to see the trees: a graphical model relating features, objects and scenes. NIPS 16.
Nemhauser, G., Wolsey, L., & Fisher, M. (1978). An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14(1), 265–294.
Park, D., Ramanan, D., & Fowlkes, C. (2010). Multiresolution models for object detection. In ECCV.
Rother, C., Kolmogorov, V., Lempitsky, V., & Szummer, M. (2007). Optimizing binary mrfs via extended roof duality. In CVPR.
Rowley, H. A., Baluja, S., & Kanade, T. (1996). Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 23–38.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. Lecture Notes in Computer Science, 3951, 1.
Sudderth, E., Torralba, A., Freeman, W., & Willsky, A. (2005). Learning hierarchical models of scenes, objects, and parts. In ICCV, II (pp. 1331–1338).
Teo, C., Smola, A., Vishwanathan, S., & Le, Q. (2007). A scalable modular convex solver for regularized risk minimization. In SIGKDD. New York: ACM.
Torralba, A., Murphy, K., & Freeman, W. (2004). Contextual models for object detection using boosted random fields. NIPS.
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML. New York: ACM.
Tu, Z. (2008). Auto-context and its application to high-level vision tasks. In CVPR.
Viola, P. A., & Jones, M. J. (2004). Robust real-time face detection. IJCV, 57(2), 137–154.
Wainwright, M., Jaakkola, T., & Willsky, A. (2002). Map estimation via agreement on (hyper)trees: message-passing and linear programming approaches. IEEE Transactions on Information Theory, 51, 3697–3717.
Yanover, C., & Meltzer, T. Y. W. (2006). Linear programming relaxations and belief propagation—an empirical study. In JMLR (pp. 1887–1907).
Author information
Authors and Affiliations
Corresponding author
Additional information
The Marr Prize is awarded to the best paper(s) at the biannual flagship vision conference, the IEEE International Conference on Computer Vision (ICCV). This paper is an extended and re-reviewed journal version of the 2009 prize-winning conference paper.
Rights and permissions
About this article
Cite this article
Desai, C., Ramanan, D. & Fowlkes, C.C. Discriminative Models for Multi-Class Object Layout. Int J Comput Vis 95, 1–12 (2011). https://doi.org/10.1007/s11263-011-0439-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-011-0439-x