Abstract
Object recognition, which consists of classification and detection, has two important attributes for robustness: 1) closeness: detection windows should be as close to object locations as possible, and 2) adaptiveness: object matching should be adaptive to object variations within an object class. It is difficult to satisfy both attributes using traditional methods which consider classification and detection separately; thus recent studies propose to combine them based on confidence contextualization and foreground modeling. However, these combinations neglect feature saliency and object structure, and biological evidence suggests that the feature saliency and object structure can be important in guiding the recognition from low level to high level. In fact, object recognition originates in the mechanism of “what” and “where” pathways in human visual systems. More importantly, these pathways have feedback to each other and exchange useful information, which may improve closeness and adaptiveness. Inspired by the visual feedback, we propose a robust object recognition framework by designing a computational visual feedback model (VFM) between classification and detection. In the “what” feedback, the feature saliency from classification is exploited to rectify detection windows for better closeness; while in the “where” feedback, object parts from detection are used to match object structure for better adaptiveness. Experimental results show that the “what” and “where” feedback is effective to improve closeness and adaptiveness for object recognition, and encouraging improvements are obtained on the challenging PASCAL VOC 2007 dataset.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The PASCAL Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303-338.
Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNET: A large-scale hierarchical image database. In Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, June 2009, pp.248-255.
Csurka G, Dance C R , Fan L, Willamowski J, Bray C. Visual categorization with bags of keypoints. In Proc. European Conference on Computer Vision Workshop, May 2004, pp.145-168.
Yang J, Yu K, Gong Y, Huang T. Linear spatial pyramid matching using sparse coding for image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2009, pp.1794-1801.
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y. Localityconstrained linear coding for image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2010, pp.3360-3367.
Zhou X, Yu K, Zhang T, Huang T. Image classification using super-vector coding of local image descriptors. In Proc. the 11th European Conference on Computer Vision, September 2010, pp.141-154.
Perronnin F, S´anchez J, Mensink T. Improving the fisher kernel for large-scale image classification. In Proc. the 11th European Conference on Computer Vision, September 2010, pp.143-156.
Krizhevsky A, Sutskever I, Hinton G E. ImageNET classification with deep convolutional neural networks. In Proc. the 26th Annual Conf. Neural Information Processing Systems, December 2012, pp.1106-1114.
Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. arXiv:1405.3531, 2014.
Lin M, Chen Q, Yan S. Network in network. arXiv:1312.4400, 2014.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
Zeiler M D, Fergus R. Visualizing and understanding convolutional networks. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.818-833.
Felzenszwalb P F, Girshick R B, McAllester D, Ramanan D. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(9): 1627-1645.
Wang X, Bai X, Ma T, Liu W, Latecki L. Fan shape model for object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2012, pp.151-158.
Zhu L, Chen Y, Yuille A, Freeman W. Latent hierarchical structural learning for object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2010, pp.1062-1069.
Girshick R B, Felzenszwalb P F, McAllester D A. Object detection with grammar models. In Proc. the 25th NIPS, December 2011, pp.442-450.
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2014, pp.580-587.
Hoffman J, Guadarrama S, Tzeng E, Hu R, Donahue J, Girshick R, Darrell T, Saenko K. LSDA: Large scale detection through adaptation. In Proc. NIPS, December 2014, pp.3536-3544.
Zhang N, Donahue J, Girshick R, Darrell T. Part-based R-CNNs for fine-grained category detection. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.834-849.
Gupta S, Girshick R, Arbeláez P, Malik J. Learning rich features from RGB-D images for object detection and segmentation. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.345-360.
Hariharan B, Arbeláez P, Girshick R, Malik J. Simultaneous detection and segmentation. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.297-312.
Zhang J, Zhao X, Huang Y, Huang K, Tan T. Semantic windows mining in sliding window based object detection. In Proc. the 21st International Conference on Pattern Recognition, November 2012, pp.3264-3267.
Russakovsky O, Lin Y, Yu K, Li F F. Object-centric spatial pooling for image classification. In Proc. the 12th European Conference on Computer Vision, Oct. 2012, pp.1-15.
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, June 2006, pp.2169-2178.
Chikkerur S, Serre T, Tan C, Poggio T. What and where: A Bayesian inference theory of attention. Vision Research, 2010, 50(22): 2233-2247.
Galleguillos C, Belongie S. Context based object categorization: A critical survey. Computer Vision and Image Understanding, 2010, 114(6): 712-722.
Divvala S K, Hoiem D, Hays J H, Efros A A, Hebert M. An empirical study of context in object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2009, pp.1271-1278.
Harzallah H, Jurie F, Schmid C. Combining efficient object localization and image classification. In Proc. the 12th International Conference on Computer Vision, Sept. 29-Oct. 2, 2009, pp.237-244.
Song Z, Chen Q, Huang Z, Hua Y, Yan S. Contextualizing object detection and classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2011, pp.1585-1592.
Chen G, Ding Y, Xiao J, Han T X. Detection evolution with multi-order contextual co-occurrence. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2013, pp.1798-1805.
Zhang Y, Chen T. Weakly supervised object recognition and localization with invariant high order features. In Proc. the British Machine Vision Conference, Aug. 31-Sept. 3, 2010, pp.47:1-47:11.
Chen Q, Song Z, Hua Y, Huang Z, Yan S. Hierarchical matching with side information for image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2013, pp.3426-3433.
Nguyen M H, Torresani L, de la Torre F, Rother C. Weakly supervised discriminative localization and classification: A joint learning process. In Proc. International Conference on Computer Vision, September 2009, pp.1925-1932.
Huang Y, Huang K, Yu Y, Tan T. Salient coding for image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2011, pp.1753-1760.
Rybak I A, Gusakova V I, Golovan A V, Podladchikova L N, Shevtsova N A. A model of attention-guided visual perception and recognition. Vision Research, 1998, 38(15/16): 2387-2400.
Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1254-1259.
Barenholtz E, Tarr M J. Reconsidering the role of structure in vision. The Psychology of Learning and Motivation, 2006, 47:157-180.
Biederman I. Recognition-by-components: A theory of human image understanding. Psychological Review, 1987, 94(2):115-147.
Huang K,Wang Q,Wu Z. Natural color image enhancement and evaluation algorithm based on human visual system. Computer Vision and Image Understanding, 2006, 103(1): 52-63.
Huang K, Wu Z, Wang Q. Image enhancement based on the statistics of visual representation. Image and Vision Computing, 2005, 23(1): 51-57.
Huang K, Wu Z, Fung G S K, Chan F H Y. Color image denoising with wavelet thresholding based on human visual system model. Signal Processing: Image Communication, 2005, 20(2): 115-127.
Boureau Y, Ponce J, LeCun Y. A theoretical analysis of feature pooling in visual recognition. In Proc. the 27th International Conference on Machine Learning, June 2010, pp.111-118.
Serre T, Wolf L, Poggio T. Object recognition with features inspired by visual cortex. In Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, June 2005, pp.994-1000.
Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504-507.
LeCun Y, Kavukvuoglu K, Farabet C. Convolutional networks and applications in vision. In Proc. IEEE International Symposium on Circuits and Systems, May 30-June 2, 2010, pp.253-256.
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, June 2005, pp.886-893.
Wohlhart P, Donoser M, Roth P M, Bischof H. Detecting partially occluded objects with an implicit shape model random field. In Proc. the 11th Asian Conference on Computer Vision, November 2012, pp.302-315.
Bogacz R, Usher M, Zhang J, McClelland J L. Extending a biologically inspired model of choice: Multialternatives, nonlinearity and value-based multidimensional choice. Philosophical Transactions of The Royal Society of London, Series B, Biological Sciences, 2007, 362(1485): 1655-1670.
Yang J, Yu K, Huang T. Supervised translation invariant sparse coding. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2010, pp.3517-3524.
Jurie F , Triggs B. Creating efficient codebooks for visual recognition. In Proc. the 10th International Conference on Computer Vision, Oct. 2005, pp.604-610.
Boureau Y L, Bach F, LeCun Y, Ponce J. Learning midlevel features for recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2010, pp.2559-2566.
Van Gemert J C, Veenman C J, Smeulders A W M, Geusebroek J M. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(7): 1271-1283.
Jegou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(9): 1704-1716.
Chatfield K, Lempitsky V, Vedaldi A, Zisserman A. The devil is in the details: An evaluation of recent feature encoding methods. In Proc. the 22nd British Machine Vision Conference, Aug. 29-Sept. 22, 2011, pp.76:1-76:12.
Felzenszwalb P F, Huttenlocher D P. Pictorial structures for object recognition. International Journal of Computer Vision, 2005, 61(1): 55-79.
Desai C, Ramanan D, Fowlkes C C. Discriminative models for multi-class object layout. International Journal of Computer Vision, 2011, 95(1): 1-12.
Vedaldi A, Gulshan V, Varma M, Zisserman A. Multiple kernels for object detection. In Proc. the 12th IEEE International Conference on Computer Vision, Sept. 29-Oct. 2, 2009, pp.606-613.
Pepik B, Stark M, Gehler P, Schiele B. Teaching 3D geometry to deformable part models. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2012, pp.3362-3369.
Yang Y, Ramanan D. Articulated pose estimation using flexible mixtures of parts. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2011, pp.1385-1392.
Zhu X, Ramanan D. Face detection pose estimation landmark localization in the wild. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2012, pp.2879-2886.
Duchenne O, Joulin A, Ponce J. A graph-matching kernel for object categorization. In Proc. IEEE International Conference on Computer Vision, November 2011, pp.1792-1799.
Song X, Wu T, Jia Y, Zhu S. Discriminatively trained and-or tree models for object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2013, pp.3278-3285.
Carbonetto P, de Freitas N, Barnard K. A statistical model for general contextual object recognition. In Proc. the 8th European Conference on Computer Vision, May 2004, pp.350-362.
Kosslyn S M, Flynn R A, Amsterdam J B,Wang G. Components of high-level vision: A cognitive neuroscience analysis and accounts of neurological syndromes. Cognition, 1990, 34(3): 203-277.
Mishkin M, Ungerleider L G, Macko K A. Object vision and spatial vision: Two cortial pathways. Trends in Neurosciences, 1983, 6: 414-417.
Ungerleider L G, Mishkin M. Two Cortical Visual Systems. Cambridge, MA: MIT Press, 1982.
Chai Y, Lempitsky V, Zisserman A. BiCoS: A bi-level co-segmentation method for image classification. In Proc. IEEE International Conference on Computer Vision, November 2011, pp.2579-2586.
Crandall D J, Huttenlocher D P. Weakly supervised learning of part-based spatial models for visual object recognition. In Proc. the 9th European Conference on Computer Vision, May 2006, pp.16-29.
Ren X, Ramanan D. Histograms of sparse codes for object detection. In Proc. Computer Vision and Pattern Recognition, June 2013, pp.3246-3253.
Malisiewicz T, Efros A A. Improving spatial support for objects via multiple segmentations. In Proc. the British Machine Vision Conference, September 2007, pp.55:1-55:10.
Pandey M, Lazebnik S. Scene recognition and weakly supervised object localization with deformable part-based models. In Proc. IEEE International Conference on Computer Vision, November 2011, pp.1307-1314.
Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91-110.
Zhang J, Huang K, Yu Y, Tan T. Boosted local structured HOG-LBp for object localization. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2011, pp.1393-1400.
Author information
Authors and Affiliations
Corresponding author
Additional information
Special Section on Object Recognition
This work was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316302, the National Natural Science Foundation of China under Grant Nos. 61322209 and 61175007, the National Key Technology Research and Development Program of China under Grant No. 2012BAH07B01.
Rights and permissions
About this article
Cite this article
Wang, C., Huang, KQ. VFM: Visual Feedback Model for Robust Object Recognition. J. Comput. Sci. Technol. 30, 325–339 (2015). https://doi.org/10.1007/s11390-015-1526-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-015-1526-1