A Robust Indoor Scene Recognition Method Based on Sparse Representation

Nascimento, Guilherme; Laranjeira, Camila; Braz, Vinicius; Lacerda, Anisio; Nascimento, Erickson R.

doi:10.1007/978-3-319-75193-1_49

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10657))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

2288 Accesses
5 Citations

Abstract

In this paper, we present a robust method for scene recognition, which leverages Convolutional Neural Networks (CNNs) features and Sparse Coding setting by creating a new representation of indoor scenes. Although CNNs highly benefited the fields of computer vision and pattern recognition, convolutional layers adjust weights on a global-approach, which might lead to losing important local details such as objects and small structures. Our proposed scene representation relies on both: global features that mostly refers to environment’s structure, and local features that are sparsely combined to capture characteristics of common objects of a given scene. This new representation is based on fragments of the scene and leverages features extracted by CNNs. The experimental evaluation shows that the resulting representation outperforms previous scene recognition methods on Scene15 and MIT67 datasets, and performs competitively on SUN397, while being highly robust to perturbations in the input image such as noise and occlusion.

G. Nascimento—This work is supported by grants from Vale Institute of Technology, CNPq, CAPES, FAPEMIG and CEFETMG; CNPq under Proc. 456166/2014-9 and 431458/2016-2; and FAPEMIG under Procs. APQ-00783-14 and APQ-03445-16, and FAPEMIG-PRONEX-MASWeb, Models, Algorithms and Systems for the Web APQ-01400-14.

You have full access to this open access chapter, Download conference paper PDF

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Article 30 January 2023

Multi-stream Convolutional Networks for Indoor Scene Recognition

Robust Scene Classification with Cross-Level LLC Coding on CNN Features

Keywords

1 Introduction

Scene recognition is one of the most challenging tasks on image classification, since the characterization of a scene is performed by using visual features of the objects that compose it and its spatial layout. In other words, a scene is the result of objects compositions. A given environment is more likely to be classified as a bathroom when equipped with a bath and a shower. This is especially true when considering indoor images, which are the focus of this work. The rationale is that besides the constituent objects, the image of a room (i.e., an indoor scene) is similar to every scene and it is hard to distinguish among them.

In this paper, we propose a novel method, which leverages features extracted by Convolutional Neural Networks (CNNs) using a sparse coding technique to represent discriminative regions of the scene. Specifically, we propose to combine global features from CNNs and local sparse feature vectors mixed with max spatial pooling. We built an over-complete dictionary whose base vectors are feature vectors extracted from fragments of a scene. This approach makes our image representation less sensitive to noise, occlusion and local spatial translations of regions and provides discriminative vectors with global and local features.

The main contributions of this paper are: (i) a robust method that simultaneously leverages global and local features to the scene recognition task, and (ii) a thorough set of experiments to validate our requirements and the proposed scene recognition method. We evaluate our method on three different datasets (Scene15 [6], MIT67 [12] and SUN397 [18]) comparing it to previously proposed methods in the literature, and perform a robustness test against the work of Herranz et al. [3]. The experimental results show that our method outperforms the current state of the art on Scene15 and MIT67, while performing competitively on SUN397. Additionally, when subjected to image perturbations (i.e., noise and occlusion), our proposal outperforms Herranz et al. [3] on all three datasets, surpassing it by a large margin on the most challenging one, SUN397.

Related Work. Most solutions proposed in the past decade for scene recognition exploited handcrafted local [4] and global descriptors [10]. Methods to combine these descriptors and to build an image representation varied from Fisher Vectors [13] to Sparse Representation [11]. Sparse representation methods reached a great performance on image recognition, becoming popular in the last few years. Yang et al. [19] and Gao et al. [2] encoded SIFT [7] descriptors into a single sparse feature vector with the max-pooling method, achieving the best results on the Caltech-101 and the Scene-15 datasets.

In recent years, methods based on CNNs achieved state of the art on the task of scene recognition. In Zhou et al. [21], the authors introduce the Places dataset and present a CNN method to learn deep features for scene recognition. The authors reached an accuracy of $70.8\%$ on MIT67. Another line of investigation is to combine features from CNNs. Dixit et al. [1] use a network to extract posterior probability vectors from local image patches and combine them using a Gaussian mixture model. Features extracted from networks trained on ImageNet [9] and Places [21] are combined, achieving $79\%$ accuracy on MIT67.

Herranz et al. [3] analyzed the impact of object features in different scales in combination with global holistic features provided by scene features. According to the authors, the aggregation of local features brings robustness to semantic information in scene categorization. Herranz et al. [3] held the state of the art before Wang et al. [17] propose the architecture called PatchNet. This architecture models local representations from scene patches and aggregates them into a global representation using a newly semantic encoding method called VSAD.

Differently from previous works, our approach is flexible to being used with any feature representation, not being restricted to a single dataset or network architecture. This is a key advantage, since the use of different sources of features (e.g., ImageNet for object features and Places for structure) cannot be easily handled by training a single CNN or an autoencoder representation. Additionally, our method is attractive because it presents a small number of hyperparameters (the sparsity and dictionary size), which makes it much easier to train. Typical CNN or autoencoder approaches require selecting the batch size, learning rate and momentum, architecture, optimization algorithm, activation functions, the number of neurons in the hidden layer and dropout regularization.

2 Methodology

Our methodology is composed of four steps, namely: (i) feature extraction and dictionary building; (ii) feature sparse coding; (iii) pooling, and (iv) concatenation. The final feature vector feeds a Linear SVM used to classify the image. The three first steps are illustrated in Fig. 1.

The overall idea of our approach is to extract features from two semantic levels of the image. Firstly, a global representation is computed from the entire image, encoding the structural features of the scene. Then, we move a sliding window over the image with $\frac{I}{2}$ and $\frac{I}{4}$ window sizes, where I represent the image dimensions, computing a feature vector for each fragment of the scene. These fragments contain local features that are potentially from objects or object’s parts. The stride was fixed to half of the window dimensions in both scales. The rationale is that features extracted from smaller regions will convey information regarding the constituent objects of the scenes. Thus, by combining these features we can represent an indoor scene as a composition of objects.

Feature Extraction and Dictionary Building. Although our method can be instantiated with any feature extractor, such as a bag-of-features model, we use Convolutional Neural Networks to extract the features. We extract semantic features from the fc7 layer of VGGNet-16 [14]. For global information, the CNN was trained on Places [21], while two types of local information were extracted: the same model trained on Places to acquire information regarding local structures, and a second model trained on ImageNet to encode object features.

The feature extraction step provided a feature vector $\mathbf {y} \in \mathbb {R}^d$ for each fragment of the image. This process is performed for all samples, grouping the $\mathbf {y}$ vectors into k clusters using the k-means algorithm. Hence, a dictionary for scale i is represented by $K_i = [\mathbf {v}_{1,i}, \ldots , \mathbf {v}_{k,i} ] \in \mathbb {R}^{d \times k}$, where $\mathbf {v}_{j,i} \in \mathbb {R}^d$ represents the jth cluster in scale i. We define the matrix $\mathbf {D_0}$ as the concatenation of multiple c scales matrices $\{K_i|1 \le i \le c\}$: $\mathbf {D_0}= [K_1, K_2, \ldots , K_c]$.

It is worth noting that, since we are considering a set of over-complete basis, we need $\cup _{i=0}^c ~K_i \gg d$, where $K_i$ is the number of scale matrices in the dictionary and d is the dimension of each feature vector.

We concatenate the dictionaries built on different patch sizes into a single dictionary, respecting the nature of the feature. This leaves us with two dictionaries, one built from features extracted using the model trained on Places, and the other using the model trained on ImageNet. The idea is improving the probability of finding a match in different scales, which offers a better discrimination and repeatability power. Once we have built the initial dictionary $\mathbf {D_0}$, we adjust the dictionary to matrix $\mathbf {D}$ by solving

$$\begin{aligned} \underset{\mathbf {\mathbf {D}, \mathbf {X}}}{{\text {min}}} \quad \frac{1}{n} \sum _{i=1}^n \left( ||\mathbf {y_i} - \mathbf {D} \mathbf {x_i} ||_2^2 \quad \lambda _{dl} ||\mathbf {x_i} ||_1 \right) , \end{aligned}$$

(1)

where $||\cdot ||_2$ is the L2-norm, $||\cdot ||_1$ is the L1-norm, $\lambda _{dl}$ is a regularization parameter for the dictionary learning and the vector $\mathbf {x_i}$ is the ith column of matrix $\mathbf {X}$. We applied a dictionary learning algorithm [8], which solves Eq. 1 alternating between $\mathbf {D}$ and $\mathbf {X}$ as variables, i.e., it minimizes over $\mathbf {D}$ while keeping $\mathbf {X}$ fixed. The dictionary $\mathbf {D_0}$ is used to start the optimization process.

Sparse Coding. Despite the large number of possible objects that can compose a scene, each class of indoor environments has a small number of characteristic types of objects. The dictionary $\mathbf {D}$ provides a linear representation of each image fragment using a small number of basis functions. Thus, we find the sparse vector $\mathbf {x}$ by modeling the composition of a scene as a sparse coding problem.

Considering the new domain of representation defined by the dictionary $\mathbf {D}$, the representation of a feature vector $\mathbf {y}_i$ extracted from a sample of indoor class i, can be rewritten as $\mathbf {y}_i=\mathbf {D} \mathbf {x}_i$, where $\mathbf {x}_i$ is a vector whose entries are zero except those associated with the fragments of class i.

We use as a basis for the vector representation the columns $\mathbf {D}_i$ of a dictionary that is mixed with weights $\mathbf {x}$ to infer the vector $\mathbf {y}$, which is the vector that reconstructs the input $\mathbf {y}$ with the smallest error. Each column $\mathbf {D}_i$ is a descriptor representing an image fragment of the scene. We use a sparsity regularization term to select a small number of columns in the dictionary $\mathbf {D}$.

Let $\mathbf {y} \in \mathbb {R}^d$ be a descriptor extracted from an image patch and $\mathbf {D} \in \mathbb {R}^{d \times kc}$ ($d \ll kc$). The coding can be modeled as the following optimization problem:

$$\begin{aligned} \mathbf {x^*} = \underset{\mathbf {x}}{{\text {argmin}}} ||\mathbf {y} - \mathbf {D} \mathbf {x} ||_2^2 \quad s.t. \quad ||\mathbf {x} ||_0 \le L, \end{aligned}$$

(2)

where $||\cdot ||_0$ is the L0-norm, indicating the number of nonzero values, and L controls the sparsity of vector $\mathbf {x}$. The final vector $\mathbf {x^*} \in \mathbb {R}^{kc}$ is the set of basis weights that represents the descriptor $\mathbf {y}$ as a linear and sparse combination of fragments of scenes.

Although the minimization problem of Eq. 2 is NP-hard, greedy approaches, such as Orthogonal Matching Pursuit (OMP) [16] or L1 norm relaxation, also know as LASSO [15], can be used to effectively solve it. In this paper, we use OMP because it achieved the best results on our tests.

The pooling process refers to the final step of the diagram presented in Fig. 1. To create the vector encoding the scene features, we compute the final feature vector $\mathbf {f} \in \mathbb {R}^{kc}$ by a pooling function $\mathbf {f} = \mathcal {F}(\mathbf {X})$, where $\mathcal {F}$ is defined on each column of $\mathbf {X}\in \mathbb {R}^{m \times kc}$. The rows of matrix $\mathbf {X}$ represent the sparse vectors of each feature vector extracted from m sliding-windows. We create kc-dimensional vectors according to the maximum pooling function, since according to Yang et al. [19], it gives the best results for sparse coded local features.

These steps are executed for both scales, using both the model trained on Places and ImageNet. At the end, five features vectors are concatenated into one: the global features as originally output by the CNN, and four local features.

3 Experiments

We evaluated our approach on the standard datasets for scene recognition benchmark: Scene15, MIT67 and SUN397. The specific attributes of each dataset allowed us to analyze different properties of our method. We trained the models on Places for global and local descriptors, and ImageNet for local descriptors, with VGGNet-16 which has shown the lowest classification error [14]. When considering features from the VGGNet-16, we used PCA to reduce the descriptions from 4,096 to 1,000 on the second and third scales.

To compose the dictionary, we set 2,175 words for Scene15, 3,886 words for MIT67 and 6,907 words for SUN397, according to the number of regions extracted from both scales and the number of classes of each dataset. The regularization factor $\lambda _{dl}$ for the dictionary learning (Eq. 1) is set to 0.1. We set the same sparsity controller value L (Eq. 2) for all configurations when executing OMP. It was set to $0.03\times D_c$ non-zeros, where $D_c$ is the number of columns of the dictionary. All vectors were L2 normalized after each step and after concatenation. We used a Linear Support Vector Machine to perform the classification.

Robustness evaluation. We verified the descriptor robustness against occlusion and noise. For this purpose, we automatically generated squares randomly positioned, to simulate occlusions (black squares) and noise (granular squares), as illustrated in Fig. 2. Experiments were performed for different sizes of windows $\frac{\mathbf {W}}{n}$, where $\mathbf {W}$ represents the dimensions of the image, with varying n values.

Table 1 presents the results for occlusion and noise robustness tested against Herranz et al. [3]. Our methodology performs better in all cases, indicating that our method is less sensitive to perturbations on the image. This behaviour is even more evident for SUN397, by far the most challenging of all three datasets, where the Herranz et al.’s methodology was inferior, on average $25.29\%$ for occlusion, and $23.05\%$ when noise was added.

Table 1. Accuracy performance comparison for occlusion and noise scenarios. Our approach shows higher accuracy performance. It provides a better feature selection for scene patches instead of just max-pooling between CNN features.

Full size table

Discussion. Besides being highly robust to perturbations in the input image, one can clearly see in Table 2 that our methodology also leads the performance on Scene15 and MIT67, while performing competitively on SUN397. It is worth highlighting that our method surpasses human accuracy, which was measured for SUN397 at $68.5\%$ [18]. To evaluate our assumption that local information can greatly benefit the task of indoor scene classification, Table 2 also presents the average accuracy achieved by a model trained solely on features from VGGNet-16 trained on Places, which composes the global portion of our methodology. On all datasets, global features by themselves showed inferior performance relative to our representation, revealing the complementary nature of local features.

We also performed a detailed performance assessment by comparing the accuracies among each indoor class on MIT67. The relative accuracies considering VGGNet-16 as a baseline are shown in Fig. 3. As we can see, for the classes that present a large number of object information (e.g., children room, emphasized in Fig. 3) we have a significant increase in accuracy. We can draw the following two observations. First, since the class of these environments strongly depends on the object configuration, the image fragments containing features of common objects can be represented by the sparse features on different scales. Second, for environments such as movie theater (Fig. 3), the proposed approach is less effective, since such environments are strongly distinguished by their overall structure, and not necessarily by the constituent objects.

Table 2. Comparing average accuracy against other approaches.

Full size table

4 Conclusion

We proposed a robust method that combines both global and local features to compose a high discriminative representation of indoor scenes. Our method improves the accuracy of CNN features by composing local features using Sparse-Coding and a max-pooling technique, by creating an indoor scene representation based on an over-complete dictionary, which is built from several image fragments. Our sparse coding-based composition approach was capable of encoding local patterns and structures for indoor environments, and in cases of strong occlusion and noise, we lead the performance on scene recognition, which is explained by the advantages of sparse representations. Our representation outperformed VGGNet-16 in all cases, reflecting the complementary nature of local features, and outperformed the current state of the art on Scene15 and MIT67, while being competitive on SUN397.

References

Dixit, M., Chen, S., Gao, D., Rasiwasia, N., Vasconcelos, N.: Scene classification with semantic fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2974–2983. IEEE (2015)
Google Scholar
Gao, S., Tsang, I.W.H., Chia, L.T., Zhao, P.: Local features are not lonely - Laplacian sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3555–3561. IEEE Computer Society (2010)
Google Scholar
Herranz, L., Jiang, S., Li, X.: Scene recognition with CNNs: objects, scales and dataset bias. In: IEEE Conference on Computer Vision and Pattern Recognition, June 2016
Google Scholar
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Advances in Neural Information Processing Systems, vol. 11, pp. 487–493. MIT Press (1998)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2169–2178. IEEE Computer Society (2006)
Google Scholar
Li, F.F., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 524–531. IEEE Computer Society, Washington, DC (2005)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 689–696, ACM, New York (2009)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Article MATH Google Scholar
Oliveira, G., Nascimento, E., Vieira, A., Campos, M.: Sparse spatial coding: a novel approach to visual recognition. IEEE Trans. Image Process. 23(6), 2719–2731 (2014)
Article MathSciNet MATH Google Scholar
Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 413–420 (2009)
Google Scholar
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)
Article MathSciNet MATH Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1994)
MathSciNet MATH Google Scholar
Tropp, J.A., Gilbert, A.C.: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory 53(12), 4655–4666 (2007)
Article MathSciNet MATH Google Scholar
Wang, Z., Wang, L., Wang, Y., Zhang, B., Qiao, Y.: Weakly supervised PatchNets: describing and aggregating local patches for scene recognition. IEEE Trans. Image Process. 26, 2028–2041 (2017)
Article MathSciNet Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE Computer Society (2010)
Google Scholar
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. CoRR abs/1403.1840 (2014)
Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 487–495. Curran Associates, Inc. (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
Guilherme Nascimento, Camila Laranjeira, Vinicius Braz, Anisio Lacerda & Erickson R. Nascimento

Authors

Guilherme Nascimento
View author publications
You can also search for this author in PubMed Google Scholar
Camila Laranjeira
View author publications
You can also search for this author in PubMed Google Scholar
Vinicius Braz
View author publications
You can also search for this author in PubMed Google Scholar
Anisio Lacerda
View author publications
You can also search for this author in PubMed Google Scholar
Erickson R. Nascimento
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guilherme Nascimento .

Editor information

Editors and Affiliations

Universidad Federico Santa María, Santiago, Chile
Marcelo Mendoza
Carlos III University of Madrid, Madrid, Spain
Sergio Velastín

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nascimento, G., Laranjeira, C., Braz, V., Lacerda, A., Nascimento, E.R. (2018). A Robust Indoor Scene Recognition Method Based on Sparse Representation. In: Mendoza, M., Velastín, S. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2017. Lecture Notes in Computer Science(), vol 10657. Springer, Cham. https://doi.org/10.1007/978-3-319-75193-1_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-75193-1_49
Published: 04 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75192-4
Online ISBN: 978-3-319-75193-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Robust Indoor Scene Recognition Method Based on Sparse Representation

Abstract

Similar content being viewed by others

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Multi-stream Convolutional Networks for Indoor Scene Recognition

Robust Scene Classification with Cross-Level LLC Coding on CNN Features

Keywords

1 Introduction

2 Methodology

3 Experiments

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Robust Indoor Scene Recognition Method Based on Sparse Representation

Abstract

Similar content being viewed by others

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Multi-stream Convolutional Networks for Indoor Scene Recognition

Robust Scene Classification with Cross-Level LLC Coding on CNN Features

Keywords

1 Introduction

2 Methodology

3 Experiments

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation