Scene Classification Based on Regularized Auto-Encoder and SVM

Li, Yi; Li, Nan; Yin, Hongpeng; Chai, Yi; Jiao, Xuguo

doi:10.1007/978-3-662-48365-7_9

Yi Li⁵,
Nan Li⁵,
Hongpeng Yin^6,5,
Yi Chai^5,7 &
…
Xuguo Jiao⁵

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

1093 Accesses

Abstract

Scene classification aims at grouping images into semantic categories. In this article, a new scene classification method is proposed. It consists of regularized auto-encoder-based feature learning step and SVM-based classification step. In the first step, the regularized auto-encoder, imposed with the maximum scatter difference (MSD) criterion and sparse constraint, is trained to extract features of the source images. In the second step, a multi-class SVM classifier is employed to classify those features. To evaluate the proposed approach, experiments based on 8-category sport events (LF data set) are conducted. Results prove that the introduced approach significantly improves the performance of the current popular scene classification methods.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Learning extremely shared middle-level image representation for scene classification

Article 23 December 2016

K-Means Clustering Based on Density for Scene Image Classification

A comprehensive system for image scene classification

Article 26 February 2020

Keywords

9.1 Introduction

In the last decades, scene classification has been an active and important research topic in image understanding [1]. It manages to automatically label an image among several categories. Scene classification can be applied to a wide spread application, such as image indexing, object recognition, and intelligent robot navigation. Many approaches have been successfully adopted in scene classification [2]. However, in view of the variability between different classes and the similarity within one class, scene classification still remains a challenging issue.

In this paper, a novel scene classification method based on regularized auto-encoder and SVM is presented. It is simple but effective. The regularized auto-encoder is imposed with the Maximum Scatter Difference (MSD) criterion [3] and the sparsity constraint [4]. Scene classification based on this novel method mainly contains two stages, feature learning and classifier designing. The feature learning stage is based on the regularized auto-encoder. Features learned in this step are sufficient and appropriate to represent the source images. The classifier designing is based on a multi-class SVM, which is an acknowledged classifier that can achieve good performance in classification tasks. Experiments based on the LF data set demonstrate that the introduced method outperfoms traditional methods in scene classification. There are two main contributions within our proposed scene classification method.

(1)
The auto-encoder with the sparsity constraint automatically extracts features from the source images. The learned features are sufficient and appropriate to describe the scene images. It is independent of prior knowledge, which can largely reduce the calculation cost.
(2)
The MSD criterion considers the similarity between different scenes and the dissimilarity within a scene. In the MSD work, images belonging to different categories can be easily classified by finding the best projection direction.

The next sections of the paper are organized as follows. The details of the novel scene classification approach are elaborated in Sect. 9.2. Experiment based on the LF data set is performed and result analysis is illustrated in Sect. 9.3. Conclusions and further research are summarized in Sect. 9.4.

9.2 The Details of the Proposed Scene Classification Approach

In this part, the novel scene classification approach is described in detail. This new proposed method contains steps of feature learning and classification. In the feature learning step, sufficient and appropriate features of source images are learned by the regularized auto-encoder imposed with the sparsity constraint and the MSD criterion. In the classification step, a multi-class SVM is adopted. Particularly, several categories of scene images are classified with the trained SVM classifiers. SVMs are trained by the 1-vs-1 strategy: constructing one SVM for each pair of the classes.

9.2.1 Feature Learning

Training images are disposed into small image patches. These patches suffer from measures of normalization and whitening. Then a regularized auto-encoder is trained by these pre-processed patches. Considering the big similarity between different categories and the small dissimilarity within a category, the proposed model is applied to extract sufficient and appropriate feature from the scene images.

The regularized auto-encoder model described in this paper is a 3-layer neural network, with an input layer and output layer of equal dimension, and a single hidden layer with $ k $ nodes. In particular, in response to the input patches set $ X = \{ x_{1} , \ldots ,x_{m} \} ,x_{i} \in R^{N} $, the feature mapping $ a(x) $ of the hidden layer with nodes, i.e. the encoding function is defined by Eq. (9.1).

$$ a(x) = g(W_{1} + b_{1} ) $$

(9.1)

where $ g(z) = 1/(1 + \exp ( - z)) $ is the non-linear sigmoid activation function applied component-wise to the vector $ z $. And $ a(x) \in R^{K} ,W_{1} \in R^{K \times N} $, $ b_{1} \in R^{K} $ are the output values, weights and bias of the hidden layer, respectively. $ W_{1} $ is the bases learned from input patches. In order to make the input patches set less redundant, pre-processing is applied to $ X $. In particular, after $ X $ is normalized by substracting the mean and dividing by the standard deviation of its elements, the entire patches set may be whitened [5]. After pre-processing, the linear activation function of the output layer, i.e. the decoding function is

$$ \tilde{x} = W_{2} a(x) + b_{2} $$

(9.2)

where $ \tilde{x} \in R^{N} $ is the output value of the output layer. $ W_{2} \in R^{N \times K} $ and $ b_{2} \in R^{N} $ are the weight matrix and bias vector of the third layer, respectively. Thus the training of the regular auto-encoder model turns out to be the following optimization problem.

$$ J_{re} = 0.5\sum\limits_{i = 1}^{m} {\left\| {x_{i} - \tilde{x}_{i} } \right\|}^{2} $$

(9.3)

Minimizing the above squared reconstruction error function with the back propagation algorithm, the weight matrices $ W_{1} $, $ W_{2} $ and bias $ b_{1} $, $ b_{2} $ are adapted. In order to overcome the over-fitting proble, a weight decay term, Eq. (9.4) is added to the cost function.

$$ J_{wd} = 0.5\lambda (\left\| {W_{1} } \right\|_{F}^{2} + \left\| {W_{2} } \right\|_{F}^{2} ) $$

(9.4)

where $ \left\| \bullet \right\|_{F} $ is the $ F $-norm of $ W_{1} $ and $ W_{2} $. Besides, $ \lambda $ represents the weight decay parameter.

The regular auto-encoder talked above relied on the number of hidden units being small. But there is a larger number of hidden units than the input pixels, it fails to discover interesting structure of the input. Particularly, when applied in multi-classification tasks, the similarity between different scenes and the dissimilarity within a scene largely reduce the classification accuracy. To handle the presented problems, we impose the sparsity constraint and the MSD criterion respectively.

(1)
The sparsity constraint

The regular auto-encoder can easily obtain sufficient features of the input data. However, when the number of hidden units is large, the performance of regular auto-encoder is affected by the large computational expense. To discover interesting structure of the input data in the situation of lots of hidden units, a sparsity constraint is imposed on the hidden units. This novel auto-encoder model is known as sparse auto-encoder (SAE) [4]. Within the SAE framework, a neuron is treated as being “active” sparsity is imposed by restricting the average activation of the hidden units to a desired constant $ \rho $. Specifically, this is achieved by adding a penalty term with the form $ \sum\nolimits_{i = 1}^{k} {KL(\rho ||\hat{\rho }_{i} )} $ to the cost function, where $ KL $ is the $ KL $-divergence between $ \rho $ and $ \hat{\rho }_{i} $. And is the sparsity parameter, $ \hat{\rho }_{i} = (1/m)\sum\nolimits_{i = 1}^{m} {a_{j} (x_{i} )} $ is the average activation of hidden node $ j $ (averaged over the training set), and $ \beta $ controls the weights of the sparsity penalty term.

Thus the training of the regularized auto-encoder model turns out to be the following optimization problem:

$$ J_{SAE} = \hbox{min} (0.5\sum\limits_{i = 1}^{m} {\left\| {x_{i} - \tilde{x}_{i} } \right\|}^{2} + 0.5\lambda (\left\| {W_{1} } \right\|_{F}^{2} + \left\| {W_{2} } \right\|_{F}^{2} ) + \beta \sum\limits_{j = 1}^{K} {KL(\rho ||\hat{\rho }_{j} )} ) $$

(9.5)

(2)
The MSD criterion

The Maximum Scatter Difference (MSD) norm [6] is a normalization of fisher discriminant criterion. It tries to seek a best projection direction, which can easily divide the categories of samples.

Assuming there are $ N $ pattern classes leave to be recognized, $ l_{1} ,l_{2} , \ldots l_{N} $, the intra-class scatter matrix $ S_{b} $ and inter-class scatter matrix $ S_{w} $ are defined as:

$$ S_{b} = \frac{1}{C}\sum\limits_{i = 1}^{N} {C_{i} } (t_{i} - t_{mean} )(t_{i} - t_{mean} )^{T} $$

(9.6)

$$ S_{w} = \frac{1}{C}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{{C_{i} }} {(x_{i}^{j} - t_{i} )} } (x_{i}^{j} - m_{i} )^{T} $$

(9.7)

where $ C $ represents the training samples number, similarly $ C_{i} $ is the training samples number in class $ i $, in which the $ j $th training sample is denoted by $ x_{i}^{j} $. Furthermore, $ t_{mean} $ and $ t_{i} $ are the mean vector of all training samples and training samples in class $ i $ respectively. Different categories are divided remotely with larger $ S_{b} $ value. Meanwhile, images from the same category get closer with smaller $ S_{w} $ value.

Reviewing the classic Fisher discriminant analysis [7], samples can be easily separated when the ratio of $ S_{b} $ and $ S_{w} $, or their difference value gets maximal value. Combining with the above contents, we adopt the following format of MSD criterion:

$$ J_{MSD} = w^{T} S_{b} w - \zeta \cdot w^{T} S_{w} w = w^{T} (S_{b} - \zeta \cdot S_{w} ) $$

(9.8)

The two items $ w^{T} S_{w} w $ and $ w^{T} S_{b} w $ are balanced by the nonnegative constant $ \zeta $.

Based on the concept of Rayleigh quotient and its extreme property, the optimal solution of quotion (9.6), referring to the eigenvectors $ w_{1} ,w_{2} , \ldots ,w_{k} $, can be acquired by the characteristic equation $ (S_{b} - \zeta \cdot S_{w} )w_{j} = \lambda_{j} w_{j} $, in which the first largest eigenvalues meet the requirements $ \lambda_{1} \ge \lambda_{2} \ge \ldots \lambda_{k} $.

Comparing with Fisher discriminant analysis method, the calculation of $ S_{w}^{ - 1} S_{b} $ is replaced with $ S_{b} - \zeta \cdot S_{w} $ within the framework of MSD criterion. Thus it will be fine weather $ S_{w} $ is a singular matrix or not. This makes computing becomes more effective.

Given the basic idea of auto-encoder, imposed with the sparsity constraint, the MSD criterion is imposed by taking the weight $ W_{1} $, which contacts the input layer with the hidden layer, as the projection matrix. Thus the proposed model turns to be the following optimization problem

$$ J(w) = J_{SAE} + J_{MSD} $$

(9.9)

where $ J_{SAE} $ and $ J_{MSD} $ are described there-in-before. By a number of iteration steps, a balanced value of the weight and projection direction is obtained. With the proposed feature learning model, sufficient and appropriate features from training and testing images can be extracted.

9.2.2 Classifier Designing

In this part, a multi-class support vector machine (SVM) classifier is designed for scene classification. It is a kind of supervised machine learning. SVM is often used to learn high-level concepts from low-level image features. In this section, the one-against-one strategy is applied to train $ l(l - 1)/2 $ non-linear SVMs, where $ l $ is the number of the scene categories.

Given the training data $ v_{i} \in R^{n} $, $ i = 1, \ldots ,s $ in two classes, with labels $ y_{i} \in \{ - 1,1\} $. Each SVM solves the following constraint convex optimization problem

$$ \mathop {\hbox{min} }\limits_{\psi ,b,\xi } (0.5\left\| \psi \right\|^{2} + C\sum\limits_{i = 1}^{S} {\xi_{i} } ),s.t.y_{i} (\psi^{T} \phi (v_{i} ) + b) \ge 1 - \xi_{i} ,\xi_{i} > 0 $$

(9.10)

where $ C\sum\limits_{i = 1}^{S} {\xi_{i} } $ is the regulation term for the non-linearly separable data set, $ (\psi^{T} \phi (v) + b) $ is the hyper-plane. There are two main parameters that play an important role in SVM classification, $ C $ and $ \gamma $. Parameter $ C $ represents the cost of penalty, which has great influence on the classification outcome. The selection of $ \gamma $ can affect the partitioning outcome in the feature space. Parameters $ C $ and $ \gamma $ make SVM achieve significant performance in classification tasks.

9.3 Experiment and Results Analysis

In this section, experiment on LF data set is conducted to evaluate the consequence of the introduced method. Experimental results show that better performance over current approaches is achieved in scene classification task (Fig. 9.1).

Experiment on LF data set

All these RGB images are firstly transferred into gray ones. Then feature learning is conducted with the proposed regularized auto-encoder model, convolution and mean pooling method. The obtained features are fed to multi-class SVM classifier to classify features. Figure 9.2 displays the confusion matrix of the sport data set.

The average performance obtained in 10 independent experiments is 91.56 %.

As shown in columns of Fig. 9.2, wrong classifications often occur in “badminton” and “bocce”, which demonstrates that in the feature space of the proposed method, there are some similarities between these two categories with the other six categories.

Effect contrast with other current methods

In this section, experimental results of our method are compared with the previous studies on LF data set. The relevant researches and outcomes are listed in Table 9.1.

Table 9.1 Effect contrast with previous studies

Full size table

Seen from Table 9.1, our proposed method achieves better performance than the best performance Zhou et al. can achieve by far. Within the work of Zhou et al., a multi-resolution bag-of-features model is utilized. And the corresponding performance is an 85.1 % correct. Also can be seen from the outcome table, Li et al. get 73.4 % correct. This is realized by integrating scenes and object categorizations. It is mentionable that before long, Li et al. improve this performance by 4.3 % percentage points. In this method, objects of scene images are regarded as attributes, which makes recognition easier in scene classification.

To analyze the improvement of our method, traditional approaches ignore the similarity between different categories and the differentiation between pictures within a category. Put another way, there may be people in bocced scene and bocced scene. Moreover, pictures in the badminton scene may contain people while other may not. Classification performance may be influenced to a certain extent by the object consistency problem. Fortunately, it can be solved by the MSD criterion adopted in our work. Significant performance has been proved by the experimental results based on the LF data set.

9.4 Conclusions

In this paper, a simple but effective scene classification method is proposed. It is based on a regularized auto-encoder model. Comparing with previous approaches, the proposed method achieves better performance due to three major improvements. First, the regular auto-encoder is imposed with the sparsity constraint, which is known as SAE model criterion. It automatically learns image features without relying on prior knowledge. Second, the MSD criterion is added into the SAE model, considering the similarity between different scenes and the dissimilarity within a scene. It offers more discriminative information of the input images. Thus, sufficient and appropriate features with more discriminative information are automatically extracted without prior information. Third, a multi-class SVM classifier is designed with the usually used optimization algorithm. It improves the classification accuracy efficiently. Results on LF data set indicate that this scene classification method gains better classification accuracy than other current methods.

References

Serrano-Talamantes JF, Aviles-Cruz C, Villegas-Cortez J, Sossa-Azuela JH (2013) Self-organizing natural scene image retrieval. Expert Syst Appl 40(7):2398–2409
Article Google Scholar
Qin J, Yung NHC (2010) Scene categorization via contextual visual words. Pattern Recogn 43:1874–1888
Article MATH Google Scholar
Song F, Zhang D, Chen Q, W Jizhong (2007) Face recognition based a novel linear discriminant criterion. Pattern Anal Applic 10:165–174
Article MathSciNet Google Scholar
Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse auto-encoder-based feature transfer learning for speech emotion recognition. In: Affective computing and intelligent interaction, human association conference, pp 511–516
Google Scholar
Krizhevsky A, Hinton GE (2010) Factored 3-way restricted Boltzmann achiness for modelling natural images. In: International conference on artificial intelligence and statistics, pp 621–628
Google Scholar
Chen Y, Xu W, Wu J, Zhang G (2012) Fuzzy maximum scatter discriminant analysis in image segmentation. J Convergence Inf Technol 7(5)
Google Scholar
Yang J, Yang JY (2003) Why can LDA be performed in PCA transformed space. Pattern Recognit 2:563–566
Google Scholar
Li LJ, Su H, Fei-Fei L (2007) What, where and who? Classification events by scene and object recognition. In: Eleventh IEEE international conference on computer vision, pp 1–8
Google Scholar
Wu J, Rehg JM (2009) Beyond the Euclidean distance: creating effective visual codebooks using the histogram intersection kernel. In: Twelfth IEEE international conference on computer vision, pp 630–637
Google Scholar
Li LJ, Su H, Lim YW, Fei-Fei L (2010) Objects as attributes for scene classification. In: European conference computer vision, pp 1–13
Google Scholar
Nakayama H, Harada T, Kuniyoshi Y (2010) Global Gaussian approach for scene categorization using information geometry. In: IEEE conference on computer vision and pattern recognition, pp 2336–2343
Google Scholar
Li LJ, Su H, Xing EP, Fei-Fei L (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Proceedings of the neural information processing systems
Google Scholar
Zhou L, Zhou Z, Hu D (2013) Scene classification using a multi-resolution bag-of-features model. Pattern Recogn 46:424–433
Article Google Scholar
Gao SH, Tsang IWH, Chia LT (2010) Kernel sparse representation for image classification and face recognition. In: European conference computer vision, pp 1–14
Google Scholar
Huang GB, Lee H, Learned-Miller E (2012) Learning hierarchical representations for face verification with convolutional deep belief networks. In: IEEE Conference on computer vision and pattern recognition, pp 2518–2525
Google Scholar
Lin SW, Ying KC, Chen SC, Lee ZJ (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35(4):1817–1824
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Automation, Chongqing University, Chongqing, 400044, China
Yi Li, Nan Li, Hongpeng Yin, Yi Chai & Xuguo Jiao
Key Laboratory of Dependable Service Computing in Cyber Physical Society, Ministry of Education, Chongqing, 400030, China
Hongpeng Yin
Key Laboratory of Power Transmission Equipment and System Security, Chongqing, 400044, China
Yi Chai

Authors

Yi Li
View author publications
You can also search for this author in PubMed Google Scholar
Nan Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongpeng Yin
View author publications
You can also search for this author in PubMed Google Scholar
Yi Chai
View author publications
You can also search for this author in PubMed Google Scholar
Xuguo Jiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongpeng Yin .

Editor information

Editors and Affiliations

Beihang University, Beijing, China
Yingmin Jia
Beijing University of Posts and Telecommunications, Beijing, China
Junping Du
Tsinghua University, Beijing, China
Hongbo Li
University of Science & Technology Beijing, Beijing, China
Weicun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Li, N., Yin, H., Chai, Y., Jiao, X. (2016). Scene Classification Based on Regularized Auto-Encoder and SVM. In: Jia, Y., Du, J., Li, H., Zhang, W. (eds) Proceedings of the 2015 Chinese Intelligent Systems Conference. Lecture Notes in Electrical Engineering. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48365-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-662-48365-7_9
Published: 12 December 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48363-3
Online ISBN: 978-3-662-48365-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics