Keywords

1 Introduction

Scene recognition is one of the most challenging tasks in image classification and various scene recognition methods have been proposed over the past decades [3, 12, 15, 21,22,23, 25, 26, 29]. To deal with large intra-class variance caused by nuisance factors such as pose, viewpoint and occlusion, it normally requires two stages for a scene recognition solution, that is, scene representation and scene classification.

Scene representation aims to fully use all the information of scene images to extract discriminative features. It explores not only the generalized characteristics in the same category but also the distinctive characteristics among different categories. The representation methods can be mainly classified into two categories, hand-crafted and deeply-learned representations. In early studies, hand-crafted representation was popular due to its simplicity and low computational cost. These methods only capture low-level information, such as texture and structure of the information. In recent works, deeply-learned feature extraction methods exploit high-level semantic information in scene images by using Convolutional Neural Networks (CNNs).

In this paper, we propose an effective scene recognition framework, which firstly extracts the essential scene sub-graph for each scene class, then learns a classifier to distinguish different scene classes by learning a bi-enhanced knowledge space. The whole work is based on the scene images and their corresponding scene graphs. The main contributions of our work are summarized as follows:

  • We propose a novel framework to extract discriminative representation from both entire image and essential scene sub-graph for scene recognition. The learned bi-enhanced knowledge space is proved to be useful for classification.

  • This work explores a pioneer study on learning knowledge graph, i.e. essential scene sub-graph, for scene recognition. The proposed approach has great potential for other categorization tasks, while enables people to think about how knowledge graph can better drive current tasks.

The rest of the paper is organized as follows. Section 2 briefly reviews related work. The proposed framework including the essential scene sub-graph mining and the bi-enhanced knowledge space learning is described in Sect. 3. Experimental results are reported and discussed in Sect. 4, followed by the conclusion in Sect. 5.

2 Related Work

In this section, we briefly review the related work on scene representation and scene classification.

Scene representation is the most important step in scene recognition task, which aims to extract discriminative features from scene images. GIST [15], which is one of hand-crafted global features, lexicographically converts an entire scene image into a high-dimensional feature vector, but fails to exploit local structure information in scenes, especially the indoor scenes with complex spatial layouts. Methods focusing on local features, such as OTC [14] and CENTRIST [22], firstly describe the structure pattern of each local patch and then combine the statistics of all patches into a concatenated feature vector. Recently, as Convolutional Neural Networks (CNNs) have made remarkable progress on image recognition, deeply-learned methods have been widely adopted. Gong et al. [7] proposed a multi-scale orderless pooling (MOP) method to extract fully-connected features on image local patches. While these methods have achieved encouraging performance, a largely overlooked aspect is the role of the scale and its relation with the feature extractor in a multi-scale scenario. Herranz et al. [8] adapted the feature extractor to each particular scale, which combined ImageNet-CNNs [17] and Places-CNNs [29] to improve classification performance. However, the essential objects and their relations are still not fully utilized, while much information extracted from patches is redundant. Furthermore, most of the recent methods need to produce the proposal of each objects, which push the computational costs too high when dealing with large scale dataset.

Over the past decades, many methods have been proposed for scene classification [2, 6, 16, 19, 20, 27, 28] and can be categorized into two groups: generative models and discriminative models. Generative models usually adopt hierarchical Bayesian to express various relations in a complex scene, such as Markov random fields (MRF) [6], hidden Markov model (HMM) [19] and latent Dirichlet allocation (LDA) [1]. However, these models need to build complex probabilistic graph model and require high computational cost. The discriminative models extract feature descriptors from images and then encode them into a fixed length vector for classification. The typical classifiers include logistic regression and support vector machine (SVM) [2]. Especially, the SVM classifier has been widely used for scene classification. Object bank (OB) [13] and deformable part based model (DPM) [5] are representative examples of training a feature classifier for scene classification. Unlike the generative models, the parameters of discriminative models are easy to learn for feature classification.

Fig. 1.
figure 1

Overview of proposed framework. The model consists of: (1) essential scene sub-graph mining; (2) bi-enhanced knowledge space learning for scene recognition.

3 Our Approach

Our proposed framework is illustrated in Fig. 1, which contains two key stages: essential scene sub-graph mining and bi-enhanced knowledge space learning. Firstly, we adopt a statistical method to mine the essential scene sub-graph for each scene class. Next, the bi-enhanced knowledge space is sought for scene image recognition by iteratively learning representations from essential scene sub-graph and entire image. In this section, we present the details of the proposed framework.

3.1 Essential Scene Sub-graph Mining

The scene graph is a graph of each scene image to describe all the objects, attributes and inter-object relations. Our approach attempts to mine the essential scene sub-graph by using the similarity between the scene graphs from the same class.

For essential scene sub-graph mining, we statistically analyze the frequencies of objects for each scene. Firstly, we count the occurring frequencies of all object sets for each scene class. Next, we choose object sets with the highest frequencies and size varying from 1 to 6 for each scene class. Lastly, we calculate the percentages of images including all the objects in above selected object sets for each scene class, and then the average of them for all the scene classes. Taking the scene of “tennis game” as an example, after counting the occurring frequencies of all object sets in all “tennis game” scene images, we obtain that tennis player surfacing out when object set size is 1 and 98.5% of images include it. Similarly, tennis player, tennis court is selected when object set size is 2, with 76.4% of images include them. More details on essential scene sub-graph mining are shown in Algorithm 1.

figure a
Fig. 2.
figure 2

Illustration of the bi-enhanced knowledge space learning. \(\textcircled {1}\) is the object-stacked network and \(\textcircled {2}\) describes the global network. The whole figure demonstrates an iterative process for knowledge space learning.

3.2 Bi-enhanced Knowledge Space Learning

This section aims to illustrate the learning of a knowledge space that saves useful and discriminative features from entire images and essential scene sub-graph. The structure of the whole model is shown in Fig. 2. It includes three parts: (1) object-stacked network, which learns features from essential scene sub-graph enhanced by global representation, (2) global scene network, which learns features from the entire image enhanced by object-stacked representation, and (3) bi-enhanced knowledge space optimization, which iteratively seeks the knowledge space from both object-stacked representation learning and global representation learning.

Inspired by Huang et al. [9] and considering the structure of essential scene sub-graph, we adopt an object-stacked network to process three objects and the relations in essential scene sub-graph as shown in Fig. 2. The object-stacked network contains three separate convolutional blocks, a concentrated layer which is adopted to combine the three-stream features, a \(1\times 1\) convolutional layer which is to reduce dimension, and a fully-connected layer which is utilized to build a knowledge space. The objective function is in Eq. (1):

$$\begin{aligned} \min \limits _{W,b}\sum _{i=1}^{m}(||f(o_{i_1},o_{i_2},o_{i_3})-h(c_i)||)+\lambda ||W||^2 \end{aligned}$$
(1)

where \(W\) and \(b\) are the weight and bias of the layers in network, respectively, \(m\) is the number of all the scene images, \(f(\cdot )\) is the output of the first fully-connected layer \(f6\) from object-stacked network, \(h(c_i)\) is the global representation of image \(c_i\) which is learned from global network. \( o_{i_1}\), \(o_{i_2}, o_{i_3}\) are the objects of essential scene sub-graph cropped from image \(c_i\) and \(h(\cdot )\) is the output of the first fully-connected layer from global scene network, \(\lambda ||W||^2\) is regularization term. Note that the object \(o_{i_2}\) which has relations to other two objects \(o_{i_1}\), \(o_{i_3}\) is fed into the second stream. For example, for the scene of “tennis game”, the essential objects are \( man\), \( court\) and \(racket\). The relations from essential scene sub-graph are the man holding the racket and the man stands on the court. Obviously, \(man\) has the relations to both \(court\) and \(racket\), and is inputted into the second stream. If the image does not contain all three essential objects, we set the value of missing object as 0.

Recall that our task is a classification problem, we add another 2 fully-connected layers and softmax layer after fc6. The final objective function is expressed in Eq. (2)

$$\begin{aligned} \min \limits _{W,b}\gamma \sum _{i=1}^{m}||f(o_{i_1},o_{i_2},o_{i_3})-h(c_i)||+\xi \lambda ||W||^2-\delta \sum _{i=1}^{m}(y_i\log (T(o_{i_1},o_{i_2},o_{i_3})) \end{aligned}$$
(2)

where \(\xi \) is used to determine whether to join regularization, \(\gamma \) and \(\delta \) are the parameters introduced to reduce the difference between the two losses that we set 0.01 and 1. \(y_i\) is the label of image \(c_i\), \(T(\cdot )\) is the output of final softmax layer in object-stacked network.

Similar to object-stacked network, we adopt a CNN model to learn global representation as shown in Fig. 2. It contains five convolutional blocks, two fully-connected layers and a softmax layer for classification. The dimension of the first fully-connected layer is equal to the dimension of representation from object-stacked network. The objective function is shown in Eq. (3):

$$\begin{aligned} \min \limits _{W,b}\alpha \sum _{i=1}^{m}||h(c_i)-f(o_{i_1},o_{i_2},o_{i_3})||+\mu \lambda ||W||^2-\beta \sum _{i=1}^{m}(y_i\log (H(c_i)) \end{aligned}$$
(3)

where \(H(\cdot )\) is the output of final softmax layer in global scene network. The parameters \(\alpha \) and \(\beta \) are utilized to balance these two losses, and \(\mu \) controls whether to use regularization term. The meaning of the remaining parameters is the same as before. We use mini-batch stochastic gradient descent (SGD) to optimize Eq. (3). When Eq. (3) reaches an optima, we obtain the global representation enhanced by object-stacked representation.

Based on the above mentioned two networks, an iterative process between them is adopted. The iterative process is initiated by object-stacked network with cross-entropy cost function instead of global representations. Next, at each iteration, we update object-stacked representations by optimizing Eq. (2) which is enhanced with global representations, and then adjust global representations by optimizing Eq. (3) which is enhanced with object-stacked representations. The knowledge space is optimized iteratively until convergence. For test, we only employ the trained global network to predict the scene class.

4 Experiments

This section demonstrates the effectiveness of the learned bi-enhanced knowledge space on Scene 30.

4.1 Datasets and Implementation

To better demonstrate the proposed method in large scale dataset, we construct Scene 30 from Visual Genome [10]. The constructed Scene 30 contains 4608 color images of 30 different scenes including both indoor and outdoor scenes. The number of images varies across categories with at least 50 images per category. Each image has a corresponding scene graph. There are 10,034 objects and 30,000 types of relations in total in Scene 30. We split 85% of each class from the entire dataset for training and the rest as test set. The object-stacked network and global scene network are implemented using the open-source package Keras [4]. We adopt the VGG-16 model pre-trained in ImageNet [18]. In object-stacked network, the cropped object patches are resized to \(128\times 128\), and the input of global scene network are warped to a \(224\times 224\). The features of scene and object-stacked network are extracted from the layer of \(fc6\).

4.2 Result and Comparison

Table 1 shows the comparison results. From the table, we can see that the accuracy of the classification increased from 82.51% to 88.29% after two iteration cycles. Moreover, through the bi-enhanced knowledge space learning, global network and object-stacked network capture more meaningful and discriminative information. The accuracy of the classification in object-stacked network increased from 89.60% to 90.32%. Similarly, the accuracy of the global network also increased from 82.51% to 86.71% and then to 88.29% under the supervision of local essential objects features.

Table 1. Recognition performance comparisons in different iterations
Table 2. Recognition performance comparisons on Scene 30

We then evaluate our proposed methods on Scene 30 and compare it with several recent CNN based methods. Table 2 records the recognition accuracy of our approach and other methods where we achieve the highest recognition rate. The method named “Our \(fc6\) + SVM” extracts the feature in \(fc6\) and trains a SVM for classification. The method named “Ours” directly utilizes the global network to predict the scene class. From the table, we have following 3 observations. (1) VGG-16 outperforms AlexNet. For example, VGG-16 in PlaceNet 365 is 10.11% higher than AlexNet. Therefore, we choose VGG-16 as a basic model. (2) the essential scene sub-graph is beneficial to scene recognition. The accuracy of our approach is 1.52% higher than PlaceNet 365 (VGG-16) and 1.21% higher than HybridNet 1365 (VGG-16). (3) The logistic regression is better than SVM for scene classification. We analyze that our model is an end to end framework for testing, while the SVM extracts the \(fc6\) feature and then is optimized for classification.

5 Conclusion

In this paper, we propose a novel framework to learn the discriminative representations from both entire scene image and essential scene sub-graph. In future work, we will focus on utilizing the probability graph model to mine the essential scene sub-graph, such as Markov random fields (MRF [6]), and build a more accurate relationship between the scene image and objects.