1 Introduction

Scene classification is a very challenging task in computer vision with applications in robot navigation, environment computing, Internet image classification, and so on. Typical solutions usually ignore the relevant correlations among the classes, and diversity inside the same class. For example, a scene image will contain more than one object that is always shared among the different scenes, which makes them visually similar. At the same time, even in the same scene, images are diverse and difficult to associate and recognize. Combined with the difficulties, (1) the ever increasing size and categories of the data, and (2) requirements of many real time applications, all inspire us to dig a more compact and discriminative image representation and apply it for the scene classification problem.

One of the most popular image representation methods is the Bag-of-Features (BoF) [6]. The BoF model usually uses the unsupervised k-means [11] to learn a set of “visual words” as the codebook during training, then images can be represented by a histogram of these “visual words”. But the BoF ignores some important information from the spatial layout of images, which is significant for scene classification. To deal with this problem, spatial pyramid matching (SPM) was introduced [19], and has greatly improved the performance of scene classification. However the histogram-based way of representation may lose much information of images, so some soft assignment methods have been proposed [25, 47] to further boost scene classification performance.

But the unsupervised methods mentioned above to learn image representations ignore the information in image labels and have been proved not discriminative enough. To find the more discriminative patterns, many researchers are trying some weakly supervised methods. Like some part-based models [10, 17, 34, 42, 44] which find the objects (e.g., the human) or components of objects (e.g., the face or body of the human) in images as parts.Footnote 1 Each image can be represented by these parts, and usually in each certain class, images are represented by the parts confined in that class. These methods construct a collection of special parts for each image class as the learned patterns, which greatly improve the performance of unsupervised methods to classify scenes. Our method also makes a full use of the information in image labels during the procedure of pattern initializing and learning, to find the most discriminative ones that are shared with other classes.

Though many sophisticated methods have been proposed to learn the patterns for scene classification, they still have some limitations. Firstly, many previous methods use some handcrafted features like SIFT [28] and HOG [7] as local descriptors to represent patches in images. Though having achieved some satisfactory results, the codes generated by these methods are still very noisy to distinguish different patterns accurately. Recently, training a deep convolutional neural network (CNN) on large and diverse datasets like ImageNet [8] has achieved a breakthrough for image classification [18]. Meanwhile, training a deep CNN model on the large scene dataset also performs well for scene classification [53]. But using the deep CNN models trained on the ImageNet or scene dataset as the off-the-shelf feature extractor has been confirmed more feasible [9, 14, 31, 37, 41] than acting as classifier for scene dataset directly. In this work, we also verify that the features generated by deep CNN can be very efficient to represent and find the discriminative patterns.

Secondly, previous methods [10, 17, 34, 42, 44] require a large amount of patterns to construct the middle-level image representation, which results in unnecessary time and space costs in the procedure of learning patterns and representing images. For example, Singh et al. [42] and Juneja et al. [17] needs 210 and 50 patterns per class respectively, and even the most compact representation method MMDL [50] still requires 11 patterns per class. So they are unpractical for real applications because of the hardware and time demand. For instance, when dataset is very large and diverse like SUN 397 [51], which has 397 scene categories totally, their methods need tens even hundreds of patterns per class by estimation to represent the images, which is too time and space costly in the following procedure of patterns learning and image representation. Moreover, there is a trend for the increasingly large and complex scene datasets to emerge, like the large dataset ImageNet with thousands of object categories in total. So it is of great significance to propose a method to learn extremely shared middle-level image representations without the sacrifice of the discriminative ability, which needs less than five patterns per class. That would be a big contribution because current methods have tens even hundreds patterns per class and require too much training and testing time.

Fig. 1
figure 1

The “chair” is shared among different classes, and the label of the image is marked on the top left corner

The limitations of previous methods originated that they only learn the patterns in each class separately, as they believe that the most discriminative patterns for a certain class only exist in that class. In some way it makes sense because if we can find some most special patterns for a class, it is more likely to distinguish that kind of images well. But in fact, in some classes we can never find patterns satisfying this requirement when the objects are too common or special but not exclusive. For example, the common but only objects (computers, desks, and chairs) in the scene of computer room are also distributed in other scenes like office and meeting room. Meanwhile, special objects in the scene of closet are not exclusive and shared with the clothing room. As shown in Fig. 1, the pattern “chair” is shared among many classes. All these facts inspire us to share some important patterns among different classes to represent images efficiently. Actually there already exist some research works in object detection which adopt the strategy of sharing parts [33, 46]. The sharing strategy has the following advantages: (1) The number of self-adjustment parameters accompanies with that of the learned patterns, whose decline will reduce the chance of over-fitting; (2) it reduces the demand on the size of available training data; (3) it will reduce the computation time and storage space in the process of training and testing, which has great convenience when the dataset is very large. But for the task of scene classification, the problem of how to share patterns effectively and efficiently is still challenging. Because the target objects in scenes are more complicated and diverse. To solve this puzzle, we put forward a method to learn extreme compact image representation which utilizes the shared patterns among different classes.

Our work is related to the work by Parizi et al. [37], which implemented a jointly learned method to learn patterns, but they initially labeled a large pool of patterns in an unsupervised way, and then selected the most discriminative ones. As a consequence, it requires a lot of time in the process of labeling and selecting, and it is not compact and discriminative enough to represent the image. Based on their work, we propose a novel method to make several improvements. Initially, we adopted a weakly supervised strategy to learn a small number of patterns in each class, thus saving the computing time without the sacrifice of the discriminative ability of the patterns. Then, we introduced the lasso regularization [45] to select the most discriminative patterns and urge some of them shared among different classes, meanwhile maintaining some class-specialized ones. Results show that each class needs only four patterns on the average to achieve the remarkable performance, which are comparatively much smaller than referred by Parizi et al., and attest that it is a very compact way to represent the image through the extremely shared representation.

To summarize, the main contributions of our work are listed as follows.

  • We demonstrate that there are many classes that have the same patterns and they are crucial for scene classification.

  • We propose a novel method that firstly learns some class-specified patterns and then applies the lasso regularization to urge some patterns shared among different classes, which can generate a very compact and discriminative way for image representation. Experimental results show that our method takes the fewest patterns to achieve the excellent performance on three widely used scene benchmarks.

On the average, in each class, only four patterns are sufficient for the final image representation. It is a very time- and space-saving way to learn patterns and represent images for the recognition purpose. Moreover, the learned patterns are more discriminative after our training procedure, as the classification accuracy is boosted comparing with the classifier constructed by the initial patterns. In the experiments conducted on the three widely used scene benchmarks, our results using the fewest patterns are very competitive.

The rest of this paper is organized as follows. In Sect. 2, some related works are presented. In Sect. 3, the detailed procedure of image representation and learning shared patterns are unfolded. In Sect. 4, the experimental results and analysis are provided to illustrate the strength of our method. In Sect. 5, some discussions are given to further explain the proposed method. Section 6 concludes the paper.

2 Related work

Compared with low-level representations which can only capture some low-level image information like shape, color, and edge, middle-level image representations have the ability to find more semantic information like objects or components of objects, which are more discriminative to distinguish high-level semantic of images, e.g., the categories of image. Finding important patterns as middle-level image representation is a very popular way to recognize scene images. The traditional BoF model [6] utilizes some unsupervised clustering methods like k-means to learn patterns in images. But the patterns seem not discriminative enough, so many strongly supervised methods such as Attributes [13, 36, 38], Poselets [2], and Object Bank [20] have been proposed. Those methods can achieve better performance but they learn patterns using both image- and patch-level annotations, which are hard to obtain.

Recently, some weakly supervised strategies have been explored and proven very effective for image representation. They learn patterns under the supervision of image-level labels only, which are easier to obtain and it proved to be more discriminative to represent images than the unsupervised ones. Moreover, these methods are very robust because of avoiding the inaccuracy in labeling the patches. Thus the strategy to learn patterns via these weakly supervised methods is quite common. Singh et al. [42] employed linear SVM to distinguish patches in each class and then selected important ones as learned patterns on some criteria like purity and discriminativeness. Wang et al. [50] found patterns by introducing the multiple instance learning constraints to the dictionary learning. Some part-based models [10, 17, 34, 44] were trying to find a set of class-specialized parts, i.e., the parts frequently appear in one class but rarely appear in other classes. But these methods learn patterns for each class separately, ignoring the fact that many discriminative patterns are appearing in more than one class. So patterns produced by their methods are not discriminative and compact enough. Our work can also be seen as a weakly supervised strategy, but we do not learn patterns for each class separately; instead, we share some important patterns among different classes to construct a compact way to represent the images.

Apart from unsupervised or weakly supervised ways, there are many other methods find patterns by the knowledge from experts [29, 39]. Peraldi et al. [39] created patterns through the Abox abduction to interpret multimedia like images. Neumann et al. [29] introduced description logics as a knowledge representation to interpret scenes. But these methods require knowledge from experts to develop some rules, which is hard to achieve. Our method can find some semantic patterns automatically, which is quite different from their methods.

With the powerful deep CNN model applied on image classification [18], we can also see the outstanding performance of CNN models trained on large diverse image dataset like ImageNet [8], when they are adopted as the local descriptor extractors to produce highly informative features for each patterns [9, 14, 23, 24, 31, 41]. Liu et al. [24] chosen the activations from the fully connected 6th layer (fc6) of deep CNN model as the patch-level features, with the sparse coding based fisher vector to recognize images. Li et al. [23] also extracted the fc6 and learned patterns through pattern mining. Gong et al. [14] selected outputs from the fully connected seventh layer (fc7) to represent patches at multiple scales and then used VLAD to aggregate these outputs for scene classification and image retrieval. Dixit et al. [9] computed the activations from the fully connected eigth layer (fc8) and combined them with the semantic Fisher Vector for scene classification. All of these work selected deep features as local descriptors. Inspired by these experiences, we choose the activations from fc7 of the deep CNN model trained on ImageNet in four different scales to represent patches, which has proven to receive very satisfactory results for scene classification.

Actually the unsupervised methods [6, 25, 47] can also be regarded as methods to distribute patterns among different classes, but these patterns seem to be not discriminative enough to represent images, because the representation is the collection of these patterns which are shared among all different classes. Some works also learn shared parts for object detection [33, 46], and adopt the strategy of sharing patterns. Different from them, we faced with a more complicated puzzle in scene classification that the patterns are more diverse and complicated in one scene and more difficult to be shared. There are many other works trying to learn shared features. Some researchers learned shared features using the multitask learning [1, 15, 35, 49]. Our method can also be regarded as the multitask learning because we learn patterns in each scene simultaneously and minimize the classification error during the learning procedure. But the difference is that we are trying to share the patterns among classes, which are more informative than the local features shared by their methods. Meanwhile, we introduce the image classifier during learning procedure aimed at sharing patterns among different scenes only. Actually we have demonstrated that retraining the final classifier is more effective than using the classifier during the process of learning patterns.

The most related works of our method were proposed by Lobel et al. [27] and Parizi et al. [37]. Lobel et al. [27] used hierarchical joint max-margin learning to learn the patterns and image classifier. But they required the weights of patterns to be positive and used some handcrafted descriptors, which may restrict the performance. Parizi et al. [37] also learned patterns and image classifier jointly. They selected the fc7 of deep CNN model trained on a large dataset for scene categorization [53] as local descriptors, and also learned negative patterns by allowing negative weights for them. Actually some patterns are also shared in their method. Our work follows the path of their work, but we make several improvements. Parizi et al. first initialize a large pool of patterns (tens of thousands of patterns from a relatively smaller scene dataset MIT-indoor 67 [40]) using an unsupervised learning method, and then select the most discriminative ones through jointly learning procedure. Compared with our extremely shared strategy, this initialization strategy is too costly and may not discriminative enough when applied in the large scene dataset like SUN 397 [51]. We are working in a different way, that is, we first employ a weakly supervised strategy to initialize a small number of class-specified patterns, and then apply the lasso regularization [45] to select the discriminative patterns and share some ones among different classes, meanwhile maintain the class-specialized ones. Moreover, they jointly learn image representation and classifier to just improve the discriminativeness of patterns, while in our method, we introduce the image classifier to force some patterns are shared among different classes, and find some most discriminative patterns at the same time. Though the image-level classifier is constructed during training, we still retrain it after the patterns are learned to improve the performance. So we are with quite different objectives. Experimental results have proven that our method to construct a more compact and discriminative representation than theirs.

3 Learning method

In this section, we propose our novel solution in detail to learn extremely shared patterns, which can generate a very compact and discriminative middle-level image representation.

3.1 Pattern definition and image representation

Suppose \({\mathcal{{X}}}=\{X_{1}, X_{2}, \ldots , X_{n}\}\) are images. For each image \(X_{i}\), we can densely extract a set of local descriptors \(F(X_{i}) = \{{{\mathbf {x}}}_{i1}, {{\mathbf {x}}}_{i2}, \ldots , {{\mathbf {x}}}_{im_{i}}\}\) from different location and scales (i.e., different patch size), where \({{\mathbf {x}}}_{ij} = f(X_{i}, z_{ij}) \in {\mathbb {R}}^{d \times 1}\) is a d-dimensional feature vector of a patch at location \(z_{ij}\) in \(X_{i}\), and \(m_{i}\) is the number of patches in image \(X_{i}\). In this paper, patterns are defined as a cluster of linear filters \({{\mathbf {W}}} = [{{\mathbf {w}}}_1, {{\mathbf {w}}}_2, \ldots , {{\mathbf {w}}}_K], {{\mathbf {w}}}_{k} \in {\mathbb {R}}^{d \times 1}\), which can discover some most semantic patches, e.g., objects. The response of pattern k at location \(z_{ij}\) in \(X_{i}\) can be simply computed using the dot product

$$\begin{aligned} s_{ijk} = s \left( X_{i}, {{\mathbf {w}}}_{k}, z_{ij}\right) = {{\mathbf {w}}}_{k}^{T} {{\mathbf {x}}}_{ij}. \end{aligned}$$
(1)

Similar to many previous work [27, 37, 42, 44, 50], each patch \({{\mathbf {x}}}_{ij}\) can be represented by the concatenation of pattern responses \([s_{ij1}, s_{ij2}, \ldots , s_{ijK}]^{T} \in {\mathbb {R}}^{K \times 1}\). In practice, an image is divided into some grids, i.e., some subregions using SPM [19]. A common way to aggregate the pattern responses in each grid \(l \in \{1, 2, \ldots , L\}\) is max-pooling. That is, the representation of grid l in image \(X_{i}\) is \({{{\mathbf {s}}}}_{i}^{l} = [s_{i1}^{l}, s_{i2}^{l}, \ldots , s_{iK}^{l}]^{T} \in {\mathbb {R}}^{K \times 1}\), where \(s_{ik}^{l} = \mathop {max}\nolimits _{z_{i, j}\, \in \,l} s_{ijk}\) is the maximum response of pattern k in grid l. Then image \(X_{i}\) can be described as the concatenation of each grid representation \({{\mathbf {s}}}_{i} = [{{{\mathbf {s}}}_{i}^{1}}^{T}, {{{\mathbf {s}}}_{i}^{2}}^{T}, \ldots , {{{\mathbf {s}}}_{i}^{L}}^{T}]^{T} \in {\mathbb {R}}^{KL \times 1}\).

3.2 Formulation of shared compact deep discriminative patterns

Inputs to our learning system are n training images \({\mathcal{X}}=\{X_{1}, X_{2}, \ldots , X_{n}\}\) and their labels \(\mathcal{Y}=\{Y_{1}, Y_{2}, \ldots , Y_{n}\}, Y_{i} \in \{1, 2, \ldots , M\}\), where M is the number of image classes. Our objective is to learn the pattern filters \({{\mathbf {W}}} = [{{\mathbf {w}}}_1, {{\mathbf {w}}}_{2}, \ldots , {{\mathbf {w}}}_K]\) and the final image-level classifier \({{\mathbf {U}}} = [{{\mathbf {u}}}_1, {{\mathbf {u}}}_2, \ldots , {{\mathbf {u}}}_M], u_{m} \in {\mathbb {R}}^{KL \times 1}\). After encoding the image \(X_{i}\) to \({{\mathbf {s}}}_{i}\), we can classify it via \(Y_{i} = \mathop {arg\,max}_{m\in 1, 2, \ldots , M} {{\mathbf {u}}}_{m}^{T} {{\mathbf {s}}}_{i}\).

Some work tries to jointly learn the pattern filters \({{\mathbf {W}}}\) and image-level classifier \({{\mathbf {U}}}\) [27, 37], but as referred in Sect. 2, there are some limitations in their methods, which make their image representation not compact and discriminative enough, and not practical when applied on large dataset. To improve their work, we learn shared patterns with the lasso regularization [45] as in Eq. (2). Experimental results show that image representation constructed by these patterns is extreme compact and discriminative for final classification task.

$$\begin{aligned} \mathop {min}\limits _{{{\mathbf {W}}}, {{\mathbf {U}}}}\ \lambda _{u} \mathop {\sum }\limits _{i=1}^{M} \Vert {{\mathbf {u}}}_{i} \Vert _{1} + \frac{\lambda _{w}}{2} \mathop {\sum }\limits _{i=1}^{K} \Vert {{\mathbf {w}}}_{i} \Vert _{2}^{2} + \frac{1}{n} \mathop {\sum }\limits _{i=1}^{n} L_{X_{i}}, \end{aligned}$$
(2)

where \({{\mathbf {W}}},{{\mathbf {U}}}, {{\mathbf {s}}}_{i}, Y_{i}, M, K\), and n are just as the definitions above. \(\lambda _{u} \in {\mathbb {R}}\) and \(\lambda _{w} \in {\mathbb {R}}\) are the regularization coefficients for \({{\mathbf {U}}}\) and \({{\mathbf {W}}}\) respectively.

The objective function Eq. (2) can be interpreted as follows. The second term is to avoid over-fitting. \(L_{X_{i}}\) in the third term is the image-level multiclass hinge loss by Crammer and Singer [5] as in Eq. (3), which minimizes the classification error of image \(X_{i}\) and generates discriminative patterns.

$$\begin{aligned} L_{X_{i}} = max \left( 0,\, \left[ 1 - \left( {{\mathbf {u}}}_{Y_{i}}^{T} {{\mathbf {s}}}_{i} - {{\mathbf {u}}}_{Y}^{T}, {{\mathbf {s}}}_{i}\right) \right] \right) \end{aligned}$$
(3)

where \(Y = \mathop {arg\, max} \nolimits _{j \in {1, 2, \ldots , M}, j \ne Y_{i}} {{\mathbf {u}}}_{j}^{T} {\mathbf {s}}_{i}\).

Note that Lobel et al. [27] and Parizi et al. [37] choose L2 regularization \(\mathop {\sum } \nolimits _{i=1}^{M} \Vert {\mathbf {u}}_{i} \Vert _{2}^{2}\) for image level classifier but the convex lasso regularization is chosen here. As we all know, the L2 regularization mainly focuses on avoiding the over-fitting, but the lasso which is a good approximation to the L0 regularization \(\Vert {\mathbf {u}}_{i} \Vert _{0}\), can yield a sparse solution for matrix U, and avoid over-fitting in the meantime. Due to the fact that, in each image class, there are only a few important patterns, while the majority of them is redundant, the lasso can outperform the L2 regularization, as has been proven by NG [30].

The lasso regularization has the function of selecting the discriminative but shared patterns. Suppose \({\mathbf {u}}^{j}\) be the jth row of \({{\mathbf {U}}}\), i.e., \({{\mathbf {U}}} = [{\mathbf {u}}_1, {\mathbf {u}}_2, \ldots , {\mathbf {u}}_M] = [{{\mathbf {u}}^{1}}^{T}, {{\mathbf {u}}^{2}}^{T}, \ldots , {{\mathbf {u}}^{KL}}^{T}]^{T}\), and the ith column jth row element in \({\mathbf {U}}\) be \(u_{i}^{j}\). It is obvious that \(\mathop {\sum } \nolimits _{i=1}^{M} \Vert {\mathbf {u}}_{i} \Vert _{1} = \mathop {\sum } \nolimits _{j=1}^{KL} \Vert {\mathbf {u}}^{j} \Vert _{1}\), so this regularization is symmetric toward the two dimensions of \({\mathbf {U}}\). The term \(\Vert {\mathbf {u}}_{i} \Vert _{1}\) encourages the sparsity of each column, which according with the phenomenon that only a few patterns are sufficient to represent the class i, so we adopt \(\Vert u_{i} \Vert _{1}\) to select the most discriminative ones for class i. Meanwhile, the \(\Vert {\mathbf {u}}^{j} \Vert _{1}\) ensures the sparsity of each row, which encourages the patterns shared among different classes, and control its number. There are some discriminative patterns shared among different classes, as we referred before, so we choose \(\Vert {\mathbf {u}}^{j} \Vert _{1}\) to select these shared patterns, which can contribute to generate the extremely shared image representation. Furthermore, this constraint also makes sure that patterns shared among only a few classes to ensure the information amount. So the lasso rather than L2 regularization is more suitable to the traits mentioned above to select the most important patterns for each class.

Note that in a certain class, the Eq. (2) can not only learn the positive patterns but also negative patterns as the counter-evidence for the classification. For example, one would not expect to find a tree in the scene such as classroom and closet. We can observe that this goal is achieved by control the sign of \(u_{i}^{j}\). That is, when \(u_{i}^{j} > 0\), the pattern j should occur in the image class i; otherwise, when \(u_{i}^{j} < 0\), the pattern j should not occur in the image class i, and when \(u_{i}^{j} = 0\), the pattern j is trivial for class i.

3.3 Optimization method

The objective function Eq. (2) is non-convex so it is hard to optimize it directly. Inspired by the fact that when \({\mathbf {W}}\) is fixed, it descends to a typical convex SVM problem and can be solved directly (see Sect. 3.3.1); when \({\mathbf {U}}\) is fixed, it can be solved by some sophisticated methods (see Sect. 3.3.2). Therefore we choose the block coordinate descent [32] to learn parameters through alternating between optimizing \({\mathbf {U}}\) when \({\mathbf {W}}\) is fixed, and optimizing \({\mathbf {W}}\) when \({\mathbf {U}}\) is fixed, as shown in Algorithm 1. When \({\mathbf {W}}\) is fixed, the Eq. (2) can be rewritten as the Eq. (4). When \({\mathbf {U}}\) is fixed, the Eq. (2) can be rewritten as Eq. (5). More details of the optimizing process can be seen in the following subsections.

$$\begin{aligned} L_{U}= & {} \mathop {min} \limits _{{\mathbf {U}}}\ \lambda _{u} \mathop {\sum } \limits _{i=1}^{M} \Vert {\mathbf {u}}_{i} \Vert _{1} + \frac{1}{n} \mathop {\sum } \limits _{i=1}^{n} max \left( 0,\, [1 - \left( {\mathbf {u}}_{Y_{i}}^{T} {\mathbf {s}}_{i} - {\mathbf {u}}_{Y}^{T} {\mathbf {s}}_{i}\right) ] \right) ,\end{aligned}$$
(4)
$$\begin{aligned} L_{W}= & {} \mathop {min} \limits _{{\mathbf {W}}}\ \frac{\lambda _{w}}{2} \mathop {\sum } \limits _{i=1}^{K} \Vert {\mathbf {w}}_{i} \Vert _{2}^{2} + \frac{1}{n} \mathop {\sum } \limits _{i=1}^{n} max \left( 0,\, [1 - \left( {\mathbf {u}}_{Y_{i}}^{T} {\mathbf {s}}_{i} - {\mathbf {u}}_{Y}^{T} {\mathbf {s}}_{i}\right) ]\right) , \end{aligned}$$
(5)
figure g

3.3.1 Optimizing \({\mathbf {U}}\)

The Eq. (4) is convex and differentiable to \({\mathbf {U}}\), so the optimizing scheme in the third line of Algorithm 1 can be the simple stochastic gradient descent (SGD). The partial derivative of Eq. (4) can be computed as

$$\begin{aligned} \frac{\partial L_{U}}{\partial {\mathbf {u}}_{m}} = \lambda _{u}\,\textit{sign}({\mathbf {u}}_{m}) + \frac{1}{n} \mathop {\sum } \limits _{i=1}^{n} \frac{\partial L_{X_{i}}}{\partial {\mathbf {u}}_{m}}, \end{aligned}$$
(6)

where the \(\textit{sign}(x)\) is the sign function which equals to 1 when \(x > 0\) and \(-1\) when \(x < 0\). The \(\partial L_{X_{i}}/\partial {\mathbf {u}}_{m}\) in the second term can be derived as follows,

$$\begin{aligned} \frac{\partial L_{X_{i}}}{\partial {\mathbf {u}}_{m}} = {\left\{ \begin{array}{ll} -{\mathbf {1}}[{\mathbf {u}}_{Y_{i}}^{T} {\mathbf {s}}_{i} - {\mathbf {u}}_{Y}^{T} {\mathbf {s}}_{i}< 1] {\mathbf {s}}_{i} \ \ &{} \text {if} \ m=Y_{i} \\ {\mathbf {1}}[{\mathbf {u}}_{Y_{i}}^{T} {\mathbf {s}}_{i} - {\mathbf {u}}_{Y}^{T} {\mathbf {s}}_{i} < 1] {\mathbf {s}}_{i} \ \ &{} \text {if} \ m=Y \\ 0 \ \ &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(7)

where \({{\mathbf {1}}}[a<b]\) indicates whether \(a < b\) is true and equals to 1 when a is smaller than b and 0 otherwise. Then given the learning rate \(\eta \), the \({\mathbf {u}}_{m}\) can be optimized using \({\mathbf {u}}_{m_{t+1}} \leftarrow {\mathbf {u}}_{m_{t}} - \eta \partial L_{X_{i}} / \partial {\mathbf {u}}_{m_{t}}\).

3.3.2 Optimizing \({\mathbf {W}}\)

To make the learning procedure of \({\mathbf {W}}\) easier to understand, we suppose there is no SPM i.e. \(L=1\), so we can rewrite \({\mathbf {s}}_{i}\) as \([s_{i1}, s_{i2}, \ldots , s_{iK}]\). Let \(s_{ik} = s(X_{i}, {\mathbf {w}}_{k}, z_{i}^{k})\) be the kth element of \({\mathbf {s}}_{i}\), where \(z_{i}^{k} = \mathop {arg\,max} \limits _{z_{ij}} {\mathbf {w}}_{k}^{T} {\mathbf {x}}_{ij}\) is the latent variable indicating the strongest response location of pattern k. For Eq. (5), \(s_{ij}\) is convex in \({\mathbf {W}}\) as it is the maximum value of a linear function. But if \(u_{Y_{i}}^{k} - u_{Y}^{k} > 0\), the \(1 - ({\mathbf {u}}_{Y_{i}}^{T} {\mathbf {s}}_{i} - {\mathbf {u}}_{Y}^{T} {\mathbf {s}}_{i})\) will be non-convex to \({\mathbf {w}}_{k}\). So the fourth line in Algorithm 1 refers to the non-convex optimization. Here we choose CCCP algorithm [52] to learn the \({\mathbf {W}}\).

As we can see, when \(u_{Y_{i}}^{k} - u_{Y}^{k} < 0\), the \(L_{X_{i}}\) will be convex to \({\mathbf {w}}_{k}\), so we can optimize \({\mathbf {w}}_{k}\) directly. Meanwhile, when \(u_{Y_{i}}^{k} - u_{Y}^{k} > 0\), if the latent variable \(z_{i}^{k}\) is given, the \(1 - ({\mathbf {u}}_{Y_{i}}^{T} {\mathbf {s}}_{i} - {\mathbf {u}}_{Y}^{T} {\mathbf {s}}_{i})\) is just a linear function to \({\mathbf {w}}_{k}\), so \(L_{X_{i}}\) is also convex to \({\mathbf {w}}_{k}\). Inspired by this fact, we can alternately update latent variable \(z_{i}^{k}\) if \(u_{Y_{i}}^{k} - u_{Y}^{k} > 0\) and learn the \({\mathbf {W}}\), as shown in Algorithm 2.

After \(z_{i}^{k}\) is given, learning \({\mathbf {W}}\) becomes a convex optimizing problem. As Eq. (5) is differentiable to \({\mathbf {W}}\), we can also choose SGD to learn \({\mathbf {W}}\). The partial derivative of Eq. (5) can be written as follows,

$$\begin{aligned} \frac{\partial L_{W}}{\partial {\mathbf {w}}_{k}} = \lambda _{w}\,{\mathbf {w}}_{k} + \frac{1}{n} \mathop {\sum } \limits _{i=1}^{n} \frac{\partial L_{X_{i}}}{\partial {\mathbf {w}}_{k}}, \end{aligned}$$
(8)

and the second term is derived as

$$\begin{aligned} \frac{\partial L_{X\textsc {im}_{i}}}{\partial {\mathbf {w}}_{k}} = {\mathbf {1}}\left[ {\mathbf {u}}_{Y_{i}}^{T} {\mathbf {s}}_{i} - {\mathbf {u}}_{Y}^{T} {\mathbf {s}}_{i} < 1\right] \left( u_{Y}^{k} - u_{Y_{i}}^{k}\right) {\mathbf {x}}_{i}^{k}, \end{aligned}$$
(9)

where \({\mathbf {x}}_{i}^{k} = f(X_{i}, z_{i}^{k})\) is the feature vector at location \(z_{i}^{k}\) in image \(X_{i}\). As in Sect. 3.3.1, given the learning rate \(\eta \), the \({\mathbf {w}}_{k}\) can be optimized using \({\mathbf {w}}_{k_{t+1}} \leftarrow {\mathbf {w}}_{k_{t}} - \eta \partial L_{W} / \partial {\mathbf {w}}_{k_{t}}\). The detailed pseudo-code of updating \({\mathbf {W}}\) is shown in Algorithm 2, where \({\mathbf {W}}^{old}\) is the pattern filters got from the later iteration, and T is the number of SGD iterations.

figure h

Notice that there is only a little difference in the extension from no SPM to multilevel SPM, so we do not provide more details about the SPM form of learning \({\mathbf {W}}\). We should also note that even we get \({\mathbf {U}}\) via minimizing the classification error, we do not choose it as the final image classifier; instead, we retrain a multiclass linear SVM. The only purpose to introduce the \({\mathbf {U}}\) is to learn the discriminative and shared patterns to construct our extremely shared image representations. In the Sect. 4, we will show that the performance of the retrained SVM classifier is better than that constructed by learned \({\mathbf {U}}\) directly.

3.3.3 Pattern filters initialization

As described in the first line of Algorithm 1, we should first initialize the pattern filters \({\mathbf {W}}\). Here we choose an effective way to achieve this goal. Firstly, we use k-means on images in each class separately to assign every patch a cluster label. Then, a multiclass SVM by Crammer and Singer [5] can be trained using the cluster label as the supervised information. Here the SGD is also used for training the SVM due to its effective and efficient.

As we can see, this initialization strategy can be regarded as the weakly supervised BoF and can find special patterns in each class through the multiclass SVM.

4 Experiments

To evaluate our proposed method, we conduct some experiments on three well-known scene classification datasets, including 15 Scenes [19], MIT-indoor 67 [40] and SUN 397 [51]. Images in these datasets are collected from Google and Flickr. Some class-specific keywords are chosen to retrieve images on the search engine, and the collected images are measured by the Amazon’s Mechanical Turk. Now all these three datasets are public available.Footnote 2 As the 15 scenes are quite a simple dataset, it is just a tentative experiment, then some detailed experiments are conducted on the MIT-indoor 67, and finally we do the experiment on a large scene dataset SUN 397.

4.1 Experimental setup

In the beginning of our experiments, we densely sample patches in four different scales from each image, with scale size \(\{72 \times 72, 96 \times 96, 120 \times 120, 144 \times 144\}\) by every 32 pixels. Deep CNN features are chosen as the local descriptors. We select activations from the seventh layer (fc7) of the model trained by Chatfield et al. [3], with the Caffe library [16]. To make the training process faster and more tractable, PCA is introduced to reduce the dimension of local descriptors from the original 4096 to 256, along with the pipeline of whitening and L2-norm.

Apart from the local descriptors, other important experimental settings also deserve to be mentioned. As described in Sect. 3.1, the max-pooling with three level SPM \((1\times 1, 2\times 2, 4\times 4)\) is adopted for image representation. The number of learned patterns is chosen by tested on MIT-indoor 67 dataset, and other parameters are chosen by cross-validation. In the SGD procedure, we set the initial learning rate equaling to 1 and after that divide the learning rate by 10 every T iterations. This operation is repeated five times so the final learning rate declined to 0.00001. In the initializing step, k-means is implemented by the VLFeat [48] library. Generating the final image classifier through SVM is implemented by the LibLinear [12]. All of the programs are written in MATLAB and running on an Inter(R) Xeon(R) i7-E5 2670 CPU (2.60GHz) 64GB RAM PC.Footnote 3

4.2 15 Scenes

The 15-scene dataset includes 4485 scene images and 15 different classes in total. There are about 200–400 images in each class, with average size of \(300\times 250\). Following the standard setup [19], we randomly select 100 images per class for training and the rests for testing. This operation is performed ten times and the average classification accuracy is reported.

Table 1 Classification accuracy for different methods on 15 scenes

For 15 scenes, the \(\lambda _{w}, \lambda _{u}\) are set to 0.1 and \(5\times 10^{-6}\), respectively, and 60 patterns (i.e., average four patterns per class) are learned in all. Results are reported in Table 1. As we can see, our method can achieve the mean accuracy of \(92.80\%\), which is much higher than other methods with handcrafted features [43, 44, 50] or the deep CNN features [53], the latter of which even utilizes Hybrid deep CNN features trained by the combination of ImageNet and Places datasets. Notice that our method needs only 60 patterns, which is comparatively quite a small number.

4.3 MIT-indoor 67

The MIT-indoor 67 is a very popular scene classification benchmark. It contains 67 categories of indoor scenes and 15,620 images totally. Following the standard setup [40], the standard partition that separates 80 images for training and 20 images for testing per category is adopted in our experiment. T, the number of SGD iterations is set to 10, 000, and the \(\lambda _{w}, \lambda _{u}\) are set to 0.1 and \(5 \times 10^{-7}\), respectively, which are all chosen from cross-validation. The classification accuracy and detailed analyses is reported.

4.3.1 Impact of the number of patterns

The first set of experiments is designed to evaluate the impact of pattern numbers. As the red line in Fig. 2 shows, it takes only 268 patterns (i.e., average 4 patterns per class) to achieve \(74.85\%\) accuracy. However, when double the number of patterns (536 patterns), there is only \(0.15\%\) improvement (74.85–\(75.00\%\)), which is not surprising because our formulation is trying to share some important patterns among different image classes. Many chosen patterns may be redundant when the number is increased. It demonstrates that simply adding patterns seems unhelpful to improve the performance, so only 268 patterns are necessary. To the best of our knowledge, the fewest patterns are required to achieve the excellent performance in the classification accuracy. 268 is a tiny number compared with that in other methods, such as many part-based models [10, 17, 44] using thousands even tens of thousands of patterns in total. So we can learn very compact representation after sharing some patterns.

Fig. 2
figure 2

Classification accuracy on MIT-indoor 67 over different number of patterns

4.3.2 Impact of the learning procedure

Then we compare the performance between the initial pattern filters and the final filters after the learning procedure. The green line in Fig. 2 is the classification accuracy with different pattern numbers. Actually our initializing strategy can be regarded as a weakly supervised BoF except that we replace the histogram-based method in BoF [6] with max-pooling based method. We can observe that the simple method to initialize patterns also has content performance, and after the learning procedure, we can get \(+2.16\%\) improvement in performance (72.69–\(74.85\%\)) using 268 patterns. So our method can greatly boost the discrimination level through selecting the discriminative patterns.

4.3.3 Comparing against jointly training

As referred in the last paragraph of Sect. 3, we can also utilize the learned \({\mathbf {U}}\) as the final image classifier, actually it can be regarded as the strategy of jointly training the pattern filters and constructing the image-level classifier. But we simply using learned \({\mathbf {U}}\) to sharing the patterns and retrain the classifier. To illustrate the advantage of our choice, we also compare our method with the jointly training method, as shown in the blue line in Fig. 2. It is obvious that our method can achieve higher accuracy than jointly training methods no matter whether the pattern number is 268 or 536. So we can learn more compact and discriminative representation after the learning procedure.

4.3.4 Comparison with other methods

To demonstrate the advancement of our method, some comparison experiments are also performed. We first compare our method with another shared representations method [37], as shown in Table 2. We can observe that the method by Parizi et al. [37] outperforms our methods when they use 13, 400 parts, but note that we only need 268 patterns, which are much less than theirs, and they make use of the Place CNN model trained on a large dataset which is particularly collected for scene classification [53]. Our method achieves higher accuracy than Parizi et al. [37] when they use 372 patterns (+\(1.55\%\)). At the same time less patterns are necessary in our method (about \(70\%\) patterns of their method), which can demonstrate our method has the ability to learn more compact and discriminative representations.

Table 2 Classification accuracy comparing with another shared representations method on MIT-indoor 67
Table 3 Classification accuracy for different methods on MIT-indoor 67

More results of other methods are shown in Table 3. Firstly, our method outperforms the ones with handcrafted features to a large extent. Meanwhile, we also achieve better performance than the ones using ImageNet deep CNN features, like activations from the ImageNet CNN convolutional layer with cross-convolutional-layer pooling [26], ImageNet fc6 with sparse coding [24], ImageNet fc7 with VLAD and Fisher Vector [14], ImageNet fc7 with pattern mining [23], and ImageNet fc8 with sematic Fisher Vector [9]. The result via concatenating semantic Fisher Vector with ImageNet fc8 and output from Place CNN fc7 is the best on MIT-indoor 67 [9]. But it also adopts the Place CNN features. We also compare our method with the Term Frequency–Inverse Document Frequency (TF-IDF) method, using the same number of patterns as our methods. It is obvious our method outperforms the TF-IDF method by a large margin (74.85 vs. 49.85). These results show that our method achieves very competitive results comparing with others.

Figure 3 is the confusion matrix of our method. As we can see, the incorrect classification made in our method (like differentiating bakery and deli in Fig. 4) are even unavoidable for human eyes.

Fig. 3
figure 3

The confusion matrix on MIT-indoor 67

Fig. 4
figure 4

Some pictures in bakery and deli

Table 4 Classification accuracy for different methods on SUN 397

4.3.5 Time costing for training and testing

The efficiency of our method can be demonstrated that, after the deep CNN features have been extracted, it only takes about 8 min to initialize four patterns in each class and 3 h for all of the learning procedure. The consuming time has positive linear correlation with the pattern number. In addition, if combined with the strategy in [4], which can accelerate the speed of extracting patches feature, our method could shorten the time costing further. Moreover, for both training and testing, the procedure of generating the image representation only contains steps of calculating the dot product and taking the maximal value. It takes only 0.1 s to encode one image, which is so small that can even be neglected.

4.4 SUN 397

The SUN 397 is a very large dataset for scene classification. There are 397 different image classes in the dataset, including outdoor and indoor scenes, like alley, apartment building, hospital room, throne room, and so on. It totally contains more than 100 K images, and each class has at least 100 images. We follow the fixed ten different partition of training and testing set by Xiao et al. [51], i.e., each class has 50 images for training and 50 for testing. The T is set to 20, 000 and \(\lambda _{w}, \lambda _{u}\) are set to 0.1 and \(5\times 10^{-8}\) respectively according to the cross-validation. According to the Sect. 4.3.1, we learn average 4 patterns per class (totally 1588 patterns).

The classification results are shown in Table 4. We can arrive at the same conclusion that our method can generate very compact image representation and achieve the best performance only using ImageNet deep CNN features. Moreover, our method is also capable of dealing with large datasets and performs well on these dataset.

Fig. 5
figure 5

The top detections in some patterns on the full MIT-indoor 67 training images. Each row represents patches in a certain pattern and the label of the image is marked on the top left corner. The “initial” means patterns generated by the initial step in Sect. 3.3.3, and the “learned” means patterns learned by our method. As we can observe, top detections in “initial” often belong to the images in one or two classes, and in “learned”, they are shared among diverse images. For example, the pattern 117, with the semantic meaning of goods, initially concentrates on the grocerystore scene (the fifth row). After training, they are shared among the deli, grocerystore, pantry, shoeshop, and toystore scenes (the sixth row)

Fig. 6
figure 6

The pattern frequencies before (top) and after (bottom) our learning strategy on MIT-indoor 67. The details of pattern frequencies are described in Sect. 5

5 Discussion

The results in Sect. 4 have shown many inspiring phenomenon. (1) Comparing with other pattern learning methods, our extremely shared strategy successfully produces very compact representations, that is, only four patterns on the average are necessary for each class, which is the fewest patterns been used to our best knowledge. (2) The fewest pattern number does not harm our performance of scene classification. Our method outperforms other methods using ImageNet deep features, and is comparable with the methods using deep CNN models trained on large scene dataset. (3) Our method also outperforms the sophisticated Fisher Vector based methods, which achieve many the state-of-the-art performance in many applications. (4) Due to the very compact representation, our method is capable for dealing with large scene dataset like SUN 397 [51], which has not been tested by other pattern learning methods. All of them confirm that our method can generate very compact but discriminative representation for images.

Figure 5 shows the patches with the highest scores in some semantic patterns on the MIT-indoor 67 dataset. Previously, the most important patches are disturbed in the same class. However, after the learning procedure of the extremely shared strategy, the important patches are dispersed among different classes, and meanwhile contain some semantic information, which demonstrated that our method can share the discriminative patterns to generate extremely compact image representation. For example, the pattern 117 (the fifth and sixth rows in Fig. 5) is concentrated in the scene grocerystore before training. After our learning procedure, it is shared among the deli, grocerystore, pantry, shoeshop and toystore scenes. At the same time, the learned patterns also convey much semantic information like the head (pattern 6), the tables (pattern 90), and the goods (pattern 117).

The pattern frequencies before and after our learning strategy on MIT-indoor 67 is shown in Fig. 6, where the frequencies are computed via dividing the occurrence number of each pattern by the number of patches. A pattern is occurred when the maximum response of a patch is corresponding to that pattern. We can observe that before learning, all pattern frequencies are located in the vicinity of a similar value, while after learning, some most discriminative patterns will be selected with greater frequencies, and what is more, the overall frequency distribution tends to be sparse. According to the pattern frequencies, we can compute the statistical significance of our results. Suppose the null hypothesis is that the pattern frequencies will not change after learning, and then the alternative hypothesis will be the pattern frequencies will change after learning. As a patch belonging to or not belonging to a pattern is similar to coin tossing with “head” or “tail,” we can employ the Bernoulli distribution to compute the p value. According to these hypotheses, the p value infinitely close to 0, i.e., the probability of the null hypothesis being true is almost 0. We can also make a null hypothesis that the classification accuracy will not be improved by our learning strategy, and then the alternative hypothesis will be our learning strategy can improve the classification accuracy. Note that on MIT-indoor 67, as shown in Fig. 2, the accuracies before and after learning are 0.7269 and 0.7485, respectively. We can also use the Bernoulli distribution here, then the p value will be less than \(0.1\%\), i.e., the probability of the null hypothesis being true is less than \(0.1\%\). So we can reject these entire two null hypotheses. These statistical significance analyses show that our learning strategy can select some most discriminative patterns and, meanwhile, improve the results.

6 Conclusions

In this work, we propose a novel method to learn extremely shared middle-level image representation. The lasso regularization is adopted to enforce the pattern selection and sharing. Our extremely shared method can learn several discriminative patterns for different scene classes simultaneously and force them to be shared among different image classes. After the patterns are learned, we concatenate the scores of patterns, then use max-pooling to aggregate these scores into the final extremely compact image representation. Our method can achieve very remarkable scene classification performance. Only four patterns per class in average are required to represent images, and the performance of the learned patterns is very remarkable on the considered scene datasets. The code for reproducing the results is publicly available. For future work, we would like to explore more powerful methods to find extremely shared patterns which can generate more compact and discriminative way for image representation.