1 Introduction

Natural image classification is an important research topic in many communities such as computer vision and pattern recognition. Because of the ability to capture distinctive information in images, local features (such as SIFT [18] and HOG [1]) have been widely used for representing images. The basic idea is representing the content of an image by a group of small, possibly overlapping and independent parts. However, local features generally cannot be applied to image classification directly because of their computational complexity and sensitivity to noises. A common strategy is to integrate local features into a global image representation, for which the codebook-based (or Bag-of-Words, BoW) model[5] and its extensions are developed.

A typical BoW model is subjected to the following three steps: 1) extracting local features from points of interest in an image; 2) encoding local features with a learned codebook or dictionary; and 3) pooling all the codes together to obtain a vector for representing the image. In this framework, the local feature coding process inevitably involves the loss of information, which severely hampers the discriminative power of the generated image representation and decreases the accuracy of image classification. To address this issue, various coding strategies are proposed to generate the codes for local features. Hard voting (HV) [5], in which each local feature votes on its nearest code with ”1” and other codes with ”0”, is usually adopted in the original codebook-based models. In spite of its simplicity, this strategy may produce great deviations such that highly similar local features may have significantly distinctive codes. As an improvement, soft voting (SV) [8] is then developed to allow a local feature to vote on multiple related codes. Instead of the frequency of occurrence, the distance between a local feature and an atom is used to calculate the probability that the local feature belongs to the atom. This scheme preserves more information and achieves better performance than hard voting. Liu et al. [17] further improved the soft voting scheme by only assigning a local feature to its nearest neighborhoods. Sparse coding, another strategy to improve the hard voting scheme, chooses a few atoms to reconstruct a local feature.

Sparse coding can be further extended to the locality-constrained linear coding (LLC) scheme [21], in which each local feature is represented as a linear combination of its nearest neighbors in the codebook. LLC performs codebook learning by using the standard K-means algorithm, which is time-consuming for large-scale data. Furthermore, most of the above coding schemes only take the salient characteristics into account, but ignore the distribution of the local features. Consequently, only the maximum response on each atom in the codebook is preserved, but the lower ones are unsuitably discarded.

To address the above issues, in this paper, a novel spatial locality-preserving coding method is proposed for constructing a robust codebook and obtaining discriminative feature representations. In the codebook learning stage, an effective approximated K-means algorithm, which greatly reduces the computational complexity, is proposed to initialize the codebook. Then, the codebook is updated by exploiting the local geometric structure among the atoms in the codebook. In the coding stage, based on the density distribution of the codebook, we select a few suitable codebook atoms for encoding each local feature in order to preserve the structure of features more completely. In the following pooling process, we also develop a geometric smooth pooling strategy based on the spatial pyramid framework to make full use of the rich information about spatial distribution of the local features. Figure 1 illustrates the processing flow of the BoW model and some representative coding strategies.

Fig. 1
figure 1

Left: flowchart of the spatial pyramid structure for image classification. Right: VQ, LLC and our coding strategies

The remainder of this paper is organized as follows. In Section 2 we provide a detailed review about the related work. In Section 3 the proposed spatial locality-preserving coding method is presented in details. Experimental results on three commonly used benchmark datasets are reported and analyzed in Section 4. Finally, we conclude the paper in Section 5.

2 Related work

Linear coding models have shown excellent performance in image classification. For an image, local features (for example, the SIFT descriptors), are firstly extracted from patches. Afterwards, each local feature is generally approximated by a linear combination of the atoms in a given codebook. The input features are transformed into more discriminative codes in the local feature coding process. The popular linear coding models include Vector Quantization (VQ) [14], Locality-constrained Linear Coding (LLC) [21], Soft-assignment Coding [9], Sparse Coding (SC) [22], and other variants [7]. Motivated by spatial pyramid matching (SPM), Yang et al. [22] proposed the ScSPM algorithm to compute a spatial-pyramid representation for each image based on the sparse codes (SC) of SIFT features. Different from the original SPM, which performed spatial pooling by computing histograms, ScSPM utilized max pooling strategy for achieving the robustness to local spatial translations. From the results of SC, Yu et al. [23] empirically observed that the non-zero coefficients in a code were often assigned to the atoms close to the encoded local feature. By adding the constraint of local agglomeration of candidates, they then extended sparse coding to a more general framework, Local Coordinate Coding (LCC) [23]. This constraint is proved to be beneficial for reducing the representation deviation of local features. In fact, LLC can be regarded as a fast implementation of LCC, wherein the constraint of locality agglomeration is replaced by searching the k nearest candidates.

After the codes of all the local features from an image is obtained, they can be appropriately aggregated into a vector to represent the image by pooling strategy. As a matter of fact, because of its robustness to small transformations in an image, for a long time, pooling has already been adopted in many popular image classification frameworks such as convolutional networks [15]. The earliest research can be traced back to [12], which studied on complex cells in the visual cortex. In the pooling stage, the obtained codes within each region of the image are combined into a histogram. In previous studies, classical pooling operations include max pooling and average pooling, which preserve the max value and the average value from all the responses. Most available studies on the pooling methods are purely empirical. Recently, Boureau et al. [4] provided theoretical analysis on binary feature pooling in the context of classification. They demonstrated that several factors, including the pooling cardinality and the sparsity of features, affected the discriminative power of the pooled features. Feng et al. [6] suggested a geometric l p -norm pooling (GLP) strategy by using some geometric information in the pooled features, which significantly boosted the discriminative capability of the obtained representations for image classification.

3 Spatial locality-preserving coding

3.1 Preliminary

Let \(X_{(j)}=[x_{1},x_{2},...,x_{N_{j}}]\in \mathbb {R}^{D\times N_{j}}\) be a set of D-dimension SIFT features extracted from an image. Given a codebook \(B=[b_{1},b_{2},...,b_{M}]\in \mathbb {R}^{D \times M}\), each local feature is encoded into an M-dimension code by coding schemes.

In the traditional SPM, the vector quantization method applies the k-means clustering to solve the following constrained least square fitting problem:

$$ \begin{array}{l} \arg\min_{Z}\sum\limits_{i=1}^{N_{j}}\|x_{i}-Bz_{i}\|^{2}\\ s.t.\,\,\,\|z_{i}\|_{l_{0}}=1,\,\,\, \|z_{i}\|_{l_{1}}=1,\,\, \, z_{i}\geq,0,\, \,\,\, \forall i \end{array} $$
(1)

where \(Z = [z_{1},z_{2},...,z_{N_{j}}]\); z i denotes the code of x i on codebook B. \(\|z_{i}\|_{l_{0}}=1\) is the cardinality constraint. \(\|z_{i}\|_{l_{1}}\) is the sum of all the elements in z i . z i ≥0 means that all the elements of z i are non-negative. It shows that the index of the only non-zero element corresponds to the cluster to which the local feature x i belongs.

Yu et al. [23] showed that the locality information was more essential. Wang et al. [21] incorporated locality constraint into the coding framework. As discussed above, a codebook can be learned based on LLC. LLC utilizes the traditional iterative K-means algorithm for codebook learning. If both of the data and the cluster centers are massive, the cluster re-assignment step becomes prohibitively expensive. To efficiently construct a robust codebook with high clustering quality, we apply a fast approximated K-means with cluster closures [20] to initialize the atoms in our codebook learning algorithm.

Meanwhile, LLC projects each local feature into the space spanned by its k-nearest neighbors, where k is a small constant, e.g., k=5. This constant is not conducive to encode local features reasonably and may ignore the spatial information around the codebook atoms. For example, if one of the k-nearest neighboring atoms of a local feature is under the sparse distribution, encoding the feature with this atom leads to a weak response. Such atoms are inappropriate to be selected for encoding the local feature. We exploit the density distribution around the atom to select suitable codebook atoms for encoding feature adaptively such that the spatial information around the coding atoms can be better preserved.

3.2 Learning codebook based on cluster closures

Given a set of D-dimension local features extracted from images, i.e., X=[x 1,x 2,...,x N ], K-means clustering aims to partition X into M(MN) groups, ζ=[ζ 1,ζ 2,...,ζ M ]. First, we identify each active point xX and update its cluster membership. Denoting \(\mathcal {N}_{x}\) as a set containing x and its nearest neighbors, we define the closure of the cluster ζ i as: \(\bar {\zeta _{i}}=\bigcup _{x\in \zeta _{i}}\mathcal {N}_{x}\). For obtaining \(\mathcal {N}_{x}\), a single approximated nearest neighborhood for each point can be derived from a random partition (RP) tree [19]. The last neighborhood is assembled by combining the results from multiple random spatial partitions. Fig. 2(a) illustrates the relationships among the cluster, the neighborhood points, and the closure. Then, different from the standard K-means which computes the distances between a feature x and all the cluster centers, the proposed method only needs to compute the distances between x and a small number of cluster centers whose cluster closures contain x. This strategy reduces much computational cost and keeps active points close to the cluster boundaries. This method is motivated by the observation that the active points whose cluster assignments are changed in each iteration often locate at or near the boundaries of different clusters. Therefore, the main idea of the method is to identify the active points so as to improve the accuracy and efficiency in the assignment step of codebook initialization.

Fig. 2
figure 2

a merging neighborhoods to obtain a closure. b learning codebook by exploring the local geometric structure around code c i

After initializing the codebook, LLC encodes each local feature using its k nearest neighboring atoms in the codebook. We observe that each codebook atom can obtain multiple responses in the coding stage. If an atom is surrounded by a local dense group of local features, it will obtain strong response. Therefore, the initialized codebook should be further updated to obtain a better image representation. Specially, each center should be adapted individually before encoding the local features. To address this issue, we propose a matrix-adaption step. Firstly, a vector e i j is utilized to describe the relationship between a local feature x i j ζ i and the center c i of the cluster ζ i :

$$ \begin{array}{l} e_{ij}=x_{ij}-c_{i}\\ i=1,2,...,M,\, j=1,2,...,|\zeta_{i}| \end{array} $$
(2)

where M is the number of clusters and |ζ i | denotes the number of points in the cluster ζ i . With vector e i j , the vector of the tangent plane at c i can be calculated for describing the local dense region of the cluster:

$$ e_{i}^{|\zeta_{i}|}=\sum\limits_{j=1}^{|\zeta_{i}|}e_{ij} $$
(3)

Afterwards, the density of the cluster ζ i is calculated as:

$$ p_{i}=\frac{1}{|\zeta_{i}|}\sum\limits_{j=1}^{|\zeta_{i}|}e_{ij}^{\mathrm{T}}\cdot e_{i}^{|\zeta_{i}|} $$
(4)

In order to take the density of the cluster into account, the center of the cluster is adjusted. It moves towards the direction of the tangent plane with a ratio η i when p i satisfies that the center locates in a more suitable place where the local features are denser.

To move the center towards the local dense regions, the codebook should be updated based on the distribution of all the local feature points in each cluster, as illustrated in Fig. 2(b). Consequently, our codebook learning method is able to well preserve the locality.

The proposed codebook learning process is summarized as Algorithm 1.

figure a

3.3 Density-constrained coding

In the coding phase, for each local feature in \(X_{(j)}=[x_{1},x_{2},...,x_{N_{j}}]\) extracted from an image, LLC only uses distance information to select codebook atoms. Selecting the codebook atoms may significantly affect the response. If a local feature is in a sparse region, the responses are weak for the atoms. The significance of each coding atoms to the local feature is variant. In order to exploit the difference of the significance discussed above, we define the density of local features in the space spanned by atoms in the codebook. Given a learned codebook C=[c 1,c 2,...,c M ], the estimated density at c i is given as:

$$ p_{i}=\frac{1}{(1+\frac{1}{m}{\sum}_{j=1}^{m}\|x_{j}-c_{i}\|_{2}^{2})^{2}}\cdot\frac{m}{M} $$
(5)

where x j is a point in the cluster ζ i ; m is the number of these points; and M is the total number of clusters. If the average distance in cluster ζ i is small and the number of these points is large, the density at c i is large.

For each local feature xX (j), we first choose a set of nearest neighbors of x roughly. Subsequently, if an atom is one of the top-two nearest neighbors of x, or/and its density \(p_{i}> \gamma \times \overline {P}\) (\(\overline {P}\) is the mean density of all the atoms), it is selected to encode the feature x. Finally, we are able to obtain more representative coding bases. The density concept has advantages over describing spatial information. An atom’s density is large means that it contains much useful information. By using density to describe the spatial characteristic, we can learn the codebook adaptively from the field which includes more spatial information. Meanwhile, the codebook is capable of preserving the locality. Therefore, the problem is transformed into solving the following equation:

$$ \begin{array}{l} \min_{z_{i}}\|x_{i}-C^{\prime}z_{i}\|^{2}_{2}+\lambda\|d_{i}\odot z_{i}\|^{2}_{2}\\ s.t.\qquad\, 1^{\mathrm{T}}z_{i}=1,\quad \forall i \end{array} $$
(6)

where ⊙ denotes the element-wise multiplication, C contains the atoms selected for the local feature x i according to the density constraint. The entries of d i are the distances between x i and the atoms in C . Particularly, we have:

$$ d_{ij}=\exp(\frac{\|x_{i}-c_{j}\|_{2}^{2}}{\sigma}),\quad \|x_{i}-c_{j}\|_{2}^{2}<\sigma $$
(7)

3.4 Geometric smooth pooling

After the coding phase, each image, from which N j local features \(X_{(j)}=[x_{1},x_{2},...,x_{N_{j}}]\in \mathbb {R}^{D\times N_{j}}\) are extracted, is represented by the coding matrix \(V=[v_{1},v_{2},...,v_{N_{j}}]\in \mathbb {R}^{M\times N_{j}}\), where v i is the code of local feature x i . Note that V is obtained by solving (6). If an atom is not selected to encode a local feature, the element corresponding to the atom is zero in the local feature’s code. Denote \(V_{i.}=[v_{i1},v_{i2},...,v_{iN_{j}}]\) is the i-th row of V, where v i j is the response of i-th codebook atom for the local feature x j . Therefore, each code may obtain multiple highly redundant responses. To address this problem, pooling operations are necessary to aggregate the encoded response vectors into a statistical vector for representing the whole image or the regions of interest. As discussed before, ”max pooling” and ”average pooling” are two commonly used operations. Recent theoretical analysis has revealed that the two pooling methods were easily suffered from the unrecoverable loss of the spatial information caused by the statistical summarization. Boureau et al. [4] summarized that max pooling strategy had good performance in a spatial pyramid model and that it could achieve better performance when the codes are sparse. Liu et al. [17] indicated that max pooling was a lower bound formulation of the likelihood of at least one visual word presented in an image. However, this does not clarify when the low response is more suitable for classification than the exact analytical solution. Unlike GLP [4], which used the l p -norm function tailored for the spatial distribution of class-specific features, the proposed pooling method is capable of preserving the geometric information of local responses based on three-level spatial pyramid model, as illustrated in Fig. 3. More specifically, the operation is defined as:

$$ f_{g}(V_{i.})=\sum\limits_{j}w_{ij}v_{ij}=V_{i.}W_{.i} $$
(8)

where \(W_{.i}=[w_{1i},w_{2i},...,w_{N_{j}i}]^{\mathrm {T}}\in \mathbb {R}^{N_{j}}\) is the weight vector of geometric smooth pooling for V i., which explores the geometric relationship between the low response and the max response. To incorporate this spatial constraint on smoothness, we calculate the weight w k i as:

$$ w_{ji}=\exp(-\frac{\|a_{j}-a_{ind_{i}}\|^{2}}{2\sigma^{2}}),\ \ 1\leq j\leq N_{j} $$
(9)

where i n d i is the index of the max response in V i.; a j = (x j ,y j ) and \(a_{ind_{i}}=(\mathbf {x}_{ind_{i}}, \mathbf {y}_{ind_{i}})\) denote the spatial coordinates of the local feature x j and \(x_{ind_{i}}\) in the image. Then, we have:

$$\begin{array}{@{}rcl@{}} f_{g}(V_{i.}) &=&\sum\limits_{j}\exp(-\frac{\|a_{j}-a_{ind_{i}}\|^{2}}{2\sigma^{2}})v_{ij}\\ &=&\|V_{i.}\|_{\infty} +\sum\limits_{j=1,j\neq ind_{i}}\exp(-\frac{\|a_{j}-a_{ind_{i}}\|^{2}}{2\sigma^{2}})v_{ij} \\ &=&\|V_{i.}\|_{\infty}+H(V_{i.}) \end{array} $$
(10)

where ∥V i. = maxV i.; H(V i.) is the geometric smooth function for max pooling, which explores the geometric relationship between the low response and the max response. It utilizes the relationship between the low response and the max response on each code sufficiently, rather than suppressing the low responses. If the frequency of the non-zero responses is large and the low response is close to the max response, the value of H(V i.) is large. The geometric smooth pooling uses the rich information about spatial distribution of low responses sufficiently. This method can preserve the geometric information of local responses around the max response. Afterwards, the pooling strategy considers not only the max response but also other low responses gradually such that the redundant elements in the codes are reduced as much as possible. This significantly improves the discriminative ability of the obtained image representation. Therefore, the obtained pooling results are more robust than those by using traditional pooling methods.

Fig. 3
figure 3

Geometric smooth pooling based on three-level overlapping pyramid

4 Experiments

We conduct image classification experiments on three benchmark object and scene datasets (15-scene dataset [14], Caltech 101 [16], and Caltech 256 [10]), making performance comparisons and the analysis of our approach with a variety of state-of-the-art spatial pyramid variants. All the experiments are repeated for 10 times in order to obtain stable and robust performance. The performance is measured by average accuracy over all the classes. Each image is resized so that its largest size is 300 pixels with the same aspect ratio. For each image, we extract SIFT descriptors on a 16 ×16-pixel patch sampled with a step-size of 8-pixel. A subset is randomly sampled from the training set for codebook learning. The codebook is initialized using the fast approximate K-means with cluster closures. Then, we update the codebook according to the local geometrical structure of the code atoms. Afterwards, we use the density information to select the coding bases adaptively, and encode the features. We use the geometric smooth pooling method based on spatial pyramid to form the descriptor of the entire image. The pooled features from each sub-region are concatenated and normalized as the final image representation. Finally, we adopt a standard linear SVM as the classifier to perform image classification.

4.1 15-scene dataset

The 15-scene dataset consists of 4,485 images spreading over 15 categories, with the number of images in each category ranging from 200 to 400. The scene categories contain from the out-door street, industrial to the in-door kitchen, living room, and etc. We randomly select 100 images from each class as the training samples and the remaining samples are used as the testing data. The size of the codebook is set as 600. The results show that our method achieves comparable performance while maintaining clear computational advantage over other optimization-based methods. We can see from Table 1 that our method achieves much better performance than the state-of-the-art methods. To train the codebook, we use the standard K-means clustering for KSPM and the sparse coding scheme for ScSPM. Our method achieves 83.63 %, outperforming the fastest codebook learning algorithm in all the baseline methods, KSPM, by 2.7 %. Moreover, our method also outperforms KC, a soft vector quantization algorithm, by more than 8 %. In this experiment, the proposed approach achieves more than 90 % classification accuracy in all the 10 classes. Fig. 4 illustrates the classification results of 5 classes that have more than 50 testing images.

Table 1 Classification accuracies(%) on 15-scene
Fig. 4
figure 4

Image classification samples in 15-scene dataset

4.2 Caltech-101

The Caltech-101 dataset contains 9,144 images in total from 102 different categories, including animals, vehicles, flowers, and etc., with high shape variability. The number of images per category ranges from 31 to 800. In our experiments, we partition the whole dataset into 15 and 30 images as training images per class respectively, and others as testing images per class. We train a codebook with 1,000 atoms. The details of comparisons are listed in Table 2. Our method outperforms ScSPM by 2 %, and is even better than LLC, which obtained the best classification performance on Caltech-101 dataset before. More specifically, in some object categories which are lifeless like menorah, ferry and chair, our methods are more efficient than LLC because the obtained representations are more stable and the used spatial information in images is more discriminative. For example, our method obtains the accuracy of 84.62 % in menorah images, while LLC achieves 69.32 %. In chair images, our approach records the higher accuracy than LLC by 25 %. Comparing with the graph-based coding methods, which only utilize the edge information in the feature coding process, our approach also achieves better results by considering more abundant spatial information. Furthermore, we present that the advantage obtained by our coding and pooling approach cannot be achieved by simply increasing the codebook size straightforwardly. To empirically justify this argument, we train multiple codebooks with different sizes, and compare the classification accuracies of the proposed pooling method with the traditional max pooling method. As shown in Fig. 5, with the same codebook generated by the fast K-means via cluster closures and spatial locality-preserving feature coding, the performance is much better by using the geometric smooth pooling proposed in this paper than that by using max pooling strategy. Furthermore, the codebook generated using our method is more robust and accurate than that of LLC, which is simply generated using K-means. It is also worth noticing that, when we simply increase the codebook size, it becomes more difficult to obtain further performance gain. If we adopt a better pooling strategy, it always bring additional accuracy gains.

Table 2 Classification accuracies(%) on Caltech-101
Fig. 5
figure 5

Comparisons on classification accuracies in Caltech-101 dataset with 15 training samples per class

To compare the two codebooks made by our algorithm and LLC respectively, we reduce the dimension of the codebook to two. As shown in Fig. 6, it can be found that our algorithm makes the codebook more compact when the density of local features is large. It means that the locality of the local features can be preserved. With the center of cluster being adjusted, some saliency information is still lost. But the overall accuracy and the accuracies on most categories, as mentioned above, are all improved.

Fig. 6
figure 6

Comparisons on 2-dimension codebook in Caltech-101

4.3 Caltech-256

The Caltech-256 dataset consists of 30,607 images of 256 categories. This database is significant for its large inter-class variability, as well as its intra-class variability larger than that in Caltech-101. Each class contains at least 80 images. We follow the protocol of commonly used methodology. Fifteen images are randomly selected from all the categories for training. The remaining images are used for testing. We train a codebook with 1,024 atoms. As it can be seen from Table 3, with 30 training images, our method improves the accuracy of classification by 1 %. Furthermore, our method is also better than ScSPM by more than 20 %. Figure 7 lists the performance of the proposed method on Caltech-256. It can be seen that the proposed strategy on coding and pooling outperforms the state-of-the-art methods. The classification accuracies grow along with the increase of the size of dictionary, before the size is larger than 1,000. Figure 8 lists 10 classes that have the best classification accuracies.

Table 3 Classification accuracies (%) on Caltech-256
Fig. 7
figure 7

Comparisons on classification accuracies in Caltech-256 dataset with 15 training samples per class

Fig. 8
figure 8

Image examples with the highest classification accuracy in Caltech-256

5 Conclusion

In this paper, we present a spatial locality-preserving coding method by adaptively selecting the coding bases according to the density of code atoms in the codebook. First, the codebook is initialized by a fast approximated K-means with cluster closures. Then, we update the codebook according to the local geometrical structure of each center to obtain better image representation. Instead of discarding the low responses, we utilize the geometric relationship among codebook atoms. In the pooling process, we incorporate the geometric smooth pooling scheme into the coding framework to generate more discriminative image representations. Experimental results demonstrate that our method outperforms the state-of-the-art methods on image classification accuracy.