1 Introduction

Image representation is a fundamental problem in computer vision, which has attracted enormous attention in recent years. One of the most popular image coding method is the bag-of-words (BoW) model which converts the image into a histogram-based representation. The BoW model shows its outstanding performance, especially its robustness to spatial variations [16, 45]. The process of BoW model is normally composed of two main steps: (i) dictionary generation and quantization of the local features which are extracted from the images [13]; (ii) feature pooling in image level, such as max pooling and sum pooling. Recently, sparse coding techniques have been used and achieved state-of-the-art performance in many applications such as object detection [33], tracking [48], image classification [9, 49] and face recognition [42].

In the BoW model, each image is presented as a histogram and each bin of the histogram is the occurrence number of its corresponding visual word. When sparse coding is applied, each feature is represented as a linear combination of a number of basis vectors. To obtain sparse code, some methods compute the dictionary and histogram-based representation separately [40], and some others manage to learn the optimal dictionary and coding parameters for local features simultaneously [45]. In order to reduce the computational complexity of sparse coding, Wang et al. [40] used the k-nearest bases to encode each feature, Gao et al. [9] added the Laplacian term in the optimization of sparse coding to guarantee that the sparse code changes smoothly on the data manifold. However, all these methods have ignored the distribution of local features over the basis vectors. Such distribution is important in effectively reflecting the relationship between similar features. It may avoid the case that similar local features in the Euclidean space turn out to be different in the sparse representation [9].

The motivation of our work is that we want to explore the useful information of local feature distribution and integrate it into the objective function. Specifically, the aims of our work are two-fold: (i) exploring class-specific similar features to increase the discriminative capability of image representations for different classes, and (ii) learning more informative dictionaries. Most existing methods related to our first aim tend to search the similar features from the whole training set. This mixes up features from foreground and background, and also reduce the discrimination of the sparse code [9, 40]. On the other hand, the normal strategies related to our second aim are to learn a discriminative dictionary for each class and then assign each test image to its predicted class by minimizing the information loss between image representation and classes [15, 41]. Chiang et al. [5] learned a component-level dictionary in each image group which exploited group characteristics to derive the sparse code. Shen et al. [35] proposed a novel dictionary learning method by taking advantage of hierarchical category correlation. Zhang et al. [52] proposed an image classification method by Laplacian affine sparse coding with tilt and orientation consistency. Lazebnik et al. [15] learned discriminative visual vocabularies by joining the features and posterior distributions for each class. However, such strategies are not optimal in the label prediction [44].

To overcome the shortcomings described above, we propose a discriminative sparse neighbor coding method. Firstly, to boost the discrimination of the sparse codes, we develop two modules in the sparse coding process: (i) eliminating the non-discriminative features for each specific class; (ii) eliminating the non-informative visual words. Module (i) is also a feature selection process which keeps the class-relevant features and highlights the high-level class knowledge of images. Then, in the coding stage, for each feature, its discriminative neighbors will be selected. The frequencies of the local features and their neighbors over the dictionary will be calculated and integrated into the objective function. Such scheme is useful for feature coding because local features are likely to have common neighboring visual word if they are close in the Euclidean space.

The contributions of this paper are three-fold. Firstly, we employ an iterative method to eliminate non-discriminative features in each class. This is to address the problem that class-irrelevant features in each class may reduce the accuracy of the neighbor information. Secondly, we adopt a statistical model to eliminate the non-informative visual words which not only are ineffective in representing the content of image but also degrade the coding discriminative capability. Finally, to characterize the relationship between local features and classes, we propose a coding method called sparse neighbor coding. We calculate the dominant basis vectors for each class and use the neighbor features to get the frequency distribution over the basis vectors in each class, which leads to more discriminative sparse code.

In the experiments, we demonstrate the benefit of the proposed method for image classification on several publicly available datasets. The performance of individual components of our framework is also verified in the experiments.

The remainder of this paper is organized as follows.

Section 2 reviews related works on sparse coding and presents the overview of the proposed method. Section 3 presents the details for feature selection and visual words elimination. The proposed sparse neighbor coding method is described in Section 4. Section 5 reports the experimental results that validate the effectiveness of the proposed method. Section 6 summarizes the key contributions of this paper and discusses the further work.

2 Related work and overview of the proposed approach

2.1 Related work

Bag-of-words (BoW) model has proved to be very useful in image coding. In the hard-assignment coding scheme, each coding coefficient vector has only one non-zero element that indicates which cluster each feature belongs to. Since such restriction may cause severe information loss, soft-assignment coding method [32] has been proposed to relax the constraint and computes coding coefficients on all visual words based on their distances to the local feature. Moreover, to cope with the loss of spatial information caused by the BoW model, Lazebnik et al. [16] introduced a spatial pyramid matching (SPM) model to derive the image representation from the spatial perspective.

Recently, sparse coding strategies have shown effectiveness in feature representation. Given an input data matrix D and the signal x to be encoded, sparse coding aims to find a linear combination of a few basis vectors from the D to reconstruct signal x. Yang et al. [45] combined the sparse coding with SPM model and notably improved the discriminability of traditional sparse representations.

The transformation from a feature vector to its sparse representation causes information loss. To cope with the information loss in the sparse coding, several techniques make use of the relationships among features to get better sparse representations. Wang et al. [40]suggested that locality plays more significant role than sparsity in sparse coding and proposed an approximation solution to obtain the sparse code with only k nearest basis vectors. Lu et al. [22] proposed a method which preserves the incoherence of dictionary entries based on the non-local self-similarity and manifold learning. Zheng et al. [53] developed a graph regularized sparse coding method by considering the local manifold structure of the data. The manifold structure has also been combined with random walk model to find nearest neighbors of encoded feature to boost the representation of encoded code [34]. Comparing with the methods that encode feature separately, these methods can preserve the similarity relations for different features.

A number of researchers focus on group sparse coding, which encodes similar features into similar sparse codes by learning a common dictionary over multiple different groups of data [1, 25, 46]. In group sparse coding, 1/ 2 replaces 1 norm in the sparse coding formulation. Julien et al. [25] acquired the sparse codes with respect to a subset of dictionary by jointly decomposing groups of similar signals. As a consequence, the similarity between features can be maintained. Mosci et al. [26] proposed an efficient optimization procedure for computing the solution of group lasso with overlapping groups of variables.

To obtain the discriminative sparse representation, some researchers focus on finding an optimal dictionary that leads to the lowest reconstruction loss with a set of sparse coefficients. In this context, dictionaries are learned for each classes. In [31, 37], each patch of the test image is approximated with respect to a set of dictionaries in different classes. Then the image class is predicted by calculating the residual errors in different classes. Julien et al. [23] proposed an online learning method to deal with large datasets with millions of training samples. This method can effectively handle the problem of high computation complexity when the training set is large. Liu et al. [20] showed the importance of non-negativity property and discriminating capability in the sparse representation.

Before the coding stage, several methods are used to guarantee the discriminative property of the dictionary and image representation. Some approaches focus on selecting the useful local features for training. For instance, Turcot et al. [39] proposed a match-based method to augment the feature representation based on a graph model and which only keeps the useful features. In [14], a pairwise image matching method was presented to select discriminative foreground features. Liu et al. s [18] proposed an image matching based iterative strategy to select the discriminative feature. This method is based on Earth Mover’s Distance (EMD) [29], which finds the optimal correspondences between features and can be used for computing the similarity between images. On the other hand, some researchers [36, 38, 47] paid more attention to remove the noise visual words. Sivic et al. [36] considered the frequencies of visual words occurring in images, which are borrowed from the text retrieval technique. Tirilly et al. [38] proposed a method to eliminate useless visual words based on the geometric properties of the local features and probabilistic latent semantic analysis (pLSA).

The literature reviewed above focuses on the different aspects in the process of feature coding, such as feature selection and dictionary learning. The aim of these methods is to reduce the information loss of sparse coding and boost the effectiveness of image presentation. Different from above sparse coding methods, we weight the dominant basis vectors by using the frequency distribution of similar local features. Our method explores the class-specific subspace for encoding local features, preserving the similarity of the local features after sparse coding.

2.2 Overview of the proposed approach

In this paper, we propose a discriminative sparse neighbor coding method. We use the frequency distribution of the similar features over the basis vectors in the coding stage, and retain the similarity between local features. In order to keep the discriminative features in each class and eliminate the non-informative visual words, we develop two modules to boost the discrimination of the sparse code.

In detail, the proposed method comprises the following steps:

  1. 1)

    Discriminative feature selection: An image matching based feature selection method is employed to select the discriminative class-specific features from each image.

  2. 2)

    Non-informative visual words elimination: A statistical method is utilized to automatically discover the non-informative visual words and eliminate them to strengthen the discriminative power of the visual words.

  3. 3)

    Neighborhood searching:

    Find the similar features (i.e. neighbors) in each class for the each given local feature through offline strategies.

  4. 4)

    Sparse neighbor coding: The distribution of the feature’s neighbors over the basis vectors is calculated. Such distribution is formulated as weighted coefficients which are integrated with the dominant basis vectors in each class into the objective function to obtain the sparse neighbor code.

Following the sparse coding stage, max pooling and SPM are used to compute the image-level representation. Then one-vs-rest classifier is employed for image classification. The framework of the proposed method is illustrated in Fig. 1.

Fig. 1
figure 1

An overview of the proposed method

3 Discriminative feature and visual word selection

Neighbor information is helpful to encode local features. The class-irrelevant (i.e. the features in the cluttered background) features in each class reduce the performance of encoded code. Therefore, we aim to detect and eliminate these class-irrelevant features in each class to boost the representation of sparse code. Furthermore, some of the generated visual words may not be useful to represent visual contents. Hence, these visual words need to be eliminated, which also can reduce the size of dictionary and computation cost in the following sparse coding phase. To achieve these goals, we introduce a method based on image matching to highlight the class-specific features. Furthermore, a statistical model is also adopted to eliminate the non-informative visual words.

3.1 Discriminative feature selection

The similarity between features is important for sparse representation. Some strategies tend to integrate the information of neighbors in the objective function to encode each local feature [9, 53]. But the features, either from irrelevant objects or from background, may reduce the performance of these strategies. For example, for coding local features supposed to locate on the surface of an object, the performance of sparse coding will decline if their neighbors from the cluttered background area are treated as object features in training. As illustrated in Fig. 2, the searched neighbors may come from the background area. Although they are similar to the encoded feature in the feature space, they are not visually relevant. This confusion thereby reduces the performance of feature coding stage. Therefore, if these features within the specific class can be detected and eliminated, the encoded sparse codes will be more discriminative.

Fig. 2
figure 2

Features from irrelevant objects

We adopt the EDM based strategy introduced in [18] in our feature selection model such that the discriminative features can be shared by images from the same class but not those from different classes. The EMD measure strategy not only computes the distance between two images, but also characterizes the feature matching contribution, which can be used to update the weight attached to each feature.

Suppose F={(f 1,w 1),…,(f |F|,w |F|)} is the set of local features extracted from image I, where |F| is the number of local features, f i is the local feature and w i is its corresponding weight. Initially, each w i is set as 1 and it is then updated based on its contribution to the image matching process. Given two images I p and I q , the EMD distance is defined as

$$\begin{array}{@{}rcl@{}} EMD(I_{p}, I_{q}) = (\sum\limits_{{i,j}}{f_{ij}d_{ij}}) / (\sum\limits_{i,j}f_{ij}) & \\ s.t \quad f_{ij} \geq 0, \sum\limits_{j}f_{ij} \leq w_{i}, \sum\limits_{i}f_{ij} \leq w_{j} & \\ \sum\limits_{i,j}f_{ij} = \min(\sum\limits_{i}w_{i},\sum\limits_{j}w_{j}) & \end{array} $$
(1)

where {f i j } is the flow matrix and each f i j denotes the flow between features f i and f j . {d i j } is the threshold distance matrix and each element d i j is defined as d i j = min(d(i,j),t), where d(i,j) is calculated by using Euclidean metric between features f i and f j . The parameter t controls the speed of the EMD computation and we set t=10 in our work.

Then the weight of each local feature is updated on the basis of feature matching during the EMD calculation. The contribution of f i of image I q is calculated as

$$ c_{q}(i) = \sum\limits_{j} f_{ij} \times \delta_{j} / d_{ij} $$
(2)

The term \(\delta _{j} = \frac {|I_{q}| \times w_{j}}{{\sum }_{k=1}^{|I_{q}|} w_{k}}\) is a normalizing factor, where |I q | is the number of local features in image I q . The weight of feature f i is updated using all related contributions in a class. Specifically, the weight of feature f i is reassigned with

$$ w_{i} = \frac{1}{M-1} \sum\limits_{q=1}^{M-1}c_{q}(i) $$
(3)

where M is the number of images in the class. In this way, the class-specific local features with strong matches across all images in the same class are selected.

The pairwise matching and feature weight update steps are performed iteratively to highlight the discriminative features in each class. Initially, the weight of each feature is set to an equal value, i.e., 1. We then minimized the EMD (1) compute the flow {f i j }. Then each weight is updated according to (2) and (3). The stopping criterion for this iterative updating procedure is the separability of the training set,the details of which can be found in [18].

The non-discriminative features with trivial weights are eliminated. We thus obtain more effective similar features which are used for learning the more robust image representations.

3.2 Non-informative visual words elimination

Our motivation for non-informative visual words elimination is from noisy word elimination in text documents, in which noisy words sometimes occur frequently and influence the text categorization. The noisy words, e.g. in, of, on, if, the, are also called stop words in text processing [11, 27]. In compute vision, there also are non-informative visual words that are not useful in image classification and retrieval.

In sparse coding, traditionally, the basis vector visual words are usually obtained by clustering algorithms, thus the semantic information of the visual words can not be predefined. In this paper, we utilize the Chi-square model [11] to find the non-informative visual words based on the relationship between visual words and image classes. A visual word is considered as non-informative if it satisfies the following two conditions:

  • It has high frequency in many images. Because one visual word cannot present any specific image or object if it exists in many images.

  • It has small statistical correlations with all the classes. The non-informative visual word cannot characterize the relation between visual word and class, which will reduce the discriminative ability of the final encoded feature representation.

Suppose the dictionary \(D^{\prime } = \{v_{1}, v_{2}, \dots , v_{K^{\prime }}\}\) (K ≥1) is generated based on the selected features obtained in the last step and C is the total number of classes. The relation between visual word v i and class is shown in Table 1.

Table 1 The contingency table of visual word v i

In the contingency table, the meanings of the items are described as follows:

  • n 1j denotes the number of images containing visual word v i in class c j ;

  • n 2j denotes the number of images which do not contain visual word v i in class c j ;

  • n +j denotes the total number of images in class c j ;

  • n 1+ denotes the total number of images containing visual word v i in training set;

  • n 2+ denotes the total number of images not containing visual word v i in training set;

  • N denotes the number of total training images.

The independence between visual word v i and all classes is computed using following weighted Chi-square statistics

$$ {\chi^{(i)}_{weighted}}^{2} = {\chi^{(i)}}^{2} / {If}_{v_{i}} $$
(4)

where

$$ {\chi^{(i)}}^{2} = \sum\limits_{j=1}^{K^{\prime}} \frac{(N n_{ij} - n_{i{+}} n_{{+}j} )^{2}}{N n_{i{+}} n_{{+}j}} $$
(5)

In (5), \(\chi ^{{(i)}^{2}}\) denotes the association between visual word and class. The smaller it is, the weaker it is correlated with the classes. The term \({If}_{v_{i}}\) in (4) denotes the occurrence frequency of visual word v i in the images, which is a trade-off factor. This factor balances the relationship between the visual word in each class and frequency of visual word in the images. Consequently, all visual words are listed in a descending order according to the value of weighted Chi-squared statistics. Those visual words with high values will be chosen if they are above a given threshold determined by cross-validation [28]. In the experiments we obtain the threshold by leave-one-out cross-validation on the training set for each trial and choose the one which leads to the best classification accuracy.

4 Sparse neighbor coding

In this section, we describe the sparse neighbor coding method which converts low-level feature into sparse code. Each class has a potential low-dimensional linear subspace that can be used to approximately construct sparse codes. Our contribution comes from the consideration of feature frequency distribution information which has been ignored in existing sparse coding methods [40, 44]. We propose to incorporate the neighbor information in the optimization to obtain the discriminative sparse code. Moreover, instead of computing a set of basis vectors for each class and predicting the label based on the residual error, we weigh each basis vector by calculating its importance to each class.

4.1 Dominant basis vector learning

In image representation, data samples belonging to the same class tend to lie in the same low-dimensional subspace. This means that a new sample can be reconstructed with lower computation load by using only a few basis vectors (atoms) in its corresponding class.

In the light of this observation, we commence by finding the dominant basis vectors, which have high relevance to each corresponding class. These basis vectors can be used to construct a more discriminate sparse code for each local feature. To this end, we start from finding the basis vectors with less reconstruction errors for each class.

Suppose DR d×K is the dictionary which non-informative visual words have been eliminated. Each column in D represents a basis vector. To encode each feature x i which represents a image, we use the sparse coding with 1 norm. Sparse coding ameliorates the quantization loss of hard vector quantization (VQ). In VQ method, only the closest basic vector is active. However, sparse coding relaxes this constraint by using a sparsity regularization term, which can be formulated as follows

$$ \arg\min_{z_{i}} \|x_{i} - D z_{i}\|_{2}^{2} + \lambda \|z_{i}\|_{1} $$
(6)

where z i is the sparse code for the feature x i and λ is the constraint that makes the trade-off between reconstruction error and sparsity of coefficients. This convex problem can be solved efficiently by Sparse Modeling Library (SPAMS) [24].

Because of the sparsity of coefficient z i , only a few basis vectors are active to represent feature x i . Let Z=[z 1,z 2,…,z n ] be the sparse code for the images in class c, we define the significance of each basis vector v j by computing the sum of response among these samples:

$$ s_{j}^{(c)} = \frac{{\sum}_{i=1}^{n} |z_{ij}|}{{\sum}_{k=1}^{K}{\sum}_{i=1}^{n} |z_{ik}|} $$
(7)

Each \(s_{j}^{(c)}\) indicates its significance to the class c. n is the number of image in class c and K is the class number. z i j is the jth dimensional coefficient for ith sparse code for class c. The activated visual words in sparse representation are mainly in the same sub-space with low-level feature vectors in the same class. Hence, we force the nonzero coefficients to lie in subset of dictionary D, and ignore the other basis vectors with less significance. To this end, we set the weight of each basis vector for class c as

$$ s_{j}^{(c)} = \left\{ \begin{array}{ll} s_{j}^{(c)}, & s_{j}^{(c)} \leq T^{(c)} \\ 0, & s_{j}^{(c)} > T^{(c)} \end{array} \right. $$
(8)

where \(T^{(c)} = \beta \times {\sum }_{j} s_{j}^{(c)} / K \) is a threshold. β is empirically set to 0.3, which ensures that the most significant coefficients are kept. These basis vectors with non-zero weights form the class-specific dictionary for each class, which are denoted as D (c). Then s (c) is normalized into the range [0,1]. The more dominant a basis vector is, the larger its correspondence significance value s (c) is. We introduce how to utilize the dominant visual words to effectively encode each local feature in Section 4.3.

4.2 Neighbor searching

One problem in sparse coding based methods is that local features similar in the feature space may be quantized into different visual words. In order to preserve their similarity, we capture the correlations between similar features and exploit the distribution of these similar features over the visual words to help encode each feature.

In this section, we introduce a graph-based method to find the similar features while simultaneously keeping the accuracy and efficiency. Then we describe how to use the similar features to obtain the sparse code in the next section.

To find similar features, we utilize the minimum dominating set (MDS) [12], which is a graph model. Consider an undirected graph G(V,E) where V denotes the set of vertices and EV×V denotes the set of edges. In the graph, the vertices represent local features and the edges describe how similar two adjacent features are. The dissimilarity between two local features x and y is measured in terms of the Euclidean distance d E (x,y)=∥xy2. During the graph construction, edges whose weights are greater than a chosen threshold are discarded.

For a graph G(V,E), one vertex αV is thought of being covered by a set of vertices if either of the two conditions are satisfied: (i) α is in the set, or (ii) α is adjacent (i.e. a neighbour) to a vertex in the set. For G(V,E), one vertex subset SV is a dominating set if S covers all the vertices in V. For a vertex αV in G, α and its adjacent vertices form a subgraph. Each subgraph contains a vertex in S and has high similarity between adjacent vertices since we have discarded some dissimilar edges in the process of graph construction. This graph will be used to find the similar features (neighbors). To make the searching stage more efficient, the size of S should be as small as possible. Therefore, we use the minimum dominating set, which has minimum size of S.

Given a feature x i , it is compared with the vertices in set S. The top vertex which shows high similarity with x i is selected as the neighbor of x i . Then the features corresponding to the selected vertices are selected.

Minimum dominating set model is effective since the vertices within a specific subgraph have great similarity. To compute the minimum dominating set, we exploit a simple greedy algorithm to obtain an approximate solution [10] For each class, constructing the graph model requires O(n 2 m) operations, where n is the number of local features and m is the dimension of each feature. In addition, the time complexity of the approximate algorithm for obtaining minimum dominating set is O(e), where e is the number of edges in G and e<n 2 m. This searching operation requires O(m l o g p) , where p is the size of S. To balance the time complexity and the performance of our method, we select 1000 features to construct the minimum dominating set, which are obtained through clustering.

In the rest of this paper, we refer to the set containing neighbors as neighbor set.

4.3 Formulation

In Section 4.1, we obtain the low dimensional subspace for each class c, which is represented as a subset of the dictionary D (c) and it contains K (c) visual words. Furthermore, each visual word has a weight \(w_{j}^{(c)}\) to denote its significance. Computing the sparse code of the local feature in class c based on the dictionary D (c) will lead to a class-specific sparse code. However, the similarity of the local features may be lost since the sparse coding approach may select diverse basis vectors for similar features, which reduces the performance of the sparse code. To preserve the similarity during sparse coding phase, we use the neighbor set (see Section 4.2) in each class to help encode the feature.

Given a feature x i , suppose its corresponding neighbor set for class c is \(NS_{i}^{(c)}\). We compute the frequency distribution of neighbor set \(NS_{i}^{(c)}\) over the dictionary D (c) based on Euclidean distance. Each neighbor is mapped to its closest visual words in D (c). Then the frequency distribution on the D (c) is calculated as

$$ \epsilon_{ip}^{(c)}=\sum\limits_{j} f(v_{p}, x_{j}) $$
(9)

with

$$ f(v_{p}, x_{j}) = \left\{ \begin{array}{ll} 1, & \textrm{if \(x_{j}\) is closest to \(v_{p}\)} \\ 0, & \text{otherwise} \end{array} \right. $$
(10)

where v p is the visual word in the D (c) and x j is the feature in the neighbor set \(NS_{i}^{(c)}\). Based on this formulation, the relation between local features can be described. If the neighbors of feature x i locate mostly in a few specific visual word, the given feature x i will have high response to these visual words (see Fig. 3).

Fig. 3
figure 3

Left: Traditional Sparse Coding method; Right: Our method. The sparse coding selects different basis vectors to encode the similar features. Our method encodes each feature together with its neighbor distribution on the basis vectors, which enables feature similarities to be preserved in sparse representation

Then coding with class sub-space and distribution information on basis vectors transforms the normal sparse coding formulation into

$$ \begin{array}{l} \arg\min_{z_{i}} \|x_{i} - D^{(c)} z_{i}\|^{2} + \gamma \|z_{i}\|_{1} + \beta \|q_{i}^{(c)} z_{i}\|^{2} \\ \textit{s.t.} \qquad 1^{\top} z_{i} = 1 \end{array} $$
(11)

The 1 norm regularization results in the sparsity of the representation. The coefficient \(q_{i}^{(c)} = 1 / (\epsilon _{i}^{(c)} \times s^{(c)})\) integrates the dominant basis vectors with the distribution information, where both \(\epsilon _{i}^{(c)}\) and s (c) are normalized vectors. Equation 11 controls the coding coefficient vector z i to achieve the minimization of quantization loss and meets the following properties: (i) the value of the coefficient z i j is larger if there are a large portion of neighbors locating on the j-th basis vector, thus preserving the similar response on the basis vectors for similar features; (ii) similar features are encoded based on similar basis vectors, therefore the neighboring local feature distribution enables similar responses over basis vectors for similar features. In this way, if two features are close in the feature space, they are likely to relate to the similar visual words and thus resulting in the similar sparse codes.

Recent studies [9, 40] suggest that construction locality produces better performance on the feature coding. Thus we can also use the k most similar basis vectors to encode each feature. The locality guarantees the sparsity, and the 1 term in (11) can thus being ignored. Only k basis vectors are used to construct the feature, which also improves the computation efficiency. To compute the optimal solution to (11), we initialize the variables in terms of z i =D −1 x i , and then iteratively update z i based on coordinate descent.

The process of the proposed sparse neighbor coding method is summarized in Algorithm 1:

figure f

4.4 Inference

Given a new test image, we need to calculate its sparse representation for each class c (c=1,…,C). Suppose one image region has m local features, maximum pooling is employed to aggregate these features in the same region. Each local feature will be presented as a vector with dictionary size K and the \(u_{j}^{(c)}\) entry is the maximum response to the j-th basis vector

$$ u_{j}^{(c)} = max \{|x_{1j}|, |x_{2j}|, \dots, |x_{mj}|\} $$
(12)

To preserve the spatial information, Spatial Pyramid Matching [16] is also employed in our method. Both spatial layout and more basic pattern responses are retained by dividing the whole image into multiple fine regions. Then we apply one-vs-rest SVM classifier to compute the probability P(C|u) that the test image belonging to each class. The classification label is assigned whereby finding the highest probability value

$$ c* = \arg\max_{c\in C} P(C=c|u^{(c)}) $$
(13)

5 Experiments

In this section, we report experimental results on four widely used datasets: Scene 15 [8], UIUC 8-Sport [17], Caltech-101 [7], PASCAL VOC 2007 [6]. There are several alternative state-of-the-arts methods for comparison in the literature. ScSPM [45] is a sparse coding method that incorporates spatial pyramid matching. KSPM [16] performs spatial pyramid matching and SVM classification using histogram intersection kernel. HIK+OCSVM [43] uses histogram intersection kernel and one class SVM to quantize local feature. LScSPM [9] is a Laplacian sparse coding approach based on spatial pyramid matching. LR-S c +SPM [50] performs non-negative sparse coding along with max pooling and spatial pyramid matching. NBNN [19] is a nearest-neighbor approach in local image feature space. LLC is the locality-constrained linear coding method. LR-LGSC [51] is a method that investigates group generation for group sparse coding with Laplacian constraints. Zhang et al. [49] proposed an image representation based on structured low-rank. We compare our method with the above state-of-the-arts methods.

5.1 Parameters setting

Local feature descriptor is essential to image representation. In our work, we adopt the widely used 128 dimensional SIFT feature [21]. Dense SIFT features are extracted with step size set to 8 and size of patches set to 16 × 16. The whole images are processed in gray scale. The extracted features are then normalized with 2-norm. For Scene-15, UIUC 8-Sport and Caltech-101 datasets, we construct the SPM model in three levels, i.e., 1 × 1, 2 × 2 and 4 × 4, as described in [16]. For the PASCAL VOC 2007 dataset, we obtain the spatial regions by dividing the image in 1 × 1, 3 × 1 and 2 × 2 grids, which follows [4]. In the SPM construction, each layer is assigned the same weight. To train the codebook, we utilize the standard k-means clustering method. The codebook size is fixed to 1024. In the classification step, we use one-vs-rest linear SVM [3] provided by Yang et al. [45] due to its advantages in speed and good performance in max pooling based image classification. Following the common benchmarks procedures, we repeat the experiments with randomly selected training and testing samples, and record the average accuracy and the standard deviation.

In addition, there are several parameters to be set in our method. The sparsity of sparse codes λ is fixed at 0.3. The regularization parameter C in linear SVM is set to 10.

5.2 Scene 15 Dataset

We evaluate our method for scene classification on the Scene 15Footnote 1 dataset which contains 4485 images from 15 categories, with category size varying from 200 to 400. The image contents are diversified, containing not only indoor scenes, such as bedrooms and kitchens, but also outdoor scenes, such as buildings and villages. The average image size is 300 × 250 (pixels). In the experiment, we resized the maximum side (length/width) of each image to 300 pixels with aspect ratio remaining unchanged. Fig.4 shows some sample images in this dataset. To compare with alternative methods in the literature, 100 images are randomly selected from each class as the training data and the rest are used as the testing data. The experimental results are listed in Table 2 with the comparison against several alternative approaches. The confusion matrix for the results for the Scene 15 dataset is shown in Fig. 5.

Fig. 4
figure 4

Example images for the Scene 15 dataset

Fig. 5
figure 5

Confusion matrix on Scene 15 Classification (%). Each entry in the diagonal is the average classification rate for an individual class. The entry in the ith row and jth column is the percentage of images from class i which were misidentified as class j

Table 2 Performance comparison on scene 15 dataset (%)

Table 2 shows that the average accuracy of our method is 89.83 %, which outperforms five alternative methods and is close to LR-S c +SPM method. However, it should be noticed that LLC and LScSPM use neighborhood data to help the construction of the sparse codes. The results validate the observation that by exploiting the relationship between sparse code and class specific information, the obtained sparse code is more powerful for image representation.

From Fig 5, we observe that the proposed method works well on several scene categories, including suburb, coast, forest, highway, tallbuilding and office. However, the accuracies are relatively low for industrial, kitchen, livingroom, and store classes. The reason for the low accuracy is that the patches in these classes are visually similar with other classes. So it’s hard to extract class specific information for further analysis.

5.3 UIUC Sport Dataset

UIUC 8-SportFootnote 2 data set was introduced in [17] for image-based event classification. These 8 categories are badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snow boarding. There are 1579 images in total, and the size of each category ranges from 137 to 250. In this data set, the maximum size is set to 400 because its images have higher resolutions. Fig. 6 shows some sample images of this dataset. In the experiment, we randomly select 70 images from each class as the training data and the rest as the testing data.

Fig. 6
figure 6

Example images in the UIUC-Sports dataset

Table 3 gives the performance comparison of the proposed method and several other methods on the UIUC Sport dataset. The proposed sparse neighbor coding method has achieved 87.13 %, with 0.44 % superiority to LR-S c +SPM. The confusion matrix for the results on this dataset is shown in Fig. 7.

Fig. 7
figure 7

Confusion matrix on UIUC Sport Classification (%)

Table 3 Performance comparison on UIUC 8-sport dataset (%)

5.4 Caltech 101 Data Set

The Caltech-101Footnote 3 dataset contains 102 classes with high intra-class appearance shape variability. The number of images per category varies from 31 to 800 images and most of these images are in medium resolution. In the experiment, the images are resized to be less than 300 × 300 with aspect ratio kept. All 102 classes are used in this experiment. Figure 8 shows some sample images in this dataset. Following the standard experimental setting, we used 15 and 30 images per class for training while leaving the remaining for test.

Fig. 8
figure 8

Example images in the Caltech-101 dataset

Table 4 provides the performance comparison of the proposed method with several alternative methods [2, 40, 43, 45, 49, 50] on the Caltech-101 dataset. Our method has outperformed the listed algorithms, achieving 70.04 ± 0.42 when the training size is 15 per class and 76.96 ± 0.87 when the training size is 30 per class. These results have validated the effectiveness of our method.

Table 4 Performance Comparison on the Caltech-101 dataset (%)

5.5 PASCAL VOC 2007 Data Set

This data set consists of 10,000 images from 20 classes, with objects in a variety of scales, locations and viewpoints. Figure 9 shows some sample images in this dataset. In the experiments, 5011 images are used for training and 4952 images for testing by random splitting. The performance measure is the mean average precision (mAP), which is a standard metric used by the PASCAL challenge. It computes the area under the Precision/Recall curve. The higher scores reflect better the performance.

Fig. 9
figure 9

Example images in the PASCAL VOC 2007 data set

In Table 5, we list the mAP scores for all 20 categories from different methods. It can be seen that our method has achieved the performance superior to alternative methods on 5 classes: bicyle (68.6 %), car (80.3 %), cow (50.1 %), person (86.2 %) and tv (57.7 %). The Fisher kernel has obtained the best mAP among the methods with dictionary size 256. This is because it encodes additional information on the distribution of the descriptors. Our method has only 0.5 percent inferiorly than the Fisher kernel method and shows significant improvement than other methods. This result demonstrates the effectiveness of the proposed method .

Table 5 Comparison of image classification performance in terms of test accuracy on the PASCAL VOC 2007 dataset

5.6 Time analysis of feature coding

From the Table 6, we can see the numeric time complexity of feature coding on four datasets during testing phase. The number of testing images in the four datasets are 480, 1500, 3030 and 4953, separately. LLC method has the least time in coding the testing images. As the normal setting, we set the number of neighbors k to 5. The time cost of LLC method mostly depends on the kNN searching. In ScSPM, we choose 200 neighbors for each feature to get the sparse code. It costs more time than LLC, but obtains better classification in some datasets. The overall coding time of ScSPM and LScSPM are quite the same. Besides, the time cost of our method is greater than that of LLC method and nearly the same with those of LScSPM and ScSPM.

Table 6 Time compleixty on four datasets in feature coding phase (min)

5.7 Influence of codebook size

In our experiment, we test the classification accuracy on three datasets according to different codebook sizes, which may considerably influence classification results [13]. The performance is illustrated in Fig. 10, from which we can see the overall tendency is that the performance increases with the growth of codebook size. Moreover, the curves grow faster when the codebook size is smaller. This is because small codebooks cannot present the various patches of the images in the dataset.

Fig. 10
figure 10

Classification performance on different codebook size (%)

5.8 Influence of individual components

In this subsection, the importance of each component is tested and the results are shown in Table 7. Here we can see that the proposed sparse neighbor coding performs better than LLC method by 3.03 %, 2.99 %, 1.42 % and 0.5 % improvements separately. Besides, by using the discriminative feature selection and visual word selection strategies, the performance are boosted comparing with that of the basic sparse neighbor coding method. Therefore, it is evident that these two modules are effective and lead to better sparse code. And the best results are obtained by combining these three modules.

Table 7 Classification performance by combining different component

6 Conclusion and future work

The neighbor information in the feature space is of great importance for image representation. To explore the neighbor information, we have presented a sparse neighbor coding method. We have developed two modules, which are used to keep the discriminative feature in each class and eliminate the non-informative visual words, to boost the discrimination of the resulted sparse code. Based on the observation that feature vectors from a certain class should be better represented by basis vectors in the sub-space of that class, we have selected the dominant basis vectors for each class. We have also demonstrated that by combining the frequency distribution of the similar features over the basis vectors, the relationship between local features can be retained during sparse coding. The experiments on four databases have validated the effectiveness of our method.

In the future work, we will explore more relational information between the features to be encoded. Furthermore, we will investigate the manifold structural information, which has proved to be an effective approach to characterizing the structure of descriptors.