1 Introduction

The Bag-of-words (BOW) model is widely used in visual object categorization [6] and image retrieval [8], because it is suitable to represent the multiple classes of objects in a unified framework. The core of the BOW model is that an image is represented by the visual words in a visual dictionary. In detail, a keypoint detector is firstly used to find interest regions, each of which is represented by a keypoint descriptor. After that, the local features are quantized and an image is represented by a histogram of word frequency. In the following content, we refer to both the keypoint detector and descriptor as “local features”. In this paper, we focus on the two implementation choices: local feature and classifier, which greatly affect the performance of a method in image classification. We evaluate several popular keypoint detectors, keypoint descriptors and classifiers to see which local feature is more distinctive and meanwhile more robust for image classification, and which classifier can achieve better classification performance. We want to answer the question how the performance of a recognition method based on the BOW model is affected by the choice of both local features and classifiers.

Many studies have discussed the influence of the components of the BOW model on the accuracy of visual object categorization. Sampling strategy is the first key issue in the BOW model. A common strategy is to use a sophisticated multiscale keypoint operator, such as DOG [16], Harris-Laplacian [18], Harris-Affine [18], Hessian-Laplacian [18], Hessian-Affine [18], MSER [17], etc. Lazebnik et al. [11] combined the Harris-Laplace detector with the Laplacian detector to sample interest regions. The Harris-Laplace detector detects key points and the Laplacian detector detects key blobs which are complementary to some degree. Eric et al. [21] discussed the sampling strategy and demonstrated that the number of sampling patches is important in visual categorization: the more patches, the better performance for image classification.

A keypoint descriptor plays a second important role in the BOW model. There are many popular keypoint descriptors such as SIFT [16], SURF [1], GLOH [19], GIH [14], shape context [2] and steerable filter [9], etc. The BOW model quantizes these local features and represents an image by a histogram of word frequency. However, the BOW model neglects the spatial information among the local features. Therefore, many researchers pay attention to utilizing spatial information in the BOW model. Especially, in the field of image retrieval, many methods are proposed to utilize the spatial information in the BOW model. Wu et al. [27] proposed the bundle features which used the geometric rank of local features to represent the spatial information. Cao et al. [4] proposed the spatial BOW method to improve the retrieval results. Zhang et al. [28] proposed to model the spatial context of the local features in a group to avoid any single local feature instability or noises, which is superior to the bundle features in term of retrieval precision. However, these methods are more applicable to image retrieval instead of image classification, because they only use the spatial information and local features to measure the similarity between each pair of images, and they do not represent an image by a feature vector which is required by a classifier. Therefore, we discuss how to introduce the spatial information to the BOW model for object classification.

The design of a classifier is the third important factor in the BOW model. Lazbnike et al. [12] used the maximum entropy classifier for visual object categorization. Moosmann et al. [20] designed a random clustering forest for the visual object categorization. Liu et al. [15] designed a classifier based on the boosting algorithm to select the category-specific words which are composed of the “visual bits”.

The codebook construction is the fourth important factor. Csurka et al. [6] grouped the local features of the training images by the K-means algorithm where the cluster centers were regarded as the visual words. Winn et al. [26] clustered the responses of the filter-bank where the compact visual dictionary was learned by pair-wise merging. Farquhar et al. [7] used the Gaussian Mixture Model to model the density function of the key points for each class of images. Perronnin [22] built adaptive vocabularies which combined universal vocabularies with specific vocabularies. Larlus et al. [10] proposed the Gaussian Mixture-Multimodal LDA [3] model to generate the visual codes. Moreover, the construction of visual dictionary is often related to classifiers, and the design of a classifier is based on the visual word representation [15, 20].

There are two work [19, 29] closely related to our work. Mikolajczyk et al. [19] evaluated the influence of local descriptors in image matching and object (or scene) recognition. They showed that GLOH and SIFT are superior to other keypoint descriptors. However, they evaluated the keypoint desciptors only for image matching, not for image classification. It is unclear whether or not a descriptor can achieve the same performance in image matching as in image classification. Furthermore, we want to know if a keypoint descriptor designed for image matching under some constrained conditions is also discriminative for image classification. For example, GIH [14] is a deformation invariant descriptor for image matching, and we want to know what performance it achieves in image classification. Zhang et al. [29] did a comprehensive study on local features for texture and object classification. Based on the extensive experiments, they pointed out that the descriptor combining SIFT with SPIN and RIFT was the most discriminative for image classification. Moreover, they mainly analyzed the performance of SVM with different kernels and designed a SVM classifier with the Earth Mover’s Distance (EMD) kernel. However, there are three limits in Zhang’s method: 1) they only evaluated the performance of the SVM classifiers with different kernels, but they did not compare its performances with the performances of the other types of classifiers; 2) they neglected the MSER detector [17], which is widely used in computer vision; 3) the spatial information of the local features was not mentioned. In this paper, we evaluate the performance of various keypoint detectors, keypoint descriptors and classifiers. We also discuss how these components affect the classification performance in the BOW framework. In detail, we evaluate seven different types of classifiers: the SVM classifier with the Radius Basis Function (RBF), the SVM classifier with the χ 2 kernel, the SVM classifier with the EMD kernel, the Adaboost classifier, the random forest classifier, the maximum entropy classifier and the Naïve Bayes classifier. Furthermore, we propose the EMD spatial kernel, which encodes the spatial information.

This paper extends our previous work [5] by proposing a novel spatial kernel to improve the classification performance, adding additional empirical results and making an in-depth analysis of our approach’s performance. The rest of this paper is organized as follows. In section 2, we evaluate the performances of the keypoint detectors, the keypoint descriptors and the classifiers. And then we propose the EMD spatial kernel in section 3. We evaluate the EMD spatial kernel in term of object classification on the Xerox7 datasets, the 4-class VOC2006 dataset and the 4-class Caltech101 dataset in section 4. In section 5, we draw conclusions.

2 Empirical evaluations

2.1 Experimental setup

We use two types of datasets: the Xerox7 dataset, which is an object category dataset; and the UIUCTex dataset, which is a texture dataset. These two standard datasets are popular and have been used to evaluate the performance in visual object categorization. The Xerox7 dataset contains 7 object classes: faces (792), bikes (125), buildings (150), cars (201), trees (150), books (142), phones (216). It is a challenging dataset because the images are captured in real world and in different viewpoints which cause the variance of appearance for the same class of objects. The UIUCTex dataset contains 25 texture classes and each class contains 40 images. Figure 1 shows some examples in two datasets.

Fig. 1
figure 1

Some examples of each class in two datasets: a) Xerox7. b) UIUCTex

In order to evaluate the local features and the classifiers, we use the classification accuracy as a criterion. For the image representation of histogram distribution, we use the K-means algorithm to cluster local features and get a dictionary with 1,000 visual words. Thus an image can be represented as a histogram distribution with 1,000 dimensions. Moreover, each image class is split randomly into two separate sets: one set is for training and the other set is for testing. In the evaluation, we run 5 random trails and report the averaged results. In order to classify multiple classes, we take the one-against-other strategy for the binary classifier, such as SVM and Adaboost.

2.2 Empirical evaluation of keypoint detectors

We evaluate the six popular keypoint detectors in term of classification accuracy on the Xerox7 dataset and the UIUCTex dataset. The competing detectors are respectively Difference of Gaussian (DOG) [16], Harris-Affine (HarA) [18], Harris-Laplacian (HarL) [18], Hessian-Affine (HesA) [18], Hessian-Laplacian (HesL) [18], and MSER [17]. DOG, HarL and HesL detect small circle blobs, while HarA, HesA, and MSER detect ellipse blobs. In Fig. 2, we show some results obtained by the six detectors. In the experiments, we use SIFT as the local feature descriptor and use SVM with RBF kernel as the classifier. The quantitative results in Tables 1 and 2 show that the Hessian-Laplacian detector has the superior performance compared with other competing detectors and that the MSER detector can find the discriminative regions. The averaged number of the detected interest points used in the experiments is given in Table 3. The regions detected by MSER are larger than those detected by other detectors, so the number of the keypoints detected by MSER is the smallest. Considering the trade-off between representation ability and computational complexity, we combine the Hessian-Laplacian detector with the MSER detector which we call as the HesLM detector. That is, we use both of the detectors to detect the key points, so the image is sampled densely and described by dense description. Figure 5a and b show the dense sampling and dense description. As shown in the last column of Tables 1 and 2, the HesLM detector has achieved a promising result in the classification accuracy on both datasets. The classification accuracy obtained by HesLM is similar to those obtained by Hessian-Laplacian and Hessian-Affine. HesLM spend much less running time than HesA. And HesLM achieves the highest classification accuracy in the four object classes of the Xerox7 dataset and in the 11 texture classes of the UIUCTex dataset.

Fig. 2
figure 2

The interest points detected by the six detectors. a) DOG detector. b) Harris-Affine detector. c) Hessian-Affine detector. d) MSER detector. e) Harris-Laplacian detector. f) Hessian-Laplacian detector

Table 1 The evaluation of the seven different detectors in term of classification accuracy on the Xerox7 dataset
Table 2 The evaluation of the seven different detectors in term of classification accuracy on the UIUCTex dataset
Table 3 The mean number of the interest points sampled by different methods on the Xerox7 dataset and the UIUCTex dataset

2.3 The evaluation of keypoint descriptors

We evaluate seven state-of-the-art descriptors of local features in term of classification accuracy: i.e. SIFT [16], GLOH [19], SURF [1], Shape Context (SHAC) [2], steerable filter (STEF) [9], SPIN [11] and GIH [14].

  • SIFT is a very popular descriptor which is a histogram of gradient orientation. An interest region is divided into 4-by-4 grids. In each grid, the gradient angle is quantized into 8 orientations, such that a 128-dimention feature vector is formed.

  • GLOH is an extension of SIFT. It computes a histogram of gradient orientation for a log-polar location grid.

  • SURF is similar to SIFT in description. It is faster than SIFT due to using the Haar-like template instead of the Gaussian function to compute interest regions. It computes a four-dimension feature vector for a grid: the sum of gradients in x coordinate, the sum of gradients in y coordinate, the sum of gradient absolute values in x coordinate, the sum of gradient absolute values in y coordinate. Each interest region is divided into 4-by-4 grids, and then a 64-dimension vector is formed.

  • Shape context is widely used to describe the shape of an object. It computes the histogram of edge points for a polar location grid, and each bin is the number of edge points in the corresponding grid.

  • Steerable filter uses the derivates of a patch up to 4th orders which are computed by convolution with Gaussian derivatives. The changes of the orientation of derivatives give the element values of the feature vector.

  • SPIN is a two-dimension histogram: the intensity is quantized to 10 bins and the radius is quantized to 5 bins. Each row of SPIN is a normalized histogram for a homocentric circle region.

  • GIH is a deformation invariant descriptor. In GIH, an image is regarded as a two-dimension surface embedded in 3D space. The method uses the geodesic distance instead of the Euclidean distance. It is a 2D joint distribution of geodesic distance and the intensity.

In the test, we use MSER to detect the interest regions, and use SVM with the RBF kernel to classify the images. The results shown in Tables 4 and 5 demonstrate that SIFT is superior to other descriptors in term of the classification accuracy. In contrast, GLOH and shape context is inferior to SIFT for object classification. However, SIFT, GLOH, SURF, shape context and SPIN achieve a similar classification accuracy for texture classifications.

Table 4 The evaluation of the seven descriptors on the Xerox7 dataset
Table 5 The evaluation of the eight descriptors on the UIUCTex dataset

Furthermore, we also compare the local features with the textons on the UIUCTex dataset in order to address whether the local features are superior to the textons for texture recognition. We represent a texture image based on the textons according to Varma’s work [24, 25]. In detail, we use MR8 [24] which are the filter banks to filter the images, and then we cluster the filter responses and obtain the textons. In our experiments, 10 cluster centers are computed by using the K-means method for each class, and all the cluster centers generated for all the classes form the texton set. At last a texture image is represented by the distribution of the textons. The SVM classifier with the χ 2 kernel is adopted to classify the texture classes. As shown in Table 5, the last column is the results based on the textons. The classification accuracies based on the textons generated by MR8 are inferior to those based on the local features except for GIH. In other words, the image patch features are superior to the filter response features. Our experimental results are consistent with Varma’s conclusion [25].

2.4 Evaluation of the classifiers

We evaluate the seven classifiers in this section: i.e., SVM with the RBF kernel, SVM with the χ 2 kernel, SVM with the EMD kernel, Adaboost, the random forest classifier, the maximum entropy classifier, and the Naïve Bayes classifier. The first five classifiers belong to discriminative models. SVM is one of the most efficient tools for binary classification. The kernel selection in classifiers greatly affects the classification performance. In this paper, we focus on the kernels based on the exponential function \( k\left( {x,y} \right) = \exp \left( { - \frac{{D\left( {x,y} \right)}}{A}} \right) \), where D(x, y) is the distance between the vectors x and y. We compare the following three kernels.

  • The RBF kernel uses the Euclidean distance to measure the distance between two vectors, \( D\left( {x,y} \right) = {\left\| {x - y} \right\|^2} \).

  • The χ 2 kernel defines D(x, y) as the χ 2 distance, where \( D\left( {x,y} \right) = \frac{1}{2}\sum\limits_{{i = 1}}^n {\frac{{{{\left( {{x_i} - {y_i}} \right)}^2}}}{{{x_i} + {y_i}}}} \).

  • The EMD kernel defines D(x, y) as the EMD distance, where \( D\left( {x,y} \right) = \mathop{{\min }}\limits_{{\left\{ {{f_{{ij}}}} \right\}}} \frac{{\sum\limits_{{i,j}} {{f_{{ij}}}{d_{{ij}}}} }}{{\sum\limits_{{i,j}} {{f_{{ij}}}} }} \).

The Adaboost algorithm is famous for its successful application in face detection. The classifier is the linear combination of a set of weaker classifiers, and it can be written as \( h(x) = sign\left( {\sum\limits_{{t = 1}}^T {{\alpha_t}{h_t}(x)} } \right) \), where h t (x) is a weaker classifier, and α t  > 0. The random forest classifier is an ensemble classifier, which combines a set of decision trees and is a multi-class classifier. The object label is predicted by the decision tree classifiers voting or by using the average confidence value of the tree classifiers. In our case, we take the latter strategy to determine the label.

Among the classifiers mentioned above, the last two classifiers are of the generative models. The maximum entropy algorithm solves a classifier by maximizing the conditional entropy, which can be written as:

$$ \left\{ {{\lambda_k}} \right\} = \arg \max \left( { - \frac{1}{{\left| T \right|}}\sum\limits_{{I \in T}} {\sum\limits_c {P\left( {c\left| I \right.} \right)\log P\left( {c\left| I \right.} \right)} } } \right), $$
(1)

where |T| is the number of the elements in the object set T, c is the number of the object classes, and P(c|I) is the posterior probability, \( P\left( {c\left| I \right.} \right) = \frac{1}{Z}\exp \left( {\sum\limits_k {{\lambda_k}{f_k}\left( {I,c} \right)} } \right) \).

The Naïve Bayes can be viewed as a maximum a posteriori classifier and the class label is predicted as follows:

$$ c(x) = \mathop{{\arg \max }}\limits_{{c \in C}} P(c)\prod\limits_{{i = 1}}^n {P\left( {{w_i}\left| c \right.} \right)} $$
(2)

where P(c) is the prior probability; P(w i |c) is the condition probability of the ith word given the cth object class, and \( P\left( {{w_i}\left| c \right.} \right) = \frac{{1 + \sum\limits_{{{I_i} \in c}} {N\left( {t,i} \right)} }}{{\left| V \right| + \sum\limits_{{s = 1}}^{{\left| V \right|}} {\sum\limits_{{{I_i} \in c}} {N\left( {s,i} \right)} } }} \), where N(t,i) is the concurrent matrix of words and images.

In the experiments, we represent an image in two ways: the histogram distribution and the signature representation. Considering the computational complexity, we sample interest points randomly along the edges detected by the Canny operator instead of sparse sampling in order to compute the histogram distribution, and then obtain 200 image patches. We take SIFT as the descriptor of the sampling patches, and use the K-means algorithm to construct the codebook which contains 1,000 words. We represent the image by the histogram of word frequency. At last, we classify the images by using different classifiers while we do not use the SVM classifier with the EMD kernel because it requires that an image is represented by signatures instead of the histogram of word frequency.

For the signature representation, there are two sampling methods: one is to use the HesLM detector; the other is to use the Harris-Laplacian detector combined with the Laplacian detector (HesLL) [11]. We use SIFT to describe the interest regions. After that, we cluster the SIFT feature vectors in each image and form 40 cluster centers which are regarded as the signatures of the image. And the ith image is denoted by \( \left\{ {\left( {{s_{{i1}}},{w_{{i1}}}} \right),\left( {{s_{{i2}}},{w_{{i2}}}} \right) \cdots \left( {{s_{{il}}},{w_{{il}}}} \right)} \right\} \), where s ik is the ith cluster center, l is equal to 40, and w ik is the corresponding weight which is the frequency of the kth cluster center in the ith image. Given two images, in order to compute the distance between two signatures S i and S j , where

$$ \matrix{{*{20}{c}} {{S_i} = \left\{ {\left( {{s_{{i1}}},{w_{{i1}}}} \right),\left( {{s_{{i1}}},{w_{{i1}}}} \right) \cdots \left( {{s_{{il}}},{w_{{il}}}} \right)} \right\},} & {{S_j} = \left\{ {\left( {{s_{{j1}}},{w_{{j1}}}} \right),\left( {{s_{{j1}}},{w_{{j1}}}} \right) \cdots \left( {{s_{{jl}}},{w_{{jl}}}} \right)} \right\}} \\ }, $$

we solve the optimal problem which minimizes a cross-bin dissimilarity measure between S i and S j ,

$$ \begin{array}{*{20}{c}} {{D_{{emd}}}\left( {{S_i},{S_j}} \right) = \frac{{\sum\limits_{{m = 1}}^l {\sum\limits_{{n = 1}}^l {{f_{{mn}}}d\left( {{s_{{im}}},{s_{{jn}}}} \right)} } }}{{\sum\limits_{{m = 1}}^l {\sum\limits_{{n = 1}}^l {{f_{{mn}}}} } }}} \\ {{\text{s}}.{\text{t}}.\left\{ {\begin{array}{*{20}{c}} {{f_{{ij}}} > 0} & {} \\ {\sum\limits_{{j = 1}}^l {{f_{{ij}}} \leqslant {w_{{im}}}} } & {1 \leqslant m \leqslant l} \\ {\sum\limits_{{i = 1}}^l {{f_{{ij}}} \leqslant {w_{{jm}}}} } & {1 \leqslant m \leqslant l} \\ \end{array} } \right.} \\ \end{array} $$
(3)

where d(s im , s jn ) is the Euclidean distance between the mth signature in the ith image and the nth signature in the jth image, f mn is the flow value which is obtained by solving the linear programming problem. More details about the algorithm are given in [13] and [23]. After that, we use the SVM classifier with the EMD kernel to classify the objects. We implement the methods mentioned above on the Xerox7 dataset and the UIUCTex dataset. The classification accuracies are shown in Tables 6 and 7. We find that the SVM classifier with the EMD kernel achieved better accuracy than the other classifiers, and the EMD kernel based on the HesLM detector is superior to that based on the HarLL detector. Moreover, the HesLM signatures achieve the highest classification accuracy in three image sets of the Xerox7 dataset, and in 22 image sets of the UIUCTex dataset.

Table 6 The evaluation of the seven classifiers using the Xerox7 dataset
Table 7 The evaluation of the seven classifiers on the UIUCTex dataset

Furthermore, we evaluate how a sampling method affects the performance of classification. We compare the previous mentioned sparse sampling methods and the random sampling method, and all the experimental data come from Tables 1 and 6 for object classifications and from Tables 2 and 7 for texture classifications. Figure 3 shows the classification accuracies obtained by using the six sampling methods on each object class of the Xerox7 dataset, and Fig. 4 shows the averaged classification accuracies on all the texture classes achieved by the mentioned eight sampling methods. As shown in Figs. 3 and 4, random sampling is inferior to the sparse sampling methods in term of the classification accuracy. Because, for the purpose of simplicity, we only randomly sample 200 points along the edges which is smaller than the number of sampling points generated by the sparse sampling methods. The experimental results demonstrate that the number of the sampling points affects greatly the classification performance.

Fig. 3
figure 3

The comparison of the sampling methods in term of the classification accuracies on the Xerox7 dataset

Fig. 4
figure 4

The comparison of the sampling methods in term of averaged classification accuracy of the 25 texture classes on the UIUCTex dataset

3 The EMD spatial kernel

For the experiments in section 2, we find that 1) the combinational detector HesLM can detect the discriminative regions; 2) SIFT is superior to the other local feature descriptors in image representation; 3) SVM with the EMD kernel achieves better performance in classification accuracy than the other classifiers. The experimental results also demonstrate that the BOW model gives a unified framework for multi-class image classification. However, it neglects the spatial information of the local features. Considering the relationship between the MSER regions and the Hessian-Laplacian blobs, these two detectors are complementary to some degree. The MSER detector can detect the interest regions which are bigger than the interest blobs detected by the Hessian-Laplacian detector. Moreover, an MSER region may contain some Hessian-Laplacian blobs, which forms the combinational pattern and embeds the spatial relationship. In this section, we propose an EMD spatial kernel to encode the spatial information of the local features, and boost the classification performance further.

Firstly, we use the HesLM detector to sample image patches which are shown in Fig. 5a. The red circle represents the region detected by MSER, and the green circles represent the regions detected by the Hessian-Laplacian detector.

Fig. 5
figure 5

The image representation of signatures based on the HesLM detector. a) The HesLM detector. b) The SIFT descriptor. c) The signatures

Secondly, we use the SIFT descriptor to describe the interest regions which are shown in Fig. 5b. Unlike to the methods [2728] which are used in image retrieval and measure the similarity by using a voting strategy, we need to compute a kernel matrix to record the similarity between each pair of images in the training dataset. We propose to encode the spatial similarity in a kernel matrix. The element of the kernel matrix between the ith image and the jth image is defined as,

$$ K\left( {{I_i},{I_j}} \right) = {K_{{emd}}}\left( {{I_i},{I_j}} \right) + {K_{{space}}}\left( {{I_i},{I_j}} \right). $$
(4)

There are two parts in the kernel matrix. The first part is just like the EMD kernel matrix mentioned above. We compute 40 signatures for each image which are shown in Fig. 2. And then we use these signatures to compute the EMD kernel matrix. The second part encodes the spatial information. We construct a codebook with 200 visual words obtained by using the K-means algorithm for the MSER detector and the Hessian-Laplacian detector respectively. We measure the spatial similarity between each pair of images which is illustrated in Fig. 6. For each MSER region in an image, we split the region into four quadrants according to its major axis and minor axis. And the major orientation of the gradients in the MSER region is assigned to the first quadrant. We firstly match the MSER regions between each pair of images. The MSER regions are matched if they can be represented by the same visual word. And then we count the matching Hessian-Laplacian local features in the matched MSER regions. Two Hessian-Laplacian local features are defined to be matched if they are represented by the same visual word and located in the same quadrant of the MSER region. The similarity is defined as:

Fig. 6
figure 6

The spatial matching of the visual words

$$ {s_i} = tfidf\left( {{f_i}} \right)*space\left( {{f_i}} \right), $$
(5)

where tfidf(f i ) is to compute the weight of the visual words obtained by MSER according to the term frequency and document frequency, and space (f i ) is the number of the matched Hessian-Laplacian local features in the MSER regions whose visual words are consistent.

Finally, we use the SVM classifier with the EMD spatial kernel to classify the images. When a query image is input, we do the same operations to obtain the signatures and the space information and then classify the query image with the SVM classifier obtained by using the training images. The framework of our approach is shown in Fig. 7.

Fig. 7
figure 7

The framework of the proposed approach

4 Experimental results

We test our approach on the Xerox7 Dataset, the 4-class VOC2006 dataset and the 4-class Caltech101 dataset. The 4-class VOC2006 dataset contains four image sets: bicycles (270), cars (553), motobikes (235), persons (666). The 4-class Caltech101 dataset contains four image sets: faces (435), Buddha (85), grand-pianos (99), and sunflowers (85). Each class is randomly split into two separate sets of images with the same size. One set is for training and the other set is for testing. We firstly compare different image representations. Three detectors and two representation features are considered. The three detectors are the Hessain-Laplacian detector (HesL), the MSER detector, and the HesLM detector. The two representation features are the histogram distribution (HD) and signatures (Sig). There are five kinds of combinations to represent an image:

  1. (1)

    HesL + HD: Hessain-Laplacian detector + histogram distribution;

  2. (2)

    MSER + HD: MSER detector + histogram distribution;

  3. (3)

    HesLM + HD: Hessian Laplacian and MSER detectors + histogram distribution;

  4. (4)

    HesLM + Sig: Hessian-Laplacian and MSER detectors + signatures;

  5. (5)

    HesLM + Sig + space: Hessian-Laplacian and MSER detectors + signatures + space information.

For the first three image representation combinations, we construct a dictionary with 1,000 visual words, and thus an image is represented as a feature vector of 1,000 dimensions. We use the SVM classifier with a RBF kernel to classify objects. For the last two image representation combinations, we generate 40 signatures for an image. For the fourth combination, we use the SVM classifier with an EMD kernel. In order to compute the spatial information, we construct a dictionary with 200 visual words for the MSER detector and for the Hessian-Laplacian detector respectively. We use the EMD spatial kernel to classify the objects.

In Fig. 8, we compare the classification accuracy among the five representation methods, and the last group of the bars represents the averaged results obtained by the five methods on the dataset. The experimental results demonstrate that our approach achieves the highest classification accuracy, and the proposed EMD spatial kernel is superior to the single EMD kernel in object categorization. Therefore, our proposed EMD spatial kernel averagely improves the overall classification performance for object categorization.

Fig. 8
figure 8

The performance comparison of the five representation methods on the Xerox7 dataset

We also use the confusion matrix to evaluate our approach as follows,

$$ {M_{{ij}}} = \frac{{\left| {\left\{ {{I_k} \in {C_j} \cap h\left( {{I_k}} \right) = i} \right\}} \right|}}{{\left| {{C_j}} \right|}} $$
(6)

where \( i,j \in \left\{ {1, \cdots, {N_c}} \right\} \), N c represents the number of classification, C j denotes the images which belong to the jth class, and h(I k ) denotes the predicted class for the image I k . The values of the diagonal elements in the confusion matrix are marked by the black color and they represent the classification accuracy for each category.

We compare the EMD spatial kernel with the single EMD kernel. As shown in Figs. 9 and 10, the proposed EMD spatial kernel is superior to the single EMD kernel for almost all the classes of objects in term of classification accuracy except for the Face set and the Bicycle set, which demonstrates that our approach is more effective in object categorization.

Fig. 9
figure 9

The comparison of the EMD spatial kernel and the EMD kernel in term of confusion matrix on the Xerox7 dataset. a) The confusion matrix of the EMD spatial kernel. b) The confusion matrix of the EMD kernel

Fig. 10
figure 10

The performance comparison of the EMD spatial kernel and the single EMD kernel on the 4-class Caltech101 dataset and the 4-class VOC2006 dataset

Moreover, we compare the explicit spatial coding in (5) (see Fig. 11) and our implicit spatial coding (4) (see Figs. 9a and 10) in term of classification accuracy. As we know, (5) is mostly used in image retrieval. We design a classification method based on the retrieved results. In detail, we firstly use each image in a class to retrieve the images in the dataset. Secondly, we use a k-NN method to select the most similar images. Thirdly, in the retrieved images, we count the images belonging to each class, and the label of a query image is determined by the class of images whose number is the largest in the retrieved images. Figure 11 shows how the number of the remaining images affects the classification accuracies on the Xerox7 dataset (see the left subfigure) and on the 4-class VOC2006 dataset (see the right subfigure). Next, we analyze the average value of the classification accuracies obtained by using the two different coding approaches. On the Xerox7 dataset, the highest averaged classification accuracy is 59 % when we use 330 retrieved images. On the 4-class VOC2006 dataset, the averaged classification accuracy is 43 % when we use 150 retrieved images. These two results are much smaller than the averaged classification accuracies of our proposed method which is 90 % on the Xerox7 dataset (obtained from Fig. 9a) and 63 % on the 4-class VOC2006 dataset (obtained from Fig. 10). Thus, we can conclude that the implicit spatial coding is superior to the explicit spatial coding.

Fig. 11
figure 11

The averaged classification accuracies obtained by the explicit spatial coding based on the image retrieval method. The left is the obtained classification accuracy on the Xerox7 dataset, and the right is the classification accuracy on the 4-class VOC2006 dataset

5 Conclusions

In this paper, we analyze the influence of local features and the classifiers used in the BOW model on image classification accuracy. We evaluate different local features and classifiers on the Xerox7 dataset and the UIUCTex dataset. We find that the combination of the Hessian-Laplacian detector and the MSER detector can obtain more discriminative regions, and the SVM classifier with the EMD kernel can achieve the highest classification accuracy. We also propose an EMD spatial kernel which encodes the spatial information. We have evaluated our approach on the Xerox7 dataset, the 4-class VOC2006 dataset and the 4-class Caltech101 dataset. The experimental results demonstrate that (1) the EMD kernel with signature representation achieves higher classification accuracy than other kernels with histogram distribution representation; and (2) the EMD spatial kernel is superior to the single EMD kernel in term of classification accuracy.