1 Introduction

Scale invariant feature transform (SIFT) descriptors [1] are widely used in many vision tasks, such as object recognition, image classification, video retrieval, etc. It has been witnessed a very robust local invariant feature descriptors in respect of different geometrical changes. However, SIFT was mainly developed for gray images; the color information of the objects is neglected. Therefore, two objects with completely different colors may be regarded as the same. To overcome this limitation, different kinds of Colored SIFT (CSIFT) descriptors were proposed and developed by researchers to utilize the color information inside the SIFT descriptors [26]. With the enhancement of color information, CSIFT descriptors can achieve better performances in resisting certain photometric changes. One example can be found in [3], which shows that CSIFT is more stable than SIFT in case of illumination changes.

On the other hand, the bag-of-features (BoF) [7, 8] joined with the spatial pyramid matching (SPM) kernel [9] has been employed to build the recent state-of-the-art image-classification systems. In BoF, images are considered as sets of unordered local appearance descriptors, which are clustered into discrete visual words for the representation of images in semantic classification.

SPM divides an image into \(2^l\times 2^l\) segments in different scales \(l = 0, 1, 2\), computes the BoF histogram within each segment, and finally concatenates all the histograms to build a spatial location-sensitive descriptor of the image. In order to obtain better classification performance, a codebook (a set of visual words), also named dictionary, is constructed to represent the extracted descriptors. Traditional SPM uses clustering techniques like \(K\)-means vector quantization (VQ) to generate the codebook. Despite their efficiency, the obtained codebooks usually suffer from several drawbacks such as distortion errors and low discriminative ability [10]. A linear SPM based on sparse coding (ScSPM) method [11] was proposed by Yang et al. for relaxing the restrictive cardinality constraint of VQ. By generalizing vector quantization to sparse coding followed by multi-scale spatial max-pooling, ScSPM significantly outperforms the traditional SPM kernel on histograms and is even better than the nonlinear SPM kernels on several benchmarks.

Yu et al. [12] demonstrated that under certain assumptions, locality is more essential than sparsity for the training of nonlinear classifiers and proposed a modification of SC, named local coordinate coding (LCC). However, in both SC and LCC, the computationally expensive \(\ell _1\)-norm optimization problem is to be solved. Wang et al. [13] developed a faster implementation of LCC, named locality-constrained linear coding (LLC), which utilizes the locality constraint to project each descriptor into its local-coordinate system. It achieves the state-of-the-art image classification accuracy even by just using a linear SVM classifier.

According to our literature survey, although various kinds of final representation (fR) based image-classification algorithms with state-of-the-art performances have been developed, most of them use only gray-based SIFT descriptors [10, 11, 1316]. Using color information can improve the robustness of traditional SIFT descriptor in respect of color variations and the geometrical changes. However, facing the diverse CSIFT descriptors, the following questions are worthwhile to be studied.

  • Which CSIFT descriptor is the best for the FR-based image classification system?

  • To what extent, the performance of fR-based image classification system can be improved by using CSIFT?

To fully exploit the potential of CSIFT descriptors for image category recognition tasks, a CSIFT-based image-classification system is constructed in this work. As a widely used state-of-the-art SC-based encoding algorithm, LLC is employed to encode the CSIFT descriptors for classification. Moreover, a simple \(\ell _2\)-norm regularized locality distance method is proposed to enhance the performance of traditional LLC.

Real experiments with different kinds of CSIFT descriptors demonstrate that significant improvements can be obtained with the enhancement of color information and \(\ell _2\)-norm regularized locality distance even by only using linear SVM classifier.

The rest of this article is organized as follows: In Sect. 2, a reflectance model for color analysis is presented. In Sect. 3, different kinds of the CSIFT descriptors and their properties are discussed. Section 4 introduces the basic concepts of the LLC. In Sect. 5, we introduce a \(\ell _2\)-norm regularized locality distance method. In Sects. 6 and 7, real experiments are carried out to study the proposed algorithm in various aspects. Finally, in Sect. 8, conclusions are drawn.

2 Dichromatic reflectance model

A physical model of reflection, named dichromatic reflection model, was presented by Shafer in 1985 [17] in which the relationship between RGB-values of captured images and the photometric changes, such as shadows and specularities, of environment was investigated. Shafer indicated that the reflection of a incident light can be divided into two distinct components: specular reflection and body reflection. Specular reflection is when a ray of light hits a smooth surface at certain angle. The reflection of that ray will reflect at the same angle as the incident ray. The effect of highlight is caused by the specular reflection. Diffuse reflection is when a ray of light hits the surface which will be reflected back in every direction.

Consider an image of an infinitesimal surface patch of some object. Let the red, green and blue sensors with spectral sensitivities be \(f_{R}(\lambda )\), \(f_{G}(\lambda )\) and \(f_{B}(\lambda )\) respectively. The corresponding sensor values of the surface image are [17, 18]:

$$\begin{aligned} L(\lambda ,{\mathbf{n ,s,v}})&=m_{b}({\mathbf{n,s}})\int _{\lambda }{f_{L}(\lambda )e(\lambda )c_{b}(\lambda )\, d\lambda } \\ &\quad + m_{s}({\mathbf{n,s,v}})\int _{\lambda }{f_{L}(\lambda)e(\lambda )c_{s}(\lambda )\, {\mathrm{d}}\lambda } \end{aligned}$$
(1)

where \(L\in \{R, G,B\}\) is the color channel of light, \(\lambda\) is the wavelength, \(\mathbf n\) is the surface patch normal, \(\mathbf s\) is the direction of the illumination source, and \(\mathbf v\) is the direction of the viewer. \(e(\lambda )\) is power of the incident light with wavelength \(\lambda\), \(c_{b}(\lambda )\) and \(c_{s}\) are the the surface albedo and Fresnel reflectance, respectively. The geometric terms \(m_{b}\) and \(m_{s}\) represent the diffuse reflection and the specular reflection, respectively.

In case white illumination and neutral interface reflection model holds, the incident light energy \(e(\lambda ) = e\) and Fresnel reflectance term \(c_s(\lambda ) = c_s\) are both constant values independent of the wavelength \(\lambda\). By assuming the following holds:

$$\int_{\lambda } {f_{R} (\lambda )} = \int_{\lambda } {f_{G} (\lambda ) = \int_{\lambda } {f_{B} (\lambda ) = f} }$$
(2)

Equation (1) can be simplified:

$$\begin{aligned} L(\mathbf{n, s,v})=e m_{b}(\mathbf{n, s})k_L+ e m_{s}(\mathbf{n, s, v} )c_{s}f \end{aligned}$$
(3)

where \(k_L = \int _{\lambda }f_L(\lambda )c_b(\lambda )\) is a variable that depends only on the sensors and the surface albedo.

3 Colored SIFT descriptors

On the basis of the dichromatic reflection model, the stability and reliability of color spaces with regard to various photometric events such as shadows and specularities are studied theoretically and empirically [2, 19, 20]. Although there are many existing color space models, they are correlated to intensity; they are linear combinations of RGB; or they are normalized with respect to intensity rgb [19]. In this article, we concentrate on investigating CSIFT using essentially different color spaces: RGB, HSV, YCbCr, Opponent, rg and color invariant spaces.

3.1 SIFT

The SIFT algorithm was originally developed for gray images by Lowe [1, 21] for extracting highly discriminative local image features that are invariant to image scaling and rotation, and partially invariant to changes in illumination and viewpoint. It has been used in a broad range of vision tasks, such as image classification, recognition, content-based image-retrieval, etc. The algorithm involves two steps: (1) extraction of the keypoints of an image and (2) computation of the feature vectors characterizing the keypoints. The first step is carried out by convolving the input image with the DoG (difference of Gaussians) function in multiple scales and detecting the extremas of the outputs. The second step is achieved by sampling the magnitudes and orientations of the image gradient in a patch around the detected feature. A 128-D vector of direction histograms is finally constructed as the descriptor of each patch. Since the SIFT descriptor is normalized, it can invariant to the scale of gradient magnitude. But the light color changes will affect it, because the intensity channel is a combination of the R, G and B channels.

3.2 RGB-SIFT

As the most popular color model, RGB color space provides plenty of information for vision applications. In order to embed RGB color information into the SIFT descriptor, we simply calculate the traditional SIFT descriptors on the each channel of RGB color space. By combining the extracted feature, a \(128 \times 3\) dimensions descriptor is built (\(128\) for each color channel). Compared with conventional gray-based SIFT, the RGB color gradients (or edges) of the image are captured.

3.3 HSV-SIFT

HSV-SIFT was introduced by Bosch et al. [22] and employed for scene classification task. Similar to RGB SIFT discussed above, they compute SIFT descriptors over all three channels of the HSV color model and produce a \(128 \times 3\) dimensional SIFT descriptor for each point. It is worth mentioning that H channel of HSV color model is scale-invariant and shift-invariant with respect to light intensity. However, due to the combination of the HSV channels, the entire descriptor has no invariance properties. The conversion from RGB space to HSV space is defined by Eqs. (4)–(6).

$$\begin{aligned} H= \left\{ \begin{array}{ll} {\rm undefined} & {\rm if } \max =\min \\ 60^\circ \times \frac{{G}-{B}}{\max -\min }+0^{\circ } & {\rm if } \max ={R }\, {\rm and }\, {G}\ge {B}\\ 60^\circ \times \frac{G-B}{\max -\min }+360^{\circ } & {\rm if } \max =R \, {\rm and}\, G<B\\ 60^\circ \times \frac{G-B}{\max -\min }+120^{\circ } & {\rm if } \max =G\\ 60^\circ \times \frac{G-B}{\max -\min }+240^{\circ } & {\rm if } \max =B \end{array} \right. \end{aligned}$$
(4)
$$\begin{aligned} S&= \left\{ \begin{array}{ll} 0 & {\rm if } \max =0\\ \frac{\max -\min }{\max }=1-\frac{\min }{\max } & {\rm otherwise} \end{array} \right. \end{aligned}$$
(5)
$$\begin{aligned} V&= \max \end{aligned}$$
(6)

where, max is equal to the maximal one of \(R,G,B\), and min is equal to the minimal one of \(R,G,B\).

3.4 rg-SIFT

The rg-SIFT descriptors are obtained from the rg color space. It is the normalized RGB color model, used r and g channels to describe the color information in the image (b is constant if r and g are given). rg color space is already scale-invariant with respect to light intensity. The conversion from RGB space to rg space is defined as follows,

$$\begin{aligned} r&= \frac{R}{R+G+B}\end{aligned}$$
(7)
$$\begin{aligned} g&= \frac{G}{R+G+B} \end{aligned}$$
(8)

3.5 YCbCr-SIFT

As one of the most popular color spaces, YCbCr color space provides very efficient representation of scenes/images and is widely used in the field of video compression. It represents colors in terms of one luminance component (\(Y\)), and two chrominance components (\(C_b\) and \(C_r\)). The YCbCr-SIFT descriptors are computed on all the channels of YCbCr color space. The YCbCr image can be converted from RGB images using the equation below:

$$\begin{aligned} \left[ \begin{array}{l} Y \\ Cb \\ Cr \end{array} \right] = \left[ \begin{array}{lcc} 0.299 & 0.587 & 0.144 \\ -0.1687 & -0.3313 & 0.5 \\ 0.5 & -0.4187 & -0.0813 \end{array} \right] \left[ \begin{array}{l} R \\ G \\ B \end{array} \right] + \left[ \begin{array}{l} 0 \\ 128 \\ 128 \end{array} \right] \end{aligned}$$
(9)

3.6 Opponent-SIFT

The opponent color space was first proposed by Ewald Hering in the late nineteenth century [23]. It consists of three channels (\(O_1\), \(O_2\), \(O_3\)), in which the \(O_3\) channel represents luminance of the image, while the remainder describe the opponent color (red–green, blue–yellow) of the image. Opponent-SIFT descriptor is obtained by computing the SIFT descriptor over each channel of the opponent color space and combines them together. The RGB images transformed in the opponent color space is defined by Eq. (10).

$$\begin{aligned} \left[\begin{array}{l} O_1 \\ O_2 \\ O_3 \end{array}\right] = \left[\begin{array}{l} \frac{R-G}{\sqrt{2}} \\ \frac{R+G-2B}{\sqrt{6}} \\ \frac{R+G+B}{\sqrt{3}} \end{array}\right] \end{aligned}$$
(10)

3.7 Color invariant SIFT

With the inspiration of the dichromatic reflectance model (see Sect. 2), the color-based photometric invariant scheme was proposed by Geusebroek [2]. It was first applied to SIFT descriptor by Abdel-Hakim and Farag [3]. A linear transformation from RGB to color invariant space is presented as the following:

$$\begin{aligned} \left[\begin{array}{l} \hat{E}(x,y)\\ \hat{E}_{\lambda }(x,y) \\ \hat{E}_{\lambda \lambda }(x,y) \end{array}\right] = \left(\begin{array}{lll} 0.06 &{} 0.63 &{} 0.27 \\ 0.30 &{} 0.04 &{} 0.35 \\ 0.34 &{} 0.60 &{} 0.17 \end{array}\right) \left[\begin{array}{l} R(x,y)\\ G(x,y) \\ B(x,y) \end{array}\right] \end{aligned}$$
(11)

where \(\hat{E}(x,y)\), \(\hat{E}_{\lambda }(x,y)\), \(\hat{E}_{\lambda \lambda }(x,y)\), denote, respectively, the intensity, the yellow–blue channel, and the red–green channel. \(\hat{E}\), \(\hat{E}_{\lambda }\) and \(\hat{E}_{\lambda \lambda }\) are the spectral differential quotients and represent the same as the above. Measurement of the color invariants is obtained by \(\hat{E}\), \(\hat{E}_{\lambda }\) and \(\hat{E}_{\lambda \lambda }\).

4 Locality-constrained linear coding

The bag-of-feature (BoF) approach has now played a leading role in the field of generic image classification research [11, 13, 15]. It commonly consists of feature extraction, codebook construction, feature coding, and feature pooling. Experimental results shown that, given a visual codebook, choosing an appropriate coding scheme has significant impacts on the classification performance.

Different kinds of coding algorithms are developed [10, 11, 13, 15]; among them, locality-constrained linear coding (LLC) [13] is considered as one of the most representative methods, which provides both fast coding speed and state-of-the-art classification accuracy. It has been widely cited in academic papers and employed in image classification applications. In this article, LLC is selected for feature coding in our real experiments.

Let \(X\) denote a set of \(D\)-dimensional local descriptors in an image, i.e. \(X = [x_{1}, x_{2},\ldots ,x_{N}]\in R^{D\times N}\). Let \(B = [b_{1},b_{2},\ldots ,b_{M}]\in R^{D\times M}\) be a visual codebook with \(M\) entries. The coding methods convert each descriptor into a \(M\)-dimensional code. Unlike the sparse coding, LLC enforces locality constraint instead of sparse constraint. A reconstruction for the basis descriptors \(B\) can be acquired by optimizing the following equation:

$$\begin{aligned} \min _{v}\sum _{i=1}^{N}\Vert x_{i}-Bv_{i}\Vert ^{2}+\lambda \Vert d_{i}\odot v_{i}\Vert ^{2} s.t. \; 1^{T}v_{i}=1,\forall i \end{aligned}$$
(12)

where \(\odot\) denotes the element-wise multiplication, and \(d_{i}\in R^{M}\) is the locality adaptor that gives some degree of freedom for each basis descriptor. LLC ensures these descriptors are proportionally similar to the input descriptor \(x_{i}\). Specifically,

$$\begin{aligned} d_{i}=\exp \left[ \frac{{\mathrm{dist}}(x_{i},B)}{\sigma }\right] \end{aligned}$$
(13)

where \({\rm dist}(x_{i},B)=[{\rm dist}(x_{i},b_{1}),{\rm dist}(x_{i},b_{2}),\ldots ,\) \({\rm dist}(x_{i},b_{M})]\), and \({\rm dist}(x_{i},b_{j})\) is the Euclidean distance between \(x_{i}\) and \(b_{j}\). \(\sigma\) is used for adjusting the weight decay speed for the locality adaptor \(d_{i}\).

An approximation is proposed in [13] to accelerate its computational efficiency in practice by ignoring the second term in Eq. (12). They directly use the \(K\) nearest basis descriptors of \(x_{i}\) to minimize the first term. The encoding process is simplified by solving a much smaller linear system,

$$\begin{aligned} \min _{v}\sum _{i=1}^{N}\Vert x_{i}-Bv_{i}\Vert ^{2} \; s.t. \; 1^{T}v_{i}=1, \forall i \end{aligned}$$
(14)

This gives the coding coefficients by only selecting k basis vectors. The other coefficients are set to zero.

5 \(\ell _2\)-norm regularized locality distance

In the above section, the details of the traditional LLC method were presented. It can be seen that the distant function (Eq. 13) plays an important role in the coding scheme. In this paper, we propose a simple \(\ell _2\) -norm regularized locality distance to achieve better classification accuracy. The distance function is defined as:

$$\begin{aligned} \widetilde{d}_{i}=\left\| \exp \left[ \frac{{\rm dist}(x_{i},B)}{\sigma }\right] \right\| _{l_{2}} \end{aligned}$$
(15)

As a result, Eq. (12) is rewritten as follows:

$$\begin{aligned} \min _{v}\sum _{i=1}^{N}\Vert x_{i}-Bv_{i}\Vert ^{2}+\lambda \Vert \widetilde{d}_{i}\odot v_{i}\Vert ^{2} \; s.t. \; 1^{T}v_{i}=1,\forall i \end{aligned}$$
(16)

6 Experimental results

To evaluate the performances of different kinds of CSIFT descriptors in a sparse representation based image classification system, two benchmark datasets: Caltech-101 [24] and Caltech-256 [25] are employed in the real experiment. Since color information is the prerequisite for the CSIFT descriptors computation, to achieve a fair comparison, the gray images in the Caltech-101 and Caltech-256 are removed. To enable that colored images of some categories are sufficient for training a stable classifier (the number of colored images less than 31), we add some new color images of the same category to make sure there are at least 31 colored images in each category.

6.1 Implementation

In all the experiments, the same processing chain with similar the settings is used to ensure consistency.

  1. 1.

    Colored SIFT CSIFT/SIFT descriptors extraction. The dense CSIFT/SIFT descriptors are extracted as described in Sect. 3 within a regular spatial grid. The step-size is fixed at 8 pixels and the patch size is fixed at \(16 \times 16\) pixels. The dimension of gray-based SIFT descriptor is \(128\). For CSIFT descriptors, RGB-SIFT, SIFT, HSV-SIFT, YCbCr-SIFT, opponent-SIFT, rg-SIFT and color invariance SIFT (C-SIFT) are implemented for the experimentation.

  2. 2.

    Codebooks construction. After the CSIFT/SIFT descriptors are extracted, a codebook of size 1,024 is created using the \(K\)-means clustering method on a randomly selected subset (with size \(2 \times 10^6\)) of extracted CSIFT descriptors.

  3. 3.

    Locality-constrained linear coding (LLC). The CSIFT/SIFT descriptors are encoded by LLC using the above constructed codebooks. The number of neighbors is set to 5 with the shift-invariant constraint.

  4. 4.

    Pooling with spatial pyramid matching (SPM) [9]. The max-pooling operation is adopted to compute the final descriptor of each image. It is performed with a \(3\) level SPM kernel (\(1 \times 1\), \(2 \times 2\) and \(4 \times 4\) sub-regions in the corresponding levels), leaving a same weight at each layer. The pooled features of the sub-regions are concatenated and normalized to form the final descriptor of each image.

  5. 5.

    Classification. A one-versus-all linear SVM classifier [26] is used to train the classifier for its good performances.

6.2 Assessment of color descriptors on the Caltech-101 dataset

The proposed algorithm is carried out using the color images of Caltech-101 dataset, which contains 101 object categories including animals, flowers, vehicles, shapes with significant variance, etc. Some color images are added to avoid insufficient of training data in certain categories as discussed before. The number of original images in every category still varies from 31 to 800. In order to test the performance with different sizes of training data, different numbers (5, 10, \(\ldots\), 30) of training images per category are evaluated. In each experiment, we randomly select \(n\) images per category for training and leave the remainder for testing. The images were resized to keep the maximum size of height and width no larger than 300 pixels with a conserved aspect ratio. For the sake of simplicity, the codebook size is fixed at 1,024 (the performance of different codebook sizes will be studied in Sect. 7.1). The corresponding results using different kinds of CSIFT descriptors (RGB-SIFT, SIFT, HSV-SIFT, YCbCr-SIFT, opponent-SIFT, rg-SIFT and color invariance SIFT (C-SIFT)) are illustrated in Table 1 and Fig. 1. According to the experimental results, all the CSIFT/SIFT descriptors achieve their best classification accuracy with 30 training images per class. It indicates that more training data may bring better classification accuracy in testing, while the improvement became slight when the size of the number of training images is more than \(20\). Both RGB-SIFT and YCbCr-SIFT outperform the state-of-the-art gray-based SIFT on this dataset. The YCbCr-SIFT achieves the best performance. For instance, when 30 images of each category are used for training, YCbCr-SIFT obtains the average classification accuracy of \(69.1\,\%\); RGB-SIFT provides the second best average classification accuracy (\(68.6\,\%\)). It is worth mentioning that even without color information, SIFT achieves third best average classification accuracy of \(68.17\,\%\). Approximately \(1\,\%\) improvement in average classification accuracy can be obtained by employing CSIFT descriptors.

Table 1 Classification rate (%) comparison on Caltech-101
Fig. 1
figure 1

The different numbers of training images per class on the classification performance

6.3 Assessment of color descriptors on the Caltech-256 dataset

A more complex dataset, Caltech-256 [25], is also employed for the experiments. It consists of 256 object classes and a total of 30,607 images, which have much higher intra-class variability and object location variability compared with the images in Caltech-101. Similar to Sect. 6.2, the gray images are also removed for fair comparison of various CSIFT/SIFT descriptors. Since there are at least 80 color images per category, no more image is added.

In each experiment, we randomly select \(n\) (\(n \in \{15, 30, 45, 60\}\) is fixed for each experiment) images from every category for training and leave the remainder for testing. For the sake of simplicity, the codebook size is fixed at 4,096 (according to our experience, it produces the best classification performance). The images were resized to keep the maximum size of height and width no larger than 300 pixels with conserved aspect ratio. The details of classification results are shown in Table 2 and Fig. 2. Among all these descriptors, YCbCr-SIFT produces the best performance as well. In case 60 random selected training images of each category are used, YCbCr-SIFT achieves the average classification accuracy of \(41.3\,\%\); moreover, RGB-SIFT also provides the second best average classification accuracy (\(38.7\,\%\)). Compared with the performance of gray-based SIFT descriptors, CSIFT brought approximately \(4\,\%\) enhancement with regard to average classification accuracy, which can be significant in many image classification tasks.

Table 2 Classification rate (%) comparison on Caltech-256
Fig. 2
figure 2

The different number of training images per class on the classification performance

6.4 Assessment of \(\ell _2\)-norm regularized locality distance on the Caltech-101 and Caltech-256 dataset

In Sects. 6.2 and 6.3, different kinds of CSIFT descriptors are implemented and evaluated by traditional LLC method. In this section, \(\ell _2\)-norm regularized locality distance and CSIFT descriptors are combined together to obtain better performance. The datasets we used are the same as in Sects. 6.2 and 6.3. Since YCbCr-SIFT descriptor and RGB-SIFT descriptor achieved the top two classification accuracies, they are employed for comparison. The size of the codebook is set at 1,024. We randomly selected the training images and repeated the experiments 10 times. The corresponding average results are listed in Tables 3 and 4. The YCbCr-SIFT descriptor still provides the best performances. With the enhancement of \(\ell _2\)-norm regularized locality distance, the classification accuracy increases steadily (about \(0.6\,\%\)). In Table 3, when 30 training images of each category are used, YCbCr-SIFT achieves the average classification accuracy of \(69.74\,\%\); approximately, \(1.57\,\%\) enhancement is obtained. In Table 4, when 60 training images of each category are used, YCbCr-SIFT achieves the average classification accuracy of \(41.78\,\%\); the combination method obtains approximately \(4.56\,\%\) enhancement compared with traditional LLC.

Table 3 Classification rate (%) comparison on Caltech-101
Table 4 Classification rate (%) comparison on Caltech-256

7 Further evaluations

The experimental results in Sects. 6.2 and 6.3 show that, among the different CSIFT descriptors, YCbCr-SIFT and RGB-SIFT achieve better image classification performance than the state-of-the-art gray-based SIFT. While, it is well known that choosing different codebook sizes, different numbers of neighbors in LLC and different pooling methods will affect the final classification results. In this section, further evaluations are carried out for more comprehensive studies of these two CSIFT descriptors.

7.1 Impact of codebook size

Firstly, we test the impacts of different codebook sizes (512, 1,024 and 2,048) using the Caltech-101 dataset. As discussed in Sect. 6, the codebooks are trained by the \(K\)-means clustering algorithm. Different numbers (5, 10, \(\ldots\), 30) of training images per category are evaluated. The number of neighbors in LLC is set at 5. The corresponding results are presented in Tables 5, 6, 7 and Fig. 3. YCbCr-SIFT descriptor outperforms the others in all the tests. In most cases, the highest classification accuracy is obtained by using codebook of size 1,024. However, when the codebook of size 2,048 is utilized, the classification accuracies decrease (except YCbCr-SIFT descriptor with 30 training images per category). It may be caused by the over-completeness of the codebooks, which results in large deviations in representing similar local features. It is interesting to notice that, by using more training data, the problem of over-completeness might be overcome. For instance, YCbCr-SIFT descriptor with codebooks of size 2,048 and 30 training images per category achieves the highest average classification accuracy.

7.2 Impact of different numbers of neighbors

The performances of the proposed algorithm using different numbers of neighbors \(K\) in LLC are also estimated. The codebook size is fixed at 1,024, and the number of training images per category is 30. The results are shown in Table 8 and Fig. 4. With the increase of the neighbor number \(K\) in LLC, the classification accuracy takes on the trend of rising first, then drops after \(K \ge 25\). The highest average classification accuracy is obtained by using YCbCr-SIFT descriptor (\(72.59\,\%\)). In contrast to the highest classification result of SIFT (\(69.18\,\%\)), more than \(3\,\%\) improvement is achieved.

7.3 Comparison of pooling methods

Besides the max-pooling method, sum-pooling is another choice which can also be used to summarize the features of each SPM layer. Tables 9 and 10 show the experimental results using the two methods, respectively. In Fig. 5 they are illustrated together for comparison. The codebook size is 1,024. The number of neighbors used in LLC is 5. It can be noticed that the max-pooling method significantly outperforms sum-pooling.

$$\begin{aligned}{\rm Max}: v_{j}=\max (v_{1},v_{2},\ldots ,v_{i})\end{aligned}$$
(17)
$$\begin{aligned}{\rm Sum}: v_{j}=v_{1}+v_{2}+\cdots +v_{i} \end{aligned}$$
(18)

As can be seen from Fig. 5, the best performance is achieved by the combination of “max-pooling” and “\(\ell _{2}\)-normalization”.

Table 5 The codebooks of size 512
Table 6 The codebooks of size 1,024
Table 7 The codebooks of size 2,048
Table 8 Comparison on the sizes of the neighborhood size
Fig. 3
figure 3

The different numbers of training images per class on the classification performance

Fig. 4
figure 4

The different numbers of training images per class on the classification performance

Table 9 The performance of max-pooling
Table 10 The performance of sum-pooling
Fig. 5
figure 5

Impact of different pooling methods

8 Conclusion

In this article, CSIFT descriptors are introduced to improve the state-of-the-art locality-constrained linear coding (LLC) based image classification system. Different kinds of CSIFT descriptors are implemented and evaluated with varying settings of the parameters. Real experiments have demonstrated that, by utilizing color information, considerable improvements can be obtained. Among the CSIFT descriptors, YCbCr-SIFT descriptor achieves the most stable and accurate image classification performance. Compared with the highest average classification accuracy achieved by using gray-based SIFT descriptors, YCbCr-SIFT descriptor acquired approximately \(1\,\%\) increase on the Caltech-101 dataset (see Sect. 7.2) and approximately \(4\,\%\) increase on the Caltech-256 dataset (see Sect. 6.3). Besides the YCbCr-SIFT descriptor, RGB-SIFT descriptor also provides favorable performance. As one of the most representative FR-based image-classification algorithms, the improvements achieved on LLC show that using CSIFT descriptors is an approach with good potential to enhance state-of-the-art FR-based image-classification systems. On the other hand, although be reported can achieve invariant or discriminatory object recognition, we found that the performances of some others CSIFT descriptors are not as good as expected. On the other hand, although be reported can achieve invariant or discriminatory object recognition, we found that the performances of some others CSIFT descriptors are not as good as expected. Moreover, we obtain a steady rise in the classification accuracy by introducing a simple \(\ell _2\) -norm regularized locality distance. The combination of YCbCr-SIFT descriptor and the \(\ell _2\) -norm regularized locality distance provides the best performance. It achieves approximately \(2\,\%\) improvement of classification accuracy on the Caltech-101 dataset and approximately \(5\%\) improvement of classification accuracy on the Caltech-256 dataset. Our future work will investigate the combinations of learning-based color descriptors [27], different kinds of distance functions and sparse coding technologies to achieve better image classification performance.