1 Introduction

Automated land-use scene classification has become widely desirable due to the amount of high-resolution remote sensing image data. Remote-sensed airborne and satellite imaging aims at predicting the content labels that globally describe a given image, i.e. land-use scene labels. It is used in many fields such as agriculture, geography, military, humanitarian applications, and many other applications for analyzing and managing natural resources. There has been a great deal of effort for developing intelligent databases for effective and efficient processing of high-resolution remote-sensed images. With the development of modern technologies, the resolution has been improving in remote sensing images and more detailed spatial information can be obtained. Due to this improvement and in addition to the widely used spectral information [30], novel computational approaches are constantly required.

Land-use scene classification has been explored from a variety of angles in the literature in the last decades. Generally speaking, the process of recognition starts by extracting a set of features from training data and follows by deriving a classifier to label test data. Thanks to developments in acquisition of high-resolution remote sensing images, extraction of color, texture, shape and object information has become possible [28, 29, 40, 41]. Since high-spatial resolution images contain rich textural information, the approaches which capture the texture information are widely adopted in this domain. With the success of local binary patterns (LBP) [22] as a texture descriptor, this model has been used in many land-use scene classification tasks [4, 5, 27, 45]. In [1], a semantic modeling based method is developed to fill the gap between low-level features and high-level user expectations and scenarios. A Bayesian framework for a visual grammar is employed in their framework to reduce this gap. Chen et al. [5] incorporates multi-orientation Gabor filters to capture the global texture information from an input image, and LBP to describe it locally. An improved Gaussian Markov random field (GMRF) model is used in [45] to extract texture features from high-spatial resolution images. While the effectiveness of such texture features has been verified, the classification procedure is later completed by combining spatial and spectral features.

However, despite being in the literature for some time, the ability to predict semantic category from pixel level (low-level features) is a hard task. The challenge is due to high variability of image appearance, i.e. variations in spatial position, illumination, and scale. Additionally, pixel-based image classification approaches may suffer the increase of within-class spectral variation with improved spatial resolution [41]. Thus, in a different direction, some other researches use region-based and object-based features to implement land-use scene classifiers [3, 6, 29, 40]. These approaches are built upon image segmentation and unlike single-pixel approaches, spatial information of image regions (i.e. segments) can be modeled [3]. Region-based approaches highly depend on a good segmentation algorithm, and cannot usually capture the complex semantic information due to the semantic phenomena known as ‘synonymy’ and ‘homonymy’ [3, 47].

Local features, on the other hand, are becoming more and more popular since they can provide robustness to rotation, scale changes, and occlusion. In addition, local features can bridge the gap between low-level features and the high-level semantics by building a mid-level feature representation of image regions across image patches. One of the most popular local feature frameworks is the Bag-of-Visual-Words (BoVW) [7, 32], where an image is represented by the histogram of occurrences of vector-quantized descriptors. It is tailored to handle scale and rotation variance, it provides a concise representation of an image, and it has shown decent performance in whole-image categorization tasks [11, 36] including for remote sensing image datasets [12, 25, 43]. In 2010, Yang and Newsam [39] investigated the traditional BoVW approach for land-use scene classification in HSR imagery. Overall, their work does not show an improvement over the best standard approaches, however BoVW represents a robust alternative for certain classes. To improve the performance of the basic BoVW framework, [6, 41, 42, 44] propose adding spatial and context information to the local features. [44] presents a concentric circle-based spatial-rotation-invariant representation strategy for describing spatial information, while [6] propose a pyramid-of-spatial-relatons (PSR) model to capture both absolute and relative spatial relationships of local features. [41, 42] on the other hand combine an object-based mid-level representation method with BoVW for an improved semantic classification. [47, 48] corporate both local and global spatial features for HSR image scene classification. To capture the local features of land-use scene images, both papers use BoVW framework. For global feature extraction [48] propose multi-scale local binary patterns (MS-CLBP), whereas in [47] the shape-based invariant texture index is designed as the global texture feature.

A BoVW framework consist of the following steps (see Fig. 1): (i) key-point extraction, which samples local areas of images and outputs image patches; (ii) key-point representation, where image patches are described via statistical analysis approaches; (iii) codebook generation, which aims at providing a compact representation of local descriptors; (iv) feature encoding, which codes local descriptors based on the codebook they induce; (v) feature pooling, where the final image representation which integrates all coding responses into one vector, i.e. the feature vector is produced; and (vi) classification, obtained from feature vectors using a support vector machine, for instance.

Fig. 1
figure 1

The general scheme of a bag of visual words (BoVW) framework. It starts by extracting several key-points (patches) from an input image, and follows by describing and clustering them into the codebook. The final image-level representation is obtained by pooling the coding coefficients of encoded local descriptors. The idea is adopted from [13]

The main contribution of the work reported in this paper is investigating in details different coding and pooling strategies of the BoVW framework. To this end, we carry out a comparative analysis of the BoVW model with different configurations on two commonly used datasets in remote sensing. We draw several conclusions on these datasets when comparing different coding representations of this model. Furthermore, the effect of dictionary size and the number of training images in respect to the coding approaches is studied.

The rest of the paper is organized as follows: after reviewing the steps of a BoVW framework in Section 2, the details on setting up the experiments are given in Section 3. Analysis of results and discussion are coming in Section 4. Finally conclusions appear in Section 5.

2 Bag of visual words

To compute a global image representation from a large set of local descriptors, various approaches can be taken. The remainder of this section will review the most commonly used methods for each step in a BoVW framework. However, the focus is given to reviewing codebook generation (step (iii)) and feature encoding (step (iv)) in particular. These two steps are considered to be the most important ones for image representation since they have a great impact on the classification performance [11, 17].

2.1 Key-point extraction – step (i)

In a BoVW model, key-points or interest points are locally sampled image patches. These patches are meant to be the most informative regions of the image, and the best candidates for image classification. They can be sampled either sparsely or densely. The notion of sparse key-point extraction [19, 34] (also known as saliency-based sampling) has evolved from edge, corner and blob detection approaches. Dense features on the other hand are sampled on a regular grid and can provide a good coverage of the entire image [33]. Dense features have shown to be an essential component of many state-of-the-art classification approaches [16, 26, 31]. Hu et al. [12], investigate and quantitatively compare different sampling strategies that can be used for scene classification of High resolution remote sensing images.

2.2 Key-point representation – step (ii)

Patch representation is the process of describing the pixels of local image patches (interest points) statistically. Patch descriptor approaches can compute local features which are invariant to image scale and rotation, and provide robustness to illumination changes, noise, and changes in viewpoint.

Let N denote a set of N image patches, then the vector of local descriptors X can be defined as: X = [x 1, x 2, … , x N ] where x i R D, i = 1, 2, …, N and D is the dimensionality of local features computed by an image descriptor. In the domain of scene understanding, object classification, and image retrieval, the most widely used patch descriptors are scale-invariant feature transform (SIFT) [18], histogram of oriented gradients (HOG) [9] and local binary patterns (LBP) [22]. SIFT descriptor compute the orientation and gradient of the key points in gray-level information, and exhibits powerful description capability in land-use scene classification [12, 44, 47, 48].

2.3 Codebook generation – step (iii)

The codebook is a concise representation of local descriptors and seeks two main goals: (1) avoiding redundancy, (2) adding robustness to scene layout by providing invariant features. A codebook can be seen as a collection of basic patterns (visual words / codewords) used to reconstruct local descriptors. The collection of visual words which is called the vocabulary of visual words or the codebook is commonly generated through clustering in a supervised or an unsupervised manner [13].

Given the vector of local descriptors X, any clustering approach seeks the K basis vectors c j (or visual words) where KN. The idea is to reconstruct X, using a set of basis vectors (i.e. visual codebook) C = [c 1, c 2, … , c K ] ∈ R D and the coding coefficients. The coding coefficient component of x i with respect to the visual word c j is called u i j .

2.3.1 Hard assignment

Hard-assignment is considered the simplest and the most common clustering approach in the literature. Usually k-means is adopted to quantize descriptors into the visual vocabulary such that the cumulative approximation error is minimized:

$$ u_{ij} = \left\{ \begin{array}{ll} 1; &\text{if } i = \underset{k = 1,2,\dots,K}{\text{argmin}} {(\| x - c_{k} \|_{2}^{2})}\\ 0; &\text{otherwise} \end{array} \right. $$
(1)

where ∥.∥2 denotes the Euclidean distance between the descriptor vector x and the visual words c k . As (1) implies, k-means assigns a local feature to its closest visual word. However this type of assignment can cause severe information loss specially for features located on the boundaries of several visual words.

2.3.2 Soft assignment

Soft assignment [24] aims at minimizing the quantization error by assigning each descriptor to more than one codeword. This type of assignment is the weighted assignment of the descriptor x i to the jth visual word c j based on its distance / similarity. The coding coefficient u i j is controlled by the smoothing factor β and represents the degree of membership of x i to c j :

$$ u_{ij} = \frac{\exp(-\beta \|x_{i}-c_{j}\|_{2}^{2})}{{\sum}_{k=1}^{K} \exp(-\beta \|x_{i}-c_{k}\|_{2}^{2}} $$
(2)

The denominator is the normalization factor. However soft assignment results in “dense code vectors, which is undesirable, among other reasons, because it leads to ambiguities due to the superposition of the components in the pooling step” [2].

2.3.3 GMM clustering

One way to mitigate the issue of ambiguities in soft assignment clustering is to use GMM clustering. In Gaussian mixture model (GMM) clustering, the probability density distribution of features is defined by a collection of Gaussian distributions. GMM can be thought of as a soft visual vocabulary. The coding coefficient u i j is:

$$ u_{ij} = \frac{p(x_{i}|\mu_{j},\Sigma_{j})\omega_{j}}{{\sum}_{k=1}^{K}p(x_{i}|\mu_{k},\Sigma_{k})\omega_{k}} $$
(3)

Each GMM p(x|θ) represents a cluster of data points (a set of descriptors), and is fully defined by its vector of parameters, i.e. the weight ω k , mean μ k , and the covariance matrix Σ k .

$$\begin{array}{@{}rcl@{}} p(x|\theta) &=& \sum\limits_{k=1}^{K}{p(x|\mu_{k},\Sigma_{k})\omega_{k}} \text{ where}\\ p(x|\mu_{k},\Sigma_{k}) &=& \frac {1}{\sqrt {(2\pi)^{D} |\Sigma_{k}|}} \exp({-\frac{1}{2}(x-\mu_{k})^{T}\Sigma_{k}^{-1}(x-\mu_{k}))} \end{array} $$
(4)

The vector of parameters θ = (ω 1, μ 11, … , ω K , μ K K ) is learnt iteratively by the Expectation Maximization (EM) algorithm [20].

2.4 Feature encoding – step (iv)

In general, the purpose of encoding is to statistically summarize a number of local feature descriptors given the codewords. In addition to that, the most recent encoding approaches aim at reducing the loss of information introduced while the codebook is formed (Sectoin 2.1). The coding can be seen as a activation function for the codebook, which is activated for each of the codewords according to the local descriptor [2].

2.4.1 Histogram coding

In the classical BoVW representation, each bin of the histogram (codeword) is activated only by its closest descriptor (1). Given a set of descriptors x 1, … , x N , the frequency of each bin reflects how many times each bin (visual word) is activated by each x i . The frequency histogram based image representation is considered the baseline encoding approach. However it suffers from instabilities when the descriptors are located on the boundaries of several codewords (due to the quantization error of hard assignment).

2.4.2 Kernel codebook coding

To enhance the accuracy of probability density estimation, this type of encoding is suggested by [20]. Each descriptor activates multiple codewords, that is to say descriptors are assigned to the codewords in a soft manner. Equation (2) can be rewritten as:

$$ p(c_{j}|x_{i}) = \frac{1}{Z}exp(s(x_{i},c_{j})) \text{ where } s(x_{i},c_{j}) = -\beta \|x_{i}-c_{j}\|_{2}^{2} $$
(5)

showing the probability of the local descriptor x i belonging to the codeword c j . Z, the normalization factor, ensures that \({\sum }_{k=1}^{K}P(c_{k}|x_{i}) = 1\). In the original form of soft coding, all the K visual words are used to compute the coding coefficients.

2.4.3 Fisher coding

In Fisher coding, the descriptors are encoded using a kernel function derived from a generative probability model p(x|θ). This can be done by fitting a parametric generative model θ (e.g. GMM) to descriptors (4). Later each descriptor x i is represented by the gradient of the log-likelihood with respect to GMM parameters [14]. If the covariance matrices, Σ k are assumed to be diagonal [23], the gradients are considered with respect to the mean and standard deviation.

For each Gaussian mode k, consider the mean and covariance deviation vectors as follows:

$$ \Phi_{\mu,k} = \frac{1}{N\sqrt{\omega_{k}}}\sum\limits_{i=1}^{N}\alpha_{i}(k)\big(\frac{x_{i}-\mu_{k}}{\Sigma_{k}}\big) $$
(6)

and

$$ \Phi_{\Sigma,k} = \frac{1}{N\sqrt{2\omega_{k}}}\sum\limits_{i=1}^{N} \alpha_{i}(k)\big[\big(\frac{x_{i}-\mu_{k}}{\Sigma_{k}}\big)^{2} -1\big] $$
(7)

where α i (k) is the soft assignment weight of descriptor x i to Gaussian k. This representation captures the average first and second order differences between the descriptor and each of the GMM centers [31]. The final gradient vector, i.e. Fisher vector Φ, is obtained by aggregating Φ μ, k and ΦΣ, k over all K GMMs, and is 2K D-dimensional. The Fisher encoding can be further improved (Improved Fisher Vector – IFV) through two normalization steps: power and L 2 normalization [23].

2.4.4 Sparse coding

Sparse coding (SC) [38], along with other coding approaches intends to ameliorate the quantization loss of vector quantization, e.g. k-means. The core idea is to reconstruct the local descriptor x i by a linear combination of a set of sparse codewords. Generally, this can be done by solving a least-square-based optimization problem.

$$ \underset{U}{\text{argmin}} \sum\limits_{i=1}^{N} \|x_{i}-Cu_{i}\|^{2}+\lambda\|u_{i}\|_{l^{1}} $$
(8)

The first term (the least-square term), pursues accurate reconstruction. Whereas the second term (sparsity regularization term) l 1 norm of c i ensures the sparsity is met.

2.4.5 Locality constrained linear coding

Locality-constrained Linear coding (LLC) [36] can be seen as a variant of sparse coding. However, instead of the sparsity constraint, LLC incorporates a locality constraint. The LLC encoding of the local descriptor x i is of size K, having M non-zero components closest to the visual word x i (where MK). Mathematically LLC can be formulated as:

$$ \underset{U}{\text{argmin}} \sum\limits_{i=1}^{N} \|x_{i}-Cu_{i}\|^{2}+\lambda\|d_{i}\odot c_{i}\|^{2} \text{ s.t. } 1^{T}u_{i}=1 \forall i $$
(9)

where ⊙ denotes the element-wise multiplication. d i = exp(d i s t(x i , C)/σ) and d i s t(x i , C) = [d i s t(x i , c 1), … , d i s t(x i , c M )]T is the Euclidean distance between x i and c j . Finally σ is a weighting parameter, controlling the decay speed for the locality adapter.

2.4.6 Super-vector coding

Zhou et al. [46] propose a coding scheme, that is the super-vector coding (SVC), which shares some similarities with IFV and LCC. The idea is to estimate the feature manifold by deriving a non-linear mapping function (i.e. f(x) ≈ ω TΦ(x)). SVC typically uses the closest codewords of a feature (hard assignment), however there is another variant based on soft assignment. It uses the first order statistics between local descriptors X and the codebook C and adds it to the adaptive representation histogram for better reconstruction:

$$ \Phi(x) = [s\gamma_{c}(x),\gamma_{c}(x)(x-c)^{T}]^{T}_{c\in C} $$
(10)

where γ c (x) is essentially the coding coefficients of codeword c j for the local descriptor x i and s is a non-negative constant. The resulting vector is a highly sparse representation with the dimensionality of K(D + 1).

2.5 Pooling – step (v)

Pooling is the final step of a BoVW framework, where the idea is to obtain an image-level representation from the coding coefficients u i j of local features x i in the image. Suppose that we have responses of the visual word c j for all the local descriptors x i , i = 1, … , N. Then, pooling can be seen as a function summarizing visual word c j for local descriptors, and “may be interpreted as the aggregation of the activations of that codeword” [2]. Two of the most common pooling functions are sum (or average) and max pooling. Given the responses for the j th visual word c j , the i th component of the final image-level representation: for sum pooling it is \({\sum }_{i=1}^{N}{u_{ij}}\); for average pooling it is \(\frac {1}{N}{\sum }_{i=1}^{N}{u_{ij}}\); and for max pooling is: m a x i u i j . Sum / Average pooling preserves the average response and is widely used in traditional BoVW. However, max pooling preserves the maximum response, and is often preferred for sparse and soft coding.

3 Performance evaluation

3.1 Coding approaches

The following five coding approaches are chosen to extensively evaluate the performance of two common land-use scene datasets, that is UC Merced Land Use Dataset [39] and High-Resolution Satellite Scene Dataset [8, 37]. It is noteworthy that both datasets have high spatial resolution (HSR) remote sensing images only and thus optical RGB images are used for all the experiments. In addition to that, throughout all the experiments, only the luminance channel is used.

  1. 1.

    Histogram Coding with hard quantization is selected as the baseline coding method for the BoVW framework;

  2. 2.

    Kernel Codebook Coding or soft-assignment coding is selected as the representative of a soft quantization approach;

  3. 3.

    Locality-constrained Linear Coding – LLC is selected as a good representative of sparse coding scheme considering both the accuracy and the computational cost;

  4. 4.

    Improved Fisher Vector – IFV is selected since it has shown to be a powerful coding approach for image-level presentation;

  5. 5.

    Super-Vector Coding is selected as a simple extension of histogram coding that is similar to both IFV and similar to LCC but in different ways.

3.2 UC Merced land use dataset – UCMerced

The methods are first evaluated using a large ground truth image dataset of 21 land-use classes [39]. There are 100 images for each of the following classes: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts. Each image contains 256 × 256 pixels, and was manually cropped from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the USA [39]. Following the common benchmarking procedure in [39], 80 images from each land-use are chosen for training and the rest are used for evaluating the performance. The experiments are then repeated 10 times, with different randomly selected training and test images. The final result is reported as the mean and standard deviation of the results from the individual runs. Sample images of this dataset appear in Fig. 2.

Fig. 2
figure 2

Sample images from each class of a 21 land-use dataset. One example per each class is shown above. From left to right, top to bottom the classes are: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts

3.3 High-resolution satellite scene dataset – RSDataset

The second dataset used is a 19-class satellite scene dataset including airport, bridge, commercial, desert, farmland, football field, forest, meadow, mountain, park, parking, pond, port, railway station, river, viaduct, commercial area, industrial area, and residential area [8, 37]. There are at least 50 images of size 600 × 600 per each class. In order to facilitate a fair comparison, the same experimental setup as suggested in [4] is followed. To this end, 30 random images per class are chosen to train the models, and the remaining are used for testing. Like with previous dataset, the experiments are repeated 10 times. The final result is reported as the mean and standard deviation of the results from the individual runs. Some sample images of this dataset is given in Fig. 3.

Fig. 3
figure 3

Sample images from each class of a 19 satellite scene dataset. One example per each class is shown above. From left to right, top to bottom the classes are: airport, beach, bridge, commercial area, desert, farmland, football field, forest, industrial area, meadow, mountain, park, parking, pond, port, railway station, residential area, river, viaduct

3.4 Experimental setup

In this study, all the descriptors are extracted over a dense grid of pixels, rather than a sparse set of interest points, as suggested by [16, 26, 31]. The local descriptors X = [x 1, … , x N ] are the scale-invariant feature transform (SIFT) descriptors [18]. Local descriptors are extracted densely in scale and space, on spatial grids equal to 8, 12, 16, 20 and 24 pixels, with the step size set to 2 pixels using the publicly available VLFeat toolbox [35]. As suggested by the authors in the original literature, max pooling is used for LLC encoding, weighted average pooling for super-vector coding, and sum/average for all the other approaches. The final image-level representation is passed through signed square-rooting and is L 2 normalized before being sent to the linear SVM for training and classification (LIBLINEAR package [10]). The number of visual words varies between 16 and 16384 for both datasets.

3.5 Spatial pyramid

The BoVW framework provides a flexible visual layout of images by presenting them as an orderless collection of local features. This results in a holistic representation of an image, where the spatial information of features is lost and ignored. To tackle this issue, we incorporate spatial information using the spatial pyramid matching scheme from [16]. The spatial pyramid matching (SPM) has shown to be successful in object and scene recognition. The basic SPM approach starts by partitioning an image into a sequence of increasingly fine sub-regions. The histograms are then computed for each grid and are stacked together after being weighted and normalized according to their size. This concept can be easily extended to any of the BoVW methods by encoding the regions and stacking the feature vectors. In [15] another variant of the basic SPM [16] is suggested, where each spatial region is normalized individually prior to stacking. The final feature vector is l 2 normalized before being sent to the SVM classifier. Like [16], the spatial regions for both datasets are set to be 1 × 1, 2 × 2, and 3 × 1 grids, and a total of 8 regions.

4 Analysis of results

Table 1 presents the overall performance of each of the coding approaches discussed in Section 2 and Section 3 and the comparison of the results with the state-of-the-art. Note that the results in this table are quoted directly from the papers, where the best overall performance is chosen for comparison. In the remainder of the paper we are going to call histogram quantization coding HQ, kernel codebook coding KC, locality-constrained linear coding LLC, super vector coding SVC, and improved Fisher vector IFV.

Table 1 Performance evaluation of the two commonly used land-use datasets; UCMerced [39] and RSDataset [8, 37]. Following the common benchmarking procedure, 80 training images are used for UCMerced and 30 training images for RSDataset. The comparison between the coding methods discussed in this paper as well as the state-of-the-art performances is provided

Based on the result we got in (see Table 1), the following observations can be made:

  1. 1.

    As expected HQ is the baseline coding approach for BoVW framework and its result stands well below other coding approaches (except for KC). The improvement with the other coding strategies (LLC, SVC, and IFV) shows the evolution of coding approaches over the time.

  2. 2.

    For both datasets HQ and KC produce almost the same results. Although soft-assignment is used instead of hard-assignment, almost no improvement in classification is observed. The inferior performance of KC is due to the ambiguities introduced in the pooling step. As proposed by Liu et al. [17] the accuracy of KC can be improved when localized soft-assignment is combined with mix-order max-pooling.

  3. 3.

    When local descriptors are clustered into codewords, some part of information is lost. LLC and sparse coding in general employ a least-squares based optimization to lessen the information loss. As the results suggest, we can see the improvement over HQ and KC when LLC is applied.

  4. 4.

    Overall, the classification accuracy of IFV outperforms all other coding schemes. Comparing IFV results with the state-of-the-art (in particular with Local Global Fusion (LGF) [48]) implies that this approach perform just as well as the best land-use scene classification approaches.

  5. 5.

    SVC outperforms LLC, however its performance cannot reach the classification accuracy of IFV. Generally speaking, SVC runs faster than IFV and combines computational efficiency and a good and reasonable classification performance.

Next, the influence of vocabulary size on performance of the five coding schemes is evaluated (see Fig. 4):

  1. 1.

    As seen in Fig. 4, the overall tendency is that a larger number of visual words leads to a higher accuracy. In the case of HQ, KC, and LLC and for both datasets, the performance degrades dramatically when a smaller codebook size is chosen.

  2. 2.

    For all coding approaches and after a certain codebook size, an over-fitting effect is noticed. This varies for different coding approaches and for different datasets. For instance for both datasets and for HQ and VC coding the saturation occurs at about 1024 visual words. However LLC on UC Merced is not saturated even at 16384 visual words. For this case, the performance can increase even further if the codebook size is increased to 32K (32768≈32K).

  3. 3.

    In the case of SVC and IFV, the over-fitting occurs for a smaller vocabulary size. In both cases the final image-level representation is very high dimensional. Thus choosing an optimal number of codewords plays an important role from the computational efficiency point of view. Perronnin [23] suggested the size of codebook for IFV to be set to 256. This value is validated in different experiments [13, 15, 23, 31] and provides a good compromise between computational cost and classification accuracy. A larger codebook size typically increases the accuracy by only a few decimals at a higher computational cost. For both IFV and SVC cases, the codebook size is not increased further than 1024, due to the saturation, and the computational complexity of the approach.

Furthermore, the effect of the number of training samples is investigated and presented in Table 2. We change the number of training images from 10 to 80 for UCMereced and from 5 to 30 for RSDataset. Note that the suggested number of training images in the literature for UCMereced it is 80, and for RSDataset is 30. As has been observed in multiple previous works [13, 15, 39], the performance can be improved considerably when the number of training samples is increased. On both datasets, and for all coding strategies a clear enhancement in classification accuracy is observed when the number of training images is increased.

Fig. 4
figure 4

Comparing the performance of on UCMerced (left) and RSDataset (right) when varying the size of codebook. The size of codebook varies from 16 to 16384 for both datasets. Following the common benchmarking procedure, 80 training images are used for UCMerced and 30 training images for RSDataset

Table 2 Comparing the classification accuracy of UCMerced (top) and RSDataset (bottom) when varying the number of training images. The codebook size is fixed for both datasets and is set to 16384 for all five coding strategies

Finally the complexity and speed of the encoding approaches are reviewed. Since BoVW framework consists of various steps, each step is evaluated separately:

  1. 1.

    Key-point extraction and presentation The 128-dimensional dense SIFT feature vector, has been adopted as the common descriptor for all the coding methods. The descriptors are computed using the ‘fast’ option in the publicly available VLFeat toolbox [35]. The fast version is not similar to the original SIFT descriptors and is slightly approximated, but it is from 30 to 70 times faster than SIFT. In our experiments computing ‘fast’ dense SIFT descriptors requires less than 0.5 seconds per image.

  2. 2.

    Codebook generation and Encoding To generate codebooks, an approximate nearest neighbor clustering algorithm [21] provided by the VLFeat toolbox is used. While histogram coding requires just the nearest neighbor to the descriptor, KCC and LLC need to search for more than one nearest neighbor per feature. The number of K nearest neighbors sought and the codebook size considerably increase the encoding time. In all our experiments we used K = 5 for both LLC and KCC, and the codebook size is fixed and is set to 16384. This results in an encoding time of around 0.5 seconds per image for histogram coding, 19 seconds for LLC and 24 seconds for KCC. Most of the encoding time is spent, however, on finding the K nearest neighbors, rather than the encoding. IFV on the other hand needs to be clustered using GMM. This does not cause a considerable overhead in clustering time. However due to their size, encoding IFV and SVC are quite slow for the codebook size of 16384. We thus used the suggested codebook size of 256 for IFV and 1024 for SVC in the literature. This configuration results in a 8-second encoding time for IFV and a 10-second encoding time for SVC. All the timings are for a 3.3GHZ Intel CPU where the implementations are done in C++/MATLAB.

5 Conclusion

The improved spatial resolution in HSR remote sensing imagery, provides more detailed information for land-use classification. Land-use scene categories, often cover multiple land-cover classes or ground objects. In addition, the higher spatial resolution may results in the increase of within-class spectral variation with the same surface features. As a consequence, pixel-base classification approaches can not fulfill this task anymore, and the approaches that represent the visual layout of land-use images flexibly are becoming popular for this purpose.

Local features, and BoVW in particular, is bridging this gap by providing an intermediate feature representation method. This paper investigated different configurations of BoVW framework for remote sensing land-use scene classification. The importance of this work lies in comparing different coding strategies for BoVW when its parameters are controlled. The detailed empirical evaluation of five coding schemes suggested that improved Fisher vector (IFV) [23] outperforms other coding strategies, though it introduces higher computational complexity. The performance of IFV is comparable to the state-of-the-art approaches [47, 48], where both local and global features are used in land-use scene classification. We also investigated the effect of codebook size and number of training images on UC Merced and RSDataset. Experimental findings showed that more training images and more visual words can improve the performance.