1 Introduction

The remote sensing imagery of high spatial resolution (HSR) provides useful geometric and detailed information which can precisely represent the Earth’s surface. Due to the increasing applications of HSR remote sensing imagery, a major issue of land-cover classification is how to improve the accuracy of the image processing. The common HSR remote sensing imagery is obtained from satellites, such as IKONOS, QuickBird, WorldView-2 and Pleiades. The availability of HSR increases the possibility of accurate Earth observations [27] and makes it possible to be widely used. However, urban landscapes become more complicated and have many different objects with similar spectral features. The increasing resolution does not facilitate improvement of the classification accuracy in the same level. Therefore, it is necessary to explore more effective approaches and incorporate the spatial features to deal with the HSR images.

This paper focuses on the problem of classification of a given high-resolution image according to different objects. Our approach is motivated by those research works [9, 36, 39, 40] on sparse signal representation, which suggest that the linear relationship among high-resolution signal elements. We propose an improved strategy to train a dictionary [36] by utilizing the sparsity of the input samples and construct a sparse model to classify the pixels in remote sensing image through adopting the error residual for sparse representation [9, 39]. The sparse vector representing the atoms for the test spectral pixels can be recovered by solving an optimization problem [40]. The classes of the test pixels can then be determined by the characteristics of the recovered sparse vector.

The remainder of this study is organized as follows. In Section 3, the sparse-based representation classification method is introduced. In Section 4, the results of our experiments and the analyses are described, and the effectiveness of our proposed method is demonstrated. Section 5 summarizes this work and draws the future work.

2 Related Work

Toward the classification, various classification approaches have been developed in order to improve the accuracy of the classification, including Independent Component Analysis [30], Artificial Neural Networks [15], Back Propagation Neural Network (BPNN) [4, 16, 45], Hierarchical Hybrid Fuzzy-Neural Network [37], K-Nearest Neighbor [43], likelihood classifier [31], Support Vector Machine (SVM) [3, 6, 25], Classification and Regression Trees (CART) [8], K-means [23, 34] and decision tree classification [20]. Giacinto et al. [14] proposed an approach to the automatic design of effective neural network ensembles, to select the subset formed by the most error-independent nets. Conventional cluster technique such as K-means [23, 34] has been used for image segmentation over years. Luo et al. [23] proposed a spatial constrained K-means approach to solve the image segmentation problem. Back-propagation neural network [16] algorithm, which is a gradient-based method, was explored for classification of multispectral image data. A variation of the SVM-based algorithms [41] put forward a set of tools for structured classification, and generalized the traditional non-structured classification approaches.

However, the above traditional classifiers are inadequate for HSR imagery [17]. In this context, the features [1, 11, 18, 19, 29, 32, 35] were used to enhance the spectral information and raise the classification accuracy. Ouma and Tateishi [29] presented a pre-classification filtering method based on unsupervised multiresolution non-linear image filtering that combines spectral and textural image characteristics. The local texture characteristics were extracted via wavelet decomposition. Huang et al. [18] proposed some statistical measures to extract some structural features and used different classifiers including maximum likelihood classifier, BPNN, probability neural network based on expectation–maximization training, and SVM to process the hybrid spectral-structural features after the steps of spatial feature extraction and dimension reduction. Pingel et al. [32] developed the Morphological Filter algorithm to be competitive with other ground filtering algorithms for LIDAR and established a baseline performance for a progressive morphological filter implemented in its simplest form.

Researchers proposed to exploit spatial information for complementing the spectral feature space and enhancing separability of the spectrally similar classes [5, 11, 42]. Dópido et al. [11] developed a semisupervised self-learning framework in which the machine learning algorithm itself selects the most useful and informative unlabeled samples for hyperspectral image classification. However, this method was dependent on the assumption that the pixels with similar spectral signature belong to the same class. This might be possible for hyperspectral images, but not for multispectral images, since they contain many spectral ambiguities (e.g., roofs and roads, water and shadow). Bruzzone et al. [5] proposed a pixel-based system, which was aimed at obtaining accurate and reliable maps both by preserving the geometrical details in the images and by properly considering the spatial-context information, for the supervised classification of high spatial resolution images. Tuia et al. [42] presented a classification method for very high resolution images by exploiting efficient multisource information, both spectral and spatial through the combination of SVMs and composite kernels. Fauvel et al. [13] used kernel methods which deal with the joint use of the spatial and the spectral information through a support vector machine formulation.

Moreover, this is meaningful for classification of land cover, but not sufficient for applications of urban mapping, since the impervious surfaces need to be elaborated into more detailed objects (e.g., tree, residential area, and water). Therefore, it is desired to explore more effective algorithms, such as sparse representation and compressive sensing. Sparse representation has been an extremely powerful tool in many classical signal processing applications.

For sparse representation, Chen and Donoho proposed the so-called Basis Pursuit (BP) algorithm [7]. BP is a principle for decomposing a signal into an optimal super position of dictionary elements, and optimal means to have the smallest l 1 norm of coefficients among all such decompositions. Mallat and Zhang [24] used an over-complete redundant dictionary for signal representation. They gave rise to the Matching Pursuit (MP) algorithm for the sparse reconstruction, and pointed out that the stronger a sparse signal is, the more accurate the reconstruction will be. The MP algorithm is a greedy algorithm, but is different from the BP algorithm. MP is a local optimization algorithm, of which the final result may not converge and may not necessarily to find the global optimal solution. Differently, Tropp and Gilbert presented the Orthogonal Matching Pursuit (OMP) algorithm [38] to deal with the convergence problem to obtain the most matching signal. In optimization, OMP selects an atom set to conduct the orthogonal optimization Gram-Schmidt in each iteration. In OMP, fewer samples are required and less iterative times are needed to achieve the optimal result compared with MP. Olshausen [28] pointed out that each image has a sparse nature.

Whereafter, the theory of the sparse theory developed rapidly. It has been adopted and employed effectively in the field of image processing [2, 12]. Furthermore, a better performance of sparse optimization algorithm [33] was proposed. Donoho and Candes presented the concept of compressive sensing [10] based on the sparse theory to further develop the sparse signal representation theory.

In very recent years, sparse representation has been further studied in literature [21, 22, 26, 42, 44, 45]. A nonlocal weighted joint sparse representation classification method [46] was proposed to improve the remote sensing image classification result, with different weights, for different neighboring pixels around the central test pixel, and the simultaneous orthogonal matching pursuit technique. Moody et al. [26] presented a technical method of land-cover unsupervised classification in multispectral satellite imagery, using sparse representations in learned dictionaries: clustering on sparse approximations and applying a Hebbian learning rule to build multispectral, multi-resolution dictionaries. In [22], Zhang et al. proposed a hyperspectral image anomaly detection approach using background joint sparse representation, which adaptively selects the most representative background bases for the local region. Zhang et al. [21] put forward a superpixel-level sparse representation classification resolution with multitask learning for hyperspectral imagery. Their proposed algorithm exploited the class-level sparsity prior for multiple-feature fusion, and the correlation and distinctiveness of pixels in a spatial local region. Yu et al. [44] proposed a remote sensing image classification method based on sparse component analysis, whose classification result is more reliable and more accurate.

3 Image Classification Model

In this paper, we focus on classification of a given high-resolution image using DigitalGlobe’s WorldView-2 satellite imagery. The main contribution of this paper is to develop an efficient solution for image classification by using nearest neighbor joint sparse linear combination to build the feature dictionary and applying pursuit algorithm joint sparse representation for image reconstruction. In this section, we mainly introduce the key components of our proposed method: the feature dictionary construction, sparse representation and image reconstruction. The idea of constructing the dictionary is to find a best matrix to represent all data vectors through extracting features directly from the data itself by nearest neighbor. We select randomly the training data set to construct the feature dictionary according to their classes by a sparse linear combination. So we describe the algorithm used to solve for the sparse representation. In our method, the sparse coefficients of test samples are divided into several groups, corresponding to the dictionary components representing specific classes. The test samples of image are represented by the sparse representation. We then discuss how to determine the class of the test pixel. The proposed classification model is shown in Fig. 1. The proposed classification model mainly consists of three steps: (1) feature dictionary construction, (2) sparse representation, (3) classification decision.

Fig. 1
figure 1

Proposed classification model

3.1 Sparse representation model

Let f be a pixel observation from an input signal with l − dimension for classification. In the sparse representation model, test spectral pixels, which lie approximately in several subspaces, are approximately represented by a few training examples. Suppose we have T distinct classes, and any one training sample for each class have n training data. This training sample can be trained to k dictionary elements. And test samples can be modeled to the T subspaces according to the T classes from the dictionary D. If the pixel f belongs to the ith class, we can represent f through these training data as a linear combination for the ith class. Thus, the test pixel f can be expressed as

$$ f=D\alpha =\left[\begin{array}{ccc}\hfill {d}_1^i\hfill & \hfill \cdots {d}_j^i\cdots \hfill & \hfill {d}_n^i\hfill \end{array}\right]\left[\begin{array}{c}\hfill {\alpha}_1^i\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\alpha}_n^i\hfill \end{array}\right]={d}_1^i{\alpha}_1^i+{d}_2^i{\alpha}_2^i+\cdots +{d}_n^i{\alpha}_n^i,\ 1\le i\le T,\ 1\le j\le n, $$
(1)

where D = {d i j } n,T j = 1,i = 1 is a feature dictionary which totally has n training data from the input sample of ith class and α i is a sparse vector. The coefficients of the sparse representations α can be decomposed to T pieces, each α i is a sparse vector which has only a few nonzero entries. Therefore, the sparse representation of the test pixel f can also be expressed as a linear combination of only the K dictionary atoms α k (k = 1, . . . , K) which is a vector with K (K = ‖α0) nonzero entries. Thus, f can be written as

$$ f=D\alpha =\left[\begin{array}{ccc}\hfill {d}_1^i\hfill & \hfill \cdots \hfill & \hfill {d}_K^i\hfill \end{array}\right]\left[\begin{array}{c}\hfill {\alpha}_1^i\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\alpha}_K^i\hfill \end{array}\right]={d}_1^i{\alpha}_1^i+{d}_2^i{\alpha}_2^i+\cdots +{d}_K^i{\alpha}_K^i,\ 1\le i\le T $$
(2)

where K denotes the number of nonzero elements in the vector α. Next, we will train a dictionary from a set of input samples. And we also will introduce how to obtain the sparse vector α and how to classify test samples from the sparse vector α.

3.2 Feature space construction

We consider a method for constructing the dictionary that produces sparse representations for the training examples. For sparse representation, it is a procedure of computing the representation coefficients based on the given examples and dictionary. Here, we will construct the feature dictionary from the input examples. In the proposed model, assume that the pixels of spectral features belonging to the same class approximately lie in the same subspace. The construction strategy of the feature dictionary is to model the best centers based on the training examples to express the most distinct characteristics of the presented objects.

Given a remote sensing image with Q channels and N × M pixels as an input signal be such a set F = {f l i,j } Q,N,M l = 1,i = 1,j = 1 (1 ≤ l ≤ Q, 1 ≤ i ≤ N, 1 ≤ j ≤ M), where l = 1, 2, ⋯, Q, and N, M is the number of rows and columns respectively. Suppose there be T distinct classes contained in the image in accordance with different plants or objects, and any one class has n training data. We select T types of representative samples from the training dataset, and input them into a sample set, S = (s 1, ⋯, s i , ⋯, s T ) (1 ≤ i ≤ T), where s i is a subset corresponding to the ith class. It contains n data points [x i1 , x i2 , ⋯, x i n ] with l bands, where x i is a data point in the subset s i . Then, we construct the feature vector D = [d 1, d 2, ⋯, d K ], which can be viewed as a dictionary including a total of K (K < < NM) elements, where D ∈  Q × K. In addition, associated with this feature matrix, we have a class index table W = [I 1, ⋯, I i , ⋯ I K ], where 1 ≤ I i  ≤ T, and I i records the class label of the feature pixel i, i = 1, 2, ⋯, K, that is, I i indicates the class which the dictionary element d i belongs to. For the given training set of image patches, each is reshaped as a two-dimensional vector. For better description, this image is rewritten as F = {f l j } Q,NM l = 1,j = 1 , (1 ≤ l ≤ Q, 1 ≤ j ≤ NM), which can be represented as a sparse linear combination of these feature vectors. The representation of F may be approximate, that is F ≈ , which satisfies the constrain ‖F − 2 ≤ ε . The vector α involves the representation coefficients of the image F. We can write f j  =  j , where α j  = e j is a vector from the trivial basis, with all zero elements except the one in the pth position. The index p is selected such that

$$ \forall p\ne q{\left\Vert {f}_i-D{\alpha}_p\right\Vert}_2^2\le {\left\Vert {f}_i-D{\alpha}_q\right\Vert}_2^2. $$
(3)

For the sparse representation of the data set F, the minimization of error is computed in order to search the best possible dictionary D with K items. And it could alternatively be met by considering

$$ <D,W,\alpha >= \arg \underset{\alpha }{ \min }{\left\Vert F-D\alpha \right\Vert}_2^2\ s.t\ \forall j,\ {\alpha}_j={e}_j. $$
(4)

Algorithm 1 (Training a dictionary).

Task: Find a best matrix to represent all data vectors for constructing a dictionary by nearest neighbor.

Input: A remote sensing image with Q channels and N × M pixels F = {f l i,j } Q,N,M l = 1,i = 1,j = 1 , 1 ≤ l ≤ Q, 1 ≤ i ≤ N, 1 ≤ j ≤ M.

Initialization: Randomly select k (k = K/T) data points from a sample subset s i as the initial representatives φ i = [φ i1 , φ i2 , ⋯, φ i k ], set i = 1 and repeat it until i reaches T.

1: Compute the k best centers from these n data points for each training sample s i , and w i j records the index of the best possible point for each data sample,

w i j  = {p| ∀ p ≠ q, ‖x i j  − φ i p 2 < ‖x i j  − φ i q 2}, 1 ≤ p, q ≤ k, 1 ≤ j ≤ n, 1 ≤ i ≤ T.

2: The representatives [ψ i1 , ψ i2 , ⋯, ψ i k ] is obtained by the following formula:

ψ i = {x i p | ∀ p ≠ q, ‖x i p  − x i m 2 < ‖x i q  − x i m 2}, x i p , x i q , x i m  ∈ φ i.

3: Update φ i by φ i = ψ i.

4: Go back to step 2. Repeat it until ψ i is equal to φ i.

Output: A dictionary and a class vector.

3.3 Reconstruction and classification

We describe the way we use the sparse vector α for a test sample f j (1 ≤ j ≤ NM) when reconstruct and classify it. At the moment, the dictionary D is obtained and known. Every image patch f j could be represented sparsely over this dictionary. And the representation α j satisfying j  = f j is obtained by solving the following optimization problem:

$$ {\widehat{\alpha}}_j= \arg\ \min {\left\Vert {\alpha}_j\right\Vert}_0\ s.t\ D{\alpha}_j={f}_j. $$
(5)

In order to solve the problem of searching the sparsest representation of f j , the equality constraint in (5) can be formulated to an inequality one as

$$ {\widehat{\alpha}}_j= \arg\ \min {\left\Vert {\alpha}_j\right\Vert}_0\ s.t\ {\left\Vert D{\alpha}_j-{f}_j\right\Vert}_2\le \varepsilon, $$
(6)

where ε is the error. The above problem can also be considered as minimizing the approximation error within a certain sparsity level. We can compute the error residual as r j  = f j  −  j . Notice that the above optimization problem can be replaced by

$$ {\widehat{\alpha}}_j= \arg \kern0.2em \min {\left\Vert D{\alpha}_j-{f}_j\right\Vert}_2\ s.t\ {\left\Vert {\alpha}_j\right\Vert}_0\le L, $$
(7)

where L express the sparsity level for the approximation error. Compute the residual for the ith class, that is, the error between the test sample f j and the reconstruction from training samples in the ith class. The class of f j can be determined by the recovered sparse vector \( {\widehat{\alpha}}_j \) as

$$ c\left({f}_j\right)= \arg\ \underset{i}{ \max}\left|{D}^i{\widehat{\alpha}}_j^i\right|,\kern0.1em s.t\kern0.1em \min \parallel {f}_j-{D}^i{\widehat{\alpha}}_j^i{\parallel}_2,\forall i,\kern0.1em 1\le i\le T, $$
(8)

where \( {\widehat{\alpha}}_j^i \) denotes the portion of the recovered sparse coefficients corresponding to the training samples in the ith class.

Eventually, we can obtain the final classification result \( \widehat{F} \) for the image F as (9).

$$ \widehat{F}=\left\{{f}_j\left|{f}_j\right.= colo{r}_{W\left({j}_0\right)},\forall c\left({f}_j\right)\in W\left({j}_0\right),1\le W\left({j}_0\right)\le T,1\le j\le NM\right\} $$
(9)

Algorithm 2 (Reconstruction and Classification).

Task: Construct the image and determine the class of the test pixels.

Input: A normalized feature dictionary D, class vector W and sparsity level L.

Initialization: Set j = 1 and repeat it until j reaches NM.

1: Choose the index j 0, 1 ≤ j 0 ≤ K, such that \( \left|{\varphi_{j_0}}^T{r}_{j_0}\right| \) is maximized. We say that j 0 is the index of the maximum value of the product of the residual r j and the atom φ j , j = 1, 2, ⋯, NM, from the class index table W, i.e. j 0 = arg max j = 1 … NM |〈r j , φ j 〉|.

2: Update the index set by I j  = I j − 1 ∪ {j 0} and the size incremental matrix by \( {A}_j={A}_{j-1}\cup \left\{{\varphi}_{j_0}\right\} \), when the product of the residual r j and the atom φ j is the maximum value. Then, remove the current column vector \( {\varphi}_{j_0} \) from the dictionary D, denoted by \( D=D\backslash \left\{{\varphi}_{j_0}\right\} \).

3: In this step, first, decompose A j by A j  = UZV T. Then we can obtain the orthogonal vectors U and V T, and the singular value vector Z whose diagonal elements are called singular value. In addition to the diagonal elements, the value of its elements is zero. Therefore, we calculate the sparse coefficient by \( {\alpha}_j=V\times {\scriptscriptstyle \frac{1}{Z}}\times {U}^T \) and recompute α j by α j  = α j  × f j .

4: The residual is updated by the formula as \( {r}_j={f}_j-{D}_{j_0}{\alpha}_j \), s. tα j 0 ≤ L.

5: The class of f j can be determined by the recovered sparse vector \( {\widehat{\alpha}}_j \) as

\( c\left({f}_j\right)= \arg \underset{i}{ \max}\left|{D}^i{\widehat{\alpha}}_j^i\right| \), s. t \( \min {\left\Vert {f}_j-{D}^i{\widehat{\alpha}}_j^i\right\Vert}_2 \), ∀ i, 1 ≤ i ≤ T.

6: The final classification result \( \widehat{F} \) for the image F is obtained as

\( \widehat{F}=\left\{{f}_j\Big|{f}_j= colo{r}_{W\left({j}_0\right)},\forall c\left({f}_j\right)\in W\left({j}_0\right),1\le W\left({j}_0\right)\le T,1\le j\le NM\right\} \).

7: j = j + 1

Output: The coloured classification image .

4 Experimental results and analysis

In this paper, we focus on classification using DigitalGlobe’s WorldView-2 satellite imagery. The sensor provides the highest resolution commercially available multispectral data and has eight multispectral bands: four standard bands (red, green, blue, and near-infrared 1) and four new bands. Ordered from shorter to longer wavelength, the list of bands is coastal blue, blue, green, yellow, red, red edge, near-infrared 1 (NIR1), and near-infrared 2 (NIR2).

In this section, two data sets are applied for the experiment. We adopt just three bands (Red, Green, and Blue) shown in Fig. 2. We illustrate the effectiveness of the proposed classification method by comparing it with other traditional classifiers, which can be divided into two categories: supervised methods and unsupervised methods according to the previous works of researchers in this field. The first category is supervised method which focuses on learning feature representation and whose training samples with identity labels are required, for example, BPNN, SVM and CART. The second category is unsupervised method, which mainly focuses on feature extraction, such as K-means. The experiments aim to compare the performance of the proposed method with the other four classifiers. Thus, the average accuracy (AA), overall accuracy (OA), Kappa (Ka), producer’s accuracy (PA), and user’s accuracy (UA) are used as the accuracy statistical parameters. For each image, we quantitatively and visually compare and evaluate the classification results of these methods.

Fig. 2
figure 2

The employed remote sensing images: (a) image 1 and (b) image 2

4.1 Experiment I for the image 1

The first dataset in our experiments was obtained from DigitalGlobe, which was acquired on 17 May 2010, as shown in Fig. 2a. It contains eight typical classes, including the bare land, residential area, grass, tree, and four different crops, which are labeled as: 1-bare land, 2-residential area, 3-grass, 4-tree, 5-crop1, 6-crop2, 7-crop3, and 8-crop4, respectively. Please refer to Fig. 3a. We randomly select around 11 % samples with ground truth class labels to train the classifiers, and use the rest as testing samples for evaluation. The number of training and testing samples for each class is shown in Table 1.

Fig. 3
figure 3

Classification map for the image 1: (a) ground truth, (b) the proposed method, (c) SVM, (d) CART, (e) BPNN and (f) K-means. (Objects are labeled as: 1-bare land, 2-residential area, 3-grass, 4-tree, 5-crop1, 6-crop2, 7-crop3, and 8-crop4. And the upper legend is for Fig. 3b-e, the lower for  Fig. 3f.)

Table 1 The training and testing sets for each class (labeled as 1-bare land, 2-residential area, 3-grass, 4-tree, 5-crop1, 6-crop2, 7-crop3, and 8-crop4 in Fig. 3a)

In order to verify the superiority of our proposed method, we make classification to this image by employing the proposed method, BPNN, SVM, CART and K-means. Figure 3 shows the classification results of the five classifiers. Thereafter, we analyze and compare their experimental results. The classification maps are shown in Fig. 3b-f. It is clear that in the K-means map, shown in Fig. 3f, all kinds of objects are grievously misclassified; as for the BPNN method, shown in Fig. 3e, the objects illustrated with highlight colors, such as the bare lands, residential areas, and crops2, are very easy to recognize, whereas the green-colored objects, such as the crops1, crops2, crops3, crops4, and especially the grasses and trees, are seriously misclassified; also, many crop4 pixels are wrongly labeled as grasses in the classification map; for the SVM and CART classification result in Fig. 3c, d, it can be clearly seen that there is severe misclassification among these classes. Relatively, through the comparison of our proposed method with the other four methods, our proposed method has achieved a great improvement, namely, the better distinction of objects, particularly the bare lands, residential areas, grasses, trees, crops3, and crops4. Only crops1 and crops2 have been classified with confusions. The result of the proposed method is shown in Fig. 3b.

The classification accuracies for each class using different classifiers are provided in Table 2. In this Table, AA, OA, Ka, PA, and UA are the statistics of the confusion matrix. Table 3 lists the confusion matrix of the proposed method. AA is the mean of the eight class accuracies. OA is computed as the ratio between the correctly classified testing samples and all the testing samples. Ka coefficient is a quantitative analysis for the classification precision and degree of agreement between the classification map and the ground truth based on the confusion matrix.

Table 2 The classification accuracies for different methods in Fig. 2a
Table 3 The confusion matrix for the proposed method in Fig. 2a

Combining the classification maps in Fig. 3b-f with the accuracy statistics in Tables 2 and 3, we can see that according to the ground truth, some pixels of crop2 are wrongly labeled as crop1, while some pixels of crop1 are misclassified as crop2 and residential area in the classification map of the proposed method; crop1 and crop4 cannot be identified in the BPNN map; there are several colors for each object in the SVM, CART and K-means map, for example, some pixels of the bare land are misclassified as tree and crop3 in the SVM map, and some pixels of tree are wrongly labeled as residential area, crop1 and crop2 in the CART map.

From Table 2, we can observe that the proposed method has achieved the highest PA and UA, and performed better in AA, OA, and Kappa coefficient. The best classification results for different objects are achieved by the proposed method. Moreover, we can see that the proposed method performs well in many aspects. The highest accuracies are achieved for the proposed method. The OA values of the proposed method, BPNN, SVM, CART and K-means are 91.22 %, 67.17 %, 20.55 %, 10 % and 51.27 %. The Ka values are 0.8984, 0.6038, 0.0785, 0.0286 and 0.3929. The AA values are 90.28 %, 57.41 %, 10.53 %, 10 % and 47.88 %. Compared with the BPNN, SVM, CART and K-means classifier, the PA values of the proposed method for each class are increased by at least 8 %, 51.5 %, 55.75 %, and 16 % respectively, and the UA values for each class are increased by at least 3.35 %, 65 %, 14.689 %, and 11.49 % respectively. Besides, the PA values of the proposed method for all classes are increased averagely by 32.88 %, 75.75 %, 80.28 %, and 42.41 % respectively, and the UA values for all classes are increased averagely by 33.79 %, 73.34 %, 75.99 %, and 48.47 %. The best classification results of different objects are achieved for the proposed method. Table 3 shows the confusion matrix of Fig. 2a.

4.2 Experiment II for the image 2

In order to verify the stability of the proposed classification method, we select another HSR image shown of WorldView-2 in Fig. 2b. We consider the WorldView-2 true-color image with 1.8-m spatial resolution of a suburban area, which has eight bands. We adopt just three bands (Red, Green, and Blue). This image contains six main kinds of objects: 1-lake, 2-tree, 3-short bush, 4-road, 5-grass, and 6-residential area, as shown in Fig. 4a. The training and testing samples are chosen from the reference data.

Fig. 4
figure 4

Classification map for the image 2: (a) ground truth, (b) proposed method, (c) SVM, (d) CART, (e) BPNN and (f) K-means. (Objects are labeled as: 1-lake, 2-tree, 3-short bush, 4-road, 5-grass, and 6-residential area. And the upper legend is for Fig. 4b-e, the lower for  Fig. 4f.)

We also apply the proposed method, BPNN, SVM, CART and K-means classifier to classify the image 2. The classification maps are shown in Fig. 4b-f. By comparing the classification map in Fig. 4b-f with the original image in Fig. 4a, we can see that some pixels of residential area and grass are misclassified as road in Fig. 4b; some pixels of lake, short bush and residential area are labeled as tree, some pixels of road are labeled as residential area, and some pixels of residential area are labeled as road in Fig. 4c; some pixels of lake and grass are labeled as tree, some pixels of lake and road are labeled as residential area, and some pixels of residential area are labeled as road in Fig. 4d; the obvious error is the misclassification of grass as residential area, some pixels of road are misclassified as residential area, and the lake and bush are both misclassified as tree in Fig. 4e. In Fig. 4f, the tree and residential area are obviously muddled, some pixels of bush are misclassified as the grass, some pixels of grass are misclassified as the lake, and the tree is seriously misclassified as the lake.

From the results in Tables 4 and 5, we can see that the highest accuracies are also achieved for the proposed method. The OA values of the proposed method, BPNN, SVM, CART and K-means are 92.5 %, 83.86 %, 65.71 %, 65.21 % and 66 % respectively. The Ka values are 0.9082, 0.8030, 0.5896, 0.5853 and 0.5872 respectively. The AA values are 91.88 %, 82.89 %, 64.17 %, 72.17 % and 64.08 % respectively. Compared with the BPNN, SVM, CART and K-means classifiers, the PA values of the proposed method for each class are increased by at least 1.5 %, 3.5 %, 11 % and 11.5 %, respectively. Moreover, compared with the BPNN, SVM, CART and K-means, the PA values of the proposed method for all objects are increased averagely by 8.99 %, 23.71 %, 19.71 % and 27.79 % respectively, and the UA values for all classes are increased averagely by 9.95 %, 18.42 %, 20.52 % and 33.06 % respectively. The best classification results of different objects are also obtained for the proposed method. Table 5 shows the confusion matrix of Fig. 2b.

Table 4 The classification accuracies for different methods in Fig. 2b
Table 5 The confusion matrix for the proposed method in Fig. 2b

Finally, we compare the different methods in terms of computational cost, which is the CPU time computed by Matlab function and used to evaluate those methods. As can be seen from Table 6 and Fig. 5, the proposed method takes about 112 s in the first dataset and about 48 s in the second dataset to train the dictionary and make a decision; the computational cost of the proposed method is almost the same as SVM, more than the method of BPNN and CART, but far less than K-means.

Table 6 Computational cost (CPU time) of the different methods in Fig. 5
Fig. 5
figure 5

Computational cost (CPU time) of the different methods

5 Conclusion and Future Work

In this paper, we tackle the problem of the classification of the HSR remote sensing image using the proposed method based on sparse representation by representing the test samples through a dictionary. The dictionary is obtained by training samples according to their classes for a sparse linear combination. We discuss the specific idea of constructing the dictionary. That is how to compute a best matrix to represent all data vectors by nearest neighbor. We also describe the algorithm used to solve for the sparse representation. We then discuss how to construct the image and how to determine the classes of the test pixels. The experimental results indicate that our method has performed better and achieved higher accuracies in the verified four real remote sensing images.

In the future work, we plan to explore more properties as feature and work toward combining the proposed approach with spectral and spatial features, both at the feature and decision levels, to improve classification accuracy. And we also need to speed up the proposed method. The end goal of this work is to detect yearly and seasonal changes in vegetation cover. Additionally, we also explore how to construct dictionaries to expand to images from the same area in different seasons, and make use of them for change detection.