1 Introduction

Remotely sensed hyperspectral images (HSI) are images taken from drones, airplanes or satellites that record a wide range of electromagnetic spectrum, typically more than 100 spectral bands from visible to near-infrared wavelengths. Since different materials reflect different spectral signatures, one can identify the materials at each pixel of the image by examining its spectral signatures. HSI is used in many applications, including agriculture [1, 2], disaster relief [3, 4], food safety [5, 6], military [7, 8] and mineralogy [9].

One of the most important problems in hyperspectral data exploitation is HSI classification. It has been an active research topic in past decades [10, 11]. The pixels in the hyperspectral image are often labeled manually by experts based on careful review of the spectral signatures and investigation of the scene. Given these ground-truth labels of some pixels (also called “training pixels”), the objective of HSI classification is to assign labels to part or all of the remaining pixels (the “testing pixels”) based on their spectral signatures and their locations.

Numerous methods have been developed for HSI classification. Among these, machine learning is a well-studied approach. It includes multinomial logistic regression [12,13,14], artificial neural networks [15,16,17,18,19], and support vector machines (SVMs) [20,21,22]. Since our method is partly based on SVMs, we will discuss it in more detail here. Early SVM classification methods [23, 24] perform pixel-wise classification that utilizes spectral information but not spatial dependencies. Numerous spectral–spatial SVM classification methods have been introduced since then. They show better performance when compared to the pixel-wise SVM classifiers. Here we discuss some of them.

SVMs with composite kernels [25] use composite kernels that are weighted summations of spectral kernels and spatial kernels. The spatial information is extracted by taking the average of the spectra in a fixed window around each pixel. To further utilize the spatial information, the method in [26] first applies superpixel segmentation to break the hyperspectral image into small regions with flexible shapes and sizes. Then it extracts the spatial information based on the segmentation and finally performs the classification using SVMs with multiple kernels. In [27], a pixel-wise SVM classification is first used to produce classification maps and then a partitional clustering is applied to obtain a segmentation of the hyperspectral image. Then a majority vote scheme is used in each cluster and finally a filter is applied to denoise the result. The method in [28] first produces pixel-wise classification maps using SVMs and then applies edge-preserving filtering to the classification maps. In addition to these methods, techniques based on Markov random fields [29], segmentation [26, 27, 30, 31] and morphological profiles [31, 32] have also been incorporated into SVMs to exploit the spatial information.

Besides machine learning approaches, another powerful approach is sparse representation [33]. It is based on the observation that spectral signatures within the same class usually lie in a low-dimensional subspace; therefore test data can be represented by a few atoms in a training dictionary. A joint sparse representation method is introduced in [34] to make use of the spatial homogeneity of neighboring pixels. In particular, each testing pixel and its neighboring pixels inside a fixed window are jointly sparsely represented. In [35], a kernel-based sparse algorithm is proposed which incorporates the kernel functions into the joint sparse representation method. It uses a fixed size local region to extract the spatial information. Approaches with more flexible local regions were proposed in [36] and [37]. They incorporate a multiscale scheme and superpixel segmentation into the joint sparse representation method, respectively. Multiple-feature-based adaptive sparse representation was proposed in [38]. It first extracts various spectral and spatial features and then the adaptive sparse representations of the features are computed. The method in [39] first estimates the pixel-wise class probabilities using SVMs and then applies sparse representation to obtain superpixel-wise class probabilities in which spatial information is utilized and the final result is obtained by combining both probabilities.

A pixel-wise classifier such as SVM considers only spectral information. It generates results with decent accuracy but would appear “noisy” as spatial information is not used, see [23] and also Fig. 1. Segmentation techniques have been used to incorporate the spatial information, see [26, 27, 30, 31]. Indeed, image segmentation is a well-studied subject in image processing and numerous effective segmentation methods for noisy images have been introduced [40,41,42,43,44,45]. Among them, a variational method called the Mumford–Shah model [40, 41] is one of the most important and successful image segmentation techniques. In this paper, we propose a simple but effective two-stage classification method inspired by our previous methods for image segmentation [43,44,45] which are based on the Mumford–Shah model. In the first stage, we apply a pixel-wise SVM method that exploits the spectral information to estimate a pixel-wise probability map for each class. In the second stage, we apply a convex variant of the Mumford–Shah model to denoise the maps and exploit the spatial information so as to segment the image into different classes accurately. Traditional methods like that in [27] apply a pixel-wise classification to obtain an initialization. Then they use a segmentation algorithm followed by a denoising algorithm to do the classification. In comparison, in our proposed method, since our convex Mumford–Shah model performs denoising and segmentation simultaneously, we just need one step here. Besides, because of the superior segmentation accuracy of our convex Mumford–Shah model, our method has much better classification results.

Fig. 1
figure 1

An example of classification result using pixel-wise SVM classifiers

Our method utilizes spectral information in the first stage and spatial information in the second stage. Experiments show that our method generates very accurate results when compared to the state-of-the-art methods on real HSI data sets, especially when the inter-class spectra are similar. This is because our method can effectively exploit the spatial information even when the other methods cannot distinguish between the spectra. Moreover, our method has a much smaller number of parameters and shorter computation time than the state-of-the-art methods.

This paper is organized as follows. In Sect. 2, support vector machines and variational methods for denoising and segmentation are reviewed. In Sect. 3, our proposed two-stage classification method is presented. In Sect. 4, experimental results are presented to illustrate the effectiveness of our method. Sect. 5 concludes the paper.

2 Support Vector Machines and Variational Methods

2.1 Review of \(\nu \)-Support Vector Classifiers

Support vector machines (SVMs) have been used successfully in pattern recognition [46], object detection [47, 48], and financial time series forecasting [49, 50] etc. This approach also has superior performance in hyperspectral classification, especially when the dimensionality of the data is high and the number of training data is limited [23, 24]. In this subsection, we review the \(\nu \)-support vector classifier (\(\nu \)-SVC) [22] which will be used in the first stage of our method.

Consider for simplicity a supervised binary classification problem. We are given m training data \(\{\mathbf {x}_{i}\}_{i=1}^m\) in \(\mathbb {R}^{d_1}\), and each data is associated with a binary label \(y_i \in \{-1,+1\}\) for \(i=1,2,\ldots ,m\). In the training phase of SVM, one aims to find a hyperplane to separate the two classes of labels and maximize the distance between the hyperplane and the closest training data, which is called the support vector. In the kernel SVM, the data is mapped to a higher-dimensional feature space by a feature map \(\phi :\mathbb {R}^{d_1}\rightarrow \mathbb {R}^{d_2}\) in order to improve the separability between the two classes.

The \(\nu \)-SVC is an advanced support vector classifier which enables the user to specify the maximum training error before the training phase. Its formulation is given by

$$\begin{aligned} \left\{ \begin{array}{l} \quad \underset{\mathbf {w},b,\varvec{\xi },\rho }{\text {min }}\, \frac{1}{2}\Vert \mathbf {w}\Vert _2^2 - \nu \rho + \frac{1}{m} \sum \limits _{i=1}^{m} \xi _i \\ \text {subject to: } \\ y_i(\mathbf {w}\cdot \phi (\mathbf {x}_{i})+b)\ge \rho -\xi _{i},\; i=1,2,\ldots ,m, \\ \quad \xi _i \ge 0, \; i=1,2,\ldots ,m, \\ \quad \quad \rho \ge 0, \end{array}\right. \end{aligned}$$
(1)

where \(\mathbf {w} \in \mathbb {R}^{d_2}\) and \(b\in \mathbb {R}\) are the normal vector and the bias of the hyperplane, respectively, \(\xi _i\)’s are the slack variables which allow training errors, and \(\rho /\Vert \mathbf {w}\Vert _2\) is the distance between the hyperplane and the support vector. The parameter \(\nu \in (0,1]\) is shown to be an upper bound on the fraction of training errors [22].

The optimization problem (1) can be solved through its Lagrangian dual

$$\begin{aligned} \left\{ \begin{array}{l} \quad \underset{\mathbf {\varvec{\alpha }}}{\text {max }}\, -\frac{1}{2}\sum \limits _{i,j=1}^{m} \alpha _{i}\alpha _{j}y_{i}y_{j}K(\mathbf {x}_i,\mathbf {x}_j) \\ \text {subject to: } 0\le \alpha _i \le \frac{1}{m},\; i=1,2,\ldots ,m,\\ \quad \quad \sum \limits _{i=1}^{m}\alpha _i y_i=0, \\ \quad \quad \sum \limits _{i=1}^{m} \alpha _i \ge \nu . \end{array}\right. \end{aligned}$$
(2)

Its optimal Lagrange multipliers can be calculated using quadratic programming methods [51]. After obtaining them, the parameters of the optimal hyperplane can be represented by the Lagrange multipliers and the training data. The decision function for a test pixel \(\mathbf {x}\) is given by

$$\begin{aligned} \begin{aligned}&g(\mathbf {x})=\text {sgn}(f(\mathbf {x})),\\&\text {where }f(\mathbf {x})= \sum \limits _{i=1}^{m} \alpha _i y_i K(\mathbf {x}_i,\mathbf {x})+b. \end{aligned} \end{aligned}$$
(3)

Mercer’s Theorem [51, p. 423-424] states that a symmetric function K can be represented as an inner product of some feature maps \(\phi \), i.e., \(K(\mathbf {x},\mathbf {y})=\phi (\mathbf {x})\cdot \phi (\mathbf {y})\) for all \(\mathbf {x},\mathbf {y}\), if and only if K is positive semi-definite. In that case, the feature map \(\phi \) need not be known in order to perform the training and classification, but only the kernel function K is required. Examples of K satisfying the condition in Mercer’s Theorem include: \(K(\mathbf {x}_i,\mathbf {x}_j) = \text {exp}(-\Vert \mathbf {x}_i-\mathbf {x}_j\Vert ^2/(2\sigma ^2))\) and \(K(\mathbf {x}_i,\mathbf {x}_j) =(\mathbf {x}_i \cdot \mathbf {x}_j + 1)^p\).

2.2 Review of Variational Methods for Denoising and Segmentation

Let \(\varOmega =\{1,\ldots ,N_1\}\times \{1,\ldots ,N_2\}\) be the index set of pixel locations of an image, \(\mathbf{v}\) be the noisy image and \(\mathbf{u}\) be the restored image. One famous variational method to denoise images with Gaussian noise is the total variation (TV) method [52]. It involves an optimization model with a TV regularization term which corresponds to the function \(\Vert \nabla \cdot \Vert _1\). However, it is known that it reproduces images with staircase effect, i.e., with piecewise constant regions. One approach to improve it is to add a higher-order term, see, e.g., [53,54,55,56,57]. In [56], the authors considered minimizing

$$\begin{aligned} H(\mathbf {u})=\frac{1}{2}\Vert \mathbf {v}-\mathbf {u}\Vert _2^2+\alpha _1\Vert \nabla \mathbf {u}\Vert _1+\frac{\alpha _2}{2}\Vert \nabla \mathbf {u}\Vert _2^2. \end{aligned}$$
(4)

Here the first term is the \(\ell _2\) data-fitting term that caters for Gaussian noise. The second term is the TV term while the third term is the extra higher-order term added to introduce smoothness to the restored image \(\mathbf {u}\). By setting the parameters \(\{\alpha _i\}_{i=1}^2\) appropriately, one can control the trade-off between a piece-wise constant and a piece-wise smooth \(\mathbf {u}\).

In [43,44,45], the authors derived the same minimizational function (4) as a convex approximation of the Mumford–Shad model for segmentation. In [43,44,45], (4) is first applied to obtain a smooth denoised image and then thresholding is applied to the restored image to obtain the segmentation. The method is successful for segmenting greyscale and color images corrupted by different noises (Gaussian, Poisson, Gamma), information loss and/or blur. We note that the denoising and segmentation are intimately related. Indeed, Cai and Steidl showed that the famous Chan–Vese segmentation model [58] can be obtained by thresholding the TV denoising model with some properly chosen regularization parameter, see [59] for more details.

The two-stage approach for denoising has also been applied to impulse noise removal, see [60]. In the first stage a standard impulse noise detector, the Adaptive Median Filter [61], is used to detect the locations of possible noisy pixels. Then in the second stage, it restores the noisy pixels while keeping the non-noisy pixels unchanged by minimizing

$$\begin{aligned} \begin{aligned} F(\mathbf {u})&= \Vert \mathbf {v}-\mathbf {u}\Vert _1+\frac{\beta }{2} \Vert \nabla \mathbf {u}\Vert ^{\alpha }_{\alpha },\\ \text {s.t. }&\mathbf {u}|_{\varUpsilon }=\mathbf {v}|_{\varUpsilon }, \end{aligned} \end{aligned}$$
(5)

where \(\varUpsilon \) is the set of non-noisy pixels detected by the Adaptive Median Filter, \(\mathbf {u}|_{\varUpsilon }=(u_i)_{i\in \varUpsilon }\), and \(1 < \alpha \le 2\). We remarked that in [62], Nikolova showed that the 1-norm data-fitting term (used in (5) above) is the correct norm for impulse noise. This two-stage method is the first method that can successfully restore images corrupted with extremely high level of impulse noise (e.g., 90%).

Our proposed method is inspired by the image denoising/segmentation methods in [43,44,45, 56, 60], which apply (4) successfully to denoise/segment images with various noises. In the first stage of our proposed method, we use the spectral classifier \(\nu \)-SVC to generate a pixel-wise probability map for each class. Then in the second stage, we use a combination of (4) and (5) to denoise and segment the result from the first stage.

3 Our Two-Stage Classification Method

SVMs yield decent classification accuracy [23], but their results can be noisy (see Fig. 1) since only spectral information is used. We therefore propose to use an image denoising/segmentation scheme to incorporate the spatial information into the classification. Our method first estimates the pixel-wise probability map for each class using SVMs. Then the spatial positions of the training pixels are used in a variational denoising/segmentation method to effectively segment the image into different classes.

3.1 First Stage: Pixel-Wise Probability Map Estimation

3.1.1 SVM Classifier

HSI classification is a multi-class classification, but the SVM is a binary classifier. To extend SVM to multi-class, we use the One-Against-One (OAO) strategy [63] where \([c(c-1)/2]\) SVMs are built to classify every possible pair of classes. Here c is the number of classes. In this paper, we choose the SVM method \(\nu \)-SVC [22] with OAO strategy for the HSI multiclass classification in our first stage. Moreover, the radial basis function kernel (RBF kernel) [21] is used as the kernel function in our SVM method. The RBF kernel is defined as

$$\begin{aligned} K(\mathbf {x}_i,\mathbf {x}_j) = \text {exp}\Big (-\frac{\Vert \mathbf {x}_i-\mathbf {x}_j\Vert ^2}{2\sigma ^2}\Big ). \end{aligned}$$
(6)

We remark that one can use other SVMs, other multiclass strategies such as the One-Against-All strategy in [63], or other kernel functions such as the polynomial kernel [21] instead.

3.1.2 Probability Estimation of SVM Outputs

Given a testing pixel \(\mathbf{x}\) and a SVM classifier with decision function \(f(\mathbf{x})\) in (3), we can label \(\mathbf{x}\) with a class according to the sign of \(f(\mathbf{x})\), see [21]. Under the OAO strategy, there are \([c (c-1)]/2\) such pairwise functions \(f_{h,l}\), \(1\le h,l \le c\), \(h\ne l\). We use them to estimate the probability \({p_h}\) that \(\mathbf {x}\) is in the h-th class. The idea is given in [64, 65]. We first estimate the pairwise class probability \(\mathrm{Prob}(y=h \ | \ y=h \text { or } y=l)\) by computing

$$\begin{aligned} r_{h,l}=\frac{1}{1+e^{\eta f_{h,l}(\mathbf {x})+\tau }}, \end{aligned}$$
(7)

where \(\eta \) and \(\tau \) are computed by minimizing a negative log likelihood problem over all the training pixels [64].

Fig. 2
figure 2

Examples of probability maps on Indian Pines before and after the second stage. Here, completely white represents probability one and completely black represents probability zero

Then the probability vector \(\mathbf {p}=[p_1,p_2,\ldots ,p_c]^\intercal \) of the testing pixel \(\mathbf{x}\) is estimated by solving

$$\begin{aligned}&\underset{\mathbf {p}}{\text {min}}\; \frac{1}{2}\sum _{h=1}^{c}\sum _{l\ne h}(r_{l,h}p_h-r_{h,l}p_l)^2, \nonumber \\&\text {s.t. }p_h \ge 0, \forall h,\; \sum _{h=1}^{c}p_h=1. \end{aligned}$$
(8)

By [65], its optimal solution can be obtained by solving the simple \((c +1)-(c +1)\) linear system

$$\begin{aligned} \begin{bmatrix} Q &{} \mathbf {e} \\ \mathbf {e}^{\intercal } &{} {0} \end{bmatrix} \begin{bmatrix} \mathbf {p} \\ {b} \end{bmatrix} = \begin{bmatrix} \mathbf {0} \\ {1} \end{bmatrix}, \end{aligned}$$
(9)

where

$$\begin{aligned} Q_{hl} = {\left\{ \begin{array}{ll} \sum \limits _{s\ne h} r_{s,h}^2 \; &{}\text {if } h=l, \\ -r_{l,h}r_{h,l} &{} \text {if } h\ne l, \end{array}\right. } \end{aligned}$$

b is the Lagrange multiplier of the equality constraint in (8), \(\mathbf {e}\) is the c-vector of all ones, and \(\mathbf {0}\) is the c-vector of all zeros. In our tests, the probability vectors \(\mathbf {p}(\mathbf{x})\) for all testing pixels \(\mathbf{x}\) are computed by this method using the toolbox of LIBSVM library [66].

Fig. 3
figure 3

Spectra of training pixels of Indian Pines data

We finish Stage 1 by forming the 3D tensor \(\mathcal {V}\) where \(\mathcal {V}_{i,j,k}\) gives the probability that pixel (ij) is in class k. More specifically, if pixel (ij) is a testing pixel, then \(\mathcal {V}_{i,j,:}=\mathbf {p}(\mathbf{x}_{i,j})\); if pixel (ij) is a training pixel belonging to the c-th class, then \(\mathcal {V}_{i,j,c}=1\) and \(\mathcal {V}_{i,j,k}=0\) for all other k’s.

3.2 Second Stage: Denoising/Segmentation of the Pixel-Wise Probability Map

Given the probability tensor \(\mathcal {V}\) obtained in Stage 1, one can obtain an HSI classification by taking the maximum probability for each pixel [28]. However, the result will appear noisy as no spatial information is taken into account. The goal of our second stage is to incorporate the spatial information into \(\mathcal {V}\) by a denoising/segmentation method that keeps the value of the training pixels unchanged during the optimization, as their ground-truth labels are given a priori.

Let \(\mathbf{v} _{k}:=\mathcal {V}_{:,:,k}\), \(k=1,\ldots ,c\), be the probability map of the k-th class obtained from Stage 1. We improve them by minimizing

$$\begin{aligned} \begin{aligned} \underset{\mathbf {u}}{\text {min }}&\frac{1}{2}\Vert \mathbf {u}-\mathbf {v}_k\Vert _2^2+\beta _{1}\Vert \nabla \mathbf {u}\Vert _1+\frac{\beta _{2}}{2}\Vert \nabla \mathbf {u}\Vert _2^2,\\ \text {s.t. }&\mathbf {u}|_{\varUpsilon }=\mathbf {v}_k|_{\varUpsilon }, \end{aligned} \end{aligned}$$
(10)

where \(\beta _1\), \(\beta _2\) are regularization parameters and \({\varUpsilon }\) is the set of training pixels. We choose this minimization functional because it gives superb performance in denoising [56] and segmentation [43,44,45]. The higher-order \(\Vert \nabla \mathbf {u}\Vert _2^2\) term encourages smoothness of the solution and can improve the classification accuracy, see Sect. 4.4. In our tests, we use anisotropic TV [67] and periodic boundary condition for the discrete gradient operator, see [68, p. 258].

Table 1 Number of training/testing pixels and classification accuracies for Indian Pines data set

Alternating direction method of multipliers (ADMM) [69] is used to solve (10). First, we rewrite (10) as

$$\begin{aligned} \begin{aligned} \underset{\mathbf {u}}{\text {min }}&\frac{1}{2}\Vert \mathbf {u}-\mathbf {v}_k\Vert _2^2+ \beta _{1}\Vert \mathbf {s}\Vert _{1}+ \frac{\beta _{2}}{2}\Vert D \mathbf {u}\Vert _2^2+ \iota _\mathbf {w}\\ \text {s.t. }&\mathbf {s} = D \mathbf {u} \mathrm{\ and \ } \mathbf {w} = \mathbf {u}. \end{aligned} \end{aligned}$$
(11)

Here D denotes the discrete operator of \(\nabla \), \(D=\begin{pmatrix}D_x \\ D_y \\ \end{pmatrix} \in \mathbb {R}^{2n\times n}\), where \(D_x\) and \(D_y\) are the first-order difference matrices in the horizontal and vertical directions, respectively, and n is the total number of pixels in the hyperspectral image, \(\iota _\mathbf {w}\) is the indicator function, where \(\iota _\mathbf {w}=0\) if \(\mathbf {w}|_{\varUpsilon }=\mathbf {v}_k|_{\varUpsilon }\) and \(\iota _\mathbf {w}=\infty \) otherwise. Its augmented Lagrangian is given by

$$\begin{aligned} \begin{aligned}&L(\mathbf {u},\mathbf {s},\mathbf {w},\varvec{\lambda })\\&\quad =\frac{1}{2}\Vert \mathbf {u}-\mathbf {v}_k\Vert _2^2+\beta _1\Vert \mathbf {s}\Vert _1+\frac{\beta _{2}}{2}\Vert D\mathbf {u}\Vert _2^2\\&\qquad + \iota _\mathbf {w}+\frac{\mu }{2}\Vert E\mathbf {u}-\mathbf{g}-\varvec{\lambda }\Vert _2^2, \end{aligned} \end{aligned}$$
(12)

where \(\mu >0\) is a positive constant, \(E = \begin{pmatrix}D \\ I \\ \end{pmatrix}\), \(\mathbf{g}=\begin{pmatrix}\mathbf {s} \\ \mathbf {w} \\ \end{pmatrix}\) and \({\varvec{\lambda }}= \begin{pmatrix}{\varvec{\lambda }}_1 \\ {\varvec{\lambda }}_2 \\ \end{pmatrix}\) the Lagrange multipliers.

The formulation (12) allows us to solve \(\mathbf {u}\) and \(\mathbf{g}\) alternately as follows:

$$\begin{aligned} \mathbf {u}^{(t+1)}&=\underset{\mathbf {u}}{\text {argmin }}\ \bigg \{ \frac{1}{2}\Vert \mathbf {u}-\mathbf {v}_k\Vert _2^2+\frac{\beta _{2}}{2}\Vert D\mathbf {u}\Vert _2^2 \nonumber \\&\;\;\;\; +\frac{\mu }{2}\Vert E\mathbf {u}-\mathbf{g}^{(t)}-{\varvec{\lambda }}^{(t)}\Vert _2^2 \bigg \} \end{aligned}$$
(13a)
$$\begin{aligned} \mathbf{g}^{(t+1)}&= \underset{\mathbf{g}}{\text {argmin }}\ \bigg \{ \beta _1\Vert \mathbf {s}\Vert _1 + \iota _\mathbf {w} \nonumber \\&\;\;\;\; +\frac{\mu }{2}\Vert E\mathbf {u}^{(t+1)}-\mathbf{g}-{\varvec{\lambda }}^{(t)}\Vert _2^2 \bigg \} \end{aligned}$$
(13b)
$$\begin{aligned} {\varvec{\lambda }}^{(t+1)}&= {\varvec{\lambda }}^{(t)}-E\mathbf {u}^{(t+1)}+\mathbf{g}^{(t+1)} \end{aligned}$$
(13c)

The \(\mathbf {u}\)-subproblem (13a) is a least squares problem. Its solution is

$$\begin{aligned} \begin{aligned} \mathbf {u}^{(t+1)}&=(I+\beta _2D^{\intercal }D+\mu E^{\intercal }E)^{-1} \\&\quad (\mathbf {v}_k+\mu E^{\intercal }(\mathbf{g}^{(t)}+{\varvec{\lambda }}^{(t)})). \end{aligned} \end{aligned}$$
(14)

Since periodic boundary conditions are used, the solution can be computed efficiently using the two-dimensional fast Fourier transform (FFT) [70] in \(O(n\log n)\) complexity.

For the \(\mathbf{g}\)-subproblem (13b), the optimal \(\mathbf {s}\) and \(\mathbf {w}\) can be computed separately as follows:

$$\begin{aligned} \begin{aligned} \mathbf {s}^{(t+1)}&= \underset{\mathbf {s}}{\text {argmin }}\ \bigg \{ \beta _1\Vert \mathbf {s}\Vert _1 \\&\quad +\frac{\mu }{2} \Vert D \mathbf {u}^{(t+1)}-\mathbf {s} - {\varvec{\lambda }}_1^{(t)} \Vert _2^2 \bigg \} \end{aligned} \end{aligned}$$
(15)

and

$$\begin{aligned} \mathbf {w}^{(t+1)} = \underset{\mathbf {w}}{\text {argmin }}\ \bigg \{ \iota _\mathbf {w}+\frac{\mu }{2} \Vert \mathbf {u}^{(t+1)}- \mathbf {w} -{\varvec{\lambda }}_2^{(t)}\Vert _2^2 \bigg \} \end{aligned}$$
(16)

The solution of (15) can be obtained by soft thresholding [71], i.e.,

$$\begin{aligned} \begin{aligned} {[\mathbf {s}^{(t+1)}]}_{i}&= \text {sgn}([\mathbf{r}]_i)\cdot \text {max}\{|[\mathbf{r}]_i|-\frac{\beta _1}{\mu },0\}, \\&\;\;\;\;\;i=1,\ldots ,2n, \end{aligned} \end{aligned}$$
(17)

where \(\mathbf{r}=D\mathbf {u}^{(t+1)}-{\varvec{\lambda }}_1^{(t)}\). The solution of (16) is simply

$$\begin{aligned}{}[\mathbf {w}^{(t+1)}]_i = \left\{ \begin{array}{ll} [\mathbf {v}_k]_i &{} \mathrm{if } \ i \in {\varUpsilon } ,\\ {[\mathbf {u}^{(t+1)}-{\varvec{\lambda }}_2^{(t)}]}_i &{} \mathrm{otherwise.} \end{array} \right. \end{aligned}$$
(18)

The computation of (13c), (17) and (18) has a computational complexity of O(n). Hence, the computational complexity of our ADMM is \(O(n \log n)\) for each iteration, where n is the total number of pixels.

The convergence of our ADMM to the global minimum is guaranteed by [69]. Once it finishes, we obtain the enhanced probability map \(\mathbf {u}\) for class k. We denote it as \(\mathcal {U}_{:,:,k}\). After the map for each class is obtained, we get a 3D tensor \(\mathcal {U}\). The final classification of the (ij)-th pixel is given by finding the maximum value in \(\mathcal {U}_{i,j,:}\), i.e., \(\underset{k}{\text {argmax }}\, \mathcal {U}_{i,j,k}\). Our proposed method is summarized in Algorithm 1.

figure e

We remark that in Stage 1, the operation is along the spectral dimension, i.e., the third index of the tensor, while in Stage 2, the operation is along the spatial dimension, i.e., the first two indices of the tensor. The techniques of Stage 2 are essentially similar to our segmentation methods in [43,44,45], where a smooth denoised image is first computed and then thresholding (here maximizing) is applied to it to segment (here classify) it.

Figure 2 shows the probability maps before the second stage and the enhanced probability maps after the second stage. The figures are in gray scale, i.e., completely white represents probability one and completely black represents probability zero. Note that the second stage does not guarantee the enhanced probability maps to have a sum to one property. In Fig. 2b and d, the enhanced probability maps are normalized to sum to one.

4 Experimental Results

4.1 Experimental Setup

4.1.1 Data Sets

Three commonly tested hyperspectral data sets are used in our experiments. These data sets have pixels labeled so that we can compare the methods quantitatively. The first one is the “Indian Pines” data set acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in North-western Indiana. It has a spatial resolution of 20 m per pixel and a spectral coverage ranging from 0.2 to 2.4 \(\upmu \)m in 220 spectral bands. However, due to water absorption, 20 of the spectral bands (the 104–108th, 150–163th and 220th bands) are discarded in experiments in previous papers. Therefore our data set is of size \(145 \times 145 \times 200\), and there are 16 classes in the given ground-truth labels.

The second and third images are the “University of Pavia” and “Pavia Center” data sets acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over Pavia in northern Italy. The sensor has 1.3 m spatial resolution and spectral coverage ranging from 0.43 to 0.86 \(\upmu \)m. The data set sizes are \(610 \times 340 \times 103\) and \(1096 \times 715 \times 102\), respectively, where the third dimension is the spectral dimension. Both sets have nine classes in the ground-truth labels.

Fig. 4
figure 4

Indian Pines data set. a Ground-truth labels, b label color of the ground-truth labels, c false color image, d heatmap colorbar, ej classification results by different methods

4.1.2 Methods Compared and Parameters Used

We have compared our method with five well-known classification methods: \(\nu \)-support vector classifiers (\(\nu \)-SVC) [22, 23] (i.e., the first stage of our method), SVMs with composite kernels (SVM-CK) [25], edge-preserving filtering (EPF) [28], superpixel-based classification via multiple kernels (SC-MK) [26] and multiple-feature-based adaptive sparse representation (MFASR) [38]. All the tests are run on a laptop computer with an Intel Core i5-7200U CPU, 8 GB RAM and the software platform is MATLAB R2016a.

Table 2 Number of training/testing pixels and classification accuracies for University of Pavia data set

In the experiments, the parameters are chosen as follows. For the \(\nu \)-SVC method, the parameters are obtained by performing a fivefold cross-validation [72]. For the SVM-CK method, the parameters are tuned such that it gives the highest classification accuracy. All parameters of the EPF method, the SC-MK method, and the MFASR method are chosen as stated in [26, 28, 38], respectively, except the window size in the EPF method, the number of superpixels and the parameters of the superpixel segmentation algorithm in the SC-MK method, and the sparsity level of the MFASR are tuned such that the highest classification accuracies are obtained. For our method, the parameters of the \(\nu \)-SVC (1) in the first stage are obtained by performing a fivefold cross-validation and the parameters of the optimization problem (10) in the second stage are tuned such that it gives the highest classification accuracy. The optimal parameters in the second stage \(\beta _1\) and \(\beta _2\) are 0.4 and 3; 0.1 and 3; 0.2 and 4 for Indian Pines, University of Pavia and Pavia Center, respectively. By [69], Algorithm 1 converges for any \(\mu >0\), so \(\mu \) is fixed as 5 for all the tests on the three data sets.

4.1.3 Performance Metrics

To quantitatively evaluate the performance of the methods, we use the following three widely used metrics: (i) overall accuracy (OA): the percentage of correctly classified pixels, (ii) average accuracy (AA): the average percentage of correctly classified pixels over each class, and (iii) kappa coefficient (kappa): the percentage of correctly classified pixels corrected by the number of agreements that would be expected purely by chance [73].

For each method, we perform the classification ten times where each time we randomly choose a different set of training pixels. In the tables below, we give the averages of these metrics over the ten runs. The accuracies are given in percentage, and the highest accuracy of each category is listed in boldface. In order to graphically show the classification results in an objective way, we also count the number of mis-classifications for each testing pixel over the ten runs. The numbers of mis-classifications are shown in the corresponding heatmap figures, with the heatmap colorbar indicating the number of mis-classifications.

4.2 Classification Results

4.2.1 Indian Pines

The Indian Pines data set consists mainly of big homogeneous regions and has very similar inter-class spectra (see Fig. 3 for the spectra of the training pixels of Indian Pines data where there are three similar classes of corns, three similar classes of grasses and three similar classes of soybeans). It is therefore very difficult to classify it if only spectral information is used. In the experiments, we choose exactly the same number of training pixels as in [26, 37] and they amount to about 10% of the pixels from each class. Some classes have small numbers of pixels, and hence 10 pixels are taken as training pixels for each of these classes. The rest of the labeled pixels are used as testing pixels.

The number of training and testing pixels and the classification accuracies obtained by different methods are reported in Table 1. We see that our method generates the best results for all three metrics (OA, AA and kappa) and outperforms the comparing methods by a significant margin. They are at least 0.95% higher than the others. Also, the second stage of our method improves the overall accuracy of \(\nu \)-SVC (used in the first stage of our method) by almost 20%.

Fig. 5
figure 5

University of Pavia data set. a Ground-truth labels, b label color of the ground-truth labels, c false color image, d heatmap colorbar, ej classification results by different methods

Table 3 Number of training/testing pixels and classification accuracies for Pavia Center data set

Figure 4 shows the heatmaps of mis-classifications. The results of the \(\nu \)-SVC, SVM-CK and EPF methods produce large area of mis-classifications. The SC-MK also produces mis-classification at the top-right region and the middle-right region which are soybeans-clean and soybeans-no till, respectively. This shows that SC-MK cannot distinguishing these two similar classes well. The heatmap of MFASR method contains scattered regions of mis-classification. In contrast, our method generates smaller regions of mis-classifications and less errors as it effectively utilizes the spatial information to give an accurate result.

4.2.2 University of Pavia

The University of Pavia data set consists of regions with various shapes, including thin and thick structures and large homogeneous regions. Hence, it can be used to test the ability of the classification methods on handling different shapes. In the experiments, we choose the same number of training pixels (200 for each class) as in [26]. This accounts for approximately 4% of the labeled pixels. The remaining ones are used as testing pixels.

Table 2 reports the classification accuracies obtained by different methods. We see that the performance of SC-MK, MFASR, and our method are very close: approximately 99% in all three metrics (OA, AA and kappa) and they outperform the \(\nu \)-SVC, SVM-CK and EPF methods. However, we note that MFASR requires twice the number of parameters as ours and 12 times longer to run, see Tables 7 and 8. Figure 5 shows the heatmaps of mis-classifications. The \(\nu \)-SVC, SVM-CK and EPF methods produce large regions of mis-classifications. The SC-MK method produces many mis-classifications at the middle and bottom regions where the meadows are. The MFASR method and our method generate smaller regions of mis-classification.

4.2.3 Pavia Center

The Pavia Center data set also consists of regions with various shapes. In the experiments, we use the same number of training pixels as in [31] (150 training pixels per class). This accounts for approximately 1% of the labeled pixels. The rest of the labeled pixels are used as testing pixels. Table 3 reports the number of training/testing pixels and the classification accuracies of different methods. We see that the EPF method gives the highest OA and kappa while our method gives the second highest and their values differ by about 0.1%. However, our method gives the highest AA (99.12%) which outperforms the EPF method by almost 1%. The SC-MK and MFASR methods give slightly worse accuracies than our method. Figure 6 shows the heatmaps of mis-classifications.

Fig. 6
figure 6

Pavia center data set. a Ground-truth labels, b label color of the ground-truth labels, c false color image, d heatmap colorbar, ej classification results by different methods

4.3 Advantages of Our Two-Stage Method

4.3.1 Percentage of Training Pixels

Since our method improves on the classification accuracy by using spatial information, it is expected to be a better method if the training percentage (percentage of training pixels) is higher. To verify that, Tables 4, 5 and 6 show the overall accuracies obtained by our method on the three data sets with different levels of training percentage. We see that our method outperforms the other methods once the training percentage is reasonably high enough (6% for Example 1, 10% for Example 2, and 3% for Example 3). When it is not high, our method still gives a classification accuracy that is very close to the best method compared.

Table 4 Classification results on the Indian Pines data with different levels of training pixels
Table 5 Classification results on the University of Pavia data with different levels of training pixels
Table 6 Classification results on the Pavia Center data with different levels of training pixels
Table 7 Comparison of number of parameters
Table 8 Comparison of computation times (in seconds)

4.3.2 Model Complexity and Computation Time

Tables 7 and 8 show the computation time required and the number of parameters for all methods. We note that the reported timing does not count the time required to find the optimal set of parameters. The \(\nu \)-SVC, SVM-CK and EPF methods have fast computation time because of the simplicity of their models. They have only a few parameters (2, 3 and 4, respectively). However, from the results in Sect. 4.2, they are worse than the other three methods. The SC-MK method is a good method in terms of accuracy and timing, but it has 9 parameters. The MFASR method has 10 parameters and the longest computation time. In comparison, our method has 5 parameters (2 parameters \(\nu \) and \(\sigma \) for the \(\nu \)-SVC (1) and the RBF kernel (6), respectively, in the first stage, 2 parameters \(\beta _1\) and \(\beta _2\) for the denoising model (10) in the second stage and 1 parameter \(\mu \) for the ADMM algorithm (12)). It has much better (if not the best) classification accuracies with slightly longer computation time than those of \(\nu \)-SVC, SVM-CK and EPF.

4.4 Effect of the Second-Order Term

Here we examine empirically the importance of the term \(\Vert \nabla \mathbf {u}\Vert _2^2\) in (10). Figure 7 shows the heatmaps of mis-classifications on the Indian Pines data by using our method with and without \(\Vert \nabla \mathbf {u}\Vert _2^2\) over ten runs. The training pixels are randomly selected and consist of 2.5% of the labeled pixels. Figure 7a shows the ground-truth labels. Figure 7b–d shows the heatmaps of mis-classifications of the \(\nu \)-SVC classifier (i.e., the first stage of our method), the second stage of our method without the \(\Vert \nabla \mathbf {u}\Vert _2^2\) term, and the second stage of our method with the \(\Vert \nabla \mathbf {u}\Vert _2^2\) term, respectively. Recall the term \(\Vert \nabla \mathbf {u}\Vert _2^2\) control the smoothness of the final probability maps and the final classification result is determined by taking the maximum over this map of each class. By choosing the parameter associated with the term appropriately, we can then control the level of shrinking or expanding the homogeneous regions in the final classification result. From Fig. 7c, when the term is dropped, the mis-classification regions at the top left and bottom left of the first stage result are not only still mis-classified, but the numbers of mis-classification increase. In contrast, when the term is kept, we see from Fig. 7d that the numbers of mis-classification are significantly lowered. Moreover, most of the mis-classified regions of the first stage result are now correctly classified when the parameters are chosen appropriately.

Fig. 7
figure 7

Heatmaps of mis-classifications on Indian Pines data. a Ground-truth labels, b\(\nu \)-SVC (the first stage), c and d our method without and with the second-order term, respectively

5 Conclusions and Future Work

In this paper, we propose a novel two-stage hyperspectral classification method inspired by image denoising/segmentation. The method is simple yet performs effectively. In the first stage, a support vector machine method is used to estimate the pixel-wise probability map of each class. The result in the first stage has decent accuracy but is noisy. In the second stage, a convex variant of the Mumford–Shah model is applied to denoise and classify the hyperspectral image into different classes. Since both spectral and spatial information are effectively utilized, our method is very competitive when compared with state-of-the-art hyperspectral data classification methods. It also has a simpler framework with fewer numbers of parameters and faster computation times. It performs particularly well when the inter-class spectra are close or when the training percentage is high.

For future work, we plan to investigate the use of deep learning methods in the first stage [16,17,18,19]. We will also investigate the use of automated parameter selection [74,75,76,77] of the variational method in the second stage. Additionally, we plan on using our methods for classifying fused hyperspectral and LiDAR data [18, 78, 79].