Keywords

1 Introduction

Hyperspectral image (HSI) classification has been an area of active research since over 20 years where the task is to assign each pixel in a HSI to its respective class. HSI classification is a challenging task due to the high data dimensionality, availability of very few labelled samples and the presence of noise. The presence of noise damages both the spatial and spectral information available in the data, thereby degrading the performance of classification algorithms drastically. This motivates us to devise a novel technique which can achieve impressive classification accuracies even in the presence of significant amount of noise. Algorithms used in the earlier days of research in HSI classification considered only the spectral signatures which did not provide satisfactory classification results. However, recent works [4, 8, 9] acknowledge the contribution of the spatial information contained in HSI in addition to the spectral information. The importance of spatial information stems from the simple hypothesis that nearby pixels in a HSI belong to the same class with high probability. In [12], the authors address the problem of robust face recognition in the presence of data corrupted due to occlusion and disguise. They incorporate a constraint of structural incoherence in order to make the bases learnt for different classes independent. This leads to the achievement of a higher discriminating ability resulting in an impressive classification performance. Inspired by their approach, in this work, we aim to exploit both the spectral and spatial information and classify a severely degraded HSI by learning discriminative features. Instead of learning these features from the corrupted data, we learn them from the underlying clean data which leads to state-of-the-art classification results.

2 Prior Work

Several methods which perform a pixelwise classification of HSIs have been proposed [10, 14, 23]. Sparse representation based classifiers have been used for HSI classification in [5, 6]. A combination of low rank and sparsity for HSI classification can be found in [8, 19, 21]. Collaborative representation based classifiers have been used in [9, 11]. In [1], extended morphological profiles were used for HSI classification. The authors in [22] used conditional random fields for HSI classification. The random forest framework was investigated for HSI classification in [7]. Owing to the excellent performance in classification of images and videos, deep learning based methods have been extensively investigated for HSI classification. Stacked auto encoders (SAEs) [16], convolutional neural networks (CNNs) [13], deep belief networks (DBNs) [4] and deep recurrent neural networks (RNNs) [15] have been used for HSI classification. Recently, generative adversarial networks (GANs) have been introduced for the purpose of HSI classification [24]. However, these models are data hungry and are prone to overfitting due to the scarcity of available labelled samples. Hence, most of the recent research with deep models deals with finding new techniques to mitigate this problem.

3 Proposed Methodology

In this work, we present a novel framework wherein we combine non-negative matrix factorization (NMF), nuclear norm minimization (NNM) and structural incoherence to learn discriminative features from a degraded HSI. Given a HSI of size \(m \times n \times b\), the training samples are the b-dimensional spectral vectors. Let \(\mathbf {Y} \in \mathcal {R}^{b \times N}\) denote a matrix formed by stacking together all the N available training samples. Hence, for a HSI with C classes, \(\mathbf {Y}=[\mathbf {Y}_1, \mathbf {Y}_2, \dots ,\mathbf {Y}_C]\) where \(\mathbf {Y}_i =[\mathbf {y}^i_1, \mathbf {y}^i_2, \dots , \mathbf {y}^i_{n_i}] \in \mathcal {R}^{b \times n_i}\) (\(i= 1, 2, \dots , C\)) denotes the training samples belonging to class i and \(n_i\) is the number of available training samples from class i so that \(N=Cn_i\). We model the data as:

$$\begin{aligned} \mathbf {Y}=\mathbf {L+S+N} \end{aligned}$$
(1)

where \(\mathbf {L}=[\mathbf {L}_1, \mathbf {L}_2, \dots , \mathbf {L}_C] \in \mathcal {R}^{b \times N}\) is a low rank matrix, \(\mathbf {S}=[\mathbf {S}_1, \mathbf {S}_2, \dots , \mathbf {S}_C] \in \mathcal {R}^{b \times N}\) is the matrix of sparse noise and \(\mathbf {N}=[\mathbf {N}_1, \mathbf {N}_2, \dots , \mathbf {N}_C] \in \mathcal {R}^{b \times N}\) denotes additive Gaussian noise. We aim to combine non-negative matrix factorization and nuclear norm minimization to simultaneously denoise the available data and learn a basis matrix for classification. We wish to solve the following optimization problem:

$$\begin{aligned} \min _{\mathbf {L,S,U\ge 0,V \ge 0}}&\alpha \Vert \mathbf {L}\Vert _* + \beta \Vert \mathbf {S}\Vert _1 + \gamma \sum _{j \ne i} \Vert {\mathbf {L}_j^T\mathbf {L}_i}\Vert _F^2 + \delta \Vert \mathbf {L-UV}\Vert _F^2 \nonumber \\&s.t. \Vert \mathbf {Y-L-S}\Vert _F^2 \le \epsilon \end{aligned}$$
(2)

where \(\Vert \cdot \Vert _F\) and \(\Vert \cdot \Vert _*\), respectively, denote the Frobenius norm and the nuclear norm of a matrix, \(\alpha \), \(\beta \), \(\gamma \), \(\delta \) are positive parameters and \(\epsilon \) is a small constant whose value is fixed to 0.001. It is noteworthy that the basis matrix \(\mathbf {U}\) is being learnt from the underlying clean data \(\mathbf {L}\) and not directly from the degraded training data \(\mathbf {Y}\). The third term in Eq. (2) corresponds to the structural incoherence [17] which measures the similarity between the derived low-rank matrices of different classes. Hence, minimizing this term separately for every class tends to make these derived matrices incoherent and enhances their discriminating ability. Hence, for the \(i^{th}\) class, we wish to solve:

$$\begin{aligned} \min _{\mathbf {L}_i,\mathbf {S}_i,\mathbf {U}\ge \mathbf {0},\mathbf {V} \ge \mathbf {0}}&\alpha \Vert \mathbf {L}_i\Vert _* + \beta \Vert \mathbf {S}_i\Vert _1 + \gamma \sum _{j \ne i} \Vert {\mathbf {L}_j^T\mathbf {L}_i}\Vert _F^2 + \delta \Vert \mathbf {L}_i-(\mathbf {U}\mathbf {V})_i\Vert _F^2 \nonumber \\&\text{ s.t. } \Vert \mathbf {Y}_i-\mathbf {L}_i-\mathbf {S}_i\Vert _F^2 \le \epsilon \end{aligned}$$
(3)

We use the method of augmented Lagrangian multipliers (ALM) [18] to solve the above problem. To this end, we first introduce an auxiliary variable \(\mathbf {Z}_i\) as follows:

$$\begin{aligned} \min _{\mathbf {L}_i,\mathbf {S}_i,\mathbf {U}\ge \mathbf {0},\mathbf {V} \ge \mathbf {0} , \mathbf {Z}_i}&\alpha \Vert \mathbf {L}_i\Vert _* + \beta \Vert \mathbf {S}_i\Vert _1 + \gamma \sum _{j \ne i} \Vert {\mathbf {L}_j^T\mathbf {Z}_i}\Vert _F^2 + \delta \Vert \mathbf {Z}_i-(\mathbf {U}\mathbf {V})_i\Vert _F^2 \nonumber \\&\text{ s.t. } \Vert \mathbf {Y}_i-\mathbf {L}_i-\mathbf {S}_i\Vert _F^2 \le \epsilon , \; \mathbf {Z}_i=\mathbf {L}_i \end{aligned}$$
(4)

The augmented Lagrangian is as follows:

$$\begin{aligned} \mathcal {L}&(\mathbf {L}_i,\mathbf {S}_i,\mathbf {U}\ge 0,\mathbf {V} \ge 0, \mathbf {Z}_i \ge \mathbf {0},\mathbf {M}_1,\mathbf {M}_2;\mu )=\alpha \Vert \mathbf {L}_i\Vert _* + \beta \Vert \mathbf {S}_i\Vert _1 \nonumber \\ +&\gamma \sum _{j \ne i} \Vert {\mathbf {L}_j^T\mathbf {Z}_i}\Vert _F^2 + \delta \Vert \mathbf {Z}_i-(\mathbf {U}\mathbf {V})_i\Vert _F^2+ \langle \mathbf {M}_1,\mathbf {Y}_i-\mathbf {L}_i-\mathbf {S}_i\rangle \nonumber \\ +&\langle \mathbf {M}_2,\mathbf {Z}_i-\mathbf {L}_i\rangle +\dfrac{\mu }{2}[\Vert \mathbf {Y}_i-\mathbf {L}_i-\mathbf {S}_i+\dfrac{\mathbf {M}_1}{\mu }\Vert _F^2+\Vert \mathbf {Z}_i-\mathbf {L}_i+\dfrac{\mathbf {M}_2}{\mu }\Vert _F^2] \end{aligned}$$
(5)

where \(\mathbf {M}_1\), \(\mathbf {M}_2\) are Lagrange multipliers, \(\mu \) is a positive parameter and \(\langle \cdot ,\cdot \rangle \) denotes the inner product. We now provide the updation rules for each of the variables.

  • Computing \(\mathbf {L}_i\) with other variables fixed: With some algebraic manipulations, the \(\mathbf {L}_i\)-subproblem can be obtained from Eq. (5) as follows:

    $$\begin{aligned} \min _{\mathbf {L}_i} \dfrac{\alpha }{2\mu } \Vert \mathbf {L}_i\Vert _* + \dfrac{1}{2} \Vert \mathbf {L}_i-(\mathbf {Y}_i-\mathbf {S}_i+\mathbf {Z}_i+\dfrac{\mathbf {M}_1}{\mu }+\dfrac{\mathbf {M}_2}{\mu })\Vert _F^2 \end{aligned}$$
    (6)

    Eq. (6) can be solved using singular value thresholding [3].

  • Computing \(\mathbf {S}_i\) fixing others: The \(\mathbf {S}_i\)-subproblem can be obtained from Eq. (5) as follows:

    $$\begin{aligned} \min _{\mathbf {S}_i} \dfrac{\beta }{\mu } \Vert \mathbf {S}_i\Vert _1+\dfrac{1}{2}\Vert \mathbf {S}_i-(\mathbf {Y}_i-\mathbf {L}_i+\dfrac{\mathbf {M}_1}{\mu })\Vert _F^2 \end{aligned}$$
    (7)

    Eq. (7) can be solved using the soft shrinkage operation.

  • Computation of \(\mathbf {Z}_i\): In order to achieve factorization of \(\mathbf {Z}_i\), the non-negativity constraint on \(\mathbf {Z}_i\) must be satisfied. To this end, we first introduce an indicator function as follows:

    $$\begin{aligned} l _+(\mathbf {Z}_i)={\left\{ \begin{array}{ll} 0, &{}Z_{i_{m,n}}\ge 0 \;\;\;\; \forall m,n\\ \infty , &{} otherwise \end{array}\right. } \end{aligned}$$
    (8)

    where, \(Z_{i_{m,n}}\) denotes the \((m,n)^{th}\) element of \(\mathbf {Z}_i\). The \(\mathbf {Z}_i\) sub-problem is then obtained from Eq. (5) as follows:

    $$\begin{aligned} \min _{\mathbf {Z}_i} \gamma \sum _{j \ne i} \Vert {\mathbf {L}_j^T\mathbf {Z}_i}\Vert _F^2 + \delta \Vert \mathbf {Z}_i-(\mathbf {U}\mathbf {V})_i\Vert _F^2 +\dfrac{\mu }{2}\Vert \mathbf {Z}_i-\mathbf {L}_i+\dfrac{\mathbf {M}_2}{\mu }\Vert _F^2+ l _+(\mathbf {Z}_i) \end{aligned}$$
    (9)

    To solve this, we use the alternating direction method of multipliers (ADMM) [2]. To do so, we first introduce an auxiliary variable in order to make the objective function separable:

    $$\begin{aligned} \min _{\mathbf {Z}_i}&\gamma \sum _{j \ne i} \Vert {\mathbf {L}_j^T\mathbf {Z}_i}\Vert _F^2 + \delta \Vert \mathbf {Z}_i-(\mathbf {U}\mathbf {V})_i\Vert _F^2 +\dfrac{\mu }{2}\Vert \mathbf {Z}_i-\mathbf {L}_i+\dfrac{\mathbf {M}_2}{\mu }\Vert + l _+(\mathbf {C}_i) \nonumber \\&\text{ s.t. } \mathbf {C}_i=\mathbf {Z}_i \end{aligned}$$
    (10)

    The above equation is solved iteratively by updating one of the variables and keeping the other fixed until convergence. The update equations are:

    $$\begin{aligned} \mathbf {Z}_i=[(\delta +\dfrac{\mu }{2}+\dfrac{\rho }{2})\mathbf {I}+&\gamma \sum _{j \ne i} {\mathbf {L}_j \mathbf {L}_j^T}]^{-1}[\delta (\mathbf {U}\mathbf {V})_i+\dfrac{\mu \mathbf {L}_i}{2}-\mathbf {M}_2-\mathbf {T}+\rho \mathbf {C}_i] \end{aligned}$$
    (11)
    $$\begin{aligned}&\mathbf {C}_i \leftarrow \max (\mathbf {Z}_i+ \dfrac{\mathbf {T}}{\rho },0)\end{aligned}$$
    (12)
    $$\begin{aligned}&\mathbf {T}\leftarrow \mathbf {T}+\rho (\mathbf {Z}_i-\mathbf {C}_i)\end{aligned}$$
    (13)
    $$\begin{aligned}&\rho \leftarrow \min (\kappa \rho ,\rho _{max}) \end{aligned}$$
    (14)

    where, \(\mathbf {I}\) denotes the identity matrix of appropriate size, \(\mathbf {T}\) is the Lagrange multiplier and \(\rho \) > 0, \(\kappa \) >0, \(\rho _{max}\) are parameters.

  • Computation of \(\mathbf {U}\) and \(\mathbf {V}\):

    $$\begin{aligned} \min _{\mathbf {U\ge 0,V\ge 0}} \Vert \mathbf {Z}_i-(\mathbf {U}\mathbf {V})_i\Vert _F^2 \end{aligned}$$
    (15)

    The above optimization problem can be solved directly using any of the existing NMF solvers.

  • The final step is to update the multipliers and \(\mu \):

    $$\begin{aligned}&\mathbf {M}_1 \leftarrow \mathbf {M}_1 +\mu (\mathbf {Y}_i-\mathbf {L}_i-\mathbf {S}_i)\end{aligned}$$
    (16)
    $$\begin{aligned}&\mathbf {M}_2 \leftarrow \mathbf {M}_2 +\mu (\mathbf {Z}_i-\mathbf {L}_i)\end{aligned}$$
    (17)
    $$\begin{aligned}&\mu \leftarrow \min (\psi \mu ,\mu _{max}) \end{aligned}$$
    (18)

Once we obtain the basis matrix \(\mathbf {U}\) learnt from the underlying clean data \(\mathbf {L}\), we project all the degraded training and testing data onto \(\mathbf {U}\) to obtain discriminative features which are subsequently used for classification. For any spectral vector \(\mathbf {y}\), we obtain its feature vector as follows:

$$\begin{aligned} \mathbf {f}=\mathbf {U}^{\dagger }\mathbf {y} \end{aligned}$$
(19)

where \(\mathbf {U}^{\dagger }\) denotes the pseudoinverse of \(\mathbf {U}\). We propose to use a very simple classifier to classify these obtained features with high accuracy. For a test feature \(\mathbf {f}_{test}\) and the training features \(\mathbf {f}_i, i=1,2, \dots , N\), let the spatial positions of the test and training features (corresponding to the test and training spectral vectors, respectively) be denoted by \(\mathbf {p}_{test}=[x,y]^T\) and \(\mathbf {p}_i=[x_i,y_i]^T\), respectively. Let \(d_1=dist(\mathbf {p}_{test}, \mathbf {p}_i)\) and \(d_2 = dist(\mathbf {f}_{test}, \mathbf {f}_i)\), where \(dist(\mathbf {a,b})\) denotes the squared Euclidean distance between the vectors \(\mathbf {a}\) and \(\mathbf {b}\). Then we propose to define

$$\begin{aligned} d_i \overset{\Delta }{=} \alpha _1d_1 + \alpha _2d_2 \end{aligned}$$
(20)

as the dissimilarity between the test feature \(\mathbf {f}_{test}\) and the training feature \(\mathbf {f}_i\). The test sample is then classified to the class of the training sample with which this dissimilarity metric is minimum. \(d_1\) simply takes into account the spatial correlation acknowledging the fact that pixels close to each other belong to the same class with a high probability while \(d_2\) corresponds to the Euclidean distance between the test and training features. Note that setting \(\alpha _1\) to zero leads to the nearest neighbour classifier.

figure a

4 Experimental Results

In order to assess the performance of our algorithm, we use three HSI datasets. We synthetically corrupt these datasets by adding Gaussian noise of 0.05 standard deviation. Note that the spectral vectors are normalized to [0–1]. We also add impulse noise and stripes to bands 61 to 70 in all the three datasets. We randomly choose 10 per cent of labelled samples from each class for training and use the rest for testing. We compare our algorithm with SVM [14], SRC [20] and CRNN [11]. The classification performance is measured by the overall accuracy (OA) which is defined as the ratio between the number of correctly predicted pixels to the total number of test pixels. The robustness of our algorithm to noise is analysed by gradually increasing the amount of noise added and monitoring its effect on the class accuracies as compared to the other algorithms. The parameters \(\alpha ,\beta ,\gamma ,\delta ,\alpha _1\) and \(\alpha _2\) are tuned to obtain the best results.

4.1 Datasets

The datasets used are Indian Pines, Botswana and Salinas. The Indian Pines dataset has a size of 145 \(\times \) 145 \(\times \) 200 and has 16 classes. The Botswana dataset has a size of 1476 \(\times \) 256 \(\times \) 145 and has 14 classes. The third dataset is the Salinas dataset which has a size of 512 \(\times \) 217 \(\times \) 204 and 16 classes. The Botswana dataset has only 326 labelled training samples and hence is particularly more challenging among these datasets due to the scarcity of training samples.

Table 1. Classification accuracy (%) for the Botswana dataset
Table 2. Classification accuracy (%) for the Salinas dataset
Table 3. Classification accuracy (%) for the Indian Pines dataset
Fig. 1.
figure 1

Effect of progressive addition of noise on individual class accuracies on the Botswana dataset: (a) SVM, (b) SRC, (c) CRNN and (d) proposed approach

4.2 Classification Performance

The performance of the classifiers on the three synthetically corrupted datasets is demonstrated in Tables 1, 2 and 3. We find that our algorithm achieves accuracies of 97.39%, 99.85% and 98.28% on the Botswana, Salinas and Indian Pines datasets respectively, outperforming the other methods. In the Salinas dataset, CRNN achieves the next best accuracy of 90.42% while SVM achieves the next best accuracies of 30.15% and 52.03% in the Botswana and the Indian Pines datasets, respectively. Note that our algorithm outperforms other state-of-the-art methods in these two datasets by a very large margin. Figure 1 depicts the sensitivity of the classifiers to noise. We progressively increase the standard deviation of Gaussian noise using increments of 0.005 upto 0.1 and examine the effect on class specific accuracies obtained by the algorithms. From Fig. 1, we infer that our algorithm is robust to noise in the data since the class specific accuracies do not drop and remain fairly constant while the performance of all the other methods deteriorates sharply with the increase in noise levels.

5 Conclusion

A novel algorithm for the classification of degraded hyperspectral data is proposed. A combination of nuclear norm minimization and non-negative matrix factorization is used to exploit the low rank nature of the data. A basis matrix is learnt from the underlying clean data which is used to extract features from the input degraded data. The discriminative ability of the underlying clean data is exploited using structural incoherence, which to the best of our knowledge, is being introduced for the first time in the hyperspectral image processing literature. Both the spatial and spectral information are exploited for classification which lead to state-of-the-art results.