1 Introduction

With the rapid development of hyperspectral imaging techniques, current sensors always have high spectral and spatial resolution (He et al., 2018). For example, the ROSIS sensor can cover spectral resolution higher than 10 nm, reaching 1 m per pixel spatial resolution (Cao et al., 2018; Zare et al., 2018). The increased spectral and spatial resolution enables us to accurately discriminate diverse materials of interest. As a result, hyperspectral images (HSIs) have been widely used in many practical applications, such as precision agriculture, environmental management, mining and mineralogy (He et al., 2018). Among them, HSI classification, which aims to assign each pixel of HSI to a unique class label, has attracted increasing attention in recent years. However, the unfortunate combination of high-dimensional spectral features and the limited ground truth samples, as well as different atmospheric scattering conditions, make the HSI data inherently highly nonlinear and difficult to be categorized (Ghamisi et al., 2017).

Early HSI classification methods straightforwardly apply conventional dimensionality reduction techniques, such as the principal component analysis (PCA) and the linear discriminant analysis (LDA), on spectral domain to learn discriminative spectral features. Although these methods are conceptually simple and easy to implement, they neglect the spatial information, a complement to spectral behavior that has been demonstrated effective to augment HSI classification performance (He et al., 2018; Ghamisi et al., 2015). To address this limitation, Chen et al. 2011) proposed the joint sparse representation (JSR) to incorporate spatial neighborhood information of pixels. Soltani-Farani et al. (2015) designed spatial aware dictionary learning (SADL) by using a structured dictionary learning model to incorporate both spectral and spatial information. Kang et al., suggested using an edge-preserving filter (EPF) to improve the spatial structure of HSI (Kang et al., 2014) and also introduced PCA to encourage the separability of new representations (Kang et al., 2017). A similar idea appears in Pan et al. (2017), in which EPF is substituted with a hierarchical guidance filter. Although these methods perform well, the discriminative power of their extracted spectral-spatial features is far from satisfactory when being tested on challenging land covers.

A recent trend is to use deep neural networks (DNN), such as autoencoders (AE) (Ma et al., 2016) and convolutional neural networks (CNN) (Chen et al., 2016), to learn discriminative spectral-spatial features (Zhong et al., 2018). Although deep features always demonstrate superior discriminative power than hand-crafted features in different computer vision or image processing tasks, existing DNN based HSI classification methods either improve the performance marginally or require significantly more labeled data (Yang et al., 2018). On the other hand, collecting labeled data is always difficult and expensive in remote sensing community (Zare et al., 2018). Admittedly, transfer learning has the potential to alleviate the problem of limited labeled data, it still remains an open problem to construct a reliable relevance between the target domain and the source domain due to the large variations between HSIs obtained by different sensors with unmatched imaging bands and resolutions (Zhu et al., 2017).

Different from previous work, this paper presents a novel architecture, termed multiscale principle of relevant information (MPRI), to learn discriminative spectral-spatial features for HSI classification. MPRI inherits the merits of the principle of relevant information (PRI) (Chapter 8, Principe 2010) (Chapter 3, Rao 2008) to effectively extract multiscale information from given data, and also takes advantage of the multilayer structure to learn representations in a coarse-to-fine manner. To summarize, the major contributions of this work are threefold.

  • We demonstrate the capability of PRI, originated from the information theoretic learning (ITL) (Principe 2010), to characterize 3D pictorial structures in HSI data.

  • We generalize PRI into a multilayer structure to extract hierarchical representations for HSI classification. A multiscale scheme is also incorporated to model both local and global structures.

  • MPRI outperforms state-of-the-art HSI classification methods based on classical machine learning models (e.g., PCA-EPF Kang et al., 2017 and HIFI Pan et al., 2017) by a large margin. Using significantly fewer labeled data, MPRI also achieves almost the same classification accuracy compared to existing deep learning techniques (e.g., SAE-LR Chen et al., 2014 and 3D-CNN Li et al., 2017).

The remainder of this paper is organized as follows. Section 2 reviews the basic objective of PRI and formulates PRI under the ITL framework. The architecture and optimization of our proposed MPRI is elaborated in Sect. 3. Section 4 shows experimental results on three popular HSI data sets. Finally, Sect. 5 draws the conclusion.

2 Elements of Renyi’s \(\alpha \)-entropy and the principle of relevant information

Before presenting our method, we start with a brief review of the general idea and the objective of PRI, and then formulate this objective under the ITL framework.

2.1 PRI: the general idea and its objective

Suppose we are given a random variable \(\mathbf{X}\) with a known probability density function (PDF) g, from which we want to learn a reduced statistical representation characterized by a random variable \(\mathbf{Y}\) with PDF f. The PRI (Chapter 8, Principe, 2010) (Chapter 3, Rao, 2008) casts this problem as a trade-off between the entropy H(f) of \(\mathbf{Y}\) and its descriptive power about \(\mathbf{X}\) in terms of their divergence \(D(f\Vert g)\). Therefore, for a fixed PDF g, the objective of PRI is given by:

$$\begin{aligned} \underset{f}{\mathrm {minimize}}~H(f)+\beta D(f\Vert g), \end{aligned}$$
(1)

where \(\beta \) is a hyper-parameter controlling the amount of relevant information that \(\mathbf{Y}\) can extract from \(\mathbf{X}\). Note that, the minimization of entropy can be viewed as a means of finding the statistical regularities in the outcomes of a process, whereas the minimization of information theoretic divergence, such as the Kullback-Leibler divergence (Kullback & Leibler, 1951) or the Chernoff divergence (Chernoff, 1952), ensuring that the regularities are closely related to \(\mathbf{X}\). The PRI is similar in spirit to the Information Bottleneck (IB) method (Tishby et al., 2000), but the formulation is different because PRI does not require an observed relevant (or auxillary) variable and the optimization is done directly on the random variable \(\mathbf{X}\), which provides a set of solutions that are related to the principal curves (Hastie & Stuetzle, 1989) of g, as will be demonstrated below.

2.2 Formulation of PRI using Renyi’s entropy functional

In information theory, a natural extension of the well-known Shannon’s entropy is the Renyi’s \(\alpha \)-entropy (Rényi et al., 1961). For a random variable \(\mathbf {X}\) with PDF f(x) in a finite set \(\mathcal {X}\), the \(\alpha \)-entropy of \(H(\mathbf {X})\) is defined as:

$$ H_{\alpha } (f) = \frac{1}{{1 - \alpha }}\log \int_{{\mathcal{X}}} {f^{\alpha } } (x){\text{d}}x. $$
(2)

On the other hand, motivated by the famed Cauchy–Schwarz (CS) inequality:

$$\begin{aligned} \Big | \int f(x)g(x)\hbox {d}x \Big |^2 \le \int \mid f(x)\mid ^2 \hbox {d}x \int \mid g(x)\mid ^2 \hbox {d}x, \end{aligned}$$
(3)

with equality if and only if f(x) and g(x) are linearly dependent (e.g., f(x) is just a scaled version of g(x)), a measure of the “distance” between the PDFs can be defined, which was named the CS divergence (Jenssen et al., 2006), with:

$$\begin{aligned} \begin{aligned} D_\mathrm{cs} (f\Vert g)&= -\log \left( \int fg\right) ^2 + \log \left( \int f^2\right) + \log \left( \int g^2\right) \\&= 2H_2(f;g) - H_2(f) - H_2(g), \end{aligned} \end{aligned}$$
(4)

the term \(H_2(f;g)=-\log \int f(x)g(x)\hbox {d}x\) is also called the quadratic cross entropy (Principe, 2010).

Combining Eqs. (2) and (4), the PRI under the 2-order Renyi’s entropy can be formulated as:

$$\begin{aligned} \begin{aligned} f_{\text {opt}}&=\arg \min _f H_2(f)+\beta \left( 2H_2(f;g)-H_2(f)-H_2(g)\right) \\&\equiv \arg \min _f (1-\beta )H_2(f) + 2\beta H_2(f;g), \end{aligned} \end{aligned}$$
(5)

the second equation holds because the extra term \(\beta H_2(g)\) is a constant with respect to f.

Given \(\mathbf {X}=\{\mathbf {x}_i\}_{i=1}^N\) and \(\mathbf {Y}=\{\mathbf {y}_i\}_{i=1}^N\), both in \(\mathbb {R}^p\), drawn i.i.d. from g and f, respectively. Using the Parzen-window density estimation (Parzen, 1962) with Gaussian kernel \(G_{\delta }(\cdot )=\exp (-\frac{\Vert \cdot \Vert ^2}{2\delta ^2})\), Eq. (5) can be simplified as Rao (2008):

$$\begin{aligned} \begin{aligned} \mathbf{Y}_{\text {opt}}=&\arg \min _\mathbf{Y}\left[ -(1-\beta )\log \left( \frac{1}{N^2}\sum _{i,j=1}^NG_{\delta }\left( \mathbf{y}_i-\mathbf{y}_j\right) \right) \right. \\&\left. -2\beta \log \left( \frac{1}{N^2}\sum _{i,j=1}^NG_{\delta }\left( \mathbf{y}_i-\mathbf{x}_j\right) \right) \right] . \end{aligned} \end{aligned}$$
(6)

It turns out that the value of \(\beta \) defines various levels of information reduction, ranging from data mean value (\(\beta =0\)), clustering (\(\beta =1\)), principal curves (Hastie and Stuetzle, 1989) extraction at different dimensions, and vector quantization obtaining back the initial data when \(\beta \rightarrow \infty \) (Principe, 2010; Rao, 2008). Hence, the PRI achieves similar effects to a moment decomposition of the PDF controlled by a single parameter \(\beta \), using a data driven optimization approach. See Fig. 1 for an example. From this figure we can see that the self organizing decomposition provides a set of hierarchical features of the input data beyond cluster centers, that may yield more robust features. Note that, despite its strategic flexibility to find reduced structure of given data, the PRI is mostly unknown to practitioners.

Fig. 1
figure 1

Illustration of the structures revealed by the PRI for a Intersect data set. As the values of \(\beta \) increase the solution passes through b a single point, c modes, d and e principal curves at different dimensions, and in the extreme case of f \(\beta \rightarrow \infty \) we get back the data themselves as the solution

3 Multiscale principle of relevant information (MPRI) for HSI classification

In this section, we present MPRI for HSI classification. MPRI stacks multiple spectral-spatial feature learning units, in which each unit consists of multiscale PRI and a regularized LDA (Bandos et al., 2009). The architecture of MPRI is shown in Fig. 2.

Fig. 2
figure 2

The architecture of multiscale principle of relevant information (MPRI) for HSI classification. The spectral-spatial feature learning unit is marked with red dashed rectangle. The spectral-spatial features are extracted by performing PRI (in multiple scales) and LDA iteratively and successively on HSI data cube (after normalization). Finally, features from each unit are concatenated and fed into a k-nearest neighbors (KNN) classifier to predict pixel labels. This plot only demonstrates a 3-layer MPRI, but the number of layers can be increased or decreased flexibly

To the best of our knowledge, apart from performing band selection (e.g., Feng et al., 2015; Yu et al., 2019) or measuring spectral variability (e.g., Chang, 2000), information theoretic principles have seldom been investigated to learn discriminative spectral-spatial features for HSI classification. The most similar work to ours is Kamandar and Ghassemian (2013), in which the authors use the criterion of minimum redundancy maximum relevance (MRMR) (Peng et al., 2005) to extract linear features. However, owing to the poor approximation to estimate multivariate mutual information, the performance of Kamandar and Ghassemian (2013) is only slightly better than the basic linear discriminant analysis (LDA) (Du, 2007).

3.1 Spectral-spatial feature learning unit

Let \(\mathbf {T}\in \mathbb {R}^{m\times n \times d}\) be the raw 3D HSI data cube, where m and n are the spatial dimensions, d is the number of spectral bands. For a target spectral vector \(\mathbf {t}_\star \in \mathbb {R}^d\), we extract a local cube (denote \(\hat{\mathbf{X}}\)) from \(\mathbf {T}\) using a sliding window of width \(\hat{n}\) centered at \(\mathbf {t}_\star \), i.e., \(\hat{\mathbf{X}}=\{\hat{\mathbf{x}}_1, \hat{\mathbf{x}}_2, \cdots , \hat{\mathbf{x}}_{\hat{N}}\}\in \mathbb {R}^{\hat{N}\times d}\), \(\hat{n}\times \hat{n}=\hat{N}\), and \(\mathbf {t}_\star =\hat{\mathbf{x}}_{\lfloor \hat{n}/2 \rceil +1,\lfloor \hat{n}/2 \rceil +1}\), where \(\lfloor \cdot \rceil \) is the nearest integer function. We obtain the spectral-spatial characterization \(\hat{\mathbf{Y}}=\{\hat{\mathbf{y}}_1, \hat{\mathbf{y}}_2, \cdots , \hat{\mathbf{y}}_{\hat{N}}\}\in \mathbb {R}^{\hat{N}\times d}\) from \(\hat{\mathbf{X}}\) using PRI via the following objective:

$$\begin{aligned} \begin{aligned} \left. \mathrm{minimize}_\mathbf{\hat{Y}}\left[ -(1-\beta )\log \left( \frac{1}{\hat{N}^2}\sum _{i,j=1}^{\hat{N}}G_{\delta }\left( \hat{\mathbf{y}}_i-\hat{\mathbf{y}}_j\right) \right) -2\beta \log \frac{1}{\hat{N}^2}\sum _{i,j=1}^{\hat{N}}G_{\delta }\left( \hat{\mathbf{y}}_i-\hat{\mathbf{x}}_j\right) \right) \right] . \end{aligned} \end{aligned}$$
(7)

We finally use the center vector of \(\hat{\mathbf{Y}}\), i.e., \(\hat{\mathbf{y}}_{\lfloor \hat{n}/2 \rceil +1,\lfloor \hat{n}/2 \rceil +1}\), as the new representation of \(\mathbf {t}_\star \). We scan the whole 3D cube with a sliding window of width \(\hat{n}\) targeted at each pixel to get the new spectral-spatial representation. The procedure is depicted in Fig. 3.

Fig. 3
figure 3

For each target spectral vector (e.g., \(\mathbf {t}_\star \) or \(\mathbf {t}'_\star \)) in the raw hyperspectral image, we obtain a new vector representation by performing PRI in its corresponding local data cube (e.g., \(\hat{\mathbf{X}}\) or \(\hat{\mathbf{X}}'\))

Equation (7) is updated iteratively. Specifically, denote \(V(\hat{\mathbf{Y}})=\frac{1}{{\hat{N}}^2}\sum _{i,j=1}^{\hat{N}}G_\delta (\hat{\mathbf{y}}_i-\hat{\mathbf{y}}_j)\) and \(V(\hat{\mathbf{Y}}; \hat{\mathbf{X}})=\frac{1}{\hat{N}^2}\sum _{i,j=1}^{\hat{N}}G_\delta (\hat{\mathbf{y}}_j-\hat{\mathbf{x}}_i)\), taking the derivative of Eq. (7) with respect to \(\hat{\mathbf{y}}_\star \) and equating to zero, we have:

$$\begin{aligned} \begin{aligned} \frac{1-\beta }{V(\hat{\mathbf{Y}})} \sum _{j=1}^{\hat{N}} G_\delta \left( \hat{\mathbf{y}}_\star -\hat{\mathbf{y}}_j\right) \left\{ \frac{\hat{\mathbf{y}}_j-\hat{\mathbf{y}}_\star }{\delta ^2}\right\} + \frac{\beta }{V\left( \hat{\mathbf{Y}};\hat{\mathbf{X}}\right) } \sum _{j=1}^{\hat{N}} G_\delta \left( \hat{\mathbf{y}}_\star -\hat{\mathbf{x}}_j\right) \left\{ \frac{\hat{\mathbf{x}}_j-\hat{\mathbf{y}}_\star }{\delta ^2}\right\} =0. \end{aligned} \end{aligned}$$
(8)

Rearrange Eq. (8), we have:

$$\begin{aligned} \begin{aligned} \left\{ \frac{\beta }{V\left( \hat{\mathbf{Y}};\hat{\mathbf{X}}\right) }\sum _{j=1}^{\hat{N}} G_\delta \left( \hat{\mathbf{y}}_\star -\hat{\mathbf{x}}_j\right) \right\} \hat{\mathbf{y}}_\star&=\frac{1-\beta }{V(\hat{\mathbf{Y}})} \sum _{j=1}^{\hat{N}} G_\delta \left( \hat{\mathbf{y}}_\star -\hat{\mathbf{y}}_j\right) {\hat{\mathbf{y}}_j}\\&\quad -\left\{ \frac{1-\beta }{V(\hat{\mathbf{Y}})}\sum _{j=1}^{\hat{N}} G_\delta \left( \hat{\mathbf{y}}_\star -\hat{\mathbf{y}}_j\right) \right\} {\hat{\mathbf{y}}_\star }\\&\quad +\frac{\beta }{V\left( \hat{\mathbf{Y}};\hat{\mathbf{X}}\right) }\sum _{j=1}^{\hat{N}} G_\delta \left( \hat{\mathbf{y}}_\star -\hat{\mathbf{x}}_j\right) \hat{\mathbf{x}}_j. \end{aligned} \end{aligned}$$
(9)

Divide both sides of the Eq. (9) by

$$\begin{aligned} \frac{\beta }{V(\hat{\mathbf{Y}};\hat{\mathbf{X}})} \sum _{j=1}^{\hat{N}} G_\delta \left( \hat{\mathbf{y}}_\star -\hat{\mathbf{x}}_j\right) , \end{aligned}$$
(10)

and let

$$\begin{aligned} c={V(\hat{\mathbf{Y}};\hat{\mathbf{X}})}/{V(\hat{\mathbf{Y}})}, \end{aligned}$$
(11)

we obtain the fixed point update rule for \(\hat{\mathbf{y}}_\star \):

$$\begin{aligned} \begin{aligned} \hat{\mathbf{y}}_\star ^{\tau +1}&=c\frac{1-\beta }{\beta }\frac{\sum _{j=1}^{\hat{N}} G_{\delta }\left( \hat{\mathbf{y}}_\star ^\tau -\hat{\mathbf{y}}_j^\tau \right) \hat{\mathbf{y}}_j^\tau }{\sum _{j=1}^{\hat{N}} G_{\delta }\left( \hat{\mathbf{y}}_\star ^\tau -\hat{\mathbf{x}}_j\right) } -c\frac{1-\beta }{\beta }\frac{\sum _{j=1}^{\hat{N}} G_{\delta }\left( \hat{\mathbf{y}}_\star ^\tau -\hat{\mathbf{y}}_j^\tau \right) }{\sum _{j=1}^{\hat{N}} G_{\delta }(\hat{\mathbf{y}}_\star ^\tau -\hat{\mathbf{x}}_j)}\hat{\mathbf{y}}_\star ^\tau \\&\quad +\frac{\sum _{j=1}^{\hat{N}} G_{\delta }\left( \hat{\mathbf{y}}_\star ^\tau -\hat{\mathbf{x}}_j\right) \hat{\mathbf{x}}_j}{\sum _{j=1}^{\hat{N}} G_{\delta }\left( \hat{\mathbf{y}}_\star ^\tau -\hat{\mathbf{x}}_j\right) }, \end{aligned} \end{aligned}$$
(12)

where \(\tau \) is the iteration number. We move the sliding window pixel by pixel, and only update the representation of the center target pixel, as shown in Fig. 3.

We also introduce two modifications to increase the discriminative power of the new representation. First, different values of \(\hat{n}\) (3, 5, 7, 9, 11, 13 in this work) are used to model both local and global structures. Second, to reduce the redundancy of raw features constructed by concatenating PRI representations in multiple scales, we further perform a regularized LDA (Bandos et al., 2009).

Note that, the hyper-parameter \(\beta \) and different values of \(\hat{n}\) play different roles in MPRI. Specifically, \(\beta \) in PRI balances the trade-off between the regularity of extracted representation and its discriminative power to the given data. Therefore, it should be set in a reasonable range to avoid over-smoothing effect of the resulting image and unsatisfactory classification performance. A deeper discussion is shown in Sect. 4.1.1. By contrast, \(\hat{n}\) controls the spatial scale of the learned representation. The motivation is that the discriminative information of different categories may not be easily characterized by a sliding window of a fixed size (i.e., \(\hat{n}\times \hat{n}\)). Thus, it would be favorable if one can incorporate discriminative information from different scales, by changing the value of \(\hat{n}\).

3.2 Stacking multiple units

In order to characterize spectral-spatial structures in a coarse-to-fine manner, we stack multiple spectral-spatial feature learning units described in Sect. 3.1 to constitute a multilayer structure and concatenate representations from each layer to form the final spectral-spatial representation. We finally feed this representation into a standard k-nearest neighbors (KNN) for classification.

Different from existing DNNs that are typically trained with error backpropagation or the combination of a greedy layer-wise pretraining and a fine-tuning stage, our multilayer structure is trained successively from bottom layer to top layer without error backpropagation. For the ith layer, the input of PRI is the representation learned from the previous layer (denoted \(T_{i-1}\)). We then learn new representation \(T_i\) by iteratively updates \(T_{i-1}\) with Eq. (12) and a dimensionality reduction step with LDA at the end of iteration. As for the multiscale PRI, it can be trained in parallel with respect to different sliding window sizes.

The interpretation of DNN as a way of creating successively better representations of the data has already been suggested and explored by many (e.g., Achille and Soatto, 2018). Most recently, Schwartz-Ziv and Tishby Shwartz-Ziv and Tishby (2017) put forth an interpretation of DNN as creating sufficient representations of the data that are increasingly minimal. For our deep architecture, in order to have an intuitive understanding to its inner mechanism, we plot the 2D projection (after 1000 t-SNE Maaten and Hinton, 2008 iterations) of features learned from different layers in Fig. 4. Similar to DNN, MPRI creates successively more faithful and separable representations in deeper layers. Moreover, the deeper features can discriminate the with-in class samples in different geography regions, even though we do not manually incorporate geographic information in the training.

Fig. 4
figure 4

2D projection of features learned by MPRI in different layers on Indian Pines data set. Features of “Woods” in the 1st layer, the 2nd layer, and the 3rd layer are marked with red rectangle in ac. Similarly, features of “Grass-pasture” are marked with magenta ellipses in df. g The locations of “Region 1” and “Region 2”. h shows the locations of “Region 3”, “Region 4” and “Region 5”. i shows class legend

4 Experimental results

We conduct three groups of experiments to demonstrate the effectiveness and superiority of the MPRI. Specifically, we first perform a simple test to determine a reliable range for the value of \(\beta \) in PRI and the number of layers in MPRI. Then, we implement MPRI and several of its degraded variants to analyze and evaluate component-wise contributions to performance gain. Finally, we evaluate MPRI against state-of-the-art methods on benchmark data sets using both visual and qualitative evaluations.

Three popular data sets, namely the Indian Pines (Baumgardner et al., 2015), the Pavia University and the Pavia Center, are selected in this work. We summarize the properties of each data set in Table 1.

  1. 1.

    The first image, displayed in Fig. 5a, is called Indian Pines. It was gathered by the airborne visible/infrared imaging spectrometer (AVIRIS) sensor over the agricultural Indian Pines test site in northwestern Indiana, United States. The size of this image is \(145\times 145\) pixels with spatial resolution of 20 m. The low spatial resolution leads to the presence of highly mixed pixels (Ghamisi et al., 2014). A three-band false color image and the ground-truth map are shown in Fig. 5a, b, where there are 16 classes of interest. And the name and quantity of each class are reported in Fig. 5c. The number of bands has been reduced to 200 by removing 20 bands covering the region of water absorption. This scene constitutes a challenging classification problem due to the significant presence of mixed pixels in all available classes and the unbalanced number of available labeled pixels per class (Li et al., 2013).

  2. 2.

    The second image is the Pavia University, which was recorded by the reflective optics spectrographic imaging system (ROSIS) sensor during a flight campaign over Pavia, northern Italy. This scene has \(610\times 340\) pixels with a spatial resolution of 1.3 m (covering the wavelength range from 0.4 to 0.9\(\upmu \) m). There are 9 ground-truth classes, including trees, asphalt, bitumen, gravel, metal sheet, shadow, bricks, meadow, and soil. In our experiments, 12 noisy bands have been removed and finally 103 out of the 115 bands were used. The class descriptions and sample distributions for this image are given in Fig. 6c. As can be seen, the total number of labeled samples in this image is 43,923. A three-band false color image and the ground-truth map are also shown in Fig. 6.

  3. 3.

    The third data set is Pavia Center. It was acquired by ROSIS-3 sensor in 2003, with a spatial resolution of 1.3m and 102 spectral bands (some bands have been removed due to noise). A three-band false color image and the ground-truth map are also shown in Fig. 7a, b. The number of ground truth classes is 9 (see Fig. 7) and it consists of \(1096\times 492\) pixels. The number of samples of each class ranges from 2108 to 65,278 (Fig. 7e). There are 5536 training samples and 98,015 testing samples (Fig. 7c, d). Note that these training samples are out of the testing samples.

Table 1 Details of data sets
Fig. 5
figure 5

a False color composition of the AVIRIS Indian Pines scene. b Reference map containing 16 mutually exclusive land-cover classes. c The numbers of the labeled samples

Fig. 6
figure 6

a False color composition of the Pavia University scene. b Reference map containing 9 mutually exclusive land-cover classes. c The numbers of the labeled samples

Fig. 7
figure 7

a False color composition of the Pavia Center scene. b Reference map containing 9 mutually exclusive land-cover classes. c Training samples. d Testing samples. e The numbers of the labeled samples

Three metrics are used for quantitative evaluation (Cao et al., 2018): overall accuracy (OA), average accuracy (AA) and the kappa coefficient \(\kappa \). OA is computed as the percentage of correctly classified test pixels, AA is the mean of the percentage of correctly classified pixels for each class, and \(\kappa \) involves both omission and commission errors and gives a good representation of the the overall performance of the classifier.

For our method, the values of the kernel width \(\delta \) in PRI were tuned around the multivariate Silverman’s rule-of-thumb (Silverman, 1986): \((\frac{4}{d+2})^{\frac{1}{d+4}} s^{\frac{-1}{4+d}}\sigma _1\le \delta \le (\frac{4}{d+2})^{\frac{1}{d+4}} s^{\frac{-1}{4+d}} \sigma _2\), where s is the sample size, d is the variable dimensionality, \(\sigma _1\) and \(\sigma _2\) are respectively the smallest and the largest standard deviation among each dimension of the variable. For example, in Indian Pines data set, the estimated range in the 5th layer corresponds to [0.05, 0.51], and we set kernel width to 0.4. On the other hand, the PRI in each layer is optimized with \(\tau =3\) iterations, which has been observed to be sufficient to provide desirable performance.

4.1 Parameter analysis

4.1.1 Effects of parameter \(\beta \) in PRI

The parameter \(\beta \) in PRI balances the trade-off between the regularity of extracted representation and its discriminative power to the given data. We illustrate the values of OA, AA, and \(\kappa \) for MPRI with respect to different values of \(\beta \) in Fig. 8a. As can be seen, these quantitative values are initially stable, but decrease when \(\beta \ge 3\). Moreover, the value of AA drops more drastically than that of OA or \(\kappa \). A likely interpretation is that when training samples are limited, many classes have only a few labeled samples (\(\sim 1\) for minority classes, such as Oats, Grass-pasture-mowed, and Alfalfa). An unreasonable value of \(\beta \) may severely influence the classification accuracy in these classes, hereby decreasing AA at first.

The corresponding classification maps are shown in Fig. 9. It is obviously that, the smaller the \(\beta \), the more smooth results achieved by MPRI. This is because large \(\beta \) encourages a small divergence between the extracted representation and the original HSI data. For example, in the scenario of \(\beta =0\), PRI clusters both spectral and spatial structures into a single point (the data mean) that has no discriminative power. By contrast, in the scenario of \(\beta \rightarrow \infty \), the extracted representation gets back to the HSI data itself (to minimize their divergence) such that PRI will fit all noisy and irregular structures.

From the above analysis, extremely large and small values of \(\beta \) are not interesting for classification of HSI. Moreover, the results also suggest that \(\beta \in [2, 4]\) is able to balance a good trade-off between preserving relevant spatial information (such as edges in classification maps) and filtering out unnecessary one. Unless otherwise specified, the PRI mentioned in the following experiments uses three different values of \(\beta \), i.e., \(\beta =2\), \(\beta =3\), and \(\beta =4\). The final representation of PRI is formed by concatenating representations obtained from each \(\beta \).

Fig. 8
figure 8

a Quantitative evaluation with different values of \(\beta \). b Quantitative evaluation with different number of layers

Fig. 9
figure 9

Classification maps of MPRI with a \(\beta =0\); b \(\beta =1\); c \(\beta =2\); d \(\beta =3\); (e) \(\beta =4\); f \(\beta =5\); g \(\beta =6\); and h \(\beta =100\)

4.1.2 Effects of the number of layers

We then illustrate the values of OA, AA and \(\kappa \) for MPRI with respect to different number of layers in Fig. 8b. The corresponding classification maps are shown in Fig. 10. Similar to existing deep architectures, stacking more layers (in a reasonable range) can increase performance. If we keep the input data size the same, more layers (beyond a certain layer number) will not increase the performance anymore and the classification maps become over-smooth. This work uses a 5-layer MPRI because it provides favorable visual and quantitative results.

Fig. 10
figure 10

Classification maps of MRPI with a 1 layer; b 2 layers; c 3 layers; d 4 layers; e 5 layers; f 6 layers; g 7 layers; and h 8 layers

4.1.3 Effects of the classifier

MPRI uses the basic KNN for classification and sets \(k=1\) throughout this work. The purpose is to validate the discriminative power of the spectral-spatial features extracted by multiple layers of PRI. To further confirm the superiority of our MPRI is independent to the used classifier, we also evaluate the performances of MPRI and other three feature extraction methods for HSI classification (EPF Kang et al., 2014, SADL Soltani-Farani et al., 2015, PCA-EPF Kang et al., 2017) and use a kernel SVM as the baseline classifier. The kernel size \(\sigma \) is tuned by cross-validation from the set \(\{0.0001, 0.001, 0.01, 0.1, 1, 10, 100\}\). The best performance is summarized in Table 2. As can be seen, KNN and SVM always lead to comparable results. Our MPRI is consistently better than other competitors, regardless of the used classifier.

Table 2 OAs (%) of different methods using KNN and SVM

4.2 Evaluation on component-wise contributions

Before systematically evaluating the performance of MPRI, we first compare it with its degraded baseline variants to demonstrate the component-wise contributions to the performance gain. The results are summarized in Table 3. As can be seen, models that only consider one attribute (i.e., multi-layer, multi-scale and multi- \(\beta \)) improve the performance marginally. Moreover, it is interesting to find that multi-layer and multi-scale play more significant roles than multi- \(\beta \). One possible reason is that the representations learned from different \(\beta \) contain redundant information with respect to class labels. However, either the combination of multi-layer and multi-\(\beta \) or the combination of multi-scale and multi-\(\beta \) can obtain remarkable improvements. Our MPRI performs the best as expected. This result indicates that multi-layer, multi-scale and multi- \(\beta \) are essentially important for the problem of HSI classification.

Table 3 Quantitative evaluation of our MPRI (the last row) and its degraded baseline variants in terms of OA, AA, and \(\kappa \)

4.3 Comparison with state-of-the-art methods

Having illustrated component-wise contributions of MPRI, we compare it with several state-of-the-art methods, including EPF (Kang et al., 2014), MPM-LBP (Li et al., 2013), SADL (Soltani-Farani et al., 2015), MFL (Li et al., 2015), PCA-EPF (Kang et al., 2017), HIFI (Pan et al., 2017), hybrid spectral convolutional neural network (HybridSN) (Roy et al., 2020), similarity-preserving deep features (SPDF) (Fang et al., 2019), convolutional neural network with Markov random fields (CNN-MRF) (Cao et al., 2018), local covariance matrix representation (LCMR) (Fang et al., 2018), and random patches network (RPNet) (Xu et al., 2018).

Tables 4, 5 and 6 summarized quantitative evaluation results of different methods. For each method, we report its classification accuracy in each land cover category as well as the overall OA, AA and \(\kappa \) values across all categories. To avoid biased evaluation, we average the results from 10 independent runs (except for the Pavia Center data set, in which the training and testing samples are fixed). Obviously, MPRI achieves the best or the second best performance in most of items. These results suggest that MPRI is able to learn more discriminative spectral-spatial features than its counterparts using classical machine learning models.

Table 4 Classification accuracies (%) of different methods on Indian Pines data set
Table 5 Classification accuracies (%) of different methods on Pavia University data set
Table 6 Classification accuracies (%) of different methods on Pavia Center data set with fixed training and testing split

The classification maps of different methods in three data sets are demonstrated in Figs. 11, 12 and 13, which further corroborate the above quantitative evaluations. The performances of EPF and MPM-LBP are omitted due to their relatively lower quantitative evaluations. It is very easy to observe that our proposed MPRI improves the region uniformity (see the small region marked with dashed border) and the edge preservation (see the small region marked by solid line rectangles) significantly, both criteria are critical for evaluating classification maps (Kang et al., 2017). By contrast, other methods either fail to preserve local details (such as edges) of different classes (e.g., MFL) or generate noises in the uniform regions (e.g., SADL, PCA-EPF and HIFI).

Fig. 11
figure 11

Classification maps on Indian Pines data set. a SADL; b MFL; c PCA-EPF; d HIFI; e HybridSN; f SPDF; g CNN-MRF; h RPNet; i MPRI

Fig. 12
figure 12

Classification maps on University of Pavia data set. a SADL; b MFL; c PCA-EPF; d HIFI; e HybridSN; f SPDF; g CNN-MRF; h RPNet; i MPRI

Fig. 13
figure 13

Classification maps on Pavia Center data set. a SADL;b MFL; c PCA-EPF; d HIFI; e HybridSN; f SPDF; g CNN-MRF; h RPNet; i MPRI

To evaluate the robustness of our method with respect to the number of training samples, we demonstrate, in Fig. 14, the OA values of different methods in a range of the percentage of training samples per class. As can be expected, the more training samples, the better classification performance. However, MPRI is consistently superior to its counterparts, especially when the training samples are limited.

Fig. 14
figure 14

OA values of different methods with respect to different percentages of training samples per class on a Indian Pines; and b Pavia University. The results on Pavia Center is omitted, because the training and testing samples are fixed

4.4 Computational complexity analysis

We finally investigate the computational complexity of different sliding window filtering based HSI classification methods. Note that, PRI can also be interpreted as a special kind of filtering, as the center pixel representation is determined by its surrounding pixels with a Gaussian weight [see Eq. (12)].

The computational complexity and the averaged running time on each pixel (in s) of different methods are summarized in Table 7. For PCA-EPF, \({\tilde{d}}\) is the dimension of averaged images, \({\hat{S}}\) is the number of different filter parameter settings, \({\hat{T}}\) is the number of iterations. For HIFI, d is the number of hyperspectral bands, \(\hat{H}\) is the number of the hierarchies. For MPRI, L, S, and B are respectively the numbers of layers, scales and betas. Usually, \({\tilde{d}}\) is set to 16, \({\hat{S}}\) is set to 3, and \({\hat{T}}\) is set to 3, which makes PCA-EPF very fast.

According to Eq. (7), the computational complexity of PRI grows quadratically with data size (i.e., \(\hat{N}\)). Although one can simply apply rank deficient approximation to the Gram matrix for efficient computation of PRI, this strategy is preferable only when \(l\ll \hat{N}\), where l is the square of number of subsamples used to approximate the original Gram matrix (Sánchez Giraldo and Príncipe, 2011). In our application, \(\hat{N}\) is less than a few hundreds (\(\sim 169\) at most), whereas we always need to set \(l\ge 25\) to guarantee a non-decreasing accuracy. From Table 7, the reduced computational power by Gram matrix approximation is marginal. However, as shown in Fig. 15, such an approximation method is prone to cause over-smooth effect.

Finally, one should note that, although MPRI takes more time than its sliding window filtering based counterparts, it is still much more timesaving than prevalent DNN based methods. For example, CNN-MRF takes more than 6, 000s (on a PC equipped with a single 1080 Ti GPU, i7 8700k CPU and 64 GB RAM) to train a CNN model using \(2\%\) labeled data on Indian pines data set with 10x data augmentation.

Table 7 Computation complexity and running time (per pixel) of different methods
Fig. 15
figure 15

Classification maps of a MPRI and b MPRI with Nyström-KECA. The Gram matrix approximation is prone to cause over-smooth effect

5 Conclusions

This paper proposes multiscale principle of relevant information (MPRI) for hyperspectral image (HSI) classification. MPRI uses PRI—an unsupervised information-theoretic learning principle that aims to perform mode decomposition of a random variable X with a known (and fixed) probability distribution g by a hyperparameter \(\beta \)—as the basic building block. It integrates multiple such blocks into a multiscale (by using sliding windows of different sizes) and multilayer (by stacking PRI successively) structure to extract spectral-spatial features of HSI data from a coarse-to-fine manner. Different from existing deep neural networks, MPRI can be efficiently trained greedy layer-wisely without error backpropagation. Empirical evidence indicates \(\beta \in [2,4]\) in PRI is able to balance the trade-off between the regularity of extracted representation and its discriminative power to HSI data. Comparative studies on three benchmark data sets demonstrate that MPRI is able to learn discriminative representations from 3D spatial-spectral data, with significantly fewer training samples. Moreover, MPRI enjoys an intuitive geometric interpretation, it also prompts the region uniformity and edge preservation of classification maps. In the future, we intend to speed up the optimization of PRI. In this line of research, the random fourier feature (Rahimi & Recht, 2008) seems to be a promising avenue.