Keywords:

1 Introduction

Hyperspectral sensors capture images with narrow and contiguous spectral bands covering spectrum not only from the visible range, but also from the ultra violet and infra red region. The resultant datasets are three-dimensional which are represented as data cubes of size P × Q × N, where P and Q are the spatial dimensions and N is the spectral dimension. Each spatial plane can be viewed as collection of two-dimensional scenes containing P × Q pixels, each taken at a specific wavelength λ and each pixel can be viewed as a vector consisting of N reflectance values. One such sensor is the airborne visible/infrared imaging spectrometer (AVIRIS) that captures images with up to 224 spectral bands ranging from 400 to 2500 nm [1]. With such high spectral dimension, much more detailed and discriminative information of the objects can be acquired which results in increased classification accuracy. However, performance of many supervised classification methods get strongly affected by increased dimensionality. Hence, spectral dimension reduction is a crucial step in hyperspectral image analysis. Though the hyperspectral images capture a wide range of spectrum, not all bands are equally important for classification purpose as each band does not contribute equally toward the discrimination of the objects. Some bands contain very little or no relevant information. Moreover, the adjacent bands often share redundant information. Thus, this is important to remove such redundant and noisy information before going to the classification phase.

In literature, many dimension reduction techniques are available. Techniques like PCA and ICA map the higher dimensional feature space in to lower dimensional feature space by carrying out linear or nonlinear transformations [2, 3]. In such transformations, original interpretation of spectral data gets compromised. Feature selection is another method for dimension reduction where only the most salient features are selected. This type of methods can broadly be divided into two categories (1) wrappers and (2) filters [4]. In wrappers, a learning algorithm is applied to examine the utility of the features of the dataset. As feature selection is highly dependent on the learning algorithm to be applied, in wrappers it is difficult to switch among the learning algorithms. Moreover, for large databases containing many features, wrappers may become intractable as each and every potential subset of features is evaluated by the learning algorithm. In filters, subset of features is selected as a preprocessing step irrespective of the learning algorithm to be applied and hence is relatively more general compared to wrappers [5, 6].

In feature selection techniques, the main objective is to select the set of features which carry maximum information about the class and yet with minimum correlation between each other. In [2, 7], authors have shown that the adjacent bands of the hyperspectral images are highly correlated and they have utilized the correlation effectively to reduce the number of bands. However in these schemes, the correlation with the class was not considered. In our proposed algorithm, to avoid redundant information carried by adjacent bands, initial band grouping phase is carried out. In this phase, bands are partitioned into K \( (K \le N) \) groups by constructing a bottom up spectral partition tree by recursively merging adjacent spectral bands with redundant information. In the second and final phase, a band is selected from each group as representative which jointly forms the filtered set of selected bands. Both the phases are discussed in Sect. 3.

2 Entropy and Mutual Information-Based Band Selection Scheme

2.1 Entropy

In information theory, entropy is the measure of information content of a random variable in terms of uncertainty. Let X be a discrete random variable with probability distributions p(x), where x ∈ X, the entropy is defined by

$$ H\left( X \right) = - \mathop \sum \limits_{x \in X} p\left( x \right)\log p(x) $$
(1)

In literature, entropy has been used as band selection criteria for hyperspectral images [2, 3, 8, 9]. Entropy of each spectral band is calculated to measure their information content. The one with the higher entropy values are selected for classification purpose. As only the bands with highest information contents are selected, the classification accuracy is often high. However, the measured information content by entropy suffers from lake of reference or objective. The information content in some bands, though high, may not have any relevance to the target classification.

2.2 Mutual Information (MI)

Mutual information is another technique used to evaluate the effectiveness of a spectral band. Unlike entropy, in mutual information-based approaches apart from measuring the information content of the individual spectral bands, the relevance of the information with the target (reference) image is also considered [10, 11]. In information theory, MI measures the mutual dependence between two random variables. Mathematically, if X and Y are two discrete random variables with probability distributions p(x) and p(y), joint probability distribution p(x, y), where x ∈ X and y ∈ Y, then MI is calculated as [12],

$$ I\left( {X,Y} \right) = \mathop \sum \limits_{x \in X} \mathop \sum \limits_{y \in Y} p\left( {x,y} \right)\log \frac{p(x,y)}{p\left( x \right)p(y)} $$
(2)
$$ = H\left( X \right) - H\left( {X|Y} \right) $$
(3)
$$ = H\left( X \right) + H\left( Y \right) - H(X,Y) $$
(4)

Here, H(X) and H(Y) are the entropies of X and Y, respectively. H(X|Y) is the conditional entropy of X with respect to Y and H(X, Y) is the joint entropy of X and Y. Equations (5, 6) represents conditional and joint entropy.

$$ H\left( {X|Y} \right) = - \mathop \sum \limits_{x \in X} \mathop \sum \limits_{y \in Y} p\left( {x,y} \right)\log p(x|y) $$
(5)
$$ H\left( {X,Y} \right) = - \mathop \sum \limits_{x \in X} \mathop \sum \limits_{y \in Y} p\left( {x,y} \right)\log p(x,y) $$
(6)

In [11], authors have calculated MI of each band in the dataset with the corresponding reference image to measure the effectiveness of the spectral bands in terms of information content about the various classes (Fig. 1).

Fig. 1
figure 1

Diagrammatic representation of entropy and mutual information

3 Proposed MI-based Hierarchical Band Selection Approach

In this paper, we have proposed a hierarchical band selection scheme based on mutual information. The algorithm is divided into two phases: (i) band grouping and (ii) band selection. Figure 2 shows the proposed MI-based band grouping and selection method.

Fig. 2
figure 2

Block diagram of the proposed method

3.1 Band Grouping Phase

The binary partition tree (BPT) in hyperspectral images was introduced in [12, 13] to merge regions based on spatial features. Inspired by the BPT concept, in the proposed algorithm we have constructed a spectral partition tree (SPT) by iterative bottom up merging of the spectral bands. The individual bands form the initial nodes, i.e., the leaf nodes of the SPT. Thereafter, in each iteration the nodes for the upper level are created by merging two adjacent nodes. The merging criterion is decided by the difference of MI values of the spectral bands with the reference image and the complementary threshold T. In the first iteration where the input is the set of N spectral bands of the image, two adjacent bands n and n + 1, are merged if the difference d(n) = |MI(n) − MI(n + 1)| < T. This condition ensures that the adjacent bands which do not carry any significant complementary information are grouped. The new node which is formed as a result of merging is represented in the next level as a set of the merged spectral bands. The nodes which cannot be merged are left as it is for the next iteration. For the rest of the iterations, any two adjacent nodes, which may contain more than one spectral band (as a result of merging in the previous iterations), are merged if dmax(n) = |max(MI(n)) − max(MI(n + 1))| < T and dmin(n) = |min(MI(n)) − min(MI(n + 1))| < T. Here dmax(n) is the difference between the largest MI values of the nth and n + 1 th node respectively. Similarly, dmin(n) is the difference between the smallest MI values of the nth and n + 1 th node, respectively The algorithm stops if two consecutive iterations produce the same set of nodes. Figure 3 depicts an example of construction and representation of spectral binary tree.

Fig. 3
figure 3

Example of construction and representation of the spectral binary tree

3.2 Band Selection Phase

In this phase, a representative band from each group of the K groups formed in band grouping phase is selected. From each group, the band yielding the highest MI value is selected as the group representative. Any representative band having less MI value than a desired threshold \( \uptheta \), is discarded.

  • Algorithm: Band Selection

  • Input: band groups formed in phase 1

  • \( i \leftarrow 1 \)

  • \( S \leftarrow \emptyset \)

  • while \( i \le K \) do

    • select the band B, with maximum MI value from the ith group.

    • if MI(B)< \( \theta \) then do

      • \( S \leftarrow S\mathop \cup \nolimits B \)

    • else

      • \( S \leftarrow S\mathop \cup \nolimits S \)

    • end if

  • end while

4 Experimental Setup and Results

4.1 Dataset Description

For our experiment, we have used the AVIRIS Indian Pines Dataset which was collected over Northwest Indiana in June 1992 by AVIRIS sensor. The dataset contains the Indian pine image consisting of 220 spectral bands and 21,025 pixels (145 × 145). The image is accompanied with a reference image where 10,249 pixels are labeled with a number from 1 to 16, denoting the class to which a pixel belongs to. The rest 10,776 pixels are labeled as 0 indicating that these either belong to areas which are not of interest or could not be labeled due to technical difficulties. Out of the 16 classes only nine classes, as listed in Table 1, have been considered for our experiment. Other classes are discarded as the rest have relatively limited number of labeled samples (Fig. 4).

Table 1 Numbers of samples for different classes in Indian pine image dataset
Fig. 4
figure 4

False color image of the reference image for Indian pines dataset showing the various classes present

4.2 Experiment and Result

Experiment was performed by first calculating the MI of each spectral band of the Indian Pine image with the reference map. Figure 5 shows the calculated results plotted against the band numbers. It can be seen that some band have very low MI values. This is due to water absorption in those bands.

Fig. 5
figure 5

Mutual information of the spectral bands of the Indian Pie image with the reference image

In the band grouping phase, the construction of the spectral binary tree was experimented with different thresholds. For band selection phase, the threshold \( \uptheta \) was set as 0.30. Number of groups formed in band grouping phase and number of selected representative bands from the groups for each threshold are listed in Table 2. For classification, support vector machine was used as it is known to work well with high input space. For training and testing, 200 random samples from each class with selected bands were taken. The classifier was also tested with samples from the original Indian pines image (containing 220 bands) and the corrected image (containing 200 bands). Table 2 shows the achieved experimental results.

Table 2 Experimental results of the proposed algorithm

From the experimental results, it is observed that as we limit the number of bands, the overall accuracy (OA) of classification also decreases. However, this decrease is not that significant and results are still comparable with that when all the bands are used. As the number of bands decreases from 200 to 38, the classification accuracy decreases by only 0.23%. However, when the number of bands further gets decreased to 27 the degradation in classification accuracy is a bit higher. The most significant observation is that with all the 220 bands the achieved OA is lower than that using 38 bands. This is due to the rejection of the groups of bands which have very low MI with the target image. The proposed algorithm discards the bands which are noisy and do not contain any relevant information are avoided. In the corrected image with 200 bands, the noisy bands are already removed and classification accuracy is higher.

5 Conclusion

In this paper, we have proposed a hierarchical band selection approach based on mutual information. A spectral partition tree is constructed to group the adjacent bands carrying redundant information. From each group in the band selection phase, only one band which carries maximum information about the target image is selected to eliminate redundancy. From the results, it has been seen that the bands that get selected through the proposed algorithm, though limited, are highly capable of representing original image without losing much information. Through this algorithm, the noisy and irrelevant bands gets automatically discarded in the band selection phase. When the threshold is high, very few numbers of bands get selected and the achieved overall accuracy of classification is also low. As more number of bands is added to the selected set, the overall accuracy also improves significantly. However, after some limit, this growth in classification accuracy with number of selected bands tends to be slower. The additional accuracy achieved becomes insignificant compared to the increase in spectral dimension.