1 Introduction

As a sensor type with enough features, hyperspectral remote sensing has been applied on a series of aspects with its dense sampling of narrow and continuous band information, each band expresses a one-dimensional feature for the dataset, and the abundant spectral resolution supplies the potential for precisely discrimination of different objects [1, 2]. Hyperspectral image (HSI) classification is projected to obtain the category label of each pixel on HSIs and conduct object analysis for the whole image, such as building, vegetation, and water [3]. In general, they are separated into two scopes: unsupervised and supervised techniques. For unsupervised techniques, fuzzy clustering model, graph-based clustering, and iterative self-organizing data analysis techniques algorithm (ISODATA) have been utilized to obtain the object information of HSIs [4,5,6]. However, there is no prior knowledge for the above approaches, which blindly conduct classification with the characteristics of feature values itself, and satisfactory classification results are obtained only under the special conditions [7]. For supervised techniques, active learning, Bayesian modelling, neural network have been utilized to conduct classification for HSIs [8,9,10]. In short, the supervised techniques need to know the characteristics of training samples in advance, and higher classification accuracy is achieved than unsupervised techniques in most situations.

As a frequently used supervised classification technique, support vector machine (SVM) is an active machine learning model to solve the classification problem on the dataset with high dimension and small samples. It is on the basis of experiential learning theory and structural risk minimization principle and has been extensively applied in a great deal of fields [11, 12]. In particular, several of studies have explored the problem of HSI classification by using SVM, and preferable classification accuracy is achieved than traditional supervised classification techniques [13, 14]. However, as the improvement of spatial resolution, the objects become more specified in the local region, and it is difficult to discriminate different objects with similar feature values by using single classifier [15]. Ensemble learning is a paradigm of machine learning that synthesizes multiple sub-classifiers to solve the same problem, and better generalization ability is obtained than single classifier according to different emphasis of sub-classifiers especially for indeterminate objects [16]. As a result, classifier ensemble based on SVM has been utilized to solve financial distress and druggable proteins problems with dozens of features on public datasets [17, 18]. As for HSI classification, the computational efficiency is limited as all of bands directly input into each sub-classifier [19,20,21]. Moreover, kernel function is the core issue for the performance of SVM, which is fused to map the input into solving space with higher dimension, so that the complex classification problem is transferred to linearly separable [22]. In recent years, the uniting of wavelet theory and SVM, namely, wavelet SVM (WSVM) has been built as an improvement of SVM, the feature vectors are extracted from time series because of positive non-linear mapping of SVM and locally analyze of wavelet kernel, but the mapping into trigonometric function with a fixed type may not adapt to different distributions of datasets [23].

On the other hand, the abundant band information of HSIs also leads to the curse of dimensionality, and band selection is projected to select a series of feature values corresponding to band information from HSIs that composes a band subset to avoid data redundancy and obtain higher classification accuracy [24, 25]. The generation of band subset is a search behavior that chooses the combination of items from HSIs via complete, random or heuristic search strategy. If a HSI contains N bands, O(2N) possible subsets will be generated, which is considered as a non-polynomial hard problem. As a combinatorial optimization problem, swarm intelligence algorithms with heuristic search have been utilized as guiding factors to seek for the solution of band selection [26]. For instance, Su et al. [27] and Ghosh et al. [28] proposed a series of band selection techniques integrating SVM with particle swarm optimization (PSO) and differential evolution (DE) to achieve the optimal band subset via satisfactory classification accuracy, but the optimization ability was limited due to the lack of local search. In recent years, some newly proposed swarm intelligence algorithms had been proposed and applied on the aspect of band selection, such as gravitational search algorithm (GSA), Gray Wolf Optimizer (GWO), ant lion optimizer (ALO), et al. [29,30,31], which were capable of obtaining superior outcomes in tackling band selection problem when compared with the previous techniques, but the time complexity was increased as the improvement of spectral resolution and total number of bands.

Whale Optimization Algorithm (WOA) is a newly proposed heuristic search algorithm and has been widely used in diverse applications [32,33,34]. The performance of WOA depends on only one parameter, which makes it not easy to trap into the local optima, and it has stable performance that converges to the global optimal solution. However, the coding length of each individual is equal to the number of bands for band selection, and CPU time is uncontrolled to search for the global optimal solution from a large number of candidate solutions when all of bands input into each sub-classifier. Membrane computing is a novel paradigm of natural computing, the long coding can be separated into a series of short coding that enter into elementary membranes [35], and an elementary membrane is corresponding to a sub-classifier. Where the optimal band combination of each sub-classifier is obtained by WOA, then transmit to the skin, and the optimal band subset of HSI datasets is achieved. Hence, a band selection approach based on WSVM ensemble model and membrane WOA (MWOA) is proposed to obtain the category label of each pixel, the HSI datasets are separated into several parts by no duplicate selection of band information, and inputs into corresponding sub-classifiers. Moreover, the ensemble with WSVM for each sub-classifier concerns on the homogeneity and the ensemble with multiple wavelet kernels emphasize the heterogeneity, the original dataset is mapped into quadratic, exponential, trigonometric functions with different types, which is suitable for different distributions of datasets especially for HSIs. The voting strategy is replaced by the information integration, and the category label of each sample is output by the remaining band information on the skin of MWOA. The main contributions of the paper are displayed as follows:

  • A novel WSVM ensemble model is designed, and WSVM with different wavelet kernels is utilized to balance the heterogeneity and homogeneity at the same time.

  • The optimal band combination of each sub-classifier is obtained by WOA, and a novel MWOA is presented to obtain the optimal band subset of HSI datasets according to the coding on the skin.

  • The category label of each sample is output by the remaining band information on the skin, and the discrimination ability of different objects with similar feature values is enhanced.

The rest of paper is structured as below. Section 2, describes the backgrounds of WSVM and WOA. The process of proposed technique for band selection is exhibited in Section 3. Section 4, sums up the experimental results and expends data discussion. Finally, the paper is concluded in Section 5.

2 Backgrounds

2.1 Basic theory of SVM

Assume that a classification problem consists of n instance-label pairs, S = (xi,yi),(i = 1,2,...,n), xiRa is an instance vector and yi ∈{− 1,+ 1} is the category label of samples. The process of training is to search for a hyper-plane that separates the positive (+ 1) samples from the negative (-1) samples, which is resolved by the optimization of following expression:

$$ \phi(\omega,\varepsilon) = \frac{1}{2}||\omega||^{2}+C\sum\limits_{i=1}^{n} \varepsilon_{i} $$
(1)

Subject to constraints:

$$ y_{i}[(\omega \cdot x_{i}) +b] \geq 1 - \varepsilon_{i}, i = 1,2,...,n $$
(2)

where ω is the normal vector of hyper-plane, εi ≥ 0 is the slack variable to enhance the samples that are fault classified, C is the penalty factor to balance the error term \({\sum }_{i=1}^{n} \varepsilon _{i}\), and ϕ is the kernel function that maps the input into another solving space with higher dimension, and the transformation of space depends on the definition of kernel, which is expressed as (3):

$$ K(x,x^{\prime}) = \phi(x)^{T} \cdot \phi(x^{\prime}) $$
(3)

Among them, the widely adopted kernel is radial basis function (RBF) that is defined as (4):

$$ K(x,x^{\prime}) = exp\left[\frac{-\left( x_{i}-x_{i}^{\prime}\right)^{2}}{2{{\sigma}_{i}^{2}}}\right] $$
(4)

However, the transformation of space is conducted on a fixed scale that may not discriminate different forms of input especially for similar characterization, and it is insufficient to generate a linearly-separable space about similar feature values and high dimensional vectors.

2.2 Wavelet kernels combined with SVM

A kernel function that obeys Mercer’s theorem is known as an admissible Support Vector (SV), it is decomposed as shift-invariant form \(K(x,x^{\prime })=K(x-x^{\prime })\) in the solving space, and Mercer’s theorem offers essential factors to determine whether the shift-invariant form is an admissible SV kernel [36].

The basic theory of wavelet analysis is to combine a lot of linear wavelet bases that reflect an arbitrary function f(x). Assume that ϕ(x) is a function of one-dimensional mother wavelet basis, and a separable multi-dimensional wavelet function should be expressed as:

$$ \phi_{d}(x) = \prod\limits_{i=1}^{d} f(x_{i}) $$
(5)

The shift-invariant form can be constructed for wavelet function as follows:

$$ \phi_{d}(x,x^{\prime}) = \prod\limits_{i=1}^{d} f\left( \frac{x_{i}-x_{i}^{\prime}}{\sigma_{i}}\right), $$
(6)

where σi > 0 is the wavelet scale factor.

Where the existing mother wavelet basis contents the fixed condition of shift-invariant form, and the following equations respectively name Mexican hat, Complex Morlet, Shannon, and Harmonic functions are considered as an admissible SV kernel, and they can be expressed as the following equations [37,38,39,40]:

$$ K(x,x^{\prime})\! =\! \prod\limits_{i=1}^{d}\! \left[1-\frac{\left( x_{i}-x_{i}^{\prime}\right)^{2}}{{\sigma_{i}^{2}}}\right] \cdot exp\left[-\frac{\left( x_{i}-x_{i}^{\prime}\right)^{2}}{2{\sigma_{i}^{2}}}\right], $$
(7)
$$ K(x,x^{\prime}){}\! =\!{} {\prod}_{i=1}^{d}{} cos\!{}\left[\!1.75 \!\times \frac{\left( x_{i}-x_{i}^{\prime}\right)}{\sigma_{i}}\right] \cdot exp\!\left[-\frac{\left( x_{i}-{x}_{i}^{\prime}\right)^{2}}{2{\sigma_{i}^{2}}}\right], $$
(8)
$$ K(x,x^{\prime}) = \prod\limits_{i=1}^{d} \frac{sin\left[\frac{\pi}{2} \cdot \frac{\left( x_{i}-{x}_{i}^{\prime}\right)}{\sigma_{i}}\right]}{\frac{\pi}{2} \cdot \frac{(x_{i}-x_{i}^{\prime})}{\sigma_{i}}} \cdot cos\left[\frac{3\pi}{2} \cdot \frac{\left( x_{i}-{x}_{i}^{\prime}\right)}{\sigma_{i}}\right]. $$
(9)
$$ K(x,x^{\prime}) = \prod\limits_{i=1}^{d} \frac{e^{i4\pi\frac{x_{i}-{x}_{i}^{\prime}}{\sigma_{i}}}-e^{i2\pi\frac{x_{i}-{x}_{i}^{\prime}}{\sigma_{i}}}}{i2\pi\left( \frac{x_{i}-x_{i}^{\prime}}{\sigma_{i}}\right)} $$
(10)

The above wavelet functions not only possess translation orthogonality, but also approximate an arbitrary equation in the square integral space, and the transformation on multiple scales guaranteeing the input is more likely to be discriminated. Due to the wavelet function has the ability of non-linear mapping on different scales, WSVM is adapted for classification decision-making and pays attention on misclassification samples.

2.3 Mathematical model of WOA

Whale optimization algorithm (WOA) is based on the predatory strategy of humpback whales that tend to catch crowd of krill or small fishes near to the surface, the process is conducted by producing specific bubbles with a ring path, and the operator is separated into three parts that are encircling prey, spiral bubble-net attacking, search for prey, and the main procedure of WOA is depicted as below:

Encircling prey

Humpback whales have the ability to search for the location of prey and encircle them. It is assumed that the position of current optimal solution is the target prey or it is the proximate solution to the optimum in theory. Other humpback whales should endeavour to motivate their positions towards to it. The process is written as follows:

$$ \mathbf{D} = |\mathbf{C} \cdot X^{*}(t)-X(t)|, $$
(11)
$$ X(t+1) = X^{*}(t) - \mathbf{A} \cdot \mathbf{D}, $$
(12)

Where t is the number of current iteration, X(t) is the position of prey, X(t) and X(t + 1) respectively represent the position of humpback whales in current and later procedure. A and C are the variable vectors that are expressed as A = 2ara and C = 2 ⋅r, and a is decreased gradually within the scope of [2,0] and r is a random number with uniform distribution.

Bubble-net attacking

Each humpback whale moves close to the prey within a compact ring acting the exploitation phase and follows with a spiral-shaped path in the meantime, and it is supposed that a probability of 0.5 is set to choose whether the compact ring or spiral mechanism and renew the position of humpback whale according to the distance between current humpback whale and prey. The formulation about current process is expressed as below:

$$ X(t+1) = \mathbf{D^{\prime}} \cdot e^{bl} \cdot cos(2 \pi l) + X^{*}(t) $$
(13)

where \(\mathbf {D^{\prime }}\) is the distance of current humpback whale to prey, which is expressed as \(\mathbf {D^{\prime }}=|X^{*}(t)-X(t)|\), b is a constant that indicates the situation of logarithmic spiral, l is a random number within the scope of [-1,1].

Search for prey

The current behavior combined with vector A is utilized to renew the position of prey, and the random value greater than 1 or less than -1 is set for A that lets the humpback whale jump out of the local space, and the position of current humpback whale is updated according to the random walk strategy rather than the best humpback whale. The details are expressed as (15):

$$ \mathbf{D} = |\mathbf{C} \cdot X_{rand} - X(t)| $$
(14)
$$ X(t+1) = X_{rand} - \mathbf{A} \cdot \mathbf{D} $$
(15)

where D is the distance between a random and current humpback whale, Xrand is the position of a random humpback whale selected from the whole population.

3 The proposed method

3.1 Structure of WSVM ensemble model

As for ensemble learning, multiple sub-classifiers are simultaneously trained and then polymerized to construct an ensemble model. A WSVM ensemble model is proposed to synthesize multiple sub-classifiers into a robust one that improves the generalization and discrimination abilities, the peculiarity of ensemble is represented by different band combination that inputs to each sub-classifier, and the construction is shown on Fig. 1.

Fig. 1
figure 1

WSVM ensemble model for band selection

The key issue of classifier ensemble depends on two elements that are how to construct each sub-classifier and how to fuse the sub-classifiers and build an ensemble classifier. This is conducted as follows. First, four sub-classifiers are built by WSVM and respectively corresponding to Mexican hat, Complex Morlet, Shannon and Harmonic kernels in Section 2.2. Then, all of bands are entered into WSVM ensemble model with the indexed sequential of HSIs, that is, the band indexes located at 1%-25% are assigned to first sub-classifier, the band indexes located at 26%-50% are assigned to second sub-classifier, the band indexes located at 51%-75% are assigned to third sub-classifier, and the band indexes located at latter 25% are assigned to fourth sub-classifier. Further, the optimal band combination of each sub-classifier is obtained, and the category label of each sample is output as the optimal band subset is achieved by information integration. The voting strategy is not necessary for the process of ensemble to avoid the category ambiguity of some samples with similar votes.

3.2 Strategy of MWOA

Membrane computing is a component of natural computing that investigates computing models abstracted from the interactions of several cells in tissues, and a complex problem can be decomposed to the combination of some easy solved problems. The core topic for MWOA and WSVM ensemble model is the expression form of band selection to be handled, and a suitable mapping between MWOA and WSVM ensemble model corresponding to the solution space is necessary, and the structure of membrane computing combined with WSVM ensemble model is demonstrated on Fig. 2.

Fig. 2
figure 2

Membrane computing for WSVM ensemble model

Where each band has exactly two candidate statuses for the process of band selection, they are selected or deselected. If the number of bands is N for a HSI, the coding length is equal to [N/4] for a humpback whale. Every bit of MWOA is set by “0” or “1”, where “1” means the current band will be selected, and “0” means the current band will be unselected. For instance, there are 10 bands for a HSI and 2 sub-classifiers are used for classification, and the coding on the skin for MWOA is “0010000101 (elementary membrane 1) ∥0100101000 (elementary membrane 2)”. That is, the 3rd, 8th, 10th bands of the previous sub-classifier and 12th, 15th, 17th bands of latter sub-classifier will be selected for classification, and remaining bands will be abandoned. The optimal band subset of HSI datasets is obtained as the optimal individuals on elementary membranes are transmitted to the skin.

3.3 Definition of objective function

The main goal of band selection for each sub-classifier is to improve the classification accuracy and reduce the selected number of bands of each individual, and the performance of classifier ensemble is enhanced by the combination of a series of sub-classifiers. In the paper, WSVM is acted as the sub-classifier for the process of classification here. Moreover, the classification accuracy is just an important goal, and the reduction of independent bands is also an imperative goal. The comprehensive goal is to gain the higher classification accuracy combined with less number of bands as possible. Thus, the fitness value is computed as (16):

$$ F(i) = 0.25*\sum\limits_{j=1}^{4} \left[\lambda \cdot Acc(i,j) + (1-\lambda) \cdot log_{10} \frac{n_{c}}{n_{s}(i,j)}\right] $$
(16)

where F(i) indicates the fitness value of i-th humpback whale, nc and ns(i,j) are respectively the total and selected number of bands about j-th elementary membrane, and Acc(i,j) is the classification accuracy of j-th sub-classifier. λ is a weighting parameter to balance the classification accuracy and selected number of bands, which is set as λ = 0.9 here.

3.4 Implementation of the proposed method

The proposed band selection approach is easy to be fulfilled, WSVM ensemble model and MWOA are designed to obtain the optimal band subset and conduct pixel-level classification for entire HSIs, and the exact process is listed as the following flow chart (Fig. 3) and pseudocode:

Fig. 3
figure 3

Flow chart of the proposed method

figure f

4 Experimental results and discussion

The proposed band selection approach is accomplished by the language of MATLAB 2014b on a personal computer with a 3.60 GHz CPU, 8.00G RAM under Windows 10 operation system.

The process mainly concerns on the optimization ability of MWOA and the classification performance of WSVM ensemble model, a public collected and 3 measured airborne HSIs respectively named SalinasA [41], HSI1, HSI2 and HSI3 are utilized here. The total number of bands is respectively 204 and 100 for public and measured HSIs, there are respectively 51 and 25 bands that are input into each sub-classifier, and the coding length of MWOA is also 51 or 25 for each individual. For HSI datasets extracted by ENVI software, we randomly choose 10% samples of each category as training data, and remaining 90% are selected as testing data.

Moreover, some newly proposed swarm intelligence algorithms and corresponding types of membrane computing are carried out for band selection of WSVM ensemble model. As it is demonstrated in Section 3, MWOA is used here. To let the comparison impartially, all of algorithms are used to conduct coding for band combination of each sub-classifier, and the form is used with their standard mode.

4.1 Parameters setting for different algorithms

The optimization ability of WOA and other algorithms relies on some parameters setting to some extent. Table 1 lists the parameters setting of GSA [29], GWO [30], ALO [31] and the type of membrane computing in these comparative methods, as well as the parameters of WOA.

Table 1 Parameters setting for different algorithms

Among them, all of algorithms above are ended when the maximum number of object function evaluations reaches 2000, and 30 independent operations are executed for each algorithm. Meanwhile, some contrastive experimental results including illustrative examples and evaluating tables are listed in the section, which distinctly embody the advantages of the proposed WSVM ensemble model and MWOA. The primary task is to obtain the optimal band subset of HSI datasets, which is reflected by the value of (16), and the comprehensive performance of interpretation is reflected by overall classification accuracy (OA) with pixel-level.

4.2 Experiments for optimization ability

The above datasets are utilized in the subsection to prove the optimization ability of MWOA and the classification performance of WSVM ensemble model. Tables 2-3 show the property of classifier ensemble optimized by different algorithms and corresponding type of membrane computing, and Fiv, Acc, Ft and Time respectively denote the fitness value, classification accuracy, selected number of bands and CPU time in average after 30 independent operations.

Table 2 Fitness value and classification accuracy of different algorithms
Table 3 Selected number of bands and CPU time of different algorithms

As for the data in Tables 2-3, WOA has the stronger optimization ability compared with other algorithms, the fitness value is higher than 0.93 for SalinasA dataset, and only 3.9409s is cost to obtain a higher classification accuracy. As for measured datasets, the classification accuracy of WOA is higher than that of other two algorithms, but the selected number of bands is more than 19 that is still affiliated to scope of multi-spectral. Moreover, few of band information is selected as WOA combined with membrane computing, and CPU time is more than 60% decreased than before. More importantly, the classification accuracy reaches 92% for all datasets, which is higher than 98% for SalinasA dataset especially, and misclassification is obviously moderated by information integration of classifier ensemble. In brief, the optimization ability of WOA is the optimal, and the convergence speed is fast enough to

obtain higher accuracy with less band information by combining with membrane computing, WSVM ensemble model is suitable for HSI datasets to keep a good generalization ability, and the proposed technique is applicable for some practical work of band selection.

4.3 Experiments for pixel-level classification

In this subsection, four HSIs respectively named SalinasA and HSI1-HSI3 are used to conduct classification for each pixel of entire images. Moreover, some corresponding and newly proposed techniques such as, WSVM [23], local joint subspace (LS)-SVM [42], tangent collaborative representation classification (TCRC) (ensemble learning) [43], random patches network (RPN) (SVM for classification) [44] and deep SVM (DSVM) (SVM ensemble model via RBF kernel) [45] are also used to conduct a comparison here. The original and classified images have been listed on Figs. 456, and 7, and Table 4 outlines the OA of each HSI.

Fig. 4
figure 4

Classification results for SalinasA: a original HSI b reference map c LS-SVM d WSVM e TCRC f RPN g DSVM h WOA i MWOA

Fig. 5
figure 5

Classification results for HSI1: a original HSI b reference map c LS-SVM d WSVM e TCRC f RPN g DSVM h WOA i MWOA

Fig. 6
figure 6

Classification results for HSI2: a original HSI b reference map c LS-SVM d WSVM e TCRC f RPN g DSVM h WOA i MWOA

Fig. 7
figure 7

Classification results for HSI3: a original HSI b reference map c LS-SVM d WSVM e TCRC f RPN g DSVM h WOA i MWOA

Table 4 OA of different HSI classification methods (%)

As for the classified images here, it is difficult to recognize different objects with similar feature values by using single classifier such as LS-SVM and WSVM, misclassification is obviously displayed in parts (c)-(d) of Figs. 456, and 7, some categories are leaked on the classified images, and the OA is lower than 72% for HSI2 and HSI3 images. In part (e)-(f) of Figs. 456, and 7, the classification results are improved by the ensemble with a series of classifiers, but the discrimination ability of each sub-classifier is weak, and false category label is obtained for amount of samples by the voting strategy. Further, the OA is higher than 99% by using DSVM for SalinasA image, the classified image is coincided with the reference map to some extent, and it is superior to 83% for 3 measured images, but misclassification is still existed on the edge region because of spectra aliasing for the same type of SVM. For WSVM ensemble model, the homogeneity is improved by utilizing WSVM instead of traditional SVM, and the heterogeneity is simultaneously improved by different types of wavelet kernels compared with DSVM. The redundant band information is abandoned and remaining bands conduct more contribution for classification. More importantly, the OA is further improved as no duplicate band information inputs into sub-classifiers, and CPU time is reduced by using membrane computing with short coding length. The optimal band subset is obtained by the coding of MWOA as it is described in Section 3.2, and less than 10 bands are selected to compose the optimal band subset from the HSIs with hundreds of bands, the optimal band subset with only 6 bands is obtained for HSI1 image, which is corresponding to the wavelengths of 1092.5nm, 1287.5nm, 1452.5nm, 1707.5nm, 1932.5nm and 2202.5nm. In sum, the proposed approach is specific and efficient that recognizes different objects from the entire HSIs with a reasonable computation efficiency, and it can be applied to conduct fast interpretation for HSIs.

5 Conclusion

In this paper, a band selection approach based on WSVM ensemble model and MWOA is proposed, and it is demonstrated that SVM with wavelet kernel is adapted to HSI datasets with high dimension and small samples, ensemble learning with different types of wavelet kernel is more appropriate to synthesize heterogeneity and homogeneity of each sub-classifier. In addition, WOA has excellent optimization ability, it is fast enough to obtain higher fitness value, and the coding length of each individual is decreased by membrane computing. Furthermore, experimental results are compared with some corresponding and newly proposed techniques for pixel-level classification, the OA has reached 93% for 4 HSIs by using the HSI classification technique in the paper, it is sufficiently discriminated for some objects with similar feature values and most of categories are recognized on the image. In general, WSVM is adapted to the feature values of band information collected from HSIs, misclassification is improved by ensemble learning and band selection, and the optimal band subset is obtained by MWOA. A good balance between computational efficiency and classification accuracy is maintained, which lets it more appropriate for a series of practical applications. In the future, it is prefer to fuse the spatial and spectral features combined with different types of sub-classifier and obtain the optimal feature subset on different scales.