Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

Learning from high-dimensional data has important applications in areas such as speech processing, medicine, monitoring urbanization using multisource images, mineralogy using hyperspectral images [34, 56, 62, 64, 75, 77, 93]. Despite constant improvements in computational learning algorithms, supervised classification of high-dimensional data is still a challenge largely due to the curse of dimensionality (Hughes phenomenon) [40]. This is because the training set is very limited when compared to the hundreds or thousands of dimensions in high-dimensional data [8, 54]. Big efforts on feature extraction and feature selection have been applied to the supervised classifiers [35, 59, 61, 67, 69, 83, 92]. Since each learning algorithm (feature selection/extraction, classifiers) has its own advantages and disadvantages, efficient methodologies have yet to be developed. One of the most usual ways to achieve that is Multiple Classifier System (MCS) [4, 7, 11, 20, 27, 48, 50, 74, 78, 79, 85].

MCS comes from the idea that seek advices from several persons to make the final decision, where the basic assumption is that combining the opinions will produce a decision that is better than the single opinion [48, 50, 74]. The individual classifiers (member classifiers) are constructed and their outputs are integrated according to a certain combination approach, to gain the final classification result. The outputs can be generated by the same classifier with different training sets, or by the different classifiers with same or different training set. The success of MCS not only depends on a set of appropriate classifiers, but also on the diversity within the ensemble, which referred to two conditions: accuracy and diversity [15, 47]. Accuracy requires a set of appropriate classifiers to be as accurate as possible. Diversity means the difference among the classification results. Combining similar classification results would not further improve the accuracy. Both theoretical and empirical studies demonstrated that using a good diversity measure is able to find the extent of diversity among classifier and estimate the improvement in accuracy of combining individual classifiers [50, 74]. However, Brown et al. pointed out that the diversity for classification tasks is still an ill-defined concept, and defining an appropriate diversity measure for MCS is still an open question [12].

Generally speaking, we often adapt three independent steps: topology selection, classifier generation, and classifier combination, to construct the MCS. In Sect. 6.2, we will give a review on the uses of MCS, including these steps and along with the application of remote sensing.

Rotation-based ensemble classifier is one of the current state-of-the-art ensemble classifier methods [72]. This algorithm constructs different training sets as follows: first, the feature set is divided into several disjoint sets on which the training set is projected. Second, the subtraining set is obtained from the projection results using bootstrapping technique. Third, feature extraction is used to rotate each obtained subtraining set. The components obtained from feature extraction are rearranged to form the dataset that is treated as the input of a single individual classifier. The final result is produced by combining the output of individual classifiers generated by repeating the above steps in multiple times.

In this chapter, we will apply rotation-based ensemble classifier to classify high-dimensional data. In particular, two classifiers: Decision Tree (DT) and Support Vector Machine (SVM), are selected as the base classifiers. Unsupervised and supervised feature extraction methods are employed to rotate the training set. The performances of rotation-based ensemble classifiers are evaluated by the high-dimensional remote sensing images.

The remainder of this article is organized as follows. In Sect. 6.2, we introduce the topology, classifier generation, and classifier combination approaches of MCS, summarize the advances of MCS to high-dimensional remote sensing data classification. The main idea and two implementations of rotation-based ensemble are shown in Sects. 6.3 and 6.4, respectively. Experimental results are presented in Sect. 6.5. The conclusion and perspective of this chapter are drawn in Sect. 6.6.

6.2 Multiple Classifier System

Different classifiers, such as parametric classifiers and non-parametric classifiers, have their own strengths and limitations. The famous ‘no free lunch’ theorem stated by Wolpert may be extrapolated to the point of saying that there is no single computational view that solves all problems [86]. In the remote sensing community, Giacinto et al. compared the performances of different classification approaches in various applications and found that no one could always gain the best result [32]. In order to alleviate this problem, MCS can provide the complementary information of the pattern classifiers and integrate the outputs of these pattern classifiers so as to make the best use of the advantages and bypass the disadvantages. Nowadays MCS are highlighted by review articles as a hot topic and promising trend in remote sensing image classification and change detection [4, 21].

Most of MCS approaches focus on integrating the supervised classifiers. Few works devote to combine unsupervised classification results, often called cluster ensemble [38, 41]. Gao et al. proposed an interesting work to combine multiple supervised and unsupervised models using graph-based consensus maximization [29]. Unsupervised models (clustering), which do not directly generate label prediction for each individual classifier, can provide useful constraints for the joint prediction of a set of related object. Thus, Gao et al. proposed to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints based on the optimization problem on a bipartite graph [29]. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives. In this chapter, we focus on the combination of supervised classifiers.

The main issues of MCS design are [50, 74, 88]:

  • MCS topology: How to interconnect individual classifiers.

  • Classifier generation: How to generate and select valuable classifiers.

  • Classifier combination: How to build a combination function which can exploit the strengths of the selected classifiers and combine them optimally.

6.2.1 MCS Topology

Figure 6.1 illustrates the two topologies employed in MCS design. The overwhelming majority of MCS reported in the literature is structured in a parallel style. In this architecture, multiple classifiers are designed independently without any mutual interaction and their outputs are combined according to certain strategies [70, 71, 90]. Alternatively, in the concatenation topology, the classification result generated by a classifier is used as the input into the next classifier [70, 71, 90]. When the primary classifier cannot obtain the satisfactory classification result, then the output of the primary classifier is feed to a secondary classifier, and so on. The main drawback of this topology is that the mistakes produced by the earlier classifier cannot be corrected by the later classifiers.

Fig. 6.1
figure 1

The topologies of MCS. a Parallel style. b Concatenation style

A very special case of concatenation topology is the AdaBoost [28]. The goal of AdaBoost is to enhance the accuracy of any given learning algorithm, even weak learning algorithms with an accuracy slightly better than chance. The algorithm processes training of the weak learner multiple times, each time presenting it with an updated weight over the training samples. Then, the weights of misclassified samples are increased to concentrate the learning algorithm on specific samples. Finally, the decisions generated by the weak learners are combined into a single decision.

6.2.2 Classifier Generation

Classifier generation aims to build mutually complementary individual classifiers that are accurate and at the same time disagree on some different parts of the input space. Diversity of individual classifiers is a vital requirement for the success of the MCS.

Both theoretical and empirical studies indicate that we can ensure diversity using Homogeneous and Heterogeneous approaches [50, 74]. In Homogeneous approaches, we can obtain a set of classification results obtained by the same classifier by injecting randomness into the classifier, manipulating the training sample and the input features. The Heterogeneous approaches are to apply different learning algorithms to the same training set. First of all, we will start to review some diversity measures, and the generated classifiers followed to ensure the diversity in the ensemble.

Fig. 6.2
figure 2

Different classifier combinations using three single classifiers. The three colors represent the different classes. The overall accuracy of all individual classifier is \(6/9\). The overall accuracies of the four combinations are \(1, 8/9, 6/9\), and \(5/9\), respectively

6.2.2.1 Diversity Measures

Diversity represents the difference among the individual classifiers [15, 47]. Figure 6.2 presents four different classifier combinations within three classes (\(9\) samples) using majority vote approach. Overall accuracy of each individual classifier is \(6/9\). The accuracies of the four combinations are \(1, 8/9, 6/9\), and \(5/9\), respectively. Our goal is to use diversity measures to find the classifier combination like in Fig. 6.2a or b and avoid to select the third or especially the fourth classifier combination.

Kuncheva and Whitaker summarized the diversity measures in classifier ensembles [53]. A special issue called “Diversity Measure in Multiple Classifier System” published in Information Fusion journal indicates that diversity measure is an important research direction in MCS [51]. Petrakos et al. applied agreement measure in decision fusion level combination [60]. Foody compared the different classification results from three aspects: similarity, non-inferiority and difference using hypothesis tests and confidence interval algorithms [26]. It is proved that increasing diversity should lead to better accuracy, but there is no formal proof of this dependency [12]. Table 6.1 summarizes the 15 diversity measures with their types, data range and literature sources.

Diversity measures also play an important role in ensemble pruning. Ensemble pruning aims at reducing the ensemble size prior to combination while maintaining a high diversity among the remaining members in order to reduce the computational cost and memory storage. To deal with the ensemble pruning process, several approaches have been proposed such as clustering-based, ranking-based, and optimization-based approaches [82].

Table 6.1 Summary of the 15 diversity measures

6.2.2.2 Ensuring Diversity

Following the steps of pattern classification, we can enforce the diversity by the manipulation of training samples, features, outputs and classifiers.

Manipulating the training samples: In this method, each classifier is trained on different versions of training samples by exchanging the distribution of original training samples. This method is very useful for the unstable learner (decision tree and neural network), for which small changes in the training set will lead to a major change in the obtained classifier. Bagging and Boosting belong to this category [9, 28]. Bagging applies sampling with replacement to obtain the independent training samples for individual classifiers. Boosting changed the weights of training samples according to the results of the previous trained classifiers, focusing on the wrong classified samples, making the final result using a weight vote rule.

Manipulating the training features: The most well-known algorithm of this type is Random subspace [39]. Random subspace can be employed for several types of base learners, such as DT (Random Forest) [10], SVM [85]. Another development is Attribute Bagging, which establishes the appropriate size of a feature subsets, and then creates random projections of a given training set by random selection of feature subsets [13].

Manipulating the outputs: Multiclassification problem can be converted into several two-class classification problems. Each problem discover the discrimination between one class and the other classes. Error Correcting Output Coding (ECOC) adapts a code matrix to convert a multiclass problem into binary ones. Ensemble of multiclassifier classification problem can be treated as ensembles of multiple two-classifier classification problem, and then combined together [19]. The other method to deal with the outputs is label switching [58]. This method generates an ensemble by using perturbed version of the training set where the classes of the training samples are randomly switched. High accuracy can be achieved with fairly large ensembles generated by class switching.

Manipulating the individual classifiers: We can use different classifiers or the same classifier with different parameters to ensure the diversity. For instance, when the SVM is selected as the base learner, we can gain diversity by using different kernel functions or parameters.

6.2.3 Classifier Combination

Majority vote is a simple and an effective strategy for classifier combination. Within this scheme, a pixel is assigned as the class which gets the highest vote from the individual classifiers. Foody et al. used majority vote rule to integrate multiple binary classifiers for the mapping of a specific class [27]. According to the output of individual classifier, classifier combination approaches can be divided into three levels: abstract level, rank level, and measurement level [76]. The abstract level combination methods are applied when each classifiers outputs a unique label [76]. Rank level makes use of a ranked list of classes where ranking is based on decreasing likelihood. In the measurement level, probability values of the classes provided by each classifier are used in the combination. Majority/weighted vote, fuzzy integral, evidence theory, and dynamic classifier selection belong to the abstract level combination methods. Bayesian average and Consensus theory belong to measurement level methods. Table 6.2 summarizes classifier combinational approaches. Weighted vote, fuzzy integral, Dempster-Shafer evidence theory and consensus theory require anther training set to calculate the weights. Dynamic classifier selection calculates the distance between the samples so it requires the original image. The computation time of dynamic classifier selection is more expensive than other approaches.

Table 6.2 Summary of classifier combination approaches

6.2.4 Applications to High-Dimensional Remote Sensing Data

Table 6.3 lists the studies of MCS applied to high-dimensional remote sensing images in recent years. These studies applied different effective MCS schemes to classify high-dimensional data, including multisource, multidate, and hyperspectral remote sensing data. In the works of Smits [78], Briem et al. [11], Gislason et al. [33], dynamic classifier selection, Bagging, Boosting and Random Forest are applied to classify multisource remote sensing data, respectively. Lawrence et al. [55], Kawaguchi and Nishii [44], Chan and Paelinckx [14], Rodriguez-Galiano et al. [73] used Boosting and Random Forest for the classification of multi-date remote sensing images. Doan and Foody [20] combining the soft classification results derived from NOAA AVHRR images using average operator and Evidence theory. From Table 6.2, the most well-known MCS approaches for hypespectral image classification is Random Forest. In Random Forest, each tree is trained on a bootstrapped sample of the original dataset and only a randomly chosen subset of the dimensions is considered for splitting a leaf. Thus, the computational complexity can be reduced and the correction between the trees are decreased. Apart from this, Waske et al. [85] developed random selection-based SVM for the classification of hyperspectral images. Yang et al. [91] proposed a novel subspace selection mechanism, dynamic subspace method, to improve random subspace method on automatically determining dimensionality and selecting component dimensions for diverse subspace. Du et al. [22] constructed diverse classifiers using different feature extraction methods and then combined the results using evidence theory, linear consensus algorithms. Recently, Xia et al. [89] used Rotation Forest to classify hyperspectral remote sensing images. Compared to Random Forest, Rotation Forest [89] uses feature extraction to promote both the diversity and the accuracy of individual classifiers. Therefore, Rotation Forest can generate more accurate result than Random Forest.

Table 6.3 Studies on high-dimensional remote sensing image classification using MCS published in journals in recent years

6.3 Rotation-Based Ensemble Classifiers

In this study, rotation-based ensemble classifiers are used for high dimensional data. Let \(\left\{ \mathbf X ,Y\right\} = \left\{ \mathbf x _{i},y_{i}\right\} _{i=1}^{n}\) be training samples. \(T\) is number of classifier. \(K\) is number of subsets (\(M\): number of features in each subset). \(\varGamma \) is the base classifier. The details of rotation-based ensemble are presented in Algorithm 1 and Fig. 6.3 [66, 72]. According to Algorithm 1 and Fig. 6.3, the main steps of rotation-based ensemble classifier can be concluded as follows:

Fig. 6.3
figure 3

Illustration of the rotation-based ensemble

  • the input feature space is divided into \(K\) disjoint subspaces.

  • feature extraction is performed on each subsets with the bootstrapped samples of 75 % size of \(\left\{ \mathbf X ,Y\right\} \).

  • the new training data, which is obtained by rotating the original training samples, is applied to the individual classifier.

  • the individual classification results are combined using majority voting rule.

The strong performance is attributed to a simultaneous improvement of (1) diversity within the ensemble, obtained by the use of feature extraction on training data and (2) accuracy of the base classifiers, by keeping all extracted features in the training data [66, 72].

It is essential to notice step 5 in rotation-based ensemble presented in Algorithm 1, the sample size \(\mathbf X _{t,j}^{'}\) is selected smaller than \(\mathbf X _{t,j}\) due to two reasons: one is to avoid obtaining the same coefficients when the same features are chosen and the other is to enhance the diversity within the ensemble [72].

Given the importance of the choice regarding the algorithm for feature extraction and the base classifier in rotation-based ensemble, several alternatives are considered in this study. The detailed feature extraction methods and base classifier can be found in the following section.

figure a

6.4 Two Implementations of Rotation-Based Ensemble

6.4.1 Rotation Forest

Decision trees are often used for the multiple classifier system, especially for the rotation-based ensembles, because it is sensitive and fast. In this chapter, we adapt Classification and Regression Tree (CART) to construct Rotation Forest (RoF).

CART is a nonparametric decision tree learning technique, which can be both used for classification and regression. Decision trees are formed by a collection of rules based on variables in the modeling dataset: (1) rules based on variables’s values are selected to get the best split to differentiate observations based on the dependent variable, (2) once a rule is selected and a node is split into two, the same process is applied to each ‘child’ node. (3) splitting stops when CART detects no further gain can be made, or some preset stopping rules are met. Each branch of the tree ends in a terminal node. Each observation falls into exactly one terminal node, and each terminal node is uniquely defined by a set of rules.

Both unsupervised and supervised feature extraction methods are applied to Rotation Forest. Principal Component Analysis (PCA) is the most popular linear unsupervised feature extraction method, which can keep the most information in a few components in terms of variance. Though Cheriyadat and Bruce provide theoretical and experimental analysis to demonstrate that PCA is not optimal for dimensionality reduction in target detection and classification of hyperspectral data, PCA are still competitive for the purpose of classification because of its low complexity and the absence of parameters [16, 24].

Linear Discriminant Analysis(LDA) is the best-known supervised feature extraction approaches. But this method has the limitation: for \(C\) class classification problem, it can extract at maximum \(C-1\) features [18, 54]. That means in Rotation Forest, we should define the value of \(C\) is greater than \(K\). In order to solve the problem, we adapt Local Fisher Discriminant Analysis (LFDA) instead of LDA. LFDA effectively combines the ideas of LDA and Locality Preserving Projection (LPP), which leads to both maximize between-class separability and preserve with-class local structure [80]. It can be viewed as the following eigenvalue decomposition problem:

$$\begin{aligned} \mathbf S ^{lb}\mathbf v =\lambda \mathbf S ^{lw}\mathbf v \end{aligned}$$
(6.1)

where, \(\mathbf v \) is an eigenvector and \(\lambda \) is the eigenvalue corresponding to \(\mathbf v \). \(\mathbf S _{lb}\) and \(\mathbf S _{lw}\) denote the local between-class and within-class scatter matrix. LFDA wants to find an eigenvector matrix that maximize the local between-class scatter in the embedding space while minimize the local within-class scatter in the embedding space. \(\mathbf S _{lb}\) and \(\mathbf S _{lw}\) can be defined:

$$\begin{aligned} \mathbf S ^{lb}=\frac{1}{2}\sum ^{n}_{i,j=1}\omega ^{lb}_{i,j}(\mathbf x _{i}-\mathbf x _{j})(\mathbf x _{i}-\mathbf x _{j})^{\top } \end{aligned}$$
(6.2)
$$\begin{aligned} \mathbf S ^{lw}=\frac{1}{2}\sum ^{n}_{i,j=1}\omega ^{lw}_{i,j}(\mathbf x _{i}-\mathbf x _{j})(\mathbf x _{i}-\mathbf x _{j})^{\top } \end{aligned}$$
(6.3)

where, \(\omega ^{lb}\) and \(\omega ^{lw}\) are the weight matrices with:

$$\begin{aligned} \omega ^{lb}_{i,j} = \left\{ \begin{array}{rl} A_{i,j}\left( \frac{1}{n}-\frac{1}{n_{y_{i}}}\right) &{}y_{i} = y_{j} \\ \frac{1}{n} &{}y_{i} \ne y_{j} \end{array} \right. \end{aligned}$$
(6.4)
$$\begin{aligned} \omega ^{lw}_{i,j} = \left\{ \begin{array}{rl} \frac{A_{i,j}}{n_{y_{i}}} &{}y_{i} = y_{j} \\ 0 &{}y_{i} \ne y_{j} \end{array} \right. \end{aligned}$$
(6.5)

where, \(A_{i,j}\) is the affinity value between \(\mathbf x _{i}\) and \(\mathbf x _{j}\) in the local space.

$$\begin{aligned} A_{i,j}=\exp \left( -\frac{\left\| \mathbf x _{i}-\mathbf x _{j}\right\| }{\sigma _{i}\sigma _{j}}\right) \end{aligned}$$
(6.6)
$$\begin{aligned} \sigma _{i}= \left\| \mathbf x _{i}-\mathbf x _{i}^{e}\right\| \end{aligned}$$
(6.7)

where, \(\mathbf x _{i}^{e}\) is the e-th nearest neighbor of \(\mathbf x _{i}\), \(n_{y_{i}}\) is the number of labeled samples in class \(y_{i} \in \left\{ 1,2,3,...,C\right\} \).

6.4.2 Rotation SVM

SVM classifier has shown better classification performance for high-dimensional data than other classifier. SVM is very stable that small changes in the training set cannot produce very different SVM classifiers.

Therefore, it is difficult to get an ensemble of multiple SVM that perform better than a single SVM using the state of the art ensemble methods. Thus, we hope to introduce more diversity into SVM. In [52], diversity is analyzed for Random Projections (RP) with and without splitting into group of attributes. Therefore, we introduce Random Projection (RP) into rotation-based SVM in order to promote the diversity within the ensemble.

RP obtains the rotation matrix using simply random number. Unlike other feature extraction methods such as PCA, RP can get a projected space which is bigger than the original. Two types of RP are used in this chapter [1]:

  1. 1.

    Gaussian. Each value in transformation matrix comes from a Gaussian distribution (mean 0 and standard deviation).

  2. 2.

    Sparse. The entry values are \(\sqrt{3}\times \alpha \), where, \(\alpha \) is a random number taking the following value: \(-1\) with the probability \(1/6\), \(0\) with the probability \(2/3\) and \(+1\) with probability \(1/6\).

6.5 Experimental Results and Analysis

6.5.1 Experimental Setup

In this section, we present the results that we obtained with rotation-based ensemble on different types of images. Two airborne hyperspectral images are used to evaluate Rotation Forest (RoF). An airborne hyperspectral and a multi-date remote sensing images are applied to test the performance of Rotation SVM (RoSVM). The descriptions of the data are detailed in the following two subsections. Overall accuracy (OA), average accuracy (AA), and class-specific accuracy are used to evaluate the efficiency of RoF and RoSVM.

Popular ensemble methods, including Bagging [9], AdaBoost [28] and Random Forest (RF) [10] are added to be compared with Rotation Forest. The performance achieved by Rotation Forest is illustrated using the following design:

  • Number of features in each subset: \(M = 10\);

  • Number of classifiers in the ensemble: \(L = 10\);

  • Feature extraction method: PCA [42] and LFDA [80];

we employed RoF-PCA and RoF-LFDA as the abbreviations of Rotation Forest with PCA and LFDA.

Gaussian RBF kernel is adopt in the SVM. In order to reduce the computational time in the ensembles of SVM, we used the fixed parameters in SVM. Random Projection-based ensemble is added to compare with RoSVM using RP projections. Two sizes of projected space dimension have been tested (100 and 150 %). The configurations of 150 % size are denoted as RoSVM or RP 150 %. The performance achieved by RoSVM is illustrated using the following designs:

  • Number of features in each subset: \(M = 10\);

  • Number of classifiers in the ensemble: \(L = 10\);

  • Feature extraction method: Random Projection (RP) with Gaussian and Sparse;

  • Base classifier: SVM.

In the following experiments, we employed RP and RoSVM as the abbreviations of Random Projection-based ensemble and rotation-based SVM ensemble.

6.5.2 Rotation Forest

6.5.2.1 Indiana Pines AVIRIS Image

The first hyperspectral image is recorded by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indiana Pines in Northwestern Indiana, USA. The image is composed of \(145 \times 145\) pixels, and the spatial resolution is 20 m per pixel. This image is a classical benchmark to validate the accuracy of hyperspectral image analysis algorithms and constitutes a challenging problem due to the significant presence of mixed pixels in all available classes and also because of the unbalanced number of available labeled pixels per class. The three-band color composite and ground truth of AVIRIS image can be seen in Fig. 6.4. We have chosen 20 pixels of each class from the available ground truth (a total size of 320 pixels) as the training set.

Fig. 6.4
figure 4

a Three-band color composite of AVIRIS image. b Ground truth: Corn-no till, corn-min till, corn, soybean-no till, soybeans-min till, soybeans-clean till, alfalfa, grass/pasture, grass/trees, grass/pasture-mowed, hay-windrowed, oats, wheat, woods, bldg-grass-tree-drives, stone-steel towers

Fig. 6.5
figure 5

Classification results of Indiana Pines AVIRIS image. Different color represents the different class. The color of the classes can be found in Fig. 6.4. a CART. b Bagging. c AdaBoost. d RF. e RoF-PCA. f RoF-LFDA

Table 6.4 Overall, average and class-specific accuracies of the Indiana Pines AVIRIS image

Table 6.4 shows the classification accuracies (OA %) obtained by the Rotation Forest approaches as well as other algorithms using different training samples. We highlight the highest accuracies of each case in bold font. From Table 6.4, it can be seen that RoF-PCA and RoF-LFDA achieve better results than other ensemble approaches (Bagging, Adaboost, and RF). Compared to Bagging, AdaBoost and RF, Rotation Forest can promote the diversity and improve the accuracy of individual classifier within the ensemble. Therefore, in most cases, Rotation Forest is superior to Bagging, AdaBoost, and Random Forest. Our experimental results are compatible with the theorectical analysis. For instance, CART, Bagging, Adaboost and RF acquired an OA of 41.44, 51.87, 50.54 and 56.97 %, respectively. RoF-PCA and RoF-LFDA respectively increased the OA to 60.88 and 60.6 %, while the AA of RoF-PCA and RoF-LFDA were improved to 23.78 and 23.85 % percertage points compared to CART. The OA of RoF-PCA is slightly higher than the one of RoF-LFDA. But there is no significantly difference between the two classification results according to McNemar test. Nine of sixteen class-specific accuracies is improved by RoF-PCA and RoF-LFDA.

The classification results of Indiana Pines AVIRIS image are shown in Fig. 6.5. The classification map for the CART classifier was very noisy because CART is not a promising classifier for high-dimensional data. Compared to the reference data presented in Fig. 6.4b, all the ensemble methods produced more corrected classification results than CART. If we carefully look at the reference image, particularly, the area of Soybean-no till, this region is almost correctly classified by RoF-PCA and RoF-LFDA, whereas it is classified as Corn-min till and Corn-no till by other classifiers.

6.5.2.2 Pavia Center DAIS Image

The second image was acquired by the DAIS sensor at 1500 m flight altitude over the city of Pavia, Italy. The image (seen in Fig. 6.6) has a size of \(400 \times 400 \) pixels, with ground resolution of 5 m. The 80 data channels recorded by this spectrometer were used for this experiment. Nine land cover classes of interest are considered, which are detailed in Table 6.5, with the number of labeled samples for each class.

Table 6.5 Overall, average and class-specific accuracies of the Pavia Center DAIS image

The global accuracies and class-specific accuracies of the Pavia Center DAIS image are reported in Table 6.5. The classification results achieved by the ensemble classifiers are similar with the ones of AVIRIS image. Regarding the overall accuracies, Rotation Forest with different feature extraction algorithms are all superior to other approaches under comparison. RoF-LFDA yields the highest OA (91.8 %). The accuracies of class Bricks, Asphalt and Bitumen are significantly improved by the Rotation Forest ensemble classifiers. The classification results of Pavia Center DAIS image are shown in Fig. 6.7.

Fig. 6.6
figure 6

a Three-band color composite of DAIS image. b Ground truth

Fig. 6.7
figure 7

Classification results of Pavia Center DAIS image. a CART. b Bagging. c AdaBoost. d RF. e RoF-PCA. f RoF-LFDA

6.5.2.3 Sensitivity of the Parameters

Ensemble size (\(T\)) and the number of features in a subset (\(M\)) are the key parameters of Rotation Forest, which are indicators of the operating complexity. In order to investigate the impacts of these parameters, we have performed the classification results using different ensemble size when the number of subset \(M\) is fixed to 10, different number of features in a subset when ensemble size \(T\) is fixed to 10. Fig. 6.8 shows the OA (%) using different number of \(T\) and \(M\) obtained from AVIRIS and DAIS images. For AVIRIS image, the classification results are improved when the values of \(T\) and \(M\) increased. RoF-PCA is superior to RoF-LFDA. For DAIS image, the OAs are improved with the increasement of \(T\). The general trend of different values of \(M\) is not obvious.

6.5.2.4 Discussion

Based on the above classification results, we identify that the incorporation of multiple classifiers has shown great improvement for the classification of high-dimensional data. In order to make MCS effective, we should enforce the diversity by the manipulation of training sets. Bagging and Boosting aim at changing the distribution of the training samples to obtain the different training set. Random subspace method constructs several classifier by random selecting the subset of the features. It is very useful for the classification problem where the number of features is much larger than the number of training samples. Random subspace method is a generalization of the Random Forest algorithm, whereas Random Forest is composed of decision trees. Rotation Forest tries to create the individual classifiers that are both diverse and accurate, each based on a different axis rotation of attributes. To create different training set, the features are randomly split into a given number of subsets and feature extraction is applied to each subset. Decision trees is very sensitive to the rotation of axis. In this chapter, we select CART to construct Rotation Forest. Rotation Forest can promote more diversity than Bagging, AdaBoost and Random Forest. Therefore, it can produce more accurate results than Bagging, AdaBoost and Random Forest. An important issue of Rotation Forest is the selection of the parameters (\(T\) and \(M\)). A larger value of \(T\) will often increase the accuracy and also increase the computation time. The optimal value of \(M\) is hard to determine. Different datasets achieve the highest accuracy with different value of \(M\). The computation time of Rotation Forest approaches is longer than those of Bagging, AdaBoost and Random Forest. But the computation complexity of Rotation Forest is much less than the one of the strong classifier of high-dimensional data, such as SVM.

Fig. 6.8
figure 8

Effects of varying parameters on the performance of rotation forest. Indiana Pines AVIRIS image. a Sensitivity for change of \(T\) (\(M =10\)). b Sensitivity for change of \(M\) (\(T =10\)). Pavia Center DAIS image. c Sensitivity for change of \(T\) (\(M =10\)). d Sensitivity for change of \(M\) (\(T =10\))

6.5.3 Rotation SVM

6.5.3.1 Indiana Pines AVIRIS Image

Table 6.6 shows overall, average and class-specific accuracies using different version of rotation-based SVMs. We highlight the highest accuracies of each case in bold font. It can be seen that RoSVM achieve the better results than RP, because RoSVM can provide more diversity than RP. For this dataset, RP with Gaussian is superior to the one of Sparse. By employing a slightly higher size of a projected space, the results of RP is improved but RoSVM yields bad results. The corresponding results are shown in Fig. 6.9. We have studied the impacts of \(T\) and \(M\) in RoSVM. The sensitivity of performance using different \(T\) and \(M\) is not obvious.

Table 6.6 Overall, average and class-specific accuracies of the Indiana Pines AVIRIS image

6.5.3.2 CHRIS Multi-date Images

The second high-dimensional data is the three dates of Compact High-Resolution Imaging Spectrometer (CHRIS) images acquired by the Project for On-Board Autonomy (PROBA)-1 satellite with spatial resolution of 21 m/pixel. The total number of spectral bands is 54. More details about CHRIS image can be seen in [23]. Training samples contains 2,297 samples and test data includes 1,975 samples with 11 classes.

Fig. 6.9
figure 9

Classification results of Indiana Pines AVIRIS image. a SVM. b RP Gaussian. c RP Gaussian 150 %. d RP Sparse. e RP Sparse 150 %. f RoSVM Gaussian. g RoSVM Gaussian 150 %. h RoSVM Sparse. i RoSVM Sparse 150 %

The flowchart of RoSVM for classifying CHRIS image is the same as the previous AVIRIS dataset. Single SVM achieved the accuracy of 84.05 %. All the methods based on RP and RoSVM ensemble can generate the better accuracies than a single SVM. In particular, RoSVM ensembles are slightly superior to RP ensembles because they enforce the diversity by applying RP to the subsets of the features. RoSVM with Spare RP gains the highest the overall accuracy.

6.5.3.3 Discussion

SVM is a stable classifier, so it is hard to generate different individual SVM classifiers using the common manipulation ways. Therefore,we should introduce more diversity to construct the diverse individual SVM classifiers. In this chapter, we adapt Random Projection methods to produce diverse SVM classifiers. Two sizes of projected space dimension have been tested. Experimental results indicated that RoSVM ensemble outperform RP ensembles. The main drawback of RoSVM is the computational complexity, especially for large training samples. The sensitivity of performance using different \(M\) is not obvious.

6.6 Conclusion

In this chapter, we first presented a review of MCS approaches with special focus on applications of high-dimensional data. Recently rotation-based ensemble classifiers were applied to high-dimensional data. They consist in splitting the feature set into several subsets, running feature extraction algorithms separately on each subset and then reassembling a new extracted feature set while keeping all the components. CART Decision Tree and SVM classifiers are used as the base classifier. Different splits of the feature set lead to different rotations. Thus diverse classifiers are obtained. We take into account both diversity and accuracy. Rotation Forest using PCA, LFDA, Rotation SVM using RP are used to classify high-dimensional data.

Experimental results have shown that rotation-based ensemble methods (both DT and SVM) outperform classical ensemble methods such as Bagging, AdaBoost, Random Forest in terms of accuracies. The key parameters are also explored in this chapter. Future studies will be devoted to the integration of rotation-based ensemble classifiers with other ensemble approaches, the selection of an optimized Decision Tree model, and the use of other effective feature extraction algorithms.