1 Introduction

The global–local hybrid ensemble (GLHE) is a classifier ensemble design that is characterized by two main traits [4]. First, one global learner and one local learner are explicitly used. Second, both heterogeneous and homogeneous diversities are integrated. Mitchell presents an explanation and comparison of how global and local learners work [26]. When all training instances are considered during classification of a query instance, the learner is termed as global. When only near training instances are considered during classification of a query instance, the learner is called local. Global learners estimate a single target function for the entire instance space, while local learners estimate target functions locally and differently for each query instance [20].

There are advantages and disadvantages for each type of learning algorithm, and their ability for generalization depends on the problem at hand. Generally, global learners do not respond well to isolated data points—those points in a sparsely distributed area. That is, it attempts to have a model that satisfies the majority of points while paying little attention to outliers (similar to how linear regression works). Local learners, alternatively, are better for handling the isolated points since their generalization is instance-based. However, if the target function only depends on a few of the many available attributes, then the instances that are most “similar” may actually be a large distance away [27]. One can argue that these two types of learners may behave in a “complementary” way: when one fails, the other may succeed since it views the problem in such a different manner.

Co-existence of a global learner along with a local learner within the same ensemble framework may offer a powerful mixture of diversity. While traditional ensemble designs in the literature appear to use only one type of diversity, either heterogeneous or homogeneous [7], global–local hybrid ensemble employs both forms. The heterogeneous diversity originates from the use of a global learner and a local learner, while the homogeneous diversity is due to multiple instantiations of each learner with different initial parameterizations (e.g. initial weights for neural networks, number of neighbors for kNN, pruning techniques for decision trees, etc.). The combination of the two diversities is not common in the literature. Although some have experimented with it indirectly [41], others have trivialized the combination to being nothing more than a heterogeneous ensemble in essence [7].

The generic architecture of the GLHE design is shown in Fig. 1. The contribution of GLHE is the design or composition of the base classifiers. There are n base-classifiers from a global learner with different parameterizations. Likewise, there are m base-classifiers from a local learner with different parameterizations. The global and local learning algorithms provide the main source of diversity (heterogeneity), while instantiation of multiple classifiers from each learner (homogeneity) gives the ensemble an opportunity to benefit from the better performing learner numerous times rather than just once.

Fig. 1
figure 1

Generic architecture of global–local hybrid ensemble (GLHE) design

A previous GLHE study [4] compared prediction accuracy of the proposed design against those of 47 ensembles reported in six prominent studies in the literature [5, 17, 22, 27, 32, 35]. The ensembles varied from bagging and boosting to hybrid ensembles with up to seven different learning algorithms. It was found that the performance of GLHE is not statistically different when compared to the performances of 45 other ensembles, statistically better than that of C4.5 bagging ensemble, and statistically worse than that of one hybrid ensemble. The results were supportive of the hypothesis that GLHE offers the same robust performance as other, at times more complex, ensemble designs. Although a large number of comparisons were performed and since comparisons were made with studies already reported in the literature, the study was limited by the constraints imposed by those same literature studies: comparisons were made with ensembles designed by others, only the prediction accuracy measurements were available in the literature studies, and at most 27 datasets were considered by any single study. Consequently, conclusions of the Baumgartner and Serpen study [4] facilitated making only general statements about GLHE. This study aims to provide a more precise projection of the utility of the GLHE design through a comparative performance evaluation with data manipulation ensembles, namely boosting and bagging classifier ensembles. Although not directly studied in this work, the GLHE method may be similarly applicable to relatively new problem domains that traditional data manipulation ensembles have been extended to, including adversarial environments [6].

In data manipulation ensembles, the approach to building an ensemble uses a single learning algorithm with different subsets of the original dataset to train different base classifiers. Multiple classifiers are generated from a single learning algorithm through variations of the training data (e.g. different samples of instances and/or different samples of features). Data manipulation ensembles are ideal when a single learning algorithm is known to perform well for a given dataset, as the performance will likely improve. However, there is also risk that the learning algorithm may not perform well for a given dataset, which adversely affects the performance of the ensemble. Regardless of its drawbacks, the simplicity of this method makes it the most widely investigated diversity creation method [30].

Popular data manipulation ensemble techniques are bagging and boosting (AdaBoost). Breiman’s bagging, or bootstrap aggregation, uses different instance subsets of the training dataset with a single learning algorithm [8]. Generally, the subset size is near the size of the original dataset; however, random sampling with replacement creates subsets with duplicates and/or omissions of the original instances. The same learning algorithm is used to train on the different subsets of data, each training episode or case resulting in a new base classifier. Given a new instance, the classifier predictions are aggregated with a majority vote to derive the final prediction for the ensemble. Whereas bagging relies on randomness to provide better performance from an ensemble, Boosting takes a more active role. Freund and Schapire’s popular boosting variant, AdaBoost [18], explicitly alters the distribution of the training dataset to concentrate on the instances that have not been correctly learned in the previous iterations (i.e. consecutive training datasets target hard-to-classify instances). A weighted vote of each classifier’s prediction is performed to obtain the ensemble’s final prediction. A recent approach, bagging–AdaBoost, incorporates components of each classic data manipulation ensembles into a single technique to improve the accuracy, stability, and robustness of the ensemble [42].

In the subsequent sections a comparative simulation-based performance and diversity analysis of global–local hybrid ensemble with boosting and bagging ensembles will be presented. It is of interest to determine if the global–local hybrid ensemble offers a more robust performance as a consequence of its heterogeneous diversity when compared to data manipulation ensembles, while also recognizing that the global–local hybrid ensemble projects a higher level of training cost due to the need to train both global and local learners.

2 Simulation study

The simulation study compares the performance of GLHE against data manipulation ensembles, namely bagging and boosting variants. The study employs 46 datasets from the UCI Machine Learning Repository [2] to profile the performance of global–local hybrid ensemble against boosting and bagging ensembles on the basis of prediction accuracy and SAR metrics. A rigorous statistical significance testing is applied for the performance comparisons.

All simulations are executed using the open source Weka software (version 3.5.8) using tenfold cross validation [39], with the large experiment and evaluation tool (LEET) as a front-end [3]. The prediction accuracy estimate for a classifier on a given dataset is an average of the ten samples (10 folds). The variation of these samples is not important for this study, as will be made apparent in the discussion of the statistical significance testing method in a forthcoming section. The 46 publically available datasets from the UCI Machine Learning Repository are listed, with their characteristics, in Table 1. This collection of datasets was obtained as a union of datasets used in six popular studies in the literature that target performance robustness [5, 17, 22, 27, 32, 35]. Only those that could not be found (due to naming discrepancies) and the letter dataset (due to its relatively long required run-times) are not included. Since exclusion of datasets is not dependent on a bias for/against the GLHE design, little or no impact is anticipated on the results and conclusions of this study.

Table 1 Simulation study datasets and their characteristics

2.1 Statistical significance testing

Demsar [14] argues that comparing classifiers does not satisfy conditions for parametric tests, and instead proposes the use of nonparametric alternatives. Some have applied these tests to their studies [4, 5, 25], and we believe it is a sound approach to follow. Since nonparametric tests rank the classifiers for each dataset, the important source of variations is the (independent) datasets and not the (usually dependent) samples used to calculate the accuracy. This implies that, besides obtaining a precise estimate of accuracy, the sampling method is irrelevant because one does not have to worry about the Type I error generated from it. When comparing multiple classifiers, basic statistics dictates that a certain portion of the reported statistical significance is actually due to random chance [14, 28, 35]. Demsar’s suggested procedure of tests resolves this issue.

In this study, first the Friedman test is performed, which is based on the average rank of each classifier across the datasets [19]. Iman and Davenport [21] showed that the Friedman test is undesirably conservative, and derived a new statistic based on its value. This is then used to test the null hypothesis—that all classifiers have equivalent performance. If the null hypothesis is accepted, then there is no statistically significant difference in the classifier performances. Otherwise, the post hoc Bonferroni–Dunn test is conducted, in which a control classifier is compared to all others in the group [16]. The performance of the control and another classifier is significantly different if the corresponding average ranks differ by at least the critical difference (CD).

The results of the statistical significance tests are presented graphically. An example is shown in Fig. 2. The left-most bar is the control classifier used for the Bonferroni–Dunn test, while the remaining bars are the comparison classifiers. The height of each bar represents the average rank of the associated classifier, which indicates relative performance among the classifiers (higher ranks are worse performing than lower ranks). Although examining the average ranks may offer some insight, the more strict and valid comparison involves the statistical significance thresholds resulting from the Bonferroni–Dunn test. Any classifier with an average rank above the high threshold is statistically worse than the control. Alternatively, any classifier with an average rank below the low threshold is statistically better than the control. Those classifiers with ranks within the threshold lines have a performance that is statistically no different than that of the control. Note that the exact values of the thresholds are displayed at the bottom of the graph. This series of statistical significance tests and the graphical presentation are applicable to any measure of performance for multiple classifiers. This includes the prediction accuracy and the SAR metrics as described in the next section.

Fig. 2
figure 2

Example results of the statistical significance tests

2.2 Performance metrics

The simulation study employs two performance metrics, namely prediction accuracy and SAR where the latter is an aggregate measure based on prediction accuracy, area under the receiver operating curve and root-mean-squared-error.

Prediction accuracy or ACC in short is defined as the proportion of the number of correct predictions that the classifier makes relative to the total number of instances. That is,

$$ ACC = \frac{Number\,of\,Correct\,Predictions}{Total\,Number\,of\,Instances}, $$

where larger prediction accuracies indicate better performance.

The receiver operating characteristic (ROC) is a 2-dimensional plot of true positives on the vertical axis against false positives on the horizontal axis. The process to construct the graph is as follows [39]. First, the predictions are sorted by descending probability, without regard to whether or not the prediction is correct. Then the graph is drawn by traversing through the predictions, from most probable to least probable. If the prediction is correct, then move one unit up. Alternately, if the prediction is incorrect, then move one unit to the right. The area under the ROC curve, ROCA, is used as a summary statistic. Larger areas are considered as having better performance. Although it is already in use and accepted in fields such as medicine, ROCA is gaining popularity in the wider machine learning community. For a 2-class problem, a single ROCA value is calculated between the two classes. For a c-class problem, there are c ROCA values calculated such that a given class is considered against the set of the remaining classes (i.e. a class is considered as 1 and all remaining classes are considered as 0). These values are then weighted by the number of instances from each corresponding class and their sum is divided by the total number of instances. Although calculation of ROCA for multiclass problems is not frequent, the stated method is accepted in the literature [31].

Root-mean-squared-error (RMSE) is widely used in regression problems; however, it can also be adapted for classification problems. RMSE measures how much predictions deviate from the true values. For C classes, let p 1p 2, …, p C be the probability values of the predicted classes (such that p i is the probability that class i is predicted with respect to the total number of instances, and all the probabilities sum to 1), and a 1a 2, …, a C be the probability values of the actual classes (such that the true class is 1 and the others are 0). RMSE is defined as [39]:

$$ RMSE = \sqrt {\frac{1}{N}\sum\nolimits_{z = 1}^{N} {\sum\nolimits_{j = 1}^{C} {[(p_{j} - a_{j} )^{2} /C} ]} ,} $$

where N is the number of instances. Smaller RMSE values indicate better performance.

SAR is an aggregate performance metric that has been shown to have better correlation with ten other prominent performance metrics than any component metric alone [10, 11]. It is an average of the prediction accuracy, area under the ROC curve, and root-mean-squared-error. That is,

$$ SAR = \frac{(ACC + ROCA + (1 - RMSE))}{3}, $$

and a higher value of SAR indicates better performance.

2.3 Ensemble designs

The global–local hybrid ensemble (GLHE) is designed using a decision tree classifier as its global learner and a nearest neighbor classifier as its local learner. Specifically, the global learner is J48 (a port of the C4.5 decision tree classifier) and the local learner is IBk (an instance-based nearest-neighbor classifier [1]). J48 is a re-implementation of C4.5 release 8 (which develops a decision tree based classifier from a set of training data through information entropy measure) in Java for Weka and employs both C4.5’s confidence-based post-pruning (default) and sub-tree raising.

The categorization of J48 as global and IBk as local is in conformity with multiple references in the literature [26, 33]. There are obviously many different learning algorithms that could be used to satisfy the global–local learner requirement of GLHE, some of which may produce better performance (e.g. multi-layer perceptrons for global learning and radial basis functions for local learning). However, initial exploratory in-house assessments indicated that the simplicity and efficiency of J48 and IBk would allow for a large-scale experimentation study while capturing the important factors of the GLHE design.

Although GLHE is a generic ensemble design, our choice of relatively simple learning algorithms (e.g. C4.5 and IBk) over complex ones (e.g. artificial neural networks, Bayesian nets or support vector machines) for empirical evaluation further supports a pursuance of simplicity. This ad-hoc approach could be improved with in-depth testing and analysis of diversity. Both the selection of learning algorithms and the set of classifiers for each promotes reduced complexity in the ensemble design—an important goal of classifier design in general.

The heterogeneous diversity of the GLHE design comes from the use of two different types of base learning algorithms—one global and one local. The homogenous diversity of GLHE is achieved by varying the parameter settings for each learner to create multiple classifier instances. This requires three steps. First, the target number of classifiers from each learner is set. In this study, three classifiers are used from each learner. This sufficiently allows homogeneous tendencies to be incorporated while keeping the total number of base classifiers at a manageable level. Second, the base parameter value settings for each learner is established. For J48, the base parameters are Weka’s default. For IBk, the base parameters are Weka’s default with the exception of distance weighting which has a value of “1/distance”. Third, for each learner different parameter value settings are employed to create the multiple base classifier instances. For J48, the type of pruning is changed: these are standard pruning, reduced error pruning, and no pruning (unpruned). For IBk, the number of nearest-neighbors is changed: the employed values are 1, 5, and 10.

The next step in the design of GLHE is determining the ensemble combination method to be used. Three ensemble techniques or architectures were considered as follows:

  • Voting—a simple combination scheme of the base-classifier predictions to derive the final ensemble prediction. In this study, the average of probabilities combination rule is used [15].

  • StackingC [36]—stacking uses meta-classification of a dataset created with the prediction values of the base-classifiers as features and the original values as the class [40]. The meta-classifier derives a final prediction from this. StackingC is more efficient with a reduced number of features in the created dataset. In this study, linear regression is used as the meta-classifier.

  • Grading [35]—another meta-classification approach that employs an opposite approach to Stacking. A new dataset is created for each base-classifier, with instances containing the original features, but the class indicates whether the base-classifier’s prediction was correct. The meta-classifier predicts which base-classifier will be correct. In this study, IBk (an instance-based K-nearest-neighbor classifier) with ten nearest-neighbors is used as the meta-classifier.

A comparison of the prediction accuracies of GLHE for Voting, StackingC, and Grading techniques was performed on the 46 datasets listed in Table 1. The average ranks and statistical significance thresholds for this comparison are shown in the graph of Fig. 3. Since StackingC has the lowest average rank, it is used as the control for the Bonferroni–Dunn test. Voting and Grading are right at the statistical significance threshold, so it is safe to conclude that StackingC has the best prediction accuracy performance of the three GLHE ensembles. Therefore, it is used for comparison with the data manipulation ensembles.

Fig. 3
figure 3

Prediction accuracy comparison of the GLHE ensembles

The design of data manipulation ensembles was based on two ensemble techniques, namely bagging and AdaBoost, and six learning algorithms. The base learning algorithms and their parameterizations used (default WEKA values unless otherwise stated) are as follows:

  • J48—a port of the C4.5 decision tree classifier.

  • IBk—an instance-based nearest-neighbor classifier (distance weighting of “1/distance” and 5 nearest-neighbors).

  • NB—naïve Bayes classifier (supervised discretization).

  • PART—creates rules from partial C4.5 trees.

  • KStar—an instance-based entropy-based classifier.

  • SMO—sequential minimal optimization algorithm for training a support vector machine classifier.

Six bagging and AdaBoost (AdaBoost.M1 in WEKA) ensembles are produced from the six learning algorithms, namely J48, IBk, NB, PART, KStar, and SMO. The labels “Bag-” and “Boost-” are pre-pended to the learning algorithm names (e.g. bagging of J48 is called Bag-J48). Ten iterations are performed for each ensemble except for those with SMO on the “splice” dataset, for which only five iterations could be performed due to memory limitations of Weka.

2.4 Simulation results: GLHE versus non-ensemble classifiers

This phase of the simulation study entailed comparing prediction accuracy performances of GLHE and individual (non-ensemble) classifiers based on J48, IBk, NB, PART, KStar, and SMO for all 46 UCI Machine Learning Repository datasets presented in Table 1. Simulation results are summarized in Fig. 4, where GLHE has all six base learners and the StackingC as the meta learner, and detailed for prediction accuracy values in Table 2. The Type I error for all tests is α = 0.05. The null hypothesis of Iman–Davenport test is rejected, indicating there is statistically significant differences among the performances of the classifiers. The post hoc Bonferroni–Dunn test is then conducted for each comparison with the GLHE as the control, resulting in the proposed ensemble design being statistically better performing than all except SMO. The GLHE scored the first place for 19 of the 46 datasets, second place for 7 datasets, third place for 11 datasets, fourth place for 3 datasets, fifth place for 4 datasets, sixth place for 1 dataset, and never came in the seventh place for any of the datasets. These results clearly indicate that the prediction accuracy performance of GLHE tends to be leading (e.g. claimed the top 3 places for 37 out of 46 datasets) and shows much less variation with respect to dataset or problem domain variation, and hence suggesting a more robust performance profile compared to any non-ensemble classifier listed in Table 2. These results also respond to the expectation that an ensemble should perform better when compared to non-ensemble classifiers.

Fig. 4
figure 4

Average ranks for prediction accuracy for the StackingC GLHE and its base classifiers. CD thresholds are for the ensemble

Table 2 Prediction accuracy values for GLHE and six other non-ensemble classifiers

2.5 Simulation results: GLHE versus data manipulation ensembles

Simulation data for each bagging and boosting ensemble classifier evaluated are presented in terms of prediction accuracy and SAR values in Tables 5, 6, 7 and 8 in the appendix. The design of these ensembles was as presented earlier in Sect. 2.3. The same data is however summarized herein per the requirements of the statistical significance tests performed for comparison purposes. Accordingly, the results are presented in comparison groups. The Type I error for all tests is α = 0.05. For all comparisons conducted, the null hypothesis of the Iman–Davenport test has been rejected, so the classifiers within each group do not perform equivalently. The post hoc Bonferroni–Dunn test is then conducted for each comparison with the GLHE design as the control (unless otherwise stated).

The average ranks and statistical significance tests for GLHE, bagging, and AdaBoost ensembles for the prediction accuracy metric are shown in Figs. 5 and 6, respectively, where k represents the number of ensembles evaluated and N is the number of data sets. For both data manipulation ensemble types, GLHE has the lowest average rank, while its performance is statistically better than those of bagging and boosting ensembles with IBk, NB, and KStar as the learning algorithms. Performance results based on the SAR metric, on the other hand, projects quite a different perspective. Figures 7 and 8 give the average ranks of and statistical test results for the bagging and AdaBoost ensembles, respectively, for the SAR metric. GLHE is only statistically better than KStar and SMO bagging ensembles. Bag-PART has a slightly lower rank (better performance) than GLHE. In Fig. 8, GLHE has the lowest average rank, with AdaBoost IBk, NB, KStar, and SMO ensembles performing statistically worse.

Fig. 5
figure 5

Average ranks of accuracy for StackingC GLHE and bagging ensembles. CD thresholds for GLHE

Fig. 6
figure 6

Average ranks of accuracy for StackingC GLHE and AdaBoost ensembles. CD thresholds for GLHE

Fig. 7
figure 7

Average ranks of SAR for StackingC GLHE and bagging ensembles. CD thresholds for GLHE

Fig. 8
figure 8

Average ranks of SAR for StackingC GLHE and AdaBoost ensembles. CD thresholds for GLHE

Results of the statistical significance tests for comparison between GLHE and data manipulation ensembles are summarized in Table 3. Each entry corresponds to a comparison between GLHE StackingC and a data manipulation ensemble, specified by the column label that represents the ensemble technique and the row label that shows the base learning algorithm used. For example, with regards to prediction accuracy, GLHE is statistically the same as bagging with PART, and better than bagging with IBk. Comparison to data manipulation ensembles shows that the GLHE design’s prediction accuracy is statistically better than those of three of six bagging ensembles. Considering the same metric, GLHE performs better than three AdaBoost versions (namely IBk, NB and KStar as learners) while scores the same with the rest. For the SAR metric, GLHE scores the same with four versions of bagging while surpassing the performance for the other two. The GLHE outperforms four AdaBoost versions and ties with the other two. In none of the cases, GLHE lags the performance of any bagging or boosting variant. The results in Table 3 indicate that GLHE exhibits a more robust performance, in the sense that it maintains consistently its competitive classification performance over a large set of problem domains or data sets, compared to the data manipulation ensembles—especially those with IBk, NB, KStar and SMO as their base learning algorithms. This finding establishes GLHE as desirable over data manipulation ensembles for the situations where a consistently high-performing classifier is needed to operate for a large set of problem domains.

Table 3 Summary of statistical significance tests for GLHE StackingC relative to data manipulation ensembles

3 Diversity analysis

In this section, a diversity analysis will be performed to explore the relationship between ensemble performance and ensemble diversity in light of the simulation results presented in the previous section.

3.1 Diversity creation methods

There are three ways in which diversity can be created among the base classifiers of an ensemble [9, 12].

  • Data manipulation—multiple classifiers are generated from a single learning algorithm through variations of the training data (e.g. different samples of instances and/or different samples of features).

  • Homogeneous—multiple classifiers are generated from a single learning algorithm through variations of the parameters (e.g. neural networks with different initial weight values). As with data manipulation ensembles, the performance of homogeneous ensembles may suffer from being limited to a single learning algorithm.

  • Heterogeneous (hybrid)—multiple classifiers are generated from two or more learning algorithms (i.e. C4.5 and naïve Bayes). Consensus in the literature is that heterogeneous ensembles are more effective at producing diversity, and consequently have more robust prediction accuracy performance [7]. Certain learning algorithms may be experts in an instance space, while others are possibly inexpert. Multiple learning algorithms may help to protect the ensemble from being burdened by poor performance of any single one. However, heterogeneous ensembles pose higher level difficulty compared to data manipulation or homogenous ensembles due the apparent need to train, test, validate and deploy multiple learning algorithms.

3.2 Measuring diversity: pairwise measures

Pairwise measures consider the classifier population definition of diversity and evaluate the diversity between two classifiers at a time. The pairwise diversity measure of interest is the disagreement measure (Dis). The disagreement measure was created specifically for characterizing the diversity between two classifiers. It counts the number of times that one classifier was correct and the other incorrect—an intuitive concept of diversity [37]. For two classifiers, D i and D j , the disagreement measure is defined as

$$ Dis_{i,j} = \frac{{N^{10} + N^{01} }}{N}. $$

The value of Dis ranges between 0 and 1, where 0 indicates no difference and 1 indicates the highest possible diversity.

Since pairwise measures consider only two classifiers at a time, an ensemble of k classifiers produces k(k − 1)/2 pairwise diversity values. To get a single value, the average across all pairs is taken (i.e. divide the sum of pairwise measures by the number of pairwise diversity values). The average Dis measure is referred to as Dis av.

3.3 Visualizing diversity: classifier projection space

Pekalska et al. [29] propose the classifier projection space (CPS) as a 2-dimensional representation of classifiers such that the points correspond to classifiers and the relative pairwise diversities are preserved by the Euclidian distances between the points. They suggest that this method is more appropriate in situations of an ensemble containing both similar and diverse base-classifiers, since in such a case the values may simply average out.

Given k classifiers, a k × k dissimilarity matrix N is created such that the value of each entry in the matrix is the pairwise measure of diversity between classifiers associated with the row and column labels for that entry. An illustrative dissimilarity matrix for four classifiers is shown in Table 4. In this case, higher values indicate higher diversity, thus the diagonal entries of the matrix are all 0. Furthermore, the matrix is symmetric so only the top half is shown.

Table 4 An example dissimilarity matrix of pairwise diversity measures for four classifiers

A Sammon mapping [34] is a nonlinear multidimensional scaling projection onto a space \( \Re^{m} , \) where m is 2 or 3, such that the distances are preserved. For the purposes of CPS, let m be 2. An error function, called stress, is defined that measures the difference between the original dissimilarities and Euclidean distances. Let N be the k × k dissimilarity matrix with n ij as elements, and \( \tilde{N} \) be the distance matrix with \( \tilde{n}_{ij} \) as elements for a projected configuration where ij = 1, 2, …, k. The stress is computed as [29]:

$$ S = \frac{1}{{\sum\nolimits_{i = 1}^{k - 1} {\sum\nolimits_{j = i + 1}^{k} {n_{ij}^{2} } } }}\sum\limits_{i = 1}^{k - 1} {\sum\limits_{j = i + 1}^{k} {(n_{ij} - \tilde{n}_{ij} )^{2} } } . $$

An initial distance matrix configuration must be used, and then the process proceeds in an iterative manner until a distance matrix configuration corresponds to a (local) minimum. An implementation of the Sammon mapping for Matlab under the GNU General Public License can be found at [13], which is the implementation used in this paper.

The resulting matrix from the Sammon map can be graphed such that the distance between the points reflects the relative diversity between the classifiers. The relative distance between two points represents the diversity between two classifiers. So, the further apart two points are, the more diverse those same points are from each other. The diversity matrix shown in Table 4 is put through the Sammon mapping, and the resulting CPS is shown in Fig. 9. Notice that points D 3 and D 4, those classifiers with the highest pairwise diversity values in Table 4, are the furthest away from each other. Alternately, points D 2 and D 4 have the lowest diversity values and are correspondingly the closest to each other. Whereas traditionally, a single average value of diversity would be reported for a given set of pairwise diversity measures, it is obvious that such a graph gives a simple and accurate view of the diversity measures. Furthermore, a “TRUE” point can be graphed that denotes all predictions being correct (e.g. the closer a classifier is to TRUE, the better prediction performance it has). Although this is not shown in Fig. 9, the TRUE point is used in the following diversity analysis.

Fig. 9
figure 9

An example CPS for the classifiers with the diversity matrix from Table 4

3.4 Diversity analysis of GLHE and data manipulation ensembles

Three datasets are considered in this analysis—anneal, pendigits, and waveform—which were chosen to represent a variety of instance counts and class sizes. Each graph contains three J48 and three IBk points to represent the GLHE design, and four other points to specify the remaining learning algorithms in the data manipulation ensembles.

The diversities for the anneal dataset are presented in Fig. 10. Note that the J48 and IBk points are clustered together, each on one side of the TRUE point. The other classifiers are spread throughout the space further away from the TRUE point, especially KStar and NB. This graph further exposes the volatility of performance for a data manipulation ensemble if in fact a low-performing learning algorithm is chosen such as the KStar for the anneal dataset. Figure 11 shows the diversities for the pendigits dataset. Once again, GLHE’s local classifiers based on IBk and the KStar are clustered tightly around the TRUE point while the others are spread out. The last diversity presentation is of the waveform dataset, given in Fig. 12. Here, all of the classifiers tend to be more evenly distributed in the space—even the J48 and IBk classifiers spread out. Three IBk-based local classifiers are within comparatively close range of the TRUE point while the J48-based classifiers are far away. The former appears to compensate for the latter set of classifiers for the performance of the GLHE in this case.

Fig. 10
figure 10

Base learning algorithms in the CPS per the disagreement measure for anneal dataset. Each point represents a classifier while triple instances of J48 and IBk (as in GLHE) are shown, and the distance between the points indicate their pairwise diversity. The TRUE point has all predictions correct

Fig. 11
figure 11

Base learning algorithms in the CPS per the disagreement measure for pendigits dataset. Each point represents a classifier while triple instances of J48 and IBk (as in GLHE) are shown, The TRUE point has all predictions correct

Fig. 12
figure 12

Base learning algorithms in the CPS per the disagreement measure for waveform dataset. Each point represents a classifier while triple instances of J48 and IBk (as in GLHE) are shown. The TRUE point has all predictions correct

Base classifier instances of GLHE tend to be closer to the TRUE classification point in the CPS graphs for all three datasets compared to other base classifiers which helps indicate, at least in part, the performance superiority of the GLHE over the data manipulation ensembles. It is also important however to note that current research is cautious not to put too much emphasis on diversity–performance relationship [23, 24, 38].

The visualizations of diversity suggest that co-existence of global and local learning algorithms offer high levels of diversity among its six instantiations of base learners. GLHE possesses higher levels of diversity due to two sources as indicated by these diversity visualization graphs. The first one is the presence of both global and local base learners: global learner J48 and local learner IBk clusters are located apart from each other, which translates into large pairwise diversity. Secondly, each separate instantiation of either base learner in GLHE, namely J48 and IBk, creates additional diversity, although not as large as it is between a global instance and local instance of a base learner. The inherently high levels of diversities for the GLHE is likely to be a major source of performance enhancement compared to the data manipulation ensembles discussed in this study although it is most likely not the only major one.

4 Conclusions

This paper presented a comparative performance study of global–local hybrid classification ensemble with data manipulation ensembles based on simulation and diversity analysis. The simulation study employed 46 datasets from the UCI Machine Learning Repository. Statistical significance tests were employed to compare the performance of one ensemble with that of another. Diversity analysis employed the dissimilarity measure and the classifier projection space methodology to analyze the inherent diversities which each ensemble possessed. Simulation results indicated that global–local hybrid ensemble offers superior performance in terms of prediction accuracy and SAR metrics compared to six data manipulation ensembles based on bagging and boosting variants. Global–local hybrid ensemble was shown to have prediction accuracy better than three of the bagging and boosting ensembles while also demonstrating an equivalent performance with the rest. In terms of the SAR metric, global–local hybrid ensemble had better performance than two bagging and four boosting ensembles while scoring a tie with the rest. The diversity analysis further provided support and explanation for the performance superiority of global–local hybrid ensemble. Global–local hybrid ensemble projects a more robust performance profile when compared to boosting and bagging ensembles when a large number of problem domains or, equivalently, data sets are considered.