Abstract
This study compares the classification performance of a hybrid ensemble, which is called the global–local hybrid ensemble that employs both local and global learners against data manipulation ensembles including bagging and boosting variants. A comprehensive simulation study is performed on 46 UCI machine learning repository data sets using prediction accuracy and SAR performance metrics and along with rigorous statistical significance tests. Simulation results for comparison of classification performances indicate that global–local hybrid ensemble outperforms or ties with bagging and boosting ensemble variants in all cases. This suggests that the global–local ensemble has a more robust performance profile since its performance is less sensitive to variation with respect to the problem domain, or equivalently the data sets. This performance robustness is realized at the expense of increased complexity of the global–local ensemble since at least two types of learners, e.g. one global and another one local, must be trained. A complementary diversity analysis of global–local hybrid ensemble and base learners used for bagging and boosting ensembles on select data sets in the classifier projection space provides both an explanation and support for the performance related findings of this study.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The global–local hybrid ensemble (GLHE) is a classifier ensemble design that is characterized by two main traits [4]. First, one global learner and one local learner are explicitly used. Second, both heterogeneous and homogeneous diversities are integrated. Mitchell presents an explanation and comparison of how global and local learners work [26]. When all training instances are considered during classification of a query instance, the learner is termed as global. When only near training instances are considered during classification of a query instance, the learner is called local. Global learners estimate a single target function for the entire instance space, while local learners estimate target functions locally and differently for each query instance [20].
There are advantages and disadvantages for each type of learning algorithm, and their ability for generalization depends on the problem at hand. Generally, global learners do not respond well to isolated data points—those points in a sparsely distributed area. That is, it attempts to have a model that satisfies the majority of points while paying little attention to outliers (similar to how linear regression works). Local learners, alternatively, are better for handling the isolated points since their generalization is instance-based. However, if the target function only depends on a few of the many available attributes, then the instances that are most “similar” may actually be a large distance away [27]. One can argue that these two types of learners may behave in a “complementary” way: when one fails, the other may succeed since it views the problem in such a different manner.
Co-existence of a global learner along with a local learner within the same ensemble framework may offer a powerful mixture of diversity. While traditional ensemble designs in the literature appear to use only one type of diversity, either heterogeneous or homogeneous [7], global–local hybrid ensemble employs both forms. The heterogeneous diversity originates from the use of a global learner and a local learner, while the homogeneous diversity is due to multiple instantiations of each learner with different initial parameterizations (e.g. initial weights for neural networks, number of neighbors for kNN, pruning techniques for decision trees, etc.). The combination of the two diversities is not common in the literature. Although some have experimented with it indirectly [41], others have trivialized the combination to being nothing more than a heterogeneous ensemble in essence [7].
The generic architecture of the GLHE design is shown in Fig. 1. The contribution of GLHE is the design or composition of the base classifiers. There are n base-classifiers from a global learner with different parameterizations. Likewise, there are m base-classifiers from a local learner with different parameterizations. The global and local learning algorithms provide the main source of diversity (heterogeneity), while instantiation of multiple classifiers from each learner (homogeneity) gives the ensemble an opportunity to benefit from the better performing learner numerous times rather than just once.
A previous GLHE study [4] compared prediction accuracy of the proposed design against those of 47 ensembles reported in six prominent studies in the literature [5, 17, 22, 27, 32, 35]. The ensembles varied from bagging and boosting to hybrid ensembles with up to seven different learning algorithms. It was found that the performance of GLHE is not statistically different when compared to the performances of 45 other ensembles, statistically better than that of C4.5 bagging ensemble, and statistically worse than that of one hybrid ensemble. The results were supportive of the hypothesis that GLHE offers the same robust performance as other, at times more complex, ensemble designs. Although a large number of comparisons were performed and since comparisons were made with studies already reported in the literature, the study was limited by the constraints imposed by those same literature studies: comparisons were made with ensembles designed by others, only the prediction accuracy measurements were available in the literature studies, and at most 27 datasets were considered by any single study. Consequently, conclusions of the Baumgartner and Serpen study [4] facilitated making only general statements about GLHE. This study aims to provide a more precise projection of the utility of the GLHE design through a comparative performance evaluation with data manipulation ensembles, namely boosting and bagging classifier ensembles. Although not directly studied in this work, the GLHE method may be similarly applicable to relatively new problem domains that traditional data manipulation ensembles have been extended to, including adversarial environments [6].
In data manipulation ensembles, the approach to building an ensemble uses a single learning algorithm with different subsets of the original dataset to train different base classifiers. Multiple classifiers are generated from a single learning algorithm through variations of the training data (e.g. different samples of instances and/or different samples of features). Data manipulation ensembles are ideal when a single learning algorithm is known to perform well for a given dataset, as the performance will likely improve. However, there is also risk that the learning algorithm may not perform well for a given dataset, which adversely affects the performance of the ensemble. Regardless of its drawbacks, the simplicity of this method makes it the most widely investigated diversity creation method [30].
Popular data manipulation ensemble techniques are bagging and boosting (AdaBoost). Breiman’s bagging, or bootstrap aggregation, uses different instance subsets of the training dataset with a single learning algorithm [8]. Generally, the subset size is near the size of the original dataset; however, random sampling with replacement creates subsets with duplicates and/or omissions of the original instances. The same learning algorithm is used to train on the different subsets of data, each training episode or case resulting in a new base classifier. Given a new instance, the classifier predictions are aggregated with a majority vote to derive the final prediction for the ensemble. Whereas bagging relies on randomness to provide better performance from an ensemble, Boosting takes a more active role. Freund and Schapire’s popular boosting variant, AdaBoost [18], explicitly alters the distribution of the training dataset to concentrate on the instances that have not been correctly learned in the previous iterations (i.e. consecutive training datasets target hard-to-classify instances). A weighted vote of each classifier’s prediction is performed to obtain the ensemble’s final prediction. A recent approach, bagging–AdaBoost, incorporates components of each classic data manipulation ensembles into a single technique to improve the accuracy, stability, and robustness of the ensemble [42].
In the subsequent sections a comparative simulation-based performance and diversity analysis of global–local hybrid ensemble with boosting and bagging ensembles will be presented. It is of interest to determine if the global–local hybrid ensemble offers a more robust performance as a consequence of its heterogeneous diversity when compared to data manipulation ensembles, while also recognizing that the global–local hybrid ensemble projects a higher level of training cost due to the need to train both global and local learners.
2 Simulation study
The simulation study compares the performance of GLHE against data manipulation ensembles, namely bagging and boosting variants. The study employs 46 datasets from the UCI Machine Learning Repository [2] to profile the performance of global–local hybrid ensemble against boosting and bagging ensembles on the basis of prediction accuracy and SAR metrics. A rigorous statistical significance testing is applied for the performance comparisons.
All simulations are executed using the open source Weka software (version 3.5.8) using tenfold cross validation [39], with the large experiment and evaluation tool (LEET) as a front-end [3]. The prediction accuracy estimate for a classifier on a given dataset is an average of the ten samples (10 folds). The variation of these samples is not important for this study, as will be made apparent in the discussion of the statistical significance testing method in a forthcoming section. The 46 publically available datasets from the UCI Machine Learning Repository are listed, with their characteristics, in Table 1. This collection of datasets was obtained as a union of datasets used in six popular studies in the literature that target performance robustness [5, 17, 22, 27, 32, 35]. Only those that could not be found (due to naming discrepancies) and the letter dataset (due to its relatively long required run-times) are not included. Since exclusion of datasets is not dependent on a bias for/against the GLHE design, little or no impact is anticipated on the results and conclusions of this study.
2.1 Statistical significance testing
Demsar [14] argues that comparing classifiers does not satisfy conditions for parametric tests, and instead proposes the use of nonparametric alternatives. Some have applied these tests to their studies [4, 5, 25], and we believe it is a sound approach to follow. Since nonparametric tests rank the classifiers for each dataset, the important source of variations is the (independent) datasets and not the (usually dependent) samples used to calculate the accuracy. This implies that, besides obtaining a precise estimate of accuracy, the sampling method is irrelevant because one does not have to worry about the Type I error generated from it. When comparing multiple classifiers, basic statistics dictates that a certain portion of the reported statistical significance is actually due to random chance [14, 28, 35]. Demsar’s suggested procedure of tests resolves this issue.
In this study, first the Friedman test is performed, which is based on the average rank of each classifier across the datasets [19]. Iman and Davenport [21] showed that the Friedman test is undesirably conservative, and derived a new statistic based on its value. This is then used to test the null hypothesis—that all classifiers have equivalent performance. If the null hypothesis is accepted, then there is no statistically significant difference in the classifier performances. Otherwise, the post hoc Bonferroni–Dunn test is conducted, in which a control classifier is compared to all others in the group [16]. The performance of the control and another classifier is significantly different if the corresponding average ranks differ by at least the critical difference (CD).
The results of the statistical significance tests are presented graphically. An example is shown in Fig. 2. The left-most bar is the control classifier used for the Bonferroni–Dunn test, while the remaining bars are the comparison classifiers. The height of each bar represents the average rank of the associated classifier, which indicates relative performance among the classifiers (higher ranks are worse performing than lower ranks). Although examining the average ranks may offer some insight, the more strict and valid comparison involves the statistical significance thresholds resulting from the Bonferroni–Dunn test. Any classifier with an average rank above the high threshold is statistically worse than the control. Alternatively, any classifier with an average rank below the low threshold is statistically better than the control. Those classifiers with ranks within the threshold lines have a performance that is statistically no different than that of the control. Note that the exact values of the thresholds are displayed at the bottom of the graph. This series of statistical significance tests and the graphical presentation are applicable to any measure of performance for multiple classifiers. This includes the prediction accuracy and the SAR metrics as described in the next section.
2.2 Performance metrics
The simulation study employs two performance metrics, namely prediction accuracy and SAR where the latter is an aggregate measure based on prediction accuracy, area under the receiver operating curve and root-mean-squared-error.
Prediction accuracy or ACC in short is defined as the proportion of the number of correct predictions that the classifier makes relative to the total number of instances. That is,
where larger prediction accuracies indicate better performance.
The receiver operating characteristic (ROC) is a 2-dimensional plot of true positives on the vertical axis against false positives on the horizontal axis. The process to construct the graph is as follows [39]. First, the predictions are sorted by descending probability, without regard to whether or not the prediction is correct. Then the graph is drawn by traversing through the predictions, from most probable to least probable. If the prediction is correct, then move one unit up. Alternately, if the prediction is incorrect, then move one unit to the right. The area under the ROC curve, ROCA, is used as a summary statistic. Larger areas are considered as having better performance. Although it is already in use and accepted in fields such as medicine, ROCA is gaining popularity in the wider machine learning community. For a 2-class problem, a single ROCA value is calculated between the two classes. For a c-class problem, there are c ROCA values calculated such that a given class is considered against the set of the remaining classes (i.e. a class is considered as 1 and all remaining classes are considered as 0). These values are then weighted by the number of instances from each corresponding class and their sum is divided by the total number of instances. Although calculation of ROCA for multiclass problems is not frequent, the stated method is accepted in the literature [31].
Root-mean-squared-error (RMSE) is widely used in regression problems; however, it can also be adapted for classification problems. RMSE measures how much predictions deviate from the true values. For C classes, let p 1, p 2, …, p C be the probability values of the predicted classes (such that p i is the probability that class i is predicted with respect to the total number of instances, and all the probabilities sum to 1), and a 1, a 2, …, a C be the probability values of the actual classes (such that the true class is 1 and the others are 0). RMSE is defined as [39]:
where N is the number of instances. Smaller RMSE values indicate better performance.
SAR is an aggregate performance metric that has been shown to have better correlation with ten other prominent performance metrics than any component metric alone [10, 11]. It is an average of the prediction accuracy, area under the ROC curve, and root-mean-squared-error. That is,
and a higher value of SAR indicates better performance.
2.3 Ensemble designs
The global–local hybrid ensemble (GLHE) is designed using a decision tree classifier as its global learner and a nearest neighbor classifier as its local learner. Specifically, the global learner is J48 (a port of the C4.5 decision tree classifier) and the local learner is IBk (an instance-based nearest-neighbor classifier [1]). J48 is a re-implementation of C4.5 release 8 (which develops a decision tree based classifier from a set of training data through information entropy measure) in Java for Weka and employs both C4.5’s confidence-based post-pruning (default) and sub-tree raising.
The categorization of J48 as global and IBk as local is in conformity with multiple references in the literature [26, 33]. There are obviously many different learning algorithms that could be used to satisfy the global–local learner requirement of GLHE, some of which may produce better performance (e.g. multi-layer perceptrons for global learning and radial basis functions for local learning). However, initial exploratory in-house assessments indicated that the simplicity and efficiency of J48 and IBk would allow for a large-scale experimentation study while capturing the important factors of the GLHE design.
Although GLHE is a generic ensemble design, our choice of relatively simple learning algorithms (e.g. C4.5 and IBk) over complex ones (e.g. artificial neural networks, Bayesian nets or support vector machines) for empirical evaluation further supports a pursuance of simplicity. This ad-hoc approach could be improved with in-depth testing and analysis of diversity. Both the selection of learning algorithms and the set of classifiers for each promotes reduced complexity in the ensemble design—an important goal of classifier design in general.
The heterogeneous diversity of the GLHE design comes from the use of two different types of base learning algorithms—one global and one local. The homogenous diversity of GLHE is achieved by varying the parameter settings for each learner to create multiple classifier instances. This requires three steps. First, the target number of classifiers from each learner is set. In this study, three classifiers are used from each learner. This sufficiently allows homogeneous tendencies to be incorporated while keeping the total number of base classifiers at a manageable level. Second, the base parameter value settings for each learner is established. For J48, the base parameters are Weka’s default. For IBk, the base parameters are Weka’s default with the exception of distance weighting which has a value of “1/distance”. Third, for each learner different parameter value settings are employed to create the multiple base classifier instances. For J48, the type of pruning is changed: these are standard pruning, reduced error pruning, and no pruning (unpruned). For IBk, the number of nearest-neighbors is changed: the employed values are 1, 5, and 10.
The next step in the design of GLHE is determining the ensemble combination method to be used. Three ensemble techniques or architectures were considered as follows:
-
Voting—a simple combination scheme of the base-classifier predictions to derive the final ensemble prediction. In this study, the average of probabilities combination rule is used [15].
-
StackingC [36]—stacking uses meta-classification of a dataset created with the prediction values of the base-classifiers as features and the original values as the class [40]. The meta-classifier derives a final prediction from this. StackingC is more efficient with a reduced number of features in the created dataset. In this study, linear regression is used as the meta-classifier.
-
Grading [35]—another meta-classification approach that employs an opposite approach to Stacking. A new dataset is created for each base-classifier, with instances containing the original features, but the class indicates whether the base-classifier’s prediction was correct. The meta-classifier predicts which base-classifier will be correct. In this study, IBk (an instance-based K-nearest-neighbor classifier) with ten nearest-neighbors is used as the meta-classifier.
A comparison of the prediction accuracies of GLHE for Voting, StackingC, and Grading techniques was performed on the 46 datasets listed in Table 1. The average ranks and statistical significance thresholds for this comparison are shown in the graph of Fig. 3. Since StackingC has the lowest average rank, it is used as the control for the Bonferroni–Dunn test. Voting and Grading are right at the statistical significance threshold, so it is safe to conclude that StackingC has the best prediction accuracy performance of the three GLHE ensembles. Therefore, it is used for comparison with the data manipulation ensembles.
The design of data manipulation ensembles was based on two ensemble techniques, namely bagging and AdaBoost, and six learning algorithms. The base learning algorithms and their parameterizations used (default WEKA values unless otherwise stated) are as follows:
-
J48—a port of the C4.5 decision tree classifier.
-
IBk—an instance-based nearest-neighbor classifier (distance weighting of “1/distance” and 5 nearest-neighbors).
-
NB—naïve Bayes classifier (supervised discretization).
-
PART—creates rules from partial C4.5 trees.
-
KStar—an instance-based entropy-based classifier.
-
SMO—sequential minimal optimization algorithm for training a support vector machine classifier.
Six bagging and AdaBoost (AdaBoost.M1 in WEKA) ensembles are produced from the six learning algorithms, namely J48, IBk, NB, PART, KStar, and SMO. The labels “Bag-” and “Boost-” are pre-pended to the learning algorithm names (e.g. bagging of J48 is called Bag-J48). Ten iterations are performed for each ensemble except for those with SMO on the “splice” dataset, for which only five iterations could be performed due to memory limitations of Weka.
2.4 Simulation results: GLHE versus non-ensemble classifiers
This phase of the simulation study entailed comparing prediction accuracy performances of GLHE and individual (non-ensemble) classifiers based on J48, IBk, NB, PART, KStar, and SMO for all 46 UCI Machine Learning Repository datasets presented in Table 1. Simulation results are summarized in Fig. 4, where GLHE has all six base learners and the StackingC as the meta learner, and detailed for prediction accuracy values in Table 2. The Type I error for all tests is α = 0.05. The null hypothesis of Iman–Davenport test is rejected, indicating there is statistically significant differences among the performances of the classifiers. The post hoc Bonferroni–Dunn test is then conducted for each comparison with the GLHE as the control, resulting in the proposed ensemble design being statistically better performing than all except SMO. The GLHE scored the first place for 19 of the 46 datasets, second place for 7 datasets, third place for 11 datasets, fourth place for 3 datasets, fifth place for 4 datasets, sixth place for 1 dataset, and never came in the seventh place for any of the datasets. These results clearly indicate that the prediction accuracy performance of GLHE tends to be leading (e.g. claimed the top 3 places for 37 out of 46 datasets) and shows much less variation with respect to dataset or problem domain variation, and hence suggesting a more robust performance profile compared to any non-ensemble classifier listed in Table 2. These results also respond to the expectation that an ensemble should perform better when compared to non-ensemble classifiers.
2.5 Simulation results: GLHE versus data manipulation ensembles
Simulation data for each bagging and boosting ensemble classifier evaluated are presented in terms of prediction accuracy and SAR values in Tables 5, 6, 7 and 8 in the appendix. The design of these ensembles was as presented earlier in Sect. 2.3. The same data is however summarized herein per the requirements of the statistical significance tests performed for comparison purposes. Accordingly, the results are presented in comparison groups. The Type I error for all tests is α = 0.05. For all comparisons conducted, the null hypothesis of the Iman–Davenport test has been rejected, so the classifiers within each group do not perform equivalently. The post hoc Bonferroni–Dunn test is then conducted for each comparison with the GLHE design as the control (unless otherwise stated).
The average ranks and statistical significance tests for GLHE, bagging, and AdaBoost ensembles for the prediction accuracy metric are shown in Figs. 5 and 6, respectively, where k represents the number of ensembles evaluated and N is the number of data sets. For both data manipulation ensemble types, GLHE has the lowest average rank, while its performance is statistically better than those of bagging and boosting ensembles with IBk, NB, and KStar as the learning algorithms. Performance results based on the SAR metric, on the other hand, projects quite a different perspective. Figures 7 and 8 give the average ranks of and statistical test results for the bagging and AdaBoost ensembles, respectively, for the SAR metric. GLHE is only statistically better than KStar and SMO bagging ensembles. Bag-PART has a slightly lower rank (better performance) than GLHE. In Fig. 8, GLHE has the lowest average rank, with AdaBoost IBk, NB, KStar, and SMO ensembles performing statistically worse.
Results of the statistical significance tests for comparison between GLHE and data manipulation ensembles are summarized in Table 3. Each entry corresponds to a comparison between GLHE StackingC and a data manipulation ensemble, specified by the column label that represents the ensemble technique and the row label that shows the base learning algorithm used. For example, with regards to prediction accuracy, GLHE is statistically the same as bagging with PART, and better than bagging with IBk. Comparison to data manipulation ensembles shows that the GLHE design’s prediction accuracy is statistically better than those of three of six bagging ensembles. Considering the same metric, GLHE performs better than three AdaBoost versions (namely IBk, NB and KStar as learners) while scores the same with the rest. For the SAR metric, GLHE scores the same with four versions of bagging while surpassing the performance for the other two. The GLHE outperforms four AdaBoost versions and ties with the other two. In none of the cases, GLHE lags the performance of any bagging or boosting variant. The results in Table 3 indicate that GLHE exhibits a more robust performance, in the sense that it maintains consistently its competitive classification performance over a large set of problem domains or data sets, compared to the data manipulation ensembles—especially those with IBk, NB, KStar and SMO as their base learning algorithms. This finding establishes GLHE as desirable over data manipulation ensembles for the situations where a consistently high-performing classifier is needed to operate for a large set of problem domains.
3 Diversity analysis
In this section, a diversity analysis will be performed to explore the relationship between ensemble performance and ensemble diversity in light of the simulation results presented in the previous section.
3.1 Diversity creation methods
There are three ways in which diversity can be created among the base classifiers of an ensemble [9, 12].
-
Data manipulation—multiple classifiers are generated from a single learning algorithm through variations of the training data (e.g. different samples of instances and/or different samples of features).
-
Homogeneous—multiple classifiers are generated from a single learning algorithm through variations of the parameters (e.g. neural networks with different initial weight values). As with data manipulation ensembles, the performance of homogeneous ensembles may suffer from being limited to a single learning algorithm.
-
Heterogeneous (hybrid)—multiple classifiers are generated from two or more learning algorithms (i.e. C4.5 and naïve Bayes). Consensus in the literature is that heterogeneous ensembles are more effective at producing diversity, and consequently have more robust prediction accuracy performance [7]. Certain learning algorithms may be experts in an instance space, while others are possibly inexpert. Multiple learning algorithms may help to protect the ensemble from being burdened by poor performance of any single one. However, heterogeneous ensembles pose higher level difficulty compared to data manipulation or homogenous ensembles due the apparent need to train, test, validate and deploy multiple learning algorithms.
3.2 Measuring diversity: pairwise measures
Pairwise measures consider the classifier population definition of diversity and evaluate the diversity between two classifiers at a time. The pairwise diversity measure of interest is the disagreement measure (Dis). The disagreement measure was created specifically for characterizing the diversity between two classifiers. It counts the number of times that one classifier was correct and the other incorrect—an intuitive concept of diversity [37]. For two classifiers, D i and D j , the disagreement measure is defined as
The value of Dis ranges between 0 and 1, where 0 indicates no difference and 1 indicates the highest possible diversity.
Since pairwise measures consider only two classifiers at a time, an ensemble of k classifiers produces k(k − 1)/2 pairwise diversity values. To get a single value, the average across all pairs is taken (i.e. divide the sum of pairwise measures by the number of pairwise diversity values). The average Dis measure is referred to as Dis av.
3.3 Visualizing diversity: classifier projection space
Pekalska et al. [29] propose the classifier projection space (CPS) as a 2-dimensional representation of classifiers such that the points correspond to classifiers and the relative pairwise diversities are preserved by the Euclidian distances between the points. They suggest that this method is more appropriate in situations of an ensemble containing both similar and diverse base-classifiers, since in such a case the values may simply average out.
Given k classifiers, a k × k dissimilarity matrix N is created such that the value of each entry in the matrix is the pairwise measure of diversity between classifiers associated with the row and column labels for that entry. An illustrative dissimilarity matrix for four classifiers is shown in Table 4. In this case, higher values indicate higher diversity, thus the diagonal entries of the matrix are all 0. Furthermore, the matrix is symmetric so only the top half is shown.
A Sammon mapping [34] is a nonlinear multidimensional scaling projection onto a space \( \Re^{m} , \) where m is 2 or 3, such that the distances are preserved. For the purposes of CPS, let m be 2. An error function, called stress, is defined that measures the difference between the original dissimilarities and Euclidean distances. Let N be the k × k dissimilarity matrix with n ij as elements, and \( \tilde{N} \) be the distance matrix with \( \tilde{n}_{ij} \) as elements for a projected configuration where i, j = 1, 2, …, k. The stress is computed as [29]:
An initial distance matrix configuration must be used, and then the process proceeds in an iterative manner until a distance matrix configuration corresponds to a (local) minimum. An implementation of the Sammon mapping for Matlab under the GNU General Public License can be found at [13], which is the implementation used in this paper.
The resulting matrix from the Sammon map can be graphed such that the distance between the points reflects the relative diversity between the classifiers. The relative distance between two points represents the diversity between two classifiers. So, the further apart two points are, the more diverse those same points are from each other. The diversity matrix shown in Table 4 is put through the Sammon mapping, and the resulting CPS is shown in Fig. 9. Notice that points D 3 and D 4, those classifiers with the highest pairwise diversity values in Table 4, are the furthest away from each other. Alternately, points D 2 and D 4 have the lowest diversity values and are correspondingly the closest to each other. Whereas traditionally, a single average value of diversity would be reported for a given set of pairwise diversity measures, it is obvious that such a graph gives a simple and accurate view of the diversity measures. Furthermore, a “TRUE” point can be graphed that denotes all predictions being correct (e.g. the closer a classifier is to TRUE, the better prediction performance it has). Although this is not shown in Fig. 9, the TRUE point is used in the following diversity analysis.
3.4 Diversity analysis of GLHE and data manipulation ensembles
Three datasets are considered in this analysis—anneal, pendigits, and waveform—which were chosen to represent a variety of instance counts and class sizes. Each graph contains three J48 and three IBk points to represent the GLHE design, and four other points to specify the remaining learning algorithms in the data manipulation ensembles.
The diversities for the anneal dataset are presented in Fig. 10. Note that the J48 and IBk points are clustered together, each on one side of the TRUE point. The other classifiers are spread throughout the space further away from the TRUE point, especially KStar and NB. This graph further exposes the volatility of performance for a data manipulation ensemble if in fact a low-performing learning algorithm is chosen such as the KStar for the anneal dataset. Figure 11 shows the diversities for the pendigits dataset. Once again, GLHE’s local classifiers based on IBk and the KStar are clustered tightly around the TRUE point while the others are spread out. The last diversity presentation is of the waveform dataset, given in Fig. 12. Here, all of the classifiers tend to be more evenly distributed in the space—even the J48 and IBk classifiers spread out. Three IBk-based local classifiers are within comparatively close range of the TRUE point while the J48-based classifiers are far away. The former appears to compensate for the latter set of classifiers for the performance of the GLHE in this case.
Base classifier instances of GLHE tend to be closer to the TRUE classification point in the CPS graphs for all three datasets compared to other base classifiers which helps indicate, at least in part, the performance superiority of the GLHE over the data manipulation ensembles. It is also important however to note that current research is cautious not to put too much emphasis on diversity–performance relationship [23, 24, 38].
The visualizations of diversity suggest that co-existence of global and local learning algorithms offer high levels of diversity among its six instantiations of base learners. GLHE possesses higher levels of diversity due to two sources as indicated by these diversity visualization graphs. The first one is the presence of both global and local base learners: global learner J48 and local learner IBk clusters are located apart from each other, which translates into large pairwise diversity. Secondly, each separate instantiation of either base learner in GLHE, namely J48 and IBk, creates additional diversity, although not as large as it is between a global instance and local instance of a base learner. The inherently high levels of diversities for the GLHE is likely to be a major source of performance enhancement compared to the data manipulation ensembles discussed in this study although it is most likely not the only major one.
4 Conclusions
This paper presented a comparative performance study of global–local hybrid classification ensemble with data manipulation ensembles based on simulation and diversity analysis. The simulation study employed 46 datasets from the UCI Machine Learning Repository. Statistical significance tests were employed to compare the performance of one ensemble with that of another. Diversity analysis employed the dissimilarity measure and the classifier projection space methodology to analyze the inherent diversities which each ensemble possessed. Simulation results indicated that global–local hybrid ensemble offers superior performance in terms of prediction accuracy and SAR metrics compared to six data manipulation ensembles based on bagging and boosting variants. Global–local hybrid ensemble was shown to have prediction accuracy better than three of the bagging and boosting ensembles while also demonstrating an equivalent performance with the rest. In terms of the SAR metric, global–local hybrid ensemble had better performance than two bagging and four boosting ensembles while scoring a tie with the rest. The diversity analysis further provided support and explanation for the performance superiority of global–local hybrid ensemble. Global–local hybrid ensemble projects a more robust performance profile when compared to boosting and bagging ensembles when a large number of problem domains or, equivalently, data sets are considered.
References
Aha D, Kibler D (1991) Instance-based learning algorithms. Mach Learn 6:37–66
Asuncion A, Newman DJ (2007) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html
Baumgartner D, Serpen G (2009) Large experiment and evaluation tool for WEKA classifiers. In: 5th international conference on data mining. Las Vegas, pp 340–346
Baumgartner D, Serpen G (2012) A design heuristic for hybrid ensembles. Intell Data Anal 16(2):233–246
Banfield RE, Hall LO, Bowyer KW, Bhadoria D, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1):173–180
Battista B, Fumera G, Roli F (2010) Multiple classifier systems for robust classifier design in adversarial environments. Int J Mach Learn Cybern 1:27–41
Bian S, Wang W (2007) On diversity and accuracy of homogeneous and heterogeneous ensembles. Int J Hybrid Intell Syst 4:103–128
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Brown G, Wyatt J, Harris R, Yao X (2005) Diversity creation methods: a survey and categorisation. Inf Fusion 6:5–20
Caruana R, Niculescu-Mizil A, Crew G, Ksikes A (2004) Ensemble selection from libraries of models. In: Proceedings of the 21st international conference on machine learning, pp 137–144
Caruana R, Niculescu-Mizil A (2004) Data mining in metric space: an empirical analysis of supervised learning performance criteria. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, pp 69–78
Canuto AM, Abreu MC, Oliveira LM, Xavier JC, Santos AM (2007) Investigating the influence of the choice of the ensemble members in accuracy and diversity of selection-based and fusion-based methods for ensembles. Pattern Recognit Lett 28:472–486
Cawley GC, Talbot NL (n.d.) Miscellaneous Matlab Software. http://theoval.sys.uea.ac.uk/~gcc/matlab/default.html. Accessed January 2009
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dietterich TG (2000) Ensemble methods in machine learning. Lect Notes Comput Sci 1857:1–15
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64
Dzeroski S, Zenko B (2004) Is combining classifiers with stacking better than selecting the best one? Mach Learn 54:255–273
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11:56–92
Hand DJ, Vinciotti V (2003) Local versus global models for classification problems: fitting models where it matters. Am Stat 57(2):124–131
Iman RL, Davenport JM (1980) Approximations of the critical region of the Friedman statistic. Commun Stat 571–595
Kotsiantis SB, Pintelas PE (2004) A hybrid decision support tool—using ensemble of classifiers. Int Conf Enterp Inf Syst (ICEIS) 2:448–456
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship to ensemble accuracy. Mach Learn 51:181–207
Kuncheva LI (2003) That elusive diversity in classifier ensembles. Lect Notes Comput Sci 2652:1126–1138
Luengo J, Garcia S, Herra F (2007) A study on the use of statistical tests for experimentation with neural networks. In: Proceedings of the 9th international work-conference on artificial neural networks. Lecture notes on computer science, vol 4507, pp 72–79
Mitchell TM (1997) Machine learning. McGraw-Hill, NY
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–198
Ott RL, Longnecker M (2001) An introduction to statistical methods and data analysis, 5th edn. Duxbury, Pacific Grove
Pekalska E, Duin RP, Skurichina M (2002) A discussion on the classifier projection space for classifier combining. In: Roli F, Kittler J (eds) 3rd international workshop on multiple classifier systems, MCS02, vol 2364. Springer, Cagliari, pp 137–148
Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45
Provost F, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3):199–215
Quinlan JR (1996) Bagging, boosting, and C4.5. In: Proceedings of the 13th national conference on artificial intelligence, pp 725–730
Ricci F, Aha DW (1998) Error-correcting output codes for local learners. In: 10th european conference on machine learning, ECML. Springer, Berlin, pp 280–291
Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18:401–409
Seewald K, Furnkranz J (2001) An evaluation of grading classifiers. In: Proceedings of the 4th international conferences on advances in intelligent data analysis, pp 115–124
Seewald K (2002) How to make stacking better and faster while also taking care of an unknown weakness. In: Proceedings of the nineteenth international conference on machine learning, pp 554–561
Skalak D (1996) The sources of increased accuracy for two proposed boosting algorithms. In: AAAI ’96 workshop on integrating multiple learned models for improving and scaling machine learning algorithms
Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Mach Learn 65:247–271
Witten H, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Wolpert DH (1992) Stacked generalization. Neural Netw 5(2):241–260
Yates WB, Patridge D (1996) Use of methodological diversity to improve neural network generalization. Neural Comput Appl 4(2):114–128
Zhiwen Y, Zhongkai D, Wong HS, Tan L (2010) Identifying protein kinase-specific phosphorylation sites based on the bagging-Adaboost ensemble approach. IEEE Trans Nanobiosci 9(2):132–143
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Baumgartner, D., Serpen, G. Performance of global–local hybrid ensemble versus boosting and bagging ensembles. Int. J. Mach. Learn. & Cyber. 4, 301–317 (2013). https://doi.org/10.1007/s13042-012-0094-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-012-0094-8