Abstract
An approach to evolutionary ensemble learning for classification is proposed using genetic programming in which boosting is used to construct a stack of programs. Each application of boosting identifies a single champion and a residual dataset, i.e. the training records that thus far were not correctly classified. The next program is only trained against the residual, with the process iterating until some maximum ensemble size or no further residual remains. Training against a residual dataset actively reduces the cost of training. Deploying the ensemble as a stack also means that only one classifier might be necessary to make a prediction, so improving interpretability. Benchmarking studies are conducted to illustrate competitiveness with the prediction accuracy of current state-of-the-art evolutionary ensemble learning algorithms, while providing solutions that are orders of magnitude simpler. Further benchmarking with a high cardinality dataset indicates that the proposed method is also more accurate and efficient than XGBoost.
Supported by 2Keys Corporation - An Interac Company.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Ensemble learning represents a widely employed meta-learning scheme for deploying multiple models under supervised learning tasks [1, 7, 23]. In general, there are three basic formulations: Bagging, Boosting or Stacking. We note, however, that most instances of ensemble learning adopt one scheme alone (Sect. 2). Moreover, Boosting and Bagging represent the most widely adopted approaches.
In this work, we investigate the utility of Boosting for constructing a Stacked ensemble (Sect. 3). Rather than members of an ensemble (i.e. programs) being deployed in parallel, our Stack assumes that programs are deployed sequentially in the order that they were originally discovered. Programs will be rewarded for either suggesting a class label or declaring the prediction as ‘ambiguous’. As we partner Stacking with Boosting, each program added to the Stack is explicitly focused on what previous programs could not classify. Thus, additional programs are only trained against exemplars associated with ambiguous predictions, i.e. the cardinality of the training partition actually decreases for each round of boosting. Post training, members of the Stack are sequentially visited until one makes a prediction (other than ‘ambiguous’), implying that only a fraction of the programs comprising the Stack needs to be visited to suggest a label.
A benchmarking study demonstrates the effectiveness of the approach by first comparing with recent evolutionary ensemble learning algorithms on a suite of previously benchmarked low cardinality datasets (Sect. 4). Having established the competitiveness of the proposed approach from the perspective of classifier accuracy, we then benchmark using a high cardinality classification task consisting of \(\approx 800,000\) records. The best of previous approaches is demonstrated not to scale under these conditions. Conversely, the efficiency of the proposed approach implies that parameter tuning is still feasible under the high cardinality setting.
2 Related Work
Ensemble learning appears in three basic forms, summarized as follows and assuming (without loss of generality) that the underlying goal is to produce solutions for a classification problem:
-
Bagging: n classifiers are independently constructed and an averaging scheme adopted to aggregate the n independent predictions into a single ensemble recommendation.Footnote 1 The independence of the n classifiers is established by building each classifier on a different sample taken from the original training partition, or a ‘bag’.
-
Boosting: constructs n classifiers sequentially. Classifier performance is used to incrementally reweight the probability of sampling training data, \(\mathcal {D}\), used to train the next classifier. Thus, each classifier encounters a sample of \(\mathcal {D}/n\) training records. Post-training the ensemble again assumes an averaging scheme to aggregate the n labels into a single label.
-
Stacking: assumes that a heterogeneous set of n classifiers is trained on a common training partition. The predictions from the n classifiers are then ‘stacked’ to define a \(n \times |\mathcal {D}|\) dataset that trains a ‘meta classifier’ to produce the overall classification prediction.
Such a division of ensemble learning architectures reflects the view that the learning algorithm only constructs one model at a time. However, genetic programming (GP) develops multiple models simultaneously. Thus, one approach for GP ensemble learning might be to divide the population into islands and expose each island to different data subsets, where the data subset is constructed using a process synonymous with Bagging or Boosting (e.g. [9, 12, 13]). Some of the outcomes resulting from this research theme were that depending on the degree of isolation between populations, classifiers might result that were collectively strong, but individually weak [13], or that solution simplicity might be enabled by the use of ensemble methods [12]. Indeed, solution simplicity has been empirically reported for other GP ensemble formulations [15].
Another recurring theme is to assume that an individual takes the form of a ‘multi-tree’ [2, 17, 25]. In this case, an ‘individual’ is a team of n Tree-structured programs. Constraints are then enforced on the operation of variation operators in order to maintain context between programs in the multi-tree. This concept was developed further under the guise of ‘levels of selection’ in which selection can operate at the ‘level’ of a program or ensemble [26, 29]. However, in order to do so, it was necessary to have different performance definitions for each level.
Virgolin revisits the theme of evolving ensembles under Bagging while concentrating on the role of selection [27]. Specifically, individuals are evaluated w.r.t. all bags. The cost of fitness evaluation over all bags is minimized by evaluating programs once and caching the corresponding fitness values. The motivation for evaluating performance over all the bags is to promote uniform development toward the performance goal.
Boosting and Bagging in particular have also motivated the development of mechanisms for decoupling GP from the cardinality of the training partition. In essence, GP performance evaluation is only ever conducted relative to the content of a data subset, DS, where \(|DS|<< |\mathcal {D}|\). However, the data subset is periodically resampled, with biases introduced to reflect the difficulty of labelling records correctly and the frequency of selecting data to appear in the data subset [11, 24]. This relationship was then later explicitly formulated from the perspective of competitively coevolving ensembles against the data subset [14,15,16], i.e. data subset and ensemble experience different performance functions.
Several coevolutionary approaches have also been proposed in which programs are defined in one population and ensembles are defined by another, e.g. [14,15,16, 21]. Programs are free to appear in multiple ensembles and the size of the ensemble is determined through evolution. The works of [14,15,16] have been extensively benchmarked over multi-class, high-dimensional and high-cardinality datasets. One limitation is the development of hitchhikers, i.e. programs that appear in the ensemble that never contribute to labelling data.
The concept of ‘Stacking’ has been less widely employed. However, the cascade-correlation approach to evolving neural networks might be viewed in this light [8]. Specifically, cascade-correlation begins with a single ‘perceptron’ and identifies the error residual. Assuming that the performance goal has not been reached, a new perceptron is added that receives as input all the previous perceptron outputs and the original data attributes. Each additional perceptron is trained to minimize the corresponding residual error. Potter and deJong demonstrated a coevolutionary approach to evolving neural networks using this architecture [19]. Curry et al. benchmarked such an approach for training layers of GP classifiers under the cascade-correlation architecture [6]. However, they discovered that the GP classifiers could frequently degenerate to doing no more than copying the input from the previous layer.
Finally, the concept of interpreting the output of a GP program as a dimension in a new feature space rather than a label is also of significance to several ensemble methods. Thus, programs comprising an ensemble might be rewarded for mapping to a (lower-dimensional) feature space that can be clustered into class-consistent regions [16, 18]. Multiple programs appearing in an ensemble define the dimensions of the new feature space. Likewise, ensembles based on multi-trees have been rewarded for mapping to a feature space that maximizes the performance of a linear discriminant classifier [2].
3 Evolving an Ensemble Stack Using Boosting
The motivation of this work is to use a combination of boosting and stacking to address data cardinality while constructing the GP ensemble. We assume that classifiers are sequentially added to the ensemble. Our insight is that after adding each classifier, the data correctly classified is removed from the training partition. Thus, as classifiers are added to the ensemble the cardinality of the training partition decreases. Moreover, the next classifier evolved is explicitly directed to label what the ensemble cannot currently label. Post training, the ensemble is deployed as a ‘stack’ in which classifiers are applied in the order in which they were evolved. As will become apparent, we need not visit all members of the ensemble stack in order to make a prediction. In the following, we first present the evolutionary cycle adopted for developing the Boosted Ensemble Stack (Sect. 3.1) and then discuss how the Ensemble Stack is deployed (Sect. 3.2).
3.1 The Boosting Ensemble Stack Algorithm
Algorithm 1 summarizes the overall approach taken to boosting in this research. The training dataset, \(\mathcal {D}\), is defined in terms of a matrix \(X^t\) of n (input) records (each comprising of d-attributes) and a vector \(Y^t\) of n labels (i.e. supervised learning for classification). We may sample records pairwise, \(\langle \textbf{x}_p \in X^t, y_p \in Y^t\rangle \), during training. The outer loop (Step 3) defines the number of ‘boosting epochs’, where this sets an upper limit on ensemble size. Step 5 initializes a new population, consisting of program decision stumps alone (single node programs). Step 6 performs fitness evaluation, ranking, parent pool selection, and variation for a limited number of generations (Max_GP_Epoch). Specifically, the following form is assumed:
-
1.
Fitness evaluation (Step 7) assumes the Gini index in which the output of each program is modelled as a distribution (Algorithm 2). Fitter individuals are those with a higher Gini index.
-
2.
Parent pool selection (PP) implies that the worst %Gap programs are deleted, leaving the parent pool as the survivors (Algorithm 1, Step 9).
-
3.
Test for early stopping (Step 10) where this is defined in terms of fitness and ‘bin purity’ a concept defined below.
-
4.
Variation operators (Step 21) are limited to 1) cloning %Gap parents, 2) adding a single new node to a clone where the parameters of the new node are chosen stochastically, and 3) mutating any of the parameters in the resulting offspring.
In assuming a performance function based on the Gini index, programs perform the following mapping: \(\hat{y}_p = f(x_p)\) where \(x_p\) is input record p and \(\hat{y}_p\) is the corresponding output from the program. There is no attempt to treat \(\hat{y}_p\) as a predicted classification label for record p. Instead, \(\hat{y}_p\) represents the mapping of the original input, \(x_p\), to a scalar value on a 1-dimensional number line, \(\hat{y}\). After mapping all p inputs to their corresponding \(\hat{y}_p\) we quantize \(\hat{y}\) into NumBin intervals (Step 11), as illustrated by Fig. 1. Each interval defines an equal non-overlapping region of the number line \(\hat{y}\) and an associated bin. The bins act as a container for the labels, \(y_p = c\), associated with each \(\hat{y}\) appearing in the bin’s interval. Three types of the bin may now appear,
-
Empty bins: have no \(\hat{y}_p\) appearing in their interval.
-
Pure bins: have \(\hat{y}_p\) appearing in their interval such that the majority of labels \(y_p\) are the same. A ‘pure bin’ assumes the label \(y_p\) that reflects the majority of the bin content and is declared when the following condition holds,
$$\begin{aligned} \frac{C_{bin} - y^*}{C_{bin}} < \beta \end{aligned}$$(1)where \(C_{bin}\) is the count of the number of records appearing in the bin, C(c) represents the number of records of each class in the bin, and \(y^* = \max _c C(c)\).
-
Ambiguous bins: imply that some \(\hat{y}_p\) appear at an interval such that bin purity does not hold.
If a pure bin is encountered for an individual during training (Step 12) that also has best fitness, then the next champion for including in the ensemble has been identified. Moreover, each time a new ensemble member is to be evolved, they have to improve on the fitness of the previous ensemble member (Step 13).
Note that multiple Pure bins can appear, where this does not place any constraint on the labels representing different Pure bins. All training records corresponding to the ‘Pure bins’ are then identified (Step 27) and removed to define a candidate residual dataset (Step 30).
Finally, the champion individual is then ‘pushed’ to the list of ensemble members (Step 31). Note that the order in which champions are pushed to the list is significant (Sect. 3.2). At the next iteration of the algorithm, an entirely new population of programs is evolved to specifically label what the members of the ensemble currently cannot label. Moreover, ‘early stopping’ might appear if the residual dataset is empty (Step 32).
3.2 Evaluating an Ensemble Stack Post Training
Post-training, we need to define how the ensemble operates collectively to produce a single label; whereas, during training, programs are constructed ‘independently’, without direct reference to other members of the ensemble. Algorithm 3 summarizes how an ensemble is deployed as a ‘stack’ and evaluated following the order in which each ensemble member was originally added to the ensemble, i.e. the stack chooses programs under a first-in, first-out order (Step 3). Program i may only suggest a label for bins that were originally identified as ‘Pure bins’. If the program produces a \(\hat{y}_p\) corresponding to any other type of bin (Empty or Ambiguous), then the next program, \(i + 1\), is called on to suggest its mapping \(\hat{y}_p\) for record p (Step 16). At any point where a program maps the input to a ‘Pure bin’, then the corresponding label for that bin can be looked up. If this matches the actual label, \(y_p\), then the ensemble prediction is correct (Step 13). This also implies that the entire ensemble need not be evaluated in order to make a prediction.
The first-in, first-out deployment of programs mimics the behaviour of ‘falling rule lists’, a model of deployment significant to interpretable machine learning [22]. Thus, the most frequent queries are answered by the classifier deployed earliest. As the classes of queries become less frequent (more atypical) prediction is devolved to classifiers appearing deeper in the list (queue). The proposed evolutionary ensemble learner adds programs to the stack in precisely this order.
3.3 Using an Extremely Large Number of Bins
One approach to parameterizing BStacGP is to force the number of bins to be very low (2 or 3). The intuition of is that this rewards BStacGP discovering a mapping of records to bins that develops at least one bin to be pure. This approach will be adopted later for the ‘small scale’ benchmark. Under high cardinality datasets the opposite approach is assumed. The insight behind this is that there is sufficient data to support multiple pure bins such that we maximize the number of training records mapped to pure bins by the same program. This will also result in the fastest decrease in data cardinality during training.
Taking this latter approach to the limit, we assume the number of bins is set by the size of a floating point number or \(2^{32}\). During training, bins are again identified as pure, ambiguous or empty. However, given the resolution of the bins, most training data will likely be mapped to a ‘pure’ bin, reducing the number of boosting epochs necessary to build the ensemble. Under test conditions, given that the bins have a high resolution and the data has not previously been encountered, it is likely that some test data will be mapped to ‘empty’ bins. Hence, under test conditions, the test data is labelled by the bin that it is closest to. If the closest bin is not pure but ambiguous or empty, then the next tree in the stack is queried.
4 Experimental Methodology
A benchmarking comparison will be made against a set of five datasets from a recent previous study [27]. This enables us to establish to what degree the proposed approach is competitive with five state-of-the-art approaches for GP ensemble classification. The datasets appearing in this comparison are all widely used binary classification tasks from the UCI repository,Footnote 2 as summarized by Table 1. The training and test partitions are stratified to provide the same class distribution in each partition and the training/test split is always 70/30%.
The algorithms appearing in this comparison take the following form:
-
2SEGP: A bagging approach to evolving GP ensembles in which the underlying design goal is to maintain uniform performance across the multiple bootstrap bags. Such an approach was demonstrated to be particularly competitive with other recent developments in evolutionary ensemble learning [27]. In addition, the availability of a code base enables us to make additional benchmarking comparisons under a high cardinality dataset.
-
eGPw: Represents the best-performing configuration of the cooperative coevolutionary approach to ensemble learning from [21]. Specifically, benchmarking revealed the ability to discover simple solutions to binary classification problems while being competitive with Random Forests and XGBoost.
-
DNGP: Represents an approach to GP ensembles in which diversity maintenance represents an underlying design goal [28]. Diversity maintenance represents a reoccurring theme in evolutionary ensemble learning, where the motivation is to reduce the correlation between ensemble members. The framework was reimplemented and benchmarked in the study of [27].
-
M3GP: Is not explicitly an evolutionary ensemble learner but does evolve a set of programs to perform feature engineering [18]. The underlying objective is to discover a mapping to a new feature space that enables clustering to separate between classes. M3GP has been extensively benchmarked in multiple contexts and included in this study as an example of what GP classification can achieve without evolutionary ensemble learning being the design goal [4, 18].
A second benchmarking study is then performed on a large cardinality dataset describing an intrusion detection task. This dataset is the union of the normal and botnet data from the CTU-13 dataset [10], resulting in hundreds of thousands of data records (Table 1). In this case, we compare the best two evolutionary ensembles from the first study with C4.5 [20] and XGBoost [5]. The latter represent very efficient non-evolutionary machine learning approaches to classification.
Table 2 summarizes the parameters assumed for BStacGP. Two parameterizations are considered in each benchmarking study: ‘slow and complex’ versus ‘fast and simple’. One insight is that higher data cardinality can imply a higher bin count. We therefore use the lowest bin count (2) and a high purity threshold (0.99) as a starting point under the Small Scale benchmarking study. Such a combination forces at least one of the 2 bins to satisfy the high bin purity threshold. We then differentiate between ‘slow and complex’ versus ‘fast and simple’ scenarios by increasing the population size (with the parent pool increasing proportionally). The value for Max_Boost_epoch is set intentionally high, where in practice such an ensemble size is not encountered due to early stopping being triggered (Algorithm 1, Step 32). Under the large scale benchmark, the largest bin count was assumed, where this does not imply that this number of bins needs to contain values, but it does mean that the distribution has the most resolution.
5 Results
Two benchmarking studies are performed.Footnote 3 The first assumes a suite of ‘small scale’ classification tasks (Sect. 5.1) that recently provided the basis for comparing several state-of-the-art GP evolutionary ensemble learners [27]. The second study reports results on a single large cardinality dataset using the best two evolutionary ensemble learners from the first study, and two non-evolutionary methods (Sect. 5.2). Hereafter, the proposed approach is referred to as BStacGP.
5.1 Small Scale Classification Tasks
Tables 3 and 4 report benchmarking for the five small datasets. In the previous benchmarking study, 2SEGP and M3GP were the best-performing algorithms on these datasets [27]. Introducing the proposed Boosted Stack ensemble (BStacGP) to the comparison changes the ranking somewhat. That said, all five GP formulations perform well on the BCW dataset (>95% under test), whereas the widest variance in classifier performance appears under HEART and SONAR. Applying the Friedman non-parametric test for multiple models to the ranking of test performance fails to reject the null hypothesis (all algorithms are ranked equally). Given that these are all strong classification algorithms this is not in itself unexpected.
BStacGP and 2SEGP are the two highest ranked algorithms on training and test. With this in mind we consider the average solution complexity and time to evolve solutions. In the case of complexity, the average number of nodes in the Tree structured GP individuals comprising an ensemble is counted, Fig. 2. It is apparent that BStacGP is able to discover solutions that are typically an order of magnitude simpler. Figure 3 summarizes the wall clock time to conduct training. Both BStacGP and 2SEGP are implemented in Python. BStacGP typically completes training an order of magnitude earlier than 2SEGP. In summary, the process by which BstackGP incrementally only evolves classifiers against the ‘residual’ misclassified data results in a significant decrease in computational cost and model complexity.
5.2 Large Scale Classification Task
BStacBP and 2SEGP accounted for 4 out of 5 of the best classifier performance under the test partition of the ‘Small Scale’ task (Table 4). With this in mind, we now compare the performance of BStacBP and 2SEGP under the high-cardinality CTU dataset with the non-evolutionary classifiers of C4.5 and XGBoost. Table 5 reports performance for the average (over 40 trials) of BStacBP, C4.5 and XGBoost and a single run of 2SEGP. Specifically, the high run time cost of 2SEGP precluded performing multiple trials on this benchmark.Footnote 4 The best-performing algorithm under test is XGBoost, however, BStacGP is within \(1.3\%\) of this result. Conversely, 2SEGP returns a 10% lower accuracy, a result in part due to 20 generations taking 4hrs to perform, so limiting the ability to optimize parameters.
In the case of solution complexity (as measured by node count), BStacGP returns solutions that are 2 to 3 orders of magnitude simpler than any comparator algorithm. From the perspective of the computational cost of performing training, C4.5 is the fastest. However, of the ensemble algorithms, BStacGP is twice as fast as XGBoost and 3 orders of magnitude faster than 2SEGP.
A second experiment is performed using the CTU dataset, the hypothesis, in this case, being that constraining C4.5 and XGBoost to the same complexity as BStacGP will have a significant impact on their classification performance. Put another way, the comparator algorithms will not be able to discover solutions with similar complexity to BStacGP without significantly compromising their classification accuracy.
Table 6 summarizes performance under the low complexity condition. It is now apparent that BStacGP is able to maintain solution accuracy on this task as well as further reduce the computation necessary to identify such a solution. This implies that BStacGP has the potential to scale to tasks that other formulations of evolutionary ensemble learning fail to scale to while maintaining state-of-the-art classification performance. Thus, as dataset cardinality increases algorithm efficiency has an increasing impact, as it is simply not possible to tune parameters, an important practical property to have.
Post training, BStacGP solutions can be queried. For example, the 1st program from a typical BStack stack ensemble provided \(\approx 57\%\) of the labels. The 2nd \(\approx 24\%\), the 3rd \(\approx 15\%\) and the 4th \(\approx 4\%\). This illustrates the ‘fall through’ nature of BStacGP operation in which most of the data is labeled by a single GP program. The complexity of trees associated with each stack level are respectively 138, 250, 305 and 407 nodes, i.e. increases with position in the stack.
6 Conclusion
An approach to evolutionary ensemble learning is proposed that employs boosting to develop a stack of GP programs. Key components of the framework include: 1) interpreting the program output as a distribution; 2) quantizing the distribution into intervals and therefore ‘binning’ the number of records mapped to an interval; 3) making predictions on the basis of ‘bin purity’; and 4) removing records from the training partition corresponding to correctly classified instances. The combination of these properties incrementally reduces the cardinality of the training partition as classifiers are added to the ensemble, and explicitly focuses the role of the next program on what previous programs could not classify. Moreover, the resulting ensemble is then deployed as a ‘Stack’. This is important because it now means that only part of the ensemble is responsible for providing a label, thus improving the explainability of the resulting ensemble.
The accuracy of the BStacGP framework on small cardinality datasets previously employed for benchmarking is empirically shown to be comparable to state-of-the-art evolutionary ensemble learners. Moreover, training time and model simplicity is significantly improved. This property is shown to be key to scaling BStacGP much more efficiently to a large cardinality dataset containing hundreds of thousands of records. Indeed the results are competitive with non-evolutionary methods.
Future work will scale BStacGP to multi-class classification and continue to investigate scalability and solution transparency.
Notes
- 1.
Classification tasks often assume the majority vote, although voting/weighting schemes might be evolved [3].
- 2.
- 3.
Laptop with Intel i7 10700k CPU, 4.3 GHz single core.
- 4.
2SEGP parameterization: pop. size 500, ensemble size 50, max. tree size 500.
References
Agapitos, A., Loughran, R., Nicolau, M., Lucas, S.M., O’Neill, M., Brabazon, A.: A survey of statistical machine learning elements in genetic programming. IEEE Trans. Evol. Comput. 23(6), 1029–1048 (2019)
Badran, K.M.S., Rockett, P.I.: Multi-class pattern classification using single, multi-dimensional feature-space feature extraction evolved by multi-objective genetic programming and its application to network intrusion detection. Genet. Program Evolvable Mach. 13(1), 33–63 (2012)
Brameier, M., Banzhaf, W.: Evolving teams of predictors with linear genetic programming. Genet. Program Evolvable Mach. 2(4), 381–407 (2001)
Cava, W.G.L., Silva, S., Danai, K., Spector, L., Vanneschi, L., Moore, J.H.: Multidimensional genetic programming for multiclass classification. Swarm Evol. Comput. 44, 260–272 (2019)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)
Curry, R., Lichodzijewski, P., Heywood, M.I.: Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Trans. Syst. Man, Cybern. - Part B 37(4), 1065–1073 (2007)
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40(2), 139–157 (2000)
Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In: Advances in Neural Information Processing Systems, vol. 2, pp. 524–532. Morgan Kaufmann (1989)
Folino, G., Pizzuti, C., Spezzano, G.: Training distributed GP ensemble with a selective algorithm based on clustering and pruning for pattern classification. IEEE Trans. Evol. Comput. 12(4), 458–468 (2008)
García, S., Grill, M., Stiborek, J., Zunino, A.: An empirical comparison of botnet detection methods. Comput. Secur. 45, 100–123 (2014)
Gathercole, C., Ross, P.: Dynamic training subset selection for supervised learning in genetic programming. In: Davidor, Y., Schwefel, H.-P., Männer, R. (eds.) PPSN 1994. LNCS, vol. 866, pp. 312–321. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58484-6_275
Iba, H.: Bagging, boosting, and bloating in genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1053–1060. Morgan Kaufmann (1999)
Imamura, K., Soule, T., Heckendorn, R.B., Foster, J.A.: Behavioral diversity and a probabilistically optimal GP ensemble. Genet. Program Evolvable Mach. 4(3), 235–253 (2003)
Lichodzijewski, P., Heywood, M.I.: Managing team-based problem solving with symbiotic bid-based genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 363–370. ACM (2008)
Lichodzijewski, P., Heywood, M.I.: Symbiosis, complexification and simplicity under GP. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 853–860. ACM (2010)
McIntyre, A.R., Heywood, M.I.: Classification as clustering: a pareto cooperative-competitive GP approach. Evol. Comput. 19(1), 137–166 (2011)
Muni, D.P., Pal, N.R., Das, J.: A novel approach to design classifiers using genetic programming. IEEE Trans. Evol. Comput. 8(2), 183–196 (2004)
Muñoz, L., Silva, S., Trujillo, L.: M3GP – multiclass classification with GP. In: Machado, P., et al. (eds.) EuroGP 2015. LNCS, vol. 9025, pp. 78–91. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16501-1_7
Potter, M.A., Jong, K.A.D.: Cooperative coevolution: an architecture for evolving coadapted subcomponents. Evol. Comput. 8(1), 1–29 (2000)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993)
Rodrigues, N.M., Batista, J.E., Silva, S.: Ensemble genetic programming. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds.) EuroGP 2020. LNCS, vol. 12101, pp. 151–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44094-7_10
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019)
Sipper, M., Moore, J.H.: Symbolic-regression boosting. CoRR abs/2206.12082 (2022)
Song, D., Heywood, M.I., Zincir-Heywood, A.N.: Training genetic programming on half a million patterns: an example from anomaly detection. IEEE Trans. Evol. Comput. 9(3), 225–239 (2005)
Soule, T.: Voting teams: a cooperative approach to non-typical problems using genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 916–922. Morgan Kaufmann (1999)
Thomason, R., Soule, T.: Novel ways of improving cooperation and performance in ensemble classifiers. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1708–1715. ACM (2007)
Virgolin, M.: Genetic programming is naturally suited to evolve bagging ensembles. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 830–839. ACM (2021)
Wang, S., Mei, Y., Zhang, M.: Novel ensemble genetic programming hyper-heuristics for uncertain capacitated arc routing problem. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1093–1101. ACM (2019)
Wu, S.X., Banzhaf, W.: Rethinking multilevel selection in genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1403–1410. ACM (2011)
Acknowledgements
This research was enabled by the support of the Natural Science and Engineering Research Council (NSERC) of Canada Alliance Grant.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, Z. et al. (2023). A Boosting Approach to Constructing an Ensemble Stack. In: Pappa, G., Giacobini, M., Vasicek, Z. (eds) Genetic Programming. EuroGP 2023. Lecture Notes in Computer Science, vol 13986. Springer, Cham. https://doi.org/10.1007/978-3-031-29573-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-29573-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29572-0
Online ISBN: 978-3-031-29573-7
eBook Packages: Computer ScienceComputer Science (R0)