Keywords

1 Introduction

Ensemble learning represents a widely employed meta-learning scheme for deploying multiple models under supervised learning tasks [1, 7, 23]. In general, there are three basic formulations: Bagging, Boosting or Stacking. We note, however, that most instances of ensemble learning adopt one scheme alone (Sect. 2). Moreover, Boosting and Bagging represent the most widely adopted approaches.

In this work, we investigate the utility of Boosting for constructing a Stacked ensemble (Sect. 3). Rather than members of an ensemble (i.e. programs) being deployed in parallel, our Stack assumes that programs are deployed sequentially in the order that they were originally discovered. Programs will be rewarded for either suggesting a class label or declaring the prediction as ‘ambiguous’. As we partner Stacking with Boosting, each program added to the Stack is explicitly focused on what previous programs could not classify. Thus, additional programs are only trained against exemplars associated with ambiguous predictions, i.e. the cardinality of the training partition actually decreases for each round of boosting. Post training, members of the Stack are sequentially visited until one makes a prediction (other than ‘ambiguous’), implying that only a fraction of the programs comprising the Stack needs to be visited to suggest a label.

A benchmarking study demonstrates the effectiveness of the approach by first comparing with recent evolutionary ensemble learning algorithms on a suite of previously benchmarked low cardinality datasets (Sect. 4). Having established the competitiveness of the proposed approach from the perspective of classifier accuracy, we then benchmark using a high cardinality classification task consisting of \(\approx 800,000\) records. The best of previous approaches is demonstrated not to scale under these conditions. Conversely, the efficiency of the proposed approach implies that parameter tuning is still feasible under the high cardinality setting.

2 Related Work

Ensemble learning appears in three basic forms, summarized as follows and assuming (without loss of generality) that the underlying goal is to produce solutions for a classification problem:

  • Bagging: n classifiers are independently constructed and an averaging scheme adopted to aggregate the n independent predictions into a single ensemble recommendation.Footnote 1 The independence of the n classifiers is established by building each classifier on a different sample taken from the original training partition, or a ‘bag’.

  • Boosting: constructs n classifiers sequentially. Classifier performance is used to incrementally reweight the probability of sampling training data, \(\mathcal {D}\), used to train the next classifier. Thus, each classifier encounters a sample of \(\mathcal {D}/n\) training records. Post-training the ensemble again assumes an averaging scheme to aggregate the n labels into a single label.

  • Stacking: assumes that a heterogeneous set of n classifiers is trained on a common training partition. The predictions from the n classifiers are then ‘stacked’ to define a \(n \times |\mathcal {D}|\) dataset that trains a ‘meta classifier’ to produce the overall classification prediction.

Such a division of ensemble learning architectures reflects the view that the learning algorithm only constructs one model at a time. However, genetic programming (GP) develops multiple models simultaneously. Thus, one approach for GP ensemble learning might be to divide the population into islands and expose each island to different data subsets, where the data subset is constructed using a process synonymous with Bagging or Boosting (e.g. [9, 12, 13]). Some of the outcomes resulting from this research theme were that depending on the degree of isolation between populations, classifiers might result that were collectively strong, but individually weak [13], or that solution simplicity might be enabled by the use of ensemble methods [12]. Indeed, solution simplicity has been empirically reported for other GP ensemble formulations [15].

Another recurring theme is to assume that an individual takes the form of a ‘multi-tree’ [2, 17, 25]. In this case, an ‘individual’ is a team of n Tree-structured programs. Constraints are then enforced on the operation of variation operators in order to maintain context between programs in the multi-tree. This concept was developed further under the guise of ‘levels of selection’ in which selection can operate at the ‘level’ of a program or ensemble [26, 29]. However, in order to do so, it was necessary to have different performance definitions for each level.

Virgolin revisits the theme of evolving ensembles under Bagging while concentrating on the role of selection [27]. Specifically, individuals are evaluated w.r.t. all bags. The cost of fitness evaluation over all bags is minimized by evaluating programs once and caching the corresponding fitness values. The motivation for evaluating performance over all the bags is to promote uniform development toward the performance goal.

Boosting and Bagging in particular have also motivated the development of mechanisms for decoupling GP from the cardinality of the training partition. In essence, GP performance evaluation is only ever conducted relative to the content of a data subset, DS, where \(|DS|<< |\mathcal {D}|\). However, the data subset is periodically resampled, with biases introduced to reflect the difficulty of labelling records correctly and the frequency of selecting data to appear in the data subset [11, 24]. This relationship was then later explicitly formulated from the perspective of competitively coevolving ensembles against the data subset [14,15,16], i.e. data subset and ensemble experience different performance functions.

Several coevolutionary approaches have also been proposed in which programs are defined in one population and ensembles are defined by another, e.g. [14,15,16, 21]. Programs are free to appear in multiple ensembles and the size of the ensemble is determined through evolution. The works of [14,15,16] have been extensively benchmarked over multi-class, high-dimensional and high-cardinality datasets. One limitation is the development of hitchhikers, i.e. programs that appear in the ensemble that never contribute to labelling data.

The concept of ‘Stacking’ has been less widely employed. However, the cascade-correlation approach to evolving neural networks might be viewed in this light [8]. Specifically, cascade-correlation begins with a single ‘perceptron’ and identifies the error residual. Assuming that the performance goal has not been reached, a new perceptron is added that receives as input all the previous perceptron outputs and the original data attributes. Each additional perceptron is trained to minimize the corresponding residual error. Potter and deJong demonstrated a coevolutionary approach to evolving neural networks using this architecture [19]. Curry et al. benchmarked such an approach for training layers of GP classifiers under the cascade-correlation architecture [6]. However, they discovered that the GP classifiers could frequently degenerate to doing no more than copying the input from the previous layer.

Finally, the concept of interpreting the output of a GP program as a dimension in a new feature space rather than a label is also of significance to several ensemble methods. Thus, programs comprising an ensemble might be rewarded for mapping to a (lower-dimensional) feature space that can be clustered into class-consistent regions [16, 18]. Multiple programs appearing in an ensemble define the dimensions of the new feature space. Likewise, ensembles based on multi-trees have been rewarded for mapping to a feature space that maximizes the performance of a linear discriminant classifier [2].

3 Evolving an Ensemble Stack Using Boosting

The motivation of this work is to use a combination of boosting and stacking to address data cardinality while constructing the GP ensemble. We assume that classifiers are sequentially added to the ensemble. Our insight is that after adding each classifier, the data correctly classified is removed from the training partition. Thus, as classifiers are added to the ensemble the cardinality of the training partition decreases. Moreover, the next classifier evolved is explicitly directed to label what the ensemble cannot currently label. Post training, the ensemble is deployed as a ‘stack’ in which classifiers are applied in the order in which they were evolved. As will become apparent, we need not visit all members of the ensemble stack in order to make a prediction. In the following, we first present the evolutionary cycle adopted for developing the Boosted Ensemble Stack (Sect. 3.1) and then discuss how the Ensemble Stack is deployed (Sect. 3.2).

3.1 The Boosting Ensemble Stack Algorithm

Algorithm 1 summarizes the overall approach taken to boosting in this research. The training dataset, \(\mathcal {D}\), is defined in terms of a matrix \(X^t\) of n (input) records (each comprising of d-attributes) and a vector \(Y^t\) of n labels (i.e. supervised learning for classification). We may sample records pairwise, \(\langle \textbf{x}_p \in X^t, y_p \in Y^t\rangle \), during training. The outer loop (Step 3) defines the number of ‘boosting epochs’, where this sets an upper limit on ensemble size. Step 5 initializes a new population, consisting of program decision stumps alone (single node programs). Step 6 performs fitness evaluation, ranking, parent pool selection, and variation for a limited number of generations (Max_GP_Epoch). Specifically, the following form is assumed:

  1. 1.

    Fitness evaluation (Step 7) assumes the Gini index in which the output of each program is modelled as a distribution (Algorithm 2). Fitter individuals are those with a higher Gini index.

  2. 2.

    Parent pool selection (PP) implies that the worst %Gap programs are deleted, leaving the parent pool as the survivors (Algorithm 1, Step 9).

  3. 3.

    Test for early stopping (Step 10) where this is defined in terms of fitness and ‘bin purity’ a concept defined below.

  4. 4.

    Variation operators (Step 21) are limited to 1) cloning %Gap parents, 2) adding a single new node to a clone where the parameters of the new node are chosen stochastically, and 3) mutating any of the parameters in the resulting offspring.

figure a
figure b

In assuming a performance function based on the Gini index, programs perform the following mapping: \(\hat{y}_p = f(x_p)\) where \(x_p\) is input record p and \(\hat{y}_p\) is the corresponding output from the program. There is no attempt to treat \(\hat{y}_p\) as a predicted classification label for record p. Instead, \(\hat{y}_p\) represents the mapping of the original input, \(x_p\), to a scalar value on a 1-dimensional number line, \(\hat{y}\). After mapping all p inputs to their corresponding \(\hat{y}_p\) we quantize \(\hat{y}\) into NumBin intervals (Step 11), as illustrated by Fig. 1. Each interval defines an equal non-overlapping region of the number line \(\hat{y}\) and an associated bin. The bins act as a container for the labels, \(y_p = c\), associated with each \(\hat{y}\) appearing in the bin’s interval. Three types of the bin may now appear,

  • Empty bins: have no \(\hat{y}_p\) appearing in their interval.

  • Pure bins: have \(\hat{y}_p\) appearing in their interval such that the majority of labels \(y_p\) are the same. A ‘pure bin’ assumes the label \(y_p\) that reflects the majority of the bin content and is declared when the following condition holds,

    $$\begin{aligned} \frac{C_{bin} - y^*}{C_{bin}} < \beta \end{aligned}$$
    (1)

    where \(C_{bin}\) is the count of the number of records appearing in the bin, C(c) represents the number of records of each class in the bin, and \(y^* = \max _c C(c)\).

  • Ambiguous bins: imply that some \(\hat{y}_p\) appear at an interval such that bin purity does not hold.

Fig. 1.
figure 1

Illustration for the relationship between program outputs (\(\hat{y}_p\)), intervals, bins, labels (\(y_p = c\)) and bin type.

If a pure bin is encountered for an individual during training (Step 12) that also has best fitness, then the next champion for including in the ensemble has been identified. Moreover, each time a new ensemble member is to be evolved, they have to improve on the fitness of the previous ensemble member (Step 13).

Note that multiple Pure bins can appear, where this does not place any constraint on the labels representing different Pure bins. All training records corresponding to the ‘Pure bins’ are then identified (Step 27) and removed to define a candidate residual dataset (Step 30).

Finally, the champion individual is then ‘pushed’ to the list of ensemble members (Step 31). Note that the order in which champions are pushed to the list is significant (Sect. 3.2). At the next iteration of the algorithm, an entirely new population of programs is evolved to specifically label what the members of the ensemble currently cannot label. Moreover, ‘early stopping’ might appear if the residual dataset is empty (Step 32).

3.2 Evaluating an Ensemble Stack Post Training

Post-training, we need to define how the ensemble operates collectively to produce a single label; whereas, during training, programs are constructed ‘independently’, without direct reference to other members of the ensemble. Algorithm 3 summarizes how an ensemble is deployed as a ‘stack’ and evaluated following the order in which each ensemble member was originally added to the ensemble, i.e. the stack chooses programs under a first-in, first-out order (Step 3). Program i may only suggest a label for bins that were originally identified as ‘Pure bins’. If the program produces a \(\hat{y}_p\) corresponding to any other type of bin (Empty or Ambiguous), then the next program, \(i + 1\), is called on to suggest its mapping \(\hat{y}_p\) for record p (Step 16). At any point where a program maps the input to a ‘Pure bin’, then the corresponding label for that bin can be looked up. If this matches the actual label, \(y_p\), then the ensemble prediction is correct (Step 13). This also implies that the entire ensemble need not be evaluated in order to make a prediction.

The first-in, first-out deployment of programs mimics the behaviour of ‘falling rule lists’, a model of deployment significant to interpretable machine learning [22]. Thus, the most frequent queries are answered by the classifier deployed earliest. As the classes of queries become less frequent (more atypical) prediction is devolved to classifiers appearing deeper in the list (queue). The proposed evolutionary ensemble learner adds programs to the stack in precisely this order.

figure c

3.3 Using an Extremely Large Number of Bins

One approach to parameterizing BStacGP is to force the number of bins to be very low (2 or 3). The intuition of is that this rewards BStacGP discovering a mapping of records to bins that develops at least one bin to be pure. This approach will be adopted later for the ‘small scale’ benchmark. Under high cardinality datasets the opposite approach is assumed. The insight behind this is that there is sufficient data to support multiple pure bins such that we maximize the number of training records mapped to pure bins by the same program. This will also result in the fastest decrease in data cardinality during training.

Taking this latter approach to the limit, we assume the number of bins is set by the size of a floating point number or \(2^{32}\). During training, bins are again identified as pure, ambiguous or empty. However, given the resolution of the bins, most training data will likely be mapped to a ‘pure’ bin, reducing the number of boosting epochs necessary to build the ensemble. Under test conditions, given that the bins have a high resolution and the data has not previously been encountered, it is likely that some test data will be mapped to ‘empty’ bins. Hence, under test conditions, the test data is labelled by the bin that it is closest to. If the closest bin is not pure but ambiguous or empty, then the next tree in the stack is queried.

4 Experimental Methodology

A benchmarking comparison will be made against a set of five datasets from a recent previous study [27]. This enables us to establish to what degree the proposed approach is competitive with five state-of-the-art approaches for GP ensemble classification. The datasets appearing in this comparison are all widely used binary classification tasks from the UCI repository,Footnote 2 as summarized by Table 1. The training and test partitions are stratified to provide the same class distribution in each partition and the training/test split is always 70/30%.

The algorithms appearing in this comparison take the following form:

  • 2SEGP: A bagging approach to evolving GP ensembles in which the underlying design goal is to maintain uniform performance across the multiple bootstrap bags. Such an approach was demonstrated to be particularly competitive with other recent developments in evolutionary ensemble learning [27]. In addition, the availability of a code base enables us to make additional benchmarking comparisons under a high cardinality dataset.

  • eGPw: Represents the best-performing configuration of the cooperative coevolutionary approach to ensemble learning from [21]. Specifically, benchmarking revealed the ability to discover simple solutions to binary classification problems while being competitive with Random Forests and XGBoost.

  • DNGP: Represents an approach to GP ensembles in which diversity maintenance represents an underlying design goal [28]. Diversity maintenance represents a reoccurring theme in evolutionary ensemble learning, where the motivation is to reduce the correlation between ensemble members. The framework was reimplemented and benchmarked in the study of [27].

  • M3GP: Is not explicitly an evolutionary ensemble learner but does evolve a set of programs to perform feature engineering [18]. The underlying objective is to discover a mapping to a new feature space that enables clustering to separate between classes. M3GP has been extensively benchmarked in multiple contexts and included in this study as an example of what GP classification can achieve without evolutionary ensemble learning being the design goal [4, 18].

Table 1. Properties of the benchmarking datasets. Class distribution reflects the distribution of positive to negative class instances.

A second benchmarking study is then performed on a large cardinality dataset describing an intrusion detection task. This dataset is the union of the normal and botnet data from the CTU-13 dataset [10], resulting in hundreds of thousands of data records (Table 1). In this case, we compare the best two evolutionary ensembles from the first study with C4.5 [20] and XGBoost [5]. The latter represent very efficient non-evolutionary machine learning approaches to classification.

Table 2. BStacGP parameterization. Gap defines the size of the parent pool. \(\beta \) is defined by Eq. (1). \(\alpha \) weights the fitness regularization term, Algorithm 2. NumBin appears in Algorithm 3 and the remaining parameters in Algorithm 1. Instruction set takes the form of the arithmetic operators \(\langle +, -, \div , \times \rangle \)

Table 2 summarizes the parameters assumed for BStacGP. Two parameterizations are considered in each benchmarking study: ‘slow and complex’ versus ‘fast and simple’. One insight is that higher data cardinality can imply a higher bin count. We therefore use the lowest bin count (2) and a high purity threshold (0.99) as a starting point under the Small Scale benchmarking study. Such a combination forces at least one of the 2 bins to satisfy the high bin purity threshold. We then differentiate between ‘slow and complex’ versus ‘fast and simple’ scenarios by increasing the population size (with the parent pool increasing proportionally). The value for Max_Boost_epoch is set intentionally high, where in practice such an ensemble size is not encountered due to early stopping being triggered (Algorithm 1, Step 32). Under the large scale benchmark, the largest bin count was assumed, where this does not imply that this number of bins needs to contain values, but it does mean that the distribution has the most resolution.

Table 3. Classifier accuracy on the training partition for small binary datasets. Bold indicates the best-performing classifier on the dataset

5 Results

Two benchmarking studies are performed.Footnote 3 The first assumes a suite of ‘small scale’ classification tasks (Sect. 5.1) that recently provided the basis for comparing several state-of-the-art GP evolutionary ensemble learners [27]. The second study reports results on a single large cardinality dataset using the best two evolutionary ensemble learners from the first study, and two non-evolutionary methods (Sect. 5.2). Hereafter, the proposed approach is referred to as BStacGP.

5.1 Small Scale Classification Tasks

Tables 3 and 4 report benchmarking for the five small datasets. In the previous benchmarking study, 2SEGP and M3GP were the best-performing algorithms on these datasets [27]. Introducing the proposed Boosted Stack ensemble (BStacGP) to the comparison changes the ranking somewhat. That said, all five GP formulations perform well on the BCW dataset (>95% under test), whereas the widest variance in classifier performance appears under HEART and SONAR. Applying the Friedman non-parametric test for multiple models to the ranking of test performance fails to reject the null hypothesis (all algorithms are ranked equally). Given that these are all strong classification algorithms this is not in itself unexpected.

Table 4. Classifier accuracy on the test partition for small binary datasets. Bold indicates the best-performing result
Fig. 2.
figure 2

Solution Complexity for BStacGP and 2SEGP on small scale classification tasks. ‘fast’ and ‘slow’ represent the two BStacGP parameterizations.

BStacGP and 2SEGP are the two highest ranked algorithms on training and test. With this in mind we consider the average solution complexity and time to evolve solutions. In the case of complexity, the average number of nodes in the Tree structured GP individuals comprising an ensemble is counted, Fig. 2. It is apparent that BStacGP is able to discover solutions that are typically an order of magnitude simpler. Figure 3 summarizes the wall clock time to conduct training. Both BStacGP and 2SEGP are implemented in Python. BStacGP typically completes training an order of magnitude earlier than 2SEGP. In summary, the process by which BstackGP incrementally only evolves classifiers against the ‘residual’ misclassified data results in a significant decrease in computational cost and model complexity.

Fig. 3.
figure 3

Training time for BStacGP and 2SEGP on small scale classification tasks. ‘fast’ and ‘slow’ represent the two BStacGP parameterizations.

5.2 Large Scale Classification Task

BStacBP and 2SEGP accounted for 4 out of 5 of the best classifier performance under the test partition of the ‘Small Scale’ task (Table 4). With this in mind, we now compare the performance of BStacBP and 2SEGP under the high-cardinality CTU dataset with the non-evolutionary classifiers of C4.5 and XGBoost. Table 5 reports performance for the average (over 40 trials) of BStacBP, C4.5 and XGBoost and a single run of 2SEGP. Specifically, the high run time cost of 2SEGP precluded performing multiple trials on this benchmark.Footnote 4 The best-performing algorithm under test is XGBoost, however, BStacGP is within \(1.3\%\) of this result. Conversely, 2SEGP returns a 10% lower accuracy, a result in part due to 20 generations taking 4hrs to perform, so limiting the ability to optimize parameters.

In the case of solution complexity (as measured by node count), BStacGP returns solutions that are 2 to 3 orders of magnitude simpler than any comparator algorithm. From the perspective of the computational cost of performing training, C4.5 is the fastest. However, of the ensemble algorithms, BStacGP is twice as fast as XGBoost and 3 orders of magnitude faster than 2SEGP.

Table 5. CTU dataset with BStacGP using ‘slow’ parameters (Table 2)

A second experiment is performed using the CTU dataset, the hypothesis, in this case, being that constraining C4.5 and XGBoost to the same complexity as BStacGP will have a significant impact on their classification performance. Put another way, the comparator algorithms will not be able to discover solutions with similar complexity to BStacGP without significantly compromising their classification accuracy.

Table 6. CTU dataset with BStacGP using ‘fast’ parameters (Table 2)

Table 6 summarizes performance under the low complexity condition. It is now apparent that BStacGP is able to maintain solution accuracy on this task as well as further reduce the computation necessary to identify such a solution. This implies that BStacGP has the potential to scale to tasks that other formulations of evolutionary ensemble learning fail to scale to while maintaining state-of-the-art classification performance. Thus, as dataset cardinality increases algorithm efficiency has an increasing impact, as it is simply not possible to tune parameters, an important practical property to have.

Post training, BStacGP solutions can be queried. For example, the 1st program from a typical BStack stack ensemble provided \(\approx 57\%\) of the labels. The 2nd \(\approx 24\%\), the 3rd \(\approx 15\%\) and the 4th \(\approx 4\%\). This illustrates the ‘fall through’ nature of BStacGP operation in which most of the data is labeled by a single GP program. The complexity of trees associated with each stack level are respectively 138, 250, 305 and 407 nodes, i.e. increases with position in the stack.

6 Conclusion

An approach to evolutionary ensemble learning is proposed that employs boosting to develop a stack of GP programs. Key components of the framework include: 1) interpreting the program output as a distribution; 2) quantizing the distribution into intervals and therefore ‘binning’ the number of records mapped to an interval; 3) making predictions on the basis of ‘bin purity’; and 4) removing records from the training partition corresponding to correctly classified instances. The combination of these properties incrementally reduces the cardinality of the training partition as classifiers are added to the ensemble, and explicitly focuses the role of the next program on what previous programs could not classify. Moreover, the resulting ensemble is then deployed as a ‘Stack’. This is important because it now means that only part of the ensemble is responsible for providing a label, thus improving the explainability of the resulting ensemble.

The accuracy of the BStacGP framework on small cardinality datasets previously employed for benchmarking is empirically shown to be comparable to state-of-the-art evolutionary ensemble learners. Moreover, training time and model simplicity is significantly improved. This property is shown to be key to scaling BStacGP much more efficiently to a large cardinality dataset containing hundreds of thousands of records. Indeed the results are competitive with non-evolutionary methods.

Future work will scale BStacGP to multi-class classification and continue to investigate scalability and solution transparency.