Keywords

6.1 Performance Measures

Performance measures are key tools to assess the quality of machine learning approaches and models. Therefore, several different measures have been proposed in the specialized literature with the goal of providing better choices in general or for a specific application domain [2].

In the context of HEAD-DT’s fitness function, and given that it evaluates algorithms (individuals) over data sets, it is reasonable to assume that different classification performance measures could be employed to provide a quantitative assessment of algorithmic performance. In the next few sections, we present five different performance measures that were selected for further investigation as HEAD-DT’s fitness function.

6.1.1 Accuracy

Probably the most well-known performance evaluation measure for classification problems, the accuracy of a model is the rate of correctly classified instances:

$$\begin{aligned} { accuracy} = \frac{tp+tn}{tp+tn+fp+fn} \end{aligned}$$
(6.1)

where \(tp\)(\(tn\)) stands for the true positives (true negatives)—instances correctly classified,—and \(fp\)(\(fn\)) stands for the false positives (false negatives)—instances incorrectly classified.

Even though most classification algorithms are assessed regarding the accuracy they obtain in a data set, we must point out that accuracy may be a misleading performance measure. For instance, suppose we have a data set whose class distribution is very skewed: 90 % of the instances belong to class A and 10 % to class B. An algorithm that always classifies instances as belonging to class A would achieve 90 % of accuracy, even though it never predicts a class-B instance. In this case, assuming that class B is equally important (or even more so) than class A, we would prefer an algorithm with lower accuracy, but which could eventually correctly predict some instances as belonging to the rare class B.

6.1.2 F-Measure

As it was presented in Sect. 4.4, F-Measure (also F-score or F\(_1\) score) is the harmonic mean of precision and recall:

$$\begin{aligned} { precision} = \frac{tp}{tp+fp} \end{aligned}$$
(6.2)
$$\begin{aligned} { recall} = \frac{tp}{tp+fn} \end{aligned}$$
(6.3)
$$\begin{aligned} f1 = 2 \times \frac{{ precision} \times { recall}}{{ precision} + { recall}} \end{aligned}$$
(6.4)

Note that though F-Measure is advocated in the machine learning literature as a single measure capable of capturing the effectiveness of a system, it still completely ignores the \(tn\), which can vary freely without affecting the statistic [8].

6.1.3 Area Under the ROC Curve

The area under the ROC (receiver operating characteristic) curve (AUC) has been increasingly used as a performance evaluation measure in classification problems. The ROC curve graphically displays the trade-off between the true positive rate (\(tpr = tp/(tp+fn) \)) and the false positive rate (\(fpr = fp/(fp+tn) \)) of a classifier. ROC graphs have properties that make them especially useful for domains with skewed class distribution and unequal classification error costs [1].

To create the ROC curve, one needs to build a graph in which the \(tpr\) is plotted along the y axis and the \(fpr\) is shown on the x axis. Each point along the curve corresponds to one of the models induced by a given algorithm, and different models are built by varying a probabilistic threshold that determines whether an instance should be classified as positive or negative.

A ROC curve is a two-dimensional depiction of a classifier. To compare classifiers, we may want to reduce ROC performance to a single scalar value representing the expected performance, which is precisely the AUC. Since the AUC is a portion of the area of the unit square, its value will always be between 0 and 1. However, because random guessing produces a diagonal line between (0,0) and (1,1), which has an area of 0.5, no realistic classifier should have an AUC value of less than 0.5. The AUC has an important statistical property: it is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance, which makes of the AUC equivalent to the Wilcoxon test of ranks [6].

The machine learning community often uses the AUC statistic for model comparison, even though this practice has recently been questioned based upon new research that shows that AUC is quite noisy as a performance measure for classification [3] and has some other significant problems in model comparison [4, 5].

6.1.4 Relative Accuracy Improvement

Originally proposed by Pappa [7], the relative accuracy improvement criterion measures the normalized improvement in accuracy of a given model over the data set’s default accuracy (i.e., the accuracy achieved when using the majority class of the training data to classify the unseen data):

$$\begin{aligned} RAI_i={\left\{ \begin{array}{ll} \frac{{ Acc}_i\, - \,{ DefAcc}_i}{1 - { DefAcc}_i}, &{} \text {if }Acc_i > { DefAcc}_i\\ \frac{{ Acc}_i\, - \,{ DefAcc}_i}{DefAcc_i}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6.5)

In Eq. (6.5), \({ Acc}_i\) is the accuracy achieved by a given classifier in data set \(i\), whereas \({ DefAcc}_i\) is the default accuracy of data set \(i\). Note that if the improvement in accuracy is positive, i.e., the classifier accuracy is greater than the default accuracy, the improvement is normalized by the maximum possible improvement over the default accuracy (\(1 - { DefAcc}_i\)). Otherwise, the drop in the accuracy is normalized by the maximum possible drop, which is the value of the default accuracy itself. Hence, the relative accuracy improvement \({ RAI}_i\) regarding data set \(i\) returns a value between \(-\)1 (when \({ Acc}_i = 0\)) and 1 (when \({ Acc}_i = 1\)). Any improvement regarding the default accuracy results in a positive value, whereas any drop results in a negative value. In case \({ Acc}_i = { DefAcc}_i\) (i.e., no improvement or drop in accuracy is achieved), \({ RAI}_i = 0\), as expected.

The disadvantage of the relative accuracy improvement criterion is that it is not suitable for very imbalanced problems—data sets in which the default accuracy is really close to 1,—since high accuracy does not properly translate into high performance for these kinds of problems, as we have previously seen.

6.1.5 Recall

Also known as sensitivity (usually in the medical field) or true positive rate, recall measures the proportion of actual positives that are correctly identified as such. For instance, it may refer to the percentage of sick patients who are correctly classified as having the particular disease. In terms of the confusion matrix terms, recall is computed as follows:

$$\begin{aligned} { recall} = \frac{tp}{tp+fn} \end{aligned}$$
(6.6)

Recall is useful for the case of imbalanced data, in which the positive class is the rare class. However, note that a classifier that always predicts the positive class will achieve a perfect recall, since recall does not take into consideration the \(fp\) values. This problem is alleviated in multi-class problems, in which each class is used in turn as the positive class, and the average of the per-class recall is taken.

6.2 Aggregation Schemes

All classification measures presented in the previous section refer to the predictive performance of a given classifier in a given data set. When evolving an algorithm from multiple data sets, HEAD-DT’s fitness function is measured as the aggregated performance of the individual in each data set that belongs to the meta-training set. We propose employing three simple strategies for combining the per-data-set performance into a single quantitative value: (i) simple average; (ii) median; and (iii) harmonic mean.

The simple average (or alternatively the arithmetic average) is computed by simply taking the average of the per-data-set values, i.e., \((1/N)\times \sum _{i=1}^{N}p_i\), for a meta-training set with \(N\) data sets and a performance measure \(p\). It gives equal importance to the performance achieved in each data set. Moreover, it is best used in situations where there are no extreme outliers and the values are independent of each other.

The median is computed by ordering the performance values from smallest to greatest, and then taking the middle value of the ordered list. If there is an even number of data sets, since there is no single middle value, either \(N/2\) or \((N/2)+1\) can be used as middle value, or alternatively their average. The median is robust to outliers in the data (extremely large or extremely low values that may influence the simple average).

Finally, the harmonic mean is given by \(\left( (1/N)\times \sum _{i=1}^{N}p_{i}\right) ^{-1}\). Unlike the simple average, the harmonic mean gives less significance to high-value outliers, providing sometimes a better picture of the average.

6.3 Experimental Evaluation

In this section, we perform an empirical evaluation of the five classification performance measures presented in Sect. 6.1 and the three aggregation schemes presented in Sect. 6.2 as fitness functions of HEAD-DT. There are a total of 15 distinct fitness functions resulting from this analysis:

  1. 1.

    Accuracy + Simple Average (ACC-A);

  2. 2.

    Accuracy + Median (ACC-M);

  3. 3.

    Accuracy + Harmonic Mean (ACC-H);

  4. 4.

    AUC + Simple Average (AUC-A);

  5. 5.

    AUC + Median (AUC-M);

  6. 6.

    AUC + Harmonic Mean (AUC-H);

  7. 7.

    F-Measure + Simple Average (FM-A);

  8. 8.

    F-Measure + Median (FM-M);

  9. 9.

    F-Measure + Harmonic Mean (FM-H);

  10. 10.

    Relative Accuracy Improvement + Simple Average (RAI-A);

  11. 11.

    Relative Accuracy Improvement + Median (RAI-M);

  12. 12.

    Relative Accuracy Improvement + Harmonic Mean (RAI-H);

  13. 13.

    Recall + Simple Average (TPR-A);

  14. 14.

    Recall + Median (TPR-M);

  15. 15.

    Recall + Harmonic Mean (TPR-H).

For this experiment, we employed the 67 UCI data sets described in Table 5.14 organized into two scenarios: (i) 5 balanced data sets in the meta-training set; and (ii) 5 imbalanced data sets in the training set. These scenarios were created to assess the performance of the 15 distinct fitness functions in balanced and imbalanced data, considering that some of the performance measures are explicitly designed to deal with imbalanced data whereas others are not. The term “(im)balanced” was quantitatively measured according to the imbalance ratio (IR):

$$\begin{aligned} { IR} = \frac{F(A_{DS})}{F(B_{DS})} \end{aligned}$$
(6.7)

where \(F(.)\) returns the frequency of a given class, \(A_{DS}\) is the highest-frequency class in data set \(DS\) and \(B_{DS}\) the lowest-frequency class in data set \(DS\).

Given the size and complexity of this experiment, we did not optimise HEAD-DT’s parameters as in Chap. 5, Sect. 5.2. Instead, we employed typical values found in the literature of evolutionary algorithms for decision-tree induction (the same parameters as in Chap. 5, Sect. 5.1):

  • Population size: 100;

  • Maximum number of generations: 100;

  • Selection: tournament selection with size \(t=2\);

  • Elitism rate: 5 individuals;

  • Crossover: uniform crossover with 90 % probability;

  • Mutation: random uniform gene mutation with 5 % probability.

  • Reproduction: cloning individuals with 5 % probability.

In the next sections, we present the results for both scenarios of meta-training set. Moreover, in the end of this chapter, we perform a whole new set of experiments with the best-performing fitness functions.

6.3.1 Results for the Balanced Meta-Training Set

We randomly selected 5 balanced data sets (IR \(<\) \(1.1\)) from the 67 UCI data sets described in Table 5.14 to be part of the meta-training set in this experiment: iris (\({ IR}=1.0\)), segment (\({ IR}=1.0\)), vowel (\({ IR}=1.0\)), mushroom (\({ IR}=1.07\)), and kr-vs-kp (\({ IR}=1.09\)).

Tables 6.1 and 6.2 show the results for the 62 data sets in the meta-test set regarding accuracy and F-Measure, respectively. At the bottom of each table, the average rank is presented for the 15 versions of HEAD-DT created by varying the fitness functions. We did not present standard deviation values due to space limitations within the tables.

Table 6.1 Accuracy values for the 15 versions of HEAD-DT varying the fitness functions
Table 6.2 F-Measure values for the 15 versions of HEAD-DT varying the fitness functions
Table 6.3 Values are the average performance (rank) of each version of HEAD-DT according to either accuracy or F-Measure

By careful inspection of both tables, we can see that their rankings are practically the same, with the median of the relative accuracy improvement being the best-ranked method for either evaluation measure. Only a small position-switching occurs between the accuracy and F-Measure rankings, with respect to the positions of ACC-M, TPR-H, and FM-H.

Table 6.3 summarizes the average rank values obtained by each version of HEAD-DT with respect to accuracy and F-Measure. Values in bold indicate the best performing version according to the corresponding evaluation measure. It can be seen that version RAI-M is the best-performing method regardless of the evaluation measure. The average of the average ranks (average across evaluation measures) indicates the following final ranking positions (from best to worst): (1) RAI-M; (2) FM-M; (3) TPR-A; (4) RAI-A; (5) RAI-H; (6) TPR-M; (7) ACC-A; (8) FM-A; (9) ACC-H; (10) ACC-M; (11) FM-H; (12) TPR-H; (13) AUC-M; (14) AUC-A; (15) AUC-H.

For evaluating whether the differences between versions are statistically significant, we present the critical diagrams of the accuracy and F-Measure values in Fig. 6.1. It is possible to observe that there are no significant differences among the top-4 versions (RAI-M, FM-M, TPR-A, and RAI-A). Nevertheless, RAI-M is the only version that outperforms TPR-M and RAI-H with statistical significance in both evaluation measures, which is not the case of FM-M, TPR-A, and RAI-A.

Fig. 6.1
figure 1

Critical diagrams for the balanced meta-training set experiment. a Accuracy rank. b F-measure rank

Some interesting conclusions can be drawn from this first set of experiments with a balanced meta-training set:

  • The AUC measure was not particularly effective for evolving decision-tree algorithms in this scenario, regardless of the aggregation scheme being used. Note that versions of HEAD-DT that employ AUC in their fitness function perform quite poorly when compared to the remaining versions—AUC-M, AUC-A, and AUC-H are in the bottom of the ranking: 13th, 14th, and 15th position, respectively;

  • The use of the harmonic mean as an aggregation scheme was not successful overall. The harmonic mean was often worst aggregation scheme for the evaluation measures, occupying the lower positions of the ranking (except when combined to RAI).

  • The use of the median, on the other hand, was shown to be very effective in most cases. For 3 evaluation measures the median was the best aggregation scheme (relative accuracy improvement, F-Measure, and AUC). In addition, the two best-ranked versions made use of the median as their aggregation scheme;

  • The relative accuracy improvement was overall the best evaluation measure, occupying the top part of the ranking (1st, 4th, and 5th best-ranked versions);

  • Finally, both F-Measure and recall were consistently among the best versions (2nd, 3rd, 6th, and 8th best-ranked versions), except once again when associated to the harmonic mean (11th and 12th).

Figure 6.2 depicts a picture of the fitness evolution throughout the evolutionary cycle. It presents both the best fitness from the population at a given generation and the average fitness from the corresponding generation.

Fig. 6.2
figure 2

Fitness evolution in HEAD-DT for the balanced meta-training set

Note that version AUC-M (Fig. 6.2e) achieves the perfect fitness from the very first generation (AUC = 1). We further analysed this particular case and verified that the decision-tree algorithm designed in this version does not perform any kind of pruning. Even though prune-free algorithms usually overfit the training data (if no pre-pruning is performed as well, they achieve 100 % of accuracy in the training data) and thus underperform in the test data, it seems that this was not the case for the 5 data sets in the meta-training set. In the particular validation sets of the meta-training set, a prune-free algorithm with the stop criterion minimum number of 3 instances was capable of achieving perfect AUC. Nevertheless, this automatically-designed algorithm certainly suffered from overfitting in the meta-test set, since AUC-M was only the 13th-best out of 15 versions.

Versions FM-H (Fig. 6.2i) and TPR-H (Fig. 6.2o) also achieved their best fitness value in the first generation. The harmonic mean, due to its own nature (ignore higher values), seems to make the search for better individuals harder than the other aggregation schemes.

6.3.2 Results for the Imbalanced Meta-Training Set

We randomly selected 5 imbalanced data sets (\({ IR}>10\)) from the 67 UCI data sets described in Table 5.14 to be part of the meta-training set in this experiment: primary-tumor (\({ IR}=84\)), anneal (\({ IR}=85.5\)), arrhythmia (\({ IR}=122.5\)), winequality-white (\({ IR}=439.6\)), and abalone (\({ IR}=689\)).

Tables 6.4 and 6.5 show the results for the 62 data sets in the meta-test set regarding accuracy and F-Measure, respectively. At the bottom of each table, the average rank is presented for the 15 versions of HEAD-DT created by varying the fitness functions. We once again did not present standard deviation values due to space limitations within the tables.

Table 6.4 Accuracy values for the 15 versions of HEAD-DT varying the fitness functions
Table 6.5 F-Measure values for the 15 versions of HEAD-DT varying the fitness functions

By careful inspection of both tables, we can see that the rankings in them are practically the same, with the average F-Measure being the best-ranked method for either evaluation measure. Only a small position-switching occurs between the accuracy and F-Measure rankings, with respect to the positions of ACC-H and RAI-M.

Table 6.6 summarizes the average rank values obtained by each version of HEAD-DT with respect to accuracy and F-Measure. Values in bold indicate the best performing version according to the corresponding evaluation measure. Note that version FM-A is the best-performing method regardless of the evaluation measure. The average of the average ranks (average across evaluation measures) indicates the following final ranking positions (from best to worst): (1) FM-A; (2) TPR-A; (3) TPR-H; (4) AUC-A; (5) AUC-H; (6) FM-H; (7) ACC-A; (8) ACC-M; (9) ACC-H; (10) RAI-M; (11) RAI-H; (12) FM-M; (13) TPR-M; (14) RAI-A; (15) AUC-M.

Table 6.6 Values are the average performance (rank) of each version of HEAD-DT according to either accuracy or F-Measure

For evaluating whether the differences among the versions are statistically significant, we present the critical diagrams of the accuracy and F-Measure values in Fig. 6.3. We can see that there are no statistically significant differences among the 7 (5) best-ranked versions regarding accuracy (F-Measure). In addition, note that the 6 best-ranked versions involve performance measures that are suitable for evaluating imbalanced problems (F-Measure, recall, and AUC), which is actually expected given the composition of the meta-training set.

Fig. 6.3
figure 3

Critical diagrams for the imbalanced meta-training set experiment. a Accuracy rank. b F-measure rank

The following conclusions can be drawn from this second set of experiments concerning imbalanced data sets:

  • The relative accuracy improvement is not suitable for dealing with imbalanced data sets and hence occupies the bottom positions of the ranking (10th, 11th, and 14th positions). This behavior is expected given that RAI measures the improvement over the majority-class accuracy, and such an improvement is often damaging for imbalanced problems, in which the goal is to improve the accuracy of the less-frequent class(es);

  • The median was the worst aggregation scheme overall, figuring in the bottom positions of the ranking (8th, 10th, 12th, 13th, and 15th). It is interesting to notice that the median was very successful in the balanced meta-training experiment, and quite the opposite in the imbalanced one;

  • The simple average, on the other hand, presented itself as the best aggregation scheme for the imbalanced data, figuring in the top of the ranking (1st, 2nd, 4th, 7th), except when associated to RAI (14th), which was the worst performance measure overall;

  • The 6 best-ranked versions were those employing performance measures known to be suitable for imbalanced data (F-Measure, recall, and AUC);

  • Finally, the harmonic mean had a solid performance throughout this experiment, differently from its performance in the balanced meta-training experiment.

Figure 6.4 depicts a picture of the fitness evolution throughout the evolutionary cycle. Note that whereas some versions find their best individual at the very end of evolution (e.g., FM-H, Fig. 6.4i), others converge quite early (e.g., TPR-H, Fig. 6.4o), though there seems to exist no direct relation between early (or late) convergence and predictive performance.

Fig. 6.4
figure 4

Fitness evolution in HEAD-DT for the imbalanced meta-training set

6.3.3 Experiments with the Best-Performing Strategy

Considering that the median of the relative accuracy improvement (RAI-M) was the best-ranked fitness function for the balanced meta-training set, and that the average F-Measure (FM-A) was the best-ranked fitness function for the imbalanced meta-training set, we perform a comparison of these HEAD-DT versions with the baseline decision-tree induction algorithms C4.5, CART, and REPTree.

For version RAI-M, we use the same meta-training set as before: iris (\({ IR}=1\)), segment (\({ IR}=1\)), vowel (\({ IR}=1\)), mushroom (\({ IR}=1.07\)), and kr-vs-kp (\({ IR}=1.09\)). The resulting algorithm is tested over the 10 most-balanced data sets from Table 5.14:

  1. 1.

    meta-data (\({ IR}=1\));

  2. 2.

    mfeat (\({ IR}=1\));

  3. 3.

    mb-promoters (\({ IR}=1\));

  4. 4.

    kdd-synthetic (\({ IR}=1\));

  5. 5.

    trains (\({ IR}=1\));

  6. 6.

    tae (\({ IR}=1.06\));

  7. 7.

    vehicle (\({ IR}=1.10\));

  8. 8.

    sonar (\({ IR}=1.14\));

  9. 9.

    heart-c (\({ IR}=1.20\));

  10. 10.

    credit-a (\({ IR}=1.25\)).

For version FM-A, we also use the same meta-training set as before: primary-tumor (\({ IR}=84\)), anneal (\({ IR}=85.5\)), arrhythmia (\({ IR}=122.5\)), winequality-white (\({ IR}=439.6\)), and abalone (\({ IR}=689\)). The resulting algorithm is tested over the 10 most-imbalanced data sets from Table 5.14:

  • flags (\({ IR}=15\));

  • sick (\({ IR}=15.33\));

  • car (\({ IR}=18.62\));

  • autos (\({ IR}=22.33\));

  • sponge (\({ IR}=23.33\));

  • postoperative (\({ IR}=32\));

  • lymph (\({ IR}=40.50\));

  • audiology (\({ IR}=57\));

  • winequality-red (\({ IR}=68.10\));

  • ecoli (\({ IR}=71.50\)).

In Chap. 5, we saw that HEAD-DT is capable of generating effective algorithms tailored to a particular application domain (gene expression data). Now, with this new experiment, our goal is to verify whether HEAD-DT is capable of generating effective algorithms tailored to a particular statistical profile—in this case, tailored to balanced/imbalanced data.

Table 6.7 shows the accuracy and F-Measure values for HEAD-DT, C4.5, CART, and REPTree, in the 20 UCI data sets (10 most-balanced and 10 most-imbalanced). The version of HEAD-DT executed over the first 10 data sets is RAI-M, whereas the version executed over the remaining 10 is FM-A. In both versions, HEAD-DT is executed 5 times as usual, and the results are averaged.

Table 6.7 Accuracy and F-Measure values for the 10 most-balanced data sets and the 10 most-imbalanced data sets

Observe in Table 6.7 that HEAD-DT (RAI-M) outperforms C4.5, CART, and REPTree in 8 out of 10 data sets (in both accuracy and F-Measure), whereas C4.5 is the best algorithm in the remaining two data sets. The same can be said about HEAD-DT (FM-A), which also outperforms C4.5, CART, and REPTree in 8 out of 10 data sets, being outperformed once by C4.5 and once by CART.

We proceed by presenting the critical diagrams of accuracy and F-Measure (Fig. 6.5) in order to evaluate whether the differences among the algorithms are statistically significant. Note that HEAD-DT is the best-ranked method, often in the 1st position (rank = 1.30). HEAD-DT (versions RAI-M and FM-A) outperforms both CART and REPTree with statistical significance for \(\alpha = 0.05\). With respect to C4.5, it is outperformed by HEAD-DT with statistical significance for \(\alpha = 0.10\), though not for \(\alpha = 0.05\). Nevertheless, we are confident that being the best method in 16 out of 20 data sets is enough to conclude that HEAD-DT automatically generates decision-tree algorithms tailored to balanced/imbalanced data that are consistently more effective than C4.5, CART, and REPTree.

Fig. 6.5
figure 5

Critical diagrams for accuracy and F-Measure. Values are regarding the 20 UCI data sets in Table 6.7. a Accuracy rank for the balanced data sets. b F-measure rank for the balanced data sets

Since HEAD-DT is run 5 times for alleviating the randomness effect of evolutionary algorithms, we further analyse the 5 algorithms generated by HEAD-DT for the balanced meta-training set and the 5 algorithms generated for the imbalanced meta-training set.

Regarding the balanced meta-training set, we noticed that the favored split criterion was the G statistic (present in 40 % of the algorithms). The favored stop criterion was stopping the tree-splitting process only when there is a single instance in the node (present in 80 % of the algorithms). The homogeneous stop was present in the remaining 20 % of the algorithms, but since a single instance is always homogeneous (only 1 class represented in the node), we can say that HEAD-DT stop criterion was actually stop splitting nodes when they are homogeneous. Surprisingly, the favored pruning strategy was not to use any pruning strategy (80 % of the algorithms). It seems that this particular combination of design components did not lead to overfitting, even though the trees were not pruned at any point. Algorithm 1 shows this custom algorithm designed for balanced data sets.

figure a

Regarding the imbalanced meta-training set, we noticed that two split criteria stand out: DCSM (present in 40 % of the algorithms) and Normalized Gain (also present in 40 % of the algorithms). In 100 % of the algorithms, the nominal splits were aggregated into binary splits. The favored stop criterion was either the homogeneous stop (60 % of the algorithms) of the algorithms or tree stop when a maximum depth of around 10 levels is reached (40 % of the algorithms). Finally, the pruning strategy was also divided between PEP pruning with 1 SE (40 % of the algorithms) and no pruning at all (40 % of the algorithms). We noticed that whenever the algorithm employed DCSM, PEP pruning was the favored pruning strategy. Similarly, whenever the Normalized Gain was selected, no pruning was the favored pruning strategy. It seems that HEAD-DT was capable of detecting a correlation between different split criteria and pruning strategies. Algorithm 2 shows the custom algorithm that was tailored to imbalanced data (we actually present the choices of different components when it was the case).

figure b

Regarding the missing value strategies, we did not notice any particular pattern in either the balanced or the imbalanced scenarios. Hence, the missing-value strategies presented in Algorithms 1 and 2 are only examples of selected components, though they did not stand out in terms of appearance frequency.

6.4 Chapter Remarks

In this chapter, we performed a series of experiments to analyse in more detail the impact of different fitness functions during the evolutionary cycle of HEAD-DT. In the first part of the chapter, we presented 5 classification performance measures and three aggregation schemes to combine these measures during fitness evaluation of multiple data sets. The combination of performance measures and aggregation schemes resulted in 15 different versions of HEAD-DT.

We designed two experimental scenarios to evaluate the 15 versions of HEAD-DT. In the first scenario, HEAD-DT is executed on a meta-training set with 5 balanced data sets, and on a meta-test set with the remaining 62 available UCI data sets. In the second scenario, HEAD-DT is executed on a meta-training set with 5 imbalanced data sets, and the meta-test set with the remaining 62 available UCI data sets. For measuring the level of data set balance, we make use of the imbalance ratio (IR), which is the ratio between the most-frequent and the less-frequent classes of the data.

Results of the experiments indicated that the median of the relative accuracy improvement was the most suitable fitness function for the balanced scenario, whereas the average of the F-Measure was the most suitable fitness function for the imbalanced scenario. The next step of the empirical analysis was to compare these versions of HEAD-DT with the baseline decision-tree induction algorithms C4.5, CART, and REPTree. For such, we employed the same meta-training sets than before, though the meta-test sets exclusively comprised balanced (imbalanced) data sets. The experimental results confirmed that HEAD-DT can generate algorithms tailored to a particular statistical profile (data set balance) that are more effective than C4.5, CART, and REPTree, outperforming them in 16 out of 20 data sets.