Keywords

1 Introduction

The growing interest in using machine learning (ML) techniques to solve prediction problems and support decision-making in almost any area of human activity is undeniable. Using artificial neural networks (ANNs) has allowed novel application development. However, ANNs are black-box models that require more means to explain how they make predictions. Therefore, their use has begun to be regulated to prevent wrong decisions supported using these methods in areas such as medicine and economics [29]. This has driven the use of post hoc methods to generate explanations of ANNs predictions [31]. Other ML strategies, such as decision trees (DTs), are white-box models with high interpretability and transparency levels. Among the DT types are those that use oblique hyperplanes in their internal nodes, producing more compact DTs than those that use a single attribute to make their partitions. However, traditional methods to induce DTs present some drawbacks, such as overfitting, selection bias toward multi-valued attributes, and instability to small changes in the training set [16]. For this reason, other inducing techniques have been proposed that, instead of applying recursive partitioning techniques, perform a search in the space of possible DTs. Evolutionary algorithms (EAs), such as genetic algorithms and genetic programming, have been widely applied to find near-optimal DTs that are more precise than those created by traditional techniques [16].

Differential Evolution (DE) is an EA that has successfully solved numerical optimization problems, and its standard versions have also been applied to induce DTs [6, 9, 18, 24, 25]. To the best of our knowledge, adaptive DE versions have not been applied to induce oblique DTs. On the other hand, it has been observed that an initial random population produces redundant internal nodes that affect the size and precision of the induced model. In this study, we analyzed the effect of using four adaptive DE versions to induce oblique DTs (JADE, SHADE, LSHADE, and JSO) and the effect on model precision and size when two additional strategies are used to create the initial DE population: dipoles and centroids. The experimental results establish that the four algorithms exhibit similar statistical behavior. However, using the dipole-based start strategy, the JSO method creates trees with better accuracy. Furthermore, the LSHADE algorithm induces more compact trees in the three initializations evaluated.

The rest of this paper is organized into four additional sections. Section 2 introduces the oblique DT characteristics and the adaptive DE approaches are described in Sect. 3. In Sect. 4, elements for the comparative study are detailed. Section 5 presents the experimental details and results. Finally, in Sect. 6, the conclusions of this study are presented, and some directions for future work are defined.

2 Oblique Decision Trees

DTs are classification models that split datasets according to diverse criteria, such as the distance between instances or the reduction of classification error. These models create a hierarchical structure using test conditions (internal nodes) and class labels (leaf nodes), allowing visualization of attribute importance in decisions and how they are used to classify an instance. DTs are the most popular interpretable algorithm for classification and regression [13]. Depending on the number of attributes evaluated in each internal node, two decision tree types are induced: univariate (axis parallel DTs) and multivariate (oblique and non-linear DTs). In particular, oblique DTs (ODTs) use test conditions representing hyperplanes having an oblique orientation relative to the axes of the instance space. ODTs are generally smaller and more accurate than univariate DTs, but they are generally more difficult to interpret [4]. ID3 [22], C4.5 [23] and CART [2] are the most popular methods for inducing univariate DTs, and CART and OC1 [19] are well-known methods for creating oblique DTs. Figure 1 shows an example of an ODT. On the right of this figure, the instance space is split using two oblique hyperplanes.

Fig. 1.
figure 1

Example of an oblique decision tree.

3 Adaptive Differential Evolution Approaches

DE is an EA evolving a population of real-valued vectors \({\textbf {x}}_i= \big (x_{i,1},x_{i,2}, \cdots , x_{i,n} \big )^T\) of n variables, to find a near-optimal solution to an optimization problem [21]. Instead of implementing traditional crossover and mutation operators, DE applies a linear combination of several randomly selected individuals to produce a new individual. Three randomly selected candidate solutions (\({\textbf {x}}_{a}\), \({\textbf {x}}_{b}\), and \({\textbf {x}}_{c}\)) are linearly combined to yield a mutated solution \({\textbf {x}}_{mut}\), as follows:

$$\begin{aligned} {\textbf {x}}_{mut} = {\textbf {x}}_{a} + F \left( {\textbf {x}}_{b} - {\textbf {x}}_{c}\right) \end{aligned}$$
(1)

where F is a scale factor for controlling the differential variation.

The mutated solution is utilized to perturb another candidate solution \({\textbf {x}}_{cur}\) using the binomial crossover operator defined as follows:

$$\begin{aligned} x_{new,j} = {\left\{ \begin{array}{ll} x_{mut,j} &{} \text {if } r \le Cr \vee j = k \\ x_{cur,j} &{} \text {otherwise} \end{array}\right. } ; j \in \{1, \dots , n\} \end{aligned}$$
(2)

where \(x_{new,j}\), \(x_{mut,j}\) and \(x_{cur,j}\) are the values in the j-th position of \({\textbf {x}}_{new}\), \({\textbf {x}}_{mut}\) and \({\textbf {x}}_{cur}\), respectively, \(r \in [0,1)\) and \(k \in \{1, \dots , n\}\) are uniformly distributed random numbers, and Cr is the crossover rate.

Finally, \({\textbf {x}}_{new}\) is selected as a member of the new population if it has a better fitness value than that of \({\textbf {x}}_{cur}\).

DE starts with a population of randomly generated candidate solutions whose values are uniformly distributed in the range \([x_{min},x_{max}]\) as follows:

$$\begin{aligned} x_{i,j} = x_{min} + r \left( x_{max} - x_{min} \right) ; i \in \left\{ 1, \dots , NP \right\} \wedge j \in \left\{ 1, \dots , n \right\} \end{aligned}$$
(3)

where NP is the population size.

DE is characterized by using fewer parameters than other EAs and by its spontaneous self-adaptability, diversity control, and continuous improvement [7]. Although DE requires fewer parameters than other EAs, its performance is sensitive to the values selected for Cr, F, and NP [32]. In the literature, several approaches exist to improve DE performance using techniques to adjust the values of its parameters or combine the advantages of different algorithm variants. Methods that adjust the algorithm parameters can be considered global approaches when the parameters are updated at the end of each generation, and all population members use their values [5, 17]. On the other hand, the most successful approaches are those in which each individual uses a different value of the control parameters [27, 30]. Finally, there are other methods where different mutation or recombination strategies are combined within the evolutionary process [11, 26]. In this study, four adaptive DE versions are used:

  • 1) JADE: This DE variant introduces a successful mutation strategy and an adaptive parameter method using Gaussian and Cauchy distributions [30]. The current-to-pbest mutation, shown in Eq. (4), improves the balance between search space exploration and exploitation by allowing the selection of an individual (\({\textbf {x}}_{pbest}\)) from a subset of the p best individuals in the population to create a mutant vector (\({\textbf {x}}_{mut}\)). An optional external archive is also used to diversify the donor vectors. \({\textbf {x}}_a\) is chosen randomly from the current population, and \({\textbf {x}}_b\) are selected from this external archive that recorded solutions previously discarded during the evolutionary process.

    $$\begin{aligned} {\textbf {x}}_{mut} = {\textbf {x}}_{cur} + F_i({\textbf {x}}_{pbest} - {\textbf {x}}_{cur}) + F_i ({\textbf {x}}_a - {\textbf {x}}_b) \end{aligned}$$
    (4)

    JADE uses F and Cr parameter values adjusted to each i-th individual in the population. \(F_i\) is selected from a Cauchy distribution, \(F_i = randc_i (\mu _F , 0 . 1)\), and \(Cr_i\) is generated using a normal distribution, \(Cr_i = randn_i (\mu _{Cr}, 0.1)\). \(\mu _{Cr}\) is updated at the end of each generation using the arithmetic mean of the set of all successful crossover probabilities. \(\mu _F\) is similarly computed, but the Lehmer mean of all successful scale factors is used.

  • 2) SHADE: The Success-History based Adaptive DE (SHADE) is an enhanced JADE version employing a historical record of the pair of \(\mu _{Cr}\) and \(\mu _{F}\) values [27]. These values are randomly selected to create each new individual instead of using the same values in each generation.

  • 3) LSHADE: A population size linear reduction strategy is applied in this SHADE variant [28]. The population decreases linearly in each generation until its size equals a minimum value.

  • 4) JSO: This LSHADE improvement replaces the \(F_i\) parameter in the second term of Eq. (4) for a weighted \(F_i\) value updated as a function of the number of objective function evaluations [3].

Recursive-partitioning and global-search strategies to induce ODTs have been implemented using DE-based approaches. In the first case, OC1-DE [25], the Adapted JADE with Multivariate DT (AJADE-MDT) method [12], and the Parallel-Coordinates (PA-DE) algorithm [6] evolve a population of real-valued individuals to find near-optimal hyperplanes. In the other case, two methods implement a global search strategy to find a near-optimal ODT: (1) The Perceptron DT (PDT) method [9, 18], where the hyperplane coefficients of one DT are encoded with a real-valued individual. Hyperplane-independent terms and the class label of leaf nodes are stored in two additional vectors. In each DE iteration, the mutation parameters are randomly altered. A group of new DTs randomly created replaces the worst individuals in the population. (2) The DE algorithm to build ODTs (DE-ODT) [24], where the size of the real-valued vector is computed as a factor of the number of internal nodes of an ODT estimated using the number of dataset attributes.

4 Comparative Study Details

In this study, we use the mapping scheme introduced by the DE-ODT method [24]. DE-ODT evolves a population of ODTs encoded in fixed-length real-valued vectors. Figure 2 shows the scheme for converting a DE individual into an ODT. The steps of this mapping scheme are described in the following paragraphs.

Fig. 2.
figure 2

Mapping scheme used on the DE-ODT method.

  • 1) ODTs linear representation: Each candidate solution encodes only the internal nodes of a complete binary ODT stored in a fixed-length real-valued vector (\({\textbf {x}}_i\)). This vector represents the set of hyperplanes used as the ODT test conditions. The vector size (n) is determined using both the number of features (d) and the number of class labels (s) of the training set, as follows:

    $$\begin{aligned} n=n_e(d+1) \end{aligned}$$
    (5)

    where \(n_e = 2^{max(H_i,H_l)-1}-1\), \( H_i = \lceil log_2(d+1) \rceil \), and \(H_l = \lceil log_2(s) \rceil \)

  • 2) Hyperplanes construction: Vector \({\textbf {x}}_i\) is used to build the vector \({\textbf {w}}_i\) encoding the sequence of candidate internal nodes of a partial ODT. Since the values of \({\textbf {x}}_i\) represent the hyperplane coefficients contained in these nodes, the following criterion applies: Values \(\{x_{i,1}, \dots , x_{i,d+1} \} \) are assigned to the hyperplane \(h_1\), the values \(\{x_{i,d+2}, \dots , x_{i,2d+2} \}\) are assigned to the hyperplane \(h_2\), and so on. These hyperplanes are assigned to the elements of \({\textbf {w}}_i\): \(h_1\) is assigned to \(w_{i,1}\), \(h_2\) is assigned to \(w_{i,2}\), an so on.

  • 3) Partial ODT construction: A straightforward procedure is applied to construct the partial DT (\(pT_i\)) from \({\textbf {w}}_i\): First, the element in the initial location of \({\textbf {w}}_i\) is used as the root node of \(pT_i\). Next, the remaining elements of \({\textbf {w}}_i\) are inserted in \(pT_i\) as successor nodes of those previously added so that each new level of the tree is completed before placing new nodes at the next level, in a similar way to the breadth-first search strategy. Since a hyperplane divides the instances into two subsets, each internal node has assigned two successor nodes.

  • 4) Decision tree completion: In the final stage, several leaf nodes are added in \(pT_i\) by evaluating the training set to build the final ODT \(T_i\). One instance set is assigned to one internal node (starting with the root node), and by evaluating each instance with the hyperplane associated with the internal node, two instance subsets are created and assigned to the successor nodes. This assignment is repeated for each node in \(pT_i\). Two cases should be considered: (1) If an instance set is located at the end of a branch of \(pT_i\), two leaf nodes are created and designated as its successor nodes. (2) If the instance set contains elements for the same class, the internal node is labeled as a leaf node, and its successor nodes are removed if they exist. Furthermore, an internal node is labeled as a leaf when it contains an empty instance subset.

As one example of this inducing approach, Fig. 3 shows two DT induced from the well-known Iris dataset: The left DT is created using J48 method from Weka library [8], and the right DT is and ODT induced using LSHADE-based approach. The ODT is more compact and more accurate than the left DT.

Fig. 3.
figure 3

DTs for Iris dataset using J48 (Left) and LSHADE (Right).

On the other hand, three initialization strategies are analyzed in this work:

  • 1) Random initialization: This is the classic initialization strategy previously described in Eq. (3). This approach generates many hyperplanes that do not divide the instances into two subsets, producing redundant nodes impacting the induced model’s size and precision.

  • 2) Dipoles: A dipole is a pair of training instances. A mixed dipole occurs when these instances have different classes [1]. Dipoles were first used to induce ODTs through a recursive partitioning strategy [15]. A hyperplane \(h_i\) must split a mixed dipole to divide the instances into two nonempty subsets. Each individual of the initial population is built by creating hyperplanes splitting randomly selected mixed dipoles from the training set. \(h_1\) uses a mixed dipole chosen from the training set, \(h_2\) uses a mixed dipole from the first subset created by \(h_1\), \(h_3\) from the second subset, and so on until the number of internal nodes \(n_e\) is completed. In particular, a random hyperplane is created if an internal node contains an empty instance subset.

  • 3) Centroids: This initialization strategy is proposed in this work. It is similar to the previous one, but the hyperplanes are created using the centroid of the instance set instead of a random mixed dipole. The centroid is a dummy instance where each value is the middle between the instance set’s minimum and maximum values.

5 Experiments and Results

In this section, the experimental study conducted to analyze and compare the performance of the algorithms is presented. First, the datasets used in this study and the validation technique applied in the comparative analysis are described. Next, the experimental results and statistical tests are outlined. Finally, a discussion about the results is provided. Thirteen datasets (shown in Table 1), from the UCI machine learning repository are used in the experimental study [14]. These datasets have only numerical attributes since ODT internal nodes are a linear combination of their values.

Table 1. Datasets description.
Table 2. Average accuracy of DE variants using random initialization.
Table 3. Average accuracy of DE variants using dipole-based initialization.

The methods used in this study are implemented in the Java languaje, using the JMetal [20] and Weka [8] libraries. The implemented algorithms run 30 times for each dataset and initialization scheme, using the ten-fold cross-validation sampling strategy to estimate the precision and size of induced trees. Subsequently, the Friedman test is applied to statistical analysis of the results obtained. Friedman test is selected since it has been demonstrated that the conditions to apply parametric tests are not satisfied to EA-based machine learning methods [10]. In the subsequent tables of this section, the best result for each dataset is highlighted with bold numbers, and the numbers in parentheses refer to the ranking reached by each method for each dataset. The last row in these tables indicates the average ranking of each method.

Tables 2, 3, 4, 5, 6, and 7 show the average accuracy and the average size (number of leaf nodes) of the trees obtained for each dataset and each DE variant, respectively.

Table 4. Average accuracy of DE variants using centroid-based initialization.
Table 5. Average size of DE variants using random initialization.
Table 6. Average size of DE variants using dipole-based initialization..
Table 7. Average size of DE variants using centroid-based initialization.

From Tables 2, 3, and 4, it is observed that the methods getting better accuracy are: the standard DE using random initialization, JSO with dipole-based initialization, and LSHADE with centroid-based start strategy. On the other hand, Tables 5, 6, and 7 show that the LSHADE method produces more compact ODTs using the three start strategies.

Table 8. Accuracy for the best methods for each initialization strategy.
Table 9. Tree size for the best methods for each initialization strategy.

Tables 8 and 9 show the best results obtained by the DE versions for accuracy and tree size, respectively.

From the results on in Table 8, the Friedman test shows a p-value of 0.5836, pointing out that the three methods behave similarly, but it is observed that, on average, the JSO algorithm gets ODTs with better accuracy than the other approaches. Also, Table 8 shows that JSO with dipoles finds more accurate ODTs from datasets with two or three classes (Parkinsons, diabetes, Australian, Heart-statlog, Iris), and LSHADE with centroids finds accurate ODTs from those with two (Ionosphere, Liver-disorders) or with more than three classes (Glass, Ecoli). On the other hand, Standard DE with random initialization induces better ODTs for imbalanced datasets: Balance-scale has three classes with 288, 288, and 49 instances, respectively, and the Wine dataset has three classes with 59, 71, and 48 instances, respectively. Breast-tissue-6 has six classes with 22, 21, 14, 15, 16, and 18 instances, and only Seeds is a balanced dataset with 70 instances per class.

The statistic value computed by the Friedman test from results in Table 9 shows a p-value of 0.926, indicating that the three initialization strategies built ODTs of similar sizes. The centroid initialization scheme produces the smallest ODTs for datasets with two classes, and the random initialization creates more tiny ODTs for multiclass datasets (Wine, Breast-tissue-t, Glass and Ecoli).

6 Conclusions and Future Work

The experimental results indicate that it is important to continue studying the application of adaptive variants of DE to solve non-numerical optimization problems because it is necessary to analyze in detail the search space’s characteristics and the strategies for mapping between DE individuals and their tree-like representation. LSHADE and JSO obtained slightly better results than the other approaches, which is to be expected since they are improved versions of SHADE and JADE methods. In future work, we will integrate a tree-pruning strategy and a more effective method to remove redundant nodes into the decision tree induction process.