Keywords

1 Introduction

Nowadays, data mining has emerged as a very useful tool for extracting knowledge from any source of information including text files, large data repositories, and data in the cloud. Machine learning provides data mining with useful procedures for constructing models from data. The most representative machine learning approaches are supervised learning, in which a model is learned from labeled data, and unsupervised learning, where a model is obtained from unlabeled data. The main supervised techniques are classification and regression, while clustering is the main unsupervised technique. In particular, decision trees are a classification method that allows to construct simple models with high interpretability level.

Oblique DTs are classifiers that split the data using hyperplanes that are oblique to the axis of the instance space. Oblique DTs are generally more compact and precise than univariate DTs. In this paper, the application of a differential evolution-based approach named OC1-DE for inducing oblique decision trees in a recursive partitioning strategy is described. Experimental results obtained show that OC1-DE induces more accurate classifiers than those produced by other proposed induction methods.

In order to describe the implementation of OC1-DE method, this paper is organized as follows: Sect. 2 describes the characteristics of the decision tree induction process for classifications tasks. The use of metaheuristics for inducing oblique decision trees is described in Sect. 3 and details of differential evolution algorithm are given in Sect. 4. In Sect. 5 the OC1-DE method is outlined and experimental results are presented in Sect. 6. Finally, Sect. 7 holds conclusions and future work of this proposal.

2 Induction of Decision Trees for Classification

A decision tree (DT) is a classification model constructed using a set of training instances in order to define a simple and effective procedure for predicting the class membership of new unclassified instances. A DT is a connected acyclic graph whose nodes contain both test conditions (internal nodes) and class labels (leaf nodes) and whose arcs represent the possible results of each test condition. This graph defines a hierarchical evaluation schema that stands out for its simplicity and its high level of interpretability. These characteristics along with its predictive power allow to place to DT as one of the most widely used classifier. It should be noted that two methods for decision tree induction (CART [4] and C4.5 [36]) have been considered in the top ten most influential data mining algorithms [46].

The great majority of algorithms for inducing decision trees apply a recursive partitioning strategy that implements some splitting criterion in order to separate the training instances. This partitioning procedure is usually complemented with a pruning mechanism in order to avoid overfitting and to improve the performance of the classifier. Furthermore, other strategies such as a global search in the space of DTs have been utilized with the aim of reaching an optimal classifier [19, 29, 44] although it is known that building optimal DTs is NP-Hard [18].

In accordance to the number of attributes used in test conditions of DT internal nodes, two types of DTs can be induced: univariate and multivariate DTs. ID3 [34], C4.5 [36], J48 [45] and C5.0 are methods for inducing univariate DTs in which a single attribute is evaluated to split the training set. An advantage of univariate DTs is that they are easily interpretable, however when the instances cannot be separated by axis-parallel hyperplanes, induced DTs are more complex as they contain a number of internal nodes [37].

On the other hand, a combination of attributes participates in the test condition at each internal node of a multivariate DT. In case of a linear combination of attributes the induced classifier is known as oblique DT [31] or Perceptron DT [2, 29]. Hyperplanes splitting the training instances in this type of DT have an oblique orientation relative to the axes of instance space. Oblique DTs are generally smaller and more accurate than univariate DTs but they are generally more difficult to interpret [5]. In addition, finding the best partition using a single attribute takes low computational effort, however determining the best partition using an oblique hyperplane is a NP-hard problem [16]. CART [4], OC1 [31], LMDT [41] and LTree [10] are methods for inducing oblique DTs.

The training set used for inducing oblique DTs is a group of instances, where each instance can be represented with a real-valued vector \(\mathbf v =\left( v_1, v_2, \dots , v_d,c \right) ^T\) of d attributes and a label c indicating the class membership of the instance. A hyperplane divides the instance space in two halfspaces and it is defined as follows:

$$\begin{aligned} \sum _{i=1}^d x_iv_i + x_{d+1} > 0 \end{aligned}$$
(1)

where \(v_i\) is the value of attribute i, \(x_i\) is a real-valued coefficient in the hyperplane and \(x_{d+1}\) represents the independent term.

3 Metaheuristics for Inducing Oblique Decision Trees

Metaheuristics are methods that implement both local search procedures and higher level strategies to perform a robust search of a solution space [14]. They have been used for inducing both univariate and multivariate DTs. Metaheuristics can be classified into three categories: single-solution based procedures, evolutionary algorithms and swarm intelligence methods.

Although global search is the most commonly used strategy for inducing DTs in which metaheuristic-based methods are applied, several proposals using metaheuristics through a recursive partitioning strategy have been developed. Single-solution based metaheuristics such as stochastic local search (SLS) [31], simulated annealing (SA) [5, 17] and tabu search (TS) [26, 32], as well as evolutionary algorithms such as genetic algorithms (GA) [5, 6, 20, 39] and evolutionary strategies (ES) [5, 47] have been used for finding good hyperplanes of an oblique DT.

Simulated Annealing of Decision Trees (SADT) method [17] applies SA to find a near-optimal hyperplane for each internal node of an oblique DT. Using a fixed initial hyperplane, SADT iteratively perturbs one coefficient of this hyperplane and each new value is accepted as a coefficient according to the Boltzmann criterion. OC1 method [31] implements a two-step process for searching a better hyperplane: First, it uses a deterministic strategy that adjusts the coefficients of a hyperplane individually, finding a local optimal value for one coefficient at a time. Once this value is reached, OC1 applies SLS in order to jump out of this local optimum value. OC1-AP [30] is a OC1 variant that induces only univariate DTs. OC1 variants such as OC1-SA, OC1-ES and OC1-GA [5] simultaneously modify several coefficients of the best hyperplane produced by OC1-AP, using SA, ES and GA, respectively. Furthermore, TS has been used in two approaches: (1) in conjunction with linear discriminant analysis in the LDTS method [26], and (2) in combination with a discrete support vector machine in the LDSDT\({_{TS}}\) method [32].

Others have used evolutionary algorithms for inducing oblique DTs: a multi-membered ES is applied for finding near-optimal hyperplanes in the MESODT method [47] and GA is implemented in two proposals: (1) in the BTGA method [6] that uses a binary chromosome for representing the hyperplane coefficients, and (2) in a method based on dipolesFootnote 1 [20] that encodes a chromosome by means of a real-valued vector. An especial GA known as HereBoy algorithm [24] that evolves a unique chromosome is used in HBDT method [39] for inducing an oblique DT.

On the other hand, GA [9, 15, 21, 33, 43], differential evolution [29] and genetic programming (GP) [1, 3, 28] have been used for inducing oblique DTs in a global search strategy.

4 Differential Evolution Algorithm

Differential evolution (DE) is an efficient evolutionary algorithm designed for solving optimization problems in continuous spaces [38]. DE encodes candidate solutions by means of real-valued vectors and it applies a difference vector to disrupt a population of these solutions. At the begining an initial population of candidate solutions is generated, then the iterative part of DE follows. At each iteration of DE a new population is constructed using mutation, crossover and selection operators. Instead of implementing traditional crossover and mutation operators, DE applies a linear combination of several candidate solutions selected randomly to produce a new solution. Finally, when the stop condition is fulfilled DE return the best candidate solution in the current population. An advantage of DE is that it utilizes few control parameters: a crossover rate Cr, a mutation scale factor F, and a population size NP.

DE/A/B/C is the notation commonly adopted for identifying DE variants, where A represents the selection procedure of candidate solutions that will be used for constructing a new solution, B is the number of difference vectors used for the mutation operator, and C identifies the type of crossover operator [7]. The most popular DE variant is the DE/rand/1/bin that consists in combining three randomly selecting candidate solutions (\(\mathbf x _{a}\), \(\mathbf x _{b}\) and \(\mathbf x _{c}\)), using Eq. (2), to yield a mutated solution \(\mathbf x _{mut}\).

$$\begin{aligned} \mathbf x _{mut} = \mathbf x _{a} + F \left( \mathbf x _{b} - \mathbf x _{c}\right) \end{aligned}$$
(2)

Mutated solution is utilized to perturb the values of another candidate solution \(\mathbf x _{cur}\) using the binomial crossover operator as follows:

$$\begin{aligned} x_{new,j} = {\left\{ \begin{array}{ll} x_{mut,j} &{} \text {if } r \le Cr \vee j = k \\ x_{cur,j} &{} \text {otherwise} \end{array}\right. } ; j \in \{1, \dots , d\} \end{aligned}$$
(3)

where \(x_{new,j}\), \(x_{mut,j}\) and \(x_{cur,j}\) are the values of the j-th position of \(\mathbf x _{new}\), \(\mathbf x _{mut}\) and \(\mathbf x _{cur}\), respectively, and \(r \in [0,1)\) and \(k \in \{1, \dots , d\}\) are uniformly distributed random numbers.

Finally, \(\mathbf x _{new}\) is selected as a member of the new population if it has a better fitness value than that of \(\mathbf x _{cur}\).

DE has been used in conjunction with support vector machines [25], neural networks [23], bayesian classifiers [13] and instance based classifiers [11] for implementing classification methods. It has been mainly used for optimizing the parameters of classification methods or for conducting preprocessing tasks in a data mining process.

In the case of DTs, DE is applied to finding the parameter settings of a classification method and for participating in the decision tree induction process. In DEMO algorithm [40] DE finds the J48 parameters in order to yield accurate and small DTs. On the other hand, TreeDE method [42] induces univariate DTs for mapping full trees to vectors and for representing discrete symbols by points in a real-valued vector space. In an improved version of TreeDE method [22] each candidate solution has two types of genes: one express the neighborhood structure between the symbols and the other represents a full tree structure. Finally, in a method for inducing perceptron DTs [29], the coefficients of all possible hyperplanes of an oblique-DT with a fixed size are encoded in a matrix, and then DE evolves the matrix values in order to construct optimized oblique DTs.

5 OC1-DE Method for Inducing Oblique Decision Trees

In this work a method for inducing an oblique DT using DE/rand/1/bin in a recursive partitioning strategy is introduced. This method, named OC1-DE, is similar to OC1 and its variants but it applies DE to finding a near-optimal hyperplane at each internal node of an oblique DT. Since the task of finding a near-optimal hyperplane with real-valued coefficients is an optimization problem in a continuous space, DE operators can be applied without any modification and OC1-DE should induce better oblique DTs.

When a recursive partitioning strategy is implemented for inducing oblique DTs, the classification method starts finding the hyperplane that best splits an instance set into two subsets. This hyperplane will be used as test condition of a new internal node that is added in DT. This procedure is recursively applied using each subset until a leaf node is created due to all instances in the subset have the same class label or a threshold value of unclassified instances is reached. Quality of hyperplane is obtained using a splitting criterion that measures the impurity of a partition or some other discriminant value. Finally, a pruning procedure is applied in order to reduce the overfitting of DT produced and to improve its predictive power.

The DE implementation for finding a near-optimal hyperplane at each internal node of an oblique DT is shown in Algorithm 1. In a similar fashion to OC1-GA method [5] the axis-parallel hyperplane that best splits a set of training instances is obtained (line 2). This hyperplane is copied to \(10\%\) of the initial population and remaining hyperplanes are randomly created (line 3). Each random hyperplane is constructed considering that almost two instances with different class label are separated by the hyperplane. This population is evolved through several generations using DE operators (lines 4–13) and the best hyperplane in population is selected (line 14). Algorithm returns the hyperplane selected between the best axis-parallel hyperplane and the best oblique hyperplane produced by DE (line 15).

figure a

6 Experiments

In order to evaluate the performance of OC1-DE and for comparison with other approaches, two experiments were conducted using several datasets with numerical attributes chosen from UCI repository [27]. Table 1 shows the description of datasets used in experiments. Results were obtained by applying repeated k-fold stratified cross-validation (CV). OC1-DE is compared with OC1 method [31], three OC1 variants proposed in [5] and the HBDT method described in [39].

DE version implemented in Java language is the canonical DE/rand/1/bin. The parameters used in the experiments are described in Table 2. The population size is adjusted in accordance with [5] and the number of iterations of DE is small with the aim of reducing the time of experiments due to DE is applied at each internal node of the induced oblique DT.

Table 1. Description of datasets used in the experiments.
Table 2. Parameters used in experiments with OC1-DE.

In the first experiment OC1-DE is compared to OC1 variants applying 5-fold CV. 10 independent runs of each dataset are conducted. For each run, an oblique DT is induced and its test accuracy is calculated. Both average accuracy and average DT size across these 10 runs are obtained. Table 3 shows the experimental resultsFootnote 2: Columns 2–9 show both average accuracy and average DT size of induced oblique DTs reported by [5]. Results obtained by OC1-DE are shown in columns 10–11 of this table. In this table can be observed that accuracies obtained by OC1-DE are better than those reported by OC1 variants, although in general OC1-DE produces oblique-DTs with more leaf nodes than those produced by OC1 variants. Fig. 1(a) shows a comparative plot of the average accuracies obtained.

In the second experiment OC1-DE is compared with HBDT method. 5 independent runs of 10-fold CV are conducted. Results are shown in Table 4: Columns 2–5 show both average accuracy and average DT size of induced oblique DTs reported by [39]. Results obtained by OC1-DE are shown in columns 6–7 of this table. In this table can be observed that both accuracies and DT sizes obtained by OC1-DE are comparable with those reported by HBDT method. OC1 induces oblique DTs with better accuracies for 2 datasets, HDBT constructs DTs with better accuracies for 5 datasets and the accuracies of DTs induced using OC1-DE are better than those reported by [39] for 8 datasets. Fig. 1(b) shows a comparative plot of the average accuracies obtained.

Table 3. Accuracy and size obtained for OC1 variants and OC1-DE.
Table 4. Comparison of accuracy and size reported of HBDT and OC1-DE.

In order to evaluate the performance of OC1-DE a statistical test of the results obtained was realized. Due to the conditions for using parametric test are not guaranteed when analysing results of evolutionary algorithms [12], the Friedman test is applied for detecting the existence of significant differences between the performance of two or more methods and the Nemenyi post-hoc test is utilized for checking these differences. Nemenyi test uses the average ranks of each classifier and checks for each pair of classifiers whether the difference between their ranks is greater than the critical difference [8] defined as follows:

$$\begin{aligned} CD = q_{\alpha } \sqrt{\frac{k(k+1)}{6N}} \end{aligned}$$
(4)

where CD is the critical difference, k is the number of methods, N is the number of datasets, and \(q_{\alpha }\) is a critical value associated of the significance level \(\alpha \).

Fig. 1.
figure 1

Average accuracies obtained in experiments.

Fig. 2.
figure 2

Comparison of classifiers using Nemenyi post-hoc test.

For the first experiment the Friedman statistic for 5 methods using 7 datasets is 15.507 and the p-value obtained is 0.003757 for 4 degrees of freedom (dof) of chi-square distribution. This p-value indicates the existence of statistical differences between these methods and then Nemenyi test post-hoc is executed. Fig. 2(a) shows the critical differences obtained for Nemenyi test and it shown that OC1-DE is the best method compared with OC1 variants. For the second experiment, Friedman statistic for 3 methods using 15 datasets is 6.8136 and the p-value is 0.03315 for 2 dof and it also indicates statistical differences between these methods. Fig. 2(b) shows that OC1-DE is has better performance that OC1 and it have a comparable performance with HDBT.

7 Conclusions

In this paper, OC1-DE method that uses the differential evolution algorithm to find a near-optimal hyperplane with real-valued coefficients at each internal node of an oblique DT is introduced. OC1-DE is a recursively partitioning scheme based on OC1 in which the DE operators can be applied without any modification. OC1-DE was evaluated using a stratified cross-validation procedure with several UCI datasets and statistical tests suggest that OC1-DE achieves a better performance than OC1 variants and HBDT method. In general, OC1-DE induces more accurate oblique DTs although their sizes are overall larger than those produced for other methods. Based on our results, future work will be oriented to evaluate other DE variants for inducing oblique DTs and to investigate the effect of using several parameter configurations on the OC1-DE performance, also more experiments will be conducted for analyzing the OC1-DE execution time, as well as to compare the OC1-DE performance with those obtained by other classification methods such as random forest and support vector machines.