Keywords

1 Introduction

Decision trees are used to a large degree as classifiers [5, 6, 10], as a means of representation of knowledge [4, 7], and as a kind of algorithms [20, 25]. We investigate here decision trees as a means of representation of knowledge.

Let us consider a decision tree \(\varGamma \) for a decision table D. We investigate three parameters of \(\varGamma \):

  • \(N(\varGamma )\) – the number of vertices in \(\varGamma \).

  • \(G(D,\varGamma )\) – the global misclassification rate [7], which is equal to the number of misclassifications of \(\varGamma \) divided by the number of rows in D.

  • \(L(D,\varGamma )\) – the local misclassification rate [7], which is the maximum fraction of misclassifications among all leaves of \(\varGamma \). One can show that \(G(D,\varGamma )\) is at most \(L(D,\varGamma )\).

The decision tree \(\varGamma \) should have a reasonable number of vertices to be understandable. To express properly knowledge from the decision table D, this tree should have an acceptable accuracy. In [7], we mentioned that the consideration of only the global misclassification rate may be insufficient: the misclassifications may be unevenly distributed and, for some leaves, the fraction of misclassifications can be high. To deal with this situation, we should consider also the local misclassification rate.

The optimization of the parameters of decision tree has been studied by many researchers [9, 11,12,13, 16,17,18,19, 24, 26]. One of the directions of the research is the bi-objective optimization [1,2,3,4,5,6,7,8]. In [7], we proposed three techniques for the building of decision trees based on the bi-objective optimization of trees and studied the parameters N, G, and L of the constructed decision trees. Unfortunately, these techniques are applicable to medium-sized decision tables with categorical features only and, sometimes, the number of vertices in the trees is too high. In particular, the decision tree \(\varGamma _1\) with the minimum number of vertices constructed by these techniques for the decision table D NURSERY from the UCI Machine Learning Repository [15] has the following parameters: \(N(\varGamma _1) = 70\), \(G(D,\varGamma _1) = 0.10\), and \(L(D, \varGamma _1) = 0.23\).

In this paper, instead of conventional decision trees, we study CART-like (CART-L) decision trees introduced in the books [1, 2]. As the standard CART [10] trees, CART-L trees use binary splits instead of the initial features. The standard CART tree uses in each internal vertex the best split among all features. A CART-L tree can use in each internal vertex the best split for an arbitrary feature. It extends essentially the set of decision trees under consideration. In [1, 2], we applied Gini index to define the notion of the best split. In this paper, we use another parameter abs [2].

We design two techniques that build decision trees for medium-sized tables (at most 10, 000 rows and at most 20 features) containing both categorical and numerical features. These techniques are based on bi-objective optimization of CART-L decision trees for parameters N and G [1], and for parameters N and L. Both techniques construct decision trees with at most 19 vertices (at most 10 leaves and at most nine internal vertices). The choice of 19 is not random. We consider enough understandable trees with small number of non-terminal vertices which can be useful from the point of view of knowledge representation. This choice is supported by some experimental results published in [1]. One technique (G-19 technique) was proposed in [1]. Another one (L-19 technique) is completely new. We apply the considered techniques to 14 data sets from the UCI Machine Learning Repository [15], and study three parameters N, G, and L of the constructed trees. For example, for the decision table D NURSERY, L-19 technique constructs a decision tree \(\varGamma _2\) with \(N(\varGamma _2) = 17\), \(G(D,\varGamma _2) = 0.12\), and \(L(D,\varGamma _2) = 0.22\).

The obtained results show that at least one of the considered techniques (L-19 technique) can be useful for the extraction of knowledge from medium-sized decision tables and for its representation by decision trees. This technique can be used in different areas of data analysis including rough set theory [14, 21,22,23, 27]. In rough set, the decision rules are used extensively. We can easily derive decision rules from the constructed decision trees and use them in rough set applications.

We arrange the remaining of the manuscript as follows. Two techniques for decision tree building are explained in Sect. 2. The output of the experiments is in Sect. 3. Finally, Sect. 4 contains brief conclusion.

Fig. 1.
figure 1

Sets of Pareto optimal points for tables breast-cancer, nursery, and tic-tac-toe for pairs of parameters N, G and N, L

2 Two Techniques for Decision Tree Construction

In the books [1, 2], an algorithm \(\mathcal {A}_\mathrm{{POPs}}\) is described. If we give this algorithm a decision table, then it builds the Pareto front – the set of all POPs (Pareto optimal points) for bi-objective optimization of CART-L trees relative to N and G (see, for example, Fig. 1(a), (c), (e)). We extend this algorithm to the building of the Pareto front for parameters N and L (see, for example, Fig. 1(b), (d), (f)). For each POP, we can get a decision tree with values of the considered parameters equal to the coordinates of this point. Both algorithm \(\mathcal {A}_\mathrm{{POPs}}\) and its extension have exponential time complexity in the worst case. We now describe two techniques of decision tree building based on the operation of the algorithm \(\mathcal {A}_\mathrm{{POPs}}\) and its extension. The time complexity of these two techniques is exponential in the worst case.

2.1 G-19 Technique

We apply the algorithm \(\mathcal {A}_\mathrm{{POPs}}\) to a decision table D. The output of this algorithm is the Pareto front for the bi-objective optimization of CART-L trees for parameters N and G. We choose a POP with the maximum value of the parameter N which is at most 19. After that, we get a decision tree \(\varGamma \), for which the parameters N and G are equal to the coordinates of this POP. The tree \(\varGamma \) is the output of G-19 technique. This technique was described in  [1]. However, we did not study the parameter L for the constructed trees.

2.2 L-19 Technique

We apply the extension of the algorithm \(\mathcal {A}_\mathrm{{POPs}}\) to a decision table D to create the Pareto front for the bi-objective optimization of CART-L trees for parameters N and L. We choose a POP with the maximum value of the parameter N which is at most 19. After that, we get a decision tree \(\varGamma \), for which the parameters N and L are equal to the coordinates of this POP. The tree \(\varGamma \) is the output of L-19 technique. This is a new technique.

3 Results of Experiments

In Table 1, we describe 14 decision tables, each with its name, number of features as well as number of objects (rows). These tables are collected from the UCI Machine Learning Repository [15] for performing the experiments.

We applied G-19 and L-19 techniques to each of these tables and found values of the parameters N, G, and L for the constructed decision trees. Table 2 describes the experimental results.

The obtained results show that the use of L-19 technique in comparison with G-19 technique allows us to decrease the parameter L on average from 0.16 to 0.11 at the cost of a slight increase in the parameter G on average from 0.06 to 0.07.

Table 1. Decision tables which are collected for performing the experiments
Table 2. Results of experiments

4 Conclusions

We proposed to evaluate the accuracy of decision trees not only by the global misclassification rate G but also by the local misclassification rate L, and designed new L-19 technique. This technique constructs decision trees having at most 19 vertices and acceptable values of the parameters G and L. Later we are planning to extend this technique to multi-label decision tables using bi-objective optimization algorithms described in [2, 3]. Also, our goal is to make more experiments with other numbers of vertices like 13, 15, 17, 21, 23, etc. Another direction of future research is to design some heuristics to overcome the problem of working with larger data set.