Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In many industrial, business, healthcare and scientific areas we can see still growing use of computers and computational resources. Together with the boom of high-speed networks and increasing storage capacity of database clusters, data warehouses and cloud technologies, a huge amount of various data can be stored. Such data are often mixed from different source, containing different data types, unusual coding schemes, and seldom come without any errors (or noise). Data mining is not only an important scientific area, but also an important tool in industry, business and healthcare as the people making decision are becoming overwhelmed by the data. As an example we can mention the Human Genome Project, Hubble Space Telescope and the Human Brain mapping Project. A nice overview of swarm intelligence for data mining can be found in [21].

Automatic construction of simple rules in the form of decision trees has been attempted virtually in all disciplines in which data exploration methods have been developed. It has been traditionally developed in the fields of statistics, engineering (pattern recognition) and decision theory (decision table programming). Recently renewed interest has been generated by research in artificial intelligence (machine learning) and neurosciences (neural networks) [22].

Multiple metaheuristics have been used for inducing decision trees. However, to our best knowledge we did found only a few successful attempts to use ant colony optimization method for the induction of decision trees (see Sect. 2.2 for more information).

2 Background Information

2.1 Ant Colony Optimization

The high number of individuals and the decentralized approach to task coordination in many ant species shows that ant colonies show high degree of parallelism, self-organization and fault tolerance. In studying these paradigms, there is a high chance to discover inspiration concepts for many successful metaheuristics.

Simple Ant Colony Optimization. The first ant colony optimization algorithm was the ant system [12]. This system has been improved and the first optimization algorithm available was the simple ACO (SACO) [10]. The SACO in fact implements the double bridge experiment [9]. It has been used for the TSP problem, where each edge is assigned a cost value (length) L. For the SACO each edge is also assigned a pheromone value \(\tau _{ij}\) that is updated during the run. The pheromone value is used for inducing the potential solutions.

It has been observed that initial experiments on the binary bridge have rapidly converged to a solution and the ants did not explore alternative paths. Therefore an pheromone evaporation at the end of each iteration has been introduced.

Ant System. Ant System (AS) [11] improves SACO by changing the transition probability between nodes to include heuristic information and by the inclusion of a tabu list (thus adding a memory to the algorithm). The AS in fact balances the exploration and exploitation tradeoff that is controlled by the parameters \(\alpha \) and \(\beta \). If \(\alpha = 0\), no pheromone is used, when \(\beta = 0\), the algorithm degrades to SACO.

A different formulation of the transition probability has been defined in [20] where the \(\alpha \) parameter defines a relative importance of the pheromone and removes the \(\beta \) parameter.

Ant Colony System. The Ant Colony System [13] has been developed as an improvement to performance of the AS. It uses a different transition rule, different pheromone update rule, local pheromone update and a candidate list to promote specific nodes.

The pseudorandom-proportional action rule is defined with a random number \(r\sim U(0,1)\) and an user-specified parameter \(r_0 \in [0,1]\). The \(r_0\) parameter balances exploration (\(r > r_0\)) and exploitation (\(r \le r_0\)).

In case of ACS, only the globally best ant is allowed to reinforce pheromone concentrations on the links (cf. AS). The authors implemented two methods of selecting the best path: (1) iteration-best where the pheromone is updated according to the best solution found in current iteration or (2) global-best where the pheromone is updated according to the best solution found ever.

Max-Min Ant System. It has been discovered that the AS prematurely stagnates for complex problems. All the ants followed exactly the same path and there was a very little exploration and too rapid exploitation of the highest pheromone concentrations. Therefore the Max-Min Ant System (MMAS) has been developed [28]. The main difference between AS and MMAS is that there is a limitation on the pheromone intensities, \(\tau _{ij} \in [\tau _{min}, \tau _{max}]\) where the \(\tau _{min}\) and \(\tau _{max}\) are problem dependent static parameters. A positive lower bound \(\tau _{min}\) facilitates better exploration as all the links have positive probability of being selected into solution. Additionally, only the best ant may reinforce pheromone. The firsts version of MMAS focused on the iteration-best path, later versions used different strategies (using combination of both strategies, reinitialization in case of stagnation, etc.).

We have tried all the ant metaheuristics mentioned above and finally, ended with a modified version of MMAS algorithm, including adaptive change of pheromone deposit and evaporation rate and the elitism.

More variants can be found in relevant literature, i.e. in the series of [14].

2.2 Decision Trees

Decision tree induction algorithms are used in many areas, such as astronomy, character recognition, finance, marketing, medicine, natural language processing, software engineering, etc. ([1]). The decision trees are hierarchical, sequential classification structures that recursively partition the set of observations (data). Each node of the decision tree is either a test that partitions the data, or a decision (classification) about the object. In this paper the decision trees are constructed automatically from the data.

Automated construction of decision trees (or decision rules) has been performed in many areas where data exploration methods have been developed.

The Problem of Construction. The construction of optimal decision trees (in the sense of minimal number of tests performed) is a NP-Complete problem ([17]). Authors [8] proved that even finding the root node in an optimal strategy is a NP-hard problem.

The algorithms for decision trees induction using ant colonies can be found in recent publications only, such as [24].

A comprehensive overview of decision trees can be found in [27].

Use of Metaheuristics for Tree Induction. The CART algorithm [5] is fully deterministic and it is prone to be stuck in local optima. The problem of traditional decision tree constructing algorithms is that it usually uses optimal split for a node only, but does not consider a group of data. Therefore a heuristic approach is suitable and provides more general results.

In 1990, Koza [18] introduced genetic programming for induction of decision trees. In 1993 [16] introduced an algorithm called SADT that uses a simulated annealing process to find oblique splits at decision node. Later in 1997, [22] introduced an OC1 system. It does not use the brute-force approach of SADT, but uses a random search only to improve the existing solution. Therefore it may be regarded as a hybridization of the CART and SADT technique. The classical approaches are very sensitive to incomplete and noisy data. A situation that is very usual in medical environment.

Evolutionary techniques have also been successfully implemented, i.e. [19]. More information can be found in [26]. It usually uses a population of trees, selecting the best according to a certain fitness function.

The authors in [1] propose to distinguish between (1) evolutionary induction of decision trees and (2) evolutionary design/improvement of decision tree components. In (1) the individuals are decision trees, in (2) the individuals are components of decision tree classifiers. Another logical possibility is to use evolutionary algorithms to optimize the parameters of an algorithm. In out methodology we use the first approach, therefore the decision trees are sometimes referred to as individuals.

An adaptive algorithm for ant-inspired induction of decision trees (ACDT) [3] uses so called twoing (binary) decision criterion or other decision rules that are not able to cope with continuous attributes. The authors proved that the algorithm is comparable to the classical CART algorithm. The authors later presented a modification using inequality rule for coping with continuous attributes [4]. However, similar approach has already been used in our publication few years earlier [6, 7]. For comparison the authors have used only 11 datasets from UCI repository and compared with an Ant-Miner algorithm [25] and the modification using continuous attributes [23].

The second ant-colony inspired induction algorithm can be found in [24]. It uses a virtual start node and two types of edges: nominal or continuous. In case of continuous attributes a dynamic discretization is used (as in C4.5 or cAnt Miner rule induction algorithm (of the same authors)). The authors also use the same heuristics as in C4.5 algorithm (information gain ratio for generating threshold values and further dividing the training sets into subsets). The pheromone is used as an accurate feedback of the quality. Authors evaluate their algorithm on 22 UCI datasets and compare them to weka’s J48 and SimpleCART algorithm.

3 The ACO_DTree Method Description

We have used population approach with elitism. The potential solutions (decision trees) represent individuals. The pseudocode of the algorithm can be described as follows. The trees are induced from an incidence matrix (representing a memory of the algorithm) containing pheromone values. In the initial phase, the pheromone matrix is initialized either with random or constant values and the first population of trees is induced. In the iterative phase the population is first optimized (decision values in tree nodes are optimized) and then the population is increased and the worst solutions are removed. According to the elitist ratio, the best solutions contribute back to the pheromone matrix. Pheromone evaporation and adaptive measures keep the population diversified while keeping the pheromone matrix from being saturated (premature convergence).

figure a

Fitness Measures of the Decision Trees. For each algorithm (over a dataset) the following outputs are saved for each partial crossvalidation run and for the final result (i.e. for 10-fold crossvalidation, 10 + 1 results are available). Only the most important are listed, the total count of measures saved is about 70:

  • Classified/misclassified/unclassified instances (absolute and relative value)

  • FN/FP/TN/TP (absolute and relative value)

  • F-Measure

  • Precision/Recall

  • Area under ROC/PRC (AUROC/AUPRC)

  • Confusion matrix

  • Kappa statistic

  • Various information scores (Kononenko & Bratko)

  • Various SF measures (entropy gain, etc.)

4 Experimental Part

4.1 Methodology Overview

As we are comparing 73 different groups of classifiers over 31 datasets, we need to divide them into some natural clusters. The classifier groups are shown in the Table 1 (the ACO_DTree method is not listed as it is compared with all other classifiers). The data groups are shown in Table 2. The reason for such division is to obtain a detailed information about the proposed method behavior instead of displaying only the All Classifiers over All datasets result (displayed in the rightmost bottom cell of the evaluation tables) as it is usual in other publications.

We have used the 10-fold crossvalidation. The data were first randomized (shuffled) and then divided into 10-folds (using stratification).

Evaluation. There are many measures available for evaluation. In this paper evaluate the performance using the precision and recall measure. We can use (1) the average value (over 10 crossvalidation runs) or (2) maximal (BSF, best-so-far) value.

The statistical significance of the results will be assessed as follows. First we will use Friedman nonparametrical test to verify the hypothesis that mean values of the measures are equal (\(H_0\)). If the null hypothesis is rejected, we continue with post-hoc tests – Nemenyi test for comparing the classifiers against the others. If a statistically significant difference between performance of two classifiers is found, we proceed with Holm and Hochberg post-hoc tests. If we cannot reject the \(H_0\) at \(\alpha =0.05\) we try at \(\alpha =0.1\).

4.2 Classifiers

The classifier groups are shown in the Table 1. The Distinct instances column shows a number of distinct classifiers implemented as a separate class. The Total instances contains also variations containing different strategies (i.e. different kernel function, etc.). The grouping is according to the WEKA datamining software Java classes.

Table 1. Overview of the classifiers grouping for evaluation. Note that the ACO_DTree method is included in each group and is not listed explicitly.

4.3 Datasets

We have divided the datasets into the following groups for detailed evaluation: See Table 2.

  • UCI – Contains datasets from the UCI database. This is a selected part of UCI database that is used in many articles from the domain of artificial intelligence. It contains 23 datasets in total.

  • ECG – Contains MIT and AHA ECG databases, in total 6 datasets. These datasets are prepended with the MIT_ or AHA_ prefix.

  • CTG + MI – Contains data of myocardial infarction (2 datasets) and cardiotocography data (two datasets).

  • MIT – Contains selected signals of the MIT ECG database together with the whole MIT database (total of 4 datasets). Does not contain the AHA ECG database. These datasets are prepended with the MIT_ prefix.

  • All – Includes all datasets available. In total it is 32 datasets.

Table 2. Overview of the dataset grouping for evaluation. The groups are not exclusive, overlapping occurs. Therefore the total sum is not equal to the sum of all the subgroups.

5 Results

This section summarizes the statistical evaluation of the experiments. Only the final results are included in this part (obtained after parameter tuning).

We have evaluated the average (mean) and BSF measures from 10-fold crossvalidation. In order to present the results in compact form, the following symbols are used in the result tables:

  • The \(\emptyset \) symbol means that the Friedman test did not reject the \(H_0\) hypothesisFootnote 1.

  • The symbols \(\begin{matrix}R\\ _{05} \end{matrix}\) and \(\begin{matrix}R\\ _{10} \end{matrix}\) mean that the Friedman test rejected the \(H_0\) hypothesis on the level of \(\alpha \) equal to 0.05 or 0.1 respectively. To save space we do not use decimals.

  • The symbols \(\begin{matrix} \surd \\ _{05} \end{matrix}\) and \(\begin{matrix} \surd \\ _{10} \end{matrix}\) indicate a significantly better results for the ACO_DTree method when compared to the other classifiers (by the Holm or Hochberg post-hoc test) on the level of \(\alpha \) equal to 0.05 or 0.1 respectively.

Each cell of the table contains left and right part: The left side of each cell contains the statistical result symbol for an average value of the measure. BSF value of the measure is located in the right part of the cell (Table 3).

This kind of presentation is quite unusual, however it provides much more information than traditionally used approach when only the bottom-right cell is actually presented – the result for all classifiers vs. all datasets.

Note that each cell stands for the run of multiple classifiers (included the ACO_DTree method) over multiple datasets in order for the statistical tests to be valid.

Table 3. Statistical results for the precision measure (avg. and max) for the classifiers with optimized parameters.

5.1 Summary

Based on the statistical evaluation that is presented in Sect. 4 we can conclude that the ACO_DTree methods performs significantly better on the level of \(\alpha = 0.05\) when compared over 32 different datasets and 41 distinct instances of classifiers. This overall result could be further divided for different measures and for various groupings of the datasets and/or classifiers (Table 4).

We conducted 10-fold crossvalidation and for each experiment saved and statistically evaluated about 20 quantitative objective measures. The averaged and best-so-far values of the selected measures have been statistically evaluated using Friedman test with Holm and Hochberg post-hoc procedures (on the levels of \(\alpha =0.05\) and \(\alpha =0.10\)). The Nemenyi statistical test has also been conducted.

The ACO_DTree algorithm performed significantly better (\(\alpha =0.05\)) in 29 test cases for the averaged f-measure and in 14 cases for the averaged precision measure. In case of the accuracy measure it was 13 and 15 cases respectively. The best results have been obtained for various subsets of the UCI database and for the dataset combining cardiotocography data and for the data of myocardial infarction.

Table 4. Statistical results for the recall measure (avg. and max) for the classifiers with optimized parameters.

6 Conclusion and Discussion

We have proposed, implemented and experimentally evaluated an ant-colony inspired method for induction of decision trees. We have also identified the areas where the ACO_DTree algorithm performed significantly better when compared to other state-of-the art classifier implementations. In addition to the accuracy measure we have evaluated also other quantitative measures. The advantage of the proposed method lies in the robustness achieved by replicating the self-organization behavior of ants that has been incorporated into an algorithm for induction of decision trees.

The algorithm uses randomization of the axis-parallel tree to improve the final solution. We have introduced a local search phase that leads to faster convergence of the algorithm. The local phase can be either random (jittering) or using another nature-inspired approach (PSO).

The algorithm has shown competitive results when compared to other classifiers. For the percent of correctly classified results (an average value) it has been statistically better in 4 cases for the whole UCI database and once for the ECG MIT and (CTG and MI) dataset on significance level \(\alpha =0.05\) and in one case for the whole ECG dataset (\(\alpha =0.10\)).

More than 12400 experiments over 73 classifiers and 32 datasets has been run in the final evaluation phase.

6.1 Discussion

We are aware that in the branch of classifiers and decision trees, it is a common solution to use only the classifier accuracy for evaluation. In case of biomedical signals it is important to measure inter- and intra- patient results, together with measures used in medicine, such as SE and SP. In this project we have obtained about 20 different quantitative measures that can be used for evaluation. In addition, apart from the mean average value we can evaluate the best/worst-so-far solution found and even the standard deviation. The question is how this measures and their features should contribute to the final evaluation.

6.2 Future Work

It would be interesting to study (and extend) the algorithm to use hypercurves for splitting as opposed to axis-parallel splitting. The problem is that the task of finding such trees is much more complicated and the readability and comprehensibility of such tree is reduced. It is desirable to use sophisticated splits only when the extra complexity of the split is compensated by the contribution to tree quality.