1 Introduction

In this paper, our focus is on the discovery of rules for the classification task using supervised training data. Many classification algorithms already exist, such as decision trees, neural networks, k-nearest neighbor classifiers, and support vector machines. Some of them (e.g., neural networks and support vector machines) are incomprehensible and opaque to humans, while others are comprehensible (e.g., decision tress). In many applications, both comprehensibility and accuracy are required.

Swarm intelligence [13], which deals with the collective behavior of small and simple entities, has been used in many application domains. Ant colony optimization (ACO) [46] is one of the most famous meta-heuristic under the umbrella of swarm intelligence. Since its inception it has been used to solve many problems including those related to data mining [710]. In this paper, we propose an ACO-based algorithm for the discovery of classification rules. The proposed algorithm, called AntMiner-C, has both properties of accuracy and comprehensibility.

The overall approach of the AntMiner-C algorithm is sequential covering. It starts with a full training set, creates a best rule that covers a subset of the training data, adds the best rule in the discovered rules list, and removes the training samples that are correctly classified by the best rule. This process continues until none or only a few training samples are left. The proposed algorithm has some unique features that allow it to achieve a higher accuracy rate when compared with some previously proposed ACO-based approaches for the same task.

The remainder of this paper is organized as follows. In Sect. 2, we explain the basic concepts of ACO and give an overview of existing ACO algorithms for classification rules discovery. Section 3 describes our proposed approach. In Sect. 4, we present and discuss our experimental results on some publicly available data sets. Finally, Sect. 5 concludes the paper and gives future directions of research.

2 ACO and its application to classification rule discovery

Ant colony optimization is a branch of swarm intelligence (and, in general, a genre of population-based heuristic algorithms [11]) inspired by the food foraging behavior of biologic ants. Ants pass on the information about the trail they are following by spreading a chemical substance called pheromone in the environment. Ants communicate with each other by means of pheromone. Other ants that arrive in the vicinity are more likely to take the path with higher concentration of pheromone than the paths with lower concentrations. In other words, the desirability of possible paths is proportional to their pheromone concentrations. The pheromone evaporates with time, and if new pheromone is not added, the older one dwindles away.

This indirect form of information passing via the environment helps the ants to find the shortest path to the food source. If two paths between a food source and the ant nest are initially discovered by some ants, then the longer of the two paths will soon become unattractive to subsequent ants, because the ants following it will take longer to go to the food source and return back, and hence, the pheromone concentration on that path will not increase as rapidly as on the shorter path. If an established path is blocked, some ants will first go to the left and some to the right with equal probability. However, an ant taking the shorter of the two paths will return earlier than the ant taking the longer path, and hence, the pheromone on the shorter path will be enhanced, and subsequent ants will have a higher probability of taking the efficient path. Soon a new shorter path that bypasses the blockage will be established. The more ants follow a given trail, the more attractive that trail becomes and is followed by other ants.

This phenomenon has been modeled in the ACO meta-heuristic. An artificial ant constructs a solution to the problem by adding solution components one by one. After a solution is constructed, its quality (i.e. fitness) is determined and the components of the solution are assigned pheromone concentrations proportional to this quality. Subsequently, other ants construct their solutions one by one, and they are guided by the pheromone concentrations in their search for components to be added in their solutions. The components with higher pheromone concentrations are thus identified as contributing to a good solution and repeatedly appear in the solutions. Usually, after sufficient iterations, the ants converge on a good, if not the optimal, solution.

For the application of ACO to a problem, we have the following requirements.

  • The ability to represent a complete solution as a combination of different components.

  • A method to determine the fitness or quality of the solution.

  • A heuristic measure for the solution’s components (this is helpful, if available, but not necessary).

Since its inception, ACO has been applied to solve many problems [2, 4]. It is naturally suited to discrete optimization problems, such as quadratic assignment, job scheduling, subset problems, network routing, vehicle routing, load dispatch in power systems, bioinformatics, and, of course, data mining, which is the subject of this paper.

2.1 ACO for the discovery of classification rules

The ACO has been applied for the discovery of classification rules. The first ACO-based algorithm for classification rule discovery, called AntMiner, was proposed by Parpinelli et al. [12]. An ant constructs a rule by starting with an empty rule and incrementally adding one condition at a time. The selection of a condition to be added is probabilistic and based on two factors: a heuristic quality of the condition and the amount of pheromone deposited on it by the previous ants. The authors use the information gain as the heuristic value of a condition. After the antecedent part of a rule has been constructed, the consequent of the rule is assigned by a majority vote of the training samples covered by the rule. The constructed rule is then pruned to remove the irrelevant terms and to improve its accuracy. The quality of the constructed rule is determined, and pheromone values are updated on the trail followed by the ant (conditions selected by the ant) proportional to the quality of rule. After all ants have constructed their rules, the best quality rule is selected and added to a discovered rule list. The training samples correctly classified by that rule are removed from the training set. This process is continued until the number of uncovered samples becomes less than a user-specified value. The final product is an ordered rule list that is used to classify the test data.

Versions the AntMiner algorithm are proposed by Liu et al. and called AntMiner2 [13] and AntMiner3 [14, 15]. In AntMiner2, the authors propose density estimation as a heuristic function instead of information gain used by AntMiner. They show that this simpler heuristic value does the same job as well as the complex one, and hence, AntMiner2 is computationally less expensive than the original AntMiner but has comparable performance. In AntMiner3, the authors use a different pheromone update method. They update and evaporate the pheromone of only those conditions that occur in the rule and do not evaporate the pheromones of unused conditions. In this way, exploration is encouraged.

Martens et al. [16] propose a Max–Min Ant System based algorithm (AntMiner+) that differs from the previously proposed AntMiners in several aspects. They define a different search environment for the ants, choose class label of rules prior to their construction, incorporate the ability to form interval rules, and make the selection of two important ACO parameters alpha and beta part of the search process. Only the iteration-best ant is allowed to update the pheromone, the range of the pheromone trail is limited within an interval, and a new rule quality measure is used.

Other works on AntMiner include [17] in which an algorithm for discovering unordered rule sets has been presented. In [18], PSO algorithm is used for continuous-valued attributes and ACO for nominal-valued ones, and these two algorithms are jointly used to construct rules. The issue of continuous attributes has also been dealt in [19] and [20]. An AntMiner version for multi-label classification problem can be found in [21]. Relevant background information regarding AntMiners can be found in [22].

Our proposed algorithm is different in many ways from the previous AntMiner algorithms. The main novelty of our approach is a new class dependant heuristic function based on correlation which reduces the search space considerably and yields better results. Other differences are highlighted along with the new algorithm’s description in the next section.

3 Correlation-based AntMiner (AntMiner-C)

In this section, we describe our ACO-based classification rules mining algorithm called AntMiner-C. We begin with a general description of the algorithm and then discuss the heuristic function, pheromone update, rule pruning, early stopping, etc.

3.1 General description

The core of the algorithm is the incremental construction of a classification rule of the type

$$ {\text{IF }} < {\text{term}}_{1} \,{\text{AND term}}_{2} \,{\text{AND }} \ldots > {\text{ THEN }} < {\text{class}} > $$

by an ant. Each term is an attribute-value pair related by an operator. In our current experiments, we use “=“as the only possible relational operator. An example term is “Color = red”. The attribute’s name is “color”, and “red” is one of its possible values. Since we use only “=”, any continuous (real-valued) attributes present in the data have to be discretized in a preprocessing step.

For the ants to search for a solution, we need to define the space in which the search is to be conducted. This space is defined by the data set used. The dimensions (or coordinates) of the search space are the attributes of the data set. The class attribute is included in the search space but not handled in the same way as other attributes. The possible values of an attribute constitute the range of values for the corresponding dimension in the search space. For example, a dimension called ‘Color’ may have three possible values {red, green, blue}. The task of the ant is to visit a dimension and choose one of its possible values to form an antecedent condition of the rule (e.g., Color = red). The total number of terms for a data set is equal to

$$ Total\_terms = \sum\limits_{n = 1}^{a} {b_{n} } $$
(1)

where a is the total number of attributes (excluding the class attribute) and b n is the number of possible values that can be taken on by an attribute A n . When a dimension has been visited, it cannot be visited again by an ant, because we do not allow the conditions of the type “Color = red OR Color = green”. The search space is such that an ant may pick a term from any dimension, and there is no ordering in which the dimensions can be visited. An example search space represented as a graph is shown in Fig. 1.

Fig. 1
figure 1

An example problem’s search space represented as a graph. There are four attributes A, B, C, and D having 3, 2, 3, and 2 possible values, respectively. An ant starts from the “Start” vertex and constructs a rule by adding conditions (attribute-value pairs or terms) for the antecedent part. After a term has been selected, all the other terms from the same attribute become prohibited for the ant. Suppose the ant chooses C1 as its first term, it cannot select any more terms from attribute C. Further, suppose that the ant subsequently chooses A3, D1, and B1. The rule construction process is stopped when the ant reaches the “Stop” vertex. The consequent part of the rule (class label) is known to the ant prior to the rule construction. It can stop prematurely if the addition of the latest term makes the rule to cover only those training samples that have the chosen class label

The same search space has been utilized by AntMiner [12]. The search environment utilized by AntMiner+ [16] is different in several ways: each of the ants is required to choose one of the classes (majority class is excluded from this set) before constructing its rule, the selection of alpha and beta parameters are made part of the search process, the dimensions (i.e., attributes) are visited in a fixed order but an ant may choose to ignore a dimension, and the discretization of continuous variables is replaced by a mechanism that enables ants to include interval conditions in their rules.

A general description of the AntMiner-C algorithm is shown in Fig. 2. Its basic structure is that of AntMiner [12]. It is a sequential covering algorithm. It discovers a rule and the training samples correctly covered by this rule (i.e., samples that satisfy the rule antecedent and have the class predicted by the rule consequent) are removed from the training set. The algorithm discovers another rule using the reduced training set, and after its discovery, the training set is further reduced by removing the training samples correctly covered by the newly discovered rule. This process continues until the training set is almost empty, and a user-defined number is used as a condition for terminating the discovery of new rules. If the number of remaining uncovered samples in the training set is lower than or equal to this number, then the algorithm stops.

Fig. 2
figure 2

The AntMiner-C algorithm

One rule is discovered after one execution of the outer WHILE loop. First of all, a class label is chosen from the set of class labels present in the uncovered samples of training set. This class label remains constant for a batch of ants of the REPEAT-UNTIL loop. Each iteration of the REPEAT-UNTIL loop sends an ant to construct a rule. Each ant starts with an empty list of conditions and constructs the antecedent part of the rule by adding one term at a time. Every ant constructs its rule for the particular class chosen before the beginning of the REPEAT-UNTIL loop. The quality of the rule constructed by an ant is determined and pheromones are updated, and then another ant is sent to construct another rule. Several rules are constructed through an execution of the REPEAT-UNTIL loop. The best one among them, after pruning, is added to the list of discovered rules.

The algorithm terminates when the outer WHILE loop exit criterion is met. The output of the algorithm is an ordered set of rules. This set can then be used to classify unseen data. The main features of the algorithm are explained in detail in the following sub-sections.

As mentioned before, an ant constructs the antecedent part of the rule by adding one term at a time. The choice of adding a term in the current partial rule is based on the pheromone value and heuristic value associated with the term and is explained in the next sub-section.

3.2 Rule construction

Rule construction, which is an important part of the algorithm, is described below.

3.2.1 Class commitment

Our heuristic function (described later) is dependent upon the class label of samples, and hence, we need to decide beforehand the class of the rule being constructed. For each iteration of the WHILE loop in the algorithm, a class is chosen probabilistically, by roulette wheel selection, on the basis of the weights of the classes present in the yet uncovered data samples. The weight of a class is the ratio of its uncovered data samples to the total number of uncovered data samples. If 40% of the yet uncovered data samples are of class “A”, then there is a 40% chance of its selection. The class is chosen once and becomes fixed for all the ant runs made in the REPEAT-UNTIL loop.

Our method of class commitment is different from those used by AntMiner and AntMiner+. In AntMiner, the consequent part of the rule is assigned after rule construction and is according to the majority class among the cases covered by the rule. In AntMiner+, each ant selects the consequent of the rule prior to antecedent construction. The consequent selection is from a set of class labels which does not contain the class label of samples in majority. In our case, even though the class label is selected prior to the rule construction, it is not selected by the individual ants and is fixed for all the ant runs of a REPEAT-UNTIL loop. Also, the set of class labels does not exclude the class label of samples in majority.

3.2.2 Pheromone initialization

At the beginning of each iteration of the WHILE loop, the pheromone values on the edges between all terms are initialized. The initial pheromone between two terms term i and term j belonging to two different attributes is

$$ \tau_{ij} (t = 1) = {\frac{1}{{\sum_{n = 1}^{a} {x_{n} b_{n} } }}} $$
(2)

where a is the total number of attributes (excluding the class attribute), b n is the number of possible values that can be taken on by an attribute A n and x n is set to 0 if the attribute A n is that to which term i belongs, otherwise it is set to 1. The pheromones on edges between the terms of same attribute are initialized to zero. Since all the pheromone values for competing terms are the same, the first ant has no historical information to guide its search. This method of pheromone initialization has been used by the AntMiner. AntMiner+ utilizes MAX–MIN Ant System and all the pheromone values are set equal to a value τmax.

3.2.3 Term selection

An ant incrementally adds terms in the antecedent part of the rule that it is constructing. The selection of the next term is subject to the restriction that a term from its attribute A n should not be already present in the current partial rule. In other words, once a term has been included in the rule then no other term from that attribute can be considered. Subject to this restriction, the probability of selection of a term for addition in the current partial rule is given by the equation:

$$ P_{ij}\,(t) = {\frac{{\tau_{ij}^{\alpha } (t)\eta_{ij}^{\beta } (s)}}{{\sum_{j = 1}^{Total\_terms} {x_{j} \{ \tau_{ij}^{\alpha } (t)\eta_{ij}^{\beta } (s)\} } }}} $$
(3)

where τ ij (t) is the amount of pheromone associated with the edge between term i and term j for the current ant (the pheromone value is updated after passage of each ant), η ij (s) is the value of the heuristic function for the current iteration s of the WHILE loop (constant for a batch of ants), x j is a binary variable that is set to 1 if term j is selectable, otherwise it is 0. Selectable terms are those that belong to attributes which have not become prohibited due to the selection of a term belonging to them. The denominator is used to normalize the numerator value for all the possible choices.

The parameters α and β are used to control the relative importance of the pheromones and heuristic values in the probability determination of next movement of the ant. We use α = β = 1, which means that we give equal importance to the pheromone and heuristic values. However, we note that different values may be used (e.g., α = 2, β = 1 or α = 1, β = 3), and we have carried out a limited experiment in this regard (Sect. 4.7).

Equation 3 is a classical equation and used in Ant System, MAX–MIN Ant System, Ant Colony System (where the state transition is also dependant on one other equation), AntMiner (with α = β = 1), and AntMiner+.

3.2.4 Heuristic function

The heuristic value of an edge gives an indication of its usefulness and thus provides a basis to guide the search. We use a heuristic function based on the correlation of the most recently chosen term with other candidate terms in order to guide the selection of next term. The heuristic function is:

$$ \eta_{ij} = {\frac{{\left| {term_{i} ,term_{j} ,class_{k} } \right|}}{{\left| {term_{i} ,class_{k} } \right|}}}.{\frac{{\left| {term_{j} ,class_{k} } \right|}}{{\left| {term_{j} } \right|}}}. $$
(4)

The most recently chosen term is term i , and the term being considered for selection is term j . The number of uncovered training samples having term i and term j and which belong to the committed class label k of the rule is given by |term i , term j , class k |. This number is divided by the number of uncovered training samples, which have term i and which belong to class k , to find the correlation between the terms term i and term j . The value of the heuristic function is zero for edges between terms of the same attribute.

The correlation is multiplied by the importance of term j in determining the class k . The factor |term j , class k | is the number of uncovered training samples having term j and which belong to class k , and the factor |term j | is the number of uncovered training samples containing term j , irrespective of their class label. Since the class label of the rule is committed before the construction of the rule antecedents, our heuristic function is dependent on the class chosen for a batch of ants.

The heuristic function quantifies the relationship of the term to be added with the most recently added term and also takes into consideration the overall discriminatory capability of the term to be added. An example is shown in Fig. 3. Suppose the specified class label is class k . An ant has recently added term i in its rule. It is now looking to add another term from the set of available terms. We take the case of two competing terms (term j1 and term j2 ) that can be easily generalized to more terms. We want the chosen term to maximize correct coverage of the rule as much as possible. This is encouraged by the first part of the heuristic function. The heuristic value on edge between term i and term j1 is \( {{\left| {term_{i} ,term_{j1} ,class_{k} } \right|} \mathord{\left/ {\vphantom {{\left| {term_{i} ,term_{j1} ,class_{k} } \right|} {\left| {term_{i} ,class_{k} } \right|}}} \right. \kern-\nulldelimiterspace} {\left| {term_{i} ,class_{k} } \right|}} \) and that on the edge between term i and term j2 is \( {{\left| {term_{i} ,term_{j2} ,class_{k} } \right|} \mathord{\left/ {\vphantom {{\left| {term_{i} ,term_{j2} ,class_{k} } \right|} {\left| {term_{i} ,class_{k} } \right|}}} \right. \kern-\nulldelimiterspace} {\left| {term_{i} ,class_{k} } \right|}} \). We also want to encourage the inclusion of a term that has better potential for inter class discrimination. This is made possible by the second portion of the heuristic function. The two competing terms will have the values \( {{\left| {term_{j1} ,class_{k} } \right|} \mathord{\left/ {\vphantom {{\left| {term_{j1} ,class_{k} } \right|} {\left| {term_{j1} } \right|}}} \right. \kern-\nulldelimiterspace} {\left| {term_{j1} } \right|}} \) and \( {{\left| {term_{j2} ,class_{k} } \right|} \mathord{\left/ {\vphantom {{\left| {term_{j2} ,class_{k} } \right|} {\left| {term_{j2} } \right|}}} \right. \kern-\nulldelimiterspace} {\left| {term_{j2} } \right|}} \). The two ratios quantifying correct coverage and interclass discrimination are multiplied to give the heuristic value of the edges between terms. The heuristic values calculated according to Eq. 4 are normalized before usage. The heuristic values on edges originating from a term i are normalized by the summation of all heuristic values on the edges between term i and other terms.

Fig. 3
figure 3

Suppose the specified class label is class k , and an ant has recently added term i in its rule. It is now looking to add another term from the set of available terms. For two competing terms (term j1 and term j2 ), we want the chosen term to maximize correct coverage of the rule as much as possible, and we also want to encourage the inclusion of a term that has better potential for inter class discrimination. These two considerations are incorporated in the heuristic function

The correlation-based heuristic function has the potential to be effective in large dimensional search spaces. The heuristic values are calculated only once at the beginning of the REPEAT-UNTIL loop. It assigns a zero value to the combination of those terms that do not occur together for a given class label, thus efficiently restricting the search space for the ants. In contrast, each ant of AntMiner continues its attempts to add terms in its rule until it is sure that there is no term whose addition would not violate the condition of minimum number of covered samples. In AntMiner+, every ant has to traverse the whole search space and there are no such short cuts.

3.2.5 Heuristic function for the 1st layer of terms

The heuristic values on the edges between the Start node and the first layer terms are calculated on the basis of the following Laplace-corrected confidence [9] of a term:

$$ \eta_{Start,\,j} = {\frac{{\left| {term_{j} ,class_{k} } \right| + 1}}{{\left| {term_{j} } \right| + m}}} $$
(5)

where m is the number of classes present in the data set. This heuristic function has the advantage of penalizing the terms that would lead to very specific rules and thus helps to avoid over-fitting. For example, if a term occurs in just one training sample and its class is the chosen class, then its confidence is 1 without the Laplace correction. With Laplace correction, its confidence is 0.66 if there are two classes in the data. Before usage, the values obtained by Eq. 5 are normalized by the summation of all such values between the Start node and the first layer of terms.

Equations 4 and 5 have not been used before in any AntMiner. All previous AntMiner versions have calculated the heuristic value of a candidate term j without considering any other previously chosen term. AntMiner utilizes a heuristic function based on the entropy of the terms and their normalized information gain.

$$ {\frac{{\log_{2} m - H(W|term_{j} )}}{{\sum_{competing\_terms} {(\log_{2} m - H(W|term_{j} ))} }}} $$
(6)

where W is the class attribute whose entropy H given term j is defined as

$$ H(W|term_{j} ) = - \sum\limits_{k = 1}^{m} {(P(k|term_{j} )\,\log_{2} P(k|term_{j} ))} $$
(7)

where k is one of the values of W, and P(k|term j ) is the probability of having class label k for instances containing term j . AntMiner2 and AntMiner3 use

$$ {\frac{{\left| {term_{j} ,majority\_class(term_{j} )} \right|}}{{\left| {term_{j} } \right|}}}$$
(8)

and AntMiner+ uses

$$ {\frac{{\left| {term_{j} ,class\_chosen\_by\_ant} \right|}}{{\left| {term_{j} } \right|}}}.$$
(9)

3.2.6 Rule termination

An ant continues to add terms in the rule it is constructing and stops only when all the samples covered by the rule have homogenous class label (which is committed prior to the REPEAT-UNTIL loop). The other possibility of termination is that there are no more terms to add. Due to these termination criteria, it might happen that a constructed rule may cover only one training sample.

Our rule construction stoppage criterion of homogenous class label is different from that of AntMiner and AntMiner+. In AntMiner, the rule construction is stopped if only those terms are left unused whose addition will make the rule cover a number of training samples smaller than a user-defined value called minimum samples per rule. In AntMiner+, one term from each attribute is added. However, each attribute has a don’t care option that allows for its non-utilization in the rule.

3.3 Rule quality and pheromone update

3.3.1 Quality of a rule

When an ant has completed the construction of a rule, its quality is calculated. The quality, Q, of a rule is computed by using confidence and coverage of the rule and is given by:

$$ Q = \frac{TP}{Covered} + \frac{TP}{N} $$
(10)

where TP is the number of samples covered by the rule that have the same class label as that of the rule’s consequent, Covered is the total number of samples covered by the rule, and N is the number of samples in the training set yet uncovered by any rule in the discovered rule set. The second portion is added to encourage the construction of rules with wider coverage. Equation 10 has been used in AntMiner+, whereas AntMiner uses sensitivity multiplied by specificity as the quality measure.

3.3.2 Pheromone update

An example pheromone matrix is shown in Fig. 4. The pheromone matrix is asymmetric and captures the fact that pheromone values on edges originating from a term to other terms are kept the same in all the layers of the search space. In other words, the same matrix is valid for any two consecutive layers.

Fig. 4
figure 4

The pheromone values for the example problem of Fig. 1. The elements of (a) are the pheromone values on edges from start node to nodes of the first layer of terms. The elements of (b) are pheromone values on edges between terms in any two consecutive layers. For example, the first row elements are the pheromone values for edges from the term A1 to all other terms in the next layer. These values are the same for edges between 1st and 2nd layer, 2nd and 3rd layer, and so on. If an ant chooses C1, A3, D1, and B1 terms in its rule then the elements τ06, τ63, τ39, and τ94 are updated according to Eq. 11. Elements of rows S, C1, A3, and D1 are then normalized. The remaining rows remain unchanged

The pheromone values are updated so that the next ant can make use of this information in its search. The amount of pheromone on the edges between consecutive terms occurring in the rule is updated according to the equation:

$$ \tau_{ij} (t + 1) = (1 - \rho )\tau_{ij} (t) + \left(1 - {\frac{1}{1 + Q}}\right)\tau_{ij} (t) $$
(11)

where τ ij (t) is the pheromone value encountered by Ant t (the t ant of the REPEAT-UNTIL loop) between term i and term j . The pheromone evaporation rate is represented by ρ, and Q is the quality of the rule constructed by Ant t .

Equation 11 is according to the Ant System and has been previously used in AntMiner3, but with a different equation for determining Q. The equation updates pheromones by first evaporating a percentage of the previously occurring pheromone and then adding a percentage of the pheromone dependant on the quality of the recently discovered rule. If the rule is good, then the pheromone added is greater than the pheromone evaporated and the terms become more attractive for subsequent ants. The evaporation in the equation improves exploration, otherwise in the presence of a static heuristic function the ants tend to converge quickly to the terms selected by the first few ants.

Note that our pheromone matrix is asymmetric, whereas the matrix used in original AntMiner is symmetric. The matrix becomes asymmetric because for a constructed rule with term i occurring immediately before term j , the pheromone on edge between term i and term j is updated, but the pheromone on edge between term j and term i is not updated. This is done to encourage exploration and discourage early convergence.

The next step is to normalize the pheromone values. Each pheromone value is normalized by dividing it by the summation of pheromone values on the edges of all its competing terms. In the pheromone matrix (Fig. 4), normalization of the elements is done by dividing them with the summation of values of the row to which they belong. Note that for those rows for which no change in the values has occurred for the ant run under consideration, the normalization process yields the same values as before. Referring to Fig. 1, normalization of pheromone value of an edge originating from a term is done by dividing it by the summation of all pheromone values of the edges originating from that term. This process changes the amount of pheromone associated with those terms that do not occur in the most recently constructed rule but are the competitors of the selected term. If the quality of rule has been good and there has been a pheromone increase in the terms used in the rule, then the competing terms become less attractive for subsequent ants. The reverse is true if the rule found is not of good quality. The normalization process is an indirect way of simulating evaporation of pheromones. Note that in original AntMiner, every element of the pheromone matrix is normalized by dividing it with the sum of all elements. This is unnecessary and tends to discourage exploration and favors early convergence.

3.4 Termination of REPEAT-UNTIL loop

The REPEAT-UNTIL loop is used to construct as many rules as the user-defined number of ants. After the construction of each rule, its quality is determined and the pheromones on the trail are updated accordingly. The pheromone values guide the construction of next rule. An early termination of this loop is possible if the last few ants have constructed the same rule. This implies that the pheromone values on a trail have become very high and convergence has been achieved. Any further rule construction will most probably yield the same rule again. Hence, the loop is terminated prematurely. For this purpose, each constructed rule is compared with the last rule and a counter is incremented if both the rules are the same. If the value of this counter becomes equal to a user defined parameter called "No_rules_converge", then the loop is terminated. In our experiments, we use a value of 10 for this parameter.

This method of early termination of REPEAT-UNTIL loop is also used by AntMiner. In AntMiner+, there is no provision for early termination of this loop and it terminates when the pheromone values on one path converge to τ max and all other paths have τ min , as required by the MAX–MIN Ant System.

3.5 Pruning of best rule

Rule pruning is the process of finding and removing any irrelevant terms that might have been included in the constructed rule. Rule pruning has two advantages. First, it increases the generalization of the rule thus potentially increasing its predictive accuracy. Second, a shorter rule is usually simpler and more comprehensible.

In AntMiner-C, rule construction by an ant stops when all the samples covered by the rule have homogenous class label (which is committed prior to the REPEAT-UNTIL loop). The other possibility of termination is that there are no more terms to add. Those rules whose construction is stopped due to homogeneity of class label of samples covered are already compact, and any attempt at pruning is bound to cause non-homogeneity. However, those rules that comprise of one term from all the attributes have the possibility of improvement with pruning of terms. Based on an experiment (Sect. 4.5), we do not prune all found rules but apply the pruning procedure only to the best rule discovered during an iteration of the WHILE loop. This means that pheromone updates are done with unpruned rules.

The rule pruning procedure starts with the full rule. It temporarily removes the first term and determines the quality of the resulting rule. It then replaces the term back and temporarily removes the second term and again calculates the quality of the resulting rule. This process continues until all the terms present in the rule are dealt with. After this assessment, if there is no term whose removal improves or maintains the quality, then the original rule is retained. However, if there is a term whose removal improves (or maintains) the quality of the rule, then that term is permanently removed. If there are two or more such terms, then the term whose removal most improves the quality of the rule is removed. The shortened rule is again subjected to the procedure of rule pruning. The process continues until any further shortening is impossible (removal of any term present in the rule leads to decrease in its quality) or if there is only one remaining term in the rule.

Rule pruning is also used by AntMiner and AntMiner+. However, the rule quality criterion used in AntMiner is different. Furthermore, in AntMiner, each discovered rule is subjected to rule pruning (prior to pheromone update), whereas we prune only the best rule among the rules found during the execution of the REPEAT-UNTIL loop of the algorithm, and our pheromone update is prior to and independent of rule pruning. AntMiner+ also prunes the best rule. However, it is pertinent to note that AntMiner+ has several iterations of the REPEAT-UNTIL loop, and in each iteration, there are a 1,000 ant runs. The best rule from these 1,000 rules is pruned. In other words, AntMiner+ prunes several rules for each rule inserted in the rule set, whereas we prune only one rule.

3.6 Final rule set

The best rule is placed in the discovered rule set after pruning, and the training samples correctly covered by the rule are flagged (i.e., removed) and have no role in the discovery of other rules. The algorithm checks whether the uncovered training samples are still greater than the value specified for “Max_uncovered_samples”. If that is the case, a new iteration of the WHILE loop starts for discovery of the next rule. If not, a final default rule is added to the rule set, and the rule set may optionally be pruned of redundant rules (see Sect. 4.6 for more details). The rule set may then be used for classifying unseen samples. These aspects are discussed here.

3.6.1 Early stoppage of algorithm

The algorithm can be continued until the training data set is empty. However, this leads to rules which usually cover one or two training samples and thus over-fit the data. Such rules usually do not have generalization capabilities. Some of the methods to avoid this phenomenon are the following:

  • Use a separate validation set to monitor the training. The algorithm can be stopped when the accuracy on the validation set starts to dip.

  • Specify a value for the number of training samples present in the data set. If, after an iteration of the REPEAT-UNTIL loop, the remaining samples in the training set are equal to or below this specified value, then stop the algorithm. This threshold value can also be defined as a percentage of the samples present in the initial training set.

  • Specify a threshold on the maximum number of rules to be found.

The validation set is not appropriate for small data sets because we have to divide the total data samples into training, validation, and test sets.

We have opted for the second option and use a fixed threshold for all the data sets. AntMiner and AntMiner+ both employ early stopping. We use the same option for early stopping as is done in AntMiner while AntMiner+ uses the validation set technique for large data sets (greater than 250 samples) and a threshold of 1% remaining samples for small data sets.

3.6.2 Default rule

A final default rule is added at the bottom of the rule set. This rule is without any conditions and has a consequent part only. The assigned class label for this rule is the majority class label of the remaining uncovered samples of the training set. The default rule is used by all AntMiners, and our method is same as that used by AntMiner. AntMiner+, however, assigns the default rule class label on the basis of majority class of the complete set of training samples.

3.6.3 Pruning of rule set

Some of the rules may be redundant in the final rule set. The training samples correctly classified by a rule may all be correctly classified by one or more rules occurring later on in the rule set. In such cases, the earlier rules are redundant, and their removal will not decrease the accuracy of the rule set. Furthermore, the removal of redundant rules increases the comprehensibility of the rule set. We have developed a procedure that attempts to reduce the quantity of rules without compromising on the accuracy obtained.

The rule set pruning procedure is applied to the final rule set that includes the default rule. Each rule is a candidate for removal. The procedure checks the first rule, and if removing it from the rule list does not decrease the accuracy on the training data, then it is permanently removed. After dealing with the first rule, the second rule is checked, and it is either retained or removed. Each rule is subjected to the same procedure on its turn.

Our experiments (Sect. 4.6) show that the technique is effective in reducing the number of rules. The remaining rules also have a tendency to have lesser terms/rule on the average. However, the strategy is greedy, and although it makes the final classifier relatively fast, but it tends to decrease predictive accuracy for some data sets. However, increased accuracy on other data sets make us believe that sometimes the pruning results in a rule set with superior generalization capability. Hence, pending further investigation, we have made this procedure optional.

3.6.4 Use of discovered rule set for classifying unseen samples

A new test sample unseen during training is classified by applying the rules in order of their discovery. The first rule whose antecedents match the new sample is fired, and the class predicted by the rule’s consequent is assigned to the sample. If none of the discovered rules are fired, then the final default rule gets activated.

3.7 Differences with other AntMiners

AntMiner-C has several similarities to previously proposed AntMiners, yet it is also different in many ways. The main differences are presented here.

3.7.1 Differences with AntMiner, AntMiner2, and AntMiner3

AntMiner-C has the following:

  • prior selection of class label,

  • new heuristic measure, Eq. 4,

  • pruning of best rule only,

  • rule construction termination on the basis of class homogeneity of samples covered by that rule,

  • an asymmetric pheromone matrix,

  • a different pheromone update equation, Eq. 11 (an exception is AntMiner3 that has the same update equation),

  • a different method of pheromone normalization, and

  • a different equation for assessing the quality of rule found and for rule pruning, Eq. 10.

In addition to these differences, AntMiner3 utilizes Eq. 3 in conjunction with a transition rule for the selection of next term that is different than ours.

3.7.2 Differences with AntMiner+

AntMiner-C has the following:

  • a different search space,

  • a different method of class selection that is done only once and subsequently the class remains fixed for all ant runs made for the corresponding antecedent part of the rule,

  • does not exclude the majority class from the class selection choices,

  • no provision of selection of alpha and beta parameters,

  • no mechanism by which ants can include interval conditions in their rules, and continuous variables have to be discredited as a preprocessing step,

  • state transition (next term selection) and pheromone update according to Ant System,

  • a new heuristic measure, Eq. 4,

  • a criterion for early stopping of rule construction,

  • a provision for early stopping of algorithm according to a user-defined maximum number of uncovered training samples, and

  • default rule class label assignment according to the majority class of remaining uncovered samples of the training set.

Furthermore, AntMiner-C has only one complete execution of the REPEAT-UNTIL loop for the extraction of a rule. Also, each iteration of the loop consists of only one ant run. For example, we use 1,000 iterations in our experiments and also find that, on the average, only 18% of these iterations are executed due to the early termination of rule construction (Sect. 4.2). The best solution is pruned after exiting from the REPEAT-UNTIL loop. In AntMiner+, the REPEAT-UNTIL loop is executed until the pheromone values on one path converge to τmax and for all other paths are τmin. There are multiple ants per iteration of the loop (1,000 ants are used in the experiments reported in [16]). The best solution from each iteration is pruned. In other words, several solutions are pruned before exiting from the REPEAT-UNTIL loop.

4 Experiments and analysis

In this section, we report our experimental setup and the results obtained.

4.1 Experimental setup and results

In our experiments, we use twelve data sets obtained from the UCI repository [23]. The main characteristics of the data sets are summarized in Table 1. The data sets in this suite have reasonable variety in terms of number of attributes, instances, and classes and are commonly used in evaluating algorithms.

Table 1 Characteristics of data sets used in experiments

AntMiner-C algorithm works with categorical attributes, and continuous attributes need to be discretized in a preprocessing step. We use unsupervised discretization filter of Weka-3.4 machine learning tool [9] for discretizing continuous attributes. This filter first computes the intervals of continuous attributes from the training data set and then uses these intervals to discretize them.

We compare the results of our algorithm with those for AntMiner, C4.5 [24, 25], Ripper, logistic regression, and support vector machines (SVM) [26]. AntMiner has been implemented by us in Matlab. For other compared algorithms, we use the Weka machine learning tool [9]. Our performance metrics for the comparison of rule sets obtained by competing algorithms are: predictive accuracy, number of rules, and number of terms per rule.

The experiments are performed using a ten fold cross-validation procedure. A data set is divided into ten equally sized, mutually exclusive subsets. Each of the subset is used once for testing while the other nine are used for training. The results of the ten runs are then averaged, and this average is reported as final result.

We have implemented the AntMiner-C algorithm in Matlab 7.0. The experiments are run on a machine that has 1.75 GHz dual processors with 1 GB RAM. The time duration of experiments of different data sets depends upon the number of attributes and the number of instances in the data set. For example, the duration of experiment of one fold is almost 3 min for the iris data set that has 150 instances and 4 attributes. For the dermatology data set, which has 34 attributes and 366 instances, the duration of experiment of one fold is almost 15 min.

The AntMiner-C has six user-defined parameters: number of ants, maximum uncovered samples, evaporation rate, convergence counter threshold, alpha, and beta. The values of these parameters are given in Table 2. The option of pruning of rule set is not used in any experiment except that of Sect. 4.5. The same parameters have been retained while obtaining results for AntMiner. These values have been chosen because they seem reasonable and have been used by some other AntMiner versions reported in literature [1217]. In Sect. 4.7 we report the results of some limited experiments that give an indication of the influence of α, β, and ρ on the performance of AntMiner-C. We found that the predictive accuracy results are highest when ρ = 0.15, α = 1, and β = 3.

Table 2 Parameters used in experiments

The predictive accuracies, average number of rules per discovered rule set, and average number of terms per rule are shown in Tables 3 and 4. All results are obtained using ten fold cross validation.

Table 3 Average predictive accuracies obtained using 10-fold cross validation
Table 4 Average number of rules per discovered rule set and average number of terms per rule

The results indicate that the AntMiner-C achieves higher accuracy rate than the five compared algorithms for most of the data sets. However, the number of rules and the number of terms per rule generated by our proposed technique are mostly higher. The reason is that we allow the generation of rules with low coverage.

4.2 Convergence speed

In Table 2, we have specified the value of the ‘Number of Ants’ parameter as 1,000, that is, a maximum of 1,000 rules can be constructed out of which the best one is chosen to be placed in the discovered rule set. In reality, on the average, very less ants are used because the REPEAT-UNTIL loop gets terminated if convergence is achieved (all of the recently discovered rules are the duplicates of each other; Sect. 3.4). For this purpose, the threshold parameter ‘Number of rules converged’ (10 in our experiments) is used. The average of the actual number of ants used per execution of REPEAT-UNTIL loop is reported in Table 5. Convergence speed is an important aspect, particularly, for large- and high-dimensional data sets.

Table 5 Average number of ant runs per execution of REPEAT-UNTIL loop

4.3 Class choice prior or after rule construction

In this experiment, we deviate from our algorithm by first constructing the rule antecedent and then assigning the rule consequent. The rule consequent assigned is the majority class label of the samples covered by the rule. Since our heuristic function (Eq. 4) requires prior commitment of class label, we have to modify it for this particular experiment. The heuristic function used is

$$ \eta_{ij} = {\frac{{\left| {term_{i} ,term_{j} } \right|}}{{\left| {term_{i} } \right|}}}.$$
(12)

The heuristic function used for the first layer of terms is

$$ \eta_{j} = \sum\limits_{k = 1}^{m}{ - {\frac{{\left| {term_{j} ,class_{k} } \right|}}{{\left| {term_{j} } \right|}}}} \log_{2} \left({\frac{{\left| {term_{j} ,class_{k} } \right|}}{{\left| {term_{j} } \right|}}}\right).$$
(13)

Experimental results are shown in Table 6. The predictive accuracy is lower when compared to our original algorithm. In our opinion, the prior commitment of class label focuses the search (all the ants in an iteration are searching for the best version of the same rule) resulting in more appropriate rules.

Table 6 Results obtained by first constructing rule antecedent and then choosing class label

4.4 Termination of rule construction

In our algorithm, terms are added to a rule until all the samples covered by it have the same class label or until there are no more terms to be added. There is no restriction that a rule should cover a minimum number of samples. A constructed rule might cover only one sample. In this section, we report the results of an experiment in which we impose a restriction that a rule being created should cover at least a minimum number of samples. We set this threshold as 10, a value used in [12] for the same purpose. A term is added to a rule only if the rule still covers at least 10 samples after its addition. As a result, all the constructed rules cover a minimum of ten samples. The results of this experiment are shown in Table 7. The predictive accuracy decreases for most of the data sets. The reason is that when we restrict a rule in this way, we may miss a very effective term in the rule that may increase the accuracy of the rule. Another problem with this approach is that there are different data sets and each has different number of dimensions and number of samples. It is difficult to decide the minimum number of samples for a given data set. Our experiment, though limited, provides some evidence that the strategy of rule construction by adding terms without any restriction is effective. However, we note that an advantage of the restriction is that we have less number of rules.

Table 7 Results obtained by imposing restriction that a constructed rule must cover a minimum of 10 samples

4.5 Rule pruning: all rules versus only the best rule

In this experiment, we prune each rule constructed by the ants prior to pheromone update; all other parameters are kept the same. The results of this experiment are shown in Table 8. The predictive accuracy is less for nine out of twelve sets, but the number of rules generated is less. The number of terms per rule is almost the same.

Table 8 Result of pruning each constructed rule and pruning only the best rule

The reason for the dip in accuracy is that there are many rules whose construction is stopped due to homogeneity of class label of samples covered. These rules are already compact and any attempt at pruning causes non-homogeneity. The criterion for assessing quality of a rule (Eq. 10) has two components: confidence and coverage. While attempting to prune those rules that cover the samples of homogenous class labels the confidence decreases but that may be compensated by an increase in coverage. A rule may thus get pruned on the basis of Eq. 10 with lower confidence but better coverage than before. Such pruned rules cause pheromone increase in terms that may be misleading for future ants. Also, more importantly, the final best rule selected from the pruned rules may not have the same discriminatory capability as the final rule selected from unpruned rules.

4.6 Rule set pruning

This experiment is for testing the effect of rule set pruning (Sect. 3.6). The algorithm is run with and without the procedure, and the results are shown in Table 9. From the table, we can see that the accuracy sometimes improves and sometimes decreases. However, the number of rules consistently decreases if pruning is used. The remaining rules are simpler, and the average number of terms/rule also decreases in most of the cases.

Table 9 Comparison of results with and without pruning of redundant rules from the rule set

4.7 Parameter settings

AntMiner-C has following user-defined parameters.

  • number of ants,

  • number of rules converged (for early termination of REPEAT-UNTIL loop),

  • number of uncovered samples (for termination of WHILE loop),

  • the powers of pheromone and heuristic values (α, β) in the probabilistic selection Eq. 3, and

  • the pheromone evaporation rate (used in Eq. 11).

According to our experiments (Sect. 4.2), the number of 1,000 ants is an adequate value because it is used with the early convergence option of the REPEAT-UNTIL loop. Furthermore, the number of rules converged is an indication of pheromone saturation and does not seem to be a sensitive parameter as long as it is not a very small number. Likewise, the number of uncovered samples used for termination of WHILE loop is used to avoid rules with very low coverage and does not seem sensitive if it is restricted to a small value. The relationship between α, β, and ρ is complex and needs to be analyzed experimentally.

In order to get some indication of the influence of α, β, and ρ on the performance of AntMiner-C, different combinations of these parameters are used, and the accuracy results are reported in Table 10. The results of Table 10 are obtained by running AntMiner-C for one fold only. The results are highest when ρ = 0.15, α = 1, and β = 3.

Table 10 One fold accuracy results for different combinations of ρ, α, and β

4.8 Number of probability calculations

An important aspect of AntMiner-C is its potential for extracting rules from high-dimensional data sets in reasonable time. The nature of the heuristic function (Eq. 4) combined with the fact that the class is selected prior to the rule construction reduces the search space considerably. It assigns the value of zero to all terms that do not occur together for a given class in the training samples of the data set. The heuristic matrix is calculated once and remains fixed for one complete execution of the REPEAT-UNTIL loop.

The most costly computation in the algorithm, which is done repeatedly, is the probability equation (Eq. 3). The counter of its utilization gives a direct indication of the amount of search space that an ant has had to consider. We present some probability equation usage counts in Fig. 5 that provide an indication of the reduction in search space of AntMiner-C and also highlight its performance improvement over AntMiner. The sets used in this limited experiment are Iris, Wine, and Dermatology with 4, 13, and 33 attributes, respectively. The experiment was run for one fold only. The ratios of overall probability calls of AntMiner and AntMiner-C for the three data sets are 11.07, 6.06, and 8.35, respectively.

Fig. 5
figure 5

Probability equation usage counts for Iris, Wine, and Dermatology with 4, 13, and 33 attributes, respectively. The experiment was run for one fold only. The results provide an indication of the reduction in search space of AntMiner-C and also highlight its performance improvement over AntMiner

5 Conclusions and future work

Previously, ACO has been used to discover classification rules from data sets. This paper proposes a new algorithm whose main highlight is the use of a heuristic function based on the correlation between the recently added term and the term to be added in the rule. We check the performance of the algorithm on a suite of twelve data sets. The experimental results show that the proposed approach achieves higher accuracy rate and has fast convergence speed.

For future work, the optimal values of parameters can be investigated more thoroughly and some general guidelines may be proposed by analyzing the experimental results. Another direction of research can be to use a heuristic function that considers the correlation between all of the selected terms and the next term to be added. For this purpose, a new search space will have to be proposed.

Finally, we would like to refer to the knowledge fusion problem that deals with the cohesion of extracted knowledge with the domain knowledge provided by experts. The rules extracted by the algorithm for a given data set may not completely adhere to the available domain knowledge. There might be missing, redundant, and unjustifiable (misleading) rules. In [27], the AntMiner+ is extended to incorporate hard constraints of the domain knowledge by modifying the search space and the soft constraints by influencing the heuristic values. The same modifications are possible in AntMiner-C and seem to be an interesting direction for future study.