Keywords

1 Introduction

Learning classification rules is one of mature and well studied tasks in machine learning. The popularity of rules comes from the fact that they directly provide a symbolic representation of knowledge discovered from data, which is more comprehensible and human-readable than other representations [5]. Many various algorithms for inducing rules have been already introduced (for their review see, e.g. [5]). Nevertheless, such aspects of data complexity as class imbalance still constitute difficulties [11]. The majority of standard rule algorithms are biased towards the majority classes and tend to neglect the minority class. Two kinds of reasons for poor performance of rule based classifiers for imbalanced data are usually pointed out – algorithmic and data level ones [11, 16].

Some extensions of rule classifiers for class imbalances have been already proposed, for their review see [16]. However, most of them address only a single or at most a few of algorithmic or data-related factors. In [16] we introduced a new rule induction algorithm, called BRACID (the acronym of Bottom-up induction of Rules And Cases for Imbalanced Data), which attempts to deal with more of the aforementioned factors. The previous comparative experiments have clearly demonstrated that the rule classifier induced by BRACID significantly outperformed other rule classifiers generated by the best, standard rule learning algorithms as well as the rule extensions specialized to class imbalances, with respect to predictive measures [16]. On the other hand, BRACID may generate too many rules (see also experiments in Sect. 6). As it restricts human experts’ abilities to analyze or interpret the rules, we are looking for a post-processing approach that could identify the most valuable rules. The first attempt, recently undertaken in [18], has shown that it is possible to select rules characterized by high supports and still leading to sufficient predictive performance.

Nevertheless, focusing attention on the most interesting rules should also take into account other characteristics than simply the rule support. In particular, it is important not to neglect the descriptive abilities of rules, which are often overwhelmed by the need to increase the predictive performance. Note that the predictive and descriptive aspects often stand in opposition to each other [13, 20]. However, when human experts seek for a compact knowledge representation, improving the interpretability of each single rule can even justify some loses on the predictive performance.

Establishing when rules are interesting to users touches both subjective and objective aspects [4]. In this paper we follow the latter aspect and consider rule interestingness measures which are often applied to filter the set of rules [7, 15]. They are calculated from learning data and aim at quantifying the relationship between a rule’s premise and its conclusion. A particular group of these measures, called Bayesian confirmation measures, is well suited for supporting rule interpretability, as it focuses on advancing rules for which the probability of the conclusion given the premise is greater than the genuine probability of the conclusion itself [3, 10]. In other words, confirmation measures promote rules, in which the value that the premise adds to conclusion is considerably high.

Although the concept of confirmation has been firstly considered by philosophers of science in a very different context (see e.g. [2, 3, 19]), it has been adopted to rule interestingness measures, mainly for filtering association rules [8]. Nevertheless, these measures have not been considered for imbalanced data yet. Their application should turn out to be particularly useful in the context of imbalance since considering the probability of each conclusion separately, as done by confirmation measures, would be related to imbalance ratios.

For the purpose of this paper we focus on two particular confirmation measures called S [2] and N [19]. We have chosen them from a wider collection of confirmation measures discussed in the literature because of the desired properties that they possess [9, 10]. In our opinion, these measures satisfy properties that should influence the interpretability of rules [10].

The main aim of this paper is to introduce an approach that uses confirmation measures S and N to post-prune rules induced by BRACID. We focus this study on BRACID only, as experiments [16] have shown that it outperformed other best rule based classifiers over a large collection of imbalanced data. The new approach should reduce the number of its rules while improving values of rule interestingness measures at the same time, especially for the minority class.

The paper is organized as follows. Firstly, we briefly review related works in Sect. 2. Section 3 introduces the concept of Bayesian confirmation and defines two particularly valuable representatives of confirmation measures: measures S and N. The algorithm used for rule induction, called BRACID, is summarized in Sect. 4. The three new rule filtering strategies are introduced in Sect. 5. Their usefulness to improve BRACID rules is evaluated in several experiments, which are described in Sect. 6. The experimental results are discussed in Sect. 7. In the final section we draw lines of future research.

2 Related Works on Rule Evaluation and Filtering

Many algorithms for constructing rule based classifiers employ rule pruning. The representative approaches are Grow, IREP or RIPPER; for their review, see Chap. 9 in [5]. However, these approaches follow the classification perspective of rule induction and pruning is oriented toward good predictive ability of the complete set of rules. Other objectives are stated in the descriptive knowledge discovery which aims at discovering from data information patterns and regularities (or sometimes exceptions) which are potentially interesting and useful to different kinds of users [20].

The descriptive rule discovery perspective, which is considered in this paper, requires other algorithmic strategies than in the classification perspective, e.g., classification versions of association rules, richer sets of satisfactory rules [20] or rule representations of subgroup discovery [6]. However, these algorithms often generate a too high number of rules which makes it impossible for users or domain expert to inspect them. Thus, users lose the opportunity to interpret the results, find interesting rules or to further modify them to have a more accurate classifier [12].

To help the user find relevant knowledge inside huge rule sets, the rule interestingness measures have been proposed (for their review see [7, 15]). They are divided into two categories: subjective (user-oriented) and objective (data-oriented) ones. The subjective measures take into account the user’s goals, background knowledge or his belief on the data domain [4]. Objective measures are those that are not application- or user-specific and depend only on raw data. Many of them are defined on the basis of contingency tables summarizing the data set (see the next section). Support and confidence are the most universal interestingness measures which are often applied in the process of rule generation (e.g., Apriori search for association rules) and sometimes in post-filtering [1]. Although they are so popular, other measures could be better suited to deal with larger sets of rules and to select the most relevant (i.e. interesting) candidates. Numerous rule interestingness measures have been proposed (lists can be found for example in [7, 12, 13, 15]) and choosing the best one for a given problem is not a trivial task.

In general, the interestingness measures are used to assess, rank (sort) and filter the rules according to various points of view [7]. For these aims, the experts either select some single measures or consider their aggregated, more complex versions. For instance, [13] describes a case study in which several measures have been used, and the results were interpreted by an expert with a recommendation to use a weighted relative accuracy. Another, more multiple-criteria analysis has been advocated by Bayardo and Agrawal, who proposed to analyze partial ordering of the rules (instead of the typical total ordering of rules) according to different interestingness measures [1]. The authors of [12], on the other hand, discussed other related proposals and proposed a subset of measures based on specialist’s preferences; see also [14]. The authors of [9] analyzed properties of the interestingness measures and showed that some measures may be preferred to others. Furthermore, other researchers looked for concise representations (e.g. closed items in associations), rule summaries, grouping of similar rules (with respect to rule condition parts or to subsets of covered examples), or developed interactive visualization tools.

Nevertheless, the choice of the interestingness measures still depends on the expert’s preferences and the problem at hand. In this paper, following motivations presented in Sect. 1, we direct our interest to a particular class of measures based on Bayesian confirmation. Although they have been recently used to filter association rules [8], they have not been considered for classification rules in the class imbalanced tasks.

3 Bayesian Confirmation Measures

To present Bayesian confirmation measures the basic notation is introduced. Rules are consequence relations represented as IF (condition part) THEN (target class), where a condition part (premise) is a conjunction of elementary tests on values of attributes characterizing learning examples and a target class points to one of the predefined values of the decision attribute (represented in a rule conclusion). For simplicity, rules will be denoted as \(E \rightarrow H\) or simpler as R.

Interestingness measures quantify the relationship between E and H, and are usually defined as functions of four non-negative values that can be gathered in a \(2\times 2\) contingency table (see Table 1). For a particular data set, a is the number of objects that satisfy both the rule’s premise and its conclusion, b is the number of learning examples for which only H is satisfied, etc. For instance, the support of \(E \rightarrow H\) rule is defined as \(sup(H,E)=a\) and its confidence as \(conf(H,E)=a/(a+c)\). Note that a, b, c and d can also be regarded as frequencies for estimating probabilities: e.g. \(P(E)=(a+c)/n\) or \(P(H)=(a+b)/n\).

Table 1. An exemplary contingency table of the rule’s premise and conclusion

Among many interestingness measures, we drew our attention to a particular group of Bayesian confirmation measures (or simply confirmation measures). All those measures are characterized by a feature called property of Bayesian confirmation, which requires that an interestingness measure c(HE) obtains: positive values when \(P(H| E) > P(H)\); 0 when \(P(H | E) = P(H)\); and negative values when \(P(H | E) <P(H)\).

Thus, confirmation measures are designed to depict simply through their scale the confirmatory, neutral or disconfirmatory impact of the rule’s premise on its conclusion. Confirmation, interpreted as an increase in the probability of the conclusion H provided by the premise E, is a desirable situation. Let us stress that basic interestingness measures such as support or confidence do not possess the property of confirmation and thus, their utility is lower for the descriptive perspective of knowledge discovery.

The difference of semantics and utility of confidence on one hand, and measure S(HE) (defined below in Eq. 1) being a representative of confirmation measures on the other hand, can be shown on the following illustrative example. Consider the possible result of rolling a dice: 1, 2, 3, 4, 5, 6 points, and let the conclusion H = “the result is divisible by 2”. Given two different potential rule premises:

  • \(E_{1}\) = “the result is a number from a set {1, 2, 3}”,

  • \(E_{2}\) = “the result is a number from a set {2, 3, 4}

we get, respectively: \(conf(H, E_{1}) = 1/3\), \(S(H, E_{1}) = -1/3\) and \(conf(H, E_{2}) = 2/3\), \(S(H, E_{2}) = 1/3\). This example clearly shows that the values of confirmation measures have a more useful interpretation than confidence. In particular, in the case of rule \(E_{1} \rightarrow H\), the premise actually disconfirms the conclusion as it reduces the probability of conclusion H from \(1/2 = P(H)\) to \(1/3 = P(H|E_{1}) = conf(H, E_{1})\). This fact is expressed by a negative value of confirmation measure S(HE) (and in fact any confirmation measure), but it cannot be concluded by observing only the value of confidence.

Note that the property of confirmation leaves plenty of space for defining various, non-equivalent confirmation measures (for their review see [3, 9]). To guide the user towards the measures that reflect his expectations, researchers proposed special properties of confirmation measures. These properties express requirements for a measure behavior in certain situations. Taking into account possession of desirable properties, we focus our further interest only on two representatives of confirmation measures. The chosen measures S(HE) [2] and N(HE) [19], both ranging from \(-1\) (showing complete disconfirmation) to \(+1\) (showing complete confirmation), are defined as:

$$\begin{aligned} S(H,E)=P(H|E)-P(H|\lnot E)=\frac{a}{a+c}-\frac{b}{b+d}, \end{aligned}$$
(1)
$$\begin{aligned} N(H,E)=P(E|H)-P(E|\lnot H)=\frac{a}{a+b}-\frac{c}{c+d}. \end{aligned}$$
(2)

Among properties that valuable confirmation measures should satisfy let us mention property of monotonicity M [10] and property of maximality / minimality [8]. Monotonicity M favors measures that are non-decreasing with respect to a and d, and non-increasing with respect to b and c. It is intuitively clear that we would like higher values of measures for rules that are supported by a greater number of positive examples (i.e. increase of a), and exactly the opposite when the number of counter-examples grows (i.e. increase of c). The property of maximality / minimality on the other hand, requires that a measure obtains its maximal value if and only if \(b=c=0\), and its minimal values if and only if \(a=d=0\). It is thus a property concentrated on the behavior of measures in the extreme cases. It was verified in [9, 10] that the measures S(HE) and N(HE) are among few confirmation measures that satisfy both monotonicity M and maximality / minimality.

We have focused our study on those two measures also because the interpretation of their definitions is rather straightforward (contrary to some other confirmation measures possessing M and maximality / minimality e.g. measure \(c_{3}(H,E)\) [9]Footnote 1). Measure S(HE) expresses how much more probable is H with E rather than with \(\lnot E\). Following some medical examples, e.g. if some symptoms occur then a certain disease is diagnosed, we could say that measure S(HE) assesses how much more probable becomes the disease when we know that the symptoms occurred (instead of knowing that the symptoms did not occur). In case of measure N(HE), we would say that is expresses how much more probable are some symptoms for a certain disease than for a case when the disease is excluded (does not occur). Measures S(HE) and N(HE) are thus somewhat complementary, as they look at rules from different perspectives: that of the rule’s premise and that of the rule’s conclusion.

Summing up, taking into account possession of desirable properties and interpretation of the measures’ definitions, this study focuses only on application of confirmation measures S(HE) and N(HE).

4 Rule Induction with BRACID

BRACID is a specialized algorithm to learn rules from imbalanced data. For its details see [16]. Here, we summarize its main characteristics:

  • Hybrid representation of rules and instances: BRACID tries to create a general description in regions where the examples form large disjuncts (using rules) and instances to better approximate the more difficult decision boundaries. BRACID allows some (difficult) examples to remain not generalized to rules. They can be treated as maximally specific rules.

  • Bottom-up rule induction: Unlike a top-down strategy typical for rule induction, BRACID follows a bottom-up (or a specific-to-general) strategy as a more appropriate for imbalanced data. It starts from the set of most specific rules each covering a single learning example – which is called a seed of the rule. Then, in every iteration each rule is generalized in the direction of the nearest neighbour example from the same class, provided that it does not decrease the classification abilities of the whole rule set. The procedure is repeated until no rule in the set can be further generalized.

  • Resignation from greedy, sequential covering technique: As this technique, popular in typical rule learning algorithms, increases the data fragmentation and is problematic for the minority examples, BRACID takes into account all the learning examples when evaluating new rule candidate.

  • Facing borderline minority examples: Types of learning examples are evaluated and rules are generated differently depending on the type of the seed example of a rule [17]. The minority examples belonging to the borderline region are allowed to be generalized into more than one rule, to lessen the dominance of the majority class in this region.

  • Facing noisy examples from the majority class: Noisy majority examples, present inside the minority class regions, may hinder the induction of general minority rules. BRACID has an embedded mechanism for detecting and removing such examples from the learning data set.

  • Less biased classification strategy: BRACID employs a classification strategy based on nearest rules to diminish the domination of strong majority rules during solving conflict situations while a new instance matches condition parts of many rules.

Note that some mechanisms employed in this algorithm lead to the increase of the number of rules (mainly a bottom-up rule induction and generation of more rules in the borderline regions). However, the increased number of rules for the minority class, coupled with an increased rule support, are beneficial for final classification. The experimental evaluation of classification performance of BRACID showed indeed that it significantly outperformed many standard rule classifiers (induced by RIPPER, PART, C4.5rules, and others) as well as other rule approaches specialized for class imbalance such as modifications of rule search and classification strategies, or the best standard algorithms (e.g., PART) combined with SMOTE methods transforming class distributions [16].

5 Selecting Rules with Respect to Confirmation

We aim to select a subset of induced rules with respect to appropriate rule evaluation measures. In [18] we have already postulated that it would be profitable to find rules which cover diverse sets of examples referring to different sub-parts of the class distribution. Focusing the expert’s attention on a subset of rules having such characteristics should be particularly good for the minority class which is often decomposed into many rare sub-concepts.

Recall that several post-pruning techniques have already been proposed to order rules or to reduce their number. However, as we discussed in [18], it may not lead to diverse subsets of rules in BRACID, as e.g. high supports may characterize many rules having similar syntax and covering similar subsets of learning examples. Other post-pruning techniques considered in rule classifiers are focused on optimizing the predictive performance of the rules rather than on improving their descriptive properties [5].

Therefore, we follow a different inspiration, coming from using rules to represent patterns in subgroup discovery, where the task is to find subgroups of individuals that are statistically “most interesting” (e.g. covering as many examples as possible and having the most unusual statistical characteristics [5]). In our opinion these kinds of local, diverse patterns correspond to decomposition of the minority class in sub-concepts. In this paper we generalize the algorithm originally proposed in [6] to find rules describing subgroups.

Our approach to select a given number of diverse rules with respect to a given rule evaluation measure is presented in Algorithm 1. It is run for each class separately and takes as an input the set of all rules induced for this class and their required number after selection – later on we discuss how to tune it.

figure a

Firstly, we remove all rules with the non-positive value of a selected confirmation measure (except the option where rules are evaluated with the support only). The key idea of the algorithm is to assign a weight c(e) to each learning example. It is initialized with \(c(e) = 1\) for all examples from the given class. When rule R is selected, then weights for examples covered by this rule are increased by adding 1. Then, while evaluating the next rule being a candidate for selection, the example takes part in all calculation with the weight 1 / c(e). For instance, the support of a rule is computed as a sum of 1 / c(e) for all target class examples covered by this rule.

This weighted coverage causes that in the subsequent iterations of the algorithm, examples already covered by the selected rules contribute less to the evaluation of new rule. It promotes the rules referring to examples not yet covered and directs the search toward diverse regions of the class.

In this study we will consider three different versions of the rule evaluation ev(R)Footnote 2 for selecting rules:

  1. 1.

    a standard rule support sup(R);

  2. 2.

    a product of support with a confirmation measure S: \(sup(R) \times S(R)\);

  3. 3.

    a product \(sup(R) \times N(R)\).

The choice of rule support sup(R) results from earlier experiments in [18] and we want to consider it as a baseline. The choice of both confirmation measures S and N has been justified in Sect. 3. We want to aggregate them with a rule support to represent a trade off in a bi-criteria evaluation where the user is interested in sufficiently strong patterns describing the classes.

6 Experimental Evaluation

In the experiments we will verify whether the proposed post-pruning strategies select a limited number of BRACID rules having better values of interestingness measures than in case of non-pruned rules.

As the evaluation criteria we choose the average values of confirmation measures S and N, rule support and rule confidence. We consider the last two measures due to their popularity in the previous rule filtering techniques and to their easy interpretation for the users. These criteria represent descriptive properties of single rules with respect to their possible interpretability and they are treated as primary criteria in our study. As a secondary criterion, we also evaluate the predictive ability of the rule set, which will be estimated by G-mean and F-measure, both well suited for cases with imbalanced data sets. We use this criterion to control whether pruning the set of rules does not dramatically deteriorate the performance compared to all rules produced by the BRACID algorithm. The predictive measures are evaluated in a repeated stratified 10-fold cross validation procedure while rule evaluation measures are calculated for a set of rules induced from the complete data set.

Table 2. Basic characteristics of data sets

We analysed previous experiments from [16] and chose 11 data sets where BRACID generated too many rules compared to other, standard rule induction algorithms. They are characterized by different imbalance ratios (from 3% to 30%), data sizes (from 155 to 1728) and types of attributes (only nominal, only numeric, or mixed). Although the imbalance ratios of some of these data sets are medium, all these data are also affected by different difficulty factors characterizing the distribution of examples from the minority class. According to experimental studies [17] these factors lead to difficulties while learning rules.

All these data sets come from the UCI repository. We analyzed them as binary problems – the minority class vs. majority one (which may aggregate others), as it is a typical view of class imbalances with focusing attention on improving recognition of the class of special importance. The basic characteristics of these data sets are presented in Table 2.

We checked that for all data sets (except cleveland and hepatitis), the BRACID rule sets contained some rules with negative values of confirmation measures. For instance, balance-scale contained 8, car 36, cmc 19, solar-flareF 18 and transfusion 14 such rules.

Table 3. Characteristics of filtered rules for the minority class

While using the algorithm for selecting rules we need to define a number of required rules as the stopping condition. In general, this parameter should represent the analyst’s expectations and his abilities to inspect the rules. Here we recall our previous experiments [18], where we studied a wide range of values of this parameter (up to 30%). The results showed that the threshold 10% often led to rule sets having the good average rule support and comparable classification performance as the original set of BRACID rules.

Table 4. Characteristics of filtered rules for the majority class

Yet another option is to select all the rules which are necessary to cover all the learning examples in each class. We studied this coverage option in [18] and observed that it usually produced higher classification prediction (with respect to G-mean or sensitivity measure) than the percentage option. However, it also selected more rules than the percentage option. As in this study we aim at reducing the number of rules, we decided to consider the percentage option with the parameter tuned to 10% of the original set of rules for each classFootnote 3.

In our study, we will examine three proposed strategies to select rules with the rule evaluation ev(R) (see Sect. 5), defined as: (1) a standard rule support sup(R); (2) a product \(sup(R) \times S(R)\); and (3) a product \(sup(R) \times N(R)\).

The rule characteristics with respect to considered criteria are given in Tables 3 and 4, for the minority and majority class, respectively. The column “pruning” corresponds to the selection strategy (note that results for using the standard version of BRACID without pruning is presented in the first row for each data set with an abbreviation “none”).

Additionally, we constructed rule classifiers with the three filtering strategies and evaluated their classification performance. The values of G-mean and F-measure are presented in Table 5.

Table 5. G-mean and F-measure for BRACID with all rules vs. filtered rules

7 Discussion of the Experiments

Each of the filtering strategies improves the interestingness measure used in the given strategy. Note that all of them improve average rule supports for both minority and majority classes. For some data sets these improvements are quite high, for instance, for cmc data the average rule supports increase from 6.59 to 18.57 examples in the minority class, and from 7.30 to 22.98 examples in the majority class. Similar high improvements also occur for car, solar flare, ecoli and transfusion data.

The third strategy (based on sup(R) and N(R)) increases the average value of measure N for all data sets in both classes—see e.g. hepatitis data, where the improvements are from 0.23 to 0.39 for the minority class and from 0.15 to 0.53 for the majority class. Similar increases have been observed for other data. Similarly, the second strategy (based on sup(R) and S(R)) improves the average values of the confirmation measure S – however, it is more visible for the minority class than for the majority one, for instance changes from 0.46 to 0.65 in the minority class and from 0.25 to 0.27 in the majority one for haberman data. Note that values of the confirmation measure S are always higher than N.

It is worth observing that the proposed strategies also improve rule evaluation measures other than the ones used in each strategy. In particular, the third strategy usually provides the highest values of the average support – in the majority of data sets it is better than the first strategy that uses the support only. Although it sometimes slightly improves the confirmation measure S, it usually decreases the average confidence of rules. On the other hand, the second strategy offers the highest increases of the rule confidence. It is more visible for the minority class as the confidence of majority rules is already quite high.

What is also interesting, classification performance of such filtered rules does not decrease too much compared to the original set of rules and for few data it is even better – see results in Table 5.

The differences in results obtained by strategies using S and N measures could be explained by analyzing their formulae (see Eqs. 1 and 2). They exploit the contingency matrix in a different, although symmetric, way. Measure S is more focused on considering a pair of numbers (a and c) decreased by (b and d), while N aggregates a different combination. As BRACID tries to induce rules with a very high confidence (which refers to the pair a and c), it is naturally oriented on obtaining higher values of the S measure. On the other hand, as measure N exploits complementary information to the one used in BRACID rule induction process, it may better co-operate with the rule support in the pruning strategy and may lead to better descriptive rule evaluation as well as classification results.

8 Conclusions and Final Remarks

To sum up, our experiments have clearly demonstrated that all proposed filtering strategies lead to selecting a much smaller number of BRACID induced rules, which are characterized by better values of considered interestingness measures than in case of non-pruned rules.

As future research, we plan to extend the experimental evaluation with other rule classifiers specialized for class imbalances in order to show the generality of our approach. We also intend to confront our pruning strategies with a baseline approach involving a simple rule filtering. Furthermore, we plan to investigate a more local way of calculating the interestingness measures, which will be based on the analysis of neighbor examples to the given rule rather than on all data elements as it is currently done.