1 Introduction

A popular technique in the field of data mining is association rule mining [28, 32, 36, 39]. Let \(\mathcal I = \{i_1, \ldots , i_n\}\) be a set of items, and let \(X\) and \(Y\) be itemsets, that is, \(X = \{i_1,\ldots , i_k\} \subset \mathcal I \), or \(k\)-itemset if it contains \(k\) items. An association rule is an implication of the form \(X \Rightarrow Y\) where \(X \subset \mathcal I ,\,Y \subset \mathcal I \), and \(X \cap Y = \emptyset \). The meaning of an association rule is that if the antecedent \(X\) is satisfied, then it is highly probable that the consequent \(Y\) will be satisfied. In association rule mining, given a set of all transactions \(\mathcal T = \{t_1, t_2, t_3,\ldots , t_n\}\) in a relation, the support of an itemset \(X\) is defined as the number transactions satisfied by the itemset, the itemset being frequent iff \(\{support(X) \ge minimum\_support \ | \ support(X) = |\mathcal S |,\; \mathcal S \subseteq \mathcal T \}\), \(|\mathcal S |\) being the number of transactions satisfied by the itemset.

Most of the approaches for mining association rules [3, 5, 7, 16] are based on the extraction of frequent itemsets or patterns [6, 30]. These proposals were originally designed for market basket analysis, obtaining those patterns that exceed a minimum support, and then using these patterns for discovering association rules requiring a minimum confidence threshold. Whereas the support value represents the frequency of occurrence, the confidence value indicates how reliable a rule is. The higher its confidence value, the more frequently the items of the rule are associated together.

However, there are situations where it is interesting to discover abnormal or unusual behaviour in datasets, discovering rare or infrequent patterns [19, 20], that is, those that do not follow the trend of the others. A pattern \(X\) is rare or infrequent iff \(\{support(X) < maximum\_support \ | \ support(X) = |\mathcal S |,\; \mathcal S \subseteq \mathcal T \}\). Examples include banking frauds [31], medical anomalies [25], network intrusion [27], etc. Such a search for rare patterns gives rise to a parallel interest in looking for infrequent and strong associations among these patterns, that is, searching for reliable association rules that do not cover a large number of instances in a dataset. An association rule \(X \Rightarrow Y\) is rare or infrequent iff \(\{support(X \cup Y) < maximum\_support \ | \ support(X \cup Y) = |\mathcal S |,\; \mathcal S \subseteq \mathcal T \}\).

The first approaches in this field consisted in adjusting the previously existing exhaustive search algorithms for mining frequent patterns [1]. Another possibility for extracting rare association rules by means of algorithms for mining frequent ones is the use of a very low frequency of occurrence threshold. However, a much larger set of patterns would be mined this way, prompting an exponential increase in runtime. Additionally, a large set of patterns originates a much larger set of association rules. This makes it necessary a subsequent process for discarding those frequent rules that are not interesting for the data miner and would hamper the interpretability of the whole set of rules.

Although rare association rules mining (RARM) is a recent area in the field of association rules mining, some proposals had been already presented for this type of problem. A rare itemset miner algorithm (ARIMA) [33], Apriori-Inverse [18], Apriori-Infrequent [27] and MRG-Exp [34] are the most relevant algorithms arising from these studies. A major drawback of these proposals is that they perform an exhaustive search among the dataset patterns, so their execution over huge datasets with a large number of attributes is computationally hard. Since this exhaustive nature causes a vast number of rules to be mined, the readability of the resulting set can be also hampered. Besides, these algorithms work only in categorical domains, therefore numerical attributes should be discretized before the itemset mining can start.

Noisy data are also an important issue in the RARM process [38]. The presence of noise in a dataset may cause a distortion of the data distribution and, in consequence, the extraction of noisy association rules. Sometimes, relations between values that do not frequently appear in a dataset are not considered rare or infrequent. In these cases, it is necessary to establish a minimum occurrence frequency threshold in order to determine the boundary between infrequent and noisy association rules. None of the current RARM algorithms address this issue, since they discover the entire set of rules under a maximum frequency threshold. Notice that this value could vary depending on the specific application domain under analysis, so its configuration set-up becomes more important before proceeding to the extraction of rules.

Different evolutionary algorithms (EAs) have been proposed to extract association rules [24], and also to other data mining tasks [13]. A technique based on EAs is known as grammar-guided genetic programming (G3P) [14, 17], where each individual is a derivation tree that generates and represents a solution using the language defined by a grammar. In such a way, solutions are represented with variable-length hierarchical structures, where the shape, size and structural complexity of the solution are constrained by the grammar. Recently, a G3P algorithm was proposed for mining frequent association rules with excellent results [23]. This proposal, called grammar- guided genetic programming association rule mining (G3PARM), promotes the use of a context-free grammar (CFG), which allows of defining syntax constraints, restricting the search space and obtaining expressive solutions in different attribute domains. Since a main issue of exhaustive search algorithms is their computational cost, the use of evolutionary algorithms in this field, for example, G3PARM, allows the reduction in the computation time. In addition, the proper selection of a good heuristic strategy aids in the search for solutions. A heuristic strategy is especially important in the RARM field, where it is essential to differentiate between infrequent and noisy relations among patterns.

Motivated by the promising performance of G3PARM, a full study of its adaptability to RARM, named Rare-G3PARM, is presented in this paper. Particularly, we study how effective different fitness functions are in helping rare association rules identification and separating them from noise, as well as a new genetic operator that guides the search process. In this proposal, the resulting set only comprises the best rules discovered along the execution, and its size tends to the size specified by the data miner. In addition, this proposal includes the lift measure together with support and confidence to overcome the problems in most algorithms [3, 5, 16] and also G3PARM, when only the support–confidence framework is followed [4]. For the sake of analysing the effectiveness of Rare-G3PARM, we compare it with the existing algorithms in the RARM field. The results show that this proposal efficiently obtains rare and reliable relations between patterns, and avoids discovering noisy ones.

This paper is structured as follows: Sect. 2 presents some related work; Sect. 3 introduces the G3PARM algorithm and describes in depth its adaptability to the RARM field; Sect. 4 presents the datasets used in the experiments, the experimental set-up, a study of different fitness functions and a discussion of the results obtained; finally, some concluding remarks are outlined in Sect. 5.

2 Related work

Despite the fact that mining frequent patterns is a major task in association rule mining, currently there has been increasing interest in the extraction of unusual or infrequent patterns. There exist many application domains for which the identification of rare patterns is an important issue. This paper focuses on rare pattern mining as an unsupervised learning task for finding association rules in large datasets. Communication failure detection [27], analysis of interesting rare patterns in telecommunication networks [21], recognition of patients who suffer a particular disease that does not often occur [25] or credit card fraud detection [31] are some of the applications where it is interesting to use association rules and, particularly, rare association rules.

The process of generating rare association rules generates two types of rules as described in [18]. Given maximum support and minimum confidence thresholds, a rule is a perfectly rare association rule (PRAR), as shown in Eq. 1, if the rule’s confidence is greater than or equal to the minimum confidence threshold and the maximum support threshold is greater than that of each item or condition in the rule.

$$\begin{aligned} PRAR \left\{ \begin{array}{l} Conf(A \rightarrow C)\ge minConf\\ Sup(A \rightarrow C) < maxSup\\ \forall x : x \in (A \cup C),\quad Sup(x) < maxSup \end{array}\right. \end{aligned}$$
(1)

On the other hand, and given maximum support and minimum confidence thresholds, a rule is an imperfectly rare association rule (IRAR), as shown in Eq. 2, if the rule confidence is greater than or equal to the minimum confidence threshold, the support of the complete rule is less than the maximum support threshold, and there is at least one pattern in the rule with a support greater than or equal to the maximum support threshold.

$$\begin{aligned} IRAR \left\{ \begin{array}{l} Conf(A \rightarrow C)\ge minConf\\ Sup(A \rightarrow C) < maxSup\\ \exists x : x \in (A \cup C),\quad Sup(x) \ge maxSup \end{array}\right. \end{aligned}$$
(2)

In early studies, the problem of finding infrequent patterns was originally addressed by using algorithms for mining frequent patterns. In this way, Apriori [3] was the first algorithm successfully used for the extraction of frequent association rules, which served as the starting point for subsequent researches in this field. However, for the extraction of rare association rules, this algorithm drastically increases its runtime because of the number of candidate items, since in this situation, the minimum support threshold should be greatly decreased [8, 15].

A novel algorithm for discovering PRAR, called Apriori-Inverse, was proposed by Koh and Rountree [18]. During the search process, Apriori-Inverse keeps those items with a support value greater than a minimum support threshold but less than a maximum value. Then, similarly to the Apriori algorithm, a set of association rules is obtained by using a confidence threshold over all the possible combinations of items previously obtained. Consequently, this algorithm mines very infrequent association rules since the support of each one is less than or equal to that of the item with minimum support. Notice that this algorithm does not find any IRAR since it would never consider itemsets that have a support value above the maximum threshold.

Another RARM algorithm, called ARIMA, was first proposed by Szathmary et al. [33] as a naïve approach. Unlike the Apriori-Inverse algorithm, ARIMA is not limited to calculating PRAR. This algorithm firstly mines the minimal rare itemsets (mRIs), that is, those rare itemsets whose proper subsets are frequent. Thus, whenever a candidate \(k\)-itemset survives the frequent \(k-1\) subset test, but proves to be rare, it is kept as a mRI. Then, the algorithm finds their proper supersets, avoiding zero itemsets, that is, those having a support of zero. Similarly to Apriori, rare association rules are generated using the set of rare itemsets mined. For this final process, it is necessary to satisfy a minimum confidence threshold.

Authors of the ARIMA algorithm proposed two different ways of mining rare itemsets [33]. A first version, called Apriori-Rare, is a slightly modified version of Apriori, finding all frequent itemsets and storing the mRIs discovered. Szathmary et al. [34] also proposed a much more efficient proposal, the MRG-Exp algorithm, which reduces the search space by avoiding exploring all frequent itemsets. Instead, it is sufficient to search for frequent generators (FGs) only. An itemset \(X\) is a generator if it has no proper subset with the same support, that is, \(\forall Y \subset X, support(X) < support(Y)\). The Apriori-Rare requires a higher computational time since it lists all frequent itemsets before reaching the mRIs whereas MRG-Exp explores only the FGs. Both versions obtain association rules from the set of mRIs, the number of rules discovered being smaller than the naïve approach.

Although it is more efficient to obtain rare association rules by applying specific algorithms for the extraction of infrequent patterns, it is also possible to follow the classic Apriori proposal to obtain rare association rules from the set of infrequent items. An example is Apriori-Infrequent, which has been used for intrusion detection [27]. During the candidate generation phase, an infrequent itemset is considered as a candidate if it comprises only frequent elements or items so this algorithm seeks for IRAR. Finally, the resulting set of candidate elements is used to obtain rules that exceed a minimum confidence threshold.

A major shortcoming of current proposals for RARM is that, since they are based on exhaustive search methods, they require a higher computational time when the number of instances and attributes in a dataset increases. Therefore, for quite some time, the data mining community has been interested in constraint-based mining, that is, the use of constraints to specify the desired properties of the patterns to be mined [2, 9]. These constraints allow restricting the search space and decreasing the computational time required.

Another important issue pertaining to the current RARM approaches is that they are based only on a support–confidence framework to evaluate those rules mined, but none of them use other measures, such as the lift, to filter uninteresting rules. Besides, these algorithms obtain rare association rules in categorical domains only, so numerical attributes have to be discretized in a previous step, which may cause some distortion of the original data under study. In addition, existing proposals are based on the discovery of perfectly or imperfectly rare rules, but they do not provide any mechanism to differentiate rare rules from extremely infrequent rules. Actually, because of the number of elements in the resulting set of rules, it tends to be unmaintainable and hardly interpretable by the data miner, who is usually more concerned with obtaining a short but representative set of interesting rules.

In some application domains, a number of extremely infrequent association rules could appear whose existence could be present in a high percentage of the total of rules. Figure 1 shows the number of extremely infrequent association rules discovered over the Automobile PerformanceFootnote 1 dataset discretized using equal-width intervals. These rules comprise at least one pattern with such an extremely low support that it could be considered as noise. In such a situation, it is very important, indeed, necessary to properly separate the rare from the noisy association rules. Most of the existing proposals for mining association rules do not bear in mind this type of rule, considering that an association rule is always rare if its support is lower than a threshold.

To understand this assertion, Fig. 2 shows some rare association rules mined with different approaches using the aforementioned Automobile Performance dataset discretized using equal-width intervals. In these scatter plots, each dot represents one instance or record in the dataset. Regarding the rectangles, they represent the instances covered by the attributes. For instance, Fig. 2a shows the distribution of the records for two attributes (\(Weight\) and \(MPG\)) in a sample dataset. In this figure, the rectangle shows the records covered by the conditions \(mpg=(39.08-Inf]\) and \(weight=(0-2318.4]\). In a general way, each scatter plot shows the instances covered by each rule mined with different approaches, where each coordinate axis represents a condition of the rule mined. Analysing these scatter plots, notice that classical approaches such as Apriori-Inverse, naïve ARIMA and Apriori-Infrequent, tend to obtain very infrequent rules (see Fig. 2a–d). Therefore, these rules cover so few instances that they could be also considered as noisy association rules based on the domain under application. As is depicted in Fig. 2d, the goal of RARM is the discovery of those rules that do not occur often, even when they do not represent noise. To filter out extremely infrequent patterns, which could be considered as noise, is an essential requirement, so this process is carried out by means of a threshold to determine rules extremely infrequent.

Fig. 1
figure 1

Number of association rules with extremely low support discovered over the Automobile Performance dataset

Fig. 2
figure 2

Examples of rare association rules mined with different approaches

3 Discovering rare association rules

This section briefly introduces the G3PARM algorithm and its main features. Then, the mechanisms for adjusting G3PARM for the extraction of rare association rules and the new features required to make this algorithm an interesting RARM approach are described in depth. In this paper, support, confidence and lift measures avoid obtaining misleading rules, playing an important role in the mining process. Furthermore, a new genetic operator serves to guide the search process in an effective way.

3.1 The G3PARM algorithm

As previously mentioned, G3PARM [23] is a grammar-guided genetic algorithm that makes use of a CFG to syntactically constrain its individuals, each of which represents an association rule, defined under both numerical or categorical domains. Figure 3 shows the CFG through which the population individuals are encoded. In general, a CFG is formally defined as a four tuple \((\Sigma _{N},\, \Sigma _{T},\,P,\,S)\) where \(\Sigma _{N} \cap \Sigma _{T} = \emptyset , \,\Sigma _{N}\) is the alphabet of non-terminal symbols, \(\Sigma _{T}\) is the alphabet of terminal symbols or tokens, \(P\) is the set of production rules and \(S\) stands for the start symbol. Productions have the format \(A \rightarrow \alpha \) where \(A \in \Sigma _{N}\), and \(\alpha \in \{ \Sigma _{T} \cup \Sigma _{N}\}^{*}\). Individuals are defined as a derivation syntax tree where the root is the symbol \(S\), the internal nodes contain only non-terminal symbols and the leaf nodes contain only tokens. Each individual is a sentence generated by the grammar and represented by a derivation syntax tree.

Fig. 3
figure 3

Context-free grammar expressed in extended BNF notation

As mentioned above, the individuals of the evolutionary algorithm follow a tree structure according to the definition of the CFG previously described, and each individual may differ greatly in size and shape. This CFG, which could be easily adapted to each specific application domain or problem, is used to state syntax constraints on the rules mined. Each individual from the EA determines a genotype, denoted by a derivation syntax tree, and a phenotype, representing the entire rare association rule, comprising an antecedent and a consequent. Both the antecedent and the consequent of the entire rule constitute the set of conditions that must be satisfied.

To obtain each individual, a derivation process is carried out starting from the start symbol of the CFG. A number of steps are then performed by using the set of production rules. The number of production rules used in the derivation process should be previously fixed in order to avoid to deep trees. Hence, the data mining expert has the ability to restrict the maximum length of the resulting rules.

In G3PARM, the chromosome of each individual encodes its expression using a pre-order traversal of the parse tree. It should be noted that the terminal grammar symbol ‘name’ is determined in the feasible attributes. For each one, the assigned value is determined among their available values. For example, using a sample dataset, a set of available values is obtained for each attribute as shown in Table 1.

Table 1 Metadata defined in a sample dataset

Figure 4 shows a sample representation of an individual in the G3PARM algorithm using the aforementioned sample dataset. The syntax-tree structure is obtained according to the grammar defined, where each node represents non-terminal symbols, each edge is a production from the set \(P\) of production rules, and the leaves of the tree are terminal symbols. Using this tree structure, each genotype is obtained by using a pre-order traversal of the tree, so the genotype of the sample tree structure is the following:

figure a1

On the contrary, the phenotype represents the meaning of the rule and it is obtained by eliminating non-terminal genotype symbols. Using the sample genotype aforementioned, the phenotype is AND != toy ball \(<\) price 35 = sex male, the non-terminal symbol Consequent being located between the attributes price and sex. Therefore, the sample individual represents the rule IF toy != ball AND toy’s price \(<\) 35 THEN sex = male.

Fig. 4
figure 4

How to represent the individuals

G3PARM also implements an auxiliary population that serves as an elite set comprising only the best individuals along the generations, that is, those that have the maximum support and confidence values. This auxiliary population has a fixed size, which can be altered by the data miner before the execution, if required. It permits to avoid obtaining large and hardly maintainable set of rules. In addition, the G3PARM algorithm obtains solutions within specified time limits and does not require the excessive amounts of memory that the exhaustive search-based algorithms do. As can be guessed, another feature of G3PARM is the use of a support–confidence framework, discovering rules having the highest support and confidence values without focusing on how interesting these rules are.

Both categorical and numerical attributes could be mined using the CFG. Categorical attributes are commonly used in most algorithms for mining association rules. In our approach, an expression of the form \(A = u\) or \(A \ne u\) is allowed, where \(A\) is a categorical attribute and \(u\) is a value in the domain \(D\) of \(A\). Focusing on numerical attributes, our approach selects different equal-width points in the range of values and then forms conditions by using the points and four logic operators: \(>,\,>=,\,<\), and \(<=\). To avoid rules that always occur, the upper and lower bounds of the range of values will not be taken into account.

G3PARM starts producing the population by randomly generating individuals from the CFG defined in Fig. 3 and not exceeding the maximum number of derivations. In the initial generation, the auxiliary population will be empty. A set of individuals is selected via a binary tournament from the merging of the current population and the auxiliary population. This selector works by selecting two individuals at random from the current population and after comparing them, it keeps the best of both. Individuals are selected to act as parents, and a new set of individuals is obtained once the genetic operators are applied, that is, crossover and mutation. In the crossover operator, one condition in one parent is swapped for other condition in the other parent. On the other hand, the mutator operator replaces one condition in one parent to try to obtain a new condition with a higher support value. Once we have the new population by crossover and mutation, the auxiliary population is updated by merging the previous auxiliary population and the current population. Then, individuals are ranked according to their support and those that are equal are eliminated.

3.2 The Rare-G3PARM algorithm

In this section, the mechanisms for adjusting G3PARM for the extraction of rare association rules are described. It should be noted that Rare-G3PARM uses the same CFG as G3PARM does—this CFG was properly defined in Fig. 3. In order to make G3PARM suitable for finding rare association rules, the following sections describe the new features required.

3.2.1 Evaluation process

The process of evaluating association rules in a specific problem is not an elementary issue since ARM algorithms base their extraction process on the quality of the rules. Some objective measures for evaluating the interest of these rules have been proposed by different researchers [35]. Two of the most important and widely used measures in this field are support and confidence. The former is formally defined in Eq. 3 as the proportion of the number of transactions \(T\) that include \(A\) and \(C\) in a dataset \(D\).

$$\begin{aligned} support(A \rightarrow C) = \frac{|\{A \cup C \subseteq T, T \in D\}|}{|D|} \end{aligned}$$
(3)

On the other hand, the confidence of an association rule is defined in Eq. 4 as the proportion of the number of transactions that include \(A\) and \(C\) among all the transactions that comprise \(A\). Here, \(A\) stands for the antecedent, \(C\) stands for the consequent and \(A\) and \(C\) are disjoint itemsets, that is, they have no elements in common: \(A \cap C = \emptyset \).

$$\begin{aligned} confidence(A \rightarrow C) = \dfrac{support(A \rightarrow C)}{support(A)} \end{aligned}$$
(4)

Despite the fact that most proposals in ARM are based on a support–confidence framework, for example, G3PARM, these measures are not sufficient to select interesting associations between patterns [4] over some application domains. For example, let us assume that 75 % of the customers in a market bought onions (i.e. \(customer~\rightarrow ~onions\) with a confidence value of 75 %), and 70 % of the customers that bought tomatoes also bought onions (i.e. \(tomatoes~\rightarrow ~onions\) with a confidence value of 70 %). In this situation, the fact of buying \(tomatoes\) does not provide an increment on buying \(onions\) so the rule is misleading. In other words, the occurrence of the antecedent does not imply an increment in the occurrence of the consequent. The measure ‘lift’ was defined to solve this problem (see Eq. 5): it establishes how many times the antecedent and the consequent occur together more often than would be expected if they were statistically independent. An association rule is interesting if its confidence is higher than the support of its consequent. On the other hand, if the confidence of the rule is equal to the support of its consequent, then both antecedent and consequent are independent.

$$\begin{aligned} lift(A \rightarrow C) = \dfrac{confidence(A \rightarrow C)}{support(C)} \end{aligned}$$
(5)

Piatetsky-Shapiro [26] suggested that any accuracy measure \(\mathcal M \) should verify three specific properties in order to separate strong and weak rules (in the sense of assigning them high and low values, respectively). The properties are the following:

  • Property 1: \(\mathcal M \)(\(A \rightarrow C\)) = 0 when support(\(A \rightarrow C\))=support(\(A\))*support(\(C\)). This property claims that any accuracy measure \(\mathcal M \) must test the independence (though values other than 0 could be used, depending on the range of the measure).

  • Property 2: \(\mathcal M \)(\(A \rightarrow C\)) monotonically increases with support(\(A \rightarrow C\)) when other parameters remain the same.

  • Property 3: \(\mathcal M \)(\(A \rightarrow C\)) monotonically decreases with support(\(A\)) (or support(\(C\))) when other parameters remain the same.

The lift measure verifies these three properties, the value 1 meaning independence instead of the value 0. Furthermore, the lift measure is symmetric (i.e. lift(\(A \rightarrow C\)) \(=\) lift(\(C \rightarrow A\))). Sometimes, association rules require to measure the strength of implication in both directions, not only the degree of dependence. Nevertheless, lift is not free of problems either. The main drawback of this measure is that its range is not bounded [4], that is, its domain is \([0, \infty )\), and it is not easy to compare the values of several rules because differences between them are not meaningful. However, in RARM, the necessity of discovering interesting rules makes the use of the lift measure specially appropriate in order to weight up the real interest of each rule mined. The goal of RARM is to obtain interesting rules with low support and which comprise patterns that are associated together.

In some application domains, the existing RARM approaches discover rules that comprise extremely infrequent patterns, so these rules actually cause noisy association rules instead of rare ones, as described in Sect. 2. In such situations, it is essential to correctly establish a boundary that allows of separating rare from noisy rules. But not only the distinction between rare and noisy rules can be complex. In some situations, depending on the application domain, it is really difficult to properly distinguish between frequent and rare rules. All these situations should be taken into account to find a solution to these problems, so it is necessary to be especially careful in the choice of a proper evaluation function that guides the search process. Therefore, in subsequent sections, a number of fitness functions, conceived especially for RARM, will be described in depth.

3.2.2 Genetic operator

In any EA, genetic operators play an important role in the evolutionary process. These operators allow to maintain the genetic diversity along the search for the optimal solutions. In RARM, these operators are especially important as they serve to guide the search process, and to mine rules having lower support values than the original one. In this proposal, a new genetic operator, which modifies the highest support condition of a rule to obtain a new condition having a lower support value, has been implemented. Notice that the lower the support value of the conditions, the lower the support value of the entire rule.

With this purpose, the operator mutates the condition with the highest frequency of occurrence to obtain a new individual with a lower support (see Fig. 5 for a real exampleFootnote 2). The genetic operator provides two possibilities of changing a condition: \((1)\) to obtain a new complete condition or \((2)\) to obtain a new value for a terminal symbol.

Fig. 5
figure 5

Example of the genetic operation

To study the convergence of the genetic operator, a number of different experiments were carried out using diverse datasetsFootnote 3. Figure 6 depicts the average support value of both the elite population (aka. external population) and the best individuals obtained for every generation. The \(Y\)-axis represents the support, whereas the \(X\)-axis stands for the number of generations. Regardless of the dataset used, notice that the lowest support value is obtained before the generation 60. Focusing on the best individual, it is discovered in early generations, that is, between the generation 5 and 15, in most datasets.

Fig. 6
figure 6

Average elite population support and best individual support obtained in each generation by using the genetic operator over a set of datasets

Finally, an analysis of the behaviour of this operator is performed as its probability is increased. Figure 7 depicts the average support values calculated for the elite population using the four sample datasets. There, the \(Y\)-axis represents the average support, and the \(X\)-axis stands for the probability of the genetic operator. As shown, there is no significant difference among the average support values obtained using the different probabilities, the probability value of 0.80 approaching the optimal one.

Fig. 7
figure 7

Average elite population support by using the genetic operator with different probabilities

3.2.3 Algorithm

The algorithm proposed (see Fig. 8 for a general sketch and Listing 1 for the pseudocode) for the extraction of rare association rules follows a generational schema. Rare-G3PARM starts by generating a set of new individuals conformant to the specified grammar (see line 1 in the pseudocode). Each individual is encoded in a tree shape through the application of production rules. Notice that each terminal symbol adopts the name and value of any of the dataset attributes. Once a set of individuals is generated compiling the general population for the algorithm, the generational schema is executed. Several steps are performed for each generation: \((1)\) in order to obtain new individuals, the algorithm selects individuals from the general population and the pool to act as parents (line 6) and a genetic operator is applied over them immediately afterwards with a certain probability (see line 7). Next, \((2)\) these new individuals are evaluated (line 8). In the following step, \((3)\) the elite population or pool—only for the first generation the elite population is empty—, which comprises the \(n\) most reliable individuals obtained along the evolutionary process, and the population are combined to form a new set. Then, this new set is ranked by their fitness, so only the best ones are selected until the new population is completed (see lines 9 and 10). Following, \((4)\) the update procedure is carried out (line 13), ranking by confidence the new set of individuals (see line 2 in the procedure), this ranking serving to select the best \(n\) individuals from the new set for the updating process. Only those individuals having a fitness value greater than zero, a confidence value greater than the minimum confidence threshold, and a lift value greater than unity are considered prompting the discovery of infrequent, reliable and interesting association rules (see lines 6–8).

An important feature of Rare-G3PARM is the use of the lift measure, which represents the interest of a given association rule. Traditionally, ARM proposals make use of a support-confidence framework, including G3PARM, attempting to discover rules whose support and confidence values are greater than certain thresholds. However, the mere fact of having rules that exceed these thresholds does not guarantee that the rules are interesting at all [4]. Even when the support and confidence thresholds are satisfied by a rule, this rule could be misleading if it acquires a confidence value less than its consequent support. In such a situation, the occurrence of the antecedent does not imply an increment in the occurrence of the consequent. For a better understanding, it is shown two sample rules obtained when running the algorithm using the Zoo dataset.Footnote 4 The rule ( breathes = f ) -> ( legs != 8 ) is obtained as a rare association rules, having a support value of 0.198 and a confidence value of 0952. Despite the fact that this rule could be extracted as a reliable rare association rule, it is discarded since the lift value is 0.972, that is, the occurrence of the antecedent does not imply and increment in the consequent. The same occur with the rule ( legs = 4 ) -> ( type != 3 ), having a support of 0.356, a confidence of 0.947, but a lift value of 0.996, so the rule is misleading and it is uninteresting.

Fig. 8
figure 8

The flowchart for the Rare-G3PARM algorithm

In ARM, the problem of association rule extraction in a dataset can be split into two sub-problems: obtaining all the desired patterns in the dataset and extracting all the association rules starting from the results obtained in the previous step. The first step is computationally expensive, and the second step generates a lot of high confidence rules from each frequent itemset, requiring large amounts of memory. In Rare-G3PARM, the mining process only requires one step: obtaining association rules through the use of a grammar, not requiring a previous step for mining rare patterns.

figure a2

Other interesting feature of Rare-G3PARM is its ability for discarding rules having the same meaning. This is not an easy task because individuals having different genotypes may represent the same rule—two rules could represent the same semantic concepts although they represent different syntactic concepts. Therefore, individuals are compared based on their conditions and those with the same conditions are not kept in the pool. A simple example is shown using a real individual obtained from the Zoo dataset, for example, the individual ( toothed != t AND legs != 5 ) -> ( airborne = f ), which has a support value of 0.168. This individual and this other individual ( legs != 5 AND toothed != t) -> ( airborne = f ) represent the same rule, so only one of them is kept in the pool. More complex examples can be produced, depending on the individuals generated, and are equally detected and removed [23].

Finally, two sample executions of Rare-G3PARM over the Automobile PerformanceFootnote 5 and Zoo datasets are illustrated in Tables 2 and 3. As shown, using Rare-G3PARM it is possible to obtain either numerical or categorical rules having a highest confidence value. Additionally, the rules mined are interesting since their lift value is greater than the unity. A further experimental study is carried out in the following section.

Table 2 Sample execution of Rare-G3PARM over the Automobile Performance dataset
Table 3 Sample execution of Rare-G3PARM over the Zoo dataset

4 Experimental study

Selecting a good fitness function is an important task in RARM, as previously mentioned. In this paper, four different fitness functions are described and an analysis of their behaviour is carried out. Finally, a complete study of the effectiveness of the proposed algorithm compared with existing proposals in RARM, for example, Apriori-Infrequent [27], Apriori-Inverse [18] and ARIMA [33], is accomplished.

4.1 Experimental set-up

To analyse the performance of this proposal, a number of executions were performed over diverse datasetsFootnote 6; see Table 4) that were chosen with different sizes and number of attributes. As mentioned above in previous sections, the existing RARM algorithms are brute-force algorithms, performing an exhaustive search process, so this kind of algorithm can hardly be executed with huge datasets composed of large number of attributes. However, our proposal does not have this limitation and it could be executed over any dataset with different sizes and number of attributes.

Table 4 Datasets and their main characteristics

Any evolutionary proposal should be configured with a set of adjustable parameters. All these parameters require previous study to determine those considered optimal, that is, those that allow us to obtain the best global results. The final configuration was adopted after performing several tests using different rank values for each parameter, using several representative datasets, and then analysing which specific set-up globally yielded the best results. It is worth mentioning that no single combination of parameter values performed better for all datasets. Notice that these parameters are not particularized for each dataset. As is shown in Table 5, the best results for our approach are obtained with a population size of 50 individuals obtained using a CFG with a maximum derivation size of 24. An in depth analysis of the final configuration revealed that solutions are hardly improved after 75 generations—Fig. 6a–d show some experiments carried out by using a subset of the whole experimental collection of datasets—so the evolutionary process should be carried out using this value for the number of generations. In this process, new individuals are obtained with a probability of 0.80 for the genetic operator, as previously discussed. The best individuals, that is, those that exceed certain quality thresholds, are kept in the external population, whose size is prefixed and determining the maximum number of rules to be discovered by the algorithm. Notice that this pool size could be changed as the data miner’s requirements vary. Anyhow, a pool size of 20 rules is established in this paper. The remaining values of the quality thresholds are set as follows: 0.90 for the confidence, 0.10 and 0.40 for the minimum and maximum support thresholds, respectively, and a lift value greater than one. In order to carry out a fair comparison, the algorithms used in the experimental stage are executed using the same confidence threshold (i.e. 0.90) and the same maximum support threshold of 0.40.

Table 5 Parameters established for each algorithm

Notice that existing algorithms do not provide any minimum threshold, and that exhaustive RARM algorithms only require two parameters: the support and confidence thresholds. It might be considered either a drawback or an advantage at the same time. On the one hand, requiring only two parameters to execute these algorithms seems to be an advantage for the final user, since no additional knowledge is required. On the other hand, this type of algorithms is normally used by data mining experts, who often require the ability to finely configure and tune the algorithm execution. Hence, only Rare-G3PARM provides the ability to fine tune the length of the rules by setting the total number of rules to be mined and restricting the search space by means of the CFG.

All the experiments were performed on an Intel Core i7 machine with 12GB main memory and running CentOS 5.4. In addition, all the RARM proposals were written in Java. Other exhaustive RARM algorithms were obtained using RM-Tool [29], a data mining tool that accurately permits to perform the association rule mining task. Finally, the proposal presented in this paper was coded using JCLECFootnote 7 [37], a Java library for evolutionary computation.

4.2 Study of the fitness functions

As mentioned above in Sect. 2, it is very important to properly differentiate between rare and noisy association rules in the process of mining them. In this section, four novel fitness functions that correctly separate rare from noisy rules are presented. To analyse the performance of these fitness functions, a number of experiments were carried out whose results are shown in Table 6 and then, the four fitness functions are compared based on confidence, lift and the number of rules discovered. Notice that support values are analysed but not compared since the four fitness functions have different goals, also depending on the expert expectations.

Table 6 Results obtained (presented in a per unit basis) by different algorithms using support and confidence as objectives to be maximized

The first fitness function (see Fig. 9a) provides a maximum fitness value in the middle of a certain interval provided by a minimum and a maximum support value. The closer the support of a rule is to the interval limits, the lower its fitness value is. Out of this interval, a zero value is assigned. For a better understanding, it is shown some sample rules obtained by using the Zoo dataset and the configuration parameters (see Table 5) previously described. The rule ( breathes = f AND backbone = f AND eggs != f ) -> ( tail != t ) satisfies only 7 instances of the complete dataset, so this rule is considered as noisy according to the parameters fixed in Table 5, having a support value of 0.068. Therefore, a zero fitness value is assigned to this solution. On the contrary, the rule ( fins != f ) -> ( backbone = t ) satisfies 17 instances, having a support value of 0.166, so a 0.455 fitness value is assigned to this rule. The farther a solution is to the interval limits, the higher its fitness value is. For example, the rule ( legs = 4 ) -> ( catsize = t ) obtains a support value of 0.255, satisfying 26 instances, so a fitness value of 0.950 is assigned to this solution. As shown, this first fitness function is specific for application domains with a large number of both noisy association rules and rules that are close to being considered as rare. In this way, a progressive fitness function is applied, since there is not a clear difference between noisy, rare and frequent association rules. Therefore, the more distant the support value is from the predefined thresholds, the better is the rule.

Fig. 9
figure 9

Different fitness functions for the RARM process

The second fitness function, which is defined in Fig. 9b, is the most similar one to that used in existing RARM proposals. The reason is the current algorithms do not differentiate among rules based on their support values. A clear difference is the use of a minimum support to determine which rules are noisy and which are rare association rules. Apart from the support, since rules belonging to the interval are equally promising, it is essential to establish a mechanism to properly differentiate among rules, bringing the confidence and lift measures into play. For example, using the aforementioned dataset, the rule ( feathers != f) -> ( hair != t) is obtained, having a support value of 0.196, a confidence of 1.000 and a lift value of 1.741. This rule provides the same support, and therefore, the same fitness function value to the rule ( feathers != f ) -> ( type != 6 ). Therefore, if it is required to select which one is better, the confidence and the lift values are analysed, the last rule having a confidence value of 1.000 and a lift value of 1.086. Therefore, the rule ( feathers != f) -> ( hair != t) is more interesting since its lift value is higher, the antecedent implying an increment in the occurrence of the consequent (see Sect. 3.2.1). It should be noted that the main application domains of this fitness function are those where there is a clear difference between noisy and rare association rules.

Finally, two other fitness functions that, respectively, minimize and maximize the support within a given interval are presented. The former (see Fig. 9c) assigns a fitness value based on the proximity of the support to the lower limit. This fitness function is specific for domains where there is a clear difference between noisy and rare association rules but the difference between rare and frequent rules is not that clear. The rule ( catsize != f ) -> ( aquatic = t ) presents a fitness value of 0.904, having a support value of 0.128. Therefore, since the support is closer to the lower limit, the fitness function is closer to the maximum. The latter function, which is defined in Fig. 9d, maximizes the fitness value when the support of the rule is closer to the upper limit of the interval. This fitness function allows of differentiating between frequent and rare association rules in situations where there is a clear limit between both types. On the other hand, it is also useful where there is not a clear differentiation between noisy and rare rules. For a better understanding, the rule ( legs = 4 ) -> ( fins = f ) obtains a fitness function value of 0.921, and a support value of 0.374. Similarly to the fitness functions described above, these two functions provide a zero fitness value if the support value is outside the interval provided by the data miner.

Following these fitness functions and since no restriction is applied to the support of each condition, both perfectly and imperfectly rare association rules could be discovered. These two types of rare rules have the maximum support restriction in common, as do the approaches described in Sect. 2. Therefore, it represents an important advantage over currently existing proposals, where the process of extracting rules is limited to discover only perfectly or imperfectly rare association rules. Finally, it should be noted that no condition of any rule mined will have less support than the minimum threshold fixed since the support of an association rule allows of identifying the minimum value of any condition within the entire rule.

According to Table 6a, which shows the average support values obtained by different algorithms using these four fitness functions, we can observe that they all behave as previously discussed. \(F_1\) and \(F_2\) discover rules that tend to be located in the middle of the valid interval. However, \(F_3\) mines rules that can be found closer to the lowest bound of the interval and \(F_4\) extracts rules closer to the highest limit.

In this experimental stage, it is analysed which fitness function behaves better in terms of average confidence, average lift and average number of rules. Focusing on the confidence and lift measures, a fitness function is better than another if it obtains higher values for these measures. As for the average number of rules discovered, the best fitness function will be the one that obtains a number of rules closer to the maximum previously established by the data miner. In such a way, and in order to determine whether there exist significant differences among these four functions, a series of statistical tests [1012] were carried out.

The Friedman statistical test is used to compare the results obtained and to be able to precisely analyze whether there are significant differences among the four fitness functions. This test first ranks the \(j\)th of \(k\) models on the \(i\)th of \(N\) datasets, and then calculates the average rank according to the F distribution (\(F_F\)) throughout all the datasets, and calculates the Friedman statistics. If the Friedman test rejects the null hypothesis indicating that there are significant differences, then a Bonferroni–Dunn test is performed to reveal these differences.

The Friedman average ranking statistics for the average confidence measure (see Table 6b), distributed according to \(F_{F}\) with \(k - 1\) and \((k - 1)(N - 1)\) degrees of freedom, is 1.152; 18.009 for the average lift measure (see Table 6c); and 0.494 for the number of rules mined (see Table 6d). The results reveal that both the confidence and the number of rules belong to the critical interval \([0,(F_{F})_{0.01,3,27}=4.510]\), so the null hypothesis that all the fitness functions perform equally well for these measures is not rejected using \(\alpha \) = 0.01. In consequence, focusing on the confidence and the average number of rules mined, it should be noted that the four fitness functions described in this paper are equally valid for mining rare association rules. Nevertheless, \(F_4\) provides the best ranking for both measures. Concerning the lift measure, \(F_3\) obtains the best ranking since this function was conceived to obtain those rules that are close to the minimum support threshold. Notice that the lift measure tend to be higher as the support value decreases (see Eq. 5). Besides, \(F_4\) provides the worst ranking for the lift measure. This fitness function searches for rules close to the maximum support threshold.

The Friedman average ranking statistics for the average lift measure, distributed according to \(F_{F}\), is equal to 18.009, which does not belong to the critical interval \([0,(F_{F})_{0.01,3,27}=4.510]\). Thus, we reject the null hypothesis that all these fitness functions perform equally well for this measure. In order to determine whether there are significant differences among these fitness functions, the Bonferroni–Dunn test is used to reveal the difference in performance, 1.171 being the critical difference (CD) value for \(p\) = 0.1; 1.318 for \(p\) = 0.05; and 1.616 for \(p\) = 0.01. The results indicate that for the lift measure, at a significance level of \(p\) = 0.01 (i.e. with a probability of 99 %), there are significant differences between \(F_3\) and \(F_4\), the performance of \(F_3\) being statistically better. Additionally, using a significance level of \(p\) = 0.10 (i.e. with a probability of 90 %), there are significant differences between \(F_1\) and \(F_3\), the performance of \(F_3\) being statistically better. Finally, it is not possible to assert that there are significant differences between \(F_2\) and \(F_3\), even when the latter obtains a better ranking.

Concluding the analysis, it is possible to state that despite the fact \(F_4\) provides the best ranking for two out of three measures, it is interesting to discover interesting rules. Furthermore, using the Friedman test is not possible to reject the null hypothesis that all fitness functions perform equally well for confidence and average of rules mined. Therefore, when the domains under application do not provide a clear difference between noisy, rare and frequent rules, \(F_3\) is the best fitness function to be used, discovering rules that are interesting and very reliable, and providing confidence values close to the maximum.

4.3 Comparing different algorithms

In this section, a comparison between other relevant RARM algorithms and our proposal, using the aforementioned \(F_3\) fitness function, is performed. Firstly, notice that the algorithm here presented does not require any previous step for discretizing numerical attributes, in contrast to other existing proposals.

For the evaluation of each algorithm, a number of experiments were performed. Since the existing RARM algorithms use an exhaustive search methodology and to make a fair comparison, only a subgroup of the datasets could be selected: one had to avoid those that would discover a huge number of rules in exceeding any acceptable computation time. Table 7 shows the results obtained with the different algorithms. The datasets with numerical attributes were discretized using equal-width discretization techniques in order to be able to make a comparison between the existing RARM algorithms. The results show the average support, the average confidence, the number of rules mined and the average runtime for each algorithm. Here, \(D-N\) signifies that the dataset \(D\) was discretized into \(N\) intervals. Finally, it should be noted that the lift has been omitted in the comparison, since current exhaustive search algorithms in the RARM field do not consider this lift measure. Therefore, including the lift would lead to an unfair comparison.

Table 7 Results obtained by different algorithms

Analysing the results in Table 7a, notice that the existing RARM proposals obtain rules with very low support, which could be considered as noise, not considering the interest of the rules mined by using the lift measure. On the other hand, our proposal obtains a set of rare association rules with support values in the range \([0.1, 0.4]\), which is the interval used to determine whether a rule is rare. Notice that this interval could vary, based on the specific domain under application and the data miner’s needs. In addition, the algorithm here presented provides interesting rare rules since the lift measure has made its appearance on the scene. Only those rules providing antecedents with a positive influence on their consequents are considered interesting.

An important issue is that Apriori-Infrequent obtains an average support greater than Apriori-Inverse. As mentioned in previous sections, Apriori-Infrequent obtains rules from the infrequent itemsets mined in the classical Apriori algorithm. Therefore, the rules obtained have at least one condition with a support value greater than the threshold. Notice that no superset is mined from the infrequent itemsets obtained. On the other hand, Apriori-Inverse discovers only perfectly rare association rules so their support values are very low. Studying the performance of the ARIMA algorithm, it should be noticed that its results are very similar to those obtained with Apriori-Inverse since their difference is only in the minimal itemsets. As for MRG-Exp, it obtains higher support values than the other exhaustive search algorithms, since it uses minimal itemsets to discover rare association rules. However, MRG-Exp does not overcome the problem of discovery noisy patterns. For instance, it discover the rule IF cylinders \((4.66-5]\) THEN origin \((1.93-2.07]\), which has a support value of 0.01.

Focusing on the average confidence (see Table 7b), notice that all the proposals used in the analysis obtain reliable rules, with an average confidence above 0.975.

According to Table 7c, which shows the average number of rare association rules mined, our proposal obtains a uniform set of rules (between 15 and 20 rules) that is close to the size specified by the data miner. On the other hand, the exhaustive search algorithms obtain a heterogeneous set of rules, depending on the datasets used. For example, the Apriori-Inverse algorithm obtains 32 rare association rules using the Vote dataset and 21489 rules using the Wisconsin dataset. Anyhow, none of these algorithms can ensure that the rules discovered are interesting. Therefore, using some datasets, these algorithms obtain a large number of association rules, which is hardly manageable. Apriori-Infrequent only mines rules from the infrequent patterns discovered using the regular Apriori algorithm. Thus, the number of rules discovered using this algorithm is lower than others. Using the Apriori-Inverse provides a higher number of rules, since it mines infrequent patterns and PRARs. ARIMA is the algorithm that discovers the highest number of association rules, since it mines not only PRARs but also IRARs. Finally, MRG-Exp discovers the lower number of rules, since it only uses minimal itemsets. Concluding the analysis of Table 7c, it should be noted the strength of the proposed algorithm, which is able to obtain up to a predefined maximum number of rules, even in those cases when this number would become unmanageable if other algorithms were executed instead.

Focusing on the runtime for each algorithm (see Table 7d), notice that MRG-Exp and Rare-G3PARM have an average execution time lower than those obtained by the other algorithms. In exhaustive search algorithms, the execution time is not uniform, but depends directly on the dataset used. For instance, using ARIMA, the execution time may vary from 293 to 11,113 s. However, the proposal presented in this work obtains rules in a uniformly short time, around 1 s, independently of the above-mentioned datasets. This is a very important advantage of our approach.

Finally, Fig. 10 lists a few rare association rules extracted from the execution of the G3P proposal here presented over the original Automobile Performance dataset. As shown, the rules discovered are very reliable, exceeding the minimum support threshold (set to 0.100 using a per unit base) to avoid the extraction of noisy rules. Finally, notice that using numerical values and operators, also allows obtaining intervals over certain attributes. For example, rule \(1\) determines that the \(horsepower\) attribute has a value in the range \((82, 118]\). This property is an important advantage of the algorithm here proposed over the currently existing proposals in this field.

Fig. 10
figure 10

Examples of rare association rules over a numerical dataset

5 Concluding remarks

The current RARM algorithms mainly use exhaustive search methods. Therefore, these algorithms cannot be successfully used with huge datasets comprising a large number of attributes because of the large search space generated. Moreover, these algorithms obtain rare association rules only over categorical domains, so numerical values must be discretized first. An important issue is that the existing algorithms in the field of rare association rules establish a maximum support threshold for the patterns, not using any support threshold for the entire rule. This allows obtaining very infrequent rules that may be considered as noise, so it is interesting to focus on rules instead of patterns.

In this paper, the Rare-G3PARM algorithm for mining interesting and reliable rare association rules was presented. Similarly to G3PARM, this algorithm uses a CFG, which provides expressiveness and flexibility, and allows both categorical and numerical attributes to be defined. The use of a grammar allows to constrain the search space, so the search cost necessary to find solutions is significantly reduced. Moreover, using an evolutionary methodology instead of an exhaustive search method allows of decreasing the computational time, and so any kind of dataset can be mined regardless of the number of instances and attributes. Finally, it should be noted that it is interesting to search for rules whose antecedents provide a positive increment in the occurrence of their consequents, so the rules with a lift value greater than unity are desired and obtained.

A number of fitness functions were presented, allowing of establishing the boundary between rare and noisy association rules. The experimental analysis of these fitness functions showed that \(F_3\) obtains the most interesting association rules. Regarding the confidence value and the average number of rules mined, this analysis states that there is no significant difference among the four functions. Regardless of the fitness function applied, the proposal obtains a heterogeneous number of rules over diverse datasets. Precisely for this purpose, the proposal uses a pool of individuals with a predefined size, so the number of rules mined is restricted to the best individuals. Furthermore, this proposal may be used in different applications such as banking frauds, medical anomalies, network intrusion, etc. More specifically, a previous proof of concept using G3P for mining rare patterns was proposed in the field of intruder detection in user-based collaborative recommended systems, and obtained promising results [22].

The results obtained here showed that the G3P-based algorithm mines up to a predefined number of rare association rules in a range of support values, unlike the existing algorithms, which usually just obtain a huge set of rules, resulting in a collection that is hardly understandable. For example, using the Apriori-Inverse algorithm, the number of rules mined may vary from 32 to 21,489, and even from 8,519 to 159,598 using ARIMA. Notice that the rare association rules discovered have high confidence values, as well as lift values greater than unity, stating that their antecedents and consequents are positively correlated. All the algorithms under study have behaved really well for mining reliable association rules, since they have obtained confidence values greater than 0.960. Finally, it is fitting to point out that the G3P proposal has the lowest average execution time, around 1 s in our experiments, and its execution time does not depend on the dataset. On the contrary, in exhaustive search algorithms, the execution time strictly depends on the dataset used. For example, the execution time may vary from 0 to 1,821 s when the Apriori-Inverse algorithm is used.