Keywords

1 Introduction

Since information technologies are developing very rapidly and the amount of collected data is growing, analyzing such big data is an important task of data mining.

The number of rules generated from “real-life” datasets can easily grow very large, which may cause a combinatorial explosion. Therefore, mining association rules from these data and reducing their number to produce compact models for end-users is becoming a crucial data mining task [17].

To overcome this problem and achieve more compact as well as understandable models, rules have to be pruned and/or clustered.

Association [1] and classification rule mining [2, 7, 26] are two important fields of data mining. Classification rule mining aims at building accurate models to forecast the class value of a future object by selecting small subsets of rules, while association rule mining algorithms find all existing rules in a dataset based on some user-specified constraints by exploring the entire search space.

Associative Classification (AC) [5, 12, 14, 18, 21, 22] is another data mining technique that combines classification and association rule mining. The main goal of AC methods is to produce compact, accurate and descriptive models based on association rules. Thus, the performance of AC methods can sometimes be better than some of the traditional classification methods on accuracy, in spite of worse efficiency, because AC methods are sensitive to user-defined parameters such as minimum support and confidence.

Another area in data mining is clustering [13, 25, 29]. Clustering methods can be usually grouped in two groups: partitional clustering [24, 30] that aims at grouping similar objects together by using partitioning techniques, and hierarchical clustering [28], which is a nested sequence of partitions.

In this research work, we present a novel cluster-based associative classification method. Firstly, we describe a new normalized combined distance metric to find the similarity of two class association rules (CARs). Secondly, we cluster the CARs by using a bottom-up approach of hierarchical agglomerative clustering. In this step, we automatically identify the optimal number of clusters. Thirdly, after CARs are clustered, we present a novel method of extracting the “representative” CAR for each cluster. Finally, we develop a compact and accurate associative classification model by including just the representative CARs from each cluster.

The performance of our method (ACHC) is evaluated on 12 selected datasets taken from the UCI ML Repository [6] and compared with 9 popular (associative) classification approaches, such as Decision Table and Naïve Bayes (DTNB) [10], Decision Table (DT) [15], PART (PT) [7], C4.5 [26], CBA [18], Simple Associative Classifier (SA) [20], FURIA (FR) [11], Ripple Down rules (RDR) [27], and J&B [19].

2 Related Work

The first related approach [3] is clustering of association rules based on the k-means clustering algorithm. Similarly to our approach, this approach utilizes the “APRIORI” algorithm [1] to generate the association rules. The key differences between ACHC and this method are (1) it uses a different algorithm in the clustering step, that is, it clusters the rules based on interestingness measures by using the k-means clustering method and (2) a novel distance metric is used to cluster the rules in our method.

The fuzzy clustering algorithm on association rules (FCAR) [4] is proposed based on rule simulation. In this method, researchers aimed to develop a fuzzy clustering method which partitions n objects into k subsets. FCAR uses the “APRIORI” algorithm in the rule-generation part as in our method, but they use a different technique (partitional) in the clustering phase. Another key difference is that they cluster the association rules while we intend to cluster class association rules.

Another related approach [8] is distance-based clustering of association rules. In this method, an “indirect” distance metric (based on CARs support and coverage) is used to group the association rules, while we develop a novel “combined” distance metric (based on direct and indirect measures) and an agglomerative clustering method to cluster the rules.

In [16], authors mine the clusters with association rules. The APRIORI algorithm is used to find strong CARs in the first step and then, it clusters these CARs by using a hierarchical clustering method as in our method. However, they used a different distance metric based on an “indirect” distance metric (based on coverage probabilities) in the clustering step.

The algorithm, described in this paper extends our previous work from [21] and [22]. In [21], the CMAC algorithm used a direct distance metric in the clustering phase of CARs and the cluster centroid approach to select the representative CAR for each cluster, while in [22] CMAC is compared to two similar algorithms, one using the direct distance metric and covering approach in the representative CAR selection phase and the other using combined (direct and indirect) distance metric with the same covering approach to select the representative CAR. This paper presents the only remaining combination, i.e., using a combined distance metric in the CAR clustering phase and the cluster centroid approach to select the representative CAR for each cluster.

3 Our Proposed Method – ACHC

It is assumed that we have given a relational table consisting of S examples (transactions). Each example is defined by A different attributes and classified into one of the C known classes.

Our main goals and contributions within this research are listed below:

  • Developing a novel distance metric to measure the similarity of class association rules;

  • Clustering of class association rules by utilizing the hierarchical clustering algorithm and finding the optimal number of clusters for each class value;

  • Defining a new method of selecting a representative class association rule for each cluster to represent the compact and accurate classifier;

  • Performing an experimental evaluation to show the advantages and disadvantages of the developed algorithm.

3.1 Class Association Rules

This section describes how to produce the strong class association rules. We first need to discover the frequent itemsets and we then generate the class association rules from these frequent itemsets. To generate the frequent itemsets, we apply the minimum support threshold, that is, frequent itemsets satisfied by minimum support constraint are generated in the first step. After generating the frequent itemsets, it is a straightforward approach to discover the class association rules from frequent itemsets. In this step, we apply the minimum confidence threshold to produce strong CARs.

We utilize the APRIORI algorithm to generate the frequent itemsets, because APRIORI is a well-known and frequently used algorithm for association rule mining.

Once we generate the frequent itemsets, our next goal is to produce the strong class association rules which satisfy the minimum confidence constraint. We check the confidence of each rule, if it satisfies the required threshold, then we generate that rule. Confidence of the rule: \(A\to B\) is computed as follows:

$$confidence\left(rule\right)=\frac{support\_count(A\cup B)}{support\_count(A)}$$
(1)

Equation (1) describes the confidence of a rule based on the support count of a frequent itemset, where A represents the antecedent (left-hand side of the rule) and B the consequence (class value in the case of CARs) of a rule. Moreover, support_count \((A \cup B)\) is the number of examples in the dataset that match the itemset \(A \cup B\), and support_count (A) is the number of examples that match the itemset A. We generate the strong class association rules that satisfy the minimum confidence constraint based on Eq. (1), as follows:

  1. 1.

    All nonempty subsets S of frequent itemset  L belonging to class C are generated;

  2. 2.

    For every nonempty subset S of L, output the strong rule R in the form of “SC” if, \(confidence(R)\ge\) min_conf, where min_conf is the minimum confidence threshold.

3.2 The New “combined” Distance Metric

In our previous research [21], we proposed a novel “direct” distance metric and defined its advantages and disadvantages. In this research, we use the previously developed “direct” distance metric and also focus on the “indirect” distance metric based on the Conditional Market-basket Probability (CMBP) [8]. Using a probability estimate for distance computation has many advantages. Probabilities are well understood, are intuitive, and a good measure for further processing, and it is appropriate for rules only with the same consequent. The distance \({d}^{CMBP}\) between two rules rule1 and rule2 is the (estimated) probability that one rule does not hold for a basket, given at least one rule holds for the same baskets. This distance is defined as follows:

$${d}_{rule1,rule2}^{CMBP}=1-\frac{\left|m\left(B{S}_{rule1}\cup B{S}_{rule2}\right)\right|}{\left|m\left(B{S}_{rule1}\right)\right|+\left|m\left(B{S}_{rule2}\right)\right|-\left|m\left(B{S}_{rule1}\cup B{S}_{rule2}\right)\right|},$$
(2)

where BS is both sides of the rule, that is, the itemset for the association rule. m(BS)and m(\({BS}_{rule1}\cup {BS}_{rule2}\)) denotes the set of transactions (baskets) matched by BS and \({(BS}_{rule1}\cup {BS}_{rule2})\) respectively. |m(BS)| and \(\left|m({BS}_{rule1}\cup {BS}_{rule2})\right|\) is the number of such transactions.

With this metric, rules having no common market baskets are at a distance of 1, and rules valid for an identical set of baskets are at a distance of 0.

In this research, we combine the “direct” and “indirect” distance metric to produce a new Weighted and Combined Distance Metric (WCDM). WCDM combines direct measure (rule items) and indirect measure (rule coverage). The weighted distance \({d}^{\mathrm{WCDM}}\) between two rules, rule1 and rule2 is defined in Eq. (3).

$$d_{rule1,rule2}^{WCDM} = \alpha \times d_{rule1,rule2}^{direct} + \left( {1 - \alpha } \right) \times d_{rule1,rule2}^{indirect} ,$$
(3)

where \(\alpha\) is a weigh balancing parametr in (3). We set \(\alpha =\) \(0.5\) parameter in the distance metric developing part, that is, the contribution of direct and indirect is considered equal to make a weighted and balanced distance metric. The resulting distance metric is defined in Eq. (4).

$${d}_{rule1,rule2}^{WCB}=0.5\times {d}_{rule1,rule2}^{direct}+0.5\times {d}_{rule1,rule2}^{indirect}.$$
(4)

3.3 Clustering

Clustering algorithms are usually split into two groups: partitional and agglomerative hierarchical clustering. In our research, we apply the complete linkage method of agglomerative hierarchical clustering (in the bottom-up fashion) because it is frequently used and more consistent than other algorithms.

In the algorithms based on the bottom-up fashion, every example is considered as a unique cluster at the beginning and the two the nearest clusters are merged in each iteration until all clusters have been merged into a unique cluster.

In the complete linkage (farthest neighbor) method of agglomerative hierarchical clustering, the similarity between two clusters is computed based on the similarity of their most dissimilar examples, that is, the farthest groups are taken as an intra-cluster distance.

figure a

To cluster the class association rules, we first identify the optimal number of clusters. In this step, we utilize the most-common technique (described in Algorithm 1), where the dendrogram is cut from the point which represents the maximum difference of two consecutive cluster heights.

Algorithm 1 gets cluster heights (distance between two clusters) which are computed during the dendrogram construction process as an input, and it outputs the optimal number of clusters.

N stores the total number of clusters in line 3. For each two consecutive cluster heights (lines 4–9), we compute the differences and store the maximum difference to Opt_number_of_cls parameter.

3.4 Selecting the “Representative” CAR

After clustering the CARs, we define an algorithm (presented in Algorithm 2) to select a “representative” CAR for each cluster. Only representative CARs are included in the final classifier.

figure b

Algorithm 2 describes the method of extracting the representative CAR. Firstly, the distances between the selected CAR and all other CARs are calculated for each CAR. Secondly, we find the CAR which obtains the minimum distance and returns that class association rules as a representative.

Algorithm 3 summarizes all the above-mentioned steps and describes the final ACHC classifier. It gets the training dataset, minimum support as well as confidence thresholds, as input parameters and outputs the compact and accurate associative classification model – ACHC.

figure c

The first three lines generate the strong CARs defined in Sect. 3.1 and sort them by confidence and support descending order due to the following criterion:

If \({Rule}_{1}\) and \({Rule}_{2}\) represent two CARs, \({Rule}_{1}\) is said to have higher rank than \({Rule}_{2}\) (\({Rule}_{1}>{Rule}_{2}\)):

  • if and only if, \(conf\left({Rule}_{1}\right)>conf({Rule}_{2})\) or

  • if \(conf\left({Rule}_{1}\right)=conf({Rule}_{2})\), but \(supp\left({Rule}_{1}\right)>supp({Rule}_{2})\) or

  • if \(conf\left({Rule}_{1}\right)=conf({Rule}_{2})\) and \(supp\left({Rule}_{1}\right)=supp\left({Rule}_{2}\right),\) \(\mathrm{but} \,\,{Rule}_{1}\) has fewer items in its left-hand side than \({Rule}_{2}\).

Once the CARs are sorted, we group them according to class value (line 4) and then we build the distance matrix (line 6) for each group of CARs (defined in Sect. 3.2) to apply the hierarchical clustering algorithm with complete linkage (For example: if we have 3 class value, we build the model for three group of CARS and merge them at the end). Line 7 finds the cluster heights (distances between clusters) and the natural number of clusters is identified by using the cluster heights (line 8). After determining the optimal number of clusters, the hierarchical clustering algorithm is again utilized to find the cluster of CARs (Cluster array stores the list of clustered CARs) in line 9. In the last step (lines 10–12), we select the representative class association rule for every cluster (described in Subsect. 3.4) to produce the final compact and accurate model.

4 Experimental Setting and Results

Our developed algorithm is compared against 9 rule learners in terms of classification accuracy and rules. All associative classifiers were run with default parameters set up by WEKA [9] software. Some parameters (minimum support and confidence) were modified on “imbalanced” datasets to achieve the intended number of rules (at least 10 rules for each class value) for AC methods.

Statistical significance testing is performed based on the paired t-test (significance difference threshold was set to 95%) method. The description of the datasets and input parameters are shown in Table 1.

Table.1 Description of datasets and AC algorithm parameters

Moreover, a 10-fold cross-validation assessment technique was employed to represent all experimental results. Table 2 shows the experimental results for classification accuracies (with standard deviations).

Table 2 shows that our proposed associative classifier (ACHC) achieved comparable average accuracies (84.8%) to other classification models on selected datasets. More precisely, ACHC gained the third-highest average accuracy among 10 classification and association rule mining algorithms.

Our developed classifier obtained the best accuracy on “Balance” (except DTNB), “Breast.Can”, “Spect.H”, “Hayes.R” and “Connect4” datasets among all algorithms, while on “Car.Evn” and “Nursery” datasets, ACHC is beaten by all “classical” classification models.

Table.2 Overall accuracies with standard deviations:

Unexpectedly, standard deviation of all rule-learners was high (this situation happens mainly with imbalanced datasets which affect the rule-generation part of “APRIORI” algorithm) on the “Hayes.R”, “Lymp” and “Connect4” datasets.

Fig. 1.
figure 1

Comparison between our method and other methods on average accuracy

It can be seen from Fig. 1 that all classification algorithms achieved almost similar result on average accuracy except DTNB, DT and SA.

Table 3 represents the statistical significance testing (wins/losses counts) on accuracy between ACHC and other classification models. W: winning count (our approach was significantly better than the compared algorithm); L: losing count (our approach was significantly worse than the selected algorithm); N: no statistically significant difference has been detected in the comparison.

Table.3 Statistically significant wins/loss counts of ACHC method on accuracy

Table 3 illustrates that the performance of the ACHC method on accuracy is better than DTNB, DT and SA methods while this performance is similar or the same to C4.5, FR, J&B, RDR and CBA according to win/losses counts. Table 4 illustrates the size of all classification methods.

Table.4 Number of CARs

Experimental evaluations on the number of rules (Table 4) show that ACHC produced the best result in terms of the average number of rules among 10 rule-learners. Our approach produced a statistically smaller classifier on “Car.Evn” and “Nursery” datasets, although not achieving the best classification accuracies on those datasets.

On “Hayes-root” and “Balance” datasets, ACHC obtained an unexpectedly larger number of rules (due to imbalanced datasets) but it produced accurate classifiers for those datasets.

Figure 2 illustrates that associative classifiers achieved better result than “classical” classification models on the average number of rules. The main reason is that “classical” classification models are sensitive to the size of the dataset.

Fig. 2.
figure 2

Comparison between our method and other methods on average number of rules

Table 5 shows that ACHC produced smaller classifiers than DTNB, DT, C4.5, PT, FR, SA, and J&B algorithms on more than 8 datasets out of 12 (by win/losses count).

Table.5 Statistically significant wins/loss counts of ACHC method on rules

5 Conclusion and Future Work

Overall, our developed method achieved the highest result on the average number of rules by exhaustively searching the entire example space using constraints and clustering while maintaining a classification accuracy that was comparable to state-of-the-art rule-learning classification algorithms. Experimental evaluations showed that ACHC produced an accurate and compact classifier that was able to reduce the number of classification rules in the classifier by 2–4 times on average compared to the other “classical” rule-learners, while this ratio is even bigger on datasets with a higher number of examples.

The main drawback of our proposed method (ACHC) is its time efficiency. In future work we plan to optimize ACHC to bring its time complexity at least a bit closer to state-of-the-art “divide-and-conquer” rule-learning algorithms.

We plan to develop new methods by setting up different values for \(\alpha\) parameter and perform experiments to show the advantages and disadvantages of those methods.