Keywords

1 Introduction

Data mining technologies aim to explore a valuable knowledge in large volumes of data [1]. There are many data mining methods and algorithms, One of the most traditional data mining approaches is finding frequent item-sets in transactional databases, and deduct their corresponding association rules. Currently, there are many proposed algorithms for mining association rules. The most known and the simplest one is the APRIORI Algorithm [2] proposed in 1993 by Agrawal. The use of APRIORI algorithm in DM makes it possible to test the various combinations between the items (Data_Atributes) to find potential relationships which will be exposed in the form of association rules. However, the rules produced by APRIORI are judged by known measures (support and confidence). But this algorithm suffers from an important defect which cannot determine the minimal value of support and confidence and these parameters are estimated intuitively by the users. Depending on the choice of those thresholds, association rule mining algorithms can generate a huge number of rules which lead algorithms to suffer from long execution time and large memory consumption, or may generate a small number of rules, and thus may delete valuable information.

This method can also offers several rules in a massive database, millions, which apparently many of them are not useful and helpful; it can be implied that it doesn’t have enough efficiency. So we require a method to find the best values of support parameter automatically especially in large databases. The main goal of this paper is to present a method to find proper values of minimum threshold for efficient support.

The outline of our paper is as follows: In Sect. 2, we present the necessary scientific background and an overview of association rules mining, and related works. Part 3 presents our proposed approach based on APRIORI algorithm for mining association rules with auto-adjust the threshold of support with multiple minimum support. In Sect. 4, we discuss the experimental results and its analysis. The conclusion and scope for future work is given in the last section.

2 An Overview of AR Mining

2.1 Process Association Rules Mining

Association Rules Mining

We define I = {i1, i2, ………. in} as a set of all items, and T = {t1, t2, .…… tm} as a set of all transactions, every transaction ti is an itemset and meets ti ⊆ I. Association rules can be generated from large (frequent/closed/maximal) itemsets. An association rule is an implication expression of the form X → Y, X ⊆ I, Y ⊆ I where X and Y disjoint itemsets (i.e. X ∩ Y = ∅). X is called the antecedent and Y is called the consequent of the rule.

The force of an association rule can be measured in terms of its support and confidence. The support of the rule X → Y is the percentage of transactions in database D that contain X ∪ Y and is represented as:

$$ {\rm{Support}}\left({{\rm{X}} \rightarrow {\rm{Y}}} \right)\,{\rm{\,=\,P}}\left({{\rm{XY}}} \right)\,{\rm{\,=\,n}}\left({{\rm{XUY}}} \right){\rm{/n}} $$
(1)

The confidence of a rule X → Y describes the percentage of transactions containing X which also contain Y and is represented as

$$ {\rm{Confidence}}\left({{\rm{X}} \rightarrow {\rm{Y}}} \right){\rm{\,=\,n}}\left({{\rm{XUY}}} \right){\rm{/n}}\left({\rm{X}} \right)\,{\rm{\,=\,P}}\left({{\rm{XY}}} \right){\rm{/P}}\left({\rm{X}} \right) $$
(2)

Where n(X⋃Y) is the number of transactions that contain items (i.e. XUY) of the rule, n(X) is the number of transactions containing itemset X and n is the total number of transactions.

The process of mining association rules is to discover all association rules from the transactional database D that have support and confidence greater than threshold predefined by the user minimum support (minsup) and minimum confidence (minconf).

APRIORI Algorithm

Now, diverse algorithms for mining association rules are proposed. The most known, and without certainly the simplest one is the APRIORI algorithm [2]. It scans the mesh of the concepts width, such as Charm [3] and Closet [4] algorithms. Other travel the lattice depth is particularly the case for algorithms FP-Growth [5] and Eclat [6].

The APRIORI algorithm works in two steps:

  • Find the frequent itemset: The frequent itemset is an itemset that verifies a predefined threshold of minimum support.

  • Generate all strong association rules from frequent itemsets: The strong association rule is a rule that verifies a predefined threshold of minimum confidence.

APRIORI algorithm is the most powerful method that candidate k + 1-itemsets may be generated from frequent k-itemsets according to the principle of APRIORI algorithm that any subset of frequent itemsets are all frequent itemsets.

Foremost, find the frequent 1-itemsets L1. Then L2 is generated from L1 and so on, until no more frequent k-itemsets can be found and then algorithm desists. Every Lk generated should scan database once. Then Ck is generated from Lk − 1.

Pseudo code of APRIORI:

  • \( {\rm\mathtt{C}_{\mathtt{k}}} \) : candidate itemset of size k.

  • \( {\rm\mathtt{L}_{\mathtt{k}}} \) : frequent itemset of size k.

  • L1: frequent items.

  • For (k   =   1; \( {\rm\mathtt{L}_{\mathtt{k}}} \) !   =   o;k   ++) do begin

  • \( {\rm\mathtt{C}_{\mathtt{k+1}}} \)  =   candidate generated from \( {\rm\mathtt{L}_{\mathtt{k}}} \) ;

  • For each transaction t in database D do increment the calculation of all candidates in \( {\rm\mathtt{C}_{\mathtt{k+1}}} \) that are included in t.

  • \( {\rm\mathtt{L}_{\mathtt{k+1}}} \)  =   candidate in \( {\rm\mathtt{C}_{\mathtt{k+1}}} \) with minsup.

  • End.

  • Return \( {\rm{U}(\mathtt{L}_{\mathtt{k}}}) \) .

2.2 State of the Arts

The process of association rules mining (ARM) can be categorized into two classes of research: determination of user specified support and confidence thresholds and the post-treatment by using the interestingness measures to evaluate and find the most interesting rules. Most of the algorithms of ARM rely on support and confidence thresholds and they use a uniform threshold at all levels. Therefore a suitable choice of those thresholds directly influences the number and the quality of association rules discovered.

Several works aim to solve this challenge and help the user in the choice of the threshold of support and confidence to be the most adequate to the decision scope.

Fournier-Viger [7, 8] Redefine the problem of association rule mining as mining the top-k association rules by introducing an algorithm to find the top k rules having the greatest support. With k is the number of rules to be generated and defined by the users. To generate rules, Top-K-Rules relies on a novel approach called rule expansions, it finds larger rules by recursively scanning the database for adding a single item at a time to the left or right part of each rule. This has an excellent scalability: execution time linearly increases with k. Top-k pattern mining algorithm is slower but provides the benefit of permitting the user set the number of patterns to be discovered, which may be more intuitive.

Kuo et al. [9] introduced a new method to determine best values of support and confidence parameters automatically particularly for finding association rule mining using Binary Particle Swarm Optimization and offered a novel approach for mining association rule in order to develop computational performance as well as to automatically define suitable threshold values. The particle swarm optimization algorithm searches firstly for the best fitness value of each particle and then detects corresponding support and confidence as minimal threshold values after the data are converted into binary values and then these minimal support and confidence values are used to extract association rules.

Other approaches [10,11,12] used a multiple level in the process of ARM, the multiple level association rule mining can run with two kinds of support: Uniform and Reduced. Uniform Support: In this method, same minimum support threshold is used at every level of the hierarchy. There is no necessity to evaluate itemsets including items whose ancestors do not have minimum support. The minimum support threshold has to be suitable. If minimum support threshold is too great then we can lose lower level associations and if it is too low then we can end up in producing too many uninteresting high-level association rules.

Other works [13, 14] proposed an approach based on multi-criteria optimization aiming to select the most interesting association rules Without need to set any parameters at all, The idea is to find the patterns that are not dominated by any other patterns by using a set of interesting measures.

3 Proposed Approach

The main problem of The APRIORI algorithm is the choice of the threshold for the support and confidence. APRIORI find the frequent candidate itemset by generating all possible candidate itemset which verifies a minimum threshold defined by users. This choice influences in the number and the qualities of AR. whereas our algorithm uses a threshold of minsup defined depending on the transactional dataset which is logical.

In this paper we propose two main contributions, the first one is to compute the minimum support (minsup) automatiquelly according to each datasets instead of using a constant value predefined by the users. The second contribution of our proposed method is making this minsup change (updated) dynamically according to each level, most of the existing methods applied a single and uniform minimum support threshold value for all the items or itemsets. But all the items in an itemset do not work in the same process, some appear very frequently and oftentimes, and some infrequent and very rare. Therefore the threshold of minsup should change according to different levels of itemset.

Our algorithm can be divided in several steps:

  • Input: a set of n transaction, a transactional dataset.

  • Step 1: determine the minimum support for the first level for 1-itemset: minsup1 by using the means of support of all itemset with one item.

    $$ \text{min} \sup 1 = \sum\limits_{i = 1}^{N} {\frac{{\sup - 1itemset_{i} }}{N}} $$

    Minsup: is a minimum support and 1-itemset is a set of items composed of 1item.

  • Step 2: Verify whether the support sup − 1itemseti of each itemi is large than or equal to minsup1. If i satisfies the above condition put in the set of 1-itemset (L1).

    \( \text{L}1 = \{ \text{1-itemset}_{\text{i}} /\text{sup} \, - \, 1\text{-itemset}_{\text{i}} \ge \text{minsup}1 \, \text{with i} \, = \,1 \ldots..\text{N number of all 1-itemset}\} \)

  • Step 3: Generate the candidate C2 from L1 with the same way to the APRIORI algorithm. the difference is that the support of all the large k-itemset.

    k-itemset: set of items composed of k items.

  • Step 4: Compute the new minsup of the 2 itemset level by using the means of support of the generated C2 generate the L2.

    \( \text{L}_{2} = \{ \text{2-itemset}_{\text{i}} /\text{sup} \, - \,\text{2-itemset}_{\text{i}} \, \ge \,\text{minsup}2 \, \text{with i} \, = \, 1 \ldots\ldots \text{N number of all 2-itemset}\} \)

  • Step 5: Check whether the support sup k-itemseti of each candidat k-itemseti is larger than or equal to minsup k obtained in step 4. If it satisfies the above condition put in the set of large k-itemset(Lk).

    \( \text{L}_{\text{k}} = \{ \text{k-itemset}_{\text{i}}/ \text{sup} \, - \, \text{kitemset}_{\text{i}}\, \ge \, \text{minsupk with i} \, = \,1 \ldots .\text{N number of all kitemset}\} . \)

  • Step 6: Repeat steps 3 to 5 until Li is null.

  • Step 7: Construct the association rules for each large k-itemseti with items: {Ik1, Ik2,……. Ikq} q ≥ 2 which verify the threshold of confidence i.e. the association rule whose confidence values are larger than or equal to the threshold of confidence defined by the mean of support of all large q itemset Ik.

  • Output: a set of association rules using an automatic threshold of support in multilevel.

4 Experiment Study

In this part, we will illustrate and investigate the advantages of our proposed algorithm (Supd), we use different public datasets: (mushroom, flare1, flare2, Zoo, Connect) got from UCI machine learning repository [15]. T10I4D100K (T10I4D) was generated using the generator from the IBM Almaden Quest research group [16] and Foodmart is a dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 [17]. Table 1 summarizes the properties of the used datasets.

Table 1. Characteristics of the used datasets

Our objectives in this section are multiple, the first, we show through many experiments that our method reduce the huge number of the generated association rules compared to APRIORI algorithm [2]. Second, we conduct an experiment to examine the qualities of the generated association rules. The third, we study the runtime and we compare it to the APRIORI algorithm (APR) and to Topkrule algorithm (Topk) [7].

All the approaches are implemented in Java programing language. At the first, all our experiments are realized through a computer (C1) with the following specifications: Core™ I3, 1,70 GHz, memory capacity: 4 GB.

Table 2 shows different values of threshold of confidence chosen and different values of support found by our algorithm in the first level for different dataset.

Table 2. Threshold values of support and confidence

4.1 Reduction of a Number of Rules

In this experiment, we show the ability of our proposed approach to reduce the number of AR generated from the chosen datasets. Our experiment compares our approach to APRIORI based on thresholds.

Table 3 compares the size of AR generated by our method to the APRIORI.

Table 3. Number of AR generated for each dataset

We see through this experiments that our approach can significantly reduce the huge number of rules generated from the data sets, which can facilitate the interpretation and help the users to see the most interesting ones and to take decision.

4.2 The Running Time Analysis

We realized an implementation for traditional Apriori from [1] and our proposed algorithm (Supd), and we compare the time wasted of original Apriori (APR) and Topkrule algorithm (Topk), and our proposed algorithm by applying many datasets, various values for the minimum support given in the implementation. The running time analysis may be differ for different machine configuration. For this reason we are realized our experiments through another machine computer (C2) with the following specifications: Core™2 Duo CPU E8400, 3,00 GHz, memory capacity: 4 GB in order to obtain unbiased result comparison. The result is shown in Tables 4 and 5.

Table 4. The time consuming in different datasets using computer C1 (in ms)
Table 5. The time consuming in different datasets using computer C2 (in ms)
Table 6. The average of confidence for different datasets

As we see in Table 4, that the time-consuming in our proposed algorithm in each dataset is less than it is in the original Apriori, and the difference grows more and more as the number of transactions of datasets increases.

On the other hand we see that this time consuming in our approach is the same as the time consuming in Topkrule in some datasets and it is less than it in other datasets.

We can add another advantage to our algorithm which is the use of memory space. As we see, we did not obtain the result of Topkrule in Connect and T10I4D100K datasets, the Topkrule algorithm can’t run on both machines and this is due to the memory problem, while our algorithm runs without any problem.

4.3 The Quality of the Extracted Rules

In order to analyze the performance of our proposed algorithm, we have compared the average value of confidence in each dataset of our method to the original method.

The Table 6 shows that the proposed method has found rules with high values of confidence in the majority of the datasets which ensures the benefit of our proposed method.

5 Conclusion

In this paper, we proposed a new approach based on APRIORI algorithm for discovering the association rules to auto-adjust the choice of the threshold of support. The main advantage of the proposed method is the automatism of the choice of support in multi-level, we get results containing desired rules with maximum interestingness in a little time. The numbers of rules generated by proposed algorithm are significantly less as compared to APRIORI Algorithm. Hence, we can say our algorithm answer the problem of the choice of the threshold of the support efficiently and effectively. As future works, we plan to ameliorate our approach to be able to select the interesting association rules without using any predefined threshold.