Keywords

1 Introduction

Data Mining is vital for data analysis in numerous fields as in the references [13]. It has pulled in a lot of consideration in the information industry and in the public arena as a whole in recent years because of the wide accessibility of immense measures of data and the impending requirement for transforming such data into valuable information and knowledge. The tasks of data mining are of two types: Descriptive and predictive mining. Descriptive mining characterize the significant features in database. The different techniques of descriptive mining are clustering, association rule mining, summarization and sequential pattern mining. Predictive mining forecast the unknown or hidden values of variables utilizing their existing values. Predictive mining techniques are classification, deviation detection, regression, and prediction (Fig. 1).

Fig. 1
figure 1

Categories of data mining tasks [7]

Association Rule Mining (ARM) is one of the significant issues in mining prospect which was presented in [4] and extended in [5] to produce association rules. ARM deal with invention of valuable co-occurrence and correlations among itemsets present in vast number of transactions. Rules found fulfill user defined minimum support and confidence thresholds. The generation of association rules can be produced in two parts. Firstly, we can discover frequent itemsets from the datasets, and the secondary deals with the generation of association rules from those frequent patterns [6] (Fig. 2).

Fig. 2
figure 2

Generation of Association Rules [8]

Frequent pattern mining is a critical zone of Data mining research. The occurrence of items repeatedly within transactions in a transactional database is called frequent itemsets. Frequent itemset mining will be used to find useful patterns over customer’s transaction databases. This kind of finding item set helps organizations to make significant decisions, for example, inventory drawing, cross marketing and customer shopping performance scrutiny. A consumer’s transaction database comprises of set of transactions, where each and every transaction is an itemset. The purpose of frequent itemset mining is to mine all frequent itemsets in a transactional database. Frequent itemset mining can be ordered into different classifications based upon the type of itemset to be mined. These are discussed in detail in the below [7] (Fig. 3):

Fig. 3
figure 3

Kind of pattern to be mined with Association Rules [7]

  1. (a)

    Approximate Frequent Itemset: When an itemset derive approximate support count from its supersets.

  2. (b)

    Closed Frequent Itemset: An itemset is closed if none of its prompt supersets has the same support as an itemset.

  3. (c)

    Constrained Frequent Itemset: An itemset which satisfy user-specifies set of rules through mining.

  4. (d)

    Maximal Frequent Itemset: An itemset is maximal frequent if none of its prompt supersets is frequent.

  5. (e)

    Near-Match Frequent Itemset: An itemset is said to be near-match when its support count is same that of its near-by or similar items.

  6. (f)

    Top-k Frequent Itemset: Out of complete set of frequent itemset, top itemsets that pass user determined value “k”.

2 Literature Survey

In the preceding discussion we have studied the fundamental ideas of data mining, Association Rule Mining and frequent itemset mining. This segment will proceed with the literature review of some popular algorithms related to these mining concepts which have been presented from many years in different researches. In this section, we examine closely related past studies and contrast them with our methodology.

The AIS (Agrawal, Imielinski, Swami [4]) algorithm was introduced to mine the association rules [4]. It has some expertise in the standard of databases alongside essential to method call support queries. Through this algorithm one frequent itemsets are produce, which suggests that the following of these guidelines exclusively contain one thing, as a case we tend to exclusively produce rules like A ∩ B − > C however not those tenets as A − > B ∩ C. The database is scanned number of times to generate the frequent itemsets in AIS. The bottleneck of this algorithm is several candidate itemsets are produced that requires more space and scans.

The AprioriTID algorithm [5] generates the candidate itemsets. The difference between AIS and AprioriTID is it requires only one database scan and a group of candidate itemsets employed for k > 1. Apriori and AprioriTID use consistent candidate generation methodology and calculation sturdy itemsets. Apriori Hybrid algorithm uses Apriori inside the underlying passes and changes to AprioriTid once it expects that the candidate itemsets at the tip of the pass will be in memory. In the SETM algorithmic guideline, candidate itemsets square measure produced on-the-fly in light because the data is scanned, however counts at the highest point of the pass. The disadvantage of SETM algorithm is similar to the AIS algorithm. Another weakness is that for each candidate itemset, there square measure as a few entries as its support value.

The FP-Growth Algorithm [8], proposed by J. Han requires constructing FP-tree. It is a productive and scalable technique for mining the complete set of frequent patterns through pattern fragment growth, utilizing an amplified prefix-tree structure for storing compacted and pivotal data about frequent patterns (FP) called as Frequent Pattern Tree (FP-tree). This algorithm needs 2 passes irrespective of candidate generation and utilizes divide and conquer approach. The database scanned two times. In the first scan, it maintains a list of frequent items sorted by recurrence in dropping order and in the second scan; the FP-tree is generated through the database compression. One issue, which hampers the well known utilization of frequent itemset mining, is that it is troublesome for users to choose a legitimate minimum support threshold. To precisely control the number of the most regular itemsets, is to mine top-k frequent itemsets [917] rather than to mine frequent itemsets. The setting of k—value is more instinctive than setting the minimal support threshold because k represents the number of itemsets the user needs to discover while choosing the threshold depends exclusively on database’s attributes, which are frequently obscure to users [18]. Despite the fact that these algorithms are different in that they make use of different data structures and pursuit strategies to find top-k frequent itemsets, they follow a same general procedure for finding top-k itemsets (Table 1).

Table 1 Input parameters description

3 Top-k Association Rules Mining (TKAR) Algorithm

The Top-k Association Rules mining algorithm takes as input a transactional database, minimum confidence threshold and k the number of rules user wants to discover. The TKAR algorithm is an advantage to conventional association rule mining algorithms meant for users to control the number of association rules. Table 2 presents an overview of the input parameters for the proposed algorithm.

Table 2 Dataset Characteristics

The proposed TKAR algorithm will work as follows:

Algorithm (dataset, k, minconf)

  1. 1.

    minsup = 1; C = NULL; D = NULL

  2. 2.

    dataset is scanned one time and the sids (sequential ids of each item I is stored in a variable called Tids(i)

  3. 3.

    For every pair of items P & Q Where Tids(P) > = min_sup & Tids(Q) > = min_sup, Set Tids(P => Q) = Null & set Tids(P =>Q) = Null

  4. 4.

    For each Tid which belongs to Tids(P) ∩ Tids (Q)

  5. 5.

    If P occurs before Q in the Tid s then Tids(P => Q) = Tids (P => Q) U {s} OR If Q occurs before P in Tid s then Tids(Q => P) = Tids (Q => P) U {s}

  6. 6.

    If │Tids (P => Q) │/│Total sequences│ > = minimum support Then

  7. 7.

    Confidence (P => Q) = │Tids (P =>Q) │/│Tids(P) │

  8. 8.

    If confidence (P =>Q) > = minimum confidence then

  9. 9.

    If │C│ < K then

  10. 10.

    If there exist a rule r2 in C which is similar to the currently generated rule r1 & whose support is also similar to the support of r1 then the rule r1 is not added to C Otherwise this rule r1 is added to the C. D: D = D U r1.

  11. 11.

    If │C│ > = K then

  12. 12.

    Remove a rule r from C whose support is equal to the present minimum support, Set minimum support = lowest support of rules in C

  13. 13.

    while D is non-empty

  14. 14.

    Do

  15. 15.

    Select a rule r with the highest support in D

  16. 16.

    Expand-Left (rule, N, E, min_sup, min_conf)

  17. 17.

    Expand-Right (rule, N, E, min_sup, min_conf)

  18. 18.

    Remove r from D

  19. 19.

    END

Mining the Top-k association rules has two difficulties:

  1. 1.

    Top-k rules can’t depend on minimum support to prune the search space, however they may be adjusted to mine frequent itemsets with minimum support = 1 to guarantee that all top-k rules can be generated in the generating rules step.

  2. 2.

    Top-k association rules mining algorithm can’t utilize the two stages procedure to mine association rules however they would need to be altered to discover top-k rules by generating rules and keeping top-k rules.

4 Experimental Evaluation

For an experimental evaluation of the proposed algorithm (TKAR), we performed several experiments on real-life datasets. We implemented the proposed algorithm on a computer equipped with a Core i5 processor running Windows 7 and 4 GB of main memory running under Java. All datasets (mentioned in the Table 2) were obtained from the Machine Learning RepositoryFootnote 1 (UCI). The Chess and Connect datasets are derivative from their respective game steps, the mushroom dataset contains characteristics of a variety of species of mushrooms, and the Pumsb consists of the population and housing (census) data. Table 3 shows some characteristics of the real datasets. Typically, these datasets are dense, that is, they generate many frequent itemsets even for very high values of support.

Table 3 Evaluation for Mushroom Dataset

Jeff Schlimmer utilized a database and applied mining techniques with the intention to obtain classifiers which can be used on similar databases. He compiled a mushroom dataset illustrating the records from The Audubon Society Field to North American Mushrooms [6] (1981). The Audubon Society’s vision is to preserve and reinstate natural ecosystems. The dataset comprises 8124 instances among them 4208 are classified as edible while 3916 are poisonous and 23 species of gilled mushrooms in the Agaricus and Lepiota families. Each species may recognize as belong to one of the three classes, namely definitely edible, definitely poisonous, or unknown edibility and thus, not recommended. The purpose of choosing the mushroom dataset is to detach edible mushrooms from poisonous ones. It is a classification problem with a target variable of whether mushrooms can be eaten or not. The results generated through TKAR algorithm on mushroom dataset indicates to increase the survival rate when we are in the wild life. If one is capable of distinguishing mushrooms between edible and poisonous by their appearances then we may able to survive in deep mountains by eating mushrooms.

Impact of the k parameter

We have run algorithm by fluctuating estimation of k and considers its values as 75, 150, 600 and 2000 separately. Furthermore we have taken minconf value as 0.1–0.9. As estimation of k increases, both the execution time and memory also increases. This assessment has been appeared underneath in the table and its respective graph. Initially as appeared underneath we consider results of mushroom dataset with various estimation of k. This Table 3 and Fig. 4 indicates memory necessity is directly expanding with estimation of k.

Fig. 4
figure 4

Detailed results for varying k for the mushroom dataset

5 Conclusion

The users need to choose the parameter estimations of minconf and minsup in classical algorithm. So amazingly vast number of rules is produced. Due to this the algorithm experience the long execution time and more memory required. We proposed the methodology which produce governs precisely how much user needs. This proposed algorithm gives preferred execution and performance over others and it is worthwhile to classical algorithms when the user needs to control the number of association rules produced. The top-k association rules mined through the proposed algorithm have the redundancy rules. There is a scope to develop efficient algorithms for mining non-redundant association rules.