TKAR: Efficient Mining of Top-k Association Rules on Real—Life Datasets

Gireesha, O.; Obulesu, O.

doi:10.1007/978-981-10-3156-4_5

O. Gireesha¹⁸ &
O. Obulesu¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 516))

956 Accesses

Abstract

Data mining is an important facet for discovering association rules among the biggest scope of itemsets. Association rule mining (ARM) is one of the techniques in data processing with the two sub processes. One is identifying frequent itemsets and the other is association rule mining. Frequent itemset mining has developed as a major issue in data mining and assumes an essential part in various data mining tasks, for example, association analysis, classification, etc. In the structure of frequent itemset mining, the outcomes are itemsets which are frequent in the entire database. Association rule mining is a basic data mining task. Researchers developed many algorithms for finding frequent itemsets and association rules. However, relying upon the choice of the thresholds, present algorithms become very slow and produce a greatly large amount of outcomes or generates few outcomes, omitting usable information. Furthermore, it is well-known that an expansive extent of association rules produced is redundant. This is truly a significant issue because in practice users don’t have much asset for analyzing the outcomes and need to find a certain amount of outcomes within a limited time. To address this issue, we propose a one of a kind algorithm called top-k association rules (TKAR) to mine top positioned data from a data set. The proposed algorithm uses a novel technique for generating association rules. This algorithm is unique and best execution and characteristic of scalability, which will be a beneficial alternative to traditional association rule mining algorithms and where k is the number of rules user want to mine.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Fast Top-K association rule mining using rule generation property pruning

Article 26 October 2020

ETARM: an efficient top-k association rule mining algorithm

Article 15 August 2017

Fast Mining of Top-k Frequent Balanced Association Rules

Keywords

1 Introduction

Data Mining is vital for data analysis in numerous fields as in the references [1–3]. It has pulled in a lot of consideration in the information industry and in the public arena as a whole in recent years because of the wide accessibility of immense measures of data and the impending requirement for transforming such data into valuable information and knowledge. The tasks of data mining are of two types: Descriptive and predictive mining. Descriptive mining characterize the significant features in database. The different techniques of descriptive mining are clustering, association rule mining, summarization and sequential pattern mining. Predictive mining forecast the unknown or hidden values of variables utilizing their existing values. Predictive mining techniques are classification, deviation detection, regression, and prediction (Fig. 1).

Association Rule Mining (ARM) is one of the significant issues in mining prospect which was presented in [4] and extended in [5] to produce association rules. ARM deal with invention of valuable co-occurrence and correlations among itemsets present in vast number of transactions. Rules found fulfill user defined minimum support and confidence thresholds. The generation of association rules can be produced in two parts. Firstly, we can discover frequent itemsets from the datasets, and the secondary deals with the generation of association rules from those frequent patterns [6] (Fig. 2).

Frequent pattern mining is a critical zone of Data mining research. The occurrence of items repeatedly within transactions in a transactional database is called frequent itemsets. Frequent itemset mining will be used to find useful patterns over customer’s transaction databases. This kind of finding item set helps organizations to make significant decisions, for example, inventory drawing, cross marketing and customer shopping performance scrutiny. A consumer’s transaction database comprises of set of transactions, where each and every transaction is an itemset. The purpose of frequent itemset mining is to mine all frequent itemsets in a transactional database. Frequent itemset mining can be ordered into different classifications based upon the type of itemset to be mined. These are discussed in detail in the below [7] (Fig. 3):

(a)
Approximate Frequent Itemset: When an itemset derive approximate support count from its supersets.
(b)
Closed Frequent Itemset: An itemset is closed if none of its prompt supersets has the same support as an itemset.
(c)
Constrained Frequent Itemset: An itemset which satisfy user-specifies set of rules through mining.
(d)
Maximal Frequent Itemset: An itemset is maximal frequent if none of its prompt supersets is frequent.
(e)
Near-Match Frequent Itemset: An itemset is said to be near-match when its support count is same that of its near-by or similar items.
(f)
Top-k Frequent Itemset: Out of complete set of frequent itemset, top itemsets that pass user determined value “k”.

2 Literature Survey

In the preceding discussion we have studied the fundamental ideas of data mining, Association Rule Mining and frequent itemset mining. This segment will proceed with the literature review of some popular algorithms related to these mining concepts which have been presented from many years in different researches. In this section, we examine closely related past studies and contrast them with our methodology.

The AIS (Agrawal, Imielinski, Swami [4]) algorithm was introduced to mine the association rules [4]. It has some expertise in the standard of databases alongside essential to method call support queries. Through this algorithm one frequent itemsets are produce, which suggests that the following of these guidelines exclusively contain one thing, as a case we tend to exclusively produce rules like A ∩ B − > C however not those tenets as A − > B ∩ C. The database is scanned number of times to generate the frequent itemsets in AIS. The bottleneck of this algorithm is several candidate itemsets are produced that requires more space and scans.

The AprioriTID algorithm [5] generates the candidate itemsets. The difference between AIS and AprioriTID is it requires only one database scan and a group of candidate itemsets employed for k > 1. Apriori and AprioriTID use consistent candidate generation methodology and calculation sturdy itemsets. Apriori Hybrid algorithm uses Apriori inside the underlying passes and changes to AprioriTid once it expects that the candidate itemsets at the tip of the pass will be in memory. In the SETM algorithmic guideline, candidate itemsets square measure produced on-the-fly in light because the data is scanned, however counts at the highest point of the pass. The disadvantage of SETM algorithm is similar to the AIS algorithm. Another weakness is that for each candidate itemset, there square measure as a few entries as its support value.

The FP-Growth Algorithm [8], proposed by J. Han requires constructing FP-tree. It is a productive and scalable technique for mining the complete set of frequent patterns through pattern fragment growth, utilizing an amplified prefix-tree structure for storing compacted and pivotal data about frequent patterns (FP) called as Frequent Pattern Tree (FP-tree). This algorithm needs 2 passes irrespective of candidate generation and utilizes divide and conquer approach. The database scanned two times. In the first scan, it maintains a list of frequent items sorted by recurrence in dropping order and in the second scan; the FP-tree is generated through the database compression. One issue, which hampers the well known utilization of frequent itemset mining, is that it is troublesome for users to choose a legitimate minimum support threshold. To precisely control the number of the most regular itemsets, is to mine top-k frequent itemsets [9–17] rather than to mine frequent itemsets. The setting of k—value is more instinctive than setting the minimal support threshold because k represents the number of itemsets the user needs to discover while choosing the threshold depends exclusively on database’s attributes, which are frequently obscure to users [18]. Despite the fact that these algorithms are different in that they make use of different data structures and pursuit strategies to find top-k frequent itemsets, they follow a same general procedure for finding top-k itemsets (Table 1).

Table 1 Input parameters description

Full size table

3 Top-k Association Rules Mining (TKAR) Algorithm

The Top-k Association Rules mining algorithm takes as input a transactional database, minimum confidence threshold and k the number of rules user wants to discover. The TKAR algorithm is an advantage to conventional association rule mining algorithms meant for users to control the number of association rules. Table 2 presents an overview of the input parameters for the proposed algorithm.

Table 2 Dataset Characteristics

Full size table

The proposed TKAR algorithm will work as follows:

Algorithm (dataset, k, minconf)

1.
minsup = 1; C = NULL; D = NULL
2.
dataset is scanned one time and the sids (sequential ids of each item I is stored in a variable called Tids(i)
3.
For every pair of items P & Q Where Tids(P) > = min_sup & Tids(Q) > = min_sup, Set Tids(P => Q) = Null & set Tids(P =>Q) = Null
4.
For each Tid which belongs to Tids(P) ∩ Tids (Q)
5.
If P occurs before Q in the Tid s then Tids(P => Q) = Tids (P => Q) U {s} OR If Q occurs before P in Tid s then Tids(Q => P) = Tids (Q => P) U {s}
6.
If │Tids (P => Q) │/│Total sequences│ > = minimum support Then
7.
Confidence (P => Q) = │Tids (P =>Q) │/│Tids(P) │
8.
If confidence (P =>Q) > = minimum confidence then
9.
If │C│ < K then
10.
If there exist a rule r2 in C which is similar to the currently generated rule r1 & whose support is also similar to the support of r1 then the rule r1 is not added to C Otherwise this rule r1 is added to the C. D: D = D U r1.
11.
If │C│ > = K then
12.
Remove a rule r from C whose support is equal to the present minimum support, Set minimum support = lowest support of rules in C
13.
while D is non-empty
14.
Do
15.
Select a rule r with the highest support in D
16.
Expand-Left (rule, N, E, min_sup, min_conf)
17.
Expand-Right (rule, N, E, min_sup, min_conf)
18.
Remove r from D
19.
END

Mining the Top-k association rules has two difficulties:

1.
Top-k rules can’t depend on minimum support to prune the search space, however they may be adjusted to mine frequent itemsets with minimum support = 1 to guarantee that all top-k rules can be generated in the generating rules step.
2.
Top-k association rules mining algorithm can’t utilize the two stages procedure to mine association rules however they would need to be altered to discover top-k rules by generating rules and keeping top-k rules.

4 Experimental Evaluation

For an experimental evaluation of the proposed algorithm (TKAR), we performed several experiments on real-life datasets. We implemented the proposed algorithm on a computer equipped with a Core i5 processor running Windows 7 and 4 GB of main memory running under Java. All datasets (mentioned in the Table 2) were obtained from the Machine Learning Repository^{Footnote 1} (UCI). The Chess and Connect datasets are derivative from their respective game steps, the mushroom dataset contains characteristics of a variety of species of mushrooms, and the Pumsb consists of the population and housing (census) data. Table 3 shows some characteristics of the real datasets. Typically, these datasets are dense, that is, they generate many frequent itemsets even for very high values of support.

Table 3 Evaluation for Mushroom Dataset

Full size table

Jeff Schlimmer utilized a database and applied mining techniques with the intention to obtain classifiers which can be used on similar databases. He compiled a mushroom dataset illustrating the records from The Audubon Society Field to North American Mushrooms [6] (1981). The Audubon Society’s vision is to preserve and reinstate natural ecosystems. The dataset comprises 8124 instances among them 4208 are classified as edible while 3916 are poisonous and 23 species of gilled mushrooms in the Agaricus and Lepiota families. Each species may recognize as belong to one of the three classes, namely definitely edible, definitely poisonous, or unknown edibility and thus, not recommended. The purpose of choosing the mushroom dataset is to detach edible mushrooms from poisonous ones. It is a classification problem with a target variable of whether mushrooms can be eaten or not. The results generated through TKAR algorithm on mushroom dataset indicates to increase the survival rate when we are in the wild life. If one is capable of distinguishing mushrooms between edible and poisonous by their appearances then we may able to survive in deep mountains by eating mushrooms.

Impact of the k parameter

We have run algorithm by fluctuating estimation of k and considers its values as 75, 150, 600 and 2000 separately. Furthermore we have taken minconf value as 0.1–0.9. As estimation of k increases, both the execution time and memory also increases. This assessment has been appeared underneath in the table and its respective graph. Initially as appeared underneath we consider results of mushroom dataset with various estimation of k. This Table 3 and Fig. 4 indicates memory necessity is directly expanding with estimation of k.

5 Conclusion

The users need to choose the parameter estimations of minconf and minsup in classical algorithm. So amazingly vast number of rules is produced. Due to this the algorithm experience the long execution time and more memory required. We proposed the methodology which produce governs precisely how much user needs. This proposed algorithm gives preferred execution and performance over others and it is worthwhile to classical algorithms when the user needs to control the number of association rules produced. The top-k association rules mined through the proposed algorithm have the redundancy rules. There is a scope to develop efficient algorithms for mining non-redundant association rules.

Notes

1.
http://fimi.ua.ac.be/data/

References

L. Bin, Q. Huayong, S. Yizhen, Realization and application of customer attrition early warning model in security company. TELKOMNIKA Indonesian J. Electr. Eng. 10(5), 10(5), 1106–1110 (2012)
Google Scholar
L. Zhou, H. Wang, W. Wang, Parallel implementation of classification algorithms based on cloud computing environment. TELKOMNIKA Indonesian J. Electr. Eng. 10(5), 10(5), 1087–1092 (2012)
Google Scholar
A. Ali, M. Elfaki, D. Norhayati, Using Naïve Bayes and bayesian network for prediction of potential problematic cases in tuberculosis. Int. J. Inf. Commun. Technol. (IJ-ICT), 1(2),1(2), 63–71 (2012)
Google Scholar
R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in Proceedings of the ACM SIGMOD International Conference on Management of data, vol. 22, no. 2, (ACM, New York, 1993), pp. 207–216
Google Scholar
R. Agrawal, R. Srikant, Fast algorithms for mining Association Rules in large databases, in Proceedings of the 20th International Conference on Very Large Data Bases (Morgan Kaufmann Publishers Inc., San Francisco, 1994), pp. 487–499
Google Scholar
D. Jain, S. Gautam, Implementation of apriori algorithm in health care sector: a survey. Int. J. Comput. Sci. Commun. Eng. Nov 2013
Google Scholar
S. Panwar, Analytical study of frequent Itemset and Utility Mining Approaches. Int. J. Eng. Appl. Technol. 180–187, July 2015. ISSN: 2321-8134
Google Scholar
Y.L. Cheung, A.W. Fu, Mining frequent itemsets without support threshold: with and without item constraints. IEEE TKDE J. 16(6), 1052–1069 (2004)
Google Scholar
T. Le, B. Vo, Bay: An N-list-based algorithm for mining frequent closed patterns. Expert Syst. Appl. 42(19), 6648–6657 (2015)
Article Google Scholar
Q. Huynh-Thi-Le, T. Le, B. Vo, H. Le, H. Bac, An efficient and effective algorithm for mining top-rank-k frequent patterns. Expert Syst. Appl. 42(1), 156–164 (2015)
Article Google Scholar
Z.H. Deng, Fast mining top-rank-k frequent patterns by using node-lists. Expert Syst. Appl. 41, 1763–1768 (2014)
Article Google Scholar
M. Zihayat, A. An, Mining top-k high utility patterns over data streams. Inf. Sci. (2014)
Google Scholar
K. Chuang, J. Huang, M. Chen, Mining Top-K frequent patterns in the presence of the memory constraint. VLDB J. 17, 1321–1344 (2008)
Article Google Scholar
P. Tzvetkov, X. Yan, J. Han, TSP: Mining top-k closed sequential itemsets. KAIS J. 7(4), 438–457 (2005)
Google Scholar
J. Wang, J. Han, Y. Lu, P. Tzvetkov, TFP: an efficient algorithm for mining Top-k frequent closed Itemsets. IEEE TKDE J. 17(5), 652–664 (2005)
Google Scholar
J. Han, J. Pei, Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD Explor. Newsl. 2(2), 14–20 (2000)
Article MathSciNet Google Scholar
J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, D. Yang, H-mine: Hyper-Structure mining of frequent Itemsets in large databases, in ICDM, pp. 441–448 (2001)
Google Scholar
C.W. Wu, B.E. Shie, P. Yu, V.S. Tseng, Mining top-K high utility itemsets, in KDD, pp. 78–86 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Sree Vidyanikethan Engineering College, Tirupati, 517502, Andhra Pradesh, India
O. Gireesha & O. Obulesu

Authors

O. Gireesha
View author publications
You can also search for this author in PubMed Google Scholar
O. Obulesu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to O. Gireesha .

Editor information

Editors and Affiliations

Anil Neerukonda Inst. of Tech. & Sci., Prof., Dept. of Computer Sci. & Engg. Anil Neerukonda Inst. of Tech. & Sci., Vishakapatnam, Andhra Pradesh, India
Suresh Chandra Satapathy
Professional Colleges (SRMGPC), Shri Ramswaroop Memorial Group of Professional Colleges (SRMGPC), Lucknow, Uttar Pradesh, India
Vikrant Bhateja
SCIS, University of Hyderabad , Hyderabad, India
Siba K. Udgata
KIIT University, School of Computer Engineering KIIT University, Bhubaneswar, Odisha, India
Prasant Kumar Pattnaik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gireesha, O., Obulesu, O. (2017). TKAR: Efficient Mining of Top-k Association Rules on Real—Life Datasets. In: Satapathy, S., Bhateja, V., Udgata, S., Pattnaik, P. (eds) Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications . Advances in Intelligent Systems and Computing, vol 516. Springer, Singapore. https://doi.org/10.1007/978-981-10-3156-4_5

Download citation

DOI: https://doi.org/10.1007/978-981-10-3156-4_5
Published: 03 March 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3155-7
Online ISBN: 978-981-10-3156-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

TKAR: Efficient Mining of Top-k Association Rules on Real—Life Datasets

Abstract

Similar content being viewed by others

Fast Top-K association rule mining using rule generation property pruning

ETARM: an efficient top-k association rule mining algorithm

Fast Mining of Top-k Frequent Balanced Association Rules

Keywords

1 Introduction

2 Literature Survey

3 Top-k Association Rules Mining (TKAR) Algorithm

4 Experimental Evaluation

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

TKAR: Efficient Mining of Top-k Association Rules on Real—Life Datasets

Abstract

Similar content being viewed by others

Fast Top-K association rule mining using rule generation property pruning

ETARM: an efficient top-k association rule mining algorithm

Fast Mining of Top-k Frequent Balanced Association Rules

Keywords

1 Introduction

2 Literature Survey

3 Top-k Association Rules Mining (TKAR) Algorithm

4 Experimental Evaluation

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation