Keywords

1 Introduction

Data mining is the process of extracting useful, potential, novel, understandable, concealed information from the databases which are huge, noisy, and ambiguous [1, 2]. Data mining plays a vital role in various application in the modern world such as market analysis, credit assessment, fraud detection, medical and pharma discovery, fault diagnosis in production system, insurance and healthcare, banking and finance, hazard forecasting, customer relationship management (CRM), and exploration of science [38]. Association Rule Mining (ARM) or Frequent Itemset Mining (FIM) is one of the key areas of the data mining paradigm. Its main intention is to extract interesting relationships, patterns, associations among sets of items in the transaction database or other data repositories [912]. The most typical application of ARM is in market basket analysis which analyzes the purchasing behavior of customers by finding the frequent item purchased together. In addition to the many business application, it is also applicable to telecommunication networks, web log mining, market and risk management, inventory control, bio-informatics, medical diagnosis and text mining [813].

Recently data mining techniques, and tools are used in the cloud computing. Cloud computing is now a very powerful trend in all range of business and scientific field. It has become a great area of focus in data mining. Cloud computing offers many services to analyze, store, and manage the massive dataset such as deliver the software and hardware over the internet, data storage with efficient, reliable, and cost effective way [14, 15].

Dozens of algorithms have been proposed to find the frequent item set from transaction dataset. A very classical association rule mining algorithm is Apriori and several other algorithms have been developed based on this Apriori algorithm such as AprioriTID [12], Ecalt [16, 17], dEclat [17], FP-Growth [18], Relim [19], H-mine [19], FIN [20]. In this research, we have chosen four well established frequent itemset mining methods of Apriori, AprioriTID, Eclat, and FP-Growth performance for comparison within cloud computing environment.

The rest of this paper is organized as follows. Section 2 explains the related work in this research. Section 3 explains basic concepts of ARM and focuses on the selected ARM algorithm. Section 4 presents details about Amazon web service and Elastic cloud computing (EC2) service, Sect. 5 provides comparative analysis, whereas, Sect. 6 concludes the findings in this paper.

2 Related Work

Several study had been carried out to compare the performance among the various association rule mining algorithms [2125]. Trivedi [26] analyzed the performance of several association rule mining algorithm and concluded that among the three algorithms compared, FP-Growth’s performance is the best followed by Eclat while Apriori had the worst performance.

In a related development Garg and Kumar [27] comparatively studied the performance among Apriori, Eclat, and FP-Growth. They concluded that FP Growth is the best among the three algorithms and also scalable and the Apriori performs the worst. However, they used only one dataset for their experiment.

Similarly, Sinha and Ghosh [28] presented the comparison the performance of these same algorithms. They used only one dataset that is ‘Pima’ and they made several experiment by varying support count. They concluded that Eclat is better algorithm than Apriori and FP-Growth.

From the earlier studies, some researchers opined the FP-Growth algorithm is better than Apriori, AprioriTID, and Eclat based on their experimental research. On the other hand some researchers also concluded based on their research the Eclat is more efficient than Apriori, and FP-Growth.

The performance of the data mining algorithms depends on the size, generating number of candidates and frequent itemset, and density of the dataset. In this study, we choose small, medium, and dense dataset for evaluating the performance of the ARM algorithms.

3 Association Rules Mining (ARM) Algorithms

ARM is one of the key method of data mining techniques and it was introduced by Agrawal et al. in 1993. We elaborate on some generic concepts of association rules mining formally as follows.

Let \( I = \left\{ {i_{1} ,i_{2} , \ldots ,i_{m} } \right\} \) be a set of m different literals, or items. For instance, goods such as bag, pen, and pencil for purchase in a shop are items.

X is a set of items such that \( X \subseteq I \), a collection of zero or more items is called an itemset. If an itemset contains k items, it is called k-itemset. For example, a set of items for purchase from a super market is an itemset.

Let \( D = \left\{ {t_{i} ,t_{i + 1} , \ldots ..,t_{n} } \right\} \) is a set of transactions, where each transaction \( t \) has \( tid \) and \( t - itemset \) \( t = \left( {tid, t - itemset} \right). \)

The itemset \( X \) in the transaction dataset \( D \) has a support, denoted as S, if \( S\% \) transaction contains \( X \), here we called \( S = Supp\left( X \right) \).

$$ Supp\left( X \right) = \frac{{\left| {\left\{ {t \in D;X \subseteq t} \right\}} \right|}}{\left| D \right|} $$

An itemset X in a transaction database D is said to be large, or frequent itemset if its support is equal to, or greater than, the threshold minimal support (minsup) given by users. The negation of an itemset X is \( \neg X \).

The support of \( \neg X \) is \( supp\left( {\neg X} \right) = 1 - supp(X \)).

An association rule is an implication in the form of \( X \to Y, where X,Y \subseteq I and X \cap Y = \phi \) [12].

The quality of association rule can be determined by measurements, support and confidence.

\( Support \left( S \right) \) determines how often a rule is applicable to a given dataset.

$$ S\left( {X \to Y} \right) = {\text{Supp}}\left( {X \cup Y} \right)/D $$

\( Confidence\left( C \right) \) determines how frequently items in \( Y \) appear in transactions that contains \( X \),

$$ C\left( {X \to Y} \right) = Supp\left( {X \cup Y} \right)/Supp\left( X \right) $$

The association rule mining task can be broken down into two sub tasks [9, 2931].

  1. I.

    Finding all of the frequent itemsets which have support above the user specified minimum support value All frequent itemset are then generated.

  2. II.

    Generating all rules that have minimum confidence in the following simple way: For every frequent itemset \( X \), and any \( B \subset X, \,let A = X - B \). If the confidence of a rule is greater than, or equal to, the minimum confidence (or \( supp\left( X \right)/supp\left( A \right) \ge minconf \)), then it can be extracted as a valid rule.

The ARM performance typically depend on the first task. Usually, ARM generates vast number of association rules. Most of the time, it is difficult for users to understand and confirm a huge number of complex association rules. So, it is important to generate only “interesting” and “non-redundant” rules, or rules satisfying certain criteria such as easy to handle, control, understand, and increase the strength. Ever since, dozens of algorithms have been developed to find the frequent itemset and association rules in ARM. Some algorithms are more popular to find the frequent itemsets and association rules which are Apriori, Apriori-TID, FP-growth, Eclat, dEclat, Relim, H-mine, FIN, Charm, dCharm and so on. In this study, we have chosen four well established algorithm which are Apriori, Apriori-TID, FP-growth, and Eclat. We have evaluated the performance of the selected algorithms on cloud platform.

3.1 Apriori Algorithm

Apriori is classic and broadly used ARM algorithm. It uses an iterative approach called breath-first search to generate \( \left( {k - 1} \right) \) itemsets from \( k \) item sets. The basic principle of this algorithm is that all nonempty subsets of a frequent itemset must be frequent [8, 11, 18].

The Apriori-gen function takes as argument L k  1 , the set of all large (k − 1)-itemsets. It returns a superset of the set of all large k-itemsets. There are two main steps in Apriori algorithm these are as follows:

  • The prune step: remove the itemsets if support is less than min_sup which predefined by user value and abandon the itemset if its subset is not frequent. So, we can delete all itemsets c ∈ C k such that some (k  1)-subset of c is not in Lk − 1:

  • The Join step: the candidates are produced by joining among the frequent item sets in level-wise way. The key drawback of this algorithm is the multiple dataset scan. So, we can join L k1 with L k – 1.

3.2 AprioriTID

AprioriTid is a small variation on the Apriori algorithm and using Apriori-Gen function to produce candidates with some modification which does not use database for counting support after first pass, keeps a separate set C k which holds information: <TID, {X k }> where each X k is a potentially large k-itemset in transaction TID, and if a transaction does not contain any large itemsets, it is removed from C k [12, 31].

3.3 FP-Growth

The FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. It uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the itemset association information.

In simple words, this algorithm works as follows: first it compresses the input database creating an FP-tree instance to represent frequent items. After this first step it divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. Finally, each of such database is mined separately. Using this strategy, the FP-Growth reduces the search costs looking for short patterns recursively and then concatenating them in the long frequent patterns, offering good selectivity [3234]. FP-growth is efficient and scalable for mining both long and short frequent patterns [35].

3.4 Eclat Algorithm

Eclat takes a depth-first search and adopts a vertical layout to represent databases, such that each item is represented by a set of transaction IDs (called a tidset) whose transactions contain the item. Tidset of an itemset is generated by intersecting tidsets of its items. Because of the depth-first search, it is difficult to utilize the downward closure property like in Apriori. However, using tidsets has an advantage that there is no need for counting support. The support of an itemset is the size of the tidset representing it. The main operation of Eclat is intersecting tidsets, thus the size of tidsets is one of the main factors affecting the running time and memory usage of Eclat. The bigger tidsets are, the more time and memory are needed [16, 17].

4 Cloud Platform

4.1 Amazon Web Service (AWS)

Amazon Web Service (AWS) provides a highly reliable, scalable, and low-cost infrastructure platform in the cloud. Whether indexing or analyzing large amount of business or scientific data sets, AWS offers set of big data tools and services and it is more suitable for any massive data analysis domain. There are several benefits accruable from the use of AWS including easy and securely host the user application using AWS management console, AWS services are more flexible to select the operating system, programming language, web application platform, database tools, and other useful services as user needs. It is a cost effective web service meaning that user can pay only for the computing resource usage per hourly basis and there are no long-term contracts and up-front commitment. AWS provides reliable, global secure, and scalable platform, and AWS tools can be auto scaling and elastic load balancing, so user can resize the application based on demand [3638].

Furthermore, Amazon EC2 also provides pre-configured templates for user instances known as Amazon Machine Images (AMI). These AMI templates can include just an operating system like Windows or Linux, and can also include a wide range of components such as operating system, and pre-installed software packages. Amazon EC2 instances range start from small “micro” instances for small jobs to high performance “X-large” instances for like data warehousing [38].

5 Comparative Analysis

5.1 Dataset Details

We have chosen four different benchmark dataset which are related in frequent itemset mining and were downloaded from [39]. In specific chess, accident, and mushroom are real life dataset and t20i6D100 K which was synthetically generated by IBM generator. Table 1 describes more details of the dataset.

Table 1. Dataset details with number of transaction and their attributes

5.2 SPMF

SPMF is an open-source data mining library for frequent pattern mining. It was developed under the GPL v3 license and written using java programming language. It has 93 data mining algorithms for sequential pattern mining, association rule mining, itemset mining, and sequential rule mining, and clustering. SPMF can be used as a standalone program with a simple user interface or from the command line [40].

5.3 AWS-EC2 Details

All experiments were performed on amazon web service cloud platform using EC2 instance type “m2-medium” that contains: Linux operating system, memory 4 GB, 2 core. Figure 1 illustrates the logon screen of EC2-m2-medium instance.

Fig. 1.
figure 1

AWS-EC2 instance login screen

5.4 Results Comparison

Table 2 and Fig. 2 show the performance of the four chosen algorithms in Chess dataset with different min_sup. The results show that FP-Growth algorithm outperforms the other three algorithms. We were only able to find the results using AprioriTID algorithms until min_sup = 0.65 because of the memory constraints.

Table 2. Chess dataset comparison with execution time and frequent itemset count
Fig. 2.
figure 2

Chess dataset result comparison with execution time

Table 3 and Fig. 3 show the performance of the four chosen algorithms in accidents dataset with different min_sup. The results show that FP-Growth algorithm outperforms the other three algorithms. Also, we were only able to find the results using AprioriTID algorithms until min_sup = 0.65 because of the memory constraints.

Table 3. Accidents dataset comparison with execution time and frequent itemset count
Fig. 3.
figure 3

Accidents dataset result comparison with execution time

Figure 4 and Table 4 show the performance of the four chosen algorithms in mushroom dataset with different min_sup. The result shows that Eclat algorithm outperforms the other three algorithms.

Fig. 4.
figure 4

Mushroom dataset result comparison with execution time

Table 4. Mushroom dataset comparison with execution time and frequent itemset count

Table 5 and Fig. 5 show the performance of the four chosen algorithms in t20i6D100 K dataset with different min_sup. The results show that Eclat algorithm outperforms the other three algorithms.

Table 5. t20i6D100 K dataset comparison with execution time and frequent itemset count
Fig. 5.
figure 5

t20i6D100 K dataset result comparison with execution time

6 Conclusion

In this work, four different association rule mining algorithms (Apriori, AprioriTID, FP-Growth, and Eclat) was implemented on cloud environment. We have chosen the cloud platform as the amazon web service platform and used EC2 service. We implemented four different benchmark dataset including chess, accidents, mushrooms and t20i6d100 K. We evaluated the performance of those algorithms based on their execution time by varying the min_sup values. From this study, we make the following observations and conclusion as follows:

  • Cloud platform is much suitable for data mining process in the areas of efficiently, reliability, and cost effectiveness.

  • During comparison Apriori requires more time to produce the frequent itemset when the min_sup values decreases. In contrast AprioriTID algorithm is not suitable for dense dataset such as chess and accidents. This is because those datasets are producing more frequent itemset when the min_sup value decreases and AprioriTID is not able to produce the results beyond particular min_sup values shown in Tables 3 and 4.

  • Eclat algorithm is suitable for any dataset (small or medium or dense dataset) with compared Apriori, and AprioriTID.

  • From this study the FP-Growth algorithm is more suitable for medium size and dense dataset. Tables 2 and 3 clearly show the experimental results. Tables 4 and 5 clearly express that the FP-Growth is not suitable for small size and simple dataset.

  • Eclat and FP-Growth algorithms are more efficient algorithm than Apriori, and AprioriTID algorithms. Comparing these algorithms, FP-Growth is more suitable for dense and medium size dataset. It may therefore be concluded that Eclat is appropriate for medium size and less dense dataset.