Efficient mining frequent itemsets algorithms

Mohamed, Marghny H.; Darwieesh, Mohammed M.

doi:10.1007/s13042-013-0172-6

Efficient mining frequent itemsets algorithms

Original Article
Published: 26 May 2013

Volume 5, pages 823–833, (2014)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Efficient mining frequent itemsets algorithms

Download PDF

Marghny H. Mohamed¹ &
Mohammed M. Darwieesh²

757 Accesses
9 Citations
Explore all metrics

Abstract

Efficient algorithms for mining frequent itemsets are crucial for mining association rules as well as for many other data mining tasks. It is well known that countTable is one of the most important facility to employ subsets property for compressing the transaction database to new lower representation of occurrences items. One of the biggest problem in this technique is the cost of candidate generation and test processing which are the two most important steps to find association rules. In this paper, we have developed this method to avoid the costly candidate-generation-and-test processing completely. Moreover, the proposed methods also compress crucial information about all itemsets, maximal length frequent itemsets, minimal length frequent itemsets, avoid expensive, and repeated database scans. The proposed named CountTableFI and BinaryCountTableF are presented, the algorithm has significant difference from the Apriori and all other algorithms extended from Apriori. The idea behind this algorithm is in the representation of the transactions, where, we represent all transactions in binary number and decimal number, so it is simple and fast to use subset and identical set properties. A comprehensive performance study shows that our techniques are efficient and scalable comparing with other methods.

Study of Effective Mining Algorithms for Frequent Itemsets

SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases

A Survey on Representation for Itemsets in Association Rule Mining

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data stored in computer increases very fast [1], so we have to try to extract useful knowledge from this data, this process is called knowledge discovery in Database (KDD), or data mining [2]. Knowledge discovery in databases (DBs) is important for many technical, social, and economic problems. Modern DBs contain such a huge quantity of information that it is practically impossible to analyze this information manually for acquiring valuable knowledge for decisions making. The problem of information generalization for real data that may contain noisy data is also considered in many research papers such as [49] and some algorithms for knowledge acquisition in consistent and inconsistent multi-scale decision systems have been proposed with illustrative examples as in [50]

Extracting association rules is a fundamental activity among the various data mining tasks. The main task of (association rules) is to search for interesting relationship among items in a given database and displays it in a rule form, i.e. A ⇒ B. many industries are becoming interested in mining associations among data due to the massive amounts of data continuously being collected and stored in databases. One of the most important example among the various applications of association mining is the market basket analysis. Association Rule Mining (ARM) is a widespread and well-known method in data mining for discovering interesting hidden relationship among items from large databases (Agrawal et al. [3]). This problem was faced in 1993 to find frequent itemsets which is an important step to extract the association rules. We have to note that this task has received a great deal of attention in the field of knowledge discovery and data mining.

To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support(min_sup) and confidence (min_conf).

The rule A ⇒ B holds with support s if s% of transactions in D contain A ∪ B. Rules that have as greater than a user-specified support is said to have min_sup.

The rule A ⇒ B holds with confidence c if c% of the transactions in D that contain A also contain B. Rules that have a c greater than a user-specified confidence is said to have min_conf.

For example, A and B are items such that {A} ⇒ {B} is an association rule mined from the database if the co-occurrence rate of A and disposable B (in the same transaction) is higher than min_sup and the occurrence rate of B in the transactions containing A is higher than min_conf.

The association rules mined from any database can be used to predict something useful. For example, the association rules mined from Point of Purchase (POP) transaction databases can be used to predict the purchase behavior of customers. In general, ARM is one of the most important consideration to make a decision about marketing activities, this type of rule is useful for the business community, for instance: promotional pricing or product placements. Nowadays, ARM is employed in many attractive application areas including engineering, medicine, and more. Most of the previous research mainly focuses on reducing the candidate itemsets amounts and the time of scanning databases. However, many algorithms adopt an Apriori-like candidate itemsets generation and support count approach that is the most time consuming process. To address this issue, we propose two effective algorithms named as CountTableFI (CTFI) and BinaryCountTableFI (BCTFI), respectively. We test our algorithms (CTFI) and (BCTFI) using an illustrative example. In the (CTFI) algorithm, a count Table is used to compress database for quick frequent itemsets generation and support count, respectively. The algorithm can also be used in many Apriori-like algorithms or in algorithms based on FP-Growth to improve the performance. Experiments with both synthetic and real databases show that CTFI) and (BCTFI) outperform Apriori and FP-Growth. Our algorithms can discover all types of frequent patterns efficiently and effectively.

2 Basic principles

2.1 Problem formulation

Mathematically, it is well known that a transaction T is said to contain A if and only if $A \subseteq T. $ In addition an association rule is an implication of the form A ⇒ B, where $A \subset I, \, B \subset I, $ and A ∩ B = ϕ where $I\,=\,\{i_{1}, i_{2}, \ldots, i_{m}\}$ be a set of items. The rule A ⇒ B holds in the transaction set D, where D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that $T \subseteq I$ with support s, where s is the percentage of transactions in D that contain (A ∪ B) (i.e., the union of sets A and B, or say, both A and B). This is taken to be the probability, P(A ∪ B). The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P(B|A). That is,

$$ sup(A \Rightarrow B)\,=\,P(A \cup B), $$

(1)

$$ conf(A \Rightarrow B)\,=\,P(B|A). $$

(2)

Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The set {milk, bread} is a 2-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the frequency, support count, or count of the itemset. Note that the itemset support defined in Eq. (1) is sometimes referred to as relative support, whereas the occurrence frequency is called the absolute support. If the relative support of an itemset I satisfies a pre-specified minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent itemset. The set of frequent k-itemsets is commonly denoted by L _k. From Eq. (2), we have

$$ conf(X \Rightarrow Y)\,=\,P(Y|X) = \frac{sup(X \cup Y)}{sup(X)}. $$

(3)

Equation (3) shows that the confidence of rule (X ⇒ Y) can be easily derived from the support counts of X and X ∪ Y. That is, once the support counts of X, Y, and X ∪ Y are found, it is straightforward to derive the corresponding association rules X ⇒ Y and Y ⇒ X and check whether they are strong. Thus the problem of mining association rules can be reduced to that of mining frequent itemsets.

In general, ARM can be viewed as a two-step process:

1.
Discover the large itemsets, i.e., all sets of itemsets that have transaction support above a predetermined minimum support s.
2.
Use the large itemsets to generate the association rules for the database.

The overall performance of mining association rules is, in fact, determined by the first step. After the large itemsets are identified, the corresponding association rules can be derived in a straightforward manner. We shall develop an algorithm to deal with the first step, i.e., discovering large itemsets from the transaction database. Readers interested in more details for the second step are referred to [4].

2.2 Related works

Association rule mining, the task of finding correlations between items in a dataset, has received considerable attention, particularly since the publication of the AIS and Apriori algorithms [3, 4]. Frequent Itemset Mining (FIM) is the most fundamental and essential problem in mining association rules. It started as a phase in the discovery of association rules, but has been generalized independent of these to many other patterns. Two main classes of algorithms have been proposed. The first class uses a process of candidate generation and testing to find frequent patterns. The second class of algorithms transforms the original data into a representation better suited for frequent pattern mining.

Candidate-generation-and-test approach Several different algorithms have been proposed to find all frequent patterns in a dataset have used the candidate set generate-and-test approach and have mined patterns directly from an original dataset. One of the key algorithms, which seems to be the most popular in many applications for enumerating frequent itemsets, is the apriori algorithm. The Apriori algorithm, developed by Agrawal [4], is the best known algorithm. It generates the k-candidate by combining two frequent (k − 1)-itemsets in the level-wise procedure, such that only the frequent itemsets at a level are used to construct candidates at the next level. However, it requires multiple database scans, as many as the longest frequent itemset. There have been extensive studies aimed to improve or extend the Apriori algorithm like [11–13]. The first one is the algorithm that accomplishes by employing a bottom–up search [4]. The algorithm generates candidate sets to limit pattern counting to only those patterns which can possibly meet the minimum support requirement. At each pass, the algorithm determines which candidates are frequent by counting their occurrence. Due to combinatory explosion, this leads to poor performance when frequent pattern sizes are large. To avoid this problem, some algorithms output only maximal frequent patterns (Zaki et al. [14], Bayarda [27], Lin and Kedem [28]). Pincer-Search [28] uses a bottom–up search along with top–down pruning to find maximal frequent patterns. Max-Miner [27] uses a heuristic bottom–up search to identify frequent sets as early as possible. Even though performance improvements may be substantial, maximal frequent sets have limited use in ARM. A complete set of rules cannot be generated without support information of the subsets of those maximal frequent sets. Park et al. [6, 29] presented the DHP algorithm. In order to efficiently find all k-subsets of a potential candidate itemset, all frequent itemsets are stored in a hash Table [21]. Brin et al. [15] proposed the DIC algorithm which tries to reduce the number of passes over the database by dividing the database into intervals of a specific size. First, all candidate patterns of size 1 are generated. The supports of the candidate sets are then counted over the first interval of the database. The Partition algorithm, proposed by Savasere et al. [5, 20], uses a Partition approach to reduce the I/O operations that is completely different from all previous approaches. That is, the database is stored in the main memory using the vertical database layout. The large transaction database is divided into multiple partitions such that each partition can be held in main memory. Holt and Chung [44] proposes IHP algorithm using inverted hashing and pruning. These algorithms mainly focus on pruning the candidate itemsets to reduce the candidate itemsets amount. Zaki proposed Eclat algorithm [10] by exploring the vertical data format. Zaki and Gouda [17] introduced a technique, called diffset, for reducing the memory requirement. The Apriori-like algorithm generates huge numbers of candidate itemets, which incurs high I/O overload by iteratively scanning large database. Tsay and Chiang [24] proposed an algorithm for mining association rules (Cluster-based association rule, CBAR). The CBAR [24] algorithm creates cluster tables to discover large itemsets faster and saves time. Contrasts are performed only against the partial cluster tables that were created in advance, requiring a single scan of the database reducing the time needed to perform data scans and requiring less contrast, and then proceeding from mining the 1-large itemsets to maximal large 1-itemsets. The Cluster-decomposition association rule (CDAR) [25] algorithm is an improved version of the CBAR algorithm. The CDAR algorithm does some efficient clustering to represent a database while CBAR algorithm is based on a single scanning, and then proceeds with the maximal large itemsets to large 1-itemsets by decomposing larger candidate itemsets. Recently, Dong and Han [18] presented BitTableFI algorithm. In the algorithm, data structure BitTable is used horizontally and vertically to compress database for quick candidate itemsets generation and support counting, respectively. As reported in [18], BitTableFI outperforms other two Apriori-like algorithms. Song et al. [19] proposed Index-BitTableFI algorithm based on BitTableFI in [18]. Finding frequent itemsets from large databases, by using support confidence based framework is not a trivial task [9]. Most of the previous studies on frequent pattern mining, such as [3–7], adopt an Apriori-like approach, which is based on an anti-monotone Apriori heuristic [4].
Data-transformation and no candidate generation approach Most previous algorithms have used the candidate set generate-and-test approach and have mined patterns directly from an original dataset. In many cases, the Apriori-like algorithm can suffer from these two non-trivial costs: generating a huge number of candidate sets and repeatedly scanning the dataset. Researchers are now exploring transforming the original data into data representations optimized for data mining. Tree Projection is an efficient algorithm presented in [32]. This algorithm builds a lexicographic tree in which each node of this tree presents a frequent pattern. The authors report that their algorithm is one order of magnitude faster than the existing techniques in the literature. Another innovative approach of discovering frequent patterns in transactional databases, FP-growth method is devised by Han et al. in 2000, that mines the complete set of frequent itemsets without candidate generation [16]. In this method, a data structure called the FP-tree is used for storing frequency information of the original database in a compressed form. The FP-growth algorithm transforms the problem of finding long frequent patterns to search for shorter ones recursively and then concatenating the suffix. It uses the least frequent items as a suffix. This makes the FP-growth method much faster than Apriori. The changing of support threshold might lead to a repetition of the whole mining process for the FP-Tree. So, it is difficult to use the FP-Tree in an interactive mining system. Besides, the FP-Tree is not suiTable for incremental mining when new datasets are inserted into the database. Many novel ideas have been proposed to deal with these issues. Furthermore, the set of frequent itemsets derived by most of the current itemset mining methods is too huge. There are proposals on reduction of such a huge set, including maximal itemsets, closed itemsets, and non-derivable itemset, [30, 31]. Mining the FP-tree structure is done recursively by building conditional trees that are of the same order of magnitude in number as the frequent patterns. This massive creation of conditional trees makes this algorithm not scalable to mine large datasets beyond few millions. There are many alternatives and extensions to the FP-growth approach, including H-Mine [8, 22] which explores a hyper-structure mining of frequent itemsets; dynamically considering item order, intermediate result representation, and construction strategy, as well as tree traversal strategy by Liu et al. [33]; and an array-based implementation of prefix-tree-structure for efficient pattern growth mining by Grahne and Zhu [34]. PatriciaMine [23] employs a compressed Patricia trie to store the datasets, where Patricia trie is a space-optimized trie data structure where each node with only one child is merged with its child. Gopalan et al. [35–41], proposed many algorithms to address issues in Apriori, FP-growth, H-mine and Eclat. FPgrowth* [42] uses an array technique to reduce the FP-tree traversal time. In FP-growth based algorithms, recursive construction of the FP-tree affects the algorithm’s performance. Song et al. [43], proposed a new algorithm TM using the vertical database representation. Transaction ids of each itemset are transformed and compressed to continuous transaction interval lists in a different space using the transaction tree, and frequent itemsets are found by transaction intervals intersection along a lexicographic tree in depth first order. TM algorithm has been shown to gain significant performance improvement over FP-growth and dEclat. In [45], Ahmed et al., examined ways of partitioning data for ARM. They identified methods that will enable efficient counting of frequent sets in cases where the data is much too large to be contained in primary memory and also where the density of the data means that the number of candidates to be considered becomes very large. Efficient pattern mining algorithms based on encoding database are presented in [46–48].

3 CountTableFI(CTFI) algorithm

3.1 Details of CTFI

In our proposed algorithm (CTFI), we transform original transaction data into new smaller transaction data with all information of frequent itemsets. By this pass over the database with given transactions, we can get a new merge transactions and then the frequent itemsets can be obtained from building count Table of all items in new merge transactions, (CTFI) uses Intersection operation to generate frequent itemsets based on CountTable which compresses the items. The Bitwise-And operation is much faster than the traditional item comparing method used in many Apriori-like algorithm. CTFI algorithm is the extension of improvements unlimited in ARM. It is based on set properties and subset property. We note some important observations on which CTFI depends on:

1.
A set G is a subset of a set H if and only if everything in G is also in $H, G \subseteq H. $
2.
Two sets are equal if and only if they have the same elements. More formally, for any sets G and $H, G\,=\,H, (G \subseteq H \wedge H \subseteq G)$.
3.
If the set of items of each transaction can be stored in some compact structure, it may be possible to avoid repeatedly scanning the original transaction database.
4.
If multiple transactions share a set of items, it may be possible to merge the shared sets with the number of occurrences registered as count. It is easy to check whether transactions are identical if the items in all of the transactions are listed according to a fixed order.
5.
If two transactions share a common prefix, and transaction is subset from another according to the same sorted order of items, the shared parts can be merged using one transaction as long as the count is registered properly.

3.2 Extraction frequent itemsets based on count table

Given I = {i ₁, i ₂, .... , i _m}, D = {t ₁, t ₂, ...., t _n} and Min_sup.count. With the above observations, we can generate the frequent k-itemset using the following steps:

Step 1 Scan transaction data to determine:

1.
If two or more transactions T _i, T _j share a common prefix, according to some sorted order of items and $T_{j} \subseteq T_{i} ,i \neq j, T_{i} , T_{j}$ can be merged using T _i as long as the count of all items is registered properly.
2.
If two or more transactions T _i, T _j share a common prefix, according to some sorted order of items and T _j = T _i , i ≠ j, T _i , T _j can be merged using T _i or T _j as long as the count of all items is registered properly.
3.
If T _i, T _j don’t follow pervious observations, we use T _i, T _j in next steps

Step 2 Build count (occurrences) Table of all items in transaction data(count-Table initially is zero).

Step 3 From step 1 increase the count of each item appearing by counting the number of occurrences in merged transactions (add number of occurrences in count-table), i.e T ₁ = {a, b, c, d} , T ₂ = {a, b, c, d} , T ₃ = {a, b} , ($T_{1}\,=\,T_{2},T_{3} \subseteq T_{1},T_{2}$) so, it T ₃ can be merged using one of T ₁ and T ₂. We assume that T ₃ will be merged with T ₁. So the set T1 will be the superset of T ₂, T ₃. In the above example, we have to note that the counts of the members of the set will be [a(3), b(3), c(2), d(2), where a(3) means that item a appeared three times in all transactions and register 3 in the count-table.

Step 4 Finally, we obtain new merged transactions which are usually substantially smaller than the original database and compact representation of the count-Table of all occurrences of each item.

Step 5 We can easily generate frequent itemsets from count-table. We use intersection operation ∧ to generate frequent itemsets based on count-Table which compresses the itemsets. We can generate superset maximum length frequent itemsets so, generating any subsets frequent itemsets is straight forward (downward closure property). i.e. if {a,b,c} is frequent then,{a}, {b}, {c}, {a,b}, {a,c}, {b,c} are frequent.

3.3 An example of CTFI

This section describes an example of CTFI algorithm based on the transaction database D, which is also presented in Table 1. Consider a database, D, consisting of nine transactions, suppose min_sup.count = 2.

Table 1 A transaction data D in ARM

Full size table

Execution of CTFI

1.
Scan transaction data in Table 1, we find that:
1. (a)
  T1 = {A , B , E}, T2 = {B , D}, T4 = {A, B, D}, are still the same in new transaction representation, because we can not merge them.
2. (b)
  T3 = {B , C} is equal to T6 = {B , C}. So, we can use one of them (merge T3, T6 in NT3), then NT3 = {B(2),C(2)}. Similarly T5 with T7 in NT5 = {A (2),C (2)}
3. (c)
  $T9\,=\,\{A,B,C\}, T8\,=\,\{A , B , C , E\}, T9 \subseteq T8$ and share a common prefix, {A,B,C} according to a fixed order,then we merge them in superset T8, NT8 = {A (2), B (2), C (2), E (1)}
2.
Build count (occurrences) Table of all items in transaction data(count-Table initially is zero).
3.
New merged transaction (lower dimension), D is shown in Table 2. We can see transforming transaction data from nine transactions to lower dimension six transactions without any loss of information.
4.
From new transaction data, we build occurrences of all items in count Table as shown in Table 3.
5.
Finally we obtain frequent itemsets from count Table by counting the occurrences of items and applying frequent itemsets property (anti-monotone heuristic) .i.e., if we see {A, B} in count Table and count the number of occurrences at the same transaction and continue to the next transaction in it. We see sup.count of {A, B} is 4, if there exist in counTable Table A(1),B(2) at the same this means that sup.count of {A, B} is 1, A(0), B(2) at the same means that sup.count{A, B} is 0. The BCTFI algorithm can proceed from mining the maximal-frequent itemsets to frequent 1-itemsets by counting occurrences of maximal items at the same transaction,or mining from frequent 1-itemsets to maximal frequent itemsets. All frequent itemsets after running the proposed algorithm is shown in Table 4.

Table 2 New transaction data D

Full size table

Table 3 Count-Table of all items in D

Full size table

Table 4 Frequent itemsets in D

Full size table

4 BinaryCountTableFI(BCTFI) algorithm

4.1 Details of BCTFI

In this algorithm (BCTFI), we represent the original transaction data by binary 0/1 representation. Using this representation, we are able to transform the data to decimal number. As in our (CTFI) algorithm we can merge transactions and then the frequent itemsets can be obtained from constructing count Table for all items in the merged transactions, (BCTFI) algorithm is the same as (CTFI) but the merge process in (BCTFI) is based on decimal number representation of each transaction. From this point of view, we note the important observations on which (BCTFI) depends:

1.
Each transaction contains m bits, thus the group of bits in each transaction (row) can be viewed as a m − bit binary code, which is equivalent to a decimal number between 0 and 2^m−1. These decimal numbers are the major elements of our algorithm.
2.
If the set of items of each transaction can be transformed in simple count table, it may be possible to avoid repeatedly scanning the original transaction database.
3.
If multiple transactions share a common prefix 0’s and 1’s, according to some sorted order in binary code, it may be possible to merge the shared sets with the number of occurrences registered as count with items listed according to a fixed order. It is easy to check whether two transactions are identical if they have the same decimal number representation.

4.2 Extraction of frequent itemsets based on binary count table

Given I = {i ₁, i ₂, . . . , i _m}, D = {t ₁, t ₂, ...., t _n} and Min_sup.count. With the above observations, we can generate the frequent k-itemset using the following steps:

Step 1 Transform transaction data into 0/1 binary code, decimal number

Step 2 From new representation, we find that:

1.
If two or more transactions T _i, T _j share a common prefix binary code, according to same sorted order of representation and $T_{j} \subseteq T_{i} ,i \neq j, T_{i} , T_{j}$ can be merged using T _i as long as the count of all items is registered properly.
2.
If two or more transactions T _i, T _j have the same decimal number, and T _j = T _i, i ≠ j, T _i, T _j can be merged using T _i or T _j as long as the count of all items is registered properly. 3. If T _i, T _j don’t follow pervious observations, we use T _i, T _j in next steps
3.
If T _i, T _j don’t follow pervious observations, we use T _i, T _j in next steps

Step 3 Build count (occurrences) Table of all items in transaction data(count-Table initially is zero).

Step 4 From Step 2 increase the count of each item appearing by counting the number of occurrences in merged transactions (add number of occurrences in count-table).

Step 5 Finally, we obtain new merged transaction (lower dimension) and compact representation count-Table of all occurrences of each items as in CTFI algorithm.

Step 6 We can easily generate frequent itemsets from count-table, we can generate superset maximum length frequent itemsets so, generation of any subsets frequent itemsets is straight forward (downward closure property).

4.3 An example of BCTFI

This section describes an example of BCTFI algorithm based on the transaction database D, which is also presented in Table 5. Consider a database, D, consisting of 9 transactions, suppose min_sup.count = 2.

Table 5 A transaction data D in ARM

Full size table

Execution of BCTFI The analysis of BCTFI can be summarized in the following steps:

1.
Transform transaction data to 0/1 binary code, decimal number see below Table 6
2.
Scan transaction data in Table 6, we find that:
1. 1.
  T1 = {A , B , E}, T1.Bit = (11001)₂ = (25)₁₀, T2 = {B , D}, T2.Bit = (01010)₂ = (10)₁₀, T4 = {A, B, D}, T4.Bit = (11010)₂ = (26)₁₀, which is still the same in new transaction representation, because we can not merge them.
2. 2.
  T3 = {B , C}, T3.Bit = (01100)₂ = (12)₁₀ is equal to T6 so, we can use one of them (merge T3 , T6 in NT3), then NT3 = {B (2) , C(2)}. Similarly (T5 with T7 in NT5 = {A (2),C (2)}
3. 3.
  $T9\,=\,\{A,B,C\}, T9.Bit\,=\,(11100)_{2}\,=\,(28)_{10},T8\,=\,{A,B,C, E}, T8.Bit\,=\,(11101)_{2}\,=\,(29)_{10}, T9 \subseteq T8, $ diff(T8 − T9) is (29)₁₀ − (28)₁₀ = (1)₁₀,(1)₁₀ means difference in one item E and share a common prefix, (111) ≡ {A, B, C} according to fixed order of 0’s and 1’s, then we merge them in superset T8, NT8 = {A (2), B (2), C (2), E (1)}
4. 4.
  Build count (occurrences) Table of all items in transaction data(count-Table initially is zero).
5. 5.
  New merged transaction(lower dimension), D ia shown in Table 7. We can see transformation of transaction data from 9 transactions to lower dimension 6 transactions without any loss of information.
6. 6.
  From new transaction data, we build occurrences of all items in count Table as shown in Table 8
7. 7.
  Finally we obtain frequent itemsets from count Table by counting the occurrences of items and applying frequent itemsets property (anti-monotone heuristic) i.e., if we see {A, B} in count Table and count the number of occurrences at the same transaction and continue to the next transaction in it we see sup.count of {A, B} is 4, if there exist in counTable Table A(1), B(2) at the same this means that sup.count of {A, B} is 1, A(0), B(2) at the same means that sup.count{A, B} is 0. The BCTFI algorithm can proceed from mining the maximal-frequent itemsets to frequent 1-itemsets by counting occurrences of maximal items at the same transaction,or mining from frequent 1-itemsets to maximal frequent itemsets. All frequent itemsets after executing our proposed algorithm is shown in Table 4.

Table 6 Decimal numbers of all transactions in D

Full size table

Table 7 New transactions in D

Full size table

Table 8 Count Table of all items in D

Full size table

5 Experimental and performance evaluation

In this section, we compared the performances of CTFI, BCTFI with two popular algorithms Apriori and FP-Growth. To verify the performance of CTFI and BCTFI algorithms, four real databases and two synthetic databases are used in the evaluation of frequent itemsets.

5.1 Test environment and datasets

We used six sets of data in our experiments. Two of these sets are synthetic data (T10I4D100K, T40I10D100K). These synthetic data resemble market basket data with short frequent patterns. The other four datasets are real data (Mushroom, Connect-4, Chess data and Pumsb*) which are dense in long frequent patterns. These datasets were often used in the previous study of association rules mining and are taken from the FIMI data repository page [26] to verify the CTFI and BCTFI performance. Table 9 shows the characteristics of these data-sets.

Table 9 Characteristics of experiment datasets

Full size table

where |T|, number of transactions; ATL, average transactions length; and |I|, number of items. The experimental results are presented in Fig. 1 through Fig. 6, for different numbers of minimum supports. Figures 1, 2, 3, 4, 5, 6 show the running time, the time frequent itemsets generation takes on database using four algorithms. The experimental results show that computation time sharply increases with the decreasing of the MinSup in Apriori algorithm, while it increases slowly in CTFI, BCTFI rather than FP-Growth. There is no candidate itemsets generation time in CTFI, BCTFI because it does not use candidates generation approach as FP-Growth.

The dense datasets in Figs. 1, 2, 3, 6 support the idea that CTFI, BCTFI run faster on longer itemsets. For most supports on the dense datasets, CTFI, BCTFI have the best performance. Generally, BCTFI performs very close to CTFI algorithm on all datasets. The artificial dense datasets in Figs. 4, 5 support the idea that CTFI, BCTFI run the fastest on shorter itemsets. CTFI, BCTFI demonstrate the best performance of the four algorithms for minimum supports.

The experimental results show that the performance of the (BCTFI), (CTFI) algorithms under almost all minimum support values is much better than the Apriori and (FP-Growth) algorithms. Moreover, (BCTFI), (CTFI) algorithms show better efficiency with the smallest minimum support.

Finally, CountTableFI and BinaryCountTableFI algorithms construct a highly compact count table, which is usually substantially smaller than the original database. Unlike existing algorithm such as [17], our proposal use simple intersection operations to compute frequent itemsets, and thus saves the costly database scans in the subsequent mining process. Our algorithms focus on solving the candidate itemsets generation and support count problems, so it does not need to generate candidate itemsets. While Zaki and Gouda [17] introduced a technique, called diffset, for reducing the memory requirement by presenting a novel vertical data representation called Diffset, that only keeps track of differences in the tids of a candidate.

6 Conclusions and future works

This paper has presented an algorithm for solving the main problem in data mining, frequent itemsets mining. The proposed algorithm has addressed the issues in the Apriori-like algorithms. Existing ARM algorithms suffer from many problems when mining large transaction datasets. Some of these major problems are: (1) the repetitive I/O disk scans, (2) the huge computation involved during the candidacy generation, and (3) the high memory dependency. The solutions to these issues is presented in this paper through CountTableFI and BinaryCountTableFI algorithms. It construct a highly compact count table, which is usually substantially smaller than the original database and discover frequent itemsets with Intersection And operation is faster than the traditional item comparing method used in many Apriori-like algorithm. This saves the costy database scans in the subsequent mining processes. Our algorithms focuses on solving the candidate itemsets generation and support count problems, so it does not need to generate candidate itemsets. The CountTableFI and BinaryCountTableFI are implemented and its performance in comparison with Apriori and FP-Growth are studied. Experimental evaluations on both synthetic data and real data show that the proposed algorithms outperforms Apriori-like algorithms in most of cases. We plan to investigate the following future works. We extend our algorithms to mine other kinds of patterns, such as sequential pattern mining problem, classification based association rules and to mine high dimension association rules.

Dong and Han presented BitTableFI algorithm [18] to decrease the cost of candidate generation and support counting in Apriori-like algorithms. However, in situations with a large number of frequent item sets, long itemsets, or quite low minimum support thresholds, it is costly to handle a huge number of candidate sets, for the candidate generation and test algorithms, the initial candidate set generation, especially for the frequent 2- itemsets, is the key issue that really counts. However, for BitTableFI algorithm, the length-2 candidates should be generated in the same way as Apriori does. The storage of the bit-vectors of these candidates will lead to high spatial complexity, while the support-counting of these candidates will cause high temporal complexity. Our proposals CountTableFI and BinaryCountTableFI algorithms construct a highly compact count table, which is usually substantially smaller than the original database which lead to increase the spatial complexity, this technique is carried out based on set theory and its operations as (Equivalence sets, Subset,...) and based on no candidate generation approach as (FP-Growth) The first algorithm CountTableFI algorithm used set theory to compress transactions and build count Table of items, the second one BinaryCountTableFI algorithm is an extension of the first one where each transaction is represented by binary and decimal numbers and set theory properties to build count Table of items. From the experiment of our paper and the experiment of the paper in [18] with the same minimum support and same datasets, the difference between the results of our algorithm with Aprori is significant than the difference between the results of the method in [18] with Aprori algorithm.

We try to resolve the problem of Incremental Maintenance of Frequent Itemsets (IMFI) and apply lattice theory and combinatorial number theory to solve the problem of computationally complexity of frequent itemsets.

References

Frawley W, Piatetsky-Shapiro G, Matheus C (1992) Knowledge discovery in databases: an overview. AI Mag 13(3):57–70
Google Scholar
Han J, Kamber M (2006) Data mining: concepts and techniques. 2nd edn. Morgan Kaufmann, San Francisco
Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in large databases. In: Proceedings of the ACM-SIGMOD 1993 international conference on management of data. Washington D.C., USA
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases. Chile, pp 487–499
Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of the 1995 international conference very large data bases (VLDB’95), Zurich, Switzerland, pp 432–443
Park JS, Chen MS, Yu PS (1995) An effective hash-based algorithm for mining association rules. In: Proceedings of 1995 ACM-SIGMOD internationa conference management of data (SIGMOD’95), San Jose, pp 175–186
Lent B, Swami A, Widom J (1997) Clustering association rules. In Proc. 1997 Int. Conf. Data Engineering (ICDE’97), 220–231, Birmingham, England
Pei J (2002) Pattern-grouth methods for frequent pattern mining. Ph.D. Thesis
Gouda K, Zaki MJ (2005) GenMax: an efficient algorithm for mining maximal frequent itemsets. Data Min Knowl Discov 11(3):223–242
Article MathSciNet Google Scholar
Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390
Article MathSciNet Google Scholar
Goethals B (2003) Survey on frequent pattern mining. Techinqcal report
Ceglar A, Roddick JF (2006) Association mining. ACM Computing Surveys 38(2), Article 5
Zhao Q, Bhowmick SS (2003) Association rule mining: a survey. Technical Report, CAIS, Nanyang Technological University, Singapore, No. 2003116
Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: Proceedings of the third international conference on knowledge discovery and data mining, AAAI Press, pp 283–286
Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD conference management of data
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of ACM SIGMOD conference on management of data. ACM, Dallas, pp 1–12
Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 326–335
Dong J, Han M (2007) BitTableFI: an efficient mining frequent itemsets algorithm. Knowl Based Syst 20(4):329–335
Article Google Scholar
Song W, Yang B, Xu Z (2008) Index-BitTableFI: an improved algorithm for mining frequent itemsets. Knowl Based Syst 21:507–513
Article Google Scholar
Zaki MJ (1999) Parallel and distributed assiocation mining: a survey. IEEE Concurr 7(4):14–25
Article Google Scholar
Toivonen H (1996) Sampling large databases for association rules. In: Proceedings of the 1996 international conference very large data bases (VLDB’96). Bombay, India, pp 134–145
Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2001) Hmine: hyper-structure mining of frequent patterns in large databases. In: Proceedings of IEEE international conference on data mining, pp 441–448
Pietracaprina A, Zandolin D (2003) Mining frequent itemsets using Patricia Tries. FIMI ’03, frequent itemset mining implementations. In: Proceedings of the ICDM 2003 workshop on frequent itemset mining implementations. Melbourne
Tsay YJ, Chiang JY (2005) CBAR: an efficient method for mining association rules. Knowl Based Syst 18(2–3):99–105
Article Google Scholar
Tsay YJ, Chiang JY (2004) CDAR: an efficient cluster and decomposition algorithm for mining association rules. Inform Sci 160:161–171
Article Google Scholar
(2004) Workshop on freqent itemset mining implementations (FIMI’04). http://fimi.cs.helsinki.fi
Bayarda RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of the ACM SIGMOD international conference on management of data. Seattle, WA, pp 85–93
Lin DI, Kedem ZM (1997) Pincer-search: a new algorithm for discovering the maximum frequent set. In Schek H, Saltor F, Ramos I, Alonso G (eds). Proceedings of advances in database technology (EDBT ’98), 6th international conference on extending database technology, Valencia, Spain. Lecture Notes in Computer Science, 1377. Springer, Berlin, pp 105–119
Park JS, Chen MS, Yu PS (1997) Using a hash-based method with transaction trimming for mining association rules. IEEE Trans Knowl Data Eng 9(5)
Zaki MJ (2004) Mining non-redundant association rules. Data Min Knowl Discov 9(3):223–248
Article MathSciNet Google Scholar
Wang HX (2004) Demand-driven frequent itemset mining using pattern structures. Knowl Inform Syst 8(1):82–102
Article Google Scholar
Agarwal R, Aggarwal C, Prasad VVV (2000) A tree projection algorithm for generation of frequent itemsets. In J Parallel Distrib Comput (Special Issue on High Performance Data Mining) 61:350–371
Liu G, Lu H, Lou W, Xu Y, Yu JX (2004) Efficient mining of frequent patterns using ascending frequency ordered prefix-tree. Data Min Knowl Discov 9(3):249–274
Article MathSciNet Google Scholar
Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining using FP-Trees. IEEE Trans Knowl Data Eng 17(10):1347–1362
Article Google Scholar
Gopalan R, Sucahyo YG (2002) ITL-Mine: mining frequent itemsets more efficiently. In: Proceedings of 2002 international conference on fuzzy systems and knowledge fiscovery, Singapore
Gopalan R, Sucahyo YG (2002) TreeITL-Mine: mining frequent itemsets using pattern growth, tid intersection and prefix tree. In: Proceedings of 15th Australian joint conference on artificial intelligence, Canberra, Australia. Lecture Notes on Artificial Intelligence, 2557. Springer, Melbourne
Sucahyo YG, Gopalan R (2003) CT-ITL: Efficient frequent item set mining using a compressed prefix tree with pattern growth. In; Proceedings of 14th Australasian database conference, Adelaide, Australia
Gopalan R, Sucahyo YG (2003) Fast Frequent itemset mining using compressed data representation. In: Proceedings of IASTED international conference on databases and applications (DBA’2003). Innsbruck, Austria, Feb 10–13
Gopalan R, Sucahyo YG (2003) Improving the efficiency of frequent pattern mining by compact data structure design. In: Proceedings of fourth international conference on intelligent data engineering and automated learning (IDEAL). Hong Kong, March 21–23, LNCS, Springer
Gopalan R, Sucahyo YG (2004) High performance frequent patterns extraction using compressed FP-Tree. In: Proceedings of the SIAM international workshop on high performance and distributed mining. Orlando, USA
Sucahyo YG, Gopalan R (2004) CT-PRO: A bottom–up non recursive frequent itemset mining algorithm using compressed FP-Tree data structure. In: Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI). Brighton, UK
Grahne G, Zhu J (2003) Efficiently using prefix-trees in mining frequent itemsets, FIMI ’03, Frequent Itemset Mining Implementations. In: Proceedings of the ICDM 2003 workshop on frequent itemset mining implementations. Melbourne
Song M, Rajasekaran S (2006) A transaction mapping algorithm for frequent itemsets mining. IEEE Trans Knowl Data Eng 18(4): 472–481
Google Scholar
Holt JD, Chung SM (2002) Mining association rules using inverted hashing and pruning. Inform Process Lett 83(4):211–220
Article MATH MathSciNet Google Scholar
Ahmed S, Coenen F, Leng P (2006) Tree-based partitioning of data for association rule mining. Knowl Inf Syst 10(3):315–331
Article Google Scholar
Wang T, He P (2006) Database encoding and a new algorithm for association rules mining. J Commun Comput 3(3):77–81
Google Scholar
Fu-zan C, Min-qiang L (2008) Efficient algorithm based on itemset-lattice and bitmap index for finding frequent itemsets. Syst Eng Theory Prac 28(2):26–34
Google Scholar
Fakhrahmad SM, Zolghadr Jahromi M, Sadreddini MH (2007) Mining frequent itemsets in large data warehouses: a novel approach proposed for sparse data sets. In: Yin H et al (eds) IDEAL. LNCS, 4881, pp 517–526
Vagin V, Fomina M (2011) Problem of knowledge discovery in noisy databases. Int J Mach Learn Cybern 2(3):135–145
Article Google Scholar
Gu S-M, Wu W-Z (2012) On knowledge acquisition in multiscale decision systems. Int J Mach Learn Cybern :1–10. doi:10.1007/s13042-012-0115-7

Download references

Author information

Authors and Affiliations

Faculty of Computers and Information, Assiut University, Assiut, Egypt
Marghny H. Mohamed
Mathematics Department, Faculty of Science, Assiut University, Assiut, Egypt
Mohammed M. Darwieesh

Authors

Marghny H. Mohamed
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed M. Darwieesh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marghny H. Mohamed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mohamed, M.H., Darwieesh, M.M. Efficient mining frequent itemsets algorithms. Int. J. Mach. Learn. & Cyber. 5, 823–833 (2014). https://doi.org/10.1007/s13042-013-0172-6

Download citation

Received: 07 March 2012
Accepted: 29 April 2013
Published: 26 May 2013
Issue Date: December 2014
DOI: https://doi.org/10.1007/s13042-013-0172-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient mining frequent itemsets algorithms

Abstract

Similar content being viewed by others

Study of Effective Mining Algorithms for Frequent Itemsets

SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases

A Survey on Representation for Itemsets in Association Rule Mining

1 Introduction