Keywords

1 Introduction

The purpose of frequent itemset (abbr. FI) mining [1, 7, 13] is to discover sets of items appearing frequently in databases. This technology has been widely applied to various real-life applications, such as association analysis [5], web mining [20], text mining [9], and bioinformatics [16]. Many efficient methods have been developed for mining FIs, such as Apriori [1], FP-Growth [7] and Eclat [18]. When mining FIs from sparse databases, these methods usually have good performance. This is because the correlations between items in sparse databases are relatively weak, thereby the lengths of FIs in the databases are also relatively short. In this situation, traditional FI mining methods can take good use of the downward closure property [1, 7] to effectively prune the search space. Transaction database is a typical example of sparse database.

However, in real-life scenarios, many kinds of datasets are dense datasets, such as plant feature data (e.g., the Mushrooms dataset [4]), records of game steps (e.g., the Connect and Chess datasets [4]), and census statistical data (e.g., the Pumsb dataset [4]). When mining FIs from dense datasets, traditional FI mining methods usually generate a large amount of FIs, causing then to suffer from long execution time and huge memory consumption. Intuitively, the more itemsets an approach needs to process, the more processing resources it consumes. When dealing with dense datasets, the performance of FI mining methods may decrease greatly. This is because that, in dense datasets, the correlations between items are relatively strong, and the lengths of FIs are relatively long. In this situation, the downward closure property [1, 7] could not work well for effectively pruning the search space.

To resolve the above problems, many studies have devoted to developing condensed representations of FIs. The experimental results of previous studies [2, 6, 8, 11] have showed that mining condensed representations from dense dataset can greatly reduce the execution time and memory usage. Many types of condensed representations of FI have been proposed, like free itemset [2], maximal itemset [6], generator itemset [8], and closed itemset [11]. Among these representations, closed itemset is the most popular one. Frequent closed itemset (abbr. FCI) is a lossless condensed representation [8] of FIs. Through the correct algorithm, all the FIs and their supports can be completely derived from all the FCIs. Therefore, the complete set of FCIs retains the complete information of FIs without any loss. Though many studies [8, 17, 19] have proposed for efficiently mining FCIs, few of them consider to develop algorithms for efficiently deriving FIs from FCIs (called deriving algorithms for simplicity). However, developing efficient deriving algorithms is a very important work. When all the FIs cannot be successfully mined from the dense datasets, an alternative solution is to mine all the FCIs from the datasets first, and then applying efficient deriving algorithms to recover the complete set of FIs and their supports from FCIs.

However, traditional deriving algorithm [12] adopts breadth-first search and top-down strategy to derive FIs. It derives FIs of length k − 1 from FCIs and FIs of length k. For sake of simplicity, the traditional deriving algorithm is called LevelWise in this work. A drawback of LevelWise is that it needs to maintain the complete set of FIs of length k and (k − 1) in the memory during the deriving process. Therefore, when there are many FIs of length k and (k − 1), LevelWise will consume a large amount of memory, or even cannot complete the whole deriving task due to running out of memory. Moreover, the deriving process of LevelWise involves a large number of search operations, which may result in the long execution time problem.

In view of the above, this paper proposes a novel algorithm, named DFI-Growth (Deriving Frequent Itemsets based on Pattern Growth), for efficient deriving FIs and their supports from FCIs. The design idea of DFI-Growth is to first construct a FP-Tree to maintain the information of FCIs, and then uses FP-Growth to generate FIs from the FP-Tree. However, the traditional FP-Tree and FP-Growth are not designed for the deriving task and directly applying them will result in incorrect results. Therefore, we propose two new strategies and modify the algorithmic processes of FP-Tree and FP-Growth, making them be able to deriving the correct information of FIs from FCIs. The proposed two strategies are respectively called maximum support replacement and maximum support selection. The former is applied during the construction of FP-Tree, while the latter is the pattern generation. We perform experiments on both real-life and synthetic datasets to evaluate the performance of the proposed DFI-Growth algorithm. Extensive experiments show that DFI-Growth is quite efficient and able to correctly deriving all the FIs and their supports. Moreover, its performance is significantly better than the current best method LevelWise [12] in terms of both execution time and memory usage. For instance, on the Chess dataset, when the minimum support threshold is set to 40%, LevelWise cannot complete the whole deriving task within 12 h (i.e., 43,200 s). However, the proposed DFI-Growth only takes 20 s to complete the task.

2 Basic Concept and Definitions

Let I*= {I1I2,…, IM} be a finite set of distinct items. A transaction database D = {t1t2,…, tH} is a set of transactions, where each transaction tR ∈ D (1 ≤ R ≤ H) is a subset of I* and has a unique transaction identifier R. An itemset is a set of items. If all the items of an itemset X are contained in a transaction T, X is said to be contained in T, which is denoted by X ⊆ T.

Definition 1 (Length of an itemset).

The length of an itemset X ={I1I2,IK} is defined as K, where K is the total number of distinct items in X.

Definition 2 (Tidset of an itemset).

The Tidset of an itemset X is denoted as r(X) and defined as the set of transaction identifiers of all the transactions containing X in D.

Definition 3 (Support of an itemset).

The support count of an itemset X is denoted as SC(X) and defined as |r(X)|. Besides, the support of X is defined as SC(X)/|D|, where |D| is the total number of transactions in D.

Definition 4 (Frequent itemset).

An itemset X is called frequent itemset iff the support of X is no less than a user-specified minimum support threshold θ (0 < θ ≤ 1).

Definition 5 (Super-itemset and sub-itemset).

If an itemset X is a subset of another itemset Y, then X is called sub-itemset of Y. Besides, Y is called super-itemset of X.

Definition 6 (Closure of an itemset).

The closure of an itemset X is denoted as ρ(X) and defined as

$$ \rho (X)\bigcap\nolimits_{R \in |r(X)|} {t_{R} } , $$

where ρ(X) returns the largest super-itemset Y in D such that X ⊆ Y and SC(X) = SC(Y). For example, the closure of the itemset {W} is ρ({W}) = t1 ∩ t2 ∩ t3 ∩ t4 ∩ t5 = {ACTW} ∩ {CDW} ∩ {ACTW} ∩ {ACDW} ∩ {ACDTW} = {CW}.

Property 1.

For any itemset X, SC(X) = SC(ρ(X)) ⇔ r(X) = r(ρ(X)).

Definition 7 (Closed itemset).

An itemset X is called closed itemset iff there exists no other itemset Y such that X ⊂ Y and SC(X)= SC(Y). Otherwise, X is a non-closed itemset. Moreover, the closure of a closed itemset X is equal to X itself (i.e., ρ(X) = X).

For example, in the database of Table 1, the itemset {W} is non-closed because ρ({W}) ={CW}. However, the itemset {C} is closed because ρ({C}) = {C}.

Table 1. An example of a transaction database.

Definition 8 (Frequent closed itemset).

A closed itemset X is called frequent closed itemset iff the support of X is not less than a user-specified minimum support threshold θ (0 < θ ≤ 1).

Problem Definition.

Let F and C be the complete sets of FIs and FCIs in a transaction database D, where each itemset in F or C is associated with its support count. Given C, the problem to be solved in this work is to develop an algorithm for efficiently and correctly deriving F from C.

3 Related Work

3.1 Frequent Itemset Mining

Extensive studies have been proposed for mining FIs. The Apriori algorithm [1] is one of the well-known algorithms for FI mining. Apriori adopts a breadth-first search strategy and candidate generation-and-test scheme to discover FIs. Although Apriori has been applied to various applications and domains, it needs to scan the original database several times and may generate a large amount of candidates during the mining process, which degrades the performance for discovering FIs. Another famous algorithm is FP-Growth [7], which is usually used as the benchmarking algorithm for performance comparison in pattern mining field. FP-Growth uses a compact FP-Tree as the internal data structure of the algorithm and adopts a pattern growth methodology to generate FIs from the FP-Tree. The most impressive contribution of FP-Growth is that it only needs to scan the original database twice and discovers FIs without candidate generation. It is widely recognized that the pattern growth algorithms generally are superior to Apriori-like algorithms. In addition to Apriori and FP-Growth, there are many other efficient FI mining algorithms that have been proposed, such as H-mine [14], DHP [13], Eclat [18], and COFI-Mine [10]. Although FI mining algorithms may have good performance when mining FIs from sparse databases, while mining FIs from dense databases, a large number of FIs in dense databases may seriously degrade their performance. This is because that the more FIs need to be processed, the more calculations are needed for the algorithms.

3.2 Frequent Closed Itemset Mining

Instead of mining all the FIs, some studies propose to mine the condensed representations of FIs. In general, a condensed representation of FIs is a subset of FIs, thereby the total number of itemsets in the representation is generally less than that of FIs. Since the number of itemsets to be mined is reduced, the execution time and memory usage spent by the mining task are also reduced. Different types of condensed representations of FIs have been proposed, including maximal itemset [6], closed itemset [11], non-derivable itemset [3] and generator itemset [8]. Among these representations, frequent closed itemset is a lossless condensed representation of FIs. Mining only the FCIs will not lose information of any FIs. Due to this special property, FCI mining has become a popular and widely researched technology and many studies have been proposed for efficiently mining FCIs, such as A-Close [11], CLOSET+ [17], CHARM [19] and DCI_Closed [8].

Nevertheless, few of these studies focus on developing algorithms for deriving FIs from FCIs. However, developing efficient deriving algorithms is an important task. In many applications, users may need to discover FIs from dense datasets instead of FCIs. In this case, discovering only the FCIs cannot fulfill the need of users, while discovering FIs by traditional FI mining algorithms may face the inefficiency problem. A solution to this problem is to discover FCIs first, and then generate the complete information of FIs through a correct and efficient deriving algorithm. To the best of our knowledge, only one study [12] had proposed the deriving algorithm (called LevelWise in this work). However, this it was proposed in 1999, which means that over the past nearly 20 years, no any other studies re-raise this research problem and develop new deriving algorithms.

3.3 Traditional Deriving Method

This subsection introduces the process of the LevelWise algorithm [12], which consists of two main steps below.

  • Step 1. The algorithm sorts all the FCIs in length descending order. Assuming that the length of the longest FCI is M and let LM denotes the set of all the itemsets of length M collected so far. Set a variable K to M.

  • Step 2. For each itemset X in LK, the algorithm generates all its sub-itemsets of length (K − 1). Then, for each generated sub-itemset Y, the algorithm searches from LK1 to see whether there exists an itemset Y′ in LK1 such that Y = Y′. If it cannot find such itemset, it puts a key-value pair (Ys) into LK1, where s is called the current count of Y and s is initially set to the support count of X. Otherwise, if such itemset Y′ can be found in LK1 and SC(X) is greater than the current count of Y′, then the algorithm replaces the current count of Y′ by SC(X). Finally, the algorithm decreases the variable K by 1 and repeats the process of Step 2 until K = 1.

    Although LevelWise is the first algorithm for deriving FIs from FCIs. However, when LK and LK1 store too many itemsets, LevelWise will suffer from the huge memory consumption and spend long execution time for searching and matching items. In the worst case, it is even unable to complete the whole process due to running out of memory space.

4 The Proposed Method: DFI-Growth

This section introduces the proposed algorithm DFI-Growth (Deriving Frequent Itemsets based on Pattern Growth). DFI-Growth is developed based on the adjustment of FP-Tree and FP-Growth. We briefly describe the FP-Tree tree structure adopted in DFI-Growth. In DFI-Growth, the FP-Tree is used to maintain the information of FCIs instead of transactions. In a FP-Tree, each node N represents an item I ∈ I* and consists of five fields: N.name, N.count, N.parent, N.child and N.link. The N.name and N.count fields store the item name of I and a count value of I, respectively. The N.parent and N.child fields store a parent node of N and all the children nodes of N, respectively. The N.link field stores a node link, which points to another node having the same item name as N. Each FP-Tree is associated with a header table. A header table records the information of items and these items are all sub-itemsets of FCIs. In a header table H, each item I ∈ I* in H consists of three fields: H[I].name, H[I].count and H[I].link, which respectively records the name of item I, a count of I, and a node link pointing to a FP-Tree node having the same item name as I. To learn more information about FP-Tree and header table, readers can refer to [7].

4.1 Construction of a FP-Tree for Deriving Task

This subsection introduces the construction process of a FP-Tree for the deriving task, which consists of the following four main steps.

  • Step 1. Let C be the complete set of FCIs discovered from a database D. DFI-Growth scans C once and derives the support count of each item appearing in C according to the proposed Maximum Support Selection for Item (abbr. MSS-I) strategy.

  • Strategy 1 (MSS-I). Let I be an item and S(I) = {c1c2, …, cz} be the set of all the FCIs containing I. The support count of I is equal to max{SC(c1), SC(c2), …, SC(cz)}.

    For example, if the minimum support count is set to 3, the left side of Fig. 1 shows the complete set of FCIs in the database of Table 1. Therefore, C = {{CWA}:4, {CWD}:3, {CWAT}:3, {CW}:5, {CD}:4, {CT}:4, {C}:6}, where each number beside each itemset is its support count. The set of all the FCIs containing the item {W} is S({W}) = {{CWA}:4, {CWD}:3, {CWAT}:3, {CW}:5}. Therefore, the support count of {W} is SC({W}) = max{4, 3, 3, 5} = 5. The readers may wonder “why the support count of {W} is neither 3 nor 4, but it is 5?”. This is because that the support count of {CW} is 5, which means that there are totally 5 transactions containing {CW} and all of these transactions also contain {W}. Therefore, the support count of {W} is 5. Then, the readers may raise another question “Is it possible that the support count of {W} be higher than 5, for example be 6 or other value?” The answer to this question is “impossible”. The reason is that if an itemset X is non-closed (Definition 6), X must exists only one closure ρ(X) = Y (Y ≠ X and Y ∈ C) such that SC(X) = SC(ρ(X)) (Definition 7 and Property 1). Otherwise, if an itemset X is closed, its closure must be equal to X itself (Definition 7). After applying the MSS-I strategy, the correct support counts of items are shown in the right side of Fig. 1.

    Fig. 1.
    figure 1

    An example of the MSS-I strategy.

  • Step 2. Creating a header table H and putting items into H with a fixed sorting order f. Then, creating a node R as the root of FP-Tree T. Scanning C again and sorting items in each FCI by the sorting order f.

  • Step 3. Inserting each sorted FCI into the FP-Tree T by calling the subroutine InsertFCI. The pseudo code of the subroutine InsertFCI is shown in Fig. 2. Each time when a frequent closed itemset X = {I1I2,…, IK} is retrieved, the two variables N and w are respectively set to the root node R and 1. Besides, setting the variable Iw to I1, where I1 is the first item of the closed itemset X. Then, calling the subroutine InsertFCI(N, Iw) to insert X into the FP-Tree T.

    Fig. 2.
    figure 2

    The pseudo code of the InsertFCI subroutine.

    The process of the subroutine InsertFCI(N, It) works as follows. If the node N has a child node N′ such that N′.name = Iw, then the algorithm applies the maximum support replacement (abbr. MSR) strategy for setting N′.count (Line 1–Line 2).

  • Strategy 2 (MSR). If SC(X) > N′.count, then replacing N′.count by SC(X).

    Otherwise, if SC(X) is not greater than N′.count, then the algorithm creates a new node N′ and sets it as the child node of N. Besides, setting N.count and N′.link to the support count of X and the lastly created node having the same item name as Iw, respectively (Line 3–Line 6). Next, increasing the variable w by 1. After that, if w is not higher than K, then recursive calling the subroutine InsertFCI (N′, Iw) for processing the w-th item of X (Line 7).

    For instance, Fig. 3 shows an example of the MSR strategy. Figure 3(a) shows the resulting FP-Tree after inserting the FCI {CWA}:4. Figure 3(b) shows the process for inserting the frequent closed itemset {CW}:5. When inserting the first item of {CW} into the FP-Tree, there exists a node N′ = {C} already under the root of the FP-Tree. Besides, the item name of N′ is equal to that of the first item of X (i.e., {C}). Therefore, DFI-Growth applies MSR strategy to setting N′.count. Because the support count of {CW} (i.e., 5) is greater than N′.count (i.e., 4), N′.count is replaced by SC({CW}) = 5. After that, the second item of {CW} is processed by the same way. Figure 3(c) and (d) show the insertion process of the frequent closed itemsets {CD}:4 and {C}:6, respectively.

    Fig. 3.
    figure 3

    An example of the MSR strategy.

  • Step 4. During the process of FP-Tree construction, all the nodes having the same item name will be linked together by the node link of the header table H and related nodes in FP-Tree T. Through these node links, DFI-Growth allows to traversing nodes having the same item name efficiently.

4.2 Deriving FIs by DFI-Growth

After constructing the FP-Tree T and the header table H, the algorithm calls the subroutine DFI-Growth(H, F, α) to derive FIs from T, where the variable α is initialize to ϕ. The process of the subroutine DFI-Growth(H, F, α) works as follows.

  • Step 1. The algorithm visits every item I in the header table H. For each visited item, the algorithm outputs {α ∪ I} and H[I].count as a FI and its support count. Through the node link in H[I].link, DFI-Growth traverses all the nodes having the same item name as I in the FP-Tree T.

  • Step 2. Let β ={β1β2,…, βm} be a set of nodes having the same item name as I. For each node βi ∈ β (1 ≤ i ≤ m), the algorithm visits from parent node of βi to the root node R of T. Then, the algorithm collects the items of each visited nodes. These items form a sequence Pi (1 ≤ i ≤ m) and the sequence is called conditional itemset. Each conditional itemset Pi is associated with a count called conditional itemset count and is denoted as CIC(Pi). The conditional itemset count of Pi is initially set to βi.count. After processing all the nodes in β in the same way, a conditional database of the itemset {α ∪ I} is obtained. Let D{αI}= {P1:CIC(P1), P2:CIC(P2),…, Pm:CIC(Pm)} denotes the conditional database of {α ∪ I}.

  • Step 3. Scanning the conditional database D{αI} once to find all the distinct items in conditional itemsets. For each item x, the algorithm calculates the support count of the frequent itemset {α ∪ I ∪ x} by the proposed Maximum Support Selection for itemSet (abbr. MSS-S) strategy.

  • Strategy 3 (MSS-S). Let x be an item appearing in D{αI} and \( P^{*} = \{ P^{\prime}_{1} :CIC(P^{\prime}_{1} ),P^{\prime}_{2} :CIC(P^{\prime}_{2} ), \ldots ,P^{\prime}_{h} :CIC(P^{\prime}_{h} )\} \) be the set of all the conditional itemsets containing x in D{αI}. The support count of the frequent itemset {α ∪ I ∪ x} is equal to \( max\left\{ {CIC(P^{\prime}_{1} ),\;CIC(P^{\prime}_{2} ),\; \ldots ,CIC(P^{\prime}_{h} )} \right\} \), where \( CIC(P^{\prime}_{i} ) \) is the conditional itemset count of \( P^{\prime}_{i} \) (1 ≤ i ≤ n).

    Figure 4 shows an example of the MSS-S strategy. Considering the conditional database of the itemset {T}, there are two conditional itemsets {CWA}:3 and {C}:4. By the step of the algorithm, the value of the variable α currently is ϕ. The set of all the conditional itemsets containing the item {C} is {{CWA}:3, {C}:4}. The support count of {ϕ ∪ T ∪ C} = {TC} is max{CIC({CWA}), CIC({C})} = max{3, 4} = 4. Therefore, the algorithm outputs {TC} as a FI and its support count is 4.

    Fig. 4.
    figure 4

    An example of the MSS-S strategy.

  • Step 4. The algorithm puts items in D{αI} into a new header table H′ in the sorting order f. Then, the algorithm scans the conditional database D{αI} again and sorts items in each conditional itemset in D{αJ} according to f.

  • Step 5. Creating a new node R′ as the root of a new FP-Tree T′ for the itemset {α ∪ I}. Inserting every sorted conditional itemset in D{αI} into T′ to construct the conditional FP-Tree of the itemset {α ∪ I}. Each time when a sorted conditional itemset J is inserted, set the variables N and Iw to R and I1, and calling the function InsertFCI(NIw). After constructing T′, all the nodes in T′ having the same item name are linked together by the node links.

  • Step 6. If the constructed FP-Tree T′ contains nodes other than the root node R′, then calling the subroutine DFI-Growth(H′, T′, α ∪ I) to derive FIs recursively.

5 Experimental Results

In this section, we evaluate the performance of the proposed DFI-Growth algorithm and compare it with the current best algorithm LevelWise [12]. The purpose of the both algorithms is the same, which is deriving FIs and their supports from FCIs. All the experiments are conducted on a computer equipped with Intel® CoreTM i7-8700 3.20 GHz CPU with 16 GB of RAM, running on Windows 10 Professional OS. All the compared algorithms are implemented in Java programming language. Five real-life datasets and one synthetic dataset are used for the performance evaluation. The five real-life datasets used in the experiments are Mushrooms, Chess, Connect, BMS_WebView_1, and Chainstore, while the synthetic dataset is c20d10k. All the datasets are download from SPMF [4]. Table 2 shows characteristics of the datasets. Both execution time and memory usage of the algorithms are evaluated. We measure the maximum memory usage of the algorithms by using the Java API.

Table 2. Characteristics of different datasets used in the experiments.

Figures 5 and 6 respectively show the execution times and memory usages of the compared algorithms. The execution times of DFI-Growth and LevelWise on dense datasets Mushrooms, Chess and Connect are shown in Fig. 5(a), (b) and (c), respectively. As shown in these figures, the performance of DFI-Growth is superior to LevelWise. For example, on Mushrooms dataset, when the minimum support threshold is lower than 1%, LevelWise even cannot complete the whole deriving task due to running out of memory. Besides, on the Chess and Connect datasets, LevelWise even cannot complete the deriving task within 12 h (i.e., 43200 s) under low minimum support thresholds. However, our proposed DFI-Growth only takes around 60 s to derive the complete FIs and their supports from FCIs. This is because that LevelWise adopts breadth-first search and top-down method to derive FIs from FCIs. When there are many FCIs and FIs, LevelWise needs to spend a lot of time for searching and matching items. Moreover, it always keeps two levels of itemsets in the memory, which occupies considerable memory spaces during the deriving process. However, DFI-Growth adopts pattern growth with divide-and-conquer scheme to derive FIs, which partitions the whole deriving tasks into smaller independent one and thus generally takes less time than LevelWise.

Fig. 5.
figure 5

Execution time of DFI-Growth and LevelWise on different datasets.

Fig. 6.
figure 6

Memory usage of LevelWise and DFI-Growth on different datasets.

Figure 5(d), (e) and (f) show the execution times of DFI-Growth and LevelWise on all the sparse datasets. From these figures, we can observe that the performance gap between of LevelWise and DFI-Growth is small when the threshold is high. This is because that when the threshold is high, the number of FCIs is small and the lengths of FCIs are relative short. Therefore, both of the algorithms can work well when the threshold is high. However, with decreasing thresholds, the performance gap between LevelWise and DFI-Growth increases significantly. As shown in the experiments, no matter the execution time or memory consumption, the proposed DFI-Growth generally has much better performance than LevelWise.

6 Conclusion

Given a complete set of FCIs and their supports, the interesting problem addressed in this paper is “how to efficiently, correctly, and completely derive FIs from FCIs?”. To resolve this problem, we propose a novel algorithm, named DFI-Growth (Deriving Frequent Itemsets based on Pattern Growth), for the deriving task. Moreover, we propose two novel strategies, named maximum support selection (abbr. MSS) and maximum support replacement (abbr. MSR), and incorporate them into DFI-Growth for efficiently deriving the complete FIs with correct supports. The MSS strategy is proposed for calculating the correct supports of FIs using information of FCIs. The MSR strategy is applied during the construction of FP-Tree, and it is proposed to set the correct information of FP-Tree nodes. Experiments show that the execution time and memory usage of DFI-Growth are significantly better than that of the current best algorithm LevelWise [12]. For example, on the Chess dataset, LevelWise cannot successfully complete the deriving task within 12 h, while DFI-Growth only takes 20 s to finish. The implementations of LevelWise [21] and DFI-Growth [22] to the well-known open-source data mining library SPMF[4], for benefiting related research works and real-life applications.