Keywords

1 Introduction

There are many algorithms [1] of mining frequent itemsets on association rules; the most representative are Apriori algorithm [2] and FP-tree algorithm [3]. Apriori algorithm is a basic algorithm to mine frequent itemsets, using a kind of iterative algorithm by searching layer by layer, and k-itemsets are used to generate k + 1 itemset. The FP-tree algorithm which only scans the database twice does not need to generate candidate itemsets; it adopts a divide-and-conquer strategy. However, the transaction database is generally relatively large. For counting the support in each iterative process, the cost of scanning the database every time is extremely high. To solve the problem, we can start from two aspects: reducing the generation of candidate itemsets and the amount of data scanned in the process of each iteration.

Paper [4] proposes a graph-based approach to generate frequent itemsets of association rules from a large database of customer transactions. This approach scans the database once to construct an association graph and mine large itemsets. This paper makes a further improvement on the basis of algorithm introduced in Paper [4], and our experimental results prove the validity of the efficiency about these improvements. Moreover, relevant experimental analysis has also been provided in this paper.

2 A Graph-Based Algorithm for Mining Association Rules

2.1 Basic Idea of DLG Algorithm

The graph-based mining algorithm, referred to as the DLG algorithm [4, 5], constructs an association graph to reflect association between itemsets and then traverses the association graph to generate frequent itemsets. The DLG algorithm uses a vertical data format for mining frequent itemsets [6], in which data are represented by the symbol of {item:tid_set}, and it consists of the following three stages:

  1. 1.

    The first stage: generating frequent 1-itemset. The phase generates frequent a-itemset and records relevant information. After scanning the database once, we number items and calculate their supports to create a bit vector, whose length is the total number of transaction in database.

  2. 2.

    The second stage: creating association graph. The stage generates frequent 2-itemsets by using the logical AND operator (can be denoted as the notation “\( \wedge \)”) among bit vector which is created in the first stage.

  3. 3.

    The third stage: generating frequent K + 1-itemsets (K >= 2). The stage generates larger frequent itemsets based on the second stage than before.

2.2 Efficiency Analysis of DLG Algorithm

The DLG algorithm has some advantages, compared with other association rule mining algorithms such as Apriori algorithm and FP-tree algorithm. It only needs to scan the database once; you can construct the transaction matrix, greatly reducing the I/O times and improving efficiency. However, the DLG algorithm has the similar problem too. The common problem is that they all exist in iterative process from frequent itemsets to candidate itemsets and from candidate itemsets to frequent itemsets, and therefore, it generates inevitably redundant candidate itemsets. In addition, the DLG algorithm is less efficient in the face of some huge transaction databases.

3 Improved Algorithm Based on DLG

3.1 Theoretical Basis

An improved algorithm based on partitioned and compressed association graph is presented in this paper. We will introduce three theorems, which are the theoretical basis of our algorithm, used in association graph before describing the algorithm.

Theorem 1

In the association graph, assume that the nodes are numbered sequentially. We partition the graph by the following properties: according to the order of smallest to largest in node number, the current node and those nodes, whose number are larger than the current one, directly connected to the former will be divided into the same part. So, it will generate many independent and disjoint sub-association graphs. Then mine frequent itemsets in each sub-association graph respectively. At last the union set from total frequent itemsets of all parts is equal to all frequent itemsets generated by the original association graph.

Proof

This proof can be divided into two steps.

First Proof: We will prove that the set of frequent itemsets generated by all parts is a subset of the set of all frequent itemsets generated by the original association graph.

We assume that the original association graph, G for short, is divided into n sub-association graphs, such as \( G_{1} , G_{2} , \ldots ,G_{n} \), and the frequent itemsets generated by association graph G can be named P. Based on the algorithm of DLG, the frequent items obtained by sub-association graph must be a subset of association graph G, that is, \( P_{1} \subset P \), \( P_{2} \subset P \), …, \( P_{n} \subset P \). In addition, \( P_{1} \mathop \cup \nolimits P_{2} \mathop \cup \nolimits \cdots \mathop \cup \nolimits P_{n} \subset P \) can work according to the property of set. So the conclusion is correct.

Second Proof: We will prove that the set of all frequent itemsets generated by original association graph is a subset of the set of frequent itemsets generated by all parts.

We use mathematical reductio ad absurdum to prove this problem.

Assume that there exists a frequent itemset, I for short, generated by the original association graph, which is not in union set of frequent itemsets generated by all parts. Therefore, there must exist one such scenario: One item A of I is only from one sub-association graph, and another item B of I only comes from another sub-association graph. Based on the nature that subset of frequent itemsets must be frequent itemsets, we know that there must be an edge between A and B. Besides, according to the division process, A and B, adjacent to each other, are necessarily divided into the same subgraph, and this leads to a contradiction. Hence, the total frequent itemsets of all parts contain all frequent itemsets generated by the original association graph.

Based on the nature of set, set A will be equal to set B if A is a subset of B and B is a subset of A at the same time. In conclusion, the Theorem 1 is correct according to the above two proofs.

Theorem 2

If the degree of some node is less than k  1, the node can not be extended into frequent k-itemsets when generating k-itemsets in association graph. Then the node together with its adjacent edges can be removed from the association graph, and thus this association graph is compressed. The proof of this theorem can be seen in paper [ 7 ].

Theorem 3

When extending the edge of the known frequent k-itemset in association graph, if there is no edge between a item node of the frequent k-itemset and the node to be extended, it should not be extended. This extended  \( k + 1 \) -itemset does necessarily not belong to the frequent itemsets so that we can tailor the extended candidate itemsets. The proof of this theorem can be seen in paper [ 8 ] where the author improved the DLG algorithm depending on this theorem.

3.2 PCAG Algorithm Description

The improved algorithm based on partitioned and compressed association graph is described as follows:

  1. 1.

    Scan the database to construct the transaction matrix, generate all frequent 2-itemsets by using the logical AND operator among BVi, and construct association graph with these frequent 2-itemsets;

  2. 2.

    Divide this graph in accordance with Theorem 1 into all sub-association graphs.

  3. 3.

    Mine frequent itemsets in each sub-association graph.

  4. 4.

    Extend frequent k − 1-itemsets into k-itemsets for each sub-association graph as follows: We first calculate the degrees of each node; the node together with its adjacent edge will be removed from the sub-association graph if its degree is less than k − 1 according to Theorem 2. Then, we apply Theorem 3 while extending frequent k − 1-itemsets.

  5. 5.

    k increases by one. If the nodes of sub-association graph are less than k, the algorithm is finished, otherwise turn to 4.

Now, the PCAG algorithm pseudo-code, except the description of the first step, is presented in the Tables 1, 2 and 3.

Table 1 PCAG algorithm
Table 2 Partition algorithm of PCAG
Table 3 Compression algorithm of PCAG

The paper improves the DLG algorithm based on its shortcomings that generate many redundant candidate itemsets and is not suited to the circumstances with large number of transactions. We first divide the association graph to reduce the complexity, especially in high volumes of customer transactions. Two improved solutions are adopted when we extend candidate itemsets for each sub-association graph. One of the two solutions is to reduce the size of association graph and the other is that extended condition of the itemsets is limited, thereby helping to improve the generation efficiency of the frequent itemsets.

3.3 Experimental Results and Analysis

This section is to compare the efficiency of the DLG algorithm and PCAG algorithm. Figure 1 shows the relative execution time for the two algorithms under different transaction counts, assuming that the minimum support threshold, minisup for short, is 2 %, and Fig. 2 shows the relative execution time for the two algorithms under different minisup, assuming that the number of transaction is 3,000. From the two tables, we know that both in different transaction counts and minisup, the PCAG algorithm has better performance.

Fig. 1
figure 1figure 1

Efficiency of different transaction counts with 2 % minisup

Fig. 2
figure 2figure 2

Efficiency of different minisup with 3,000 transactions

Fig. 1 shows that the number of transactions has less effect on PCAG algorithm than DLG algorithm. In other words, compared with DLG algorithm, the larger the transaction database is, the better the efficiency of PCAG algorithm is. In Fig. 2, in general, with the changing minisup, the PCAG algorithm has higher performance than DLG algorithm on the basis of the same transaction database.

In contrast to DLG algorithm, the PCAG algorithm has improved greatly in efficiency as shown in Figs. 1 and 2.

4 Conclusion and Future Work

Some feasible improvement thoughts on the basis of association rule mining algorithm called DLG are presented in this paper. On the one hand, we introduce the idea of partition and come up with a method to divide and conquer the association graph. This method is particularly applicable to association rule mining in the large-scale transaction database and makes up for disadvantages of the DLG algorithm. On the other hand, we propose a method to compress the associated graph. Combined with the property of Apriori, we impose the further restrictions on the extension of the frequent itemsets. All these methods together improve the efficiency of generating frequent itemsets. Obviously, the PCAG algorithm also has its defects. The improvement effect is not obvious when the minisup is too small. Because a dense association graph created by too many frequent itemsets will decrease the performance of PCAG algorithm when the minisup is too small.

From the experimental results, we know that the efficiency of the PCAG algorithm decreases slowly with the increase in the transaction database size; in other words, the number of transactions has less effect on PCAG algorithm. Taking into account these natures of algorithm, we can mine parallelly frequent itemsets of each sub-association graph when dealing with large transaction databases in our future study, because all sub-association graphs are independent of each other in the process of mining. Besides, we also attempt to apply the PCAG algorithm in other fields except database.