An Improved Algorithm for Mining Association Rules Based on Partitioned and Compressed Association Graph

Jiang, Hao; He, Yabo; Wan, Wei

doi:10.1007/978-81-322-1759-6_75

Hao Jiang⁴,
Yabo He⁴ &
Wei Wan⁴

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 255))

975 Accesses

Abstract

Mining association rules have been one of the research focuses in data mining; an improved algorithm based on graph is presented for discovering various types of association rules in this paper. We introduce the original algorithm based on graph; on that basis, we improve it according to the idea of partition and compression. Firstly, the association graph constructed is partitioned into many association subgraphs. Secondly, we compress these subgraphs and mine frequent itemsets respectively which will not influence each other in different parts. In the end, the improved algorithm is compared with the original one in performance.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A New Association Rule Mining Algorithm Based on Compression Matrix

Discovery of Multi-frequent Patterns Using Directed Graph

An efficient method for mining association rules based on minimum single constraints

Article Open access 13 November 2014

Keywords

1 Introduction

There are many algorithms [1] of mining frequent itemsets on association rules; the most representative are Apriori algorithm [2] and FP-tree algorithm [3]. Apriori algorithm is a basic algorithm to mine frequent itemsets, using a kind of iterative algorithm by searching layer by layer, and k-itemsets are used to generate k + 1 itemset. The FP-tree algorithm which only scans the database twice does not need to generate candidate itemsets; it adopts a divide-and-conquer strategy. However, the transaction database is generally relatively large. For counting the support in each iterative process, the cost of scanning the database every time is extremely high. To solve the problem, we can start from two aspects: reducing the generation of candidate itemsets and the amount of data scanned in the process of each iteration.

Paper [4] proposes a graph-based approach to generate frequent itemsets of association rules from a large database of customer transactions. This approach scans the database once to construct an association graph and mine large itemsets. This paper makes a further improvement on the basis of algorithm introduced in Paper [4], and our experimental results prove the validity of the efficiency about these improvements. Moreover, relevant experimental analysis has also been provided in this paper.

2 A Graph-Based Algorithm for Mining Association Rules

2.1 Basic Idea of DLG Algorithm

The graph-based mining algorithm, referred to as the DLG algorithm [4, 5], constructs an association graph to reflect association between itemsets and then traverses the association graph to generate frequent itemsets. The DLG algorithm uses a vertical data format for mining frequent itemsets [6], in which data are represented by the symbol of {item:tid_set}, and it consists of the following three stages:

1.
The first stage: generating frequent 1-itemset. The phase generates frequent a-itemset and records relevant information. After scanning the database once, we number items and calculate their supports to create a bit vector, whose length is the total number of transaction in database.
2.
The second stage: creating association graph. The stage generates frequent 2-itemsets by using the logical AND operator (can be denoted as the notation “\( \wedge \)”) among bit vector which is created in the first stage.
3.
The third stage: generating frequent K + 1-itemsets (K >= 2). The stage generates larger frequent itemsets based on the second stage than before.

2.2 Efficiency Analysis of DLG Algorithm

The DLG algorithm has some advantages, compared with other association rule mining algorithms such as Apriori algorithm and FP-tree algorithm. It only needs to scan the database once; you can construct the transaction matrix, greatly reducing the I/O times and improving efficiency. However, the DLG algorithm has the similar problem too. The common problem is that they all exist in iterative process from frequent itemsets to candidate itemsets and from candidate itemsets to frequent itemsets, and therefore, it generates inevitably redundant candidate itemsets. In addition, the DLG algorithm is less efficient in the face of some huge transaction databases.

3 Improved Algorithm Based on DLG

3.1 Theoretical Basis

An improved algorithm based on partitioned and compressed association graph is presented in this paper. We will introduce three theorems, which are the theoretical basis of our algorithm, used in association graph before describing the algorithm.

Theorem 1

In the association graph, assume that the nodes are numbered sequentially. We partition the graph by the following properties: according to the order of smallest to largest in node number, the current node and those nodes, whose number are larger than the current one, directly connected to the former will be divided into the same part. So, it will generate many independent and disjoint sub-association graphs. Then mine frequent itemsets in each sub-association graph respectively. At last the union set from total frequent itemsets of all parts is equal to all frequent itemsets generated by the original association graph.

Proof

This proof can be divided into two steps.

First Proof: We will prove that the set of frequent itemsets generated by all parts is a subset of the set of all frequent itemsets generated by the original association graph.

We assume that the original association graph, G for short, is divided into n sub-association graphs, such as \( G_{1} , G_{2} , \ldots ,G_{n} \), and the frequent itemsets generated by association graph G can be named P. Based on the algorithm of DLG, the frequent items obtained by sub-association graph must be a subset of association graph G, that is, \( P_{1} \subset P \), \( P_{2} \subset P \), …, \( P_{n} \subset P \). In addition, \( P_{1} \mathop \cup \nolimits P_{2} \mathop \cup \nolimits \cdots \mathop \cup \nolimits P_{n} \subset P \) can work according to the property of set. So the conclusion is correct.

Second Proof: We will prove that the set of all frequent itemsets generated by original association graph is a subset of the set of frequent itemsets generated by all parts.

We use mathematical reductio ad absurdum to prove this problem.

Assume that there exists a frequent itemset, I for short, generated by the original association graph, which is not in union set of frequent itemsets generated by all parts. Therefore, there must exist one such scenario: One item A of I is only from one sub-association graph, and another item B of I only comes from another sub-association graph. Based on the nature that subset of frequent itemsets must be frequent itemsets, we know that there must be an edge between A and B. Besides, according to the division process, A and B, adjacent to each other, are necessarily divided into the same subgraph, and this leads to a contradiction. Hence, the total frequent itemsets of all parts contain all frequent itemsets generated by the original association graph.

Based on the nature of set, set A will be equal to set B if A is a subset of B and B is a subset of A at the same time. In conclusion, the Theorem 1 is correct according to the above two proofs.

Theorem 2

If the degree of some node is less than k − 1, the node can not be extended into frequent k-itemsets when generating k-itemsets in association graph. Then the node together with its adjacent edges can be removed from the association graph, and thus this association graph is compressed. The proof of this theorem can be seen in paper [ 7 ].

Theorem 3

When extending the edge of the known frequent k-itemset in association graph, if there is no edge between a item node of the frequent k-itemset and the node to be extended, it should not be extended. This extended \( k + 1 \) -itemset does necessarily not belong to the frequent itemsets so that we can tailor the extended candidate itemsets. The proof of this theorem can be seen in paper [ 8 ] where the author improved the DLG algorithm depending on this theorem.

3.2 PCAG Algorithm Description

The improved algorithm based on partitioned and compressed association graph is described as follows:

1.
Scan the database to construct the transaction matrix, generate all frequent 2-itemsets by using the logical AND operator among BVi, and construct association graph with these frequent 2-itemsets;
2.
Divide this graph in accordance with Theorem 1 into all sub-association graphs.
3.
Mine frequent itemsets in each sub-association graph.
4.
Extend frequent k − 1-itemsets into k-itemsets for each sub-association graph as follows: We first calculate the degrees of each node; the node together with its adjacent edge will be removed from the sub-association graph if its degree is less than k − 1 according to Theorem 2. Then, we apply Theorem 3 while extending frequent k − 1-itemsets.
5.
k increases by one. If the nodes of sub-association graph are less than k, the algorithm is finished, otherwise turn to 4.

Now, the PCAG algorithm pseudo-code, except the description of the first step, is presented in the Tables 1, 2 and 3.

Table 1 PCAG algorithm

Full size table

Table 2 Partition algorithm of PCAG

Full size table

Table 3 Compression algorithm of PCAG

Full size table

The paper improves the DLG algorithm based on its shortcomings that generate many redundant candidate itemsets and is not suited to the circumstances with large number of transactions. We first divide the association graph to reduce the complexity, especially in high volumes of customer transactions. Two improved solutions are adopted when we extend candidate itemsets for each sub-association graph. One of the two solutions is to reduce the size of association graph and the other is that extended condition of the itemsets is limited, thereby helping to improve the generation efficiency of the frequent itemsets.

3.3 Experimental Results and Analysis

This section is to compare the efficiency of the DLG algorithm and PCAG algorithm. Figure 1 shows the relative execution time for the two algorithms under different transaction counts, assuming that the minimum support threshold, minisup for short, is 2 %, and Fig. 2 shows the relative execution time for the two algorithms under different minisup, assuming that the number of transaction is 3,000. From the two tables, we know that both in different transaction counts and minisup, the PCAG algorithm has better performance.

Fig. 1 shows that the number of transactions has less effect on PCAG algorithm than DLG algorithm. In other words, compared with DLG algorithm, the larger the transaction database is, the better the efficiency of PCAG algorithm is. In Fig. 2, in general, with the changing minisup, the PCAG algorithm has higher performance than DLG algorithm on the basis of the same transaction database.

In contrast to DLG algorithm, the PCAG algorithm has improved greatly in efficiency as shown in Figs. 1 and 2.

4 Conclusion and Future Work

Some feasible improvement thoughts on the basis of association rule mining algorithm called DLG are presented in this paper. On the one hand, we introduce the idea of partition and come up with a method to divide and conquer the association graph. This method is particularly applicable to association rule mining in the large-scale transaction database and makes up for disadvantages of the DLG algorithm. On the other hand, we propose a method to compress the associated graph. Combined with the property of Apriori, we impose the further restrictions on the extension of the frequent itemsets. All these methods together improve the efficiency of generating frequent itemsets. Obviously, the PCAG algorithm also has its defects. The improvement effect is not obvious when the minisup is too small. Because a dense association graph created by too many frequent itemsets will decrease the performance of PCAG algorithm when the minisup is too small.

From the experimental results, we know that the efficiency of the PCAG algorithm decreases slowly with the increase in the transaction database size; in other words, the number of transactions has less effect on PCAG algorithm. Taking into account these natures of algorithm, we can mine parallelly frequent itemsets of each sub-association graph when dealing with large transaction databases in our future study, because all sub-association graphs are independent of each other in the process of mining. Besides, we also attempt to apply the PCAG algorithm in other fields except database.

References

Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. ACM SIGMOD 22(2), 207–216 (1993)
Article Google Scholar
Agrawal, R., Srikant, R.: Fast algorithm for mining association rules. IBM Research Report RJ9839 (1994)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000)
Google Scholar
Yen, S.-J., Chen, A.L.P.: A graph-based approach for discovering various types of association rules. IEEE Trans. Knowl. Data Eng. 13(5), 839–845 (2001)
Article Google Scholar
Yen S.J., Chen, A.L.P.: An efficient approach to discovering knowledge from large databases. In: Proceedings of the IEEE/ACM International Conference on Parallel and Distributed Information Systems, pp. 8–18 (1996)
Google Scholar
Han, J., Kamber, M.: Data Mining Concepts and Techniques, 2nd edn. China Machine Press, Beijing (2007)
Google Scholar
Wan, W.: Research and Application: Algorithm of Mining Assocition Rules Based on Graph. SouthEast University, Nanjing (2012)
Google Scholar
Huang, H.: Revised algorithm of mining association rules based on graph. Comput. Digit. Eng. 37(12), 38–42 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, Jiangsu, China
Hao Jiang, Yabo He & Wei Wan

Authors

Hao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yabo He
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yabo He .

Editor information

Editors and Affiliations

Computer Science and Engineering, SOA University, Bhubaneswar, India
Srikanta Patnaik
Electronics and Computer Engineering Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, H., He, Y., Wan, W. (2014). An Improved Algorithm for Mining Association Rules Based on Partitioned and Compressed Association Graph. In: Patnaik, S., Li, X. (eds) Proceedings of International Conference on Computer Science and Information Technology. Advances in Intelligent Systems and Computing, vol 255. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1759-6_75

Download citation

DOI: https://doi.org/10.1007/978-81-322-1759-6_75
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1758-9
Online ISBN: 978-81-322-1759-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

An Improved Algorithm for Mining Association Rules Based on Partitioned and Compressed Association Graph

Abstract

Similar content being viewed by others

A New Association Rule Mining Algorithm Based on Compression Matrix

Discovery of Multi-frequent Patterns Using Directed Graph

An efficient method for mining association rules based on minimum single constraints

Keywords

1 Introduction