Abstract
A new association rule mining algorithm based on matrix is introduced. It mainly compresses the transaction matrix efficiently by integrating various strategies. The new algorithm optimizes the known association rule mining algorithms based on matrix given by some researchers in recent years, which greatly reduces the temporal complexity and spatial complexity, and highly promotes the efficiency of association rule mining. It is especially feasible when the degree of the frequent itemset is high.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Association rule mining is one of the most important and well-researched techniques of data mining, which is firstly introduced by Agrawal et al. [1]. They presented well-known Apriori algorithm in 1993, since many methods have been involved in the improvement and optimization of Apriori algorithm, such as binary code technology, genetic algorithm, and algorithms based on matrix [2, 3]. The algorithm based on matrix could only scan the database for one time to convert the transactions into matrix and could be reordered by item support count in non-descending order to reduce the number of candidate itemsets and could highly promote Apriori algorithm efficiency in temporal complexity and spatial complexity.
A great deal of work on Apriori algorithms based on matrix has been done [4, 5]. In this chapter, a new improvement of Apriori algorithm based on compression matrix is proposed and could achieve better performance.
2 Preliminaries
Some basic preliminaries used in association rule mining are introduced in this section. Let T = {T 1,T 2, ⋯,T m } be a database of transactions and T k (k = 1, 2, ⋯, m) denotes a transaction. Let I = {I 1, I 2, ⋯ I n } be a set of binary attributes, called Items. I k (k = 1, 2, ⋯, n) denotes an Item. Each transaction T k in T contains a subset of items in I. The number of items contained in T k is called the length of transaction T k , which is symbolized |T k |.
An association rule is defined as an implication of the form X ⇒ Y, where X, Y ∈ I and X ∩ Y = ϕ. The support (min-sup) of the association rule X ⇒ Y is the support (resp. frequency) of the itemset X ∪ Y. If support (min-sup) of an itemset X is greater than or equal to a user-specified support threshold, then X is called frequent itemsets.
In the process of the association rule mining, we find frequent itemsets firstly and produce association rule by these frequent itemsets secondly. So the key procedure of the association rule mining is to find frequent itemsets; some properties of frequent itemset are given as the following:
Property 1 [1]
Every nonempty subset of a frequent itemset is also a frequent itemsets.
By the definition of frequent k-itemset, the conclusion below is easily obtained.
Property 2
If the length |T i | of a transaction T i is less than k, then T i is valueless for generating the frequent k-itemset.
3 An Improvement on Apriori Algorithm Based on Compression Matrix
A new improvement on Apriori algorithm based on compression matrix is introduced. The process of our new algorithm is described as follows:
-
1.
Generate the transaction matrix.
For a given database with n transactions and m items, the m × n transaction matrix D = (d ij ) is determined, in which d ij sets 1 if item I i is contained in transaction T j or otherwise sets 0.
where \( {d}_{ij}=\left\{\begin{array}{c}\hfill 1,{I}_i\in {T}_j\hfill \\ {}\hfill 0,{I}_i\notin {T}_j\hfill \end{array}\right. \). i = 1, 2, ⋯, m j = 1, 2, ⋯, n.
For each I k , T j , \( {v}_k={\displaystyle \sum_{i=1}^n{d}_{ij}} \), k = 1, 2, ⋯, m; \( {h}_j={\displaystyle \sum_{i=1}^m{d}_{ij}} \) , j = 1, 2, ⋯, n.
-
2.
Produce frequent 1-itemset L 1 and frequent 2-itemset support matrix D 1.
The frequent 1-itemset L 1 is L 1 = {I k |v k ≥ min ‐ sup}.
3.1 Matrix Compression Procedure
In order to reduce the storage space and computation complexity, useless rows and columns should be discovered and removed in “matrix compression procedure,” which will be reused frequently in subsequent processes. Useless rows and columns can be classified into two classes, so the compression procedure is separated into two steps:
-
(i)
A row I k is considered as worthless when the corresponding v k is less than the support min-sup; a column T j is considered as worthless when the corresponding h j is less than 2 according to Property 2. Thus, we drop these rows or columns one by one and update v k and h j immediately after each drop operation. Subsequently, repeat the procedure (i) until there is no such row or column.
-
(ii)
Let’s consider the second class of useless rows and store their frequent itemsets for being used in the next procedure. Every row I l whose corresponding v l is less than \( \left[\sqrt{n}\right] \) ([x] is the largest integer which is no greater than x) would be removed after its frequent itemsets are calculated as below:
Let min-sup = b. For a satisfied item I l , let S l = {T j |d lj = 1} and S l ' be the b-combinations set of elements in S l : \( {S}_l\hbox{'}=\left\{\left({T}_{j_1},{T}_{j_2},\dots, {T}_{j_m}\right)\right.\Big|\left.{T}_{j_1},{T}_{j_2},\dots, {T}_{j_m}\in {S}_l\Big)\right\} \). Each b-tuple \( \left({T}_{j_1},{T}_{j_2},\dots, {T}_{j_m}\right) \) from S l ' would be scanned in turn, if there exist items \( {I}_{l_1},{I}_{l_2},\cdots, {I}_{l_k} \) except I l that let \( {d}_{l_i{j}_1}={d}_{l_i{j}_2}=\dots ={d}_{l_i{j}_m}=1 \) (i = 1, 2, ⋯, k), the collection \( \left({I}_{l_1},{I}_{l_2},\cdots, {I}_{l_k},{I}_l\right) \) is one frequent itemset containing I l . All the frequent itemsets containing I l can be obtained though handling every b-tuple element from S l '. After repeating step (i) and (ii) until there is no such useless row or column in the compressed matrix of D, the frequent 2-itemset support matrix D 1 is produced:
$$ {D}_1=\begin{array}{cc}\hfill \hfill & \hfill \begin{array}{cccc}\hfill {T}_{j_1}\hfill & \hfill {T}_{j_2}\hfill & \hfill \hfill & \hfill {T}_{jq}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{c}\hfill {I}_{i_1}\hfill \\ {}\hfill {I}_{i_2}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {I}_{i_p}\hfill \end{array}\hfill & \hfill \left(\begin{array}{cccc}\hfill {d}_{i_1{j}_1}\hfill & \hfill {d}_{i_1{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_1{j}_q}\hfill \\ {}\hfill {d}_{i_2{j}_1}\hfill & \hfill {d}_{i_2{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_2{j}_q}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \hfill & \hfill \vdots \hfill \\ {}\hfill {d}_{i_p{j}_1}\hfill & \hfill {d}_{i_p{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_p{j}_q}\hfill \end{array}\right)\hfill \end{array} $$where 1 ≤ i 1 < i 2 < ⋯ i p ≤ m, 1 ≤ j 1 < j 2 < ⋯ j q ≤ n.
-
3.
Produce the frequent 2-itemset L 2 and the frequent 3-itemset support matrix D 2.
The frequent 2-itemset L 2 is the union of the 2-itemset subsets produced by frequent itemsets in step (ii) of procedure (2) and a set L′2 determined by comparing the inner product of each two row vectors of matrix D 1 with the support min-sup
Matrix D′2 is obtained by calculating “and” operation of the two corresponding row vectors of every element \( \left({I}_{i_h},{I}_{i_r}\right) \) in L′2, that is:
where 1 ≤ h 1 < h 2 < ⋯ h s ≤ p, 1 ≤ r 1 < r 2 < ⋯ r t ≤ p, n 1 is called row numbers of matrix D′2.
-
(i)
Remove rows or columns in D′2 using the same approach in step (i) of (2), while column \( {T}_{j_k} \) is considered as useless when \( \left({h}_{j_k}<2\right) \) its length is less than 3 according to Property 2, we drop these columns. Update v k and h j immediately, and we drop these rows which the corresponding v k is less than the support min-sup. Subsequently, repeat the procedure (i) until there is no such row or column.
-
(ii)
Similarly with step (ii) of (2), every row \( \left({I}_{i_s},{I}_{i_t}\right) \) whose corresponding v s is less than \( \left[\sqrt{n_1}\right] \) would be removed after finding and storing its frequent itemsets.
Then, the frequent 3-itemset support matrix D 2 is produced by repeating the matrix compression procedure (i) and (ii) until no more row or column which is considered as a useless element could be found. That is,
where \( {j}_1\le {j}_{p_1}<{j}_{p_2}<\cdots <{j}_{p_w}\le {j}_q \), \( \left({I}_{i_{h_{s_y}}},{I}_{i_{r_{t_z}}}\right)\in \left\{\left({I}_{i_{h_m}},{I}_{i_{r_n}}\right)\Big|m=1,2,\cdots, s;n=1,2,\cdots, t\right\} \).
Let \( {L}_2^{{\prime\prime} }=\left\{\left({I}_{i_{h_{s_y}}},{I}_{i_{r_{t_z}}}\right)\right\} \) be the compressed frequent 2-itemset of D 2.
-
4.
Produce the frequent 3-itemset L 3 and the frequent 4-itemset support matrix D 3.
The frequent 3-itemset is the union of all 3-itemset subsets of the frequent itemsets generated in step (ii) of procedures (2) and (3), and a set defined as {\( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right)\Big|\left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}}\right),\left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_k}}}\right),\left({I}_{i_{h_{s_n}}},{I}_{i_{r_{t_k}}}\right)\in {L}_2^{{\prime\prime} } \) and inner product of corresponding row vectors of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}}\right) \) and \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_k}}}\right) \) in D 2 is not less than min-sup}.
Similarly with previous steps, the intermediate matrix D′3 is produced by calculating “and” operation of the corresponding row vector of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}}\right) \) and \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_k}}}\right) \) in L′2, which are derived from the element \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \) in L 3.
n 2 is called row numbers of matrix D′3.
-
(i)
Remove rows or columns using the same approach in step (i) of (2) or (3) and execute the following procedure.
-
(ii)
When the sum of the corresponding row of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \) is less than and equal to \( \left[\sqrt{n_2}\right] \), we find and store frequent itemsets containing items \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \) by the same approach in step (ii) of (2), (3) again remove the corresponding row of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \). Then the matrix compression procedure is repeated until no more row or column which is considered as a useless element could be found.
-
5.
Analogously, the frequent 4-itemset,…, the frequent k-itemset is produced by step (2) to step (5) until the frequent k-itemset support matrix D k is empty.
4 Algorithm Example Experiment Studying
Suppose that a transaction database is listed as Table 33.1 which is simulated for the number of min-sup is 2.
-
(1)
Generate the transaction matrix, and calculate the sum of each row v s and the sum of each column h s as described in Table 33.2.
-
(2)
Produce the frequent 1-itemset. L 1 = {I k |v k ≥ 2} = {I 1,I 2,I 3,I 4,I 5,I 6,I 7,I 8,I 9,I 10}.
-
(i)
It is obvious that the corresponding columns of T 5 should be dropped since the sum of which is less than 2 (h s < 2). After updating v s , the corresponding row of I 7 is removed with regard to its v s < 2. Then recalculate h s and accordingly remove the corresponding columns of T 9 . Finally, the new compression matrix is shown in Table 33.3.
-
(ii)
Because \( \left[\sqrt{n}\right]=\left[\sqrt{9}\right]=3 \) and the corresponding v s of I 1 I 6 I 8 I 9 is less than and equal to 3, we need to find all the frequent itemsets containing items I l (l = 1, 6, 8, 9), then remove I l (l = 1, 6, 8, 9).
The given min-sup being 2, find frequent itemsets containing I 8 firstly since v 8 = 2. S 8 = {T 1,T 4|d 8j = 1} and the 2-combinations set of elements in S 8 is S 8 ' = {(T 1,T 4)}. It is obvious that I 1 and I 3 are the rows whose matrix element with column T 1 and T 4 are both 1. So (I 1 I 3 I 8 ) is the only frequent itemset containing I 8, thus we store (I 1 I 3 I 8 ) and drop row I 8. Then another item I 1 is considered, S 1 = {T 1,T 2,T 4|d 1j = 1} and the 2-combinations set of elements in S 1 is S 1 ' = {(T 1,T 2), (T 1,T 4), (T 2,T 4)}. Frequent itemsets containing items I 1 are obtained by dealing with three 2-tuples in S 1 ' successively. Collection (I 1 I 3 ) is the frequent itemset determined by (T 1,T 2) using the similar approach in finding the frequent itemset containing I 8. Similarly, (T 1,T 4) determines collection (I 1 I 3 ) and (T 2,T 4) determines collection (I 1 I 3 I 4 I 5 ). From the above, all the frequent itemsets containing items I 1 are (I 1 I 3 ) and (I 1 I 3 I 4 I 5 ). Continuing scanning other satisfied items accordingly, all the frequent itemsets containing items I l (l = 1, 6, 8, 9) are found: L′1 = {(I 1 I 3 I 8),(I 1 I 3 I 4 I 5),(I 6 I 2 I 3),(I 9 I 2 I 6),(I 9 I 2 I 3)}. After removing rows I l (l = 1, 6, 8, 9), the newly compressed matrix is shown in Table 33.4.
We drop the corresponding columns of T 7 since the sum of which is less than 2 (h s < 2) and recalculate v s again. Then the support matrix of the frequent 2-itemset is listed in Table 33.5. Regarding each row and column again, there is no useless element. In other words, the support matrix in Table 33.5 is fully compressed.
-
(3)
The frequent 2-itemset L 2 is the union of the 2-itemset subsets produced by L′1 in step (ii) of procedure (2) and a set L′2 obtained from the support matrix in Table 33.5
That is L 2 = {(I 1 I 3),(I 1 I 8),(I 3 I 8),(I 1 I 4),(I 1 I 5),(I 3 I 4),(I 4 I 5),(I 3 I 5),(I 2 I 3),(I 2 I 6),(I 6 I 3),(I 9 I 2),(I 9 I 3),(I 9 I 6)}.
Subsequently, the uncompressed support matrix of the frequent 3-itemset is constructed as listed in Table 33.6.
Firstly, we remove the corresponding columns of T 1 by considering its h s < 2. Where n 1 = 6, \( \left[\sqrt{n_1}\right]=\left[\sqrt{6}\right]=2 \). Secondly, because the corresponding v s of (I 2 I 4 ) (I 2 I 5 ) is equal to 2, we work out all the frequent itemsets containing (I 2 I 4 ) or (I 2 I 5 ): L′3 = {(I 2 I 3 I 4),(I 2 I 3 I 5)} and drop those corresponding rows. That is Table 33.7.
-
(4)
Produce the frequent 3-itemset.
A frequent 3-itemset (I 3 I 4 I 5) is obtained from Table 33.7. And the frequent 3-itemset is L 3 = {(I 3 I 4 I 5)} ∪ {(I 1 I 3 I 8), (I 1 I 3 I 4) (I 1 I 3 I 5) (I 1 I 4 I 5), (I 3 I 4 I 5), (I 9 I 2 I 3), (I 9 I 2 I 6), (I 6 I 2 I 3)} of L′1 ∪ {(I 2 I 3 I 4) (I 2 I 3 I 5)} of L′3.
-
(5)
Produce a frequent 4-itemset from (I 1 I 3 I 4 I 5 ) in L′1. While no frequent 4-itemset could be found in Table 33.7, so our algorithm ends.
5 Conclusion
An algorithm of mining association rule based on matrix is able to discover all the frequent item sets only by searching the database once and not generating the candidate itemsets, but generating the frequent itemsets directly, which is more efficient. Many researchers have done a great deal of work on it. Here, a new algorithm for generating association rules based on matrix is proposed. It compresses the transaction matrix efficiently by integrating various strategies and achieves better performance than the known algorithms based on matrix. Some new strategies of compressing the transaction matrix are worthy of further research.
References
Agrawal, R., Imielinski, T., & Wami, A. S. (1993). Mining association rules between sets of items in large databases (pp. 207–216). Proceeding of the ACM SIGMOD Conference on Management of Data, Washington, DC.
Lv, T. X., & Liu, P. Y. (2011). Algorithm for generating strong association rules based on matrix. Application Research of Computers, 28(4), 1301–1303.
Cao, F. H. (2012). Improved association rule mining algorithm based on two matrixes. Electronic Science and Technology, 25(5), 126–128.
Xu, H. Z. (2012). The research of association rules data mining algorithms. Science Technology and Engineering, 12(1), 60–63.
He, B., & Xue, F. (2012). An improved algorithm for mining association rules. Computer Knowledge and Technology, 8(5), 1015–1017.
Acknowledgment
This work is financially supported by the Natural Science Foundation of the Jiangxi Province of China under Grant No. 20122BAB201004.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Shu, S. (2014). A New Association Rule Mining Algorithm Based on Compression Matrix. In: Wong, W.E., Zhu, T. (eds) Computer Engineering and Networking. Lecture Notes in Electrical Engineering, vol 277. Springer, Cham. https://doi.org/10.1007/978-3-319-01766-2_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-01766-2_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01765-5
Online ISBN: 978-3-319-01766-2
eBook Packages: EngineeringEngineering (R0)