Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Association rule mining is one of the most important and well-researched techniques of data mining, which is firstly introduced by Agrawal et al. [1]. They presented well-known Apriori algorithm in 1993, since many methods have been involved in the improvement and optimization of Apriori algorithm, such as binary code technology, genetic algorithm, and algorithms based on matrix [2, 3]. The algorithm based on matrix could only scan the database for one time to convert the transactions into matrix and could be reordered by item support count in non-descending order to reduce the number of candidate itemsets and could highly promote Apriori algorithm efficiency in temporal complexity and spatial complexity.

A great deal of work on Apriori algorithms based on matrix has been done [4, 5]. In this chapter, a new improvement of Apriori algorithm based on compression matrix is proposed and could achieve better performance.

2 Preliminaries

Some basic preliminaries used in association rule mining are introduced in this section. Let T = {T 1,T 2, ⋯,T m } be a database of transactions and T k (k = 1, 2, ⋯, m) denotes a transaction. Let I = {I 1, I 2, ⋯ I n } be a set of binary attributes, called Items. I k (k = 1, 2, ⋯, n) denotes an Item. Each transaction T k in T contains a subset of items in I. The number of items contained in T k is called the length of transaction T k , which is symbolized |T k |.

An association rule is defined as an implication of the form X ⇒ Y, where X, Y ∈ I and X ∩ Y = ϕ. The support (min-sup) of the association rule X ⇒ Y is the support (resp. frequency) of the itemset X ∪ Y. If support (min-sup) of an itemset X is greater than or equal to a user-specified support threshold, then X is called frequent itemsets.

In the process of the association rule mining, we find frequent itemsets firstly and produce association rule by these frequent itemsets secondly. So the key procedure of the association rule mining is to find frequent itemsets; some properties of frequent itemset are given as the following:

Property 1 [1]

Every nonempty subset of a frequent itemset is also a frequent itemsets.

By the definition of frequent k-itemset, the conclusion below is easily obtained.

Property 2

If the length |T i | of a transaction T i is less than k, then T i is valueless for generating the frequent k-itemset.

3 An Improvement on Apriori Algorithm Based on Compression Matrix

A new improvement on Apriori algorithm based on compression matrix is introduced. The process of our new algorithm is described as follows:

  1. 1.

    Generate the transaction matrix.

For a given database with n transactions and m items, the m × n transaction matrix D = (d ij ) is determined, in which d ij sets 1 if item I i is contained in transaction T j or otherwise sets 0.

$$ D=\begin{array}{cc}\hfill \hfill & \hfill \begin{array}{cccc}\hfill {T}_1\hfill & \hfill {T}_2\hfill & \hfill \hfill & \hfill {T}_n\hfill \end{array}\hfill \\ {}\hfill \begin{array}{c}\hfill {I}_1\hfill \\ {}\hfill {I}_2\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {I}_m\hfill \end{array}\hfill & \hfill \left(\begin{array}{cccc}\hfill {d}_{11}\hfill & \hfill {d}_{12}\hfill & \hfill \cdots \hfill & \hfill {d}_{1n}\hfill \\ {}\hfill {d}_{21}\hfill & \hfill {d}_{22}\hfill & \hfill \cdots \hfill & \hfill {d}_{2n}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \hfill & \hfill \vdots \hfill \\ {}\hfill {d}_{m1}\hfill & \hfill {d}_{m2}\hfill & \hfill \cdots \hfill & \hfill {d}_{mn}\hfill \end{array}\right)\hfill \end{array} $$

where \( {d}_{ij}=\left\{\begin{array}{c}\hfill 1,{I}_i\in {T}_j\hfill \\ {}\hfill 0,{I}_i\notin {T}_j\hfill \end{array}\right. \). i = 1, 2, ⋯, m j = 1, 2, ⋯, n.

For each I k , T j , \( {v}_k={\displaystyle \sum_{i=1}^n{d}_{ij}} \), k = 1, 2, ⋯, m; \( {h}_j={\displaystyle \sum_{i=1}^m{d}_{ij}} \) , j = 1, 2, ⋯, n.

  1. 2.

    Produce frequent 1-itemset L 1 and frequent 2-itemset support matrix D 1.

The frequent 1-itemset L 1 is L 1 = {I k |v k  ≥ min ‐ sup}.

3.1 Matrix Compression Procedure

In order to reduce the storage space and computation complexity, useless rows and columns should be discovered and removed in “matrix compression procedure,” which will be reused frequently in subsequent processes. Useless rows and columns can be classified into two classes, so the compression procedure is separated into two steps:

  1. (i)

    A row I k is considered as worthless when the corresponding v k is less than the support min-sup; a column T j is considered as worthless when the corresponding h j is less than 2 according to Property 2. Thus, we drop these rows or columns one by one and update v k and h j immediately after each drop operation. Subsequently, repeat the procedure (i) until there is no such row or column.

  2. (ii)

    Let’s consider the second class of useless rows and store their frequent itemsets for being used in the next procedure. Every row I l whose corresponding v l is less than \( \left[\sqrt{n}\right] \) ([x] is the largest integer which is no greater than x) would be removed after its frequent itemsets are calculated as below:

    Let min-sup = b. For a satisfied item I l , let S l  = {T j |d lj  = 1} and S l ' be the b-combinations set of elements in S l : \( {S}_l\hbox{'}=\left\{\left({T}_{j_1},{T}_{j_2},\dots, {T}_{j_m}\right)\right.\Big|\left.{T}_{j_1},{T}_{j_2},\dots, {T}_{j_m}\in {S}_l\Big)\right\} \). Each b-tuple \( \left({T}_{j_1},{T}_{j_2},\dots, {T}_{j_m}\right) \) from S l  ' would be scanned in turn, if there exist items \( {I}_{l_1},{I}_{l_2},\cdots, {I}_{l_k} \) except I l that let \( {d}_{l_i{j}_1}={d}_{l_i{j}_2}=\dots ={d}_{l_i{j}_m}=1 \) (i = 1, 2, ⋯, k), the collection \( \left({I}_{l_1},{I}_{l_2},\cdots, {I}_{l_k},{I}_l\right) \) is one frequent itemset containing I l . All the frequent itemsets containing I l can be obtained though handling every b-tuple element from S l '. After repeating step (i) and (ii) until there is no such useless row or column in the compressed matrix of D, the frequent 2-itemset support matrix D 1 is produced:

    $$ {D}_1=\begin{array}{cc}\hfill \hfill & \hfill \begin{array}{cccc}\hfill {T}_{j_1}\hfill & \hfill {T}_{j_2}\hfill & \hfill \hfill & \hfill {T}_{jq}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{c}\hfill {I}_{i_1}\hfill \\ {}\hfill {I}_{i_2}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {I}_{i_p}\hfill \end{array}\hfill & \hfill \left(\begin{array}{cccc}\hfill {d}_{i_1{j}_1}\hfill & \hfill {d}_{i_1{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_1{j}_q}\hfill \\ {}\hfill {d}_{i_2{j}_1}\hfill & \hfill {d}_{i_2{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_2{j}_q}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \hfill & \hfill \vdots \hfill \\ {}\hfill {d}_{i_p{j}_1}\hfill & \hfill {d}_{i_p{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_p{j}_q}\hfill \end{array}\right)\hfill \end{array} $$

    where 1 ≤ i 1 < i 2 < ⋯ i p m, 1 ≤ j 1 < j 2 < ⋯ j q n.

  1. 3.

    Produce the frequent 2-itemset L 2 and the frequent 3-itemset support matrix D 2.

The frequent 2-itemset L 2 is the union of the 2-itemset subsets produced by frequent itemsets in step (ii) of procedure (2) and a set L2 determined by comparing the inner product of each two row vectors of matrix D 1 with the support min-sup

$$ {L}_2^{\prime }=\left\{\left({I}_{i_h},{I}_{i_r}\right)\right.\Big|{\displaystyle \sum_{k=1}^q{d}_{i_h{j}_k}{d}_{i_r{j}_k}}\ge \min \hbox{-} \sup \left.,h<r,h,r=1,2,\cdots, p\right\}. $$

Matrix D2 is obtained by calculating “and” operation of the two corresponding row vectors of every element \( \left({I}_{i_h},{I}_{i_r}\right) \) in L2, that is:

$$ {D}_2^{\prime }=\begin{array}{cc}\hfill \hfill & \hfill \begin{array}{cccc}\hfill {T}_{j_1}\hfill & \hfill \hfill & \hfill \hfill & \hfill {T}_{j_2}\hfill \end{array}\begin{array}{ccccc}\hfill \hfill & \hfill \hfill & \hfill \hfill & \hfill {T}_{jq}\hfill & \hfill \hfill \end{array}\hfill \\ {}\hfill \begin{array}{c}\hfill \left({I}_{i_{h_1}},{I}_{i_{r_1}}\right)\hfill \\ {}\hfill \left({I}_{i_{h_1}},{I}_{i_{r_2}}\right)\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill \left({I}_{i_{h_s}},{I}_{i_{r_t}}\right)\hfill \end{array}\hfill & \hfill \left(\begin{array}{cccc}\hfill {d}_{i_{h_1}{j}_1}{d}_{i_{r_1}{j}_1}\hfill & \hfill {d}_{i_{h_1}{j}_2}{d}_{i_{r_1}{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_{h_1}{j}_q}{d}_{i_{r_1}{j}_q}\hfill \\ {}\hfill {d}_{i_{h_1}{j}_1}{d}_{i_{r_2}{j}_1}\hfill & \hfill {d}_{i_{h_1}{j}_2}{d}_{i_{r_2}{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_{h_1}{j}_q}{d}_{i_{r_2}{j}_q}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \hfill & \hfill \vdots \hfill \\ {}\hfill {d}_{i_{h_s}{j}_1}{d}_{i_{r_t}{j}_1}\hfill & \hfill {d}_{i_{h_s}{j}_2}{d}_{i_{r_t}{j}_2}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_{h_s}{j}_q}{d}_{i_{r_t}{j}_q}\hfill \end{array}\right)\hfill \end{array} $$

where 1 ≤ h 1 < h 2 < ⋯ h s  ≤ p, 1 ≤ r 1 < r 2 < ⋯ r t  ≤ p, n 1 is called row numbers of matrix D2.

  1. (i)

    Remove rows or columns in D2 using the same approach in step (i) of (2), while column \( {T}_{j_k} \) is considered as useless when \( \left({h}_{j_k}<2\right) \) its length is less than 3 according to Property 2, we drop these columns. Update v k and h j immediately, and we drop these rows which the corresponding v k is less than the support min-sup. Subsequently, repeat the procedure (i) until there is no such row or column.

  2. (ii)

    Similarly with step (ii) of (2), every row \( \left({I}_{i_s},{I}_{i_t}\right) \) whose corresponding v s is less than \( \left[\sqrt{n_1}\right] \) would be removed after finding and storing its frequent itemsets.

Then, the frequent 3-itemset support matrix D 2 is produced by repeating the matrix compression procedure (i) and (ii) until no more row or column which is considered as a useless element could be found. That is,

$$ {D}_2=\begin{array}{cc}\hfill \hfill & \hfill \begin{array}{cccc}\hfill \hfill & \hfill {T}_{j_{p_1}}\hfill & \hfill \hfill & \hfill \hfill \end{array}\begin{array}{ccccc}\hfill \hfill & \hfill \hfill & \hfill {T}_{j_{p_2}}\hfill & \hfill \hfill & \hfill \hfill \end{array}\begin{array}{cccc}\hfill \hfill & \hfill \hfill & \hfill {T}_{j_{p_w}}\hfill & \hfill \hfill \end{array}\hfill \\ {}\hfill \begin{array}{c}\hfill \left({I}_{i_{h_{s_1}}},{I}_{i_{r_{t_1}}}\right)\hfill \\ {}\hfill \left({I}_{i_{h_{s_1}}},{I}_{i_{r_{t_2}}}\right)\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill \left({I}_{i_{h_{su}}},{I}_{i_{r_{t_v}}}\right)\hfill \end{array}\hfill & \hfill \left(\begin{array}{cccc}\hfill {d}_{i_{h_{s_1}}{j}_{p_1}}{d}_{i_{r_{t_1}}{j}_{p_1}}\hfill & \hfill {d}_{i_{h_{s_1}}{j}_{p_2}}{d}_{i_{r_{t_1}}{j}_{p_2}}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_{h_{s_1}}{j}_{p_w}}{d}_{i_{r_{t_1}}{j}_{p_w}}\hfill \\ {}\hfill {d}_{i_{h_{s_1}}{j}_{p_1}}{d}_{i_{r_{t_2}}{j}_{p_1}}\hfill & \hfill {d}_{i_{h_{s1}}{j}_{p_2}}{d}_{i_{r_{t_2}}{j}_{p_2}}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_{h_{s1}}{j}_{p_w}}{d}_{i_{r_{t_2}}{j}_{p_w}}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \hfill & \hfill \vdots \hfill \\ {}\hfill {d}_{i_{h_{s_u}}{j}_{p_1}}{d}_{i_{r_{t_v}}{j}_{p_1}}\hfill & \hfill {d}_{i_{h_{s_u}}{j}_{p_2}}{d}_{i_{r_{t_v}}{j}_{p_2}}\hfill & \hfill \cdots \hfill & \hfill {d}_{i_{h_{s_u}}{j}_{p_w}}{d}_{i_{r_{t_v}}{j}_{p_w}}\hfill \end{array}\right)\hfill \end{array} $$

where \( {j}_1\le {j}_{p_1}<{j}_{p_2}<\cdots <{j}_{p_w}\le {j}_q \), \( \left({I}_{i_{h_{s_y}}},{I}_{i_{r_{t_z}}}\right)\in \left\{\left({I}_{i_{h_m}},{I}_{i_{r_n}}\right)\Big|m=1,2,\cdots, s;n=1,2,\cdots, t\right\} \).

Let \( {L}_2^{{\prime\prime} }=\left\{\left({I}_{i_{h_{s_y}}},{I}_{i_{r_{t_z}}}\right)\right\} \) be the compressed frequent 2-itemset of D 2.

  1. 4.

    Produce the frequent 3-itemset L 3 and the frequent 4-itemset support matrix D 3.

The frequent 3-itemset is the union of all 3-itemset subsets of the frequent itemsets generated in step (ii) of procedures (2) and (3), and a set defined as {\( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right)\Big|\left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}}\right),\left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_k}}}\right),\left({I}_{i_{h_{s_n}}},{I}_{i_{r_{t_k}}}\right)\in {L}_2^{{\prime\prime} } \) and inner product of corresponding row vectors of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}}\right) \) and \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_k}}}\right) \) in D 2 is not less than min-sup}.

Similarly with previous steps, the intermediate matrix D3 is produced by calculating “and” operation of the corresponding row vector of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}}\right) \) and \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_k}}}\right) \) in L2, which are derived from the element \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \) in L 3.

n 2 is called row numbers of matrix D3.

  1. (i)

    Remove rows or columns using the same approach in step (i) of (2) or (3) and execute the following procedure.

  2. (ii)

    When the sum of the corresponding row of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \) is less than and equal to \( \left[\sqrt{n_2}\right] \), we find and store frequent itemsets containing items \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \) by the same approach in step (ii) of (2), (3) again remove the corresponding row of \( \left({I}_{i_{h_{s_m}}},{I}_{i_{r_{t_n}}},{I}_{i_{r_{t_k}}}\right) \). Then the matrix compression procedure is repeated until no more row or column which is considered as a useless element could be found.

  3. 5.

    Analogously, the frequent 4-itemset,…, the frequent k-itemset is produced by step (2) to step (5) until the frequent k-itemset support matrix D k is empty.

4 Algorithm Example Experiment Studying

Suppose that a transaction database is listed as Table 33.1 which is simulated for the number of min-sup is 2.

Table 33.1 Transaction database
  1. (1)

    Generate the transaction matrix, and calculate the sum of each row v s and the sum of each column h s as described in Table 33.2.

    Table 33.2 Transaction matrix
  2. (2)

    Produce the frequent 1-itemset. L 1 = {I k |v k  ≥ 2} = {I 1,I 2,I 3,I 4,I 5,I 6,I 7,I 8,I 9,I 10}.

  3. (i)

    It is obvious that the corresponding columns of T 5 should be dropped since the sum of which is less than 2 (h s  < 2). After updating v s , the corresponding row of I 7 is removed with regard to its v s  < 2. Then recalculate h s and accordingly remove the corresponding columns of T 9 . Finally, the new compression matrix is shown in Table 33.3.

    Table 33.3 Compression matrix1
  4. (ii)

    Because \( \left[\sqrt{n}\right]=\left[\sqrt{9}\right]=3 \) and the corresponding v s of I 1 I 6 I 8 I 9 is less than and equal to 3, we need to find all the frequent itemsets containing items I l (l = 1, 6, 8, 9), then remove I l (l = 1, 6, 8, 9).

The given min-sup being 2, find frequent itemsets containing I 8 firstly since v 8 = 2. S 8 = {T 1,T 4|d 8j  = 1} and the 2-combinations set of elements in S 8 is S 8 ' = {(T 1,T 4)}. It is obvious that I 1 and I 3 are the rows whose matrix element with column T 1 and T 4 are both 1. So (I 1 I 3 I 8 ) is the only frequent itemset containing I 8, thus we store (I 1 I 3 I 8 ) and drop row I 8. Then another item I 1 is considered, S 1 = {T 1,T 2,T 4|d 1j  = 1} and the 2-combinations set of elements in S 1 is S 1 ' = {(T 1,T 2), (T 1,T 4), (T 2,T 4)}. Frequent itemsets containing items I 1 are obtained by dealing with three 2-tuples in S 1 ' successively. Collection (I 1 I 3 ) is the frequent itemset determined by (T 1,T 2) using the similar approach in finding the frequent itemset containing I 8. Similarly, (T 1,T 4) determines collection (I 1 I 3 ) and (T 2,T 4) determines collection (I 1 I 3 I 4 I 5 ). From the above, all the frequent itemsets containing items I 1 are (I 1 I 3 ) and (I 1 I 3 I 4 I 5 ). Continuing scanning other satisfied items accordingly, all the frequent itemsets containing items I l (l = 1, 6, 8, 9) are found: L1 = {(I 1 I 3 I 8),(I 1 I 3 I 4 I 5),(I 6 I 2 I 3),(I 9 I 2 I 6),(I 9 I 2 I 3)}. After removing rows I l (l = 1, 6, 8, 9), the newly compressed matrix is shown in Table 33.4.

Table 33.4 Compression matrix2

We drop the corresponding columns of T 7 since the sum of which is less than 2 (h s  < 2) and recalculate v s again. Then the support matrix of the frequent 2-itemset is listed in Table 33.5. Regarding each row and column again, there is no useless element. In other words, the support matrix in Table 33.5 is fully compressed.

Table 33.5 Support matrix of the frequent 2-itemset
  1. (3)

    The frequent 2-itemset L 2 is the union of the 2-itemset subsets produced by L1 in step (ii) of procedure (2) and a set L2 obtained from the support matrix in Table 33.5

$$ {L}_2^{\prime }=\left\{\left({I}_i,{I}_j\right)\right.\left|{\displaystyle \sum_{k\in \left\{2,3,4,5\right\}}{d}_{ik}{d}_{kj}}\ge 2\left.,i<j,i,j=1,2,3,4,6,8,10\right\}=\left\{\left({I}_2,{I}_3\right)\right.,\left({I}_2,{I}_4\right),\left({I}_2,{I}_5\right),\left({I}_3,{I}_4\right),\left({I}_3,{I}_5\right),\left({I}_4,{I}_5\right)\right\}. $$

That is L 2 = {(I 1 I 3),(I 1 I 8),(I 3 I 8),(I 1 I 4),(I 1 I 5),(I 3 I 4),(I 4 I 5),(I 3 I 5),(I 2 I 3),(I 2 I 6),(I 6 I 3),(I 9 I 2),(I 9 I 3),(I 9 I 6)}.

Subsequently, the uncompressed support matrix of the frequent 3-itemset is constructed as listed in Table 33.6.

Table 33.6 Uncompressed support matrix of the frequent 3-itemset

Firstly, we remove the corresponding columns of T 1 by considering its h s < 2. Where n 1 = 6, \( \left[\sqrt{n_1}\right]=\left[\sqrt{6}\right]=2 \). Secondly, because the corresponding v s of (I 2 I 4 ) (I 2 I 5 ) is equal to 2, we work out all the frequent itemsets containing (I 2 I 4 ) or (I 2 I 5 ): L3 = {(I 2 I 3 I 4),(I 2 I 3 I 5)} and drop those corresponding rows. That is Table 33.7.

Table 33.7 Support matrix of the frequent 3-itemset
  1. (4)

    Produce the frequent 3-itemset.

A frequent 3-itemset (I 3 I 4 I 5) is obtained from Table 33.7. And the frequent 3-itemset is L 3 = {(I 3 I 4 I 5)} ∪ {(I 1 I 3 I 8), (I 1 I 3 I 4) (I 1 I 3 I 5) (I 1 I 4 I 5), (I 3 I 4 I 5), (I 9 I 2 I 3), (I 9 I 2 I 6), (I 6 I 2 I 3)} of L1 ∪ {(I 2 I 3 I 4) (I 2 I 3 I 5)} of L3.

  1. (5)

    Produce a frequent 4-itemset from (I 1 I 3 I 4 I 5 ) in L1. While no frequent 4-itemset could be found in Table 33.7, so our algorithm ends.

5 Conclusion

An algorithm of mining association rule based on matrix is able to discover all the frequent item sets only by searching the database once and not generating the candidate itemsets, but generating the frequent itemsets directly, which is more efficient. Many researchers have done a great deal of work on it. Here, a new algorithm for generating association rules based on matrix is proposed. It compresses the transaction matrix efficiently by integrating various strategies and achieves better performance than the known algorithms based on matrix. Some new strategies of compressing the transaction matrix are worthy of further research.