Keywords

1 Introduction

For more than two decades, most of the researches are for mining frequent itemsets with the weights/significance of all items are the same (equal weight, as equal to 1), the algorithmic approaches based on Apriori [1] and FP-Tree [2]. In addition, to speed up the execution of mining frequent itemsets, Phan et al. proposed NOV-FI [3] algorithm based on the Kernel_COOC array. Besides, rare itemsets mining is an important task for potential applications such as the detection of computer attacks, fraudulent transactions in financial institutions, bioinformatics and medicine. Algorithms such as Apriori-Inverse [4] and Rarity [5] implement an Apriori-like approach. Thereafter to speed up the execution of mining minimal rare itemsets, Szathmary et al. proposed Walky-G [6] algorithm based on the IT-Tree structure. But in real-world applications, items can have different significance/importance in databases, and such databases are called weighted databases. Most algorithms for frequent weighted/significance itemsets mining are based on satisfying the downward closure property such as algorithms [7,8,9]. However, Huai et al. [10] proposed Apriori-like algorithms based on approach NOT satisfy the downward closure property (very rare proposed algorithms following this approach). This is a great challenge.

In this paper, we propose a novel algorithm called NOV-RSI for mining rare significance itemsets based on NOT satisfying the downward closure property. Furthermore, the proposed algorithm is easily expanded on parallel computing systems. The paper has algorithms as follows:

  • Algorithm 1: Computing Kernel_LOOC array of co-occurrences/occurrences of kernel item in at least one transaction;

  • Algorithm 2: Building list nLOOC_Tree based on Kernel_COOC array;

  • Algorithm 3: NOV-RSI algorithm mining all rare significance itemset based on list of nLOOC-Tree.

This paper is organized as follows: in Sect. 2, we describe the basic concepts for mining frequent itemsets, rare itemsets (the weights/significance of all items are the same or different) and data structure for datasets. Some theoretical aspects of our approach relies, are given in Sect. 3. Besides, we describe our NOV-RSI algorithm to mine rare significance itemsets based on Algorithm 1 and Algorithm 2. Details of implementation and experiment are discussed in Sect. 4. Finally, we conclude with a summary of our approach, perspectives and extensions of this future work.

2 Background

In this section, we present the basic concepts for mining frequent itemsets, rare itemsets (the weights/significance of all items are the same or different) and efficient data structure for dataset.

2.1 Mining Weighted/Significance Frequent, Rare Itemset

Let I = {i1, i2, …, im} be a set of m distinct items. A set of items X = {i1, i2, …, ik}, ∀ij ∈ I (1 ≤ j ≤ k) is called an itemset, an itemset with \( k \) items is called a k-itemset. Ɗ be a dataset containing n transaction, a set of transaction T = {t1, t2, …, tn} and each transaction tj = {ik1, ik2, …, ikl}, ∀ikl ∈ I and a set of weight/significance SIG = {sigi1, sigi2, …, sigim}, ∀sigik ∈ [0, 1] respective to each item.

Definition 1.

The count of an itemset \( X \) is the number of transaction in which occurs as a subset, denoted \( count\left( X \right) \). The support of an itemset \( X \) computes:

$$ \sup \left( X \right) = count\left( X \right)/{\text{n}} $$
(1)

Definition 2.

Let X = {i1, i2, …, ik}, ∀ij ∈ I (1 ≤ j ≤ k), significance of itemset X to compute sig(X) = max(sigi1, sigi2, …, sigik).

The significance support of itemset \( X \) to computes as follow:

$$ sigsup\left( {\text{X}} \right) = \, sig\left( {\text{X}} \right) \times sup\left( {\text{X}} \right) $$
(2)

Definition 3.

Let maxsigsup be the threshold maximum significance support value specified by user. An itemset X is a rare significance itemset if sigsup(X) < maxsigsup, denoted RSI is the set of all the rare significance itemset.

See an Example transaction database \( {\mathcal{D}} \) in Tables 1 and 2.

Table 1. The transaction database \( {\mathcal{D}} \) used as our running example
Table 2. Items significance of \( {\mathcal{D}} \)

Example 1.

See Table 1 and 2. There are eight different items I = {A, B, C, D, E, F, G, H} and ten transactions T = {t1, t2, t3, t4, t5, t6, t7, t8, t9, t10}. And see Table 2 and maxsigsup = 0.20. Consider item X = {G}, sup(G) = 0.50, sig(G) = 0.30, sigsup(G) = 0. 15 < maxsigsup, we have itemset X = {G}∈ RSI. However, itemset Y = {G, A}, sup(GA) = 0.30, sigsup(GA) = max(sigG, sigA) = 0.70, sigsup(GA) = sig(GA) × sup(GA) = 0.70 × 0.50 = 0.35 ≥ maxsigsup, we have item G ∉ RSI (DOES NOT satisfy the downward closure property).

Property 1.

ik ∈ I, sigsup(ik) < maxsigsup: ikRSI.

2.2 Data Structure for Transaction Database

The binary matrix is an efficient data structure for mining frequent itemsets [ 3 ]. The process begins with the transaction database transformed into a binary matrix BiM, in which each row corresponds to a transaction and each column corresponds to an item. Each element in the binary matrix BiM contains 1 if the item is presented in the current transaction; otherwise it contains 0.

3 The Proposed Algorithms

3.1 Generating Array Contain Co-occurrence Items of Kernel Item

In this section, we describe the framework of the algorithm that generates co-occurrence items of items in transaction database.

Definition 4. [3]

Project set of item ik on database \( {\mathcal{D}} \): π(ik) = {tj\( { \mathcal{D}} \) | ik ⊆ tj} is set of transaction contain item ik. According to Definition 1

$$ count\left( {{\text{i}}_{\text{k}} } \right) = \left| {\uppi\left( {{\text{i}}_{\text{k}} } \right)} \right| $$
(3)

Definition 5. [3]

Project set of itemset X = {i1, i2, …, ik}, ∀ij ∈ I (1 ≤ j ≤ k): π(X) = π(i1) ∩ π(i2) … π(ik).

$$ count\left( {\text{X}} \right) = \left| {\uppi\left( {\text{X}} \right)} \right| $$
(4)

Definition 6. (Reduce search space)

Let ∀ik ∈ I (i1 \( \succ \) i2 \( \succ \)\( \succ \) im) items are ordered in significance descending, ik is called a kernel item. Itemset Xlexcooc ⊆ I is called co-occurrence items with the kernel item ik, as to satisfy π(ik) ≡ π(ik ∪ ij), ik \( \prec \) ij, ∀ijXlexcooc. Denoted as lexcooc(ik) = Xlexcooc.

Definition 7. (Reduce search space)

Let ∀ik ∈ I (i1 \( \succ \) i2 \( \succ \)\( \succ \) im) items are ordered in significance descending, ik is called a kernel item. Itemset Ylexlooc ⊆ I is called occurrence items with item ik in as least one transaction, but not co-occurrence items, so that 1≤ |π(ik ∪ ij)| < |π(ik)|, ∀ij ∈ Ylexlooc. Denoted as lexlooc(ik) = Ylexlooc.

Algorithm Generating Array of Co-occurrence Items

This algorithm is generating co-occurrence items of items in transaction database and archived into the Kernel_COOC array and each element has 4 fields:

  • Kernel_COOC[k].item: kernel item k;

  • Kernel_COOC[k].sup: support of kernel item k;

  • Kernel_COOC[k].cooc: co-occurrence items with kernel item k;

  • Kernel_COOC[k].looc: occurrence items kernel item k in at least one transaction.

The framework of Algorithm 1 is as follows:

figure a

We illustrate Algorithm 1 on Example database in Table 1.

Initialization of the Kernel_COOC array, number items in database m = 8;

Item

A

B

C

D

E

F

G

H

sup

0

0

0

0

0

0

0

0

cooc

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

looc

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

Read once of each transaction from t1 to t10

Transaction t1 = {A, C, E, F} has vector bit representation 10101100;

Item

A

B

C

D

E

F

G

H

sup

1

0

1

0

1

1

0

0

cooc

10101100

11111111

10101100

11111111

10101100

10101100

11111111

11111111

looc

10101100

00000000

10101100

00000000

10101100

10101100

00000000

00000000

The same, transaction t10 = {A, C, E, F, G} has vector bit representation 10101110;

Item

A

B

C

D

E

F

G

H

sup

8

2

8

2

7

3

5

1

cooc

10100000

11101000

10100000

10110000

00001000

10100100

10100010

00001001

looc

11111110

11101010

11111110

10110110

11101111

10111110

11111110

00001001

After the processing of Algorithm 1, the Kernel_COOC array is as follows (Table 3):

Table 3. Kernel_COOC array are ordered in support ascending order (line 1 to 11)

Execute command line 12, 13 and 14 in Algorithm 1:

We added the sig field to demonstrate items the ordered by significance descending. We have looc(G) = {B, D, E, F}, where B \( \succ \) D \( \succ \) F \( \succ \) E \( \succ \) G, so lexlooc(G) = {∅} and result on Table 4.

Table 4. Kernel_COOC array are co-occurrence items ordered in significance descending

3.2 Generating List NLOOC-Tree

In this section, we describe the algorithm generating list nLOOC-Tree based on Kernel_LOOC array. Each node within the nLOOC_Tree, 2 main fields:

  • nLOOC_Tree[k].item: kernel item \( k \);

  • nLOOC_Tree[k].sup: support of item \( k \);

The framework of Algorithm 2 is as follows:

figure b
Fig. 1.
figure 1

List nLOOC-Tree based on Kernel_COOC array

Each nLOOC-Tree has the characteristics following Fig. 1:

  • The height of the tree is less than or equal to the number of items that occur at least in one transaction with the kernel item (items are ordered in significance support ascending order).

  • Single-path is an ordered pattern from the root node (kernel item) to the leaf node and the support of the pattern is the support of the leaf node (ik → ik+1 → … → i).

  • Sub-single-path is part of single-path from the root node to any node in an ordered pattern and the sub-single-path support is the support of the child node at the end of the sub-single-path.

Example 2.

Consider kernel item F, we observe nLOOC-Tree(F) generating single-path {F → E → G}, sup(FEG) = 0.10 and sigsup(FEG) = 0.06; sub-single-path {F → E}, sup(FE) = 0.20 and sigsup(FE) = sig(FE) × sup(FE) = 0.60 × 0.20 = 0.12.

3.3 Algorithm Generating All Rare Significance Itemsets

In this section, we describe the framework of the algorithm generating all rare significance itemsets based on the nLOOC-Tree and Kernel_COOC.

The power set of any itemset X is the set of all subsets of X, including the empty set and X itself, variously denoted as ℘(X). The set of subsets of X of cardinality greater than or equal to k is sometimes denoted by ℘≥k(X).

Lemma 1.

(Generating rare significance itemset from co-occurrence items) ∀ikI, if sigsup(ik) < maxsigsup and itemset Xlexcooc is set of for all element of lexcooc(ik) then sup(ikxlexcooc) < maxsigsup, ∀xlexcooc≥1(Xlexcooc) and itemset {ikxlexcooc} ∈ RSI, ∀xlexcooc≥1(Xlexcooc).

Proof.

According to Definition 6, (1), (2) and (3): itemset Xlexcooc is set of co-occurrence items with the kernel item ik, as to satisfy π(ik) ≡ π(ikxlexcooc), ∀xlexcooc≥1(Xlexcooc). Therefore, we have sup(ik) = sup (ikxlexcooc), sigsup(ik) = sigsup(ikxlexcooc) = sig(ikXlexcooc) × sup(ik) = sig(ik) × sup(ik) < maxsigsup and according to Definition 7: itemsets{ikxlexcooc} ∈ RSI, ∀xlexcooc≥1(Xlexcooc)■.

Example 3.

See Table 4. Consider the item D as kernel item (maxsigsup = 0.15), we detect co-occurrence items with the item Ɗ as lexcooc(D) = {A, C} then ≥1({A, C}) = {A, C, AC}, sigsup(DA) = sigsup(DC) = sigsup(DAC) = sig(D) × sup(D) = 0.65 × 0.20 = 0.13 < maxsigsup and itemsets {DA, DC, DAC}are rare significance itemset.

Lemma 2.

(Generating rare significance itemset from occurrence items with kernel item k in at least one transaction) ∀ikI, sigsup(ik) < maxsigsup, Xlexcooc = lexcooc(ik) ∧ ∀spj ∈ nLOOC-Tree(ik), if sigsup(spj) < maxsigsup then {ikssp}∈ RSI, ∀sspspj and {iksspjxlexcooc} ∈ RSI, ∀xlexcooc≥1(Xlexcooc).

Proof.

According to Definition 6, 7 and Lemma 1: we have |π(ik ∪ ylexlooc)| < |π(ik)| ≡ |π(ikXlexcooc)|, ylexloocspj ∈ nLOOC-Tree(ik) contain of single-paths/sub-single-paths, and sigsup(ikspj) < maxsigsup, {ikspj}∈ RSI. Therefore, we have sigsup(ikspjxlexcooc) < maxsigsup and {ikspjXlexcooc} ∈ RSI, xlexcooc≥1(Xlexcooc)■.

Example 4.

See Table 4 and Fig. 1. Consider the item D as kernel item (maxsigsup = 0.15) with sigsup(D) = 0.13 < maxsigsup, we detect occurrence items with kernel item D in at least one transaction as Ylexlooc = lexlooc(D) = {F, G}; we observe nLOOC-Tree(D) generating single-path {D → F → G}, sup(DFG) = 0.10 and sigsup(DFG) = 0.65 × 0.10 = 0.065 < maxsigsup then itemsets {DF, DG, DFG} are rare significance itemset and itemsets {DAF, DCF, DACF, DAG, DCG, DACG, DAFG, DCFG, DACFG} ∈ RSI.

Property 2.

spj ∈ nLOOC-Tree(ik) ∧ sig(ik) × minsup_leafnode(spj) ≥ maxsigsup: {ikspj}∉ RSI (minsup_leafnode is minimum support value of each leaf node on single-paths in nLOOC-Tree(ik)).

The framework of Algorithm 3 is presented as follows:

figure c

3.4 The Algorithm Diagram NOV-RSI

In this section, we represent the diagram of NOV-RSI algorithm for high-performance mining rare significance itemsets, as follows Fig. 2:

Fig. 2.
figure 2

The diagram algorithm for NOV-RSI.

We illustrate Algorithm 3 on Example database in Table 1, 2 and maxsigsup = 0.10. After the processing Algorithm 1 result the Kernel_COOC array in Table 4 and Algorithm 2 presented the list nLOOC_Tree in Fig. 1.

Consider kernel item H, sigsup(H) = 0.80 × 0.10 = 0.08 < maxsigsup (Lemma 1 - line 3) generating rare significance itemset of kernel item H as RSI[H] = {(H; 0.08), (HE; 0.08)};

Consider kernel item D, sigsup(D) = 0.65 × 0.20 = 0.13 > maxsigsup, lexcooc(D) = {A, C} have ≥1({A, C}) = {A, C, AC}. We observe nLOOC-Tree(D) have single-path/sub-single-path {D → F → G}, {D → F} and {D → G}: sigsup(DFG) = 0.65 × 0.10 = 0.065 < maxsigsup; sigsup(DF) = 0.65 × 0.10 = 0.065 < maxsigsup; sigsup(DG) = 0.65 × 0.10 = 0.065 < maxsigsup (Lemma 2 - line 5) generating rare significance itemset of kernel item D as RSI[D] = {(DFG, 0.065), {(DF, 0.065), {(DG, 0.065), (DAFG, 0.065), (DCFG, 0.065), (DACFG, 0.065), (DAF, 0.065), (DCF, 0.065), (DACF, 0.065), (DAG, 0.065), (DCG, 0.065), (DACG, 0.065)};

Consider kernel item B, sigsup(B) = 0.70 × 0.20 = 0.14 > maxsigsup, lexcooc(B) = {A, C, E} have ≥1({A, C, E}) = {A, C, E, AC, AE, CE, ACE}. We observe nLOOC-Tree(B) have single-path {B → G}: sigsup(BG) = 0.70 × 0.10 = 0.07 < maxsigsup (Lemma 2 - line 5) generating rare significance itemset of kernel item B as RSI[B] = {(BG, 0.07), (BAG, 0.07), (BCG, 0.07), (BEG, 0.07), (BACG, 0.07), (BAEG, 0.07), (BCEG, 0.07), (BACEG, 0.07)};

Consider kernel item F, sigsup(F) = 0.60 × 0.30 = 0.18 > maxsigsup, lexcooc(F) = {A, C} have ≥1({A, C}) = {A, C, AC}. We observe nLOOC-Tree(F) have single-path/sub-single-path {F → E → G}, {F → E} and {F → G}: sigsup(FEG) = 0.60 × 0.10 = 0.06 < maxsigsup; sigsup(FE) = 0.60 × 0.20 = 0.12 > maxsigsup; sigsup(FG) = 0.60 × 0.20 = 0.12 > maxsigsup generating rare significance itemset of kernel item F as RSI[F] = {(FEG, 0.06), (FAEG, 0.06), (FCEG, 0.06), (FACEG, 0.06)} (Lemma 2 - line 5);

Consider kernel item E, sigsup(E) = 0.40 × 0.70 = 0.28 > maxsigsup. We observe nLOOC-Tree(E) have single-path {E → G}, minsup_leafnode({E → G}) = 0.30 and sig(E) × minsup_leafnode({E → G}) = 0.40×0.30 = 0.12 > maxsigsup, so RSI[E] = {∅} (Property 5 - line 7).

Consider kernel item C (similarly kernel item E), sigsup(C) = 0.50 × 0.80 = 0.40 ≥ maxsigsup. We observe nLOOC-Tree(C) have single-paths {C → E → G}, {C → G}, minsup_leafnode({C → E → G}, {C → G}) = 0.30 and sig(C) × minsup_leafnode{C → E → G}, {C → G}) = 0.50 × 0.30 = 0.15 > maxsigsup, so RSI[C] = {∅} (Pro 2 - line 7).

Consider kernel item A (similarly kernel item C), sigsup(A) = 0.55 × 0.80 = 0.44 ≥ maxsigsup, RSI[A] = {∅}.

Table 5 shows the rare significance itemsets at maxsigsup = 0.10.

Table 5. RSI satisfy maxsigsup = 0.10 (Example database in Table 1 and 2)

4 Experiments

All experiments were performed on a PC with a Core Duo CPU T2500 2.0 GHz, 4 Gb main memory, running Microsoft Windows 7 Ultimate. All codes were compiled using C#, MVStudio 2010, .Net Framework 4.

We experimented on two instance types of datasets, see Table 6:

  • Two real datasets are both dense form of UCI Machine Learning Repository [http://archive.ics.uci.edu/ml] as Chess and Mushroom datasets.

  • Two synthetic sparse datasets are generated by software of IBM Almaden Research Center [http://www.almaden.ibm.com] as T10I4D100K and T40I10D100K datasets.

    Table 6. Datasets description in experiments

Additionally, we build one table to save the significance values of items by random real values in the range of 0 to 1. This is the first proposed algorithm for RSI mining based on approach DOES NOT satisfy the downward closure property. To evaluate the performance of the proposed algorithm, we modified (DOES NOT satisfy the downward closure property) the AprioriInverse [4] and Rarity [6] to mine RSI called the WaprioriInverse and WRarity algorithm. Therefore, we have compared the NOV-RSI algorithm with algorithms WAprioriInverse and WRarity.

Fig. 3.
figure 3

Running time of the three algorithms on Chess and Mushroom datasets.

Figure 3(a) and (b) show the running time of the compared algorithms on real datasets Chess and Mushroom. The NOV-RSI algorithm runs faster than WAprioriInverse and WRarity algorithms in all maximum significance supports.

Fig. 4.
figure 4

Running time of the three algorithms on T10I4D100K and T40I10D100K datasets.

Figure 4(a) and (b) show the running time of the compared algorithms on synthetic datasets T10I4KD100K and T40I10D100K. The NOV-RSI algorithm runs faster than WaprioriInverse and WRarity algorithms.

In the experiment mentioned above, results suggest the following comparison of these algorithms when running time is concerned: NOV-RSI runs faster than algorithms WaprioriInverse and WRarity algorithms in all maxsigsup on real and synthetic datasets.

5 Conclusion

According to this paper, we presented a high-performance algorithm for mining rare significance itemsets on transaction databases, comprising three phases: the first phase, we quickly detect a Kernel_COOC array of co-occurrences and occurrences of kernel item in at least one transaction; the second phase, we build the list of nLOOC-Tree base on the Kernel_COOC and a binary matrix of dataset (self-reduced search space); the last phase, the algorithm is proposed for fast mining RSI based on nLOOC-Tree. Besides, when using mining RSI with other maxsigsup value, the proposed algorithm only performs mining RSI based on the nLOOC-Tree that is calculated previously (the second phase - Algorithm 2), there by reducing the significant processing time. The experiment’s results show that the proposed algorithms perform better than other existing algorithms.

The results from the algorithm proposed: In the future, we will expand the NOV-RSI algorithm to be able to mine rare significance itemsets on Multi-Cores, Many-CPUs, GPU and distributed computing systems such as Hadoop, Spark.