Keywords

1 Introduction

In this section, we first introduce some preliminaries for mining high utility itemsets [7]. Let I = {i1, i2,…, im} be the set of all the items. An itemset X is a subset of I and the length of X is the number of items contained in X. An itemset with length k is called a k-itemset. A transaction database D = {T1, T2,…, Tn} contains a set of transactions and each transaction has a unique transaction identifier (TID). Each transaction contains the items purchased in this transaction and their purchased quantities. The purchased quantity of item ip in a transaction Tq is denoted as o(ip, Tq). The utility of item ip in Tq is u(ip, Tq) = o(ip, Tq)×s(ip), in which s(ip) is the profit of item ip. The utility of an itemset X in Tq is the sum of the utilities of the items contained in X \(\subseteq\) Tq, which is shown in expression (1). If X ⊄ Tq, u(X, Tq) = 0. The utility of an itemset X in D is the sum of the utilities of X in all the transactions containing X, which is shown in expression (2).

The transaction utility (tu) of a transaction Tq is the sum of the utilities of the items in Tq, which is shown in expression (3). The total utility of the whole transaction database D is the sum of the transaction utilities of all the transactions in D. A utility threshold is a user specified percentage and a minimum utility (MU) can be obtained by multiplying total utility of D and the user-specified utility threshold. An itemset X is a high utility itemset if the utility of X in D is no less than the minimum utility.

$${\text{u(X,T}}_{q} )= \sum\limits_{{i_{p} \in X \subseteq T_{q} }} {u\left( {{\text{i}}_{p} ,{\text{T}}_{q} } \right)}$$
(1)
$${\text{u(X)}} = \sum\limits_{{X \subseteq T_{q} \in D}} {u \, (X,{\text{T}}_{q} )}$$
(2)
$${\text{tu(T}}_{\text{q}} )= \sum\limits_{{{\text{i}}_{\text{p}} \in {\text{T}}_{\text{q}} }} {{\text{u}}\left( {{\text{i}}_{\text{p}} , {\text{T}}_{\text{q}} } \right)}$$
(3)

For example, Table 1 is a transaction database, in which each integer number represents the purchased quantity for an item in a transaction. Table 2 is a Profit Table which records the profit for each item in Table 1. Suppose the user-specified utility threshold is 60%. Because the total utility of Table 1 is 224, the minimum utility is 226 * 60% = 134.4. The utility of itemset {D} is u ({D}) = 3×6 = 18 ≦ 135, which is not a high utility itemset . The utility of itemset {BD} is u ({BD}) = (6×10 + 1×6) + (10×10 + 1×6) = 172 ≥ 135. Therefore, itemset {BD} is a high utility itemset.

Table 1 A transaction database
Table 2 Profit table

For mining frequent itemset [1], all the subsets of a frequent itemset are frequent, that is, there is a downward closure property for frequent itemsets. However, the property is not available for high utility itemsets , since a subset of a high utility itemset may not be a high utility itemset. For the above example, itemset {BD} is a high utility itemset in Table 1, but its subset {D} is not a high utility itemset. Therefore, Liu et al. [7] proposed a Two-Phase algorithm for mining high utility itemsets. They defined the transaction weighted utility (twu) for an itemset X, which is shown in expression (4).

$${\text{twu(X)}} = \sum\limits_{{{\text{X}} \subseteq {\text{T}}_{\text{q}} \in {\text{D}}}} {{\text{tu(T}}_{\text{q}} )}$$
(4)

If the twu of an itemset is no less than MU, then the itemset is a high transaction weighted utility (HTWU) itemset. According to expression (4), the twu for an itemset X must be greater than or equal to the utility of X in D. Therefore, if X is a high utility itemset , then X must be a HTWU itemset. All the subsets of a HTWU itemset are also HTWU itemsets. Therefore, there is a downward closure property for HTWU itemsets. The first phase for the Two-Phase algorithm [7] is to find all the HTWU itemsets which are called candidate high utility itemsets by applying Apriori algorithm [1]. Two-Phase algorithm scans the database again to compute the utilities for all the candidate high utility itemsets and find high utility itemsets in the second phase.

Although some approaches [2, 7, 9, 15, 16, 18] have been proposed for mining high utility itemsets in a static transaction database, these approaches cannot efficiently discover high utility itmesets in a data stream environment, since they need to rescan the original database and re-discover all the high utility itemsets when some transactions are added into or removed from the database. In a data stream environment, the transactions are generated or removed in an extremely fast way. We need to immediately identify which itemsets can be turn out to be high utility itemsets, and vice versa. Besides, in this environment, we need to keep the information for all the itemsets, otherwise some high utility itemsets may be lost. However, the memory space is limited. It is very difficult to retain the utilities of all the itemsets in a large database .

Recently, some approaches [3, 6, 10, 14] have been proposed to find high utility itemsets in a data stream, which can be divided into Apriori-like [6, 14] and Tree-based approaches [3, 10]. However, these approaches just tried to find candidate high utility itemsets, that is HTWU itemsets. They still need to take a lot of time to rescan the original database and search for high utility itemsets from the large number of candidate itemsets without using the previous found information. Therefore, in this chapter, we propose an efficient algorithm HUIStream(mining High Utility Itemset in data Stream) for mining high utility itemsets in a data stream. When the transactions are added or deleted, our algorithms can just update HTWU itemsets according to the added or deleted transactions and directly calculate the utilities of HTWU itemsets without rescan the original database and search for high utility itemsets from the HTWU itemsets.

2 Related Work

The early approaches for mining frequent itemsets [1, 4, 12] are based on Apriori-like approach, which iteratively generate candidate (k + 1)-itemsets from the frequent k-itemsets (k ≥ 1) and check if these candidate itemsets are frequent. However, in the cases of extremely large input sets or low minimum support threshold, the Apriori-like algorithms may suffer from two main problems of repeatedly scanning the database and searching from a large number of candidate itemsets.

Since Apriori-like algorithms require multiple database scans to calculate the number of occurrences of each itemset and record a large number of candidate itemsets, Tree-based algorithms [5, 11, 21] improve these disadvantages, which transform the original transaction database into an FP-tree and generate the frequent itemsets by recursively constructing the sub-trees according to the FP-Tree. Because all the transactions are recorded in a tree, Tree-based algorithms do not need multiple database scans and do not need to generate a large number of candidate itemsets.

Although Tree-based algorithms have been able to efficiently identify frequent itemsets from the transaction database, because of the number of the frequent itemsets may be very large, the execution time and memory usage would increase significantly. Therefore, some researchers have proposed the concept of closed itemsets [17, 20]. The number of the closed frequent itemsets is far less than the number of the frequent itemsets in a transaction database, and all the frequent itemsets can be derived from the frequent closed itemsets , so either memory usage or execution time for mining frequent closed itemsets is much less than that of mining frequent itemsets.

Liu et al. [7] proposed Two-Phase algorithm for mining high utility itemsets . Since the subset of a high utility itemsets may not be a high utility itemsets, that is, there is no downward closure property for high utility itemset, Liu et al. proposed the transaction weighted utility (twu) of an itemset to find out high utility itemsets. In the first stage, Two-Phase algorithm applied Apriori algorithm [1] to find all the HTWU itemsets as the candidate itemsets, and then scans the transaction database to calculate the utility for each candidate itemset in order to identify which candidate itemsets are high utility itemsets. Although Two-Phase algorithm can find all the high utility itemsets from a transaction database, a large number of HTWU itemsets would be generated in the first phase, such that much time would be taken to search for high utility itemsets from these candidates in the second phase, since the twu of an itemset is much greater than the utility for the itemset.

Tseng et al. [14] proposed an algorithm THUI-Mine for mining high utility itemsets in a data stream , which only stores length two HTWU itemsets and applies Two-Phase algorithm to find all the HTWU itemsets. When a set of transactions is added, if there are new items in the added transactions, THUI-Mine will only determine whether the new items satisfy the utility threshold in the added transactions. If the items in the added transactions already exist in the original database, THUI-Mine will judge if the items are still HTWU items. Because THUI-Mine uses Two-Phase algorithm to re-mine the high utility itemsets, it still needs to take a lot of time to scan the database many times. HUP-HUI-DEL algorithm [8] also applies Two-Phase algorithm and only considers the transaction deletion. It still needs to generate a large number of candidate high utility itemsets and repeatedly scans the database to find high utility itemsets.

Li et al. [6] proposes MHUI algorithm, which discovers high utility itemsets in a specific sliding window. MHUI takes use of BITvector or TIDlist to store the transaction IDs in which each item is contained to avoid repeatedly scanning the database. MHUI stores length 2 HTWU itemsets in the structure LexTree-2HTU (Lexicographical Tree with 2-HTU itemset). When the transactions are added or deleted, MHUI generates all the length 2 itemsets from the added or deleted transactions and updates the structure LexTree-2HTU. MHUI uses level-wise method to generate all the HTWU itemsets from the length 2 HTWU itemsets, and re-sacan the database to find high utility itmesets.

Ahmed et al. [3] proposes a tree structure IHUP to stores the utility for each transaction and divides IHUP into three types according the order of the items which appear in the tree nodes: IHUPL, IHUPTF and IHUPTWU-. When a transaction is added, IHUPL stores each item of the transaction in the tree node according to the alphabetic order, but IHUPTF and IHUPTWU need to adjust the tree nodes to make sure that the items of each transaction are ordered by support and twu, respectively. IHUP needs to spend a large amount of memory space to store the whole database in a tree structure and applies FP-Growth algorithm [5] to repeatedly generate subtree structure. Finally, IHUP still needs to rescan the whole database to calculate the utility for each HTWU itmesets and generate high utility itemsets .

Yun and Ryang proposes HUPID-Growth algorithm [19] and SHU-Grow algorithm [13], respectively. HUPID-Growth scans the database once to construct HUPID-Tree and TIList and adjust the order of the items in the tree nodes to reduce the over-estimated value of the utility for each node in a path, that is to reduce the over-estimated utility for each itemsets. SHU-Grow uses the tree structure IHUP and stores the accumulated utility for each node when a set of transactions are added. SHU-Grow applies the strategies of UP-Growth algorithm [16] to reduce the over-estimated utility and the number of the candidate high utility itemsets. UPID-Growth and SHU-Grow still apply FP-Growth algorithm to find HTWU itemsets and search for high utility itemsets from a large number of candidate high utility itemsets.

3 Mining High Utility Itemsets in a Data Stream

In this section, we first introduce the storage structure for our algorithm HUIStream. When a transaction is added into or deleted from the transaction database, HUIStream updates HTWU itemsets related to the added or deleted transaction, respectively. In the following, we propose two algorithms HUIStream+ and HUIStream− for maintaining the HTWU itemsets and generates all the high utility itemsets from the HTWU itemsets when a transaction is added and deleted, respectively.

In order to avoid rescanning the original database and searching from the candidate high utility itemsets when the transactions are added or deleted, we have the following definitions. An itmeset X is a closed twu itemset if there is no superset of X, which has the same twu as X. An itemset is a closed HTWU itemset if X is a closed twu itemset and the twu of X is no less than user-specified minimum utility. For any two itemsets X and Y (X \(\subseteq\) Y), if the twu of Y is the same as the twu of X and Y is not contained in any other itemset with the same twu as Y, then Y is the closure of X. For a closed HTWU itemset X, the proper subset of X, which has the same twu as X, is called the Equal TWU itemset of X, and X is the closure of the Equal TWU itemset.

In order to efficiently find the high utility itemsets in a data stream without information loss, HUIStream first determines which itemsets in the transaction are closed twu itemsets when a transaction is added or deleted. All the closed twu itemsets are recorded in a Closed Table, since the number of the closed itemsets is much less than the number of the itemsets and all the itemsets can be generated by the closed itemsets in a transaction database. There are three fields included in the Closed Table: Cid records the identification of each closed twu itemset; CItemset records the closed twu itemset with utility of each item in the closed twu itemset; twu records the twu of the closed twu itemset.

Table 3 shows the content of the Closed Table after the previous four transactions in Table 1 are added. There are five closed twu itemsets, in which the utility and twu of the closed twu itemset {E} with Cid 3 are 15 and 205, respectively. For each item, we use the table Cid List to record the Cids of the closed twu itemsets which contain the item. Table 4 shows the content of the Cid List after the previous four transactions in Table 1 are added, in which the field CidSet for item C is {1, 4, 5}, since item C is contained in the three closed twu itemsets with Cids 1, 4 and 5.

Table 3 The closed table after processing the previous four transactions in Table 1
Table 4 The Cid list after processing the previous four transactions in Table 1

For example, the total utility of the previous four transactions in Table 1 is 210. If the utility threshold is 60%, that is the minimum utility is 210 * 60% = 126, then the closed HTWU itemsets are {BDE} and {E}. The Equal TWU itemsets for the two closed twu itemsets are shown in Table 5. The closed HTWU itemsets and their Equal TWU itemsets form the candidate high utility itemsets . HUIStream only needs to update the closed HTWU itemsets, that is, update the content of Closed Table and Cid List, and then the twu values of all the Equal TWU itemsets can be computed without rescanning the database.

Table 5 Closed HTWU itemsets and their equal TWU itemsets

3.1 The Algorithm HUIStream+

In this subsection, we describe how HUIStream finds the closed twu itemsets which need to be updated and all the HTWU itemsets after adding a transaction. When a transaction TADD is added, the twu value will be increased just for TADD and the subsets of TADD. Therefore, HUIStream only considers whether TADD and the subsets of TADD are closed twu itemsets or not. If X \(\subseteq\) TADD is a closed twu itemset before adding the transaction TADD, it must be a closed twu itemset after adding the transaction, because the twu value for the supersets (⊄TADD) of X would not be changed [20]. Therefore, all the closed twu itemsets which need to be updated can be obtained by performing the intersections on TADD and all the closed twu itemsets in the Closed Table. However, the intersections of TADD and most of the closed twu itemsets would be empty. It will waste a lot of unnecessary time to intersect TADD with all the closed twu itemsets. In order to avoid that the intersection is empty, HUIStream identifies which closed twu itemsets contain some items in TADD = {i1, i2,…, im} from Cid List according to expression (5).

$${\text{SET}}\left( {\{ {\text{TADD}}\} } \right) = {\text{CidSet}}\left( {{\text{i}}_{ 1} } \right)\, \cup \,{\text{CidSet}}\left( {{\text{i}}_{ 2} } \right)\, \cup \, \ldots \, \cup \,{\text{CidSet}}\left( {{\text{i}}_{\text{m}} } \right)$$
(5)

The closed twu itemset X obtained by the intersection of each closed twu itemset Y with Cid in SET({TADD}) and TADD need to be updated when a transaction TADD is added. The closed twu itemsets which need to be updated after adding transaction TADD are recorded in the table TempADD, which includes the two fields: UItemset records the closed twu itemset X which needs to be updated; Closure_Id records the Cid of the closure of X before adding transaction TADD, that is, the Cid of itemset Y. If there is the same closed twu itemset X generated by the intersections of different closed twu itemsets and TADD, then the closure of X is the closed twu itemset with the largest twu value among the different closed twu itemsets. Because an added transaction TADD must be a closed twu itemset [20], TADD is recorded in TempADD and the corresponding Closure_Id is set to be 0, which represents that we cannot know the closure of TADD before adding the transaction so far. HUIStream can update the content of Closed Table according to the TempADD.

For each record in TempADD, HUIStream compares the itemset X in UItemset and the itemset Y with Cid in Closure_Id. If X and Y are the same, that is, X is a closed twu itemset before adding the transaction TADD, then the utility of each item in X is increased by adding the utility of the item in TADD, and the twu of X is increased by adding the tu of TADD. If X and Y are different, that is, X is not a closed twu itemset before adding the transaction TADD and turns out to be a closed twu itemset after adding the transaction, then HUIStream assigns X a unique Cid and adds it into The Closed Table and Cid List, in which the twu of X is the twu of Y plus the tu of TADD and the utility of each item is the utility of the item in Y in Closed Table plus the utility of the item in TADD, since the twu of X is equal to the twu of Y before adding the transaction TADD.

If itemset X is not a closed HTWU itemset before adding transaction TADD, but is closed HTWU itemsets after adding the transaction, then the Equal TWU itemsets of the Closure Y of X becomes the Equal TWU itemsets of X. If the Closure of X is not a closed HTWU Itemset, then the Equal TWU itmesets of X are the subsets of X, which have the same twu as X. HUIStream uses the following method to determine if the subset Z of X is an Equal TWU itemset of X: If the twu of X is the largest twu among all the itemsets with Cids in SET(Z) according to expression (5), then Z is an Equal TWU itemset of X, that is, X is the Closure of Z. If X is a closed twu itemset before adding the transaction, then HUIStream only needs to justify if X is a closed HTWU itemset after adding the transaction.

For example, suppose the utility threshold is 60% for Table 1. The Close Table and Cid List are shown in Tables 3 and 4 after adding the previous four transactions in Table 1. When the transaction T5 is added, HUIStream records the itemset T5 = {ADE} in the field UItemset of TempADD and set 0 to Closured_Id. Because SET(T5) = CidSet(A) ∪ CidSet(D) ∪ CidSet(E) = {1, 2, 3, 4} according to expression (1), HUIStream performs the intersections of T5 and the closed twu itemsets with Cids 1, 2, 3 and 4 from Closed Table, respectively. Firstly, because Cid 1 is itemset {CE} and {ADE} ∩ {CE} = {E}, HUIStream adds UItemset {E} and Closure_Id 1 in the TempADD. Secondly, because Cid 2 is itemset {BDE} and {ADE} ∩ {BDE} = {DE}, HUIStream adds UItemset {DE} and Closure_Id 2 in the TempADD. Thirdly, Cid 3 is itemset {E} and {ADE} ∩ {E} = {E} has existed in TempADD. Because the twu of the closed twu itemset with Cid 3 is greater than the twu of the closed twu itemset with Cid 1, the corresponding Closure_Id of UItemset {E} is replaced with Cid 1. Finally, because Cid 4 is itemset {AC} and {ADE} ∩ {AC} = {A}, HUIStream adds UItemset {A} and Closure_Id 4 in the TempADD. After adding the transaction T5, the TempADD is shown in Table 6.

Table 6 The TempADD after adding transaction T5

HUIStream updates Closed Table and Cid List according to TempADD. For Table 6, because the first UItemset {ADE} of TempADD is not in Closed Table and the corresponding Closure_Id is 0, which means that itemset {ADE} is not a closed twu itemset before adding transaction T5, but turns out to be a closed twu itemset after adding transaction T5, HUIStream adds the CItemset {A:3, D:6, E:5} with Cid 6 and twu = 3+6 + 5 = 14 into the Closed Table. HUIStream also inserts Cid 6 into Cid List for items A, D and E. Because the minimum utility is 135 after adding transaction T5, itemset {ADE} is not a closed HTWU itemset.

For the second UItemset {E} and the corresponding Closure_Id 3 in Table 6, because Cid 3 in Closed Table (Table 3) is also {E}, which means that itemset {E} is a closed twu itemset before adding transaction T5, the twu of {E} after adding the transaction T5 is the twu of {E} in the Closed Table plus the tu of T5, that is 205 + 14 = 219. Because the utility of {E} is 5 in T5, HUIStream updates the CItemset {E:15} in Table 3 as {E:20 (=15 + 5)}.

For the third UItemset {DE} and the corresponding Closure_Id 2 in Table 6, because Cid 2 in Closed Table (Table 3) is {BDE}, which means that itemset {DE} is not a closed twu itemset and the Closure of {DE} is {BDE} before adding transaction T5, the twu of {DE} after adding transaction T5 is the twu of {BDE} in the Closed Table plus the tu of T5, that is, 182 + 14 = 196. HUIStream adds the CItemset {D:18 (=12 + 6), E:15 (=10 + 5)} in the Closed Table and assigns the new closed twu itemset {DE} a Cid 7, which is added to Cid List for items D and E. Because the two itemsets {D} and {DE} are the Equal TWU itemsets of {BDE} before adding transaction T5, and {D} and {DE} both are contained in {DE}, the two itemsets become the Equal TWU itemset of {DE} after adding transaction T5.

For the fourth record in TempADD, the UItemset is {A} and the corresponding Closure_Id is 4. Because Cid 4 in the Closed Table (Table 3) is {AC}, which means that itemset {DE} is not a closed twu itemset and the Closure of {A} is {AC} before adding transaction T5. The twu of {A} after adding transaction T5 is the twu of {AC} in the Closed Table plus the tu of T5, that is 7 + 14 = 21, which is not a closed HTWU itemset. HUIStream adds the CItemset {A:9 (=6 + 3)} in the Closed Table and assigns the new closed twu itemset {A} a Cid 8, which is added to Cid List for item A. After adding transaction T5, the Closed Table and Cid List are shown in Table 7 and Table 8, and the closed HTWU itemsets and the corresponding Equal TWU itemsets are shown in Table 9.

Table 7 The closed table after adding transaction T5 in Table 1
Table 8 The Cid list after adding transaction T5 in Table 1
Table 9 The closed HTWU itemsets after adding transaction T5 in Table 1

3.2 The Algorithm HUIStream

In this subsection, we describe how HUIStream finds the closed twu itemsets which need to be updated and all the HTWU itemsets after deleting a transaction. The itemsets need to be updated after deleting a transaction are the subsets of TDEL, since the twu of the subsets of TDEL would be decreased after deleting the transaction. Therefore, the closed twu itemsets which need to be updated can be obtained by performing the intersections on TDEL and all the closed twu itemsets before deleting transaction TDEL. In order to avoid that the intersection is empty, HUIStream only performs the intersections on TDEL and the closed twu itemsets with Cids in SET({TDEL}) according to expression (5) after deleting transaction TDEL. The closed twu itemsets which need to be updated after deleting a transaction are recorded in a table TempDEL, which includes the two fields: DItemset records the closed twu itemset X which needs to be updated; C1 records the Cid of X before deleting the transaction; C2 records the information which can be used to determine if X is still a closed twu itemset after deleting the transaction. Because a deleted transaction TDEL is a closed twu itemset before the deletion [20], HUIStream firstly puts TDEL in the first record of TempDEL, and sets the corresponding C1 and C2 to be 0.

Because the intersection of TDEL and different closed twu itemsets S may obtain the same itemset X, the field C1 in TempDEL records the Cid p of the closed twu itemset with the largest twu among all the closed twu itemsets in S. Cid p is the Cid of itemset X, since itemset X is a closed itemset before deleting the transaction TDEL. Because itemset X may not be a closed twu itemset after deleting the transaction, the field C2 in TempDEL records the Cid q of the closed twu itemset with the largest twu among all the closed twu itemsets in S except the closed twu itemset with Cid p. If the twu of Cid q is equal to the twu of Cid p minus the tu of TDEL, which means that the itemset with Cid q has the same twu as X, then X is not a closed twu itemset any more after deleting the transaction. HUIStream updates the content of the Closed Table according to the Table TempDEL.

For each record with values X, p and q for DItemset, C1 and C2 in TempDEL, respectively, the twu values of the closed twu itemsets with Cid p and Cid q can be obtained from Closed Table. If the twu of X minus the tu of TDEL is equal to 0, which means that X is not contained in any transaction after deleting TDEL, then itemset X with Cid p is removed from the Closed Table. If itemset Y with Cid q is not itemset X and the twu of X minus the tu of TDEL is equal to the twu of Y, then X is not a closed twu itemset after deleting transaction TDEL, since Y is a superset of X and they have the same twu values.

If X is still a closed twu itemset after the deletion, then HUIStream updates the twu of X and the utility of each item in X in the Closed Table as follows: the updated twu of X is the twu of X minus the tu of TDEL and the updated utility of each item in X is the utility of the item in X minus the utility of the item in TDEL. If X is not a closed HTWU itemset before deleting the transaction but is a closed HTWU itemset after the deletion, then HUIStream finds all the subsets of X, which have the same twu as X, that is all the Equal TWU itemsets of X.

For example, in Table 1, when the transaction T1 = {CE} is deleted, HUIStream firstly puts {CE} in the field DItemset in TempDEL and the corresponding C1 and C2 are set to be 0. Because SET(T1) = CidSet(C) ∪ CidSet(E) = {1, 2, 3, 4, 5, 6, 7} according to expression (1) and Cid List in Table 8, HUIStream performs the intersections on T1 and the closed twu itemsets with Cids 1, 2, 3, 4, 5, 6 and 7 from Closed Table in Table 7, respectively. Firstly, Cid 1 is itemset {CE} and {CE} ∩ {CE} = {CE} which exists in TempDEL and the corresponding C1 and C2 are 0 s. Because from Table 7, we can see that the twu of Cid 1 is greater than the twu of Cid 0, the Cid in C1 is changed to 1 and the Cid in C2 remains 0 for DItemset {CE} in the table TempDEL.

Secondly, because Cid 2 is itemset {BDE} and {CE} ∩ {BDE} = {E} which is not in TempDEL, the itemset {E} is added to TempDEL, and the corresponding C1 and C2 are set to 2 and 0, respectively. Thirdly, Cid 3 is itemset {E} and {CE} ∩ {E} = {E} has existed in TempDEL. Because the twu of Cid 3 is greater than the twu of Cid 2 and the twu of Cid 2 is greater than the twu of Cid 0, the Cid in C1 is replaced with 3 and the Cid in C2 is replaced with 2 for DItemset {E} in the table TempDEL. HUIStream continuously performs the intersections on T1 and the closed twu itemsets with Cids 4, 5, 6, and 7, respectively, and updates the content of TempDEL. After deleting the transaction T1, the TempDEL is shown in Table 10.

Table 10 The TempDEL after deleting the transaction T1

HUIStream updates the content of Closed Table and Cid List according to TempDEL. For example, in Table 10, the first record in DItemset is {CE} and C1 is Cid 1. Because the twu of {CE} with Cid 1 in Table 7 minus the tu of T1 is equal to 0, itemset {CE} is not a closed twu itemset after deleting transaction T1. Therefore, HUIStream removes the information about {CE} from Closed Table and Cid List. The second record in DItemset is {E} and C1 is Cid 3. Because the twu of {E} with Cid 3 in Table 7 minus the tu of T1 is equal to the twu of {DE} with Cid 7 in C2, itemset {E} is not a closed twu itemset after deleting transaction T1, since there exists a superset {DE} of {E} and they have the same twu values. HUIStream removes the information about {E} from the Closed Table and Cid List, and moves all the Equal TWU itemsets of {E} to the Equal TWU itemsets of {DE} with Cid 7. There is the same situation with the second record for the third record in TempDEL. All the information about itemset {C} is removed from the Closed Table and Cid List. After deleting transaction T1 from Table 1, the Closed Table and Cid List are shown in Table 11 and Table 12, respectively, and the closed HTWU itemsets and their Equal TWU itemsets are shown in Table 13.

Table 11 The closed table after deleting the transaction T1 from Table 1
Table 12 The Cid list after deleting the transaction T1 from Table 1
Table 13 The closed HTWU itemsets and their equal TWU itemsetsafter deleting T1

3.3 High Utility Itemset Generation

After processing the added and deleted transactions, all the Equal TWU itemsets of the closed HTWU itemsets are the candidate high utility itemsets . The utility of each Equal TWU itemset X for each closed HTWU itemset Y can be obtained by accumulating the utility of each item of X in Y from the Closed Table, and then all the high utility itemsets can be obtained without scanning the database.

For example, from Table 13, we can see that the itemsets {BDE} and {DE} are closed HTWU itemsets. For itemset {BDE} with Cid 2, the utility of {BDE} is 182( =B:160 + D:12 + E:10), which can be obtained from Closed Table in Table 11. The utility of the Equal TWU itemset {BD} of {BDE} is 172 (=B:160 + D:12), The utility of the Equal TWU itemset {BE} of {BDE} is 170 (=B:160 + E:10). All the candidate high utility itemsets and their utilities after deleting transaction T1 are shown in Table 14, in which itemsets {B},{BD},{BE} and {BDE} are high utility itemsets.

Table 14 The candidate high utility itemsets and their utilities

4 Experimental Results

In this section, we evaluate the performance of our HUIStream algorithm and compare it with IHUP algorithm [3]. Our experiments are performed on Intel(R) Core(TM) 2 Quad CPU Q9400 @ 2.66 GHz with 4 GB RAM and running on Windows XP. The two algorithms are implemented in JAVA language.

Fig. 1
figure 1

Utility value distribution in utility table

We first generate two synthetic datasets T5I2D100 K and T5I4D100 K by using IBM Synthetic Data Generator [22], in which T is the average length of the transactions, I is the average size of maximal potentially frequent itemsets and D is the total number of the transactions. The number of distinct items is set to 1000. For the profit of each item, we use the log Normal Distribution [2, 7] and set the range of the profits between 0.01 and 10, which is shown in Fig. 1. The purchased quantity for an item in a transaction is randomly set to the number between 1 and 10.

Figure 2 and Fig. 3 show the execution time of IHUP and HUIStream, which the utility threshold is set to be 0.1%, the number of transactions is increased from 10 K to 100 K, and the size of sliding window is set to be 1 K and 10 K, respectively. From the experiments, we can see that HUIStream outperforms IHUP, and the performance gap increases as the number of transactions increases and the times of window size movements increases, since HUIStream only updates the closed twu itemsets related to the added or deleted transactions. However, IHUP needs to re-mine HTWU itemsets by using FP-Growth algorithm [5] and rescan the database to find high utility itemsets from the HTWU itemsets.

Fig. 2
figure 2

The execution time for the two algorithms on window size = 1 K

Fig. 3
figure 3

The execution time for the two algorithms on window size = 10 K

Figures 4 and 5 show the memory usages for the two algorithms IHUP and HUIStream when the number of transactions is increased from 10 K to 50 K and the size of sliding window is set to be 1 K and 10 K, respectively, from which we can see that the memory usage for IHUP is significantly larger than that of HUIStream. This is because IHUP needs to recursively construct the subtrees for re-mining HTWU itemsets when a transactions are added or deleted, but HUIStream only needs to store and update the Closed Table and Cid List.

Fig. 4
figure 4

The memory usages for the two algorithms on window size = 1 K

Fig. 5
figure 5

The memory usages for the two algorithms on window size = 10 K

In the following experiments, we generate the two datasets T10I4D100K and T10I6D100K. The number of distinct items is set to 2000, and the utility threshold is set to be 0.1%. Figures 6 and 7 show the execution time of IHUP and HUIStream, which the number of transactions is increased from 10 K to 50 K, and the size of sliding window is set to be 1 K and 10 K, respectively. From Fig. 6, we can see that HUIStream outperforms IHUP and the performance gap increases as the number of transactions increases. Although the performance gaps are similar when the number of transactions increases from 10 K to 50 K, HUIStream still outperforms IHUP in Fig. 7. The memory usages for IHUP and HUIStream in this experiment are shown in Figs. 8 and 9, from which we can see that the memory usage for HUIStream still less than the memory usage for IHUP on the datasets with longer transaction size.

Fig. 6
figure 6

The execution time for the two algorithms on window size = 1 K

Fig. 7
figure 7

The execution time for the two algorithms on window size = 10 K

Fig. 8
figure 8

The memory usages for the two algorithms on window size = 1 K

Fig. 9
figure 9

The memory usages for the two algorithms on window size = 10 K

5 Conclusion

There are many previous approaches for mining high utility itemsets in a data stream . However, they all first need to generate a large number of candidate high utility itemsets and then scan the whole database to caculate the utility for each high utility itemset. Although some approaches propose some strategies to reduce the number of the candidate high utility itemsets, the number of the candidates is still large when the size of the database is large. In order to avoid rescanning the database, some approaches store the whole database in a tree structure, but they also need to re-generate the candidate high utility itemsets when the transactions are added or deleted without using the information about previously discovered high utility itemsets.

In order to improve the performance of the previous approaches, we propose an algorithm HUIStream for mining high utility itemsets over a data stream. We take use of the concept of closed itemsets [20] and propose the definition of closed twu itemsets which can be used to derive the twu values of all the itemsets in the database. Because the number of the closed twu itemsets is much less than the number of the itemsets in the database, HUIStream only keeps all the closed twu itemsets, such that the twu values of all the itemsets in the database can be reserved. Therefore, our approach only needs to update the closed twu itemsets about the added or deleted transaction without any information loss when a transaction is added or deleted. According to the closed twu itemsets, HUIStream can directly obtain the high utility itemsets from the closed HTWU itemsets without rescanning the database. The experimental results also show that our HUIStream outperforms the other approaches.