Mining High Utility Sequential Patterns Using Maximal Remaining Utility

Song, Wei; Rong, Keke

doi:10.1007/978-3-319-93803-5_44

Wei Song¹⁶ &
Keke Rong¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10943))

Included in the following conference series:

International Conference on Data Mining and Big Data

3790 Accesses
7 Citations

Abstract

Mining high utility sequential pattern is an interesting problem in data mining. In this paper, we propose a new algorithm called high utility sequential pattern mining based on maximal remaining utility (HUSP-MRU). In HUSP-MRU, the maximal remaining utility (MRU) is defined as tighter upper bound of candidates. Representing the search space with lexicographic sequential pattern tree, the matrix structures are used for MRU storage, and branch as well as node pruning based on MRU are used for improving mining efficiency. Extensive tests conducted on publicly available datasets show that the proposed algorithm outperforms USpan algorithm in terms of mining efficiency.

Access provided by CONRICYT-eBooks. Download conference paper PDF

On efficiently mining high utility sequential patterns

Article 11 January 2016

Mining Periodic High Utility Sequential Patterns

High-Utility Sequential Pattern Mining with Multiple Minimum Utility Thresholds

Keywords

1 Introduction

High utility sequential pattern (HUSP) mining [1, 7], which is to discover sequential patterns with high utility in a sequence database, is an important topic in the field of utility pattern mining [3, 6], and has wide real-world applications [5, 8].

Different from traditional frequent sequential pattern [4], HUSP considers both occurrence frequency and profit of itemsets in a sequence. Thus, the infrequent but high profit sequential patterns can be discovered through mining HUSPs.

USpan [7] is a typical HUSP mining algorithm. For width pruning, USpan use the sequence-weighted utilization as the upper bound, which may include unnecessary operations on utility computation. To improve the efficiency, a new HUSP mining algorithm named HUSP-MRU is proposed. In this algorithm, the maximal remaining utility is defined as a tighter upper bound. To store maximal remaining utility, the item-utility matrix and maximal utility matrix structures are used. The search space is represented by lexicographic sequential pattern tree, and the branch as well as node pruning strategies are used to avoid unnecessary traversal of unpromising nodes. Experimental results show that the HUSP-MRU algorithm is efficient.

2 Problem of HUSP Mining

Let $ \varvec{I} = \, \left\{ {i_{1} ,i_{2} , \ldots ,i_{m} } \right\} $ be a finite set of items, set X ⊆ I is called an itemset. An itemset sequence S_r is an ordered list of itemsets $ < X_{1} ,X_{2} , \ldots ,X_{n} > $. Each itemset X_d in sequence S_r is denoted as $ S_{r}^{d} $. A sequence database SDB is a set of tuples of the form (sid, S), where sid is the ID of sequence S.

The internal utility $ q(i,S_{r}^{d} ) $ represents the quantity of item i in itemset $ S_{r}^{d} $ of sequence S_r. The external utility p(i) is the unit profit value of item i. The utility of item i in itemset $ S_{r}^{d} $ is defined as $ u(i,S_{r}^{d} ) = p(i) \times q(i,S_{r}^{d} ) $. The utility of itemset X in $ S_{r}^{d} $ of sequence S_r is defined as $ u(X,S_{r}^{d} ) = \sum\limits_{i \in X} {u(i,S_{r}^{d} )} $.

A sequential pattern $ \alpha = \, < X_{1} ,X_{2} , \, \ldots ,X_{p} > $ is called a subsequence of a sequence $ S_{r} = \, < S_{r}^{1} ,S_{r}^{2} , \ldots ,S_{r}^{n} > $, denoted as $ \alpha { \sqsubseteq }S_{r} $, if there exist integers $ 1 \le j_{1} < j_{2} < \ldots < j_{p} \le n $ such that $ X_{1} \subseteq S_{r}^{{j_{1} }} ,X_{2} \subseteq S_{r}^{{j_{2} }} , \ldots ,X_{p} \subseteq S_{r}^{{j_{p} }} $. We say the ordered list of itemsets $ < S_{r}^{{j_{1} }} ,S_{r}^{{j_{2} }} , \ldots ,S_{r}^{{j_{p} }} > $ an occurrence of α in S_r. The set of all occurrences of α in S_r is denoted as OccSet(α, S_r).

In this paper, the term sequence refers to the sequence database transactions, while the term sequential pattern refers to the extracted subsequences from these sequences.

Let $ occ = < S_{r}^{{j_{1} }} ,S_{r}^{{j_{2} }} , \ldots ,S_{r}^{{j_{p} }} > $ be an occurrence of $ \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > $ in sequence S_r. The utility of α with respect to occ is defined as $ u(\alpha ,occ) = \sum\limits_{i = 1}^{p} {u(X_{i} ,S_{r}^{{j_{i} }} )} $, and utility of α in S_r is defined as $ u(\alpha ,S_{r} ) = max\{ u(\alpha ,occ) \, |occ \in OccSet(\alpha ,S_{r} )\} $. The sequence utility (SU) of a sequence S_r is defined as SU(S_r) = u(S_r, S_r). The utility of a sequential pattern α in SDB is defined as $ u(\alpha ) = \sum\limits_{{\alpha { \sqsubseteq }S_{r} \wedge S_{r} \in SDB}} {u(\alpha ,S_{r} )} $.

The minimum utility threshold δ, specified by the user, is defined as a percentage of the total SU values of the database, whereas the minimum utility value is defined as $ min\_util = \delta \times \sum\limits_{Sr \in SDB} {SU(S_{r} )} $. An sequential pattern α is called an HUSP if u(α) ≥ min_util. Given a sequence database SDB, the task of HUSP mining is to determine all sequential patterns that have utilities no less than min_util. Similar to the algorithms in [1, 7], the concept of remaining utility is also used in our algorithm to speed up the mining process.

Let $ occ = < S_{r}^{{j_{1} }} ,S_{r}^{{j_{2} }} , \ldots ,S_{r}^{{j_{p} }} > $ be an occurrence of $ \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > $ in sequence S_r, the last item of X_p, say item i, is called extension item of α. The remaining sequential pattern of S_r w.r.t. position j_p, denoted as S_r/(α, j_p), is defined as the subsequence of S_r from the item after the i to the end of S_r. The remaining utility of α w.r.t. the sequential pattern $ S_{r} /(\alpha ,j_{p} ) $, denoted as $ ru(\alpha ,S_{r} /(\alpha ,j_{p} )) $, is the sum of the utilities of all items in $ S_{r} /(\alpha ,j_{p} ) $. Among all values of $ ru(\alpha ,S_{r} /(\alpha ,j_{p} )) $, the one with the highest value is called the maximal remaining utility of α in S_r, denoted as mru(α, S_r).

Consider the sequence database in Table 1 and the profit table in Table 2. In the example database, the utility of item a in second itemset of S₂ (i.e., $ S_{2}^{2} $) is $ u(a,S_{2}^{2} ) = 6 \times 3 = 18 $, the utility of itemset {ae} in $ S_{2}^{2} $ is $ u(\left\{ {ae} \right\},S_{2}^{2} ) = 18 + 2 = 20 $. Both $ occ_{1} = < S_{4}^{1} ,S_{4}^{2} > $ and $ occ_{2} = < S_{4}^{1} ,S_{4}^{3} > $ are occurrences of α = <ba> in S₄. So $ u(\alpha ,S_{4} ) = max\{ u(\alpha ,occ_{1} ) $, $ u(\alpha ,occ_{2} )\} = max\left\{ {31, \, 22} \right\} = 31 $. In the example database, $ u(\alpha ) = u(\alpha ,S_{3} ) + u(\alpha ,S_{4} ) + u(\alpha ,S_{5} ) = 11 + 31 + 28 = 70 $. Given $ min\_util = 60 $, as $ u(\alpha ) \, > min\_util $, α is an HUSP. Consider the occurrence $ occ_{2} = < S_{4}^{1} ,S_{4}^{3} > $ of α in S₄, the remaining sequential pattern $ S_{4} /(\alpha , \, 3) = < \left\{ {be} \right\} > $, and $ ru(\alpha ,S_{4} /(\alpha , \, 3)) = 5 + 2 = 7 $. Since $ ru(\alpha ,S_{4} /(\alpha , \, 2)) = 6 + 12 + 5 + 2 = 25 $, which is higher than $ ru(\alpha ,S_{4} /(\alpha , \, 3)) $, so $ mru(\alpha ,S_{4} ) = 25 $.

Table 1. Example sequence database

Full size table

Table 2. Profit table

Full size table

3 HUSP-MRU Algorithm

3.1 The Matrix Structure

For HUSP-MRU, an item-utility matrix structure is defined for each sequence.

Definition 1.

Let $ S_{r} = \, < S_{r}^{1} ,S_{r}^{2} , \ldots ,S_{r}^{n} > $ be a sequence, the item-utility matrix of sequence S_r is an m × (2 × n) matrix denoted as M_r, where m is the number of items in S_r. Each row of M_r, representing item i, is composed of n pairs, and each pair is with the form of $ p_{k} = < u_{k} \left( i \right), \, iru_{k} \left( i \right) > (1 \le k \le n) $, where u_k(i) is calculated by Eq. 1.

$$ u_{k} (i) = \left\{ {\begin{array}{*{20}r} \hfill {u(i,S_{r}^{k} ),i \in S_{r}^{k} } \\ \hfill {0,i \notin S_{r}^{k} } \\ \end{array} } \right. $$

(1)

Let “$ \prec $” be the lexicographic order of items, if item $ j \prec i $, and there exists no item q such that $ \, j \prec q \prec i $, iru_k(i) is calculated by Eq. 2.

$$ iru_{k} (i) = \left\{ {\begin{array}{*{20}c} {ru(i,S_{r} /( < i > ,k)),i \in S_{r}^{k} } \\ {iru_{k} (j) \, ,i \notin S_{r}^{k} \wedge k \ne 1} \\ {SU(S_{r} ),i \notin S_{r}^{k} \wedge k = 1} \\ \end{array} } \right. $$

(2)

According to Definition 1, for two items i and j, if $ j \prec i $ and there exists no item q such that $ \, j \prec q \prec i $, we have iru_k(j) = u_k(i) + iru_k(i), which can facilitate the calculation of item-utility matrix. Given item-utility matrix, the maximal utility as well as maximal remaining utility are also recorded in the maximal utility matrix.

Definition 2.

Let $ S_{r} = \, < S_{r}^{1} ,S_{r}^{2} , \ldots ,S_{r}^{n} > $ be a sequence, the maximal utility matrix of sequence S_r is an m × 3 matrix, where m is the number of items in S_r. For each row, the first element (denoted as i) is the item this row represents, the second element is u(i, S_r), the third element is mru(i, S_r).

As the utility of sequential pattern is based on the maximal utility among all occurrences, the maximal utility matrix structure is convenient for HUSP mining.

Consider S₃ in Table 1. The item-utility matrix of S₃ is shown in Table 3, and the maximal utility matrix of S₃ is shown in Table 4.

Table 3. The item-utility matrix of S₃

Full size table

Table 4. The maximal utility matrix of S₃

Full size table

3.2 Lexicographic Sequential Pattern Tree

Given a sequential pattern $ \alpha = \, < X_{1} ,X_{2} , \, \ldots ,X_{p} > $ and an item x, $ \alpha \diamondsuit x $ means α concatenates with x. It can be I-extension, $ \alpha \diamondsuit_{I} x = \, < X_{1} ,X_{2} , \ldots ,X_{p} \cup x > ; $ or S-extension, $ \alpha \diamondsuit_{S} x = < X_{1} ,X_{2} , \, \ldots ,X_{p} ,x > $. For example, <{ae}> is an I-extension of <a>, while <ae> is a S-extension of <a>.

The search space of the HUSP-MRU can be represented as a lexicographic tree. A lexicographic sequential pattern tree (LSP-Tree) T is a tree structure composed as follows: (1) One empty root labeled as “< >”. (2) Each node N except root consists of four fields: N.seq, N.ulist, N.u, and N.mru, where N.seq records the sequential pattern represented by N, N.ulist is composed of information for search space pruning (see Definition 3), N.u is u(N.seq, S_r), and N.mru is mru(N.seq, S_r), where S_r ∈ SDB.

Given a node N in an LSP-Tree T, the sequential pattern represented by a child node N’ of N (i.e., N’.seq) is either an I-extension sequential pattern or an S-extension sequential pattern of N.seq.

Figure 1 shows a part of the LSP-Tree of the sequence database in Table 1. All the children nodes of a parent node are ordered lexicographically with I-extension sequential patterns before S-extension sequential patterns.

Definition 3.

Let SDB be a sequence database, N be a node of LSP-Tree T representing sequential pattern $ \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > $. Each entry of N.ulist corresponds to an element of $ g(\alpha ) = \{ S_{r} \in \varvec{SDB}|\alpha { \sqsubseteq }S_{r} \} $, and consists of five fields: SID, TID, u, ru and pointer, where SID is the sequence ID, TID is the itemset ID of extension item of α occurs, u and ru are utility and remaining utility w.r.t. the enumerating occurrence, pointer links to the next occurrence of α in SID, or “null” if there is none.

For the node N representing sequential pattern <ba>, the N.ulist is illustrated in Fig. 2, while N.u and N.mru are shown in Table 5.

Table 5. N.u and N.mru of <ba>

Full size table

3.3 Searching Strategies

After constructing the LSP-Tree, our algorithm traverses the search space from the root using a depth-first strategy. For a sequential pattern α = <X₁, X₂,…, X_p>, the HUSP-MRU algorithm checks its extension item to perform I-extension and S-extension, respectively. To facilitate the operation of sequential pattern extension, the item-utility matrix can be used. For I-extension, only items in the entries below the extension item are checked, and the set of all potential I-extension items is denoted as PIS(α). For S-extension, only items in the entries after the itemset containing extension item are checked, and the set of all potential S-extension items is denoted as PSS(α).

Here, we consider the extension case of sequential pattern <{ab}> in S₃. Here item b is the extension item. As shown in Table 3, items c, d and e are larger than b according to the lexicographic order, so entries of the last three rows of the matrix could be concatenated after b for I-extension. Specifically, only $ S_{3}^{2} $ includes {ab}, so only item c can be used to form the I-extension, and the result is <{abc}>. The utility of <{abc}> is the utility of u({ab}, S₃) plus the newly added item’s utility u(c, S₃), i.e., $ u\left( {\left\{ {abc} \right\},S_{3} } \right) = 8 + 8 = 16 $.

We still use the sequential pattern <{ab}> to show the idea of S-extension. As we can see from Table 3, items that can be used for S-extension to <{ab}> are located in the third column of the item-utility matrix of S₃. Thus, sequential patterns <{ab}a>, <{ab}d > and <{ab}e> are the candidates. Take <{ab}a> for example, $ u\left( { < \left\{ {ab} \right\}a > ,S_{3} } \right) = u\left( {\left\{ {ab} \right\},S_{3} } \right) + u(a,S_{3}^{3} ) = 8 + 6 = 14 $.

3.4 Pruning Strategies

Without pruning, an LSP-Tree search will consider every node. The key to an efficient LSP-Tree search is to remove entire branches or unpromising nodes from consideration.

In the search space formed by LSP-Tree, when a new node N_α representing sequential pattern $ \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > $ is constructed, we compute

$$ bu(\alpha ) = \sum\limits_{{\alpha { \sqsubseteq }S_{r} \wedge S_{r} \in SDB}} {(u(\alpha ) + mru(\alpha ,S_{r} ))} $$

(3)

If bu(α) < min_util, then we stop traversing all children nodes of N_α, we call this pruning strategy branch pruning.

When the node N_α passes the branch pruning, and can be used to form potential sequential patterns, then the items in $ PIS(\alpha )\,{\text{or}}\,PSS(\alpha ) $ will be checked one by one. Specifically, for each item $ i \in PIS(\alpha )\,{\text{or}}\,i \in PSS(\alpha ) $, we compute

$$ nu(\alpha ) = \sum\limits_{{\alpha { \sqsubseteq }S_{r} \wedge S_{r} \in SDB}} {(u(\alpha ') + mru(i,S_{r} ))} $$

(4)

where $ \alpha^{\prime } $ is the result of I-extension or S-extension of α with item i. If nu(α) < min_util, then we do not need to extend α with i, and we call this pruning strategy node pruning.

3.5 Algorithm Description

The proposed HUSP-MRU algorithm for mining HUSPs is described in Algorithm 1.

In Algorithm 1, Steps 1–3 perform the branch pruning strategy. If the current sequence pattern can not be used to generate HUSP, it returns to its parent sequential pattern. The loop (Steps 4–14) checks each item in PIS(α) for I-extension of α. Steps 5–8 perform the node pruning strategy. Then, new sequential pattern α_I concatenated by I-extension of α is generated in Step 9. If the utility of the extension result is no lower than the minimum utility value, it is output as an HUSP in Step 11. Step 13 recursively invokes itself to go deeper in the LSP-Tree. Similarly, the loop (Steps 15–25) checks each item in PSS(α) for S-extension of α. Finally, Step 26 outputs all of the discovered high utility sequential patterns.

4 Performance Evaluation

We evaluate the performance of our HUSP-MRU algorithm and compare it with the USpan [7] algorithm. The source code of USpan was downloaded from the SPMF data mining library [2].

4.1 Experimental Environment and Datasets

The experiments were performed on a computer with 3.40 GHz CPU, 8 GB memory, and running on 64-bit Microsoft Windows 7. Our program was written in Java. Four real datasets, downloaded from the SPMF data mining library [2], were used for evaluation, and their characteristics are presented in Table 6.

Table 6. Characteristics of the datasets

Full size table

Bible is moderately dense and contains many medium length sequences. BMS contains many short sequences. Kosarak10k is a sparse dataset that contains short sequences and a few very long sequences. SIGN is a dense dataset having very long sequences.

4.2 Running Time

We demonstrate the efficiency performance of the two algorithms. When measuring the runtime, we varied the minimum utility threshold for each dataset.

Figures 3, 4, 5 and 6 show the execution time comparisons for the four datasets. We can see that the HUSP-MRU algorithm is always faster than USpan algorithm. On average, HUSP-MRU is faster than USpan by 25.16%, 12.40%, 14.81% and 28.71% on Bible, BMS, Kosarak10k and SIGN, respectively. The reason for the results is that the branch pruning and node pruning strategies can avoid unnecessary search space traversal operations. Furthermore, with item-utility matrix and maximal utility matrix, computation of utilities of candidate sequential patterns can be accelerated.

4.3 Effect of Pruning Strategies

To demonstrate the benefit of branch pruning and node pruning, we compare the performance of HUSP-MRU with and without pruning strategies. In Figs. 7, 8, 9 and 10, we use HUSP-MRU-B to denote the baseline implementation of HUSP-MRU without pruning strategies.

As shown in Figs. 7, 8, 9 and 10, the pruning strategies are beneficial in improving the efficiency. On Bible and BMS, the implementation with pruning strategies spends about half of the total time of the baseline algorithm. The most obvious case appears on the dataset of SIGN that HUSP-MRU is 5.08 times faster than HUSP-MRU-B on average.

5 Conclusions

In this paper, we proposed an HUSP mining algorithm called the HUSP-MRU based on the maximal remaining utility. In the HUSP-MRU algorithm, matrix structures are used to store information for utility computation. The search space is organized as the LSP-Tree and traversed using depth-first manner. To improve the mining efficiency, branch and node pruning strategies are used. Experiments on four datasets show that the HUSP-MRU algorithm outperforms USpan algorithm.

References

Alkan, O.K., Karagoz, P.: CRoM and HuspExt: improving efficiency of high utility sequential pattern extraction. IEEE Trans. Knowl. Data Eng. 27, 2645–2657 (2015)
Article Google Scholar
Fournier-Viger, P., Lin, J.C.-W., Gomariz, A., Gueniche, T., Soltani, A., Deng, Z., Lam, H.T.: The SPMF open-source data mining library version 2. In: Berendt, B., Bringmann, B., Fromont, É., Garriga, G., Miettinen, P., Tatti, N., Tresp, V. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9853, pp. 36–40. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46131-1_8
Chapter Google Scholar
Liu, Y., Liao, W.-k., Choudhary, A.: A two-phase algorithm for fast discovery of high utility itemsets. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 689–695. Springer, Heidelberg (2005). https://doi.org/10.1007/11430919_79
Chapter Google Scholar
Mooney, C.H., Roddick, J.F.: Sequential pattern mining - approaches and algorithms. ACM Comput. Surv. 45, 1–39 (2013)
Article Google Scholar
Shie, B.-E., Cheng, J.-H., Chuang, K.-T., Tseng, V.S.: A one-phase method for mining high utility mobile sequential patterns in mobile commerce environments. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345, pp. 616–626. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31087-4_63
Chapter Google Scholar
Song, W., Zhang, Z., Li, J.: A high utility itemset mining algorithm based on subsume index. Knowl. Inf. Syst. 49, 315–340 (2016)
Article Google Scholar
Yin, J., Zheng, Z., Cao, L.: USpan: an efficient algorithm for mining high utility sequential patterns. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 660–668. ACM Press, New York (2012)
Google Scholar
Zihayat, M., Davoudi, H., An, A.: Mining significant high utility gene regulation sequential patterns. BMC Syst. Biol. 11, 1–14 (2017)
Article Google Scholar

Download references

Acknowledgments

This work was supported by Beijing Natural Science Foundation (4162022), and High Innovation Program of Beijing (2015000026833ZK04).

Author information

Authors and Affiliations

College of Computer Science and Technology, North China University of Technology, Beijing, 100144, China
Wei Song & Keke Rong

Authors

Wei Song
View author publications
You can also search for this author in PubMed Google Scholar
Keke Rong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Song .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Ying Tan
Southern University of Science and Technology, Shenzhen, China
Yuhui Shi
Tongji University, Shanghai, China
Qirong Tang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, W., Rong, K. (2018). Mining High Utility Sequential Patterns Using Maximal Remaining Utility. In: Tan, Y., Shi, Y., Tang, Q. (eds) Data Mining and Big Data. DMBD 2018. Lecture Notes in Computer Science(), vol 10943. Springer, Cham. https://doi.org/10.1007/978-3-319-93803-5_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-93803-5_44
Published: 10 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93802-8
Online ISBN: 978-3-319-93803-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mining High Utility Sequential Patterns Using Maximal Remaining Utility

Abstract

Similar content being viewed by others

On efficiently mining high utility sequential patterns

Mining Periodic High Utility Sequential Patterns

High-Utility Sequential Pattern Mining with Multiple Minimum Utility Thresholds

Keywords

1 Introduction

2 Problem of HUSP Mining