Keywords

1 Introduction

High utility sequential pattern (HUSP) mining [1, 7], which is to discover sequential patterns with high utility in a sequence database, is an important topic in the field of utility pattern mining [3, 6], and has wide real-world applications [5, 8].

Different from traditional frequent sequential pattern [4], HUSP considers both occurrence frequency and profit of itemsets in a sequence. Thus, the infrequent but high profit sequential patterns can be discovered through mining HUSPs.

USpan [7] is a typical HUSP mining algorithm. For width pruning, USpan use the sequence-weighted utilization as the upper bound, which may include unnecessary operations on utility computation. To improve the efficiency, a new HUSP mining algorithm named HUSP-MRU is proposed. In this algorithm, the maximal remaining utility is defined as a tighter upper bound. To store maximal remaining utility, the item-utility matrix and maximal utility matrix structures are used. The search space is represented by lexicographic sequential pattern tree, and the branch as well as node pruning strategies are used to avoid unnecessary traversal of unpromising nodes. Experimental results show that the HUSP-MRU algorithm is efficient.

2 Problem of HUSP Mining

Let \( \varvec{I} = \, \left\{ {i_{1} ,i_{2} , \ldots ,i_{m} } \right\} \) be a finite set of items, set XI is called an itemset. An itemset sequence S r is an ordered list of itemsets \( < X_{1} ,X_{2} , \ldots ,X_{n} > \). Each itemset X d in sequence S r is denoted as \( S_{r}^{d} \). A sequence database SDB is a set of tuples of the form (sid, S), where sid is the ID of sequence S.

The internal utility \( q(i,S_{r}^{d} ) \) represents the quantity of item i in itemset \( S_{r}^{d} \) of sequence S r . The external utility p(i) is the unit profit value of item i. The utility of item i in itemset \( S_{r}^{d} \) is defined as \( u(i,S_{r}^{d} ) = p(i) \times q(i,S_{r}^{d} ) \). The utility of itemset X in \( S_{r}^{d} \) of sequence S r is defined as \( u(X,S_{r}^{d} ) = \sum\limits_{i \in X} {u(i,S_{r}^{d} )} \).

A sequential pattern \( \alpha = \, < X_{1} ,X_{2} , \, \ldots ,X_{p} > \) is called a subsequence of a sequence \( S_{r} = \, < S_{r}^{1} ,S_{r}^{2} , \ldots ,S_{r}^{n} > \), denoted as \( \alpha { \sqsubseteq }S_{r} \), if there exist integers \( 1 \le j_{1} < j_{2} < \ldots < j_{p} \le n \) such that \( X_{1} \subseteq S_{r}^{{j_{1} }} ,X_{2} \subseteq S_{r}^{{j_{2} }} , \ldots ,X_{p} \subseteq S_{r}^{{j_{p} }} \). We say the ordered list of itemsets \( < S_{r}^{{j_{1} }} ,S_{r}^{{j_{2} }} , \ldots ,S_{r}^{{j_{p} }} > \) an occurrence of α in S r . The set of all occurrences of α in S r is denoted as OccSet(α, S r ).

In this paper, the term sequence refers to the sequence database transactions, while the term sequential pattern refers to the extracted subsequences from these sequences.

Let \( occ = < S_{r}^{{j_{1} }} ,S_{r}^{{j_{2} }} , \ldots ,S_{r}^{{j_{p} }} > \) be an occurrence of \( \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > \) in sequence S r . The utility of α with respect to occ is defined as \( u(\alpha ,occ) = \sum\limits_{i = 1}^{p} {u(X_{i} ,S_{r}^{{j_{i} }} )} \), and utility of α in S r is defined as \( u(\alpha ,S_{r} ) = max\{ u(\alpha ,occ) \, |occ \in OccSet(\alpha ,S_{r} )\} \). The sequence utility (SU) of a sequence S r is defined as SU(S r ) = u(S r , S r ). The utility of a sequential pattern α in SDB is defined as \( u(\alpha ) = \sum\limits_{{\alpha { \sqsubseteq }S_{r} \wedge S_{r} \in SDB}} {u(\alpha ,S_{r} )} \).

The minimum utility threshold δ, specified by the user, is defined as a percentage of the total SU values of the database, whereas the minimum utility value is defined as \( min\_util = \delta \times \sum\limits_{Sr \in SDB} {SU(S_{r} )} \). An sequential pattern α is called an HUSP if u(α) ≥ min_util. Given a sequence database SDB, the task of HUSP mining is to determine all sequential patterns that have utilities no less than min_util. Similar to the algorithms in [1, 7], the concept of remaining utility is also used in our algorithm to speed up the mining process.

Let \( occ = < S_{r}^{{j_{1} }} ,S_{r}^{{j_{2} }} , \ldots ,S_{r}^{{j_{p} }} > \) be an occurrence of \( \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > \) in sequence S r , the last item of X p , say item i, is called extension item of α. The remaining sequential pattern of S r w.r.t. position j p , denoted as S r /(α, j p ), is defined as the subsequence of S r from the item after the i to the end of S r . The remaining utility of α w.r.t. the sequential pattern \( S_{r} /(\alpha ,j_{p} ) \), denoted as \( ru(\alpha ,S_{r} /(\alpha ,j_{p} )) \), is the sum of the utilities of all items in \( S_{r} /(\alpha ,j_{p} ) \). Among all values of \( ru(\alpha ,S_{r} /(\alpha ,j_{p} )) \), the one with the highest value is called the maximal remaining utility of α in S r , denoted as mru(α, S r ).

Consider the sequence database in Table 1 and the profit table in Table 2. In the example database, the utility of item a in second itemset of S2 (i.e., \( S_{2}^{2} \)) is \( u(a,S_{2}^{2} ) = 6 \times 3 = 18 \), the utility of itemset {ae} in \( S_{2}^{2} \) is \( u(\left\{ {ae} \right\},S_{2}^{2} ) = 18 + 2 = 20 \). Both \( occ_{1} = < S_{4}^{1} ,S_{4}^{2} > \) and \( occ_{2} = < S_{4}^{1} ,S_{4}^{3} > \) are occurrences of α = <ba> in S4. So \( u(\alpha ,S_{4} ) = max\{ u(\alpha ,occ_{1} ) \), \( u(\alpha ,occ_{2} )\} = max\left\{ {31, \, 22} \right\} = 31 \). In the example database, \( u(\alpha ) = u(\alpha ,S_{3} ) + u(\alpha ,S_{4} ) + u(\alpha ,S_{5} ) = 11 + 31 + 28 = 70 \). Given \( min\_util = 60 \), as \( u(\alpha ) \, > min\_util \), α is an HUSP. Consider the occurrence \( occ_{2} = < S_{4}^{1} ,S_{4}^{3} > \) of α in S4, the remaining sequential pattern \( S_{4} /(\alpha , \, 3) = < \left\{ {be} \right\} > \), and \( ru(\alpha ,S_{4} /(\alpha , \, 3)) = 5 + 2 = 7 \). Since \( ru(\alpha ,S_{4} /(\alpha , \, 2)) = 6 + 12 + 5 + 2 = 25 \), which is higher than \( ru(\alpha ,S_{4} /(\alpha , \, 3)) \), so \( mru(\alpha ,S_{4} ) = 25 \).

Table 1. Example sequence database
Table 2. Profit table

3 HUSP-MRU Algorithm

3.1 The Matrix Structure

For HUSP-MRU, an item-utility matrix structure is defined for each sequence.

Definition 1.

Let \( S_{r} = \, < S_{r}^{1} ,S_{r}^{2} , \ldots ,S_{r}^{n} > \) be a sequence, the item-utility matrix of sequence S r is an m × (2 × n) matrix denoted as M r , where m is the number of items in S r . Each row of M r , representing item i, is composed of n pairs, and each pair is with the form of \( p_{k} = < u_{k} \left( i \right), \, iru_{k} \left( i \right) > (1 \le k \le n) \), where u k (i) is calculated by Eq. 1.

$$ u_{k} (i) = \left\{ {\begin{array}{*{20}r} \hfill {u(i,S_{r}^{k} ),i \in S_{r}^{k} } \\ \hfill {0,i \notin S_{r}^{k} } \\ \end{array} } \right. $$
(1)

Let “\( \prec \)” be the lexicographic order of items, if item \( j \prec i \), and there exists no item q such that \( \, j \prec q \prec i \), iru k (i) is calculated by Eq. 2.

$$ iru_{k} (i) = \left\{ {\begin{array}{*{20}c} {ru(i,S_{r} /( < i > ,k)),i \in S_{r}^{k} } \\ {iru_{k} (j) \, ,i \notin S_{r}^{k} \wedge k \ne 1} \\ {SU(S_{r} ),i \notin S_{r}^{k} \wedge k = 1} \\ \end{array} } \right. $$
(2)

According to Definition 1, for two items i and j, if \( j \prec i \) and there exists no item q such that \( \, j \prec q \prec i \), we have iru k (j) = u k (i) + iru k (i), which can facilitate the calculation of item-utility matrix. Given item-utility matrix, the maximal utility as well as maximal remaining utility are also recorded in the maximal utility matrix.

Definition 2.

Let \( S_{r} = \, < S_{r}^{1} ,S_{r}^{2} , \ldots ,S_{r}^{n} > \) be a sequence, the maximal utility matrix of sequence S r is an m × 3 matrix, where m is the number of items in S r . For each row, the first element (denoted as i) is the item this row represents, the second element is u(i, S r ), the third element is mru(i, S r ).

As the utility of sequential pattern is based on the maximal utility among all occurrences, the maximal utility matrix structure is convenient for HUSP mining.

Consider S3 in Table 1. The item-utility matrix of S3 is shown in Table 3, and the maximal utility matrix of S3 is shown in Table 4.

Table 3. The item-utility matrix of S3
Table 4. The maximal utility matrix of S3

3.2 Lexicographic Sequential Pattern Tree

Given a sequential pattern \( \alpha = \, < X_{1} ,X_{2} , \, \ldots ,X_{p} > \) and an item x, \( \alpha \diamondsuit x \) means α concatenates with x. It can be I-extension, \( \alpha \diamondsuit_{I} x = \, < X_{1} ,X_{2} , \ldots ,X_{p} \cup x > ; \) or S-extension, \( \alpha \diamondsuit_{S} x = < X_{1} ,X_{2} , \, \ldots ,X_{p} ,x > \). For example, <{ae}> is an I-extension of <a>, while <ae> is a S-extension of <a>.

The search space of the HUSP-MRU can be represented as a lexicographic tree. A lexicographic sequential pattern tree (LSP-Tree) T is a tree structure composed as follows: (1) One empty root labeled as “< >”. (2) Each node N except root consists of four fields: N.seq, N.ulist, N.u, and N.mru, where N.seq records the sequential pattern represented by N, N.ulist is composed of information for search space pruning (see Definition 3), N.u is u(N.seq, S r ), and N.mru is mru(N.seq, S r ), where S r SDB.

Given a node N in an LSP-Tree T, the sequential pattern represented by a child node N’ of N (i.e., N’.seq) is either an I-extension sequential pattern or an S-extension sequential pattern of N.seq.

Figure 1 shows a part of the LSP-Tree of the sequence database in Table 1. All the children nodes of a parent node are ordered lexicographically with I-extension sequential patterns before S-extension sequential patterns.

Fig. 1.
figure 1

Partial LSP-Tree for the example sequence database in Table 1

Definition 3.

Let SDB be a sequence database, N be a node of LSP-Tree T representing sequential pattern \( \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > \). Each entry of N.ulist corresponds to an element of \( g(\alpha ) = \{ S_{r} \in \varvec{SDB}|\alpha { \sqsubseteq }S_{r} \} \), and consists of five fields: SID, TID, u, ru and pointer, where SID is the sequence ID, TID is the itemset ID of extension item of α occurs, u and ru are utility and remaining utility w.r.t. the enumerating occurrence, pointer links to the next occurrence of α in SID, or “null” if there is none.

For the node N representing sequential pattern <ba>, the N.ulist is illustrated in Fig. 2, while N.u and N.mru are shown in Table 5.

Fig. 2.
figure 2

The ulist structure of <ba>

Table 5. N.u and N.mru of <ba>

3.3 Searching Strategies

After constructing the LSP-Tree, our algorithm traverses the search space from the root using a depth-first strategy. For a sequential pattern α = <X1, X2,…, X p >, the HUSP-MRU algorithm checks its extension item to perform I-extension and S-extension, respectively. To facilitate the operation of sequential pattern extension, the item-utility matrix can be used. For I-extension, only items in the entries below the extension item are checked, and the set of all potential I-extension items is denoted as PIS(α). For S-extension, only items in the entries after the itemset containing extension item are checked, and the set of all potential S-extension items is denoted as PSS(α).

Here, we consider the extension case of sequential pattern <{ab}> in S3. Here item b is the extension item. As shown in Table 3, items c, d and e are larger than b according to the lexicographic order, so entries of the last three rows of the matrix could be concatenated after b for I-extension. Specifically, only \( S_{3}^{2} \) includes {ab}, so only item c can be used to form the I-extension, and the result is <{abc}>. The utility of <{abc}> is the utility of u({ab}, S3) plus the newly added item’s utility u(c, S3), i.e., \( u\left( {\left\{ {abc} \right\},S_{3} } \right) = 8 + 8 = 16 \).

We still use the sequential pattern <{ab}> to show the idea of S-extension. As we can see from Table 3, items that can be used for S-extension to <{ab}> are located in the third column of the item-utility matrix of S3. Thus, sequential patterns <{ab}a>, <{ab}d > and <{ab}e> are the candidates. Take <{ab}a> for example, \( u\left( { < \left\{ {ab} \right\}a > ,S_{3} } \right) = u\left( {\left\{ {ab} \right\},S_{3} } \right) + u(a,S_{3}^{3} ) = 8 + 6 = 14 \).

3.4 Pruning Strategies

Without pruning, an LSP-Tree search will consider every node. The key to an efficient LSP-Tree search is to remove entire branches or unpromising nodes from consideration.

In the search space formed by LSP-Tree, when a new node N α representing sequential pattern \( \alpha = \, < X_{1} ,X_{2} , \ldots ,X_{p} > \) is constructed, we compute

$$ bu(\alpha ) = \sum\limits_{{\alpha { \sqsubseteq }S_{r} \wedge S_{r} \in SDB}} {(u(\alpha ) + mru(\alpha ,S_{r} ))} $$
(3)

If bu(α) < min_util, then we stop traversing all children nodes of N α , we call this pruning strategy branch pruning.

When the node N α passes the branch pruning, and can be used to form potential sequential patterns, then the items in \( PIS(\alpha )\,{\text{or}}\,PSS(\alpha ) \) will be checked one by one. Specifically, for each item \( i \in PIS(\alpha )\,{\text{or}}\,i \in PSS(\alpha ) \), we compute

$$ nu(\alpha ) = \sum\limits_{{\alpha { \sqsubseteq }S_{r} \wedge S_{r} \in SDB}} {(u(\alpha ') + mru(i,S_{r} ))} $$
(4)

where \( \alpha^{\prime } \) is the result of I-extension or S-extension of α with item i. If nu(α) < min_util, then we do not need to extend α with i, and we call this pruning strategy node pruning.

3.5 Algorithm Description

The proposed HUSP-MRU algorithm for mining HUSPs is described in Algorithm 1.

figure a

In Algorithm 1, Steps 1–3 perform the branch pruning strategy. If the current sequence pattern can not be used to generate HUSP, it returns to its parent sequential pattern. The loop (Steps 4–14) checks each item in PIS(α) for I-extension of α. Steps 5–8 perform the node pruning strategy. Then, new sequential pattern α I concatenated by I-extension of α is generated in Step 9. If the utility of the extension result is no lower than the minimum utility value, it is output as an HUSP in Step 11. Step 13 recursively invokes itself to go deeper in the LSP-Tree. Similarly, the loop (Steps 15–25) checks each item in PSS(α) for S-extension of α. Finally, Step 26 outputs all of the discovered high utility sequential patterns.

4 Performance Evaluation

We evaluate the performance of our HUSP-MRU algorithm and compare it with the USpan [7] algorithm. The source code of USpan was downloaded from the SPMF data mining library [2].

4.1 Experimental Environment and Datasets

The experiments were performed on a computer with 3.40 GHz CPU, 8 GB memory, and running on 64-bit Microsoft Windows 7. Our program was written in Java. Four real datasets, downloaded from the SPMF data mining library [2], were used for evaluation, and their characteristics are presented in Table 6.

Table 6. Characteristics of the datasets

Bible is moderately dense and contains many medium length sequences. BMS contains many short sequences. Kosarak10k is a sparse dataset that contains short sequences and a few very long sequences. SIGN is a dense dataset having very long sequences.

4.2 Running Time

We demonstrate the efficiency performance of the two algorithms. When measuring the runtime, we varied the minimum utility threshold for each dataset.

Figures 3, 4, 5 and 6 show the execution time comparisons for the four datasets. We can see that the HUSP-MRU algorithm is always faster than USpan algorithm. On average, HUSP-MRU is faster than USpan by 25.16%, 12.40%, 14.81% and 28.71% on Bible, BMS, Kosarak10k and SIGN, respectively. The reason for the results is that the branch pruning and node pruning strategies can avoid unnecessary search space traversal operations. Furthermore, with item-utility matrix and maximal utility matrix, computation of utilities of candidate sequential patterns can be accelerated.

Fig. 3.
figure 3

Execution times on Bible

Fig. 4.
figure 4

Execution times on BMS

Fig. 5.
figure 5

Execution times on Kosarak10k

Fig. 6.
figure 6

Execution times on SIGN

4.3 Effect of Pruning Strategies

To demonstrate the benefit of branch pruning and node pruning, we compare the performance of HUSP-MRU with and without pruning strategies. In Figs. 7, 8, 9 and 10, we use HUSP-MRU-B to denote the baseline implementation of HUSP-MRU without pruning strategies.

Fig. 7.
figure 7

Execution times of HUSP-MRU with and without pruning strategies on Bible

Fig. 8.
figure 8

Execution times of HUSP-MRU with and without pruning strategies on BMS

Fig. 9.
figure 9

Execution times of HUSP-MRU with and without pruning strategies on Kosarak10k

Fig. 10.
figure 10

Execution times of HUSP-MRU with and without pruning strategies on SIGN

As shown in Figs. 7, 8, 9 and 10, the pruning strategies are beneficial in improving the efficiency. On Bible and BMS, the implementation with pruning strategies spends about half of the total time of the baseline algorithm. The most obvious case appears on the dataset of SIGN that HUSP-MRU is 5.08 times faster than HUSP-MRU-B on average.

5 Conclusions

In this paper, we proposed an HUSP mining algorithm called the HUSP-MRU based on the maximal remaining utility. In the HUSP-MRU algorithm, matrix structures are used to store information for utility computation. The search space is organized as the LSP-Tree and traversed using depth-first manner. To improve the mining efficiency, branch and node pruning strategies are used. Experiments on four datasets show that the HUSP-MRU algorithm outperforms USpan algorithm.