Keywords

1 Introduction

The problem of sequential pattern mining was first brought up by Srikant and Agrawal in 1995 [2]. Since then, there have been quite a lot of approaches and algorithms proposed to solve this problem. However, finding an effective method is still challenging. Recently, hybrid approaches using DiffNodeSets [10], N-List [9] data structures are reported as very efficient for mining frequent itemsets. But can those approaches be applied for mining pattern with a sequential order? To the best of our knowledge, there have not any work that was based on the hybrid approaches using those data structures. Itemset patterns are easier to deal with because each item only appears once at most in each transaction of the database, and the order of items in the itemsets can be assigned by users. On the other hand, sequential patterns consist of multiple transactions in sequential or timely order. Thus, each item can appear more than one in a sequence, in various transactions, and in an order that users cannot predict.

In this paper, we propose the SMUB algorithm to tackle a part of sequential pattern mining problem by solving clickstream pattern mining, a special version of sequential pattern mining. SMUB is a hybrid-based approach algorithm, based on B-List, an extension of N-List data structure. B-Lists are generated from an SPPC tree. Via our experiments on various datasets have shown that SMUB was more efficient than the recent state-of-the-art algorithm, CM-Spade [11], with respect to runtime, especially on huge datasets with low minimum support thresholds.

We organized this paper as follows. In Sect. 2, we describe the basic concepts. In Sect. 3, we introduce related work. In Sect. 4, we introduce SPPC tree and definitions. In Sect. 5, we present our B-List and SMUB algorithm for clickstream pattern mining. In Sect. 6, we present our experiments. In Sect. 7, we conclude our study and present our future work.

2 Basic Concepts

Let \( I = \left\{ { i_{1} , i_{2} , \ldots , i_{j} } \right\} \) be a set of distinct elements, each element is called an item. A sequence is a list of items that are ordered. A clickstream sequence \( S \) is denoted as \( \left\langle {s_{1} ,s_{2} , \ldots ,s_{q} } \right\rangle \), where \( s_{p} \in S \left( {1 \le p \le q} \right) \) is an item. The number of items in clickstream is called the size or length of the clickstream. A clickstream sequence having length \( k \) is denoted as \( k \)-sequence. A clickstream sequence \( S_{a} = \left\langle {a_{1} ,a_{2} , \ldots ,a_{n} } \right\rangle \) is a subsequence of another clickstream sequence \( S_{b} = \left\langle {b_{1} ,b_{2} , \ldots ,b_{m} } \right\rangle \), denoted by \( S_{a} \subseteq S_{b} \), if there exist integers \( x_{1} < x_{2} < \cdots < x_{y} \) that \( a_{t} = b_{{x_{y} }} \) with all of \( a_{t} \). In other words, \( S_{b} \) is called a super sequence of \( S_{a} \).

A clickstream sequence database SDB is a collection of clickstream and each sequence has a unique id (called sid). Support of a clickstream pattern \( P \) is defined as the number of clickstreams in SDB that are the super sequences of \( P \). Given a threshold, a clickstream sequence is a frequent clickstream pattern if its support is more than or equal to the given threshold. The clickstream pattern mining task is discovering all frequent clickstream patterns in SDB.

3 Related Work

Several algorithms have been proposed for sequential pattern mining such as AproriAll [2], GSP [3] and SPADE [5]. All of them find all sequential patterns by using “generate and test candidate” approach which consumes a lot of time and memory. PrefixSpan [6], FUSP [7] and Sequential Pattern Tree [8] does not generate any candidate sequences, but the structure of the tree is complex; thus, they create lots of projected databases and in order to find new sequential patterns, they need to completely scan the projected databases.

SPADE algorithm [5] identifies all frequent items (viz., 1-sequences) at the beginning, converts the database to the vertical database format and identify the rest of sequential pattern by BFS or DFS based on lattice decomposition concept. Though experiments, it is more efficient than the GSP algorithm. However, SPADE needs to convert database from horizontal to the vertical format, so the memory usage for storing the databases increased and it is even bigger than the original databases.

In 2008, Lin et al. proposed FUSP-tree [7] data structure and its maintenance algorithm for mining sequential patterns in incremental databases. FUSP-tree consists of one root node and a set of prefix subtrees as the children branches of the root. Each node in the prefix subtrees contains three values: \( item - name \) represents the node contains that item, \( count \) is the number of sequences represented by the section of the path reaching the node and \( node - link \) links to the next node of the same item in another branch of the FUSP-tree. The FUSP-tree contains a Header-Table which stores frequent items, their count and the link to the first occurrence in the tree corresponding to the item. This table assists on finding appropriate items or sequences in the tree.

Fournier-Viger et al. proposed CM-Spade in 2014 [11]. In their work, they proposed the CMAP data structure to store co-occurrence information of items and used the CMAP to produce a candidate pruning mechanism. Basically, CM-Spade integrates CMAP data structure into the SPADE algorithm. It was reported to have better performance than previous algorithms, SPADE and SPAM. But CM-Spade still suffers from spending much time evaluating candidates that do not exist in the sequence database.

There have been quite a few several efficient algorithms recently for mining frequent itemset from transaction databases [1] such as FP-growth [4], N-List [9] and DiffNodeSets [10].

In 2012, Deng proposed PrePost [9] algorithms. PrePost was based on the N-List structure that was generated from PPC-tree, which was a new structure for representing transaction databases. This data structure saves all information of itemsets. By combining the approach of candidate-generation-and-test and the approach of mining sequence itemset directly without candidate generation, PrePost was reported as an efficient algorithm for mining frequent itemsets. PPC-tree structure includes a root node and a set of children nodes, the structure of each node includes five properties: item-name, count, children-list, pre-order, and post-order. Item-name registers which item this node represents, count registers the number of transactions presented by the portion of the path reaching this node, children-list registers all children of the node, pre-order is the pre-order rank of the node and post-order is the post-order rank of the node. PPC-tree structure is like an FP-tree [4].

4 SPPC-Tree Structure

Definition 1.

SPPC-tree is a tree data structure. The tree consists of a root and a set of item prefix subtrees as the children of the root. Each node of the tree consists of eight fields: item-name, count, first-child, first-father, right-sibling, label-sibling, pre-order, post-order. Item-name is the item that the current node represents. Count is the number of sequences that have the same path reaching to the current node. First-child is a list that contains the first children of the node. First-father is the first previous node that is reached from the root node. Right-sibling is the first sibling node of the current node. Label-sibling is a list of nodes that have the same item-name even they may be in different branches of the tree. Pre-order is a list of pre-order ranks that were generated by pre-order traversal of the tree. Post-order is a list of post-order ranks that were generated by post-order traversal of the tree. SPPC-tree is derived from PPC-tree [9]. However, there are two differences between SPPC-tree and PPC-tree:

  1. 1.

    The support of frequent item is not the sum of all counts of nodes with same item name on SPPC-tree.

  2. 2.

    The item-name of an item can appear more than in one node in the same branch of SPPC tree.

Based on Definition 1, an SPPC-tree can be built by the following algorithm.

figure a

For example, assuming that we use an example sequence database SDB in Table 1 with minimum support threshold ξ = 0.5. First, we convert the value of minimum support from a double value to an integer value: 4 * 0.5 = 2. Then, we scan SDB to find the frequent items with their support count greater than or equal to ξ. The final set is SP1 = {<1>,<2>,<5>} with their support counts. With all infrequent items eliminated, we have a newly transformed sequence database as in Table 2.

Table 1. A clickstream sequence database
Table 2. The new sequence database with infrequent items already removed

Based on the newly transformed database, we build an SPPC-tree by the following steps. First, we create an empty node and assign it as a root node, then we add sequence 100 to the tree. The adding process starts at the root node. From there, each item in the sequence will have a node created and appended to the tree in a sequential order. The first item of the sequence will be appended to the root, the second will be appended to the first node and so on. The tree will look like in Fig. 1a the sequence 100 is added. After which, we add sequence 200 to the tree. Because the subsequence <2,5,1> was previously added into the tree during adding sequence 100, so we increase the count of each same node, the process for the rest of the items the same as adding sequence 100. After the sequence 200 is added, the tree will be like Fig. 1b and so on. However, the sequence 400 does not start with the same start item with other previous sequences. Thus, we create a new branch and add each item in this sequence into the tree like what we did to 100. The tree then will be like in Fig. 1d. Considering the node 2:2, it means that this is the node of item 2 and its support count is 2.

Fig. 1.
figure 1

Step by step SPPC-Tree construction: (a) after adding sequence 100 (b) after adding sequence 200 (c) after adding sequence 300 (d) after adding sequence 400 (e) after adding pre-order and post-order

After adding all sequences in SDB in Table 2 into the tree, we travel the tree using depth-first search (DFS) algorithm to add pre-order and post-order for each node. The tree looks like in Fig. 1e, which depicts the final result tree from SDB in Table 2 after executing the Algorithm 1. The node (0,4)2:3 mean this is the node of item 2, the count is 3, and the pre-order and post-order of the node is 0 and 4 respectively.

5 Sequential Pattern Mining Using B-Lists

In this section, we describe the idea and step by step of our proposed SMUB algorithm (sequential clickstream mining using B-List). SMUB is a hybrid approach for mining frequent sequences. Main steps of SMUB algorithm include: (1) build SPPC-tree and identify all frequent 1-sequences (2) based on SPPC-tree, conduct the B-List for each frequent 1-sequence (3) mine the remaining frequent k-sequences (k > 1). The details of the algorithm are presented in Sect. 5.2.

Definition 2

(SPP-code). Given an SPPC-tree \( S_{tr} \) and a node \( N \in S_{tr} \), an SPP-code of \( N \) is an element represented in the form of (\( N \).pre-order, \( N \).post-order):count.

Definition 3

(B-List of a frequent item; viz., frequent 1-sequence). Given an SPPC-tree, the B-List of a specified frequent item is an ordered set of all the SPP-codes of nodes having the same item-name with respect to the frequent item. The SPP-codes are sorted in an ascending order based on their pre-order values and the B-List is represented in the form of \( \left( {x_{1} ,y_{1} } \right):z_{1} \to \cdots \to \left( {x_{n} ,y_{n} } \right):z_{n} \). For each SPP-code in a B-List, there should always be a node in SPPC-tree that is registered with the SPP-code.

Definition 4

(Support count of a B-List). Given a B-List \( BL = \left( {x_{1} ,y_{1} } \right):z_{1} \to \cdots \to \left( {x_{n} ,y_{n} } \right):z_{n} \), and \( BL_{m} = BL\backslash \{ \left( {x,y} \right):z \in BL\left| {\exists \left( {x_{i} ,y_{i} } \right):z_{i} \in BL:x} \right\rangle x_{i} \wedge y < y_{i} \} \). The support of \( BL \) can be calculated via \( BL_{m} \) by the sum of all \( z_{k} \) with \( \left( {x_{k} ,y_{k} } \right):z_{k} \in BL_{m} \). For example, consider the B-List of the frequent 1-sequence <1> in Table 3, its \( BL_{m} \) is (2,2):2 → (5,9):1. So the support count would be 3.

Table 3. The B-Lists of frequent 1-sequences

5.1 B-List Generation for k-Sequences

Let \( BL1 \) and \( BL2 \) be the B-Lists of two \( k \)-frequent sequences \( P_{1} = \left\langle {i_{1} ,i_{2} , \ldots ,i_{k - 1} ,x} \right\rangle \) and \( P_{2} = \left\langle {i_{1} ,i_{2} , \ldots ,i_{k - 1} ,y} \right\rangle \), \( P_{1} \) and \( P_{2} \) share the same (k − 1) prefix, the B-List of (k + 1)-sequence \( P_{3} = \left\langle {i_{1} , \ldots ,i_{k - 1} ,x,y} \right\rangle \) is formed by following the procedure in Algorithm 2. In other words, BL_intersection only works between two frequent k-patterns that share (k − 1) prefix. A special case is that frequent 1-sequences are considered sharing an empty prefix.

figure b

For example, assuming that we have frequent 1-sequence <5> and we want to generate the B-List of 2-sequence <5,5> . As shown in Table 3, the B-List of <5> is (1,3):2 → (3,1):1 → (6,8):1 → (8,6):1. The generation of the B-List of <5,5> is done by combining the B-List of <5> with itself. First, we check (1,3):2 with every element in the B-List of itself. However, the pre-order of the SPP-code (1,3):2 is 1, which is not greater than the pre-order of (1,3):2 itself. So we move to (3:1):1. The pre-order of (3:1):1 is 3, which is higher than pre-order of (1,3):2. The post-order of (3:1):1 is 1, which is less than post-order of (1,3):2. So (3:1):1 is added to the B-List of <5,5> . Finishing the BL_intersection, we have the B-List of <5,5> , which is (3,1):1 → (8,6):1.

5.2 Mining Clickstream Sequential Patterns

Based on previous definitions, Algorithm 3 illustrates the process of SMUB with high-level pseudocodes.

figure c

For example, considering the minimum support ξ = 3, we have \( SP_{1} \) as the set of frequent 1-sequences <5>, <2> and <1> mined from the example database SDB and their respective B-List set \( BL_{1} \). Running 3, we first join <5> with <5>, <2> and <1> to form 2-sequence candidates <5,5> , <5,1> and <5,2> . By generating B-Lists for aforementioned candidates, we can use them to check for support count of each candidate. Only <5,5> and <5,2> have their support counts higher than ξ, so they are frequent 2-sequences and are added into the set of frequent 2-patterns \( SP_{2} \). In the same way, <2> is joined with <5>, <2> and <1>, and <1> is joined with <5>, <2> and <1>. The resultant frequent 2-patterns are added into \( SP_{2} \) and their respective B-Lists are added into \( BL_{2} \). Recursively, we re-run mining_L procedure with \( SP_{2} \) and \( BL_{2} \) and so on, until no candidate can be generated. Figure 2 illustrates the full set of frequent clickstream patterns.

Fig. 2.
figure 2

The tree of frequent clickstream patterns

6 Experimental Evaluation

In this section, we performed experiments to assess the performance of the proposed algorithm. We performed experiments on a computer running Intel Core i7 2.2 GHz CPU, 16 GB memory, and macOS Sierra 10.12.6 operating system. We configured JVM with the flags of -Xmx10G -Xms10G (viz., the maximum memory allowed was 10 GB). The state-of-art algorithm, CM-Spade, for sequential pattern mining that was proved more efficient than previous algorithms, which were GSP, PrefixSpan and FUSP in [11]. So, in this paper we just compared the proposed algorithm, SMUB, with CM-Spade. We use Kosarak, FIFA, MSNBC, and BMS2 datasets (Table 4) for testing performance. We implemented the SMUB in Java 8. The experiments are conducted on each database by decreasing the minimum support thresholds until algorithm took too long time to execute (more than 2000s) or ran out of memory. The running time is the total execution time of the algorithm.

Table 4. Database description

Figure 3 shows the running time of SMUB and CM-Spade on Kosarak, FIFA, MSNBC, and BMS2 correspondingly. Generally, SMUB ran faster than CM-Spade and the gap kept getting bigger at smaller minimum support. Thus, we can see that SMUB is more efficient than CM-Spade at low minimum support threshold.

Fig. 3.
figure 3

Runtime of SMUB and CM-Spade

7 Conclusions and Future Work

In this paper, we proposed a novel data structure, B-List, for compressing and storing information for clickstream patterns. Based on B-Lists, we developed an algorithm, SMUB, for fast mining clickstream patterns in clickstream databases. The advantages of the SMUB algorithm compared to other previous algorithms are as follow: First, it uses a compact data structure, B-List, which is usually substantially smaller than the original databases, and thus avoids costly database scans in the subsequent mining processes. Second, counting the support of sequence is transformed into the intersection of B-Lists and it employs an efficient strategy with the complexity of O(m + n) for intersecting two B-Lists, where m and n are the cardinalities of the two B-Lists respectively. We have implemented the SMUB algorithm and studied its performance in comparison with CM-Spade, a well-known sequential pattern mining algorithm, on a variety of real and synthetic datasets. Our performance study shows that the SMUB algorithm is more efficient than CM-Spade.

In future work, we will further explore our method to fully work with sequential pattern mining problem (viz., there is more than one element in itemsets). We also consider using the parallel approach for SMUB so that it can work even bigger databases.