Mining inter-sequence patterns with Itemset constraints

Nguyen, Anh; Nguyen, Ngoc-Thanh; Nguyen, Loan T.T.; Vo, Bay

doi:10.1007/s10489-023-04514-7

Mining inter-sequence patterns with Itemset constraints

Published: 18 March 2023

Volume 53, pages 19827–19842, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Mining inter-sequence patterns with Itemset constraints

Download PDF

Anh Nguyen¹,
Ngoc-Thanh Nguyen^1,2,
Loan T.T. Nguyen^3,4 &
…
Bay Vo ORCID: orcid.org/0000-0002-2723-1138⁵

318 Accesses
3 Citations
Explore all metrics

Abstract

Nowadays, raw data is rarely used directly. In real world applications, data is often processed, and the necessary knowledge extracted, depending on the purpose of the user. Applying constraints in pattern mining is a major factor in reducing the resulting patterns to help decision support systems work efficiently. In 2018, a constraint-based approach was developed to discover inter-sequence patterns. However, this method only focused on the constraints with single items. The task of discovering constraint-based inter-sequential patterns is our target in this work. We propose the DBV-ISPMIC algorithm, a DBV-PatternList based structure, for mining inter-sequential patterns with itemset constraints. The proposed algorithm utilizes an organized search tree structure stored as dynamic bit vectors to quickly compute the support of patterns. In addition, we also develop a property and, based on it, an improved algorithm is proposed to reduce checking candidates. Finally, we develop the pDBV-ISPMIC algorithm as a parallel method of the DBV-ISPMIC algorithm. Empirical evaluations show that DBV-ISPMIC has better performance than the post-processing algorithms in experimental databases and pDBV-ISPMIC is better than DBV-ISPMIC with regard to the runtime.

Mining sequential patterns with itemset constraints

Article 01 February 2018

Mining constrained inter-sequence patterns: a novel approach to cope with item constraints

Article 09 February 2018

CloFAST: closed sequential pattern mining using sparse and vertical id-lists

Article 20 October 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sequential pattern mining is used to reveal a set of frequent patterns from a sequence database t [1,2,3,4,5,6] while mining inter-sequence patterns will find the frequent patterns across multiple sequences in the sequence database [7,8,9,10,11,12].

Among the important tasks in the field of data mining, inter-sequence pattern mining is an especially practical one. Its purpose is to find the frequent sequences that are common to many sequences in the database, and to find the relationships between these sequence patterns over time. Therefore, it brings very important and meaningful information for rule generation to support decision-making, management, and prediction.

1.1 Motivation

In recent years, various methods of mining constraint-based (inter)-sequential patterns or rules have been developed [8, 13, 14]. However, most of these solved the problem of mining sequential patterns and mining sequential rules with constraints [13, 14]. The ISP-IC algorithm, suggested by Vo et al. in 2018 [8], focusing in discovering constraint-based inter-sequence patterns. Inter-sequence pattern mining with itemset constraints, however, is not addressed by any of the mentioned approaches. We want to mine patterns such as: “If a customer visits URL1, URL2, URL3 today, then he/she will visit URL2, URL3, URL5 two days later. The constraint is that he/she must visit {URL2, URL3}”. We can solve this problem by applying the proposed algorithm from [7] to mine all mining inter-sequence patterns, after that, we check the constraints for each mined pattern to get the result. However, this work will take more time and memory to mine and store patterns.

As far as we know, the complexity of the mining inter-sequence patterns task is much higher than that of mining frequent sequences. This is because we must mine all sequences appearing across multiple sequences, and solving the itemset constraints needs more complex processing to check the patterns that satisfy them. Therefore, the major challenge of this study is to address the problem of inter-sequence pattern mining in combination with itemset constraints.

1.2 Contributions

In this study, our goal is to solve the task of inter-sequence pattern mining with itemset constraints. This mining task, unlike that of using itemset constraints for mining sequence patterns, requires more complex processing because many candidates are generated during the mining process. Our major contributions are as follows.

1.
Based on the EISP-Miner algorithm [7] and a using dynamic bit vector data structure [9], we state the problem of inter-sequence pattern mining in combination with itemset constraints.
2.
We then suggest a proposition to help reduce candidate checking during sequence expansion according to the EISP-Miner algorithm, thus reducing the search space for inter-sequence pattern mining with itemset constraints.
3.
Next, an algorithm named DBV-ISPMIC is developed to discover constraints-based inter-sequence patterns. A parallel version of the DBV-ISPMIC algorithm, named pDBV-ISPMIC algorithm, was also presented to improve the runtime.
4.
Finally, we conduct experiments with various databases to evaluate the proposed method.

The rest of this paper is organized as follows: The next section carries out our literature review on sequence pattern mining, sequence pattern mining with constraints, inter-sequence patterns and inter-sequence patterns with constraints. The third section offers the basic concepts and problem statement. The proposed algorithms are presented in the fourth section. In this section, we introduce a structure, namely the DBV-PatternList, and propose the DBV-ISPMIC and pDBV-ISPMIC algorithms, to solve the problem. A proposition is also developed to reduce checking candidates in DBV-ISPMIC-IMPROVING. The experimental results are discussed in the fifth section. The last section then presents the conclusion and some directions for future studies.

2 Related work

2.1 Mining frequent (inter-)sequences

Mining frequent sequences has been explored by many researchers, and many algorithms have thus been developed for mining frequent sequences. AprioriAll was first proposed by Agrawal and Srikant [2] in 1995 for mining sequential patterns from a sequence database. AprioriAll is based on two basic principles for finding the frequent sequence, candidate generation and testing [2].

However, candidate generation and testing are time-consuming and need many database scans, and thus Zaki et al. introduced the SPADE algorithm in 2001 [3]. This algorithm is based on lattice theory to divide the search space to reveal frequent sequences. It utilizes the vertical database format and equivalence class to find frequent sequences. Therefore, the SPADE algorithm outperforms AprioriAll in most experiments. Han et al. suggested the FreeSpan algorithm in 2000 [4]. This algorithm uses projected sequence databases to confine the search and the growth of subsequence fragments. Pei and Han proposed the PrefixSpan approach in 2001 [5]. The algorithm is based on a specific divide-and-conquer method, which mines frequent sequential patterns based on a prefix database projection (like the FP-Growth algorithm). PrefixSpan can generate frequent sequences with candidate generation. Therefore, it has better execution time than either SPADE or AprioriAll. Thanks to the divide-and-conquer technique, PrefixSpan significantly reduces the size of the database as the length of the common patterns increase, leading to savings in memory and computation time. Gouda et al. used prism encoding to compress sequence IDs to save memory storage and runtime [15, 16].

In 2020, Huynh et al. [17] proposed the CUP algorithm to mine clickstream patterns. In their study, the authors employed a structure called the pseudo-IDList. In addition, the algorithm also adopts the DUB pruning technique to help CUP run faster than state-of-the-art algorithms, by from 13 to 45%. Huynh et al. also proposed efficient parallel algorithms using multi-core processors for mining clickstream patterns [18], as well as a weight-based method for mining clickstream patterns. The method is based on the average weight measure and the authors proposed two algorithms, named CM-WSPADE and Compact-SPADE, for efficient mining of weighted clickstream patterns.

In 2009, a novel method for inter-sequence pattern mining based on vertical database format was introduced by Wang and Lee [7]. The authors developed an ISP-tree to generate candidates that satisfy the minimum support threshold. With the method of inter-sequence pattern mining, the authors defined a new method for the mining sequences which ensures that it can still exploit the traditional sequences like the algorithms SPADE [3], CM-SPADE [6], and PRISM [15, 16], but also mine patterns through multiple sequences in the sequence database. This approach stores sequence IDs (SIDs) and uses them to compute the support of patterns. Therefore, it needs a large amount of memory to store SIDs and time to compute the intersection of SIDs. In 2012, Vo et al. introduced a novel approach, which employed a highly efficient structure call the dynamic bit vector (DBV-PatternList) [9], to replace the pattern-list structure used by the EISP-Miner approach. This method thus significantly reduces the search space and time in mining inter-sequence patterns, as well as for mining closed inter-sequence patterns [10].

2.2 Mining frequent (inter-)sequences with constraints

In 2018, Van et al. proposed the MSPIC-DBV algorithm to mine sequential patterns with itemset constraints [14]. To efficiently discover sequential patterns, MSPIC-DBV adopts the dynamic bit vector and the prefix tree structure. A method for early pruning the candidates to help reduce processing time is also introduced. The authors then continued to improve their work in mining sequential rules with itemset constraints through two new algorithms, namely the MSRIC-R and MSRIC-P, which were introduced in 2021 [13]. The constraints are combined into the rule generating phase for the MSRIC-R algorithm, while the latter algorithm combined it into the pattern mining phase. The authors suggested a technique to integrate the mining process with itemset constraints, to help the algorithm only create constraint-satisfying patterns and thus increase the speed. These methods solved the problem of itemset constraints but can only be applied on mining sequential patterns or sequential rules, so the generated subsequences do not satisfy the inter-sequence mining requirements. Therefore, these algorithms cannot be applied to the problem examined in the current study.

An algorithm named ISP-IC was developed by Le et al. [8] for constraints-based inter-sequence pattern mining, as well as its improvements, iISP-IC and piISP-IC. The authors also applied a parallel processing method to speed up the runtime. As stated in the introduction, ISP-IC, iISP-IC, and piISP-IC focus on the condition of items in the sequence, and we cannot apply this approach to the itemset problem.

3 Basic concepts and problem statement

3.1 Inter-sequence pattern mining

Let t be the set of items, t = {u₁, u₂, u₃,…, u_m} where u_i is an item (1 ≤ i ≤ m). A sequence s = 〈t₁, t₂, t₃,…, t_n〉 is an ordered list of itemsets where t_i ⊆ t (1 ≤ i ≤ n) is an itemset. A sequential database D = {s₁, s₂, s₃,…, s_w} where w = |D| is the number of sequences in D and s_i (1 ≤ i ≤ w) is a pair of values 〈Dat, Sequence〉, in which Dat is the property of s_i used to describe contextual information based on the time of the transaction.

What follows is an example with the sequence database in Table 1. An example of creating a sequential database from a customer dataset.

Table 1 An example of creating a sequential database from a customer dataset

Full size table

With DAT 1 (DAT is an attribute describing the time of the transaction): First buy item C then buy itemset AB.

Assuming the sequences s₁ and s₂ have d₁ and d₂, respectively, be their domain attributes (DAT). If we take d₁ as the reference point, the span between s₂ and s₁ is defined as [d₂-d₁]. The sequence s₂ at domain attribute d₂ with respect to d₁ is called an extend sequence and denoted as s₂[d₂-d₁]. Take the database given in Table 1. An example of creating a sequential database from a customer dataset.

For instance – if DAT from the 1st transaction is used as the reference point, then the extended sequence from the 2nd transaction equals to ⟨C(ABC)A⟩[1].

Let s[d] = 〈t₁, t₂, t₃,…, t_n〉[d] be a sequence where t_i is an itemset (1 ≤ i ≤ n) and [d] is span of s.t_i associated with [d] was defined as an extended itemset (e-itemset) denoted by 〈t_i〉[d]. If t_i = (u₁, u₂, u₃,…, u_m) where each u_i (1 ≤ i ≤ m) is an item, then u_i associated with [d] is an extended item (e-item) denoted by (u_i)[d]. For example, 〈C(ABC)A〉[1] has 3 e-itemset 〈C〉[1], 〈(ABC)〉[1], A[1] and 3 e-item: (C)[1], (A)[1], (B)[1].

Definition 1 (Megasequence) [7]

Given a list of k sequences 〈d₁, s₁〉, 〈d₂, s₂〉,…, 〈d_k, s_k〉 in the sequential database. A megasequence with k ≥ 1 is denoted as Ψ = s₁[0] ∪ s₂[d₂ – d₁] ∪…∪ s_k[d_k – d₁].

From the example database shown in Table 1. An example of creating a sequential database from a customer dataset.

With maxspan = 1 and DAT = 1 as the reference point, we have a list of megasequences shown in Table 2.

Table 2 Converting a sequential database to megasequences

Full size table

Definition 2 [8]

Let a = (i_m)[x] and b = (i_n)[y] be 2 e-items.

(1)
a = b if and only if (i_m = i_n) ∧ (x = y) and
(2)
a < b, if and only if x < y or ((x = y) ∧ (i_m < i_n)).

For example, (A)[0] = (A)[0], (A)[1] < (A)[2], and (B)[1] < (D)[1].

Definition 3 [8]

Given a pattern p. A function sub_ij(p) is defined as (j-i + 1) subset e-items of p from position I to j. For example, sub_1,3(〈(BC)〉[0]〈(AD)B〉[1]) = 〈(BC)〉[0]〈(A)〉[1] and sub_4,4(〈(BC)〉[0]〈(AD)B〉[1]) = (D)[1].

Definition 4 [8]

Given two frequent 1-patterns α = 〈u〉[0] and β = 〈v〉[0]. α is joinable to β in any instance and there are three types of join operation: (1) itemset-join: α ∪_i β = {〈(uv)〉[0]}|{〈(uv)〉[0]}; (2) sequence-join: α ∪_s β = {〈(uv)〉[0]}; (3) inter-join: α ∪_t β = {〈u〉[0]〈v〉[x] | 1 ≤ x ≤ maxspan}. For example, given maxspan = 2, 〈C〉[0] ∪_i 〈D〉[0] = 〈(CD)〉[0]; 〈C〉[0] ∪_s 〈D〉[0] = 〈CD〉[0]; and 〈C〉[0] ∪_t 〈D〉[0] = {〈C〉[0]〈D〉[1], 〈C〉[0]〈D〉[2]}.

Definition 5 [8]

Given 2 frequent k-patterns α and β, k > 1, then sub_k,k(α) = (u)[i], and sub_k,k(β) = (v)[j]. α is joinable to β if sub₁,_k-1(α) = sub_1,k-1(β) and i ≤ j, which yields three types of join operation: (1) itemset-join: α ∪_i β = {α + _i (v)[j]|(i = j) ∧ (u < v)}; (2) sequence-join: α ∪_s β = {α + _s (v)[j] | (i = j)}; (3) inter-join: α ∪_t β = {α + _t (v)[j] | (i < j)}.

For example, 〈BC〉[0] ∪_i 〈BD〉[0] = 〈B(CD)〉[0], 〈BC〉[0] ∪_s 〈BD〉[0] = 〈BCD〉[0], and 〈BC〉[0] ∪_t 〈B〉[0]〈D〉[2] = 〈BC〉[0]〈D〉[2].

Definition 6 (Prefix) [14]

A pattern β = ⟨b₁b₂ ... b_m⟩ is called a prefix of pattern α = ⟨α₁α₂ ... α_n⟩ if and only if b_i = α_i for all 1 ≤ i ≤ m − 1, b_m ⊆ α_m, m < n. For instance, the prefixes of pattern ⟨D(BC)B⟩ are: ⟨D⟩, ⟨DB⟩ and ⟨D(BC)⟩. Thus, any sequence would be the prefix of its extended sequences based on this definition.

3.2 Problem statement

Given a sequence database D, the minimum support (minsup), and a set of constraint itemsets IC = {c₁, c₂, c₃, …, c_k}. The task of inter-sequence pattern mining with an itemset constraint is to discover all frequent sequences α = α₁[w₁], α₂[w₂],. .., α_m[w_m] such that ∃α_i[w_i] ∈ α, ∃b_j ∈ IC: b_j ⊆ α_i.

For instance, let IC = {(C), (E)}, the sequence ⟨C(AB)⟩[0]⟨C(ABC)A⟩[1] satisfies the constraint whereas the sequence ⟨AD⟩[0]⟨A⟩[1] does not.

4 Proposed algorithm

4.1 DBV-PatternList structure

Instead of using the PatternList data structure according to the EISP-Miner algorithm [7], we use the DBV-PatternList data structure [9]. This structure uses dynamic bit vectors to store the t-values and p-values of an e-item in the database. It is used to minimize the space and time needed to extend the e-item according to inter-sequence patterns. The DBV-PatternList structure is presented as follows:

The index of the first non-zero value in the bit vector.
Bit vector: the array of values after trimming zeroes at the start and end index.

Figure 1 Presents the use of PatternList and DBV-PatternList structures.

As in structure Fig. 1a, if we store information for an e-item using the PatternList data structure we need to use 26 bytes (12 bytes for t-values and 14 bytes for p-values). But if we use the DBV-PatternList (structure Fig. 1b) data structure we only use 20 bytes (2 bytes for the start position of e-items, 4 bytes for the t-values and 14 bytes for the p-values). Because we have converted the t-values to bit vectors (Definition 6), storage space is reduced.

4.2 DBV-ISPMIC algorithm

Our proposed method relies on the DBV-PatternList structure and the DBV-PatternList joining methods. The model of the DBV-ISPMIC algorithm is shown in Fig. 2.

In this algorithm, there are five functions – ISP-Join1, ISP-Joink, ISP-Join1-Extension, ISP-Joink-Extension and Check – as shown in Figs. 3, 4, 5, 6 and 7, respectively. The algorithm computes on a given sequential database with minsup, maxspan and a set of itemset constraints defined by the user. The result is a set of frequent inter-sequence patterns that satisfy the conditions of minsup and itemset constraints. The DBV-ISPMIC algorithm has three main steps, as presented below.

Step 1 The sequential database is scanned once to find all frequent 1-patterns in which there is not less than the user-specified minimum support threshold minsup (Fig. 2, line 1). Then, we will create a tree T, with the root being NULL and leaves including the found DBV-PatternList. The algorithm goes over all frequent 1-DBV-PatternList results to start expanding the nodes according to the itemset, sequence and inter extension (Fig. 2, line 2).
Step 2 The algorithm calls the ISP-Join1 function (Fig. 3) for extending an α_list node with the remaining children of the T|NULL tree (Definition 4). In this function, we will have a new DBV-PatternList in three ways that was extended from the itemset, sequence and inter (based on maxspan value) extension. We check the DBV-PatternList result with the minsup condition and itemset constraints. The Check function in Fig. 7 is used to check if a new DBV-PatternList satisfies the constraint. If satisfied, we will add this DBV-PatternList as a child node of T|α_list (Fig. 5, lines 1–9).
Step 3 The algorithm calls the ISP-Joink function (Fig. 4) for extending the child node of the α_list node with the remaining children according to the k-pattern (Definition 5). In this function, we have been given a new DBV-PatternList in three ways that extend according to itemset, sequence and inter and then we will check whether the DBV-PatternList result satisfies the minsup condition. The Check function is used to check if a new DBV-PatternList satisfies the constraint. If it is satisfied, we will add this DBV-PatternList as a child node of T|α_list (Fig. 6, lines 1–9).

4.3 Improved DBV-ISPMIC algorithm

We can see that if a pattern satisfies itemset constraints, then the new pattern that is extended from this node with the itemset, sequence, and inter extension will also satisfy the itemset constraints. This helps to reduce the algorithm’s DBV-PatternList node expansion time. We propose a proposition to verify this, as follows.

Proposition 1

(Checking itemset constraints) Given an inter-sequence α and a set of itemset constraints IC, if α satisfies IC, then the sequence β, generated from α, also satisfies constraint IC.

Proof

Let α = ⟨α₁[w₁]α₂[w₂] ... α_m[w_m]⟩, β = ⟨β₁[w₁] β₂[w₂] ... β_u[w_u]⟩ be an inter-sequence, whereas each α_i, β_i corresponds to an itemset. Because α satisfies IC, based on the problem statement, this condition holds: ∃α_i[w_i] ∈ α, ∃b_j ∈ IC: b_j ⊆ α_i.

Based on Definition 5, there are three cases to consider:

Itemset-join: In β always exists β_i[w_i] such that α_i ⊆ β_i ⇒ b_j ⊆ α_i ⊆ β_i or b_j ⊆ β_i. It means β satisfies IC.

Sequence-join: Because itemset of α_i does not change. It means β_i = α_i and therefore, we have b_j ⊆ β_i or β satisfies IC.

Inter-join: Based on the inter-join, all itemsets from α always exist in β, therefore, in β always exists β_k[w_k] such that α_i ⊆ β_k ⇒ b_j ⊆ α_i ⊆ β_k or b_j ⊆ β_k. It means β satisfies IC.

In the ISP-Join1-improving function, we first check the α_list pattern with the itemset constraints. If it is true, all frequent DBV-PatternList which are extended from α_list also satisfy the itemset constraints (Fig. 8, lines 3–12). This is the same with the ISP-Joink-improving function (Fig. 9, lines 2–13).

Consider the example shown in Table 1. An example of creating a sequential database from a customer dataset.

Where minsup = 2, maxspan = 1 and itemset constraints IC = {(AB), CA, AD}. The frequent patterns generated in the mining phase are presented in Fig. 10.

Step 1

the sequential database is scanned once by the algorithm to enumerate all frequent 1-patterns, which is {⟨A⟩, ⟨B⟩, ⟨C⟩, ⟨D⟩} with the support {4, 2, 2, 2}, respectively (Fig. 10, level 0). In this database scan, the 1-DBV-PatternLists are also generated.

Step 2

The algorithm generates candidates {⟨A⟩[0]⟨A⟩[1], ⟨(AB)⟩[0], ⟨AD⟩[0], ⟨A⟩[0]⟨D⟩[1]} that shares the 1-prefix ⟨A⟩ by joining ⟨A⟩ with ⟨A⟩, ⟨B⟩, ⟨C⟩ and ⟨D⟩ based on itemset, sequence and inter extension (Fig. 8). Other generated candidates ⟨B⟩[0]⟨A⟩[1], ⟨CA⟩[0], ⟨C⟩[0]⟨A⟩[1] and ⟨CB⟩[0] that share 1-prefix ⟨B⟩, ⟨C⟩ are also created at the same time during the combination (Fig. 10, level 1).

Step 3

The algorithm traverses the children of each (k-1)-pattern by depth first search to generate k-patterns. If a (k-1)-pattern satisfies the constraints IC then all its super patterns should not be checked. For instance, ⟨(AB)⟩[0] satisfies the itemset constraints, so we do not need to check the itemset constraints for ⟨(AB)⟩[0]⟨A⟩[1]. Due to this checking case, we reduce the checking time of the algorithm. Otherwise, ⟨A⟩[0]⟨A⟩[1] do not satisfy the itemset constraints, so we check itemset constraints for its child nodes. The algorithm backtracks to step 3 and complete the full candidates of ⟨B⟩, ⟨C⟩ then ⟨D⟩. As no more candidates can be found in branch ⟨D⟩, the algorithm terminates. We have FP = {⟨AA⟩[0]⟨AD⟩[1]: support = 2, ⟨(AB)⟩[0]: support = 2, ⟨(AB)⟩[0]⟨A⟩[1]: support = 2, ⟨AD⟩[0]: support = 2, ⟨CA⟩[0]: support = 2, ⟨CA⟩[0]⟨A⟩[1]: support = 2, ⟨C(AB)⟩[0]: support = 2, ⟨C(AB)⟩[0]⟨A⟩ [1]: support = 2}.

4.4 Parallel DBV-ISPMIC algorithm

As shown in Fig. 2, the DBV-ISPMIC algorithm is a sequential algorithm. The complexity time of the DBV-ISPMIC algorithm is calculated by t_{node_1} + t_{node_2} + t_{node_3} + … + t_{node_n}, whereas the set {node_1, node_2, node_3, …, node_n} contains child nodes of an ISP-tree T, t_nodei is the extension time of a node extension of the ISP-tree T in 1-pattern and k-pattern (in ISP-Join1 and ISP-Joink functions, respectively). This is because the ISP-Joink extension function independently expands the sub-nodes of the ISP-tree T. If each sub-branch of the ISP-tree T is expanded for each task, the running time of the algorithm improves. Therefore, the overall runtime of the algorithm can be determined as Max{node_1, node_2, node_3, …, node_n}.

An example for the above proposal is given in Fig. 11. Based on the database in Table 2 with minsup = 2, the frequent patterns that were generated at the first level by 1-pattern extension included ⟨A⟩, ⟨B⟩, ⟨C⟩ and ⟨D⟩. For each frequent pattern, the k-pattern extension is processed at each individual task. As stated, the time of the algorithm is calculated in Max{t_Task1, t_Task2, t_Task3}, since one task runs in parallel with the others. The allocation of the number of tasks that can be executed simultaneously is determined by the computer processor’s cores. This can also be extended to distributed systems, where each task is processed on a separate system and then final the result is gathered and combined.

The pDBV-ISPMIC algorithm is based on the DBV-ISPMIC with the algorithm parallelized. Figure 12 shows with a flowchart the main steps of the sequential DBV-ISPMIC algorithm and its parallel (pDBV-ISPMIC) counterpart. The pDBV-ISPMIC algorithm has two main steps:

Step 1: Loading the database and finding the frequent 1-pattern sets that satisfy the minsup.
Step 2: Allocating execution tasks, with each task handling one k-pattern. The algorithm search space exploration is DFS-based (depth-first search), which is recursive. When no more candidates can be generated, the algorithm terminates.
Fig. 12
The figure shows the difference between sequential and parallel flow chart. a The main steps of a sequential algorithm, and b the main steps of a parallel processing algorithm
Full size image

5 Experimental results

In evaluating the performance of the DBV-ISPMIC algorithm and its improvement in runtime, all the experiments were carried out on a PC with an Intel® Core ™ i7 10th gen processor (10,510 U) @ 1.8–4.9 GHz, and 20 GB RAM. The operating system used is Windows 10 64-bit. The algorithms were implemented in Visual Studio 2017 C#.

5.1 Experimental databases

We ran tests on five databases, namely C6T5S4I4N1kD1k, C6T5S4I4N1kD10k, Gazelle, BIKE and BMSWebView1. The synthetic databases used for comparison were generated using the IBM synthetic data generator. These databases are available at https://www.mediafire.com/folder/id3p3z6b9g8kj. Their characteristics are shown in Table 3.

Table 3 Test database characteristics

Full size table

5.2 Runtime

For the C6T5S4I4N1kD1k database, we evaluated the algorithms with minsup = 0.5% and maxspan = {1,2,3,4,5}, the C6T5S4I4N1kD10k database with minsup = 5% and maxspan = {1, 2, 3, 4, 5}, the Gazelle database with minsup = 1% and maxspan = {1,2,3,4,5}, the BIKE database with minsup = 0.5% and maxspan = {1, 2, 3, 4, 5}, and the BMSWebView1 database with minsup = 0.5% and maxspan = {1, 2, 3, 4, 5}. We use an EISP-Miner algorithm to evaluate all the proposed algorithms [7], and add a check constraint to it, with this approach called the Post-EISPMiner algorithm.

Based on the experimental results in Figs. 13, 14, 15, 16 and 17, we can see that the DBV-ISPMIC-IMPROVING algorithm runs faster than the other two algorithms, Post-EISPMiner and DBV-ISPMIC. In Fig. 13, we compare the runtime of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the Gazelle dataset. When the value of maxspan is increasing, the running time of all three algorithms increases relatively evenly.

Figures 14, 15, 16 and 17 show the results for the C6T5S4I4N1kD1k, C6T5S4I4N1kD10k, BIKE and BMSWebView1 datasets. It is clear that the runtimes for DBV-ISPMIC-IMPROVING and DBV-ISPMIC are much better than that of Post-EISPMiner.

5.3 Parallel method for efficient mining of inter-sequence patterns with itemset constraints

Because DBV-ISPMIC-IMPROVING is the best algorithm for mining inter-sequence patterns with itemset constraints, we develop a parallel version of it, pDBV-ISPMIC-IMPROVING, by using the C#.NET software library to improve the performance. The performance of pDBV-ISPMIC-IMPROVING algorithm is evaluated by comparing it with that of the DBV-ISPMIC-IMPROVING algorithm. The results are shown in Figs. 18, 19, 20 and 21, and it can be seen that when maxspan increases, the runtime of pDBV-ISPMIC-IMPROVING is much less than the runtime of DBV-ISPMIC-IMPROVING algorithm.

5.4 Memory usage

Figures 22, 23, 24, 25 and 26 show the peak memory consumption of the three algorithms, Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING. The results show that the memory needed by DBV-ISPMIC and DBV-ISPMIC-IMPROVING is less than that needed by the Post-EISPMiner algorithm for almost all database parameter values. Because the two proposed algorithms reduce the time needed to check the child nodes generated, they have less memory usage compared to the Post-EISPMiner algorithm.

5.5 Impact of maxspan

For mining inter-sequence patterns, when we increase the maxspan value, the number of candidates generated will also increase. Therefore, if we use proposition 1 to reduce the itemset constraints checking, the processing time will be better. For instance, we use two databases, Gazelle and C6T5S4I4N1kD1k, to evaluate this. The Gazelle database (minsup = 3%) and C6T5S4I4N1kD1k database (minsup = 0.8%) were tested with the maxspan value increasing from 2 to 12. The results show that the proposed algorithms (DBV-ISPMIC and DBV-ISPMIC-IMPROVING) always work well (Fig. 27 and 28).

6 Conclusions and future work

With this study, we introduced an algorithm to solve the problem of mining inter-sequence patterns with itemset constraints. This algorithm, named DBV-ISPMIC, is based on the EISP-Miner algorithm to mine inter-sequence patterns, and uses a dynamic bit vector structure to store data, which helps to increase the processing speed and reduce the storage space when compared to EISP-Miner. Based on the DBV-ISPMIC algorithm, we also propose its improvement to help reduce processing time.

In the future, we will apply distributed computing to the improved algorithm to help optimize the running time. We will also study how to put the constraints into mining frequent closed inter-sequences. Finally, algorithms for mining high utility sequences have been proposed in recent years [19,20,21,22,23,24,25], and we will study how to mine high utility inter-sequences and high utility inter-sequences with constraints.

References

Huynh HM, Nguyen LTT, Vo B, Nguyen A, Tseng VS (Mar. 2020) Efficient methods for mining weighted clickstream patterns. Expert Syst Appl 142:112993. https://doi.org/10.1016/j.eswa.2019.112993
Article Google Scholar
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings - International Conference on Data Engineering, pp. 3–14, https://doi.org/10.1109/icde.1995.380415
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60. https://doi.org/10.1023/A:1007652502315
Article MATH Google Scholar
Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M-C (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings - the sixth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ‘00, pp. 355–359, https://doi.org/10.1145/347090.347167
Pei J et al (2001) PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings - International Conference on Data Engineering, pp. 215–224, https://doi.org/10.1109/icde.2001.914830
Fournier-Viger P, Gomariz A, Campos M, Thomas R Fast vertical mining of sequential patterns using co-occurrence information. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, vol. 8443 LNAI, no. PART 1, pp. 40–52, https://doi.org/10.1007/978-3-319-06608-0_4
Wang CS, Lee AJT (May 2009) Mining inter-sequence patterns. Expert Syst Appl 36(4):8649–8658. https://doi.org/10.1016/j.eswa.2008.10.008
Article Google Scholar
Le T, Nguyen A, Huynh B, Vo B, Pedrycz W (May 2018) Mining constrained inter-sequence patterns: a novel approach to cope with item constraints. Appl Intell 48(5):1327–1343. https://doi.org/10.1007/s10489-017-1123-9
Article Google Scholar
Vo B, Tran MT, Hong TP, Nguyen H, Le B (2012) A dynamic bit-vector approach for efficiently mining inter-sequence patterns. In: Proceedings - 3rd International Conference on Innovations in Bio-Inspired Computing and Applications, IBICA 2012, pp. 51–56, https://doi.org/10.1109/IBICA.2012.31
Le B, Tran MT, Vo B (Jul. 2015) Mining frequent closed inter-sequence patterns efficiently using dynamic bit vectors. Appl Intell 43(1):74–84. https://doi.org/10.1007/s10489-014-0630-1
Article Google Scholar
Wang CS, Liu YH, Chu KC (Jun. 2013) Closed inter-sequence pattern mining. J Syst Softw 86(6):1603–1612. https://doi.org/10.1016/J.JSS.2013.02.010
Article Google Scholar
Liao W, Wang Q, Yang L, Ren J, Davis DN, Hu C (Apr. 2018) Mining frequent intra-sequence and inter-sequence patterns using bitmap with a maximal span, Proc. - 2017 14th web Inf. Syst Appl Conf WISA 2017, vol 2018-January, pp 56–61, https://doi.org/10.1109/WISA.2017.70
Van T, Le B (Mar. 2021) Mining sequential rules with itemset constraints. Appl Intell 51:1–13. https://doi.org/10.1007/s10489-020-02153-w
Article Google Scholar
Van T, Vo B, Le B (Nov. 2018) Mining sequential patterns with itemset constraints. Knowl Inf Syst 57(2):311–330. https://doi.org/10.1007/s10115-018-1161-6
Article Google Scholar
Gouda K, Hassaan M, Zaki MJ (2007) PRISM: A prime-encoding approach for frequent sequence mining. In: Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 487–492, https://doi.org/10.1109/ICDM.2007.33
Gouda K, Hassaan M, Zaki MJ (Feb. 2010) Prism: an effective approach for frequent sequence mining via prime-block encoding. J Comput Syst Sci 76(1):88–102. https://doi.org/10.1016/J.JCSS.2009.05.008
Article MathSciNet MATH Google Scholar
Huynh HM, Nguyen LTT, Vo B, Yun U, Oplatková ZK, Hong TP (Jun. 2020) Efficient algorithms for mining clickstream patterns using pseudo-IDLists. Futur Gener Comput Syst 107:18–30. https://doi.org/10.1016/j.future.2020.01.034
Article Google Scholar
Huynh HM, Nguyen LTT, Vo B, Oplatková ZK, Fournier-Viger P, Yun U (Jan. 2022) An efficient parallel algorithm for mining weighted clickstream patterns. Inf Sci (NY) 582:349–368. https://doi.org/10.1016/J.INS.2021.08.070
Article MathSciNet Google Scholar
Gan W, Lin JCW, Zhang J, Fournier-Viger P, Chao HC, Yu PS (Feb. 2021) Fast utility mining on sequence data. IEEE Trans Cybern 51(2):487–500. https://doi.org/10.1109/TCYB.2020.2970176
Article Google Scholar
Gan W et al (May 2021) Utility Mining Across Multi-Dimensional Sequences. ACM Trans Knowl Discov Data 15(5):1–24. https://doi.org/10.1145/3446938
Article MathSciNet Google Scholar
Lin JCW, Li Y, Fournier-Viger P, Djenouri Y, Zhang J (2020) Efficient chain structure for high-utility sequential pattern mining. IEEE Access 8:40714–40722. https://doi.org/10.1109/ACCESS.2020.2976662
Article Google Scholar
Gan W, Lin JCW, Zhang J, Chao HC, Fujita H, Yu PS (Mar. 2020) ProUM: Projection-based utility mining on sequence data. Inf Sci (NY) 513:222–240. https://doi.org/10.1016/J.INS.2019.10.033
Article Google Scholar
Wu Y, Geng M, Li Y, Guo L, Li Z, Fournier-Viger P, Zhu X, Wu X (Oct. 2021) HANP-miner: high average utility nonoverlapping sequential pattern mining. Knowledge-Based Syst 229:107361. https://doi.org/10.1016/J.KNOSYS.2021.107361
Article Google Scholar
Chun-wei Lin J et al (Nov. 2021) Scalable Mining of High-Utility Sequential Patterns with Three-Tier MapReduce model. ACM Trans Knowl Discov Data 16(3):1–26. https://doi.org/10.1145/3487046
Article Google Scholar
Truong T, Duong H, Le B, Fournier-Viger P, Yun U, Fujita H (Aug. 2021) Efficient algorithms for mining frequent high utility sequences with constraints. Inf Sci (NY) 568:239–264. https://doi.org/10.1016/J.INS.2021.01.060
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Informatics, Wroclaw University of Science and Technology, Wrocław, Poland
Anh Nguyen & Ngoc-Thanh Nguyen
Faculty of Information Technology, Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam
Ngoc-Thanh Nguyen
School of Computer Science and Engineering, International University, Ho Chi Minh City, Viet Nam
Loan T.T. Nguyen
Vietnam National University, Ho Chi Minh City, Viet Nam
Loan T.T. Nguyen
Faculty of Information Technology, HUTECH University, Ho Chi Minh City, Vietnam
Bay Vo

Authors

Anh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc-Thanh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Loan T.T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Bay Vo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bay Vo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nguyen, A., Nguyen, NT., Nguyen, L.T. et al. Mining inter-sequence patterns with Itemset constraints. Appl Intell 53, 19827–19842 (2023). https://doi.org/10.1007/s10489-023-04514-7

Download citation

Accepted: 06 February 2023
Published: 18 March 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10489-023-04514-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Mining inter-sequence patterns with Itemset constraints

Abstract

Similar content being viewed by others

Mining sequential patterns with itemset constraints

Mining constrained inter-sequence patterns: a novel approach to cope with item constraints

CloFAST: closed sequential pattern mining using sparse and vertical id-lists

Explore related subjects

1 Introduction

1.1 Motivation

1.2 Contributions

2 Related work

2.1 Mining frequent (inter-)sequences

2.2 Mining frequent (inter-)sequences with constraints

3 Basic concepts and problem statement

3.1 Inter-sequence pattern mining

Definition 1 (Megasequence) [7]

Definition 2 [8]

Definition 3 [8]

Definition 4 [8]

Definition 5 [8]

Definition 6 (Prefix) [14]

3.2 Problem statement

4 Proposed algorithm

4.1 DBV-PatternList structure

4.2 DBV-ISPMIC algorithm

4.3 Improved DBV-ISPMIC algorithm

Proposition 1

Proof

Step 1

Step 2

Step 3

4.4 Parallel DBV-ISPMIC algorithm

5 Experimental results

5.1 Experimental databases

5.2 Runtime

5.3 Parallel method for efficient mining of inter-sequence patterns with itemset constraints

5.4 Memory usage

5.5 Impact of maxspan

6 Conclusions and future work

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation