1 Introduction

Sequential pattern mining is used to reveal a set of frequent patterns from a sequence database t [1,2,3,4,5,6] while mining inter-sequence patterns will find the frequent patterns across multiple sequences in the sequence database [7,8,9,10,11,12].

Among the important tasks in the field of data mining, inter-sequence pattern mining is an especially practical one. Its purpose is to find the frequent sequences that are common to many sequences in the database, and to find the relationships between these sequence patterns over time. Therefore, it brings very important and meaningful information for rule generation to support decision-making, management, and prediction.

1.1 Motivation

In recent years, various methods of mining constraint-based (inter)-sequential patterns or rules have been developed [8, 13, 14]. However, most of these solved the problem of mining sequential patterns and mining sequential rules with constraints [13, 14]. The ISP-IC algorithm, suggested by Vo et al. in 2018 [8], focusing in discovering constraint-based inter-sequence patterns. Inter-sequence pattern mining with itemset constraints, however, is not addressed by any of the mentioned approaches. We want to mine patterns such as: “If a customer visits URL1, URL2, URL3 today, then he/she will visit URL2, URL3, URL5 two days later. The constraint is that he/she must visit {URL2, URL3}”. We can solve this problem by applying the proposed algorithm from [7] to mine all mining inter-sequence patterns, after that, we check the constraints for each mined pattern to get the result. However, this work will take more time and memory to mine and store patterns.

As far as we know, the complexity of the mining inter-sequence patterns task is much higher than that of mining frequent sequences. This is because we must mine all sequences appearing across multiple sequences, and solving the itemset constraints needs more complex processing to check the patterns that satisfy them. Therefore, the major challenge of this study is to address the problem of inter-sequence pattern mining in combination with itemset constraints.

1.2 Contributions

In this study, our goal is to solve the task of inter-sequence pattern mining with itemset constraints. This mining task, unlike that of using itemset constraints for mining sequence patterns, requires more complex processing because many candidates are generated during the mining process. Our major contributions are as follows.

  1. 1.

    Based on the EISP-Miner algorithm [7] and a using dynamic bit vector data structure [9], we state the problem of inter-sequence pattern mining in combination with itemset constraints.

  2. 2.

    We then suggest a proposition to help reduce candidate checking during sequence expansion according to the EISP-Miner algorithm, thus reducing the search space for inter-sequence pattern mining with itemset constraints.

  3. 3.

    Next, an algorithm named DBV-ISPMIC is developed to discover constraints-based inter-sequence patterns. A parallel version of the DBV-ISPMIC algorithm, named pDBV-ISPMIC algorithm, was also presented to improve the runtime.

  4. 4.

    Finally, we conduct experiments with various databases to evaluate the proposed method.

The rest of this paper is organized as follows: The next section carries out our literature review on sequence pattern mining, sequence pattern mining with constraints, inter-sequence patterns and inter-sequence patterns with constraints. The third section offers the basic concepts and problem statement. The proposed algorithms are presented in the fourth section. In this section, we introduce a structure, namely the DBV-PatternList, and propose the DBV-ISPMIC and pDBV-ISPMIC algorithms, to solve the problem. A proposition is also developed to reduce checking candidates in DBV-ISPMIC-IMPROVING. The experimental results are discussed in the fifth section. The last section then presents the conclusion and some directions for future studies.

2 Related work

2.1 Mining frequent (inter-)sequences

Mining frequent sequences has been explored by many researchers, and many algorithms have thus been developed for mining frequent sequences. AprioriAll was first proposed by Agrawal and Srikant [2] in 1995 for mining sequential patterns from a sequence database. AprioriAll is based on two basic principles for finding the frequent sequence, candidate generation and testing [2].

However, candidate generation and testing are time-consuming and need many database scans, and thus Zaki et al. introduced the SPADE algorithm in 2001 [3]. This algorithm is based on lattice theory to divide the search space to reveal frequent sequences. It utilizes the vertical database format and equivalence class to find frequent sequences. Therefore, the SPADE algorithm outperforms AprioriAll in most experiments. Han et al. suggested the FreeSpan algorithm in 2000 [4]. This algorithm uses projected sequence databases to confine the search and the growth of subsequence fragments. Pei and Han proposed the PrefixSpan approach in 2001 [5]. The algorithm is based on a specific divide-and-conquer method, which mines frequent sequential patterns based on a prefix database projection (like the FP-Growth algorithm). PrefixSpan can generate frequent sequences with candidate generation. Therefore, it has better execution time than either SPADE or AprioriAll. Thanks to the divide-and-conquer technique, PrefixSpan significantly reduces the size of the database as the length of the common patterns increase, leading to savings in memory and computation time. Gouda et al. used prism encoding to compress sequence IDs to save memory storage and runtime [15, 16].

In 2020, Huynh et al. [17] proposed the CUP algorithm to mine clickstream patterns. In their study, the authors employed a structure called the pseudo-IDList. In addition, the algorithm also adopts the DUB pruning technique to help CUP run faster than state-of-the-art algorithms, by from 13 to 45%. Huynh et al. also proposed efficient parallel algorithms using multi-core processors for mining clickstream patterns [18], as well as a weight-based method for mining clickstream patterns. The method is based on the average weight measure and the authors proposed two algorithms, named CM-WSPADE and Compact-SPADE, for efficient mining of weighted clickstream patterns.

In 2009, a novel method for inter-sequence pattern mining based on vertical database format was introduced by Wang and Lee [7]. The authors developed an ISP-tree to generate candidates that satisfy the minimum support threshold. With the method of inter-sequence pattern mining, the authors defined a new method for the mining sequences which ensures that it can still exploit the traditional sequences like the algorithms SPADE [3], CM-SPADE [6], and PRISM [15, 16], but also mine patterns through multiple sequences in the sequence database. This approach stores sequence IDs (SIDs) and uses them to compute the support of patterns. Therefore, it needs a large amount of memory to store SIDs and time to compute the intersection of SIDs. In 2012, Vo et al. introduced a novel approach, which employed a highly efficient structure call the dynamic bit vector (DBV-PatternList) [9], to replace the pattern-list structure used by the EISP-Miner approach. This method thus significantly reduces the search space and time in mining inter-sequence patterns, as well as for mining closed inter-sequence patterns [10].

2.2 Mining frequent (inter-)sequences with constraints

In 2018, Van et al. proposed the MSPIC-DBV algorithm to mine sequential patterns with itemset constraints [14]. To efficiently discover sequential patterns, MSPIC-DBV adopts the dynamic bit vector and the prefix tree structure. A method for early pruning the candidates to help reduce processing time is also introduced. The authors then continued to improve their work in mining sequential rules with itemset constraints through two new algorithms, namely the MSRIC-R and MSRIC-P, which were introduced in 2021 [13]. The constraints are combined into the rule generating phase for the MSRIC-R algorithm, while the latter algorithm combined it into the pattern mining phase. The authors suggested a technique to integrate the mining process with itemset constraints, to help the algorithm only create constraint-satisfying patterns and thus increase the speed. These methods solved the problem of itemset constraints but can only be applied on mining sequential patterns or sequential rules, so the generated subsequences do not satisfy the inter-sequence mining requirements. Therefore, these algorithms cannot be applied to the problem examined in the current study.

An algorithm named ISP-IC was developed by Le et al. [8] for constraints-based inter-sequence pattern mining, as well as its improvements, iISP-IC and piISP-IC. The authors also applied a parallel processing method to speed up the runtime. As stated in the introduction, ISP-IC, iISP-IC, and piISP-IC focus on the condition of items in the sequence, and we cannot apply this approach to the itemset problem.

3 Basic concepts and problem statement

3.1 Inter-sequence pattern mining

Let t be the set of items, t = {u1, u2, u3,…, um} where ui is an item (1 ≤ i ≤ m). A sequence s = 〈t1, t2, t3,…, tn〉 is an ordered list of itemsets where ti ⊆ t (1 ≤ i ≤ n) is an itemset. A sequential database D = {s1, s2, s3,…, sw} where w = |D| is the number of sequences in D and si (1 ≤ i ≤ w) is a pair of values 〈Dat, Sequence〉, in which Dat is the property of si used to describe contextual information based on the time of the transaction.

What follows is an example with the sequence database in Table 1. An example of creating a sequential database from a customer dataset.

Table 1 An example of creating a sequential database from a customer dataset

With DAT 1 (DAT is an attribute describing the time of the transaction): First buy item C then buy itemset AB.

Assuming the sequences s1 and s2 have d1 and d2, respectively, be their domain attributes (DAT). If we take d1 as the reference point, the span between s2 and s1 is defined as [d2-d1]. The sequence s2 at domain attribute d2 with respect to d1 is called an extend sequence and denoted as s2[d2-d1]. Take the database given in Table 1. An example of creating a sequential database from a customer dataset.

For instance – if DAT from the 1st transaction is used as the reference point, then the extended sequence from the 2nd transaction equals to ⟨C(ABC)A⟩[1].

Let s[d] = 〈t1, t2, t3,…, tn〉[d] be a sequence where ti is an itemset (1 ≤ i ≤ n) and [d] is span of s.ti associated with [d] was defined as an extended itemset (e-itemset) denoted by 〈ti〉[d]. If ti = (u1, u2, u3,…, um) where each ui (1 ≤ i ≤ m) is an item, then ui associated with [d] is an extended item (e-item) denoted by (ui)[d]. For example, 〈C(ABC)A〉[1] has 3 e-itemset 〈C〉[1], 〈(ABC)〉[1], A[1] and 3 e-item: (C)[1], (A)[1], (B)[1].

Definition 1 (Megasequence) [7]

Given a list of k sequences 〈d1, s1〉, 〈d2, s2〉,…, 〈dk, sk〉 in the sequential database. A megasequence with k ≥ 1 is denoted as Ψ = s1[0] ∪ s2[d2d1] ∪…∪ sk[dkd1].

From the example database shown in Table 1. An example of creating a sequential database from a customer dataset.

With maxspan = 1 and DAT = 1 as the reference point, we have a list of megasequences shown in Table 2.

Table 2 Converting a sequential database to megasequences

Definition 2 [8]

Let a = (im)[x] and b = (in)[y] be 2 e-items.

  1. (1)

    a = b if and only if (im = in) ∧ (x = y) and

  2. (2)

    a < b, if and only if x < y or ((x = y) ∧ (im < in)).

For example, (A)[0] = (A)[0], (A)[1] < (A)[2], and (B)[1] < (D)[1].

Definition 3 [8]

Given a pattern p. A function subij(p) is defined as (j-i + 1) subset e-items of p from position I to j. For example, sub1,3(〈(BC)〉[0]〈(AD)B〉[1]) = 〈(BC)〉[0]〈(A)〉[1] and sub4,4(〈(BC)〉[0]〈(AD)B〉[1]) = (D)[1].

Definition 4 [8]

Given two frequent 1-patterns α = 〈u〉[0] and β = 〈v〉[0]. α is joinable to β in any instance and there are three types of join operation: (1) itemset-join: αi β = {〈(uv)〉[0]}|{〈(uv)〉[0]}; (2) sequence-join: αs β = {〈(uv)〉[0]}; (3) inter-join: αt β = {〈u〉[0]〈v〉[x] | 1 ≤ x ≤ maxspan}. For example, given maxspan = 2, 〈C〉[0] ∪iD〉[0] = 〈(CD)〉[0]; 〈C〉[0] ∪sD〉[0] = 〈CD〉[0]; and 〈C〉[0] ∪tD〉[0] = {〈C〉[0]〈D〉[1], 〈C〉[0]〈D〉[2]}.

Definition 5 [8]

Given 2 frequent k-patterns α and β, k > 1, then subk,k(α) = (u)[i], and subk,k(β) = (v)[j]. α is joinable to β if sub1,k-1(α) = sub1,k-1(β) and i ≤ j, which yields three types of join operation: (1) itemset-join: αi β = {α + i (v)[j]|(i = j) ∧ (u < v)}; (2) sequence-join: αs β = {α + s (v)[j] | (i = j)}; (3) inter-join: αt β = {α + t (v)[j] | (i < j)}.

For example, 〈BC〉[0] ∪iBD〉[0] = 〈B(CD)〉[0], 〈BC〉[0] ∪sBD〉[0] = 〈BCD〉[0], and 〈BC〉[0] ∪tB〉[0]〈D〉[2] = 〈BC〉[0]〈D〉[2].

Definition 6 (Prefix) [14]

A pattern β = ⟨b1b2 ... bm⟩ is called a prefix of pattern α = ⟨α1α2 ... αn⟩ if and only if bi = αi for all 1 ≤ i ≤ m − 1, bm ⊆ αm, m < n. For instance, the prefixes of pattern ⟨D(BC)B⟩ are: ⟨D⟩, ⟨DB⟩ and ⟨D(BC)⟩. Thus, any sequence would be the prefix of its extended sequences based on this definition.

3.2 Problem statement

Given a sequence database D, the minimum support (minsup), and a set of constraint itemsets IC = {c1, c2, c3, …, ck}. The task of inter-sequence pattern mining with an itemset constraint is to discover all frequent sequences α = α1[w1], α2[w2],. .., αm[wm] such that ∃αi[wi] ∈ α, ∃bjIC: bj ⊆ αi.

For instance, let IC = {(C), (E)}, the sequence ⟨C(AB)⟩[0]⟨C(ABC)A⟩[1] satisfies the constraint whereas the sequence ⟨AD⟩[0]⟨A⟩[1] does not.

4 Proposed algorithm

4.1 DBV-PatternList structure

Instead of using the PatternList data structure according to the EISP-Miner algorithm [7], we use the DBV-PatternList data structure [9]. This structure uses dynamic bit vectors to store the t-values and p-values of an e-item in the database. It is used to minimize the space and time needed to extend the e-item according to inter-sequence patterns. The DBV-PatternList structure is presented as follows:

  • The index of the first non-zero value in the bit vector.

  • Bit vector: the array of values after trimming zeroes at the start and end index.

Figure 1 Presents the use of PatternList and DBV-PatternList structures.

Fig. 1
figure 1

Structures of (a) PatternList and (b) DBV-PatternList

As in structure Fig. 1a, if we store information for an e-item using the PatternList data structure we need to use 26 bytes (12 bytes for t-values and 14 bytes for p-values). But if we use the DBV-PatternList (structure Fig. 1b) data structure we only use 20 bytes (2 bytes for the start position of e-items, 4 bytes for the t-values and 14 bytes for the p-values). Because we have converted the t-values to bit vectors (Definition 6), storage space is reduced.

4.2 DBV-ISPMIC algorithm

Our proposed method relies on the DBV-PatternList structure and the DBV-PatternList joining methods. The model of the DBV-ISPMIC algorithm is shown in Fig. 2.

Fig. 2
figure 2

The DBV-ISPMIC algorithm

In this algorithm, there are five functions – ISP-Join1, ISP-Joink, ISP-Join1-Extension, ISP-Joink-Extension and Check – as shown in Figs. 3, 4, 5, 6 and 7, respectively. The algorithm computes on a given sequential database with minsup, maxspan and a set of itemset constraints defined by the user. The result is a set of frequent inter-sequence patterns that satisfy the conditions of minsup and itemset constraints. The DBV-ISPMIC algorithm has three main steps, as presented below.

  • Step 1 The sequential database is scanned once to find all frequent 1-patterns in which there is not less than the user-specified minimum support threshold minsup (Fig. 2, line 1). Then, we will create a tree T, with the root being NULL and leaves including the found DBV-PatternList. The algorithm goes over all frequent 1-DBV-PatternList results to start expanding the nodes according to the itemset, sequence and inter extension (Fig. 2, line 2).

  • Step 2 The algorithm calls the ISP-Join1 function (Fig. 3) for extending an αlist node with the remaining children of the T|NULL tree (Definition 4). In this function, we will have a new DBV-PatternList in three ways that was extended from the itemset, sequence and inter (based on maxspan value) extension. We check the DBV-PatternList result with the minsup condition and itemset constraints. The Check function in Fig. 7 is used to check if a new DBV-PatternList satisfies the constraint. If satisfied, we will add this DBV-PatternList as a child node of T|αlist (Fig. 5, lines 1–9).

  • Step 3 The algorithm calls the ISP-Joink function (Fig. 4) for extending the child node of the αlist node with the remaining children according to the k-pattern (Definition 5). In this function, we have been given a new DBV-PatternList in three ways that extend according to itemset, sequence and inter and then we will check whether the DBV-PatternList result satisfies the minsup condition. The Check function is used to check if a new DBV-PatternList satisfies the constraint. If it is satisfied, we will add this DBV-PatternList as a child node of T|αlist (Fig. 6, lines 1–9).

Fig. 3
figure 3

The ISP-Join1 function

Fig. 4
figure 4

The ISP-Joink function

Fig. 5
figure 5

The ISP-Join1-Extension function

Fig. 6
figure 6

The ISP-Joink-Extension function

Fig. 7
figure 7

The function Check

4.3 Improved DBV-ISPMIC algorithm

We can see that if a pattern satisfies itemset constraints, then the new pattern that is extended from this node with the itemset, sequence, and inter extension will also satisfy the itemset constraints. This helps to reduce the algorithm’s DBV-PatternList node expansion time. We propose a proposition to verify this, as follows.

Proposition 1

(Checking itemset constraints) Given an inter-sequence α and a set of itemset constraints IC, if α satisfies IC, then the sequence β, generated from α, also satisfies constraint IC.

Proof

Let α = ⟨α1[w1]α2[w2] ... αm[wm]⟩, β = ⟨β1[w1] β2[w2] ... βu[wu]⟩ be an inter-sequence, whereas each αi, βi corresponds to an itemset. Because α satisfies IC, based on the problem statement, this condition holds: ∃αi[wi] ∈ α, ∃bjIC: bj ⊆ αi.

Based on Definition 5, there are three cases to consider:

Itemset-join: In β always exists βi[wi] such that αi ⊆ βibj ⊆ αi ⊆ βi or bj ⊆ βi. It means β satisfies IC.

Sequence-join: Because itemset of αi does not change. It means βi = αi and therefore, we have bj ⊆ βi or β satisfies IC.

Inter-join: Based on the inter-join, all itemsets from α always exist in β, therefore, in β always exists βk[wk] such that αi ⊆ βkbj ⊆ αi ⊆ βk or bj ⊆ βk. It means β satisfies IC.

In the ISP-Join1-improving function, we first check the αlist pattern with the itemset constraints. If it is true, all frequent DBV-PatternList which are extended from αlist also satisfy the itemset constraints (Fig. 8, lines 3–12). This is the same with the ISP-Joink-improving function (Fig. 9, lines 2–13).

Fig. 8
figure 8

The function ISP-Join1-improving

Fig. 9
figure 9

The function ISP-Joink-improving

Consider the example shown in Table 1. An example of creating a sequential database from a customer dataset.

Where minsup = 2, maxspan = 1 and itemset constraints IC = {(AB), CA, AD}. The frequent patterns generated in the mining phase are presented in Fig. 10.

Fig. 10
figure 10

The extended tree of patterns corresponding to the example database

Step 1

the sequential database is scanned once by the algorithm to enumerate all frequent 1-patterns, which is {⟨A⟩, ⟨B⟩, ⟨C⟩, ⟨D⟩} with the support {4, 2, 2, 2}, respectively (Fig. 10, level 0). In this database scan, the 1-DBV-PatternLists are also generated.

Step 2

The algorithm generates candidates {⟨A⟩[0]⟨A⟩[1], ⟨(AB)⟩[0], ⟨AD⟩[0], ⟨A⟩[0]⟨D⟩[1]} that shares the 1-prefix ⟨A⟩ by joining ⟨A⟩ with ⟨A⟩, ⟨B⟩, ⟨C⟩ and ⟨D⟩ based on itemset, sequence and inter extension (Fig. 8). Other generated candidates ⟨B⟩[0]⟨A⟩[1], ⟨CA⟩[0], ⟨C⟩[0]⟨A⟩[1] and ⟨CB⟩[0] that share 1-prefix ⟨B⟩, ⟨C⟩ are also created at the same time during the combination (Fig. 10, level 1).

Step 3

The algorithm traverses the children of each (k-1)-pattern by depth first search to generate k-patterns. If a (k-1)-pattern satisfies the constraints IC then all its super patterns should not be checked. For instance, ⟨(AB)⟩[0] satisfies the itemset constraints, so we do not need to check the itemset constraints for ⟨(AB)⟩[0]⟨A⟩[1]. Due to this checking case, we reduce the checking time of the algorithm. Otherwise, ⟨A⟩[0]⟨A⟩[1] do not satisfy the itemset constraints, so we check itemset constraints for its child nodes. The algorithm backtracks to step 3 and complete the full candidates of ⟨B⟩, ⟨C⟩ then ⟨D⟩. As no more candidates can be found in branch ⟨D⟩, the algorithm terminates. We have FP = {⟨AA⟩[0]⟨AD⟩[1]: support = 2, ⟨(AB)⟩[0]: support = 2, ⟨(AB)⟩[0]⟨A⟩[1]: support = 2, ⟨AD⟩[0]: support = 2, ⟨CA⟩[0]: support = 2, ⟨CA⟩[0]⟨A⟩[1]: support = 2, ⟨C(AB)⟩[0]: support = 2, ⟨C(AB)⟩[0]⟨A⟩ [1]: support = 2}.

4.4 Parallel DBV-ISPMIC algorithm

As shown in Fig. 2, the DBV-ISPMIC algorithm is a sequential algorithm. The complexity time of the DBV-ISPMIC algorithm is calculated by tnode_1 + tnode_2 + tnode_3 + … + tnode_n, whereas the set {node_1, node_2, node_3, …, node_n} contains child nodes of an ISP-tree T, tnodei is the extension time of a node extension of the ISP-tree T in 1-pattern and k-pattern (in ISP-Join1 and ISP-Joink functions, respectively). This is because the ISP-Joink extension function independently expands the sub-nodes of the ISP-tree T. If each sub-branch of the ISP-tree T is expanded for each task, the running time of the algorithm improves. Therefore, the overall runtime of the algorithm can be determined as Max{node_1, node_2, node_3, …, node_n}.

An example for the above proposal is given in Fig. 11. Based on the database in Table 2 with minsup = 2, the frequent patterns that were generated at the first level by 1-pattern extension included ⟨A⟩, ⟨B⟩, ⟨C⟩ and ⟨D⟩. For each frequent pattern, the k-pattern extension is processed at each individual task. As stated, the time of the algorithm is calculated in Max{tTask1, tTask2, tTask3}, since one task runs in parallel with the others. The allocation of the number of tasks that can be executed simultaneously is determined by the computer processor’s cores. This can also be extended to distributed systems, where each task is processed on a separate system and then final the result is gathered and combined.

Fig. 11
figure 11

Example of using parallel processing for ISP-tree extension

The pDBV-ISPMIC algorithm is based on the DBV-ISPMIC with the algorithm parallelized. Figure 12 shows with a flowchart the main steps of the sequential DBV-ISPMIC algorithm and its parallel (pDBV-ISPMIC) counterpart. The pDBV-ISPMIC algorithm has two main steps:

  • Step 1: Loading the database and finding the frequent 1-pattern sets that satisfy the minsup.

  • Step 2: Allocating execution tasks, with each task handling one k-pattern. The algorithm search space exploration is DFS-based (depth-first search), which is recursive. When no more candidates can be generated, the algorithm terminates.

    Fig. 12
    figure 12

    The figure shows the difference between sequential and parallel flow chart. a The main steps of a sequential algorithm, and b the main steps of a parallel processing algorithm

5 Experimental results

In evaluating the performance of the DBV-ISPMIC algorithm and its improvement in runtime, all the experiments were carried out on a PC with an Intel® Core ™ i7 10th gen processor (10,510 U) @ 1.8–4.9 GHz, and 20 GB RAM. The operating system used is Windows 10 64-bit. The algorithms were implemented in Visual Studio 2017 C#.

5.1 Experimental databases

We ran tests on five databases, namely C6T5S4I4N1kD1k, C6T5S4I4N1kD10k, Gazelle, BIKE and BMSWebView1. The synthetic databases used for comparison were generated using the IBM synthetic data generator. These databases are available at https://www.mediafire.com/folder/id3p3z6b9g8kj. Their characteristics are shown in Table 3.

Table 3 Test database characteristics

5.2 Runtime

For the C6T5S4I4N1kD1k database, we evaluated the algorithms with minsup = 0.5% and maxspan = {1,2,3,4,5}, the C6T5S4I4N1kD10k database with minsup = 5% and maxspan = {1, 2, 3, 4, 5}, the Gazelle database with minsup = 1% and maxspan = {1,2,3,4,5}, the BIKE database with minsup = 0.5% and maxspan = {1, 2, 3, 4, 5}, and the BMSWebView1 database with minsup = 0.5% and maxspan = {1, 2, 3, 4, 5}. We use an EISP-Miner algorithm to evaluate all the proposed algorithms [7], and add a check constraint to it, with this approach called the Post-EISPMiner algorithm.

Based on the experimental results in Figs. 13, 14, 15, 16 and 17, we can see that the DBV-ISPMIC-IMPROVING algorithm runs faster than the other two algorithms, Post-EISPMiner and DBV-ISPMIC. In Fig. 13, we compare the runtime of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the Gazelle dataset. When the value of maxspan is increasing, the running time of all three algorithms increases relatively evenly.

Fig. 13
figure 13

Execution times of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the Gazelle dataset

Fig. 14
figure 14

Execution times of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the C6T5S4I4N1kD1k dataset

Fig. 15
figure 15

Execution times of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the C6T5S4I4N1kD10k dataset

Fig. 16
figure 16

Execution times of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the BIKE dataset

Fig. 17
figure 17

Execution times of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the BMSWebView1 dataset

Figures 14, 15, 16 and 17 show the results for the C6T5S4I4N1kD1k, C6T5S4I4N1kD10k, BIKE and BMSWebView1 datasets. It is clear that the runtimes for DBV-ISPMIC-IMPROVING and DBV-ISPMIC are much better than that of Post-EISPMiner.

5.3 Parallel method for efficient mining of inter-sequence patterns with itemset constraints

Because DBV-ISPMIC-IMPROVING is the best algorithm for mining inter-sequence patterns with itemset constraints, we develop a parallel version of it, pDBV-ISPMIC-IMPROVING, by using the C#.NET software library to improve the performance. The performance of pDBV-ISPMIC-IMPROVING algorithm is evaluated by comparing it with that of the DBV-ISPMIC-IMPROVING algorithm. The results are shown in Figs. 18, 19, 20 and 21, and it can be seen that when maxspan increases, the runtime of pDBV-ISPMIC-IMPROVING is much less than the runtime of DBV-ISPMIC-IMPROVING algorithm.

Fig. 18
figure 18

Execution times in a parallel evaluation of pPost-EISPMiner, pDBV-ISPMIC and pDBV-ISPMIC-IMPROVING for the C6T5S4I4N1kD1k dataset

Fig. 19
figure 19

Execution time in a parallel evaluation of pPost-EISPMiner, pDBV-ISPMIC and pDBV-ISPMIC-IMPROVING for the C6T5S4I4N1kD10k dataset

Fig. 20
figure 20

Execution times in a parallel evaluation of pPost-EISPMiner, pDBV-ISPMIC and pDBV-ISPMIC-IMPROVING for the BIKE dataset

Fig. 21
figure 21

Execution times in a parallel evaluation of pPost-EISPMiner, pDBV-ISPMIC and pDBV-ISPMIC-IMPROVING for the BMSWebView1 dataset

5.4 Memory usage

Figures 22, 23, 24, 25 and 26 show the peak memory consumption of the three algorithms, Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING. The results show that the memory needed by DBV-ISPMIC and DBV-ISPMIC-IMPROVING is less than that needed by the Post-EISPMiner algorithm for almost all database parameter values. Because the two proposed algorithms reduce the time needed to check the child nodes generated, they have less memory usage compared to the Post-EISPMiner algorithm.

Fig. 22
figure 22

Memory usage of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the Gazelle dataset

Fig. 23
figure 23

Memory usage of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the C6T5S4I4N1kD1k dataset

Fig. 24
figure 24

Memory usage of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the C6T5S4I4N1kD10k dataset

Fig. 25
figure 25

Memory usage of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the BIKE dataset

Fig. 26
figure 26

Memory usage of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the BMSWebView1 dataset

5.5 Impact of maxspan

For mining inter-sequence patterns, when we increase the maxspan value, the number of candidates generated will also increase. Therefore, if we use proposition 1 to reduce the itemset constraints checking, the processing time will be better. For instance, we use two databases, Gazelle and C6T5S4I4N1kD1k, to evaluate this. The Gazelle database (minsup = 3%) and C6T5S4I4N1kD1k database (minsup = 0.8%) were tested with the maxspan value increasing from 2 to 12. The results show that the proposed algorithms (DBV-ISPMIC and DBV-ISPMIC-IMPROVING) always work well (Fig. 27 and 28).

Fig. 27
figure 27

Execution times of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the Gazelle dataset, with maxspan from 2 to 12

Fig. 28
figure 28

Execution times of Post-EISPMiner, DBV-ISPMIC and DBV-ISPMIC-IMPROVING for the C6T5S4I4N1kD1k dataset, with maxspan from 2 to 12

6 Conclusions and future work

With this study, we introduced an algorithm to solve the problem of mining inter-sequence patterns with itemset constraints. This algorithm, named DBV-ISPMIC, is based on the EISP-Miner algorithm to mine inter-sequence patterns, and uses a dynamic bit vector structure to store data, which helps to increase the processing speed and reduce the storage space when compared to EISP-Miner. Based on the DBV-ISPMIC algorithm, we also propose its improvement to help reduce processing time.

In the future, we will apply distributed computing to the improved algorithm to help optimize the running time. We will also study how to put the constraints into mining frequent closed inter-sequences. Finally, algorithms for mining high utility sequences have been proposed in recent years [19,20,21,22,23,24,25], and we will study how to mine high utility inter-sequences and high utility inter-sequences with constraints.