Abstract
The prediction of the future workload of applications is an essential step guiding resource provisioning in cloud environments. In our previous works, we proposed two prediction models based on pattern mining. This paper builds on our previous experience and focuses on the issue of time and space complexities of the prediction model. Specifically, it presents a general approach to improve the efficiency of the pattern mining engine, which leads to improving the efficiency of the predictors. The approach is composed of two steps: (1) Firstly, to improve space complexity, redundant occurrences of patterns are defined and algorithms are suggested to identify and omit them. (2) To improve time complexity, a new data structure, called closed pattern backward tree, is presented for mining closed patterns directly. The approach not only improves the efficiency of our predictors, but also can be employed in different fields of pattern mining. The performance of the proposed approach is investigated based on real and synthetic workloads of cloud. The experimental results show that the proposed approach could improve the efficiency of the pattern mining engine significantly in comparison to common methods to extract closed patterns.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Elasticity, one of the prominent features of cloud computing, is the degree of the system adaptability to workload changes by provisioning and deprovisioning resources automatically in a way that the allocated resources match the current demand [1, 2]. The future demand prediction is the only practical and effective solution for the fast resources provisioning and the rapid elasticity implementation [3, 4]. The most important challenges of the application prediction models are as follows [3]:
Complexity Each prediction model needs computation resources to estimate future behavior of applications. The computation resources consumption of the prediction models should not be significant in comparison with the other applications. So, time and space complexities of the prediction model should be reasonable in a way that its deployment is affordable.
Pattern length In most of the prediction models such as [4,5,6,7,8,9,10], the pattern length is fixed. In these models, using a sliding window, the extracted patterns have a predefined length. The constraint of the pattern length restricts the model to specific patterns and prevents the model from learning the other useful patterns. However, choosing the pattern length is a challenge. The pattern length should be selected in a way that the most popular patterns can be extracted and application behavior can be estimated accurately.
In our previous works, based on Sequential Pattern Mining (SPM), we proposed two new prediction models, called POSITING [11] and RELENTING [2]. As Fig. 1 shows, POSITING considers application behavior in the past, extracts behavioral patterns and stores them in the off-line pattern base. Based on the extracted patterns and the recent behavior of the application, POSITING predicts the future demand for different resources. In [2], we developed POSITING with the capability of online learning and proposed RELENTING. While RELENTING predicts the status of all the allocated resources, it also learns the new behavior of the application rapidly without gathering new data and retraining the model.
According to Fig. 1, the pattern mining engine is the core of the predictors. To improve mining efficiency and avoid information loss, a compressed set of patterns, called closed patterns, is extracted by the pattern mining engine [2, 11]. A pattern is closed if none of its super-patterns have the same frequency as the pattern’s [12]. The common approach for extracting the closed episodes under gap constraints is based on the hash table [12, 13]. As our experiment results show this approach is very time-consuming if there are many candidate closed patterns.
The main goal of the paper is to improve time and space complexities of the pattern mining engine in a way that it could be employed in different fields such as [14,15,16,17] efficiently. To improve space complexity, we define redundant occurrences of patterns and prove that omitting them causes no information loss. Then, by using a new imProved RepresentatiOn of the Stream based on PointERs (\({\textit{P}ROSPER}\)), we develop algorithms proposed in [11] to identify the redundant occurrences and omit them from the occurrence list of patterns. Thus, memory consumption improves. To improve time complexity, we introduce a new data structure, called closed pattern backward tree (\({\textit{C}PBT}\)) to extract the closed patterns directly. As it will be shown in the experiment results, \({\textit{C}PBT}\) improves the efficiency of the pattern mining engine significantly when there are a huge number of patterns. So, the contributions of this paper are as follows:
Since SPM is widely used in different fields, this paper presents a general approach to improve time and space complexities of the pattern mining engine.
To improve space complexity, this paper proposes \({\textit{P}ROSPER}\), defines the redundant occurrences of patterns and suggests algorithms to identify and omit them. Thus, the length of the occurrence list of patterns decreases and memory consumption improves (Sect. 4).
To improve time complexity, this paper introduces a new data structure, called \({\textit{C}PBT}\), to store the closed patterns. Based on \({\textit{C}PBT}\), depth-first search algorithms are presented to extract the closed patterns under gap constraints directly without storing/processing all of the candidate closed patterns (Sect. 5).
The performance of the proposed approach is evaluated using both real and synthetic workloads and compared with the hashing based approach. According to the evaluation results, the proposed approach outperforms the hashing based approach significantly when there are a large number of the candidate closed patterns (Sect. 6).
The rest of the paper is organized as follows: The related works are discussed in Sect. 2. Section 3 introduces the main concepts of POSITING and RELENTING briefly. To improve space complexity, the redundant occurrences of patterns are discussed in Sect. 4. To improve time complexity, Sect. 5 introduces \({\textit{C}PBT}\) and presents new algorithms to extract the closed patterns directly. Experimental results are presented in Sect. 6. Finally, the paper is concluded in Sect. 7.
2 Related work
The sequential patterns could capture causative chains of influences present in data. They are useful in many real-life applications [12, 18, 19] such as the prediction of system failures [20], ICT risk assessment and management [21] and mining web access patterns [22].
Events could be classified into two groups: time-point events and time-interval events. The time-point events are stamped with the time of the event occurrence. On the contrary, the time interval events describe the status of variables in time intervals. An episode is defined as a partially ordered collection of events that occur together [23]. The main goal of episode mining is to find the relationship between events [24].
Most of the research works focus on events stamped with the time-point and extend algorithms to extract their corresponding episodes. The time-interval events are considered in many applications such as health care, data network, financial applications [25, 26] and cloud workloads [2, 11]. The problem of mining the time-interval episodes is a relatively young research field [27]. Most of the research works ignore the time-interval events and the relationship between them [28]. So presenting algorithms with the capability of learning from such complex data is one of the most important and the most challenging topics in the field of data mining [27, 29]. It is clear that mining the time-interval episodes from such data is more complicated [28]. This paper focuses on the time-interval events and proposes a general approach for mining the closed time-interval episodes efficiently. In the following section, we review some research works on the time-interval events briefly.
Winarko et al. [28] propose a new algorithm, ARMADA, for discovering temporal patterns from interval-based data. The authors extend MEMISP (MEMory Indexing for Sequential Pattern mining) [30] to mine frequent patterns. Their algorithm requires one database scan and does not require the candidate generation.
In [31], time-stamped multivariate data are converted into time interval-based abstract concepts by using the temporal abstraction (see Sect. 3.1). An algorithm, called KarmaLego, enumerates all of the patterns whose frequency is above a given support threshold. Moskovitch et al. [32] improve KarmaLego to handle thousands of symbol types.
Batal et al. [27] present Recent Temporal Pattern (RTP) mining to find predictive patterns for the event detection problems. At first, their approach converts the time series data into time-interval sequences of temporal abstractions. Then complex time-interval patterns are constructed by using temporal operators. The mining algorithm explores the space of temporal patterns in the level by level fashion. They also present the minimal predictive recent temporal patterns framework to choose a small set of predictive and non-spurious patterns.
In [33], the abstracted time series is used to find temporal association rules by generalizing Allen’s rules [34] into a relation called PRECEDES. The user defines a set of complex patterns, which constitutes the basis of the construction of temporal rules. An Apriori-like algorithm looks for meaningful temporal relationships among the complex patterns.
Patal et al. [35] augment the hierarchical representation with count information to achieve a lossless representation. The hierarchical representation provides a compact mechanism to express the temporal relations among events. Based on this representation, an Apriori-based algorithm, called IEMiner (Interval-based Event Miner), is proposed to discover frequent temporal patterns. IEMiner employs two optimization strategies to reduce the search space. Finally, interval-based temporal patterns are used for classification.
Physiological conditions of patients are reported by using variables such as blood pressure and the heart rate [27, 36, 37]. Gosh et al. [37] propose an approach that combines sequential patterns extracted from multiple physiological variables and captures interactions between these patterns by Coupled Hidden Markov Models (CHMM).
Laxman et al. [38] present a pattern discovery framework that utilizes event duration information in the temporal data mining process. They incorporate event duration information explicitly into the episode structure and discover event duration-dependent temporal correlations in the data. Furthermore, they define “principal episodes”, which are similar to closed episodes, and extract them based on the hashing approach.
As it is observed, most of the research works focus on mining frequent patterns. Mining frequent patterns might lead to extracting a huge number of patterns. To improve mining efficiency and avoid information loss, closed patterns are usually extracted [39]. Most of the works such as [12, 39] focus on extracting closed episodes from the sequence of the time-point events. The common approach to extract closed episodes under gap constraints is based on the hash table [12, 13]. This approach generates closed patterns under gap constraints in two steps [13]. In the first step, candidate closed patterns are extracted from frequent patterns. In the next step, they are considered and closed patterns are determined by using a hashing procedure with frequency as the key. In this step, all the candidates with the same frequency are hashed to the same bucket in the hash table. Among the candidate patterns which are hashed to the same bucket, those patterns for which a super-pattern with the same frequency is found, are discarded. As our experimental results show, this approach is not appropriate to extract the closed time-interval patterns for the small values of the frequency threshold because it leads to generating a huge number of candidate closed patterns. In this paper, we introduce a new data structure, called \({\textit{C}PBT}\), to store the closed pattern. Based on \({\textit{C}PBT}\), depth-first search algorithms are proposed to extract the closed patterns directly. As it will be shown in the experiment results, \({\textit{C}PBT}\) improves the efficiency of the pattern mining engine significantly when there are a huge number of the candidate closed patterns.
3 An overview of the pattern mining engine of POSITING/RELENTING
In this section, we consider the structure of POSITING briefly. Firstly, the background concepts such as event, stream and episode are defined. Then, the episode occurrence is discussed. Finally, the pattern mining engine is explained concisely. We recommend that readers refer to [2, 11] for more detail.
3.1 Background concepts
As Fig. 2 shows, POSITING converts the numeric time series of all the resources allocated to the application into a sequence of abstractions \(\langle S_1[st_1, et_1],\ldots , S_n[st_n, et_n] \rangle \) where \(S_i\in Status,1\le i \le n\) is an abstraction that holds from time \(st_i\) to time \(et_i\) and Status is the abstraction alphabet. Let \(Status=\{S_1,\ldots ,S_M\}\) be a set of the abstract values and \(ResourceType=\{R_1,\ldots ,R_N\}\) be a set of all the resources allocated to the application. Without loss of generality, we define an arbitrary order on ResourceType, for example \(R_1<R_2<\cdots <R_N\).
Definition 1
An event \(e_i\) is defined as a 4-tuple \(\langle r_i,s_i,st_i,et_i\rangle \) that means the abstract value of \(r_i\in ResourceType\) is \(s_i\in Status\) from the start time \(st_i\) to the end time \(et_i\). The span of the event \(e_i=\langle r_i,s_i,st_i,et_i\rangle \) is \(\varDelta e_i=et_i-st_i>\epsilon \), where \( \epsilon \) is a positive constant (\( \epsilon \in \mathbb {Z}_{\ge 0} \)). So the span of each event is at least \( \epsilon +1 \) time slots.
All the discretized time series of the resources are represented as a multivariate stream. Note that the value of \(\epsilon \) depends on the length of sampling intervals. In coarse grained sampling, \(\epsilon \) is set to small values. For fine grained sampling, \(\epsilon \) could be set to larger values.
Definition 2
A multivariate stream \(E=\langle e_1,e_2,\ldots ,e_n\rangle \), where n is the index of the latest observed event, is a sequence of events that are ordered according to their start time:
Definition 3
A state is an ordered pair of (r, s), where \(r\in ResourceType\) and \(s\in Status\). The Resource-Status (RS) is a set of all the possible states: \(RS=\{(r,s)|\forall r\in ResourceType, \forall s\in Status\}\).
Example 1
Figure 3 shows a multivariate stream E with and \(Status=\{Very\;Low,Low,Medium,High\}\). If the order on ResourceType is defined as \(CPU<Memory<Disk\), according to Definition 2, .
If the span of events is large, they are decomposed based on the decomposition unit \(\mu \). For example the event (Disk, Medium, 3, 7) with \(\mu =3\) is decomposed into two events (Disk, Medium, 3, 6) and (Disk, Medium, 6, 7). However, after decomposing the event e, the span of the last decomposed event might be less than \(\epsilon \). Here, to satisfy Definition 1, the latest and penultimate decomposed events merge together.
Inspired by the temporal relations defined in [27], we define two types of relations between events: concurrent and consecutive.
Definition 4
Given the stream \(E=\langle e_1,\ldots ,e_n \rangle \), two events \(e_i\) and \(e_j\), \(1\le i,j\le n \), are concurrent iff \(|st_i-st_j|\le \epsilon \) and are consecutive iff \(|st_i-st_j|>\epsilon \).
Mannila et al. [23] informally define an episode as a partially ordered collection of events that occur together [23]. Inspired by the definition of the episode in [23], we present a detailed definition of the episode based on our problem domain. Note that we use the terms “pattern” and “episode” interchangeably in this paper.
Definition 5
A Concurrent Nodes Group (CNG) \(G=D_1D_2\cdots D_l\) is a group of nodes such that \(\forall D_{j},D_{m}\in G,1\le j,m\le l\), there is no partial order between \(D_{j}\) and \(D_{m}\).
Definition 6
An episode \(\alpha \) is defined as a directed acyclic graph \(\alpha =(V_{\alpha },\prec _{\alpha },g_{\alpha })\), where \(V_{\alpha }\) is a set of nodes, \(\prec _{\alpha }\) is a partial order on \(V_{\alpha }\) and \(g_{\alpha }:V_{\alpha }\rightarrow RS\) is a function that maps each node into one state. The episode \(\alpha \) is composed of \(k(>1)\)CNGs in the form of \(G_1=D^{1}_{1},D^{1}_{2},\ldots ,D^{1}_{l_1}\),..., \(G_k=D^{k}_{1},D^{k}_{2},\ldots ,D^{k}_{l_k}\) that:
- 1.
\(|G_i|=l_i\)
- 2.
\(V_{\alpha }=\{D^{1}_{1}\ldots ,D^{1}_{l_1},D^{2}_{1}\ldots ,D^{2}_{l_2},\ldots ,D^{k}_{1}\ldots ,D^{k}_{l_k}\}\)
- 3.
\(\forall D^{i}_{j}\in G_i,\forall D^{m}_{n}\in G_{m},1\le i<m\le k,j\in \{1,\ldots ,l_i\},n\in \{1,\ldots ,l_m\}:D^{i}_j\prec _{\alpha }D^{m}_{n}\)
- 4.
\(|CNG_{\alpha }|=k\)
- 5.
\(G'_{i}=\{(r,s)\in RS|g_{\alpha }(v)=(r,s)\;\;,\forall v\in G_i\}\).
The episode \(\alpha \) could be represented as a general form \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\).
Example 2
Consider the episode \(\alpha =(V_{\alpha },\prec _{\alpha },g_{\alpha })\) in Fig. 4. The set \(V_{\alpha }\) contains four nodes. As it is shown, the function \(g_{\alpha }\) maps the nodes into the states and \(D^{1}_{1}\prec _{\alpha }D^{2}_{1}\), \(D^{1}_{1}\prec _{\alpha }D^{2}_{2}\), \(D^{1}_{2}\prec _{\alpha }D^{2}_{1}\) and \(D^{1}_{2}\prec _{\alpha }D^{2}_{2}\). As a simple graphical notation, this episode is represented as \(\alpha =(CPU,High)(Memory,Medium)\rightarrow (CPU,Low)(Disk,Low)\).
3.2 The episode occurrence
Informally, the occurrence of an episode in the stream means that the nodes of the episode have the corresponding events in the stream such that the partial order of the episode is preserved [40]. A frequent episode occurs often enough in the stream. Given a frequency threshold \( \theta \in {\mathbb {R}}_{\ge 0} \), the goal of episode mining is to extract all the frequent episodes in the stream. We choose the Non-Overlapped (NO) frequency [41] to compute the frequency of episodes. Two occurrences \(h_1\) and \(h_2\) of the episode \( \alpha \) are said to be non-overlapped if either “\( h_1 \) starts after \( h_2 \)” or “\( h_2 \) starts after \( h_1 \)” [41]. The NO frequency is computed by using minimal occurrences. A minimal occurrence is an occurrence that includes no other occurrences. So, \( freq(\alpha ) \) is the cardinality of a maximal NO set of minimal occurrences of the episode \( \alpha \) in the stream [11, 41]. In Sect. 4, we discuss how to compute the episode frequency based on the occurrences.
Definition 7
Given the episode \(\alpha \) such that \(|CNG_{\alpha }|=k\) and \(1\le i\le k \), for each occurrence of \(\alpha \), the starting interval of the occurrence of \(G_i\), \([t^{i}_{1},t^{i}_{2}]\), is:
Definition 8
Given the episode \(\alpha \) such that \(|CNG_{\alpha }|=k\), each occurrence O of \(\alpha \) is determined as a sequence of the starting intervals of CNGs: \( O=([t^{i}_{1},t^{i}_{2}]^{k}_{i=1}) \)
Example 3
Consider the stream \(E=\langle e_1=(CPU,High,0,3),e_2=(Memory,Medium,0,4),e_3=(Network,Low,0,2),e_4=(Disk,Medium,0,3),e_5=(Network,Medium,2,5),e_6=(CPU,3,5,Low),e_7=(Disk,Low,3,5),e_8=(Memory,Very\;Low,4,5)\rangle \). For \(\epsilon =0\), there is an occurrence of the episode \(\alpha \) given in Example 2 in the stream E. The starting intervals of the occurrence of \(G_1\) and \(G_2\) are \([t^{1}_1,t^{1}_2]=[0,0]\) and \([t^{2}_1,t^{2}_2]=[3,3]\) respectively. So the occurrence is represented as \( O=([0,0],[3,3])\).
Dynamic resources allocation is based on virtualization techniques [42]. Based on time spent on booting VMs, patterns should be extracted from application behavior in a way that SLA is satisfied and energy wasting is avoided. Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\) and an occurrence \(O=([w^{i}_1,w^{i}_2]^{k}_{i=1})\) of \(\alpha \), if the time it takes to instantiate a new VM instance is \(\delta (>\epsilon )\) time slots, the starting interval of \(G'_{i+1},1\le i< k,\) should begin after \(\delta +w^{i}_2\). Thus, the resources manager has enough time to instantiate a new VM instance. On the other hand, if resources are allocated before occurring workload burstiness for a long time, energy and resources are wasted. According to the discretion of the resources manager and characteristics of the cloud data center, the gap constraint \(\varDelta (\ge \delta )\) determines that resources might be allocated at most \(\varDelta -\delta \) time slots before occurring workload burstiness. Therefore, the Valid Interval (VI) of \(G'_{i+1}\) is \(VI([w^{i}_1,w^{i}_2],i+1)=[w^{i}_2+\delta , w^{i}_2+\varDelta ]\) to satisfy QoS and SLA and avoid wasting energy. \(\delta \) and \(\varDelta \) are called minimum internal gap and maximum internal gap respectively.
To compute the NO frequency of episodes under gap constraints, tracking the minimal occurrences of episodes is not enough [12]. So we introduced a new type of the occurrence, called the latest occurrence, to compute the NO frequency under gap constraints in [11]:
Definition 9
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\) and the internal gaps \(\delta \) and \(\varDelta \), an occurrence \(O=([w^{i}_1,w^{i}_2]^{k}_{i=1})\) of \(\alpha \) is a valid occurrence iff \(\forall i, 1\le i<k, w^{i}_2+\delta \le w^{i+1}_{1}\le w^{i}_2+\varDelta \).
Example 4
Consider Example 3. If \( \delta =2 \) and \( \varDelta =3 \), \(VI([0,0],2)=[0+2,0+3]=[2, 3]\). In addition, the occurrence O is valid because we have \( 0+2\le 3\le 0+3 \).
Definition 10
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\), if for a valid occurrence \(O=([t^{i}_1,t^{i}_2]^{k}_{i=1})\) of \(\alpha \) there exists no other valid occurrence \(Q=([w^{i}_1,w^{i}_2]^{k-1}_{i=1},[t^k_1,t^k_2])\) of \( \alpha \) such that \(\exists j,1\le j<k, w^{j}_1>t^{j}_{1}\), it is said that O includes the Latest Prefix Occurrence (LPO).
Definition 11
Each valid occurrence of the episode \(\alpha \) that includes LPO, is called the Latest Occurrence (LO). \(LO(\alpha )\) is a set of all the latest occurrences of \(\alpha \).
Example 5
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow G'_3\), \(\epsilon =1\), \(\delta =4\) and \(\varDelta =7\), Fig. 5a shows the occurrences of \(G'_i,i=1,2,3\). Note that there are two occurrences for each \(G'_i\): \(A_1=[1,2]\) and \(A_2=[4,5]\) are the occurrences of \(G'_1\), \(B_1=[9,10]\) and \(B_2=[12,13]\) are the occurrences of \(G'_2\) and \(C_1=[17,18]\) and \(C_2=[21,22]\) are the occurrences of \(G'_3\). Figure 5b shows that four valid occurrences (note that each LO is also a valid occurrence) could be identified for \(\alpha \) under gap constraints. According to Definition 11, the corresponding occurrences of the red lines in Fig. 5b are not LO. As the figure shows there are two LOs for \(\alpha \): \((A_2,B_2,C_1)\) and \((A_2,B_2,C_2)\).
In previous works, a superset of the minimal occurrences, called Minimal Prefix Occurrences (MPOs), is extracted to compute the NO frequency under gap constraints. An MPO with the span of [ts, te] is a valid occurrence (satisfying gaps) such that there is no other valid occurrence that starts strictly after ts and ends at or before te [12].
Example 6
Consider Example 5 again. Assume there is only the occurrence \(A_2\) for \(G'_1\). In this case, according to the definition of MPO, there are three MPOs: \(O_1=(A_2,B_1,C_1)\), \(O_2=(A_2,B_2,C_1)\) and \(O_3=(A_2,B_2,C_2)\) for \(\alpha \). On the contrary, there are two LOs: \(O_2=(A_2,B_2,C_1)\) and \(O_3=(A_2,B_2,C_2)\) for \(\alpha \). This example shows that LOs of \(\alpha \) are a subset of its MPOs.
Lemma 1
Given the episode \(\alpha =G'_1\rightarrow \cdots \rightarrow G'_k\), if \(MPO(\alpha )\) is a set of all the minimal prefix occurrences of \(\alpha \), then \(LO(\alpha )\subseteq MPO(\alpha )\).Footnote 1
Definition 12
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\), \(LOList(\alpha )\) includes a 4-tuple \((t^{k-1}_{2},t^{k}_1,t^{k}_2,t^{1}_1)\) for each occurrence \(O=([t^{i}_{1},t^{i}_{2}]^{k}_{i=1})\in LO(\alpha )\). \( LOList(\alpha )[i]\) is the i-th member of \(LOList(\alpha )\).
The focus of the following sections is on improvements to time and space complexities of the pattern mining engine.
3.3 The pattern extraction
The pattern mining engine constructs a pattern tree and extracts frequent patterns.
Definition 13
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\) and \((r,s)\in RS\), the serial extension of \(\alpha \) with (r, s) is:
Definition 14
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_{k-1}\rightarrow G'_k\) and \((r,s)\in RS\), the concurrent extension of \(\alpha \) with (r, s) is:
The lexicographic tree (pattern tree) is constructed based on the serial and concurrent extensions as follows [13]:
The root is labeled with \(\varnothing \).
Each node n of the tree is labeled with a state. Label(n) is the corresponding label of the node n.
Each node n of the tree corresponds to an episode. Pattern(n) is the corresponding episode of the node n.
If \(Pattern(n)=\alpha \), the corresponding episode of each child of n is either a serial extension or a concurrent extension of \(\alpha \).
The left sibling is less than the right sibling.
Figure 6 shows a part of the pattern tree constructed on RS [11]. Note that \(\forall i,j, 1\le i\le N,1\le j\le M,(R_i,S_j)\in RS\). Here, the lexicographic order is defined on RS as \((R_1,S_1)<\cdots<(R_1,S_M)<(R_2,S_1)<\cdots<(R_2,S_M)<\cdots<(R_N,S_1)<\cdots <(R_N,S_M)\). The root of the tree is null. All the patterns in the tree are generated only by the serial extension or the concurrent extension. For example the episode \(((R_1,S_1)(R_2,S_1))\) is generated from the concurrent extension of \((R_1,S_1)\) with \((R_2,S_1)\) and the episode \(((R_1,S_1)\rightarrow (R_N,S_M))\) is generated from the serial extension of \((R_1,S_1)\) with \((R_N,S_M)\).
Mining frequent episodes might lead to extracting a huge number of patterns. To improve mining efficiency and avoid information loss, a compressed set of episodes, called closed episodes, is extracted [43].
Definition 15
The episode \( \beta \) is a sub-episode of the episode \( \alpha \) (or \( \alpha \) is a super-episode of \( \beta \)), \( \beta \sqsubseteq \alpha \), if all the states of \( \beta \) and the partial order between them exist in \( \alpha \).
Definition 16
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k \) and \( 1\le i\le k \), \( Prefix(\alpha ,i)\)= \(G''_1\rightarrow G''_2\rightarrow \cdots \rightarrow G''_i\) and \(Postfix(\alpha ,i)=G''_i\rightarrow G''_{i+1}\rightarrow \cdots \rightarrow G''_k\), where \( G''_i\subseteq G'_i\).
Definition 17
The episode\( \alpha \)is closed under gap constraints\(\delta \)and\(\varDelta \) iff there is no other episode \(\beta \) whose prefix or suffix is \( \alpha \) and \(freq(\alpha )=freq(\beta )\) [12].
Example 7
Consider the two episodes \( \alpha =(Memory,Low)\rightarrow (Disk,High) \) and \( \beta =(CPU,High)(Memory,Low)\rightarrow (Disk,High) \). It is clear that \( \alpha \sqsubseteq \beta \) and \( \alpha =Suffix(\beta ,1) \). So if \( freq(\alpha )=freq(\beta ) \), then the episode \( \alpha \) is not closed.
Based on the definitions of the serial extension and the concurrent extension, if nodes \(n'\) and \(n''\) are the serial and concurrent extensions of the node n respectively, then we have:
So without scanning the stream, a maximal non-overlapped set of minimal occurrences of \(\alpha \) and \(\gamma \) can be determined by using the join of \(LO(\beta )\) with the occurrences of x and y respectively [12].
Firstly, the pattern mining engine extracts frequent episodes by the complete traverse of the pattern tree in a depth-first way. For this purpose, all the frequent 1-node episodes are extracted. Then, the pattern tree is traversed in a depth-first manner from each of the frequent 1-node episodes. When the serial and concurrent extensions of the episode are constructed, it is checked whether any of the super patterns has the same frequency as the episode’s or not; if not, the episode is added to the list of candidate closed episodes. After extracting the candidate closed episodes, a post-processing step is performed on them using a hashing procedure with frequency as the key. Finally, a set of all the closed frequent episodes are extracted. Note that to avoid enlarging the pattern tree, we could limit the number of CNGs of episodes. We define Level as the maximum number of CNGs of episodes.
4 Improving space complexity: computing NO frequency based on redundant occurrences
This section focuses on improvements to space complexity of the pattern mining engine. In Sect. 4.1, redundant LOs are introduced. The improved representation of the stream is introduced to identify the redundant occurrences in Sect. 4.2. In Sect. 4.3, algorithms are proposed to compute the NO frequency under gap constraints.
4.1 Redundant LO
In this section, the redundant LOs are defined and it is proved that removing these occurrences does not affect the frequency of episodes. Therefore, the redundant occurrences could be removed without any information loss, which improves memory consumption.
Definition 18
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\) and the occurrence \(O=([t^{i}_{1},t^{i}_{2}]^{k}_{i=1})\in LO(\alpha )\), the occurrence O is a redundant occurrence iff three conditions below are satisfied:
- 1.
There exists another occurrence \(Q=([w^{i}_{1},w^{i}_{2}]^{k}_{i=1})\in LO(\alpha )\) such that \(w^1_1=t^1_1\) and \(w^k_2<t^k_1\)
- 2.
There is no sub-interval of \([t^k_2+\delta ,t^k_2+\varDelta ]\) such that only O covers it.
- 3.
There exists no event \(e=(r,s,st,et)\) such that \(|t^k_2-st|\le \epsilon \) and \(|t^k_1-st|\le \epsilon \).
In Sect. 3.3, we explained how the pattern tree is constructed based on the serial and concurrent extensions. If for the occurrence O there is an event \(e =(r,s,st ,et)\) such that \(|t^k_2-st|\le \epsilon \) and \(|t^k_1-st|\le \epsilon \), then an episode \( \beta \) can be extended from \(\alpha \) by the concurrent extension (\( \beta =\alpha \odot (r,s) \) ). Therefore, the occurrence O is not redundant and it should not be removed.
Example 8
Given the episode \(\alpha =G'_1\rightarrow G'_2\), \(\epsilon =1\), \(\delta =12\) and \(\varDelta =23\), Fig. 7 shows the occurrences of \(G'_i,i=1,2\). \(A_1=[80,81]\) and \(A_2=[92,93]\) are the occurrences of \(G'_1\) and \(B_1=[100,101]\), \(B_2=[103,104]\) and \(B_3=[106,106]\) are the occurrences of \(G'_2\). There are three LOs: \(O_1=(A_1,B_1)\), \(O_2=(A_1,B_2)\) and \(O_3=(A_2,B_3)\). Note that the occurrence \(O_2\) satisfies the first two conditions of Definition 18: the occurrence \(O_1\) satisfies the first condition and as Fig. 8 shows VI([103, 104], 3) is covered completely by the valid intervals of \(O_1\) and \(O_3\). Therefore if there exists no event \(e=(v,s,st,et)\) such that \(|104-st|\le \epsilon \) and \(|103-st|\le \epsilon \), then \(O_2\) is redundant.
Lemma 2
Given the episode \(\alpha \), if \(\beta \) and \(\gamma \) are the serial and concurrent extensions of \(\alpha \), removing redundant occurrences from \(LOList(\alpha )\) does not affect \(freq(\alpha )\), \(freq(\beta )\) and \(freq(\gamma )\).
Lemma 3
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\) and the occurrences \(A=([a^i_1,a^i_2]^k_{i=1})\), \(B=([b^i_1,b^i_2]^k_{i=1})\) and \(C=([c^i_1,c^i_2]^k_{i=1})\), where \(A,B,C\in LO(\alpha )\) and A and C are LOs immediately before and after B respectively, if \(a^1_1\ne b^1_1\) and \([b^k_1,b^k_2]\) is not covered by \([a^k_1,a^k_2]\) and \([c^k_1,c^k_2]\), then all of the LOs before A start before B and \([b^k_1,b^k_2]\) is covered by none of the LOs before A and after C.
Lemma 4
If \(\epsilon \ge \frac{\delta }{4}\) and \( \varDelta \in [\delta ,2\delta ) \), then there is no redundant LO.
4.2 Improved representation of the stream based on pointers \(({\textit{P}ROSPER})\)
According to Lemma 2, all the LOs of the episode don’t include useful information. So removing these LOs could improve memory consumption of LOList of episodes. To identify redundant LOs, according to the third condition of Definition 18, LOs should not extend concurrently. For this purpose, the occurrence list of all the states of RS should be checked, which might be time-consuming. To expedite the identification of concurrent events and the removal of redundant LOs, we propose \({\textit{P}ROSPER}\), which is the improved representation of the stream based on pointers. \({\textit{P}ROSPER}\) is based on the vertical representation of the stream. It connects the concurrent events by using pointers.
In the vertical representation of the stream [44], each \((r,s)\in RS\) is associated with a list whose entries include the starting intervals of that (r, s). In \({\textit{P}ROSPER}\), each entry of the list is augmented with two pointers to the entry, which are called Next and Previous. The concurrent events are connected by the pointers. The corresponding list of (r, s) in \({\textit{P}ROSPER}\) is called LOListRS(r, s).
Definition 19
LOListRS(r, s) includes a 4-tuple \( ([v,v'],Next,Previous) \) for each occurrence of \( (r,s)\in RS \), where \( [v,v'] \) is the starting interval of the occurrence and Next and Previous are the pointers that connect the concurrent events. LOListRS(r, s)[i] is the i-th member of LOListRS(r, s). (Note that for each occurrence of (r, s) , \( v=v' \). )
Example 9
Consider the stream E:
Figure 9 shows the vertical representation and \({\textit{P}ROSPER}\) of the stream E for \(\epsilon =0\). As Fig. 9b shows, the concurrent events could be identified by using the pointers easily. Note that the pointer L always points to the last event of the stream.
4.3 NO frequency under gap constraints
To compute the NO frequency of episodes, their LOList should be extracted firstly [11]. In this section, based on \({\textit{P}ROSPER}\), we modify the algorithms SSMakeLOList and SCMakeLOList presented in [11] to extract the non-redundant LOs of episodes.
4.3.1 Extracting the non-redundant LOs of episodes using the serial extension
The algorithm SMakeLOList (Algorithm 1) is proposed to extract the non-redundant LOs of episodes using the serial extension. The algorithm receives \(LOList(\alpha )\) (see Definition 12) and LOListRS(r, s) (see Definition 19) that \(\alpha \) is an episode and \((r,s)\in RS\), and computes \(LOList(\beta =\alpha \oplus (r,s))\) without scanning the stream. Note that LOListRS(r, s) is the occurrence list of (r, s) in \({\textit{P}ROSPER}\). The counters i, z and j traverse the LOLists of \(\alpha \) and \(\beta \) and LOListRS(r, s) respectively. Lines 2 to 23 consider for each LO of \(\alpha \) which LOs of (r, s) could create a non-redundant LO for \(\beta \). Lines 4 to 6 check whether the i-th LO of \(\alpha \) could be the latest prefix occurrence for the j-th occurrence of (r, s) or not. If not, this LO of \(\alpha \) could not be the latest prefix occurrence for the next occurrences of (r, s). So the next LOs of \(\alpha \) are considered (line 22). For the new LOs of \(\alpha \), we start from the occurrences of (r, s) that there is no latest prefix occurrence for them. In lines 4 to 5, if an LO of \(\alpha \) is the latest prefix occurrence for an LO of (r, s), the corresponding LO of \(\beta \) is generated and inserted in \(LOList(\beta )\). In lines 7 to 17, the LOs immediately before and after each LO of \(\beta \) are considered whether that LO is redundant based on Definition 18. In line 12, the function CExtending (see Algorithm 7 in “Appendix B”) considers whether LO could extend concurrently or not. According to Lemma 3, if the conditions of Definition 18 are not satisfied for the next and previous LOs, then the conditions would not be satisfied by the other LOs. In line 13, if an LO is redundant, it is removed and the counter z is updated. Note that in line 10, if \(b_3+\delta \le b_1+\varDelta \), then \([b_2+\delta ,b_2+\varDelta ]\) is covered because we have \(b_1<b_2<b_3\). So if \(b_1+\delta<b_2+\delta<b_3+\delta \le b_1+\varDelta<b_2+\varDelta <b_3+\varDelta \), then \(VI([a_2,b_2],k+1)\) is completely covered by \(VI([a_1,b_1],k+1)\) and \(VI([a_3,b_3],k+1)\), where \(|CNG_{\alpha }|=k\). Time complexity of the algorithm is discussed in Lemma 9 in “Appendix A”.
Theorem 1
Given the episode \(\alpha \) and \((r,s)\in RS\), the algorithm SMakeLoList only finds all the non-redundant LOs of \(\beta =\alpha \oplus (r,s)\).
Example 10
Consider Example 8. Figure 10 shows \( LOList(G'_1) \), LOListRS(r, s) and \( LOList(\beta =G'_1\rightarrow (r,s))\) extracted by Algorithm 1. Since all of the pointers of LOListRS(r, s) are null, according to lines 10 to 20 of the algorithm, the second element of \( LOList(\beta ) \) is redundant. So it is removed.
4.3.2 Extracting the non-redundant LOs of episodes using the concurrent extension
The algorithm CMakeLOList (Algorithm 2) is proposed to extract the non-redundant LOList of episodes using the concurrent extension. The algorithm receives \(LOList(\alpha )\) and LOListRS(r, s) that \(\alpha \) is an episode and \((r,s)\in RS\), and computes \(LOList(\beta =\alpha \odot (r,s))\) without scanning the stream.
The counters i, z and j traverse the \(LOList(\alpha )\), \(LOList(\beta )\) and LOListRS(r, s) respectively. In lines 3 to 5, the first element of LOListRS(r, s) that is concurrent with at least one event is found. There are three cases for \(LOList(\alpha )\) and LOListRS(r, s): 1) In lines 7 to 20, if the corresponding entries of \(LOList(\alpha )[i]\) and LOListRS(r, s)[j] could generate an LO of \(\beta =\alpha \odot (r,s)\), it is inserted in \(LOList(\beta )\) and three counters i, j and z increase by +1. In lines 10 and 20, if an LO is redundant, it is removed and the counter z is updated. (2) In lines 21 to 22, if LOListRS(r, s)[j] occurs after \(LOList(\alpha )[i]\), then \(LOList(\alpha )[i]\) should not be considered for the members after LOListRS(r, s)[j]. So the next LO of \(\alpha \) is checked. (3) In lines 23 and 24, if LOListRS(r, s)[j] occurs before \(LOList(\alpha )[i]\), the next occurrences of (r, s) are considered for \(LOList(\alpha )[i]\). Time complexity of the algorithm CMakeLOList is discussed in Lemma 10 in A.
Theorem 2
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_{k}\) and \(G=(r,s)\in RS\), the algorithm CMakeLOList only finds all the non-redundant LOs of \(\beta =\alpha \odot G\).
Example 11
Given the episode \( \alpha =G'\rightarrow (r,s)\), \( (r',s')\in RS \), \( (r,s)<(r',s') \), \( \epsilon =1 \), \( \delta =12 \) and \( \varDelta =23\), Fig. 11 shows \( LOList(\alpha ) \), LOListRS(r, s) , \( LOListRS(r',s') \) and \( LOList(\beta )\) extracted by Algorithm 2. According to the algorithm, the second element of \( LOList(\beta ) \) is removed because it is redundant.
After extracting LOList of episodes, their NO frequency could be computed by calling the function ComputeFreq presented in [11] easily. ComputeFreq scans LOList of the episode and counts the number of the non-overlapped LOs.
5 Improving time complexity: a new approach for mining the closed episodes
As it was mentioned, the common approach to extract closed episodes under gap constraints is based on the hash table [13]. It is a two-step approach. In the first step, candidate closed episodes are extracted. In the next step, they are considered and closed episodes are determined by using a hashing procedure with frequency as the key [12, 13]. As it will be shown in the evaluation results, the number of closed episodes is usually much fewer than candidate closed episodes’. So extracting closed episodes from among a huge number of candidate closed episodes is very time-consuming. For this purpose, we introduce a new data structure, called \({\textit{C}PBT}\), to store closed episodes and present depth-first search algorithms based on \({\textit{C}PBT}\) to extract closed episodes directly. In this section, \({\textit{C}PBT}\) is introduced firstly. Then, the algorithms for mining closed frequent episodes are presented.
5.1 The data structure \({\textit{C}PBT}\)
The data structure \({\textit{C}PBT}\) is introduced to store closed episodes compactly. The root of \({\textit{C}PBT}\) is labeled with \(\varnothing \). The episode is traversed in the backward direction and inserted in \({\textit{C}PBT}\) in a way that each node is corresponding to one CNG of the episode. So the episodes whose postfixes are the same share the same nodes. To avoid losing the frequency of the episodes, each node maintains the sum of the frequency of the episodes that share that node. Each Noden of \({\textit{C}PBT}\) includes:
label Given the episode \(\alpha =G'_1\rightarrow \cdots \rightarrow G'_k\), if the node n is corresponding to \(G'_i,1\le i\le n\), then label(n) is the inverse of \(G'_i\).
freq: It is the sum of the frequency of the episodes that share n.
children There is a node as a child of n for each episode that shares n and n is not corresponding to the first CNG of the episode.
Example 12
Given \(RS_1=\{A,B,C,D,E,F,G,M,N,Z\}\), where \(RS_1\subseteq RS\), assume the three episodes \(\alpha \), \(\beta \) and \(\gamma \) are extracted as follows:
As Fig. 12a shows, the episode \(\alpha \) is inserted in the backward direction in \({\textit{C}PBT}\). Each node of \({\textit{C}PBT}\) is corresponding to one CNG of \(\alpha \). Note that each node includes the frequency of \(\alpha \). Figure 12b shows \({\textit{C}PBT}\) after inserting the episode \(\beta \). Since the episodes \(\alpha \) and \(\beta \) have no equal postfix, the episode \(\beta \) is inserted as a new branch of the root. Since the postfixes of the episodes \(\gamma \) and \(\alpha \) are equal, so the node labeled MF is shared between them. Note that the frequency of this node is the sum of the frequency of \(\alpha \) and \(\gamma \).
5.2 Mining closed frequent episodes
Firstly, we introduce two new supersets of closed episodes, called Forward Closed and Backward Closed. Then based on them and \({\textit{C}PBT}\), we propose algorithms that extract closed frequent episodes directly.
Definition 20
Given \(\alpha =G'_1\rightarrow \cdots \rightarrow G'_k\), if there is no episode \( \beta \) such that \(\alpha =Prefix(\beta ,k)\) and \(freq(\alpha )=freq(\beta )\), then \(\alpha \) is Forward Closed (FC).
Definition 21
Given \(\alpha =G'_1\rightarrow \cdots \rightarrow G'_k\), if there is no episode \(\beta \) such that \( |CNG_{\beta }|=u\), \(\alpha =Suffix(\beta ,u-k+1)\) and \(freq(\alpha )=freq(\beta )\), then \(\alpha \) is Backward Closed (BC).
If an episode is FC and BC then it is a closed episode. Therefore, we extract the FC episodes and insert them in \({\textit{C}PBT}\) firstly. From among the FC episodes, the episodes that are not BC are absorbed by their super-episodes. Thus, \({\textit{C}PBT}\) only includes closed frequent episodes. Note that for the two FC episodes \( \alpha \) and \(\beta \), checking the CNG occurrences of the episodes is redundant because we could check whether \( \alpha \) is a suffix of \( \beta \) and identify their frequency by using \({\textit{C}PBT}\) simply. As it will be shown, closed episodes and their frequency could be extracted from \({\textit{C}PBT}\) easily.
Algorithm 3 extracts closed episodes by the complete traverse of the pattern tree in a depth-first way. At first, in line 1, all the 1-node episodes (denoted by P) are extracted. Then, in lines 3 and 4, the pattern tree is traversed in a depth-first manner from each of the 1-node episodes using the recursive calls of the algorithm FindClosedFreqEpisode (see Algorithm 8 in “Appendix B”). Note that LOListRS(p) is the corresponding list of p in \({\textit{P}ROSPER}\). The algorithm FindClosedFreqEpisode identifies the FC episodes and inserts them in \( {\textit{C}PBT} \) by calling the function \(Insert{\textit{C}PBT}\). Finally, after extracting closed episodes, Algorithm 3 calls the function ExtractClosedEpisodesFromCBBT to traverse \({\textit{C}PBT}\) and extract closed episodes from it. Note that episodes are represented in the form of SAVE [11] to expedite the episode extraction. In the next section, we comprehensively explain how to insert the FC episodes in CPBT.
5.3 Insert in \({\textit{C}PBT}\)
When the FC episode \(\alpha \) is inserted in \({\textit{C}PBT}\), two cases could occur: (1) In \({\textit{C}PBT}\), there is another episode \(\beta \) whose suffix is \(\alpha \) and \(freq(\alpha )=freq(\beta )\). It means that \(\beta <\alpha \) because \(\beta \) has been inserted in \({\textit{C}PBT}\) sooner than \(\alpha \). So \(\alpha \) is not BC and should not be inserted in \({\textit{C}PBT}\). (2) After inserting \(\alpha \), another episode \(\beta \) whose suffix is \(\alpha \) and \(freq(\alpha )=freq(\beta )\) might be inserted. It means that \(\alpha <\beta \). So \(\alpha \) should be removed from \({\textit{C}PBT}\).
Definition 22
If the episodes \(\alpha \) and \(\beta \) are FC, \(\alpha \) is a suffix of \(\beta \) and \(freq(\alpha )=freq(\beta )\), it is said that \(\beta \)absorbs\(\alpha \).
The procedure for calling functions to insert an FC episode in \( {\textit{C}PBT} \) is shown in Fig. 13. According to the figure, the episode is converted to a branch by calling the function CreateBranch. In the next step, the function SearchInTree is called. This function calls the function EpisodeAbsorbByTree. It considers whether \( {\textit{C}PBT} \) absorbs the episode or not. If not, the function TreeAbsorbByEpisode is called. This function finds the branches that are absorbed by the episode. Then, these branches are updated by calling the function UpdateBranch. Finally, the function InsertInTree is called to insert the episode in the right place.
The algorithm InsertInCPBT inserts the FC episode \(\alpha \) in \({\textit{C}PBT}\). As Algorithm 4 shows this function receives the FC episode \(\alpha \) and its frequency. The pointer R points to the root of \({\textit{C}PBT}\) at first. In line 2, the corresponding branch of the episode \(\alpha \), \(\alpha '\), is created by calling the function CreateBranch (Algorithm 9). In line 3, the branch \(\alpha '\) is searched in \({\textit{C}PBT}\) by calling the function SearchInTree (Algorithm 5). If \(\alpha '\) is absorbed by a branch of \({\textit{C}PBT}\), this function returns \(-\,1\). Otherwise the branches of \({\textit{C}PBT}\) that are absorbed by \(\alpha '\) are removed and the function returns 0. Finally, the branch \(\alpha '\) is inserted in \({\textit{C}PBT}\) in lines 4 to 6 by calling the function InsertInTree (Algorithm 6). So while inserting the episode \(\alpha \) in CPBT, it should be considered whether \(\alpha \) absorbs the other episodes or is absorbed by another episode. In the following section, the functions called by the function InsertInCPBT are presented in detail.
(1) Function CreateBranch: The algorithm CreateBranch (Algorithm 9 in “Appendix B”) receives the FC episode \(\alpha \) and its frequency, converts them into a branch of \({\textit{C}PBT}\) and returns a pointer to this branch. Time complexity of the algorithm is discussed in Lemma 11 in “Appendix A”.
Definition 23
Given the node n of \({\textit{C}PBT}\), Episode(n) is the corresponding episode of a branch that starts from the root and ends in the node n.
(2) Function SearchInTree : This function (Algorithm 5) searches an episode in \({\textit{C}PBT}\) and checks whether the episode is absorbed by a branch of the tree or not. If it is not absorbed, the function checks whether the episode absorbs branches of \({\textit{C}PBT}\) or not. As Algorithm 5 shows, by calling the function EpisodeAbsorbByTree (Algorithm 10 in “Appendix B”) in lines 2 to 4, it is checked whether the branch \(\alpha '\) is absorbed by a branch of \({\textit{C}PBT}\) or not. If \(\alpha '\) is absorbed, the value 0 is returned, which shows \(\alpha '\) should not be inserted in \({\textit{C}PBT}\). If \(\alpha '\) is not absorbed, it should be checked whether \(\alpha '\) could absorb branches of \({\textit{C}PBT}\) or not. If it could, before the \(\alpha '\) is inserted, these branches should be removed from \({\textit{C}PBT}\). The stack Path defined in line 1, stores the path of branches that should be removed. The function TreeAbsorbByEpisode (Algorithm 11 in “Appendix B”) in line 6 finds these Paths and inserts them into the global variable PathList. Finally, in lines 7 to 9, the function UpdateBranch (Algorithm 12 in “Appendix B”) is called to update the corresponding branches of the Paths in PathList.
(3) Function InsertInTree: This function (Algorithm 6) inserts the corresponding branch of the episode \(\alpha \), \(\alpha '\), in the subtree of the node R in \({\textit{C}PBT}\). As Algorithm 6 shows, in lines 2 to 12, a child of R whose label is equal to the label of \(\alpha '\) is found. Note that there exists just one such node because the labels of nodes are unique. Since this node is shared with the branch \(\alpha '\), so the frequency of \(\alpha '\) is added to the node’s in line 4. In lines 5 to 8, the following nodes of the branch \(\alpha '\) are traversed and this process is repeated. While inserting \(\alpha '\), if no node whose label is equal to the label of \(\alpha '\) is found, the value of flag remains False. So in lines 13 to 15, \(\alpha '\) is added to the children of R as a new child.
Theorem 3
The algorithm AllClosedFreqEpisodes only finds all the closed episodes.
Example 13
Given \(RS_1=\{A,B,C,E,F,G\}\), where \(RS_1\subseteq RS\), assume the FC episodes \(\alpha _i,i=1,2,3,4,5\) are identified as follows:
We define the lexicographic order on RS as \(A<B<C<E<F<G\). Therefore, based on the lexicographic tree of episodes, we have \(\alpha _1<\alpha _2<\alpha _3<\alpha _4<\alpha _5\) [11]. It is clear that the function InsertInCPBT is called for episodes in ascending order:
\(\alpha _1\): Since CPBT has no branch, \(\alpha _1\) is inserted in it. Figure 14a shows \({\textit{C}PBT}\) after inserting \(\alpha _1\).
\(\alpha _2\): As Fig. 14b shows, the episode \(\alpha _2\) is also inserted in \({\textit{C}PBT}\).
\(\alpha _3\): When the function InsertInCPBT is called for \(\alpha _3\), the function SearchInTree returns \(-\,1\) because the function EpisodeAbsorbByTree detects that \(\alpha _3\) is absorbed by \(\alpha _2\). Therefore, \(\alpha _3\) is not inserted in \({\textit{C}PBT}\).
\(\alpha _4\): Since none of the branches of \({\textit{C}PBT}\) absorbs \(\alpha _4\), the function TreeAbsorbByEpisode is called for \(\alpha _4\). It detects that \(\alpha _1\) is absorbed by \(\alpha _4\). As Fig. 14c shows, \(\alpha _1\) is replaced with \(\alpha _4\).
\(\alpha _5\): \(\alpha _5\) is absorbed by no episode. Furthermore, it absorbs no episode. Therefore, it is inserted in \({\textit{C}PBT}\) as Fig. 14d shows.
The function ExtractClosedEpisodesFromCPBT (see Algorithm 3 in “Appendix B”) extracts the closed episodes stored in \({\textit{C}PBT}\) and provides fast access to them.
5.4 Analysis of time complexity
In general, the running time of an algorithm is roughly proportional to how many times some basic operation is done [45]. To analyze time complexity, we consider the comparison of the FC/candidate closed episodes as the basic operation. In the hashing approach, all the candidates with the same frequency are hashed to the same bucket in the hash table. If there are v distinct frequency values, then there are v buckets such that \( |bucket_i|=l_i, 1\le i\le v \) and \( \sum _{i=1}^{v}{l_i}=|CandidateClosedEpisodes|=|FC~Episodes| \). Among the candidate patterns which are hashed to the same bucket, those patterns for which a super-pattern with the same frequency is found, are discarded. So the number of comparisons is \( \sum _{i=1}^{v}{|bucket_i|^2\le |CandidateClosedEpisodes|^2} \). It means that time complexity of the hashing approach is \( O(|CandidateClosedEpisodes|^2)\). Since the maximum number of CNGs of episodes is Level, the height of \( {\textit{C}PBT} \) is also Level. Therefore, in our approach, the number of comparisons is \( O(BranF\times Level) \), where BranF is the branching factor of \( {\textit{C}PBT}\). In the worst case, the branching factor of \( {\textit{C}PBT} \) is \(MaxBranF=\sum _{|CNG|=1}^{N}{{N \atopwithdelims ()|CNG|}M^{|CNG|}} \), where \( |ResourceType|=N \) and \( |Status|=M \). As we will see in Sect. 6.2, we should choose the small values such as 6 for Level. In addition, M and N are not large (in this paper, we set \( M=N=4 \)). In general, BranF depends on the extracted FC episodes and in practice \( BranF\ll MaxBranF \). As we will see in Sect. 6, if |CandidateClosedEpisodes| is small, the hashing approach is a good choice. Otherwise, our approach could identify closed episodes much faster than the hashing approach.
6 Evaluation
In [2, 11], we evaluate POSITING and RELENTING on both the real and synthetic workloads comprehensively and investigate the impact of different parameters on them. According to the main concepts introduced in Sect. 3.1, there are some parameters for the pattern mining engine: \(\delta \), \(\varDelta \), \(\epsilon \), \(\mu \), Level, \(\theta \). The parameters setting for the evaluation of the proposed approach is as follows:
\(\varDelta \) and \(\delta \): As we mentioned in Sect. 3.1, \( \delta \) and \( \varDelta \) are internal gaps, which determine the starting interval of CNGs. The values of \( \delta \) depend on the time spent on booting VMs. In [2, 11], the valid interval of \( \varDelta \) is \( [\delta ,\delta +\epsilon ] \). To provide a more general approach for mining closed episodes in different fields and conduct more comprehensive experiments, we extend the valid interval of \( \varDelta \) as follows:
Given the episode \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\) and the occurrence \(O=([t_j,t'_j]_{j=1}^{k})\in LOList(\alpha )\) such that \(k>2\) and \(1\le i<k-1\), if there is overlap between \(VI([t_i,t'_i], i + 1)\) and \(VI([t_{i+1},t'_{i+1}], i + 2)\), then the two serial episodes \(\beta =G'_1\rightarrow \cdots \rightarrow G'_i\rightarrow G'_{i+1}\) and \(\gamma =G'_1\rightarrow \cdots \rightarrow G'_i\rightarrow G'_{i+2}\) could be extracted. Based on these episodes, different status could be predicted for the next slots. So \(\varDelta \) should be selected in a way that the precise prediction is possible and extracting redundant episodes is avoided. For this purpose, as (6.1) implies \(\varDelta \) should be in the interval of \([\delta ,2\delta )\):
$$\begin{aligned} {\left. \begin{aligned} t'_{i+1}\ge t'_i+\delta \\ t'_{i+1}+\delta >t'_{i}+\varDelta \end{aligned}\right\} } \rightarrow \varDelta< t'_{i+1}-t'_i+\delta \rightarrow \varDelta <2\delta \rightarrow \varDelta \in [\delta ,2\delta ) \nonumber \\ \end{aligned}$$(6.1)
\(\epsilon \): According to Definition 1, the span of the events should be greater than \( \epsilon \). \( \epsilon \) should be determined based on the length of sampling intervals. Since the real workloads [46, 47] are coarse-grained and the synthetic workloads are also generated in a similar way to them, we set \(\epsilon =0\) in all the experiments.
\(\mu \): The events whose span is large, are decomposed into two events based on the decomposition unit \( \mu \). For smooth workloads, the small values of \(\mu \) might lead to generating many events, which could increase the duration of the episode extraction. On the other hand, the large values of \(\mu \) might lead to the inability to extract all the hidden useful episodes. We evaluate the impact of \(\mu \) on the pattern mining engine for both the real and synthetic workloads.
\(\theta \): It is a threshold value that is used to extract frequent episodes. It is clear that the small values of \(\theta \) might lead to identifying a huge number of episodes, which might be very time-consuming. On the other hand, the large values of \(\theta \) could lead to losing some useful episodes for prediction. We evaluate the impact of \(\theta \) on the pattern mining engine for both the real and synthetic workloads.
Level: To avoid enlarging the pattern tree, Level limits the length of episodes. As Level increases, the height of the pattern tree, the number of episodes and the time consumed to identify them increase. So we investigate the impact of Level on the pattern mining engine for both the real and synthetic workloads.
Since the focus of the paper is on the pattern mining engine, we evaluate the efficiency of the proposed approach to extract closed episodes. For this purpose, the effect of two important parameters \(\theta \) and \(\mu \) is considered on the pattern mining engine for both the synthetic and real workloads.
6.1 Workloads
The data set GWA-T-12Footnote 2 Bitbrains contains the performance metrics of 1750 VMs from a distributed data center from Bitbrains, which is a service provider that specializes in managed hosting and business computation for enterprises. The workload traces are corresponding to requested and actually used resources in a distributed data center servicing business-critical workloads. The data set focuses on four key types of resources, which can become bottlenecks for business-critical applications: CPU, disk I/O, memory and network I/O. For each VM, the performance metrics are sampled every 5 min. The traces include data for 1750 nodes, with over 5000 cores and 20 TB of memory, and operationally include over 5 million CPU hours in 4 operational months [48].
Since the workloads of the data set GWA-T-12 Bitbrains are more dynamic than the other public workloads [48], in a similar way to [11], we use these workloads for evaluation. Furthermore, we use the synthetic workloads generated in [2, 11]. Table 1 shows the types of the generated synthetic workloads and their corresponding embedded episodes. Since the time it takes to instantiate a new VM instance is about 5–15 min [49] and VMs are sampled every 5 min, so three values 1, 2 and 3 should be evaluated for \(\delta \).
6.2 Impact of Level
Since both the methods extend the pattern tree the same, we only report the impact of Level on the hashing approach. For this purpose, we set \(\theta = 0.1\), \( \mu =3 \) and \( \delta =\varDelta \) and investigate the impact of Level on both the real and synthetic workloads.
Impact ofLevelon the real workload: One VM, called \( VM_1 \), is selected randomly from GWA-T-12 to evaluate the impact of Level on the number of episodes and the processing time to identify them. Table 2 shows the impact of Level on \( VM_1 \) for \( \delta =\varDelta =1,2,3\). According to the table, as Level increases, the number of episodes and the time consumed to identify them increase. Although there is no significant change in the number of episodes for \( Level>6 \), the processing time increases strongly. So, according to the bolded row, there is a trade-off between the processing time (Time) and the number of episodes for \(Level=6\).
Impact ofLevelon the synthetic workload: As Table 3 shows, For each workload type, one trace is generated by the workload generator. The distinct values of \( \delta \) are selected for each workload type randomly. Table 4 shows the impact of Level on \( Trace_i,i=A,B,C \). According to the table, as Level increases, the number of episodes and the time consumed to identify them increase. In a similar way to the real workloads for \( Level>6 \), although there is no significant change in the number of episodes, the processing time increases strongly. According to the bolded rows, for all the traces, there is a trade-off between the processing time and the number of episodes for Level\(=\) 6. So in all of the experiments, we set \(Level=6\).
6.3 Evaluation results
To the best of our knowledge, the hashing approach is the only approach that is used in different literature to identify closed patterns among frequent patterns [12, 13, 44]. Therefore, to evaluate the performance and efficiency of the approach proposed in this paper, the approach is compared to the hashing approach [13]. For evaluation, there are four parameters \(\delta \), \(\varDelta \), \(\theta \) and \(\mu \), which should be investigated. As it was mentioned the valid values of \(\delta \) are 1, 2 and 3 and \(\varDelta \) is in the interval of \([\delta ,2\delta )\). The parameter \(\theta \) is in the interval of [0, 1]. Since \(\mu \ge \max (2\epsilon +1,\epsilon + 2)\) [11] and \(\epsilon = 0\), then we have \(\mu \ge 2\). If \(\mu \) is bound to 10 and evaluated in the steps of 1 and \(\theta \) is evaluated in the steps of 0.1, then there are \(6\times 9\times 10=540\) distinct combinations of the parameters for evaluation. Due to space limitation, we select some combinations of the parameters and investigate the impact of the parameters on the pattern extraction of some VMs. The valid values of the parameters \(\delta \) and \(\varDelta \) could be divided into the three groups \(\delta =\varDelta \) and \(\delta <\varDelta \) with spans of 1 and 2 time slots. So we select (\(\delta =1,\varDelta =1\)) from the group \(\delta =\varDelta \), (\(\delta =2\), \(\varDelta =3\)) from the group \(\delta <\varDelta \) with the span of 1 and (\(\delta =3\), \(\varDelta =5\)) from the group \(\delta <\varDelta \) with the span of 2. The impact of \(\varDelta -\delta \) on episode mining is investigated on the real workloads. Since the main goal of cloud is to satisfy SLA and avoid wasting resources [3], the occurrence time of the future events should be determined precisely. So the main focus of evaluation is on the group \(\delta =\varDelta \). The values of \(\delta \) and \(\varDelta \) are selected from this group randomly for the synthetic workloads. All of the experiments run on a machine with an Intel Core 2 Duo 2.53 GHz processor and 4GB of RAM.
6.3.1 Experimental results of the real workload
In addition to \( VM_1 \), We select two other VMs of GWA-T-12 randomly, called \(VM_2\) and \(VM_3\). Each VM is evaluated for distinct values of \(\delta \) and \(\varDelta \): \(VM_1(\delta =1,\varDelta =1)\), \(VM_2(\delta =2,\varDelta =3)\) and \(VM_3(\delta =3,\varDelta =5)\). For each VM, the impact of the parameters \(\mu \) and \(\theta \) on the number of patterns and the processing time is investigated. Note that since the same sets of closed episodes are extracted from each VM by using both of the methods, the number of closed episodes is only reported for the hashing approach.
Impact of\(\theta \): To consider the impact of \(\theta \), the values of \(\mu \) and Level are set to 3 and 6 respectively. For different values of \(\theta \), Table 5 shows the number of episodes, candidate closed and closed episodes, the extraction time of candidate closed episodes (\(Time_E\)) and the processing time of candidate closed episodes (\(Time_P\)) in seconds for the hashing approach. The symbol \(\infty \) indicates that the hashing approach could not complete the extraction of candidate closed episodes due to insufficient memory. In this case, the number of closed episodes is reported by using the proposed approach. As the table shows, the small values of \(\theta \) increase the number of extracted episodes (and closed episodes) and the time consumed to identify them. According to the table, as the span of \(\varDelta -\delta \) increases, the number of candidate and closed episodes increases abruptly. For example, if \(\theta =0.1\), for \(VM_1\) with \(\varDelta -\delta =0\), the number of closed episodes is 247, for \(VM_2\) with \(\varDelta -\delta =1\), the number of closed episodes is 22166 and for \(VM_3\) with \(\varDelta -\delta =2\), the hashing approach could not extract candidate closed episodes due to insufficient memory. On the other hand, as the table shows for \(VM_2\) with \(\theta =0.1\) and \(VM_3\) with \(\theta =0.3\), for a large number of candidate closed episodes, processing candidate closed episodes consumes more time in comparison to extracting them. Furthermore, the number of closed episodes is much fewer than candidate closed episodes’. These points imply that the hashing approach is not a good choice for small values of \(\theta \) or large spans of \(\varDelta -\delta \).
The total time consumed by the hashing approach to extract closed episodes is defined as \(Time_E+Time_P\). Figure 15 shows the total time consumed to extract the closed episodes by the hashing approach and the proposed approach for different values of \(\theta \). According to the figure, as the value of \(\theta \) increases, the consumed time decreases. Furthermore, for \(VM_1\) with \(\varDelta -\delta =0\) (Fig. 15a), the consumed time of the proposed approach is reasonable and similar to the hashing approach’s. As the span of \(\varDelta -\delta \) increases for \(VM_2\) and \(VM_3\) (Fig. 15b, c), the number of candidate closed episodes increases and the total time consumed by the hashing approach increases abruptly. As Fig. 15 shows the proposed approach extracts the closed episodes much faster than the hashing approach for small values of \(\theta \) and large spans of \(\varDelta -\delta \) because it extracts closed episodes directly without storing/processing all of the candidate closed episodes.
Impact of\(\mu \): To evaluate the impact of \(\mu \), we set \(\theta = 0.1\) and \( Level=6 \). Since \(\mu \ge \max (2\epsilon +1,\epsilon +2)\) and \(\epsilon = 0\), then we have \(\mu \ge 2\). Table 6 shows the impact of \(\mu \) in the interval of [2, 10] on the number of patterns and the consumed time of the hashing approach. For small values of \(\mu \) such as 2 and 3, the number of candidate closed episodes increases abruptly. On the other hand, as the span of \(\varDelta -\delta \) increases the number of episodes increases suddenly. So for \(VM_2\) with \(\mu =2\) and \(VM_3\) with \(\mu =2,3\), the hashing approach could not complete the extraction of candidate closed episodes due to insufficient memory. In these cases, the number of closed episodes is reported by using the proposed approach. It is clear that as \(\mu \) increases, the span of events increases and the number of events decreases subsequently. So the number of episodes decreases as \(\mu \) increases. Furthermore, for \(VM_2\) with \(\mu =3\) and \(VM_3\) with \(\mu =4\), processing candidate closed episodes consumes more time in comparison to extracting them due to a large number of candidate closed episodes. Therefore, since the hashing approach extracts a large number of candidate closed episodes, it could not be an appropriate choice for small values of \(\mu \) or large spans of \(\varDelta -\delta \).
Figure 16 shows the total time consumed by the hashing approach and the proposed approach to extract the closed episodes for different values of \(\mu \) and \(\theta =0.1\). According to the figure, as the value of \(\mu \) increases, the consumed time decreases. Furthermore, for \(VM_1\) with \(\varDelta -\delta =0\) (Fig. 16a), the processing time of the hashing approach is reasonable. As the span of \(\varDelta -\delta \) increases for \(VM_2\) and \(VM_3\) (Fig. 16b, c), the number of candidate closed episodes increases and the total time consumed by the hashing approach increases abruptly. As Fig. 16 shows the proposed approach extracts closed episodes much faster than the hashing approach for small values of \(\mu \) and large spans of \(\varDelta -\delta \). For example, for \(VM_3\), the time consumed by the hashing approach for \(\mu =4\) is nearly equal to the time consumed by the proposed approach for \(\mu =2\). All these points show that extracting closed episodes without storing/processing all of the candidate closed episodes improves the mining efficiency significantly.
6.3.2 Experimental results of the synthetic workload
In this section, the impact of the two parameters \(\theta \) and \(\mu \) on the number of closed episodes extracted from the synthetic workloads and the time consumed by the methods to identify them is considered. We use the traces generated in Table 3 and compare the time consumed by the hashing approach with the proposed approach’s.
Impact of\(\theta \): To consider the impact of \(\theta \), the value of \(\mu \) is set to 3. For different values of \(\theta \), Table 7 shows the number of episodes, candidate closed and closed episodes, \(Time_E\) and \(Time_P\) in seconds for the hashing approach. According to the table, the small values of \(\theta \) increase the number of extracted episodes (and closed episodes) and the time consumed to extract them. As the table shows for \(Trace_B\) with \(\theta =0.1\), for a large number of candidate closed episodes, processing candidate closed episodes consumes more time in comparison to extracting them. So the time consumed by the hashing approach mainly depends on the number of candidate closed episodes. For example, \(Trace_C\) with \(\theta =0.1\) needs the minimum time for the episode extraction due to the minimum number of candidate closed episodes. Furthermore, for all the traces, closed episodes are a small fraction of candidate closed episodes. These points imply that the hashing approach is not an appropriate choice for small values of \(\theta \) because it generates a large number of candidate closed episodes.
Figure 17 shows the total time consumed by the hashing approach and the proposed approach to extract closed episodes for different values of \(\theta \). According to the figure, as the value of \(\theta \) increases, the consumed time decreases for both the methods. Since a large number of candidate closed episodes are extracted for \(Trace_A\) and \(Trace_B\) with small values of \(\theta \), the proposed approach improves the consumed time for them significantly (Fig. 17a, b). As it is observed, for \(Trace_B\) with a larger number of candidate closed episodes, the effect of the proposed approach on the consumed time is more prominent. On the contrary, since a smaller set of candidate closed episodes is extracted by the hashing approach for \(Trace_C\), the consumed time of the hashing approach is less than the proposed approach’s (Fig. 17c). However, the time consumed by both the methods is similar and comparable.
Impact of\(\mu \): To evaluate the impact of \(\mu \), we set \(\theta = 0.1\). Table 8 shows the impact of \(\mu \) in the interval of [2, 10] on the number of patterns and the consumed time of the hashing approach. Unlike the real workloads, there is no clear behavior of the impact of \(\mu \) on the traces. The increase of \(\mu \) does not show clear behavior on the training phase of \(Trace_A\). For example, the number of candidate closed episodes for \(\mu =10\) is more than the number of candidate closed episodes for \(\mu =4\). On the other hand, for \(\mu \ge 3\), there is no change in the number of candidate closed episodes of \(Trace_B\). For \(\mu \ge 8\), there is no significant change in the episode extraction of \(Trace_C\). These results imply that the increase of \(\mu \) does not affect the span of events of dynamic workloads. As the table shows the time consumed by the hashing approach mainly depends on the number of episodes and candidate closed episodes.
Figure 18 shows the total time consumed by the hashing approach and the proposed approach to extract the closed episodes for different values of \(\mu \) and \(\theta =0.1\). According to the figure, there is no clear behavior of the impact of \(\mu \) on the consumed time of both the methods. However, as the figure shows, since there are a large number of candidate closed episodes for \(Trace_A\) and \(Trace_B\), the proposed approach reduces the consumed time for episode mining (Fig. 18a, b). According to the results, for a larger number of candidate closed episodes, the effect of the proposed approach on the consumed time is more eminent (Fig. 18b). On the contrary, although the time consumed by both the methods is comparable for \(Trace_C\), the hashing approach needs less time due to a smaller set of candidate closed episodes.
7 Conclusion and future work
The prediction of the future workload of applications is an essential step before resource provisioning in cloud. This paper improves the efficiency of the previous predictors proposed based on pattern mining. The paper proposes a general approach, which not only improves time and space complexities of the pattern mining engine of the predictors, but also can be employed in different fields of SPM. To improve space complexity, redundant LOs are identified and omitted based on the improved vertical representation of the stream. To improve time complexity, a new data structure, called \({\textit{C}PBT}\), is introduced to store closed episodes. Based on \({\textit{C}PBT}\), a new approach is suggested to extract closed episodes directly. The experimental results show that for small values of the frequency threshold and the decomposition unit, the proposed approach improves the efficiency of mining closed episodes significantly in comparison to the hashing approach.
In the future work, we plan to conduct more experiments to evaluate the efficiency of the proposed approach for pattern mining in different fields. Furthermore, we plan to propose an approach to select the values of the parameters according to workload changes dynamically.
Notes
The proof of lemmas and theorems could be found in “Appendix A”.
These traces can be accessed at http://gwa.ewi.tudelft.nl/datasets/Bitbrains.
The proof of lemmas and theorems could be found in “Appendix A”.
References
Petcu D, Vzquez-Poletti JL (2012) European research activities in cloud computing. Cambridge Scholars Publishing, Cambridge
Amiri M, Mohammad-Khanli L, Mirandola R (2018) An online learning model based on episode mining for workload prediction in cloud. Future Gener Comput Syst 87:83
Amiri M, Mohammad-Khanli L (2017) Survey on prediction models of applications for resources provisioning in cloud. J Netw Comput Appl 82:93–113
Jiang Y, Perng C-S, Li T, Chang RN (2013) Cloud analytics for capacity planning and instant VM provisioning. IEEE Trans Netw Serv Manag 10(3):312–325
Cetinski K, Juric MB (2015) AME-WPC: advanced model for efficient workload prediction in the cloud. J Netw Comput Appl 55:191–201
Amiri M, Feizi-Derakhshi MR, Mohammad-Khanli L (2017) IDS fitted Q improvement using fuzzy approach for resource provisioning in cloud. J Intell Fuzzy Syst 32(1):229–240
Altevogt P, Denzel W, Kiss T (2016) Cloud modeling and simulation. Wiley-IEEE Press, London
Yang J, Liu C, Shang Y, Cheng B, Mao Z, Liu C, Niu L, Chen J (2014) A cost-aware auto-scaling approach using the workload prediction in service clouds. Inf Syst Front 16(1):7–18
Shi P, Wang H, Yin G, Fengshun L, Wang T (2012) Prediction-based federated management of multi-scale resources in cloud. Adv Inf Sci Serv Sci 4(6):324–334
Matsunaga A, Fortes JAB (2010) On the use of machine learning to predict the time and resources consumed by applications. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing, Melbourne, Victoria, Australia, pp 495–504. IEEE Computer Society
Amiri M, Mohammad-Khanli L, Mirandola R (2018) A sequential pattern mining model for application workload prediction in cloud environment. J Netw Comput Appl 105:21–62
Achar A, Ibrahim A, Sastry PS (2013) Pattern-growth based frequent serial episode discovery. Data Knowl Eng 87:91–108
Yan X, Han J, Afshar R (2003) CloSpan: mining—closed sequential patterns in large datasets. In: Proceedings of the 2003 SIAM international conference on data mining, San Francisco, CA, USA, pp 166–177
Fahed L, Brun A, Boyer A (2014) Episode rules mining algorithm for distant event prediction. Technical Report hal-01062542, HAL
Huang P, Liu CJ, Yang X, Xiao L, Chen J (2014) Wireless spectrum occupancy prediction based on partial periodic pattern mining. IEEE Trans Parallel Distrib Syst 25(7):1925–1934
Li K, Fu Y (2014) Prediction of human activity by discovering temporal sequence patterns. IEEE Trans Pattern Anal Mach Intell 36(8):1644–1657
Wright AP, Wright AT, McCoy AB, Sittig DF (2015) The use of sequential pattern mining to predict next prescribed medications. J Biomed Inf 53:73–80
Gan W, Lin JCW, Fournier-Viger P, Chao HC, Yu PS (2018) A survey of parallel sequential pattern mining. CoRR, arXiv:1805.10515
Dinh D-T, Le B, Fournier-Viger P, Huynh V-N (2018) An efficient algorithm for mining periodic high-utility sequential patterns. Appl Intell 48(12):4694–4714
Martin F, Méger N, Galichet S, Becourt N (2012) Forecasting failures in a data stream context application to vacuum pumping system prognosis. Trans Mach Learn Data Min 5(2):87–116
D’Andreagiovanni M, Baiardi F, Lipilini J, Ruggieri S, Tonelli F (2019) Sequential pattern mining for ict risk assessment and management. J Log Algebraic Methods Program 102:1–16
Van T, Yoshitaka A, Le B (2018) Mining web access patterns with super-pattern constraint. Appl Intell 48(11):3902–3914
Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289
Rathore S, Goyal V (2015) Top-K high utility episode mining in complex event sequence. PhD thesis
Höppner F (2001) Discovery of temporal patterns. Learning rules about the qualitative behaviour of time series. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, PKDD ’01. Springer, London, pp 192–203
Papapetrou P, Kollios G, Sclaroff S, Gunopulos D (Nov 2005) Discovering frequent arrangements of temporal intervals. In: Fifth IEEE international conference on data mining (ICDM’05), Houston, TX, USA. IEEE
Batal I, Cooper GF, Fradkin D, Harrison J Jr, Moerchen F, Hauskrecht M (2016) An efficient pattern mining approach for event detection in multivariate temporal data. Knowl Inf Syst 46(1):115–150
Winarko E, Roddick JF (2007) ARMADA: an algorithm for discovering richer relative temporal association rules from interval-based data. Data Knowl Eng 63(1):76–90 (Data Warehouse and Knowledge Discovery, DAWAK’05)
Papadopoulos S, Drosou A, Tzovaras D (2016) Fast frequent episode mining based on finite-state machines. In: Abdelrahman OH, Gelenbe E, Gorbil G, Lent R (eds) Information sciences and systems 2015. Springer International Publishing, Cham, pp 199–208
Lin M-Y, Lee S-Y (2002) Fast discovery of sequential patterns by memory indexing. Springer, Berlin, pp 150–160
Moskovitch R, Shahar Y (2009) Medical temporal-knowledge discovery via temporal abstraction. AMIA Annu Symp Proc 2009:452–456
Moskovitch R, Walsh C, Wang F, Hripcsak G, Tatonetti N (Nov 2015) Outcomes prediction via time intervals related patterns. In: 2015 IEEE international conference on data mining, pp 919–924
Sacchi L, Larizza C, Combi C, Bellazzi R (2007) Data mining with temporal abstractions: learning rules from time series. Data Min Knowl Discov 15(2):217–247
Allen JF (1984) Towards a general theory of action and time. Artif Intell 23(2):123–154
Patel D, Hsu W, Lee ML (2008) Mining relationships among interval-based events for classification. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, SIGMOD ’08. ACM, New York, NY, USA, pp 393–404
Batal I, Fradkin D, Harrison J, Moerchen F, Hauskrecht M (2012) Mining recent temporal patterns for event detection in multivariate time series data. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12. ACM, Beijing, China, pp 280–288
Ghosh S, Li J, Cao L, Ramamohanarao K (2017) Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns. J Biomed Inf 66:19–31
Laxman S, Sastry P, Unnikrishnan K (2007) Discovering frequent generalized episodes when events persist for different durations. IEEE Trans Knowl Data Eng 19(9):1188–1201
Tatti N, Cule B (2010) Mining closed strict episodes. In: Proceedings of the 2010 IEEE international conference on data mining, ICDM ’10. IEEE Computer Society, Washington, DC, USA, pp 501–510
Wu S-Y, Chen Y-L (2007) Mining nonambiguous temporal patterns for interval-based events. IEEE Trans Knowl Data Eng 19(6):742–758
Laxman S, Sastry PS, Unnikrishnan KP (2005) Discovering frequent episodes and learning hidden markov models: a formal connection. IEEE Trans Knowl Data Eng 17(11):1505–1517
Hwang K, Bai X, Shi M, Li Y, Chen WG, Wu Y (2016) Cloud performance modeling and benchmark evaluation of elastic scaling strategies. IEEE Trans Parallel Distrib Syst 27(1):130–143
Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66
Zaki MJ (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60
Neapolitan RE, Neapolitan R, Naimipour K (2010) Foundations of algorithms. Jones & Bartlett Learning, Burlington
Alam M, Shakil KA, Sethi S (2016) Analysis and clustering of workload in google cluster trace based on resource usage. In: 2016 IEEE international conference on computational science and engineering (CSE) and IEEE international conference on embedded and ubiquitous computing (EUC) and 15th international symposium on distributed computing and applications for business engineering (DCABES), pp 740–747. IEEE
Alexandru I, Hui L, Mathieu J, Shanny A, Catalin D, Lex W, Epema Dick HJ (2008) The grid workloads archive. Future Gener Comput Syst 24(7):672–686
Shen S, van Beek V, Iosup A (2015) Statistical characterization of business-critical workloads hosted in cloud datacenters. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 465–474. IEEE
Li A, Yang X, Kandula S, Zhang M (2010) Cloudcmp: comparing public cloud providers. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pp 1–14. ACM
Acknowledgements
The GWA-T-12 Bitbrains traces are provided by Bitbrains IT Services Inc., which is a service provider that specializes in managed hosting and business computation for enterprises. We thank the GWA team and all those who have graciously provided the data for us.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proofs
The proof of all of the theorems, lemmas and corollaries are presented in this appendix. Furthermore, we might present some new lemmas that are used to prove the other lemmas and theorems.
Lemma 5
Given the episode \(\alpha =G'_1\rightarrow \cdots \rightarrow G'_k\) and the occurrence \(x=([t^j_1,t^j_2]_{j=1}^{k})\in LO(\alpha )\), if there exists a valid occurrence \(y=([w^j_1,w^j_2]_{j=1}^{k})\) such that \(t^1_1<w^1_1\), then \(t^k_1<w^k_1\).
Proof
The proof is by induction on k: Base case for \(k=2\): The proof is by contradiction: Assume \(w^2_1<t^2_1\). We have:
Therefore, x does not include LPO and it is not an LO. Induction Step: Assume it is true for \(k=m-1\). Now, we should prove it for \(k=m\): The proof is by contradiction: Assume \(w^m_1<t^m_1\). We have:
It means that x does not include LPO. So it is not an LO, which is in contradiction to the assumption. \(\square \)
Lemma 1Given the episode\(\alpha =G'_1\rightarrow \cdots \rightarrow G'_k\), if\(MPO(\alpha )\)is a set of all the minimal prefix occurrences of\(\alpha \), then\(LO(\alpha )\subseteq MPO(\alpha )\).Footnote 3
Proof
The proof is by contradiction: assume there exists at least one LO\(x=([t_1^{j},t_2^{j}]^{k}_{j=1})\) such that \(x\notin MPO(\alpha )\). Since \(x\notin MPO(\alpha )\), so there should exist a valid occurrence \(y=([w_1^{j},w_2^{j}]^{k}_{j=1})\) where \(w^{1}_1>t^{1}_1\) and \(w^{k}_1\le t^{k}_1\). According to Lemma 5, for each valid occurrence \(y = ([w^{j}_{1}, w^{j}_{2}]^{k}_{j=1})\) that \( t^{1}_{1}<w^{1}_{1} \), we should have \( t^{k}_{1}<w^{k}_{1} \). So x is not an LO, which is in contradiction to the assumption. \(\square \)
Lemma 2Given the episode\(\alpha \), if\(\beta \)and\(\gamma \)are the serial and concurrent extensions of\(\alpha \), removing redundant occurrences from\(LOList(\alpha )\)does not affect\(freq(\alpha )\), \(freq(\beta )\)and\(freq(\gamma )\).
Proof
We define \(OSet^N_M(\alpha )=\{O_1,\ldots ,O_L\}\) as a set of all the non-overlapped minimal occurrences that \( O_1 \) is the first minimal occurrence of \( \alpha \) and \( O_{i+1},1\le i<L,\) is the first non-overlapped minimal occurrence after \( O_i \). In [11], we proved that \( OSet^N_M(\alpha ) \) is a maximal non-overlapped set of the minimal occurrences of \(\alpha \) in the stream and \(freq(\alpha )=|OSet^N_M(\alpha )|\). The proof of the lemma includes three cases:
Impact of removing redundant LOs on \(freq(\alpha )\): according to the first condition of Definition 18, there is overlap between the two occurrences O and Q. So at most one of them could be in \(OSet^N_M(\alpha )\). On the other hand, O is not a minimal occurrence. So \(O\notin OSet^N_M(\alpha )\) and removing it does not affect \(freq(\alpha )\).
Impact of removing redundant LOs on \(freq(\beta )\): If there exists a sub-interval of \([t^k_2+\delta ,t^k_2+\varDelta ]\) such that the occurrence O covers it exclusively or O is a minimal occurrence, then O might be a non-overlapped occurrence of \(\alpha \) and form a non-overlapped occurrence for \(\beta \). Therefore, removing O might lead to losing a non-overlapped occurrence of \(\beta \). According to the first condition of Definition 18, each non-overlapped occurrence of \(\beta \) whose LPO is O could be formed by using Q. On the other hand, there exists no sub-interval of \([t^k_2+\delta ,t^k_2+\varDelta ]\) such that the occurrence O covers it exclusively. Therefore, removing O does not affect \(freq(\beta )\).
Impact of removing redundant LOs on \(freq(\gamma )\): If there exists the event \(e=(v,s,st,et)\) such that \(|t^k_2-st|<\epsilon \) and \(|t^k_1-st|<\epsilon \), then removing O affects the frequency of \(\gamma =\alpha \odot (r,s)\). So if there is no such event, removing O does not affect \(freq(\gamma )\).
Therefore, if the three conditions are satisfied together, removing O does not affect \(freq(\alpha )\), \(freq(\beta )\) and \(freq(\gamma )\). \(\square \)
Lemma 6
Given the episode \(\alpha \) such that \(|CNG_{\alpha }|=k\) and \(\mu \ge \max (2\epsilon +1,\epsilon +2)\), the successive starting intervals of \(G_i, 1\le i\le k\), have no overlap.
Proof
The proof is by contradiction: suppose there are two starting intervals \([t_1,t_2]\) and \([t'_1,t'_2]\) of \(G_i\) such that there is overlap between them: \(t_1\le t'_1\le t_2<t'_2\). According to the definition of the episode occurrence in [11], \(t_2-t_1\le \epsilon \) and \(t'_2-t'_1\le \epsilon \). \(\forall e=(r,s,st,et)\in E\;that\;t_2< st\le t'_2\), then there should exist the other event \(e'=(r'=r,s'=s,st',et')\) such that \(t_1\le st'\le t_2\). Since \(et'\le st\), if \(t'_1\le st'\le t_2\) then \(\varDelta e'<\epsilon \). If \(t_1\le st'<t'_1\) and \(et'=st\), we have \(\varDelta e'=\mu <2\epsilon \), which is in contradiction to \(\mu >2\epsilon \). If \(t_1\le st'<t'_1\) and \(et'<st\), there should exist the other event \(e''=(r,s''\ne s,st'',et'')\) such that \(et'\le st''<et''\le st'\). Since \(\varDelta e'>\epsilon \), we have \(et'>t_2\) and \(\varDelta e''<\epsilon \). These show that the successive starting intervals of \(G_i\) have no overlap. \(\square \)
Lemma 7
Given the episode \(\alpha \) such that \(|CNG_{\alpha }|=k\), \( 1\le i\le k \), \(\mu \ge \max (2\epsilon +1,\epsilon +2)\) and the two occurrences \(O,O'\in OSet(\alpha )\), if \([u,u']\) is the starting interval of \(G_i\) in O, the following starting interval of \(G_i\) in \(O'\) is \([w,w']\) that \(w>2\epsilon +u\).
Proof
According to the definition of the occurrence O in [11], \(\exists A^{i}_{j}\in G_i, 1\le i\le k, j\in \{1,\ldots ,l_i\}\) that \(g_{\alpha }(A^{i}_{j})=(r,s)\), \(h(A^{i}_{j})=a\) and \(e_a=(r,s,st=u,et)\). Since \(\varDelta e_a>\epsilon \), we have \(et>\epsilon +u\). According to Lemma 6, \(w>u'\). For the occurrence \(O'\), \(h'(A^{i}_{j})=b\) such that \(e'_b=(r'=r,s'=s,st'=v,et')\), \(w\le v\le w'\), \(\varDelta e'_b>\epsilon \) and \(et'>v+\epsilon \). If \(st'>et\), there should exist the other event \(e_m=(r,s_m\ne s,st_m,et_m)\) such that \(st_m=et\). Since \(\varDelta e_a>\epsilon \) and \(\varDelta e_m>\epsilon \), then \(w>2\epsilon +u\). If \(st'=et\), we have \(\varDelta e_a=\mu \). So \(w-u=\mu >2\epsilon \). Note that the condition \(\mu >2\epsilon \) is reasonable because the minimum span of events is \(\epsilon +1\) [11]. So the decomposition unit of events, \(\mu \), could be twice more than the minimum span. \(\square \)
Lemma 8
Given the episode \(\alpha =G'_1\rightarrow \cdots \rightarrow G'_k\) and the two occurrences \(O=([t^i_1,t^i_2]^{k}_{i=1})\) and \(Q=([w^i_1,w^i_2]^{k}_{i=1})\), where \(O,Q\in LO(\alpha )\), if \(\exists j, 1\le j\le k-1\), such that for \(r=1,2,\ldots ,j-1\): \(t^r_1=w^r_1\), \(t^r_2=w^r_2\) and \(t^j_1<w^j_1\), then \(t^k_1<w^k_1\) and \(t^k_2<w^k_2\).
Proof
The proof is by induction on k: Base case for \(k=2\): we have j=1 and according to Lemmas 6 and 7 , \(w^1_1>t^1_2\). If \(t^2_1\in [t^1_2+\delta ,\min (t^1_2+\varDelta ,w^1_2+\delta -1)]\), then we have \(O\in LO(\alpha )\). If \(w^2_1\in [w^1_2+\delta ,w^1_2+\varDelta ]\), then \(Q\in LO(\alpha )\). So we have \(w^2_1>t^2_1\) and according to Lemmas 6 and 7 , \(w^2_2>t^2_2\). Induction Step: Assume it is true for \(k=m-1\). It should be proved for \(k=m\). Since the lemma is correct for \(k=m-1\), so if \(t^j_1<w^j_1,1\le j\le m-2\), then \(t^{m-1}_{2}<w^{m-1}_{2}\). The starting interval of \(G'_m\) in O, \(t^m_1\), should be in the interval of \([t^{m-1}_2+\delta ,\min (t^{m-1}_2+\varDelta ,w^{m-1}_2+\delta -1)]\) because if \(t^m_1<t^{m-1}_2+\delta \), then O is not a valid occurrence and if \(t^m_1>\min (t^{m-1}_2+\varDelta ,w^{m-1}_2+\delta -1)\), then O is not a valid occurrence or since \(w^{m-1}_2>t^{m-1}_2\) and \(t^1_1\le w^j_1\), O cannot include the starting interval of \(G'_m\) (because O does not include an LPO).
The starting interval of \(G'_m\) in Q, \(w^m_1\), should also be in the interval of \([w^{m-1}_2+\delta ,w^{m-1}_2+\varDelta ]\) because gap constraints are satisfied. Since \(t^1_1\le w^1_1\) and \(w^{m-1}_2>t^{m-1}_2\), so Q could include the starting interval of \(G'_m\) because it includes an LPO. So we have:
On the other hand, according to Lemmas 6 and 7 , two occurrences of \(G'_m\) have no overlap. So we have \(w^m_2\ge w^m_1>t^m_2\). For \(j=m-1\), in a similar way to the previous case, if \(t^m_1\in [t^{m-1}_2+\delta ,\min (t^{m-1}_2+\varDelta ,w^{m-1}_{2}+\delta -1)]\), then O is an LO. If \(w^m_1\in [w^{m-1}_2+\delta ,w^{m-1}_2+\varDelta ]\), then Q is an LO. So we have \(w^m_1>t^m_1\) and according to Lemmas 6 and 7, \(w^m_2\ge w^m_1>t^m_2\ge t^m_1\). \(\square \)
Lemma 3Given the episode\(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k\)and the occurrences\(A=([a^i_1,a^i_2]^k_{i=1})\), \(B=([b^i_1,b^i_2]^k_{i=1})\)and\(C=([c^i_1,c^i_2]^k_{i=1})\), where\(A,B,C\in LO(\alpha )\)andAandCareLOs immediately before and afterBrespectively, if\(a^1_1\ne b^1_1\)and\([b^k_1,b^k_2]\)is not covered by\([a^k_1,a^k_2]\)and\([c^k_1,c^k_2]\), then all of theLOs beforeAstart beforeBand\([b^k_1,b^k_2]\)is covered by none of theLOs beforeAand afterC.
Proof
According to Lemmas 6, 7 and 8 , the starting intervals of all the LOs before A are equal to \([a^1_1,a^1_2]\) or before \([a^1_1,a^1_2]\). If \(a^1_1\ne b^1_1\), it means that \(a^1_1<b^1_1\). It is clear that the starting intervals of all the LOs before A don’t coincide with \(b^1_1\). If \(VI([b^k_1,b^k_2],k+1)\) is not covered by \(VI([a^k_1,a^k_2],k+1)\) and \(VI([c^k_1,c^k_2],k+1)\), the starting intervals of all the LOs before A are before \([a^k_1,a^k_2]\) and the starting intervals of all the LOs after C are after \([c^k_1,c^k_2]\). So VIs of these intervals don’t cover \(VI([b^k_1,b^k_2],k+1)\). \(\square \)
Lemma 4If\(\epsilon \ge \frac{\delta }{4}\)and\( \varDelta \in [\delta ,2\delta ) \), then there is no redundantLO.
Proof
According to the second condition of Definition 18 and Lemma 3, \([t^k_2+\delta ,t^k_2+\varDelta ]\) of O (the redundant LO) should be covered by the LOs immediately before and after O. Assume \(O_1\) and \(O_2\) are the LOs immediately before and after O respectively. We define \([b_1,b'_1], [b,b']\) and \([b_2,b'_2]\) as the starting intervals of the last group of the episode in \(O_1\), O and \(O_2\) respectively. It is clear that \(b'_1<b'<b'_2\). If \(b'_2+\delta \le b'_1+\varDelta \), then we have \(b'_1+\delta<b'+\delta<b'_2+\delta \le b'_1+\varDelta<b'+\varDelta <b'_2+\varDelta \). It means that \([b'+\delta ,b'+\varDelta ]\) is covered completely. Since \(b'_2+\delta \le b'_1+\varDelta \), we have \(b'_2-b'_1\le \varDelta -\delta \). According to the upper bound of \(\varDelta \) (\(\delta \le \varDelta <2\delta \)), \(b'_2-b'_1< \delta \). On the other hand, we have \(b>b_1+2\epsilon \) and \(b_2>b+2\epsilon \) according to Lemma 7. So we have \(b'_2-b'_1>4\epsilon \). It means that \(4\epsilon<b'_2-b'_1<\delta \). Therefore, if \(\epsilon \ge \frac{\delta }{4} \), the second condition of Definition 18 is not satisfied and there is no redundant LO. \(\square \)
Theorem 1Given the episode\(\alpha \)and\((r,s)\in RS\), the algorithmSMakeLoListonly finds all the non-redundantLOs of\(\beta =\alpha \oplus (r,s)\).
Proof
To prove this theorem, we focus on the span of LOs in LOList of episodes. The proof includes three parts: (1) Occurrences extracted by the algorithm are an LO. Given \(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_{k-1}\) and \(G'=(r,s)\in RS\), we have \(\beta =G'_1\rightarrow G'_2\rightarrow \cdots G'_{k-1}\rightarrow G'\). The proof is by contradiction: there is at least one extracted occurrence of \(\beta \) that is not the latest occurrence. For this occurrence, assume there are the corresponding occurrences \(O_{\alpha }\) and \(O_{G}\) of \(\alpha \) and G with spans \([r,r']\) and \([x,x']\) respectively. There are two cases: (a) The gap constraints have not been satisfied, which is impossible due to line 11 of the algorithm. (b) There is the other LO\(Q_{\alpha }\) of the episode \(\alpha \) with the span \([u,u']\) for the episode \(\alpha \) that satisfies the gap constraints for \([x,x']\) and \(u>r\), \(u'>r'\). Otherwise, \([r,r']\) and \([x,x']\) could form an LO for \(\beta \). So, we have:
Since \([r,r']\) is before \([u,u']\), \(\forall [f,f']\in LOList(\alpha )\) that \(r\le f\le u\), we have:
Since line 7 of the algorithm is satisfied for \(O_{\alpha }\), so \([r,r']\) could not be the latest prefix occurrence. 2) The extracted LOs are non-redundant. According to Definition 18, a redundant LO satisfies three conditions. In lines 15 to 27, these conditions are checked. According to Lemma 3, it is sufficient that the immediately next and previous LOs are investigated. The first condition of Definition 18 is considered in line 19. The second and third conditions are also investigated in lines 20 and 21 respectively. So if all the conditions are satisfied, \(LOList(\beta )[z-1]\) is redundant and is removed in line 22. 3) It should be proved that “all” of the non-redundant LOs are found. The proof is by contradiction: suppose there is at least a non-redundant LO\(O_{\beta }\) of \(\beta \), composed of \(O_{\alpha }\) and \(O_{G}\) with spans \([r,r']\) and \([x,x']\) respectively, which is not extracted. Since this LO is non-redundant, then the conditions of lines 19 to 22 are not satisfied. Since LOListRS(G) is complete, there are two cases for \(O_{\alpha }\): a) \([r,r']\in LOList(\alpha )\): it is checked in the first while loop. If \([x,x']\) is not checked for \([r,r']\), it means that the other latest prefix occurrence has been found for it previously. So \([r,r']\) could not be the latest prefix occurrence. When \([x,x']\) is checked for \([r,r']\), if \([u,u']\) is after \([r,r']\) in \(LOList(\alpha )\), then \(u>r\). So we have \(u'>r'\) according to Lemma 8. Since \([r,x']\) is not redundant, it means that \(VI([x,x'],k+1)\) is not covered or \([x,x']\) extends concurrently or there is not an LO of \(\beta \) with the span \([r,y']\) such that \(y'<x'\). So if \([r,x']\) is not redundant, then \([u,x']\) is not redundant. Thus, the non-redundant LO with the span \([u,x']\) is formed. Therefore, \([r,r']\) could not be an LPO and \([r,x']\) is not an LO, which is in contradiction to the assumption. (b) If \([r,r']\notin LOList(\alpha )\), so there is the latest occurrence with the span \([u,r']\) such that \(u>r\). Since \(r'\) satisfies the gap constraints with x, the latest occurrence with the span \([u,r']\) also satisfies the gap constraints with \([x,x']\). So there is another valid occurrence with the span \([u,x']\) that \(\exists j, 1\le j\le k-2\) that the starting interval of \(G'_j\) in the span \([u,r']\) is greater than \([r,r']\). So the occurrence \(O_{\beta }\) is not an LO. Therefore if \([r,r']\in LOList(\alpha )\), all the non-redundant LOs whose LPO is \([r,r']\), are extracted. \(\square \)
Lemma 9
Given the episode \(\alpha \), \((r,s)\in RS\), \(|ResourceType|=r\), \(|LOList(\alpha )|=q\) and \(|LOListRS(r,s)|=p\), time complexity of the algorithm SMakeLoList is \(O(p(r+\frac{\delta }{\epsilon })+q)\) in the worst case and O(p) and O(q) in the best cases.
Proof
Generally, time complexity of the algorithm SMakeLOList is \(O(k\% \times p\times r+q-f)\) where \(0\le k\le 100\) and \(0\le f\le q\). It means that when \(k\%\) of LOListRS(r, s) have been traversed by the f elements of \(LOList(\alpha )\), a member of LOListRS(r, s) is met that should be compared with the remaining \(q-f\) elements of \(LOList(\alpha )\). In the worst case, the first element of \(LOList(\alpha )\) connects to all the \(p-1\) elements of LOListRS(r, s) and the last element of LOListRS(r, s) connects to no element of LOListRS(r, s). Furthermore, for all the extracted occurrences, the functions CExtending and FindIndex are called. Time complexity of CExtending is O(r) . For the redundant LOs we have \( b_3+\delta \le b_1+\varDelta \). In addition, in [11], we proved that \( \varDelta -\delta <\delta \). So we have \( b_1<b_2<b_3<b_1+\delta \). On the other hand, in [11], we proved that if \( [x_1,x_1] \) and \( [y_1,y_1] \) are two consecutive starting intervals of (r, s) , then \(y_1-x_1>\epsilon \). So time complexity of FindIndex is \( O(\frac{\delta }{\epsilon }) \). Therefore, time complexity is \(O(p(r+\frac{\delta }{\epsilon })+q)\). In the best case, the first element of \(LOList(\alpha )\) connects to all the members of LOListRS(r, s) or the first element of LOListRS(r, s) connects to no element of \(LOList(\alpha )\). In these cases, the functions CExtending and FindIndex are not also called. Therefore, time complexity is O(p) and O(q) respectively. \(\square \)
Theorem 2Given the episode\(\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_{k}\)and\(G=(r,s)\in RS\), the algorithmCMakeLOListonly finds all the non-redundantLOs of\(\beta =\alpha \odot G\).
Proof
The proof of the theorem includes three parts: (1) The occurrences extracted by the algorithm are an LO. According to the definition of the concurrent extension, we have \(\beta =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow (G'_k\cup G=G')\). Since \(LOList(\beta )\) is constructed based on \(LOList(\alpha )\), so all the occurrences extracted by the algorithm satisfy the definition of LO. (2) The extracted LOs are non-redundant. According to Definition 18, a redundant LO satisfies three conditions. In lines 14 to 26, these conditions are checked. According to Lemma 3, it is sufficient that the immediately next and previous LOs are investigated. The first condition of Definition 18 is considered in line 18. The second and third conditions are also investigated in lines 19 and 21 respectively. So if all the conditions are satisfied, \(LOList(\beta )[z-1]\) is redundant and is removed in line 22. (3) It should be proved that “all” of the non-redundant LOs are found. The proof is by contradiction: there is at least a non-redundant LOA of \(\beta \) that is not extracted. According to the previous parts, all the extracted LOs are non-redundant. Then it means that the algorithm does not recognize the occurrence A as an LO. Since each LO of \(\beta \) includes one LO of \(\alpha \) and one LO of G, then it means that \(LOList(\alpha )\) or LOListRS(G) is not complete or the algorithm could not find the occurrence A of \(\beta \). Since \(LOList(\alpha )\) and LOListRS(G) are complete and line 9 of the algorithm checks the concurrent extensions of \(\alpha \) with G, A and all the possible LOs of \(\beta \) are extracted. Therefore, A is extracted by the algorithm, which is in contradiction to the assumption. \(\square \)
Lemma 10
Given the episode \(\alpha \), \(G=(r,s)\in RS\), \(|LOList(\alpha )|=q\) and \(|LOListRS(r,s)|=p\) and \(|ResourceType|=r\), time complexity of the algorithm CMakeLOList is \(O(p(r+\frac{\delta }{\epsilon })+q)\) or \(O(p+q(r+\frac{\delta }{\epsilon }))\) in the worst cases and \(O(\min (p,q))\) in the best case.
Proof
In the best case, the corresponding element of \(LOList(\alpha )[i]\) matches the corresponding element of LOListRS(G)[j] and both the counters i and j increase repeatedly. Furthermore, all the LOs are non-redundant and the functions CExtending and FindIndex are called for none of the identified LOs. So, time complexity is \(O(\min (p,q))\). In the worst case, each element of \(LOList(\alpha )\) is checked with the t elements of LOListRS(G) where \( 1\le t\le p \), then an LO is identified for each element of \(LOList(\alpha )\) for which the functions CExtending and FindIndex are called in a similar way to Lemma 9. So time complexity is \(O(q(r+\frac{\delta }{\epsilon })+p)\). In the other case, each element of LOListRS(G) is checked with the w elements of \(LOList(\alpha )\) where \( 1\le w\le q \), then an LO is identified for each element of LOListRS(G) for which the functions CExtending and FindIndex are called. So time complexity is \(O(p(r+\frac{\delta }{\epsilon })+q)\). \(\square \)
Lemma 11
Given \(|ResourceType|=r\), time complexity of the algorithm CreateBranch (Algorithm 9) for the episode \(\alpha \) is \(O(r|CNG_{\alpha }|)\).
Proof
Algorithm 9 has a for loop that processes each CNG of \(\alpha \) in each repeat. Since episodes are represented in the form of SAVE [11], we have \(O(|RArray_{\alpha }[j]|)=r\) where \( 1\le j\le |CNG_{\alpha }|\). Thus, time complexity of reversing each CNG is O(r). So, time complexity of the algorithm is \(O(r|CNG_{\alpha }|)\). \(\square \)
Lemma 12
Given the episodes \(\alpha \) and \(\beta \) and the threshold \(\theta \in {\mathbb {R}}_{\ge 0}\) , if \(\beta \sqsubseteq \alpha \) and \(freq(\alpha )\ge \theta \), then \(\theta \le freq(\alpha )\le freq(\beta )\) (the anti-monotonic constraint).
Proof
Since \(\beta \sqsubseteq \alpha \), each occurrence of the episode \(\alpha \) includes an occurrence of \(\beta \). So \(\forall O_i\in OSet^{N}_{M}(\alpha ), \exists O'_{i}\subseteq \; O_i\;that\;O'_i\in OSet^{N}_{M}(\beta )\) (see Lemma 2). Therefore, \(|OSet^{N}_{M}(\alpha )|\le |OSet^{N}_{M}(\beta )|\;or\;freq(\alpha )\le freq(\beta )\). \(\square \)
Lemma 13
The function InsertInCPBT is only called for FC episodes.
Proof
According to lines 31 to 33 of Algorithm 8, when Flag is true, the function InsertInCPBT is called for the episode \(\alpha \). At first, Flag is True. When there exists a serial extension or a concurrent extension of \(\alpha \) whose frequency is equal to \(\alpha \)’s, Flag is set to False. So according to Definition 20, \(\alpha \) is not FC. Therefore, if there are no such episodes, according to Lemma 12, the frequency of episodes extended from \(\alpha \) is less than \(\alpha \)’s. Therefore, \(\alpha \) is FC. It means that the value of Flag remains True and the function InsertInCPBT is called for it. \(\square \)
Lemma 14
Given the episode \(\alpha \), if the function FindClosedFreqEpisode is not called for \(\alpha \), then \(\alpha \) is not a closed frequent episode.
Proof
The function FindClosedFreqEpisode is called by the function \(AllClosed FreqEpisodes\) and itself. The function AllClosedFreqEpisodes calls it for \(P=\{(r,s)|\;|LOListRS(r,s)|>0,\forall (r,s)\in RS\}\). The function \(AllClosedFreqEpisodes\) does not call the function FindClosedFreqEpisode for the members of RS that are not frequent. If an episode is not frequent, then it could not also be closed frequent. The function FindClosedFreqEpisode is recursively called for the serial and concurrent extensions whose frequency is larger than the threshold value c (see lines 13 and 27 of Algorithm 8). So if this function is not called for an episode \(\alpha \), it means that \(freq(\alpha )\) is less than c and it is not a frequent episode. So it could not also be a closed frequent episode. \(\square \)
Theorem 3The algorithmAllClosedFreqEpisodesonly finds all the closed episodes.
Proof
The proof includes two parts: (1) all the extracted episodes are closed. The proof is by contradiction: assume there is an episode \(\alpha \) such that \(|CNG_{\alpha }|=k\) and it is not closed. It means that at least one of the scenarios below occurs:
There is at least one episode \(\beta _1\) such that \(\alpha =Prefix(\beta _1,k)\) and \(freq(\alpha )=freq(\beta _1)\): Since \(\alpha <\beta _1\) , then \(\alpha \) is processed sooner. While processing \(\alpha \), all the serial and concurrent extensions of \(\alpha \) are generated. If the frequency of one of them is equal to \(\alpha \)’s, Flag is set to False in Algorithm 8. So \(\alpha \) is not inserted in \({\textit{C}PBT}\). It means that such an episode cannot be found in \({\textit{C}PBT}\).
There is at least one episode \(\beta _2\) such that \(|CNG_{\beta }|=n \), \(\alpha =Suffix(\beta _2,n-k+1)\) and \(freq(\alpha )=freq(\beta _2)\). There are two cases: a) \(\alpha <\beta _2\): In this case, \(\alpha \) is inserted in \({\textit{C}PBT}\) sooner. Since \(\alpha \) is FC and \(\alpha =Suffix(\beta _2,n-k+1)\), \(\beta _2\) is also FC and according to Lemma 13, it is inserted in \({\textit{C}PBT}\). While inserting \(\beta _2\), the algorithm SearchInTree detects that \(\alpha \) is absorbed by \(\beta _2\). Therefore \({\textit{C}PBT}\) is updated and \(\alpha \) is removed from it. (b) \(\beta _2<\alpha \): Since \(\beta _2\) is inserted in \({\textit{C}PBT}\) sooner, the function SearchInTree returns \(-\,1\) and \(\alpha \) is not inserted in \({\textit{C}PBT}\). Therefore, \(\alpha \) does not exist in \({\textit{C}PBT}\). It means that all the episodes stored in \({\textit{C}PBT}\) are closed.
(2) All of the closed episodes are found. The proof is by contradiction: there exists at least one closed episode \(\alpha \) which has not been found. There are two cases: (a) \(\alpha \) has not been inserted in \({\textit{C}PBT}\). Since \(\alpha \) is a closed episode, then it is FC. So when the function FindClosedFreqEpisode is called for \(\alpha \), it is inserted in \({\textit{C}PBT}\). Therefore if the function InsertInCPBT has not been called for \(\alpha \), it means that FindClosedFreqEpisode has not been called for it. Therefore \(\alpha \) cannot be a closed frequent episode according to Lemma 14. So the function InsertInCPBT is called for \(\alpha \). (2) \(\alpha \) has been removed from \({\textit{C}PBT}\). It means that there is another FC episode \(\beta \) whose suffix is \(\alpha \) and \(freq(\alpha )=freq(\beta )\). So, according to the definition of closed episodes, \(\alpha \) is not a closed episode, which is in contradiction to the assumption. Therefore, all the closed episodes are extracted. \(\square \)
B Algorithms
In this appendix, we present some functions in a canonical form and explain them in detail.
1.1 B.1 Function CExtending
This function considers whether the third condition of Definition 18 is satisfied for an occurrence of the episode. As Algorithm 7 shows, the function receives (r, s), which is the last member of the last CNG in the episode, the corresponding entry of LOListRS(r, s) in the occurrence and the starting interval of the last CNG in the occurrence. It considers whether a concurrent event with (r, s) exists or not. The state of the concurrent event should be greater than (r, s) based on the order defined on RS. In lines 1 to 3, if the pointers Next and Previous are null, it means that there is no concurrent event for it. So it cannot extend concurrently. In lines 4 to 12, the list linked to the pointer Next is considered to find the concurrent event. In lines 13 to 21, the list linked to the pointer Previous is considered in a similar way to the pointer Next’s.
1.2 B.2 Function FindClosedFreqEpisode
The algorithm FindClosedFreqEpisode (Algorithm 8) receives the parameters \( \delta \), \( \varDelta \), \( \epsilon \), the thresholds \(\theta \) and Level, the episode \(\alpha \) and \(LOList(\alpha )\) and forms the concurrent and serial extensions of \(\alpha \) (lines 6 and 20) as the episode \(\beta \) and computes \(LOList(\beta )\) by calling the functions CMakeLOList and SMakeLOList (lines 7 and 21). Then the NO frequency of \(\beta \) is computed by calling the function ComputeFreq (lines 8 and 22). If \(freq(\beta )\) is above the threshold c (computed based on \(\theta \) [11]), the tree is traversed further down by calling FindClosedFreqEpisode in lines 13 and 27 recursively with \(\beta \) and \(LOList(\beta )\) as its parameters. When the serial and concurrent extensions of \(\alpha \) are constructed, it is checked (in lines 9 and 23) whether any of the super patterns \(\beta \) formed from \(\alpha \) has the same frequency as \(\alpha \)’s or not; if not, it means that \(\alpha \) is FC. So the function InsertCPBT is called to insert the FC episode \(\alpha \) in \({\textit{C}PBT}\) (line 32).
1.3 B.3 Function CreateBranch
The algorithm CreateBranch receives the FC episode \(\alpha \) and its frequency, converts them into a branch of \({\textit{C}PBT}\) and returns a pointer to this branch. In lines 3 to 16, in the backward direction, CNGs of \(\alpha \) are processed. In lines 4 and 5, the order of members of CNG is reversed. In lines 6 to 15, for each CNG, a Node is created and added to the end of the branch. Finally, \(L_1\), which is a pointer to the first of the branch, is returned in line 17.
1.4 B.4 Function EpisodeAbsorbByTree
Algorithm 10 checks whether \({\textit{C}PBT}\) could absorb the episode \( \alpha \) or not. Finding at least one branch that absorbs \( \alpha \) is sufficient to omit it. In line 1, the function finds the children of the node R whose label includes \(\alpha '.label\) and frequency is greater than \(\alpha '.freq\). Note that \(\alpha '\) is the pointer to the first node of the corresponding branch of the episode \(\alpha \). In lines 2 to 4, if there are no such children, \(\alpha '\) cannot be absorbed by \({\textit{C}PBT}\) and the function returns False. In lines 5 to 12, if the last node of \(\alpha '\) is being checked, it should be considered whether there exists a member of SubSetChildren that satisfies the condition of equality of the frequency (see function CheckFreq in B.7). If such an episode is found, it means that \(\alpha \) could be absorbed by \({\textit{C}PBT}\). So \(\alpha \) is not a closed episode and the function returns True. In line 11, if there exists no such episode, the function returns False. In lines 12 to 19, the middle nodes of the branch \(\alpha '\) are checked whether there exists a super-episode that absorbs \(\alpha \). As soon as such a super-episode is found, the function returns True in line 15.
1.5 B.5 Function TreeAbsorbByEpisode
Algorithm 11 finds all the branches of \({\textit{C}PBT}\) that are absorbed by the episode \(\alpha \) (\( \alpha ' \) is the corresponding branch of \( \alpha \)). The path of these branches is completed in Path. The completed paths are added to PathList. Finally, PathList includes all the paths whose corresponding episodes should be removed from \({\textit{C}PBT}\). In line 1, the children of the node R whose label is a subset of \(\alpha '.label\) and frequency is greater than or equal to \(\alpha '.freq\) are found. All the found children are considered in lines 2 to 13. In lines 5 to 7, if an episode is found that \(\alpha \) absorbs it, the corresponding Path of the episode is added to PathList and Path is updated. In lines 8 to 9, the middle nodes of \(\alpha '\) are checked to find the episodes that could be absorbed by \(\alpha \). In lines 10 and 11, since the last node of \(\alpha '\) is met, Path is updated.
1.6 B.6 Function UpdateBranch
After the non-closed episodes of the tree are recognized by Algorithm 11, the corresponding branches of them should be updated. The function UpdateBranch (Algorithm 12) updates these branches based on the frequency of the episodes. The function receives Path and freq of a non-closed episode and the node R that the search starts from it towards down. In lines 2 to 3, the function starts the search of Path from the node R and decreases the frequency of the node corresponding to Path[1] by freq. If the frequency of the node is 0, it means that the frequency of its corresponding episode and all of its super-episodes is 0. So in lines 4 and 5, that node and its subtree are removed. Otherwise, this procedure is repeated for the remaining entries of Path.
1.7 B.7 Functions CheckFreq and ComputeNodeFreq
The function CheckFreq (Algorithm 13) is proposed to consider whether there is an episode in the sub-tree of the node n of \({\textit{C}PBT}\) whose frequency is equal to the frequency of the episode \(\alpha \). This function receives \(\alpha '\) (the corresponding branch of \(\alpha \)) and the node n of \({\textit{C}PBT}\). It returns True if such an episode exists in \({\textit{C}PBT}\). In lines 1 and 2, it is checked whether freq(Episode(n)) is equal to \(freq(\alpha )\) or not. If not, the children of n are traversed by calling CheckFreq recursively in lines 4 to 8. As soon as a child with the frequency freq is found, the search is stopped and True is returned. The function ComputeNodeFreq (Algorithm 14) computes the frequency of Episode(n). For this purpose, in lines 2 to 4, the frequency of the node n decreases by the sum of the frequency of its children. It is clear that the function ComputeNodeFreq(n) computes freq(Episode(n)). If \(ComputeNodeFreq(n)>0\), then Episode(n) has occurred in the stream.
1.8 B.8 Function ExtractClosedEpisodeFromCPBT
Algorithm 15 shows the function ExtractClosedEpisodeFromCPBT. The main loop of Algorithm 15 traverses \({\textit{C}PBT}\) until there is no node except the root of \({\textit{C}PBT}\). In line 3, the traverse starts from the most left child. In lines 6 to 11, the most left branch of \({\textit{C}PBT}\) is found. The corresponding episode of this branch is stored in the episode \(\alpha \). Since episodes have been inserted in the backward direction in \({\textit{C}PBT}\), \(\alpha \) is added to ClosedSet in reverse order in line 12. Furthermore, the branch should be updated. In lines 13 to 25, the frequency of all the nodes of the branch decreases by \(\alpha \)’s. In lines 15 to 18, if the frequency of a node is 0, then that node and its subtree are removed. Finally, in line 27, the algorithm returns ClosedSet, which includes all the closed frequent episodes.
Rights and permissions
About this article
Cite this article
Amiri, M., Mohammad-Khanli, L. & Mirandola, R. A new efficient approach for extracting the closed episodes for workload prediction in cloud. Computing 102, 141–200 (2020). https://doi.org/10.1007/s00607-019-00734-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-019-00734-3