Keywords

1 Introduction

Time series Motif Discovery is a task to discover repeated and similar (correlated) patterns in time series. In recent years, time series motif discovery has been received extensive attentions. It has been applied in many time series data mining and analysis tasks, such as data classification, clustering, activity recognition and outlier detection.

In the past decade, a large number of motif discovery algorithms have been proposed. Most existing algorithms aim to find fixed-length motifs. While, the length of motifs cannot be predicted previously in most cases. So, some interesting motifs with different lengths will be lost. Mostly, finding variable-length motifs in the time series can reveal more latent patterns than fixed-length motifs. So, there are some works such as [3, 4] have tried to discover the variable-length motifs. The biggest challenge of variable-length motifs discovery is the massive computations. The complexity of variable-length motif discovery is 10 times higher than the fixed-length motif discovery in [17]. For example, if the lengths of motifs are ranging from 300 to 10300, the brute-force algorithm will take \(5\times 10^{18}\) Euclidean distance calls [3]. In fact, many industrial applications generate very large time series. While, most algorithms use a single compute node to analyze large-scale time series. So, that is difficult to complete the analysis in a feasible time. Therefore, in recent years, distributed and parallel computing platforms are widely used in the data mining and analysis of large-scale time series.

Fig. 1.
figure 1

Variable-length motifs discovered by PMDSC algorithm on random walk dataset

In addition, there are some other limitations in these motif discovery algorithms. For example, the algorithms apply Euclidean distance as the similarity and require the two compared subsequences with the same length, which is not suitable in most applications. So, we propose a variable-length motif discovery algorithm using Pearson Correlation Coefficient as the similarity measure. The Pearson Correlation Coefficient is a commonly used similarity measure for time series data mining due to its multiple beneficial mathematical properties, such as it is invariant to scale and offset. The Pearson Correlation Coefficient can reveal the true similarity of two time series. Therefore, we argue that the Pearson Correlation Coefficient is a good similar measure in time series motif discovery. In Fig. 1, we showed one of the results found by our proposed algorithm.

In order to solve the existing drawbacks and improve the efficiency of motif discovery in long time series, we introduced a parallel motif discovery algorithm on Spark platform. This algorithm applies the Pearson Correlation Coefficient as the similarity measure to find variable-length motifs in large-scale time series.

In short, the main works of this paper are:

  1. 1.

    We propose a parallel algorithm for variable-length motifs discovery based on subsequences correlation using Spark.

  2. 2.

    In order to compute the correlation efficiently, we proposed a parallel FFT algorithm using Spark.

  3. 3.

    In order to improve the concurrency of parallel jobs, we proposed a time series segmentation method and dot product matrix partition method.

  4. 4.

    We demonstrate the efficiency and scalability of the proposed algorithm using extensive experiments.

The rest of the paper is organized as follows. Section 2 discusses Related Works of motif discovery. In Sect. 3, we introduce the problem definition and background concepts used in this paper. In Sect. 4, we introduce our proposed parallel PMDSC algorithm. The experimental results are shown in Sect. 5 and the conclusions are given in Sect. 6.

2 Related Works

Time series motif discovery approach was proposed in 2002 [7]. Then, time series motif discovery has received extensive attentions and many motif discovery algorithms have been proposed. These methods discover the motif as the most similar subsequences using some similar measures. Time series motif discovery is a basic operation in time series data analysis, so a large amount of motif discovery research works have been proposed in recent years.

The existing algorithms for time series motif discovery are mainly divided into two strategies: fixed-length and variable-length. The fixed-length algorithms proposed in [15] are based on SAX (Symbolic Aggregate approXimation) [6], which was applied to represent the time series with symbols. These methods mainly focus on fixed-length motif discovery. [10] proposed a MK algorithm to find the most similar pair subsequences as motifs, which is adopted a pruning method to speed up the Brute Force Algorithm. In [5], authors introduced a Quick-Motif algorithm, whose calculation speed is increased by 3 orders of magnitude compared with the traditional fixed-length motif discovery algorithm in [10]. Several recent works focus on fixed-length motif discovery, [16] introduces an algorithm named STAMP combined with MASS algorithmFootnote 1 to find exact motifs for a given length. An algorithm named STOMP was proposed in [17], which reduced the time complexity of STAMP from \(O(n^{2}\log n)\) to \(O(n^{2})\).

As variable-length motifs can reveal much more interesting latent patterns than fix-length do, many research works have focused on variable-length motif discovery in recent years. In 2011, the VLMD algorithm [11] was proposed by calling fixed-length motif discovery algorithm to find K pair-motifs with variable-length. In [9], the authors proposed an algorithm using Euclidean distance as the similarity measure for Z-Normalized segments. It applied a lower-bound to reduce the computing time of variable-length motifs discovery. In [14], the authors proposed a novel method, which incorporated the grammar induction to find approximate motifs with variable-length. Its running time is faster than other algorithms. However, the idea of this algorithm is based on grammar induction, so this method may be limited in some applications. [2] proposed a method based on discretization and the subsequences do not overlap with their adjacents, which may lead to loss some real results. [3] introduced an algorithm named HIME based on SAX and Induction Graph to find variable-length motifs. This approach can find exact motifs in an acceptable time. However, this method is difficult to implement.

In summary, the existing motif discovery algorithms mostly find fixed-length motifs and mainly using Euclidean distance as a measure of similarity. As illustrated above, these algorithms have many limitations. In this paper, we propose an efficient parallel time series motif discovery method on Spark. Our approach is using subsequences correlation, combined with parallel FFT algorithm and time series segmentation method, can efficiently find motifs with variable-length.

Table 1. Symbols and Definitions

3 Problem Definition and Background

In this section, we present the problem definitions and introduce the background concepts used in this paper. The symbols used in this paper are listed in Table 1.

Definition 1

(Time Series). A \(\textit{Time Series T}\) is a sequence of real numbers observed in the same time interval. \(T=[t_1,t_2,\ldots ,t_n]\), where n is the length of time series T.

Definition 2

(Subsequence). A Subsequence with the length of m in time series T is a set of continuous points \(T[j:j+m]=[t_j,t_{j+1},\ldots ,t_{j+m-1}]\) starting at position j.

Definition 3

(\(\textit{K-Frequent Motif}\)). Given a time series T and a minimum length \(L_{min}\) of motif, K-frequent motif of T is defined as a set of subsequences that have at least K matches and denoted as \(\phi \), \(|\phi |\ge K\). \(\phi =\{T''\,|\,\forall \,T'\,\subset \,T,\,\exists \,T''\,\subset \,T\,\wedge \,Corre(T',\,T'')\,\ge \,\theta \}\). \(T'\) and \(T''\) are subsequences of the time series T with \(|T'| \ge L_{min}\) and \(|T''| \ge L_{min}\). \(Corre(T', T'')\) is the correlation coefficient between \(T'\) and \(T''\), \(\theta \) is a similarity threshold of the correlation given by user.

Definition 4

(Pearson Correlation Coefficient). Pearson Correlation Coefficient is a measure of the correlation between two variables. It can reflect the degree of similarity between two subsequences. For two subsequences \(T'\) and \(T''\), the Pearson Correlation Coefficient can be computed as following.

$$\begin{aligned} Corre(T',T'')=\frac{(E[(T'-E(T'))(T''-E(T''))])}{(\sigma _{T'} \sigma _{T''} )} \end{aligned}$$
(1)

The above calculation formula of the Pearson Correlation Coefficient can also be defined as Formula 2, where \(T'\) and \(T''\) are the subsequences of T, \(u_{T'}\), \(\mu _{T''}\) and \(\sigma _{T'}\), \(\sigma _{T''}\) are the mean and Standard deviation of \(T'\) and \(T''\) respectively, and the \(\sum {T'T''}\) can be calculated from the dot product between the subsequences.

$$\begin{aligned} Corre(T',T'')=\frac{(\sum T'T'' -m\mu _{T'} \mu _{T''})}{(m\sigma _{T'} \sigma _{T''} )} \end{aligned}$$
(2)

FFT (\(\textit{Fast Fourier Transform}\)) is an efficient algorithm to compute the DFT (\(\textit{Discrete Fourier Transform}\)) of a sequence. In many applications, we are interested in finding motifs or the similar shapes. Before computing the correlation coefficient, we normalized two subsequences by \(\textit{Z-Normalization}\). After \(\textit{Z-Normalization}\), FFT can be used to compute the cross products of arbitrary subsequences of two sequences. By doing this, the computational time complexity of \(\sum {T'T''}\) in Formula 2 can be reduced to \(O(n \log n)\). In order to improve the efficiency of motif discovery, we implemented a parallel FFT on Spark. In order to avoid redundant computations, we computed the dot products of all time series subsequences previously using parallel FFT and organized them as a Dot-Product-Matrix \(\mathcal {Z}\).

Fig. 2.
figure 2

The Framework of PMDSC Algorithm

4 Parallel Variable-Length Motifs Discovery

In this section, we present the implementation details of the parallel motif discovery using subsequences correlation using Spark. Our approach is based on new parallel FFT and data segmentation techniques. Figure 2 shows the framework of our motif discovery algorithm. In following, we first introduce the time series segmentation method. Then, we describe the Dot-Product-Matrix computation and partition method. Finally, we describe the details of motif discovery.

figure a

4.1 Time Series Segmentation

In order to make better utilize the properties of RDD (Resilient Distributed DataSet), we introduced a partition approach that divides time series into multiple segments (subsequences) with equal length \(\textit{len}\). To assure the exact of results, we keep an overlap between the adjacent segments with the length of \(L_{min}\). For each segment, we can get the range of indices for its data points by using Formulas 3 and 4, in which \(I_{min}\) and \(I_{max}\) are the minimum and maximum data points indices in \(S_P\), respectively. Also, we can get the number of segments \(P_{max}\) for a time series based on the len and \(L_{min}\) using Formula 5. The Algorithm 1 describes its implementation using Spark.

$$\begin{aligned} I_{min}\ge P*(len-L_{min}) \end{aligned}$$
(3)
$$\begin{aligned} I_{max}\le (P+1)*len-P*L_{min} \end{aligned}$$
(4)
$$\begin{aligned} P_{max}=\frac{n-L_{min}}{len-L_{min}}+1 \end{aligned}$$
(5)

In Spark, we use the \(\textit{map}\), \(\textit{partitionBy}\) and \(\textit{groupByKey}\) operators to complete the time series segmentation task as presented in Algorithm 1. As presented in Algorithm 1, the \(\textit{map}\) operator is used to transform the time series data points into \(\langle key, value \rangle \) pairs (Lines 1–2). Here, the value is the data point \(\textit{T(i)}\) and the key is the index i of the data point. For each data point, we assign its partition number P as its new key according to Formulas 3 and 4 and transform it into \(\langle P, (i, T(i)) \rangle \) (Lines 3–5). Then, the \(\textit{partitionBy}\) operator is used to partition the new key/value pairs into multiple partitions according to the partition number (Line 6). Finally, the \(\textit{groupByKey}\) operator is used to aggregate the new key/value pairs in multiple groups with the same order as in the original time series (Line 7). By doing this, the segmentation is completed.

Each time series is processed in the similar way. After that, in order to pair all possible subsequences. First, we use the map operator to assign a new key to the divided subsequences according to the maximum partition number \(P_{max}\) of each time series. Then, we joined the subsequences from different two time series using the \(\textit{join}\) operator. This step generates records in the form \(\langle (P_1, P_2) , (Iterable[T'], Iterable[T''])\rangle \), where \(P_1\) and \(P_2\) are the indices of two subsequences.

figure b

4.2 Dot-Product-Matrix Computation and Partition

Variable-length motifs discovery can reveal much more interesting latent patterns than fix-length motifs discovery. However, it is a very expensive in time cost. As the motif discovery is a pairwise operation to compute the correlations of all possible subsequences, there are many redundant computations [8]. In order to avoid redundant computations, we compute the correlations previously using parallel FFT and store the results in the Dot-Product-Matrix \(\mathcal {Z}\). In fact, the matrix partition is implemented during its computation and stored in multiple distributed blocks, as shown in Algorithm 2.

The Dot-Product-Matrix \(\mathcal {Z}\) stores the shift-cross products of all possible subsequences from tow time series. For parallel processing, the naive way to use the \(\mathcal {Z}\) is to send it to all worker nodes. In fact, just a little of columns in \(\mathcal {Z}\) is used to compute the correlations of each pair subsequences. In order to avoid unnecessary data transfer and improve the efficiency, we proposed a partition technique for \(\mathcal {Z}\). For each pair of subsequences, we can get the corresponding blocks in \(\mathcal {Z}\) according to their partition numbers. Without losing generality, we assuming the partition number \(P_1\) and \(P_2\) of two subsequences \(T'\) and \(T''\) with lengths \(len_1\) and \(len_2\). The necessary corresponding part \(\mathcal {Z}'\in \mathcal {Z}\) for computing Corre(\(T''\), \(T'\)) is from \(\mathcal {Z}[*][|T|+(P_2-P_1+1)*L_{min}+P_1*len_1-(P_2+1)*len_2]\) to \(\mathcal {Z}[*][|T|+(P_2-P_1-1)*L_{min}+(P_1+1)*len_1-P_2*len_2]\) .

In Algorithm 2, the first part is to compute the \(\mathcal {Z}\) (lines 1–8). It gets all subsequences of T (lines 2–3)and extend their length to the twice of length T (lines 4–6). Then, each subsequence \(T_i\) of T and itself will be transformed using FFT (line 7) and get their cross products (line 8). Finally, a part of the Dot-Product-Matrix is retrieved by doing inverse Fourier transform (line 9). This algorithm returns a two-dimensional array, which contains the sum of the products of the elements in T and \(T_i\) for different shifts of T. The \(\textit{FFT()}\) (line 7) ensures that this process can be done in \(O(n \log n)\) time. The \(\textit{FFT}\) is a parallel implementation [13]. The parallel FFT algorithm is implemented as the following four steps.

  1. 1.

    Get the original index for each element in time series.

  2. 2.

    Compute the binary code \(\mathcal {B(I)}\) for each element according to its original index and the length of time series.

  3. 3.

    Compute the butterfly coefficient using Formula 6 based on the values of bits in \(\mathcal {B(I)}\).

  4. 4.

    Compute the final result for each element using Formula 7.

$$\begin{aligned} W(n)=\prod W_{2^j}^k*(-1)^t, k \in [0 , 2^{j-1})\quad t= {\left\{ \begin{array}{ll} 0 &{}\text {n = k}\\ 1 &{}\text {n = k}+2^{j-1} \end{array}\right. } \end{aligned}$$
(6)
$$\begin{aligned} X(n)=\sum _{i=0}^{n-1} x(i)*W(n) \end{aligned}$$
(7)

The last part of Algorithm 2 (lines 9–16) shows the details of the matrix \(\mathcal {Z}\) partition. When computing the \(\mathcal {Z}\), we assign a key to the elements belongs to the same block according to the partition number of the two subsequences. The \(\mathcal {Z}\) is organized and stored in the form of key/value pairs. By doing this, we completed the matrix partition task. When performing motif discovery and computing the correlations of each pair of subsequences, the subsequence pair along with the corresponding block of \(\mathcal {Z}' \in \mathcal {Z}\) will be grouped together and shuffled to the same worker node according to their assigned key in Algorithm 3.

figure c

4.3 K-Frequent Motif Discovery

In this part, we introduce the algorithms for discovering K-Frequent motifs, as shown in Algorithm 3. Algorithm 3 is used to compute the correlation of all possible subsequence pairs and filter out the covered subsequence pairs to get the longest ones.

In Algorithm 3, the input list contains two segments, \(Segment_{T'}\) and \(Segment_{T''}\), generated by using segmentation method 1 on two time series. In order to get variable-length motifs, we should compute the correlations of all possible subsequences contained in these two segments (Lines 3–11). According to the Dot-Matrix-Partition method, only a little block of \(\mathcal {Z}' \in \mathcal {Z}\) is needed for each pair of segments to compute the correlations of contained subsequences (Line 8). For each pair of subsequences (\(S_i,S_j\)), in which \(S_i \in Segment_{T'}\) and \(S_j \in Segment_{T''}\), the mean and standard deviation of data points contained by \(S_i\) and \(S_j\) should be computed (Lines 9–10). After that, we can use Formula 2 to compute the correlation coefficient of two subsequences (Line 11). Then, we will apply the filtering method to select the required subsequence pairs (Lines 12–15). It’s to be noted that, some found short subsequence pairs can be covered by the longer ones. So, it is necessary to remove the covered subsequence pairs. At the last step, we filter out the short ones and keep the long ones (Line 17). Finally, Algorithm 3 returns the longest subsequences pair contained in each segments pair.

After that, we use groupByKey operator to aggregate the motifs with the same key, and we can get a series motifs of the form \(\langle Pid_1, Iterable(Pid_2, len,Correlation)\rangle \). So, we can find the motifs whose frequency is more than K easily.

Fig. 3.
figure 3

Time series length.

Fig. 4.
figure 4

Partition segment length.

Fig. 5.
figure 5

Computing nodes number

5 Experiments

We have implemented our proposed algorithm using the Scala programming language on Spark. Our experiments are conducted on Spark-2.1.0 cluster, in which there is one master node and 8 worker nodes. Each node in the cluster are equipped with Linux CentOS 6.2, 4 cores, 6 GB of memory and 500 GB of disk storage. In this section, we evaluated the experimental results with the variations of different parameters, including the segment length of motif, the length of time series and the threshold \(\theta \). We also testified the scalability by changing the number of computing nodes. All the experiments are tested on public datasets that are downloaded from the web siteFootnote 2. There are four datasets used in our experiments, including Blood Pressure [1], Light Curve [12], Power [8] and Random Walk.

5.1 Effect of Time Series Length

In this experiment, we verify the efficiency of PMDSC by changing the length of time series from 2000 to 32,768. In this experiment, we set the minimum motif length \(L_{min} = 100\), the correlation threshold \(\theta \) = 0.9 and the partitioned segment length len = 400. The experimental results are shown in Fig. 3.

From the Fig. 3, we can observe that the time cost is increasing near linearly on the four different data sets as the length of time series increasing. Compared with the time costs on these four data sets, we find that the time cost on the Power data is the most. It is because the Power dataset has large spikes that cause increased time to compute the correlation in each partition. The time costs on the two datasets Blood Pressure and Random Walk is smaller and near the same. The reason is that after normalization, the spikes and changing frequencies of the two datasets are like to each other.

5.2 Effect of the Segment Length

In this test, we test the time costs by changing the segment length len from 400 to 800 with the interval 100. In this experiment, we set the time series length to 16,384, the correlation threshold \(\theta \) = 0.9 and \(L_{min}\) = 100. Figure 4 shows the effects of segment length variations on time costs.

From Fig. 4, we can find that the time costs are increasing on four different datasets as the segment length increasing. The reason is that increasing the segment length will bring more subsequence pairs to be processed in each segment. We can also find that the time costs increasing trend is different on four data sets. When the segment length is more than 500, the time costs increase mostly on Power data set followed by Light Curve data set. It is caused by the large variations of data values contained in the two time series. While, the distribution of data point values in other two data sets are relatively stable. So, the time costs changing on Blood Pressure and Random Walk is relatively smaller.

5.3 Effect of the Number of Computing Nodes

In this part, we test the scalability of our proposed method by changing the number of computing nodes in the cluster. In this experiment, we set the length of time series to 16,384, the segment length len = 400 and the correlation threshold \(\theta \) = 0.9. The experimental results are shown in Fig. 5. In Fig. 5, the time costs on four data sets are decreasing when the computing nodes number is increased from 4 to 8. The more computing nodes in the cluster means more computing power and higher parallel concurrency.

6 Conclusion

Time series motif is the repetitive similar patterns in time series. In this paper, we introduce a parallel algorithm to discover the time series motifs with variable-length. This algorithm can process large-scale time series in an acceptable time. Experimental results demonstrate that our algorithm can efficiently and precisely find motifs in large-scale time series. In the future, we will improve our method to find motifs from multivariate time series.