Keywords

1 Introduction

Many areas of science, engineering and business are generating, archiving and processing vast amounts of time series data. Mathematically, a time series T = {T[1], T[2], …, T[n]} is a sequence of n real numbers in the increasing order of time, where each value has a time stamp. The ability to match two time series, T1 of length m, and T2 of length n, to determine their similarity, is a fundamental and critical step in most time series data mining applications including query by content, clustering and classification [1]. Time series T1 and T2 may differ in length, time scale, amplitude scale, time shift, and amplitude shift. There may be considerable distortion due to time warp and noise, and the elements of T1 and T2 may not align. In such cases, the matching should be established on general shapes and local trends of T1 and T2. There are three time series matching scenarios.

  1. 1.

    Sequence-to-sequence matching

  2. 2.

    Sequence-to-subsequence matching

  3. 3.

    Subsequence-to-subsequence matching

The subsequence-to-subsequence matching, which is the task of establishing similarity based on sufficiently long similar subsequences of T1 and T2, has received very little attention and remains an open research problem. In the simplest case, where T1 and T2 are of the same time scale and basic Euclidean distance is used as the similarity measure, the order of computation for subsequence to subsequence matching is O(n3m). As time series are high dimension data, the computation of dissimilarity between two time series in their raw form is very expensive. The situation becomes even worse, if the data points of the two time series do not align.

This paper presents an approach capable of handling sequence-to-sequence, sequence-to-subsequence, and subsequence-to-subsequence matching effectively and efficiently. A segmentation algorithm that selects perceptually important points as primary break points, a time series alignment algorithm that suggests sequence-to-sequence, sequence-to-subsequence, or subsequence-to-subsequence based on the relational analysis of primary break points, and a hierarchical representation that supports coarse-fine matching are the primary contribution of this paper.

The remainder of this paper is organized five sections. Section 2 reviews briefly the related work. The HPLA based time series matching approach is described in Sect. 3. The approach consists of three major steps: Identification of perceptually important primary breakpoints, alignment of the two time series to identify pairs of corresponding candidate matching segments, and matching of segments in each suggested pair using their HPLA representations. The experimental results using a variety of time series data are analyzed and discussed in Sect. 4. Finally, the conclusion is given in Sect. 5.

2 Related Work

Several distance measures have been developed for the computation of the dissimilarity between T 1 and T 2 [2]. These distance measured are grouped into 4 categories: lock-step, elastic, edit and threshold based distance measures. The computationally efficient lock-step distances are not capable of handling even the slightest misalignment between T 1 and T 2. The elastic measures, such as Dynamic Time Warping (DTW), allow time series to be stretched or compressed as needed to achieve good matching [3]. As time series are high dimension data and DTW uses dynamic programming requiring O(mn) time, matching time series in their raw form is computationally expensive. Researchers have embraced two approaches for improving computational efficiency. They have developed techniques to speed-up DTW and other time series matching algorithms, and to represent time series compactly while preserving salient attributes.

Sakoe and Chiba [4] improved the efficiency of the DTW algorithm by defining a warp window, and by comparing each data point of T2 (query sequence) with only the data points of T1 (sequence in the data base) that are inside the warp window. The Fast Time Series Evaluation algorithm (FTSE) maps data points of T1 into a grid based on their values. The data point of T1 that matches T2[i] is determined by comparing T2[i] with only the data points of T1 that reside in the same grid cell as T2[i] [5]. The Embedding-Based Subsequence Matching (EBSM) algorithm converts each subsequence of T1 into a k-dimensional vector, where the ith component of the vector is the DTW distance between the subsequence and the ith embedding sequence. Thus, each time series T1 in the database becomes a sequence of vectors. The query sequence T2 is also converted to a vector using the same embedding sequences, and vector matching techniques are used for retrieval. The experimental results in [6] indicate one to two orders of magnitude faster retrieval than the brute force method. Though time series are converted to sequence of vectors offline, the approach generates a large number of vectors with high computational cost. Many widely used methods including DTW are natural for only sequence-to-sequence matching. There are variants of DTW algorithm, which are developed for sequence-to-subsequence matching [7]. Some methods, in order to handle this problem, cut the long time series into non-overlapping short segments, and match each segment with the query sequence [8]. Such approaches cannot retrieve any subsequence other than the stored segments. Faloutsos et al. [9] use a sliding window of size w to convert each time series in the database to a trail in a low-dimensional feature space. The window is placed at all possible position, features are extracted for the subsequence inside the window and used to map the subsequence to a point in the feature space. The trail is partitioned into sub-trails, and each sub-trail is enclosed in a minimum bounding rectangle for indexing purpose. Similarly, the query sequence T2 is mapped to the feature space to determine the sequences for retrieval.

The second approach obtains compact representations of the two time series, and matches them in the representation space. During the past two decades, several representations, such as Discrete Fourier Transform (DFT) [10], Discrete Wavelet Transform (DWT) [10], Singular Value Decomposition (SVD) [11], Piecewise Aggregation Approximation (PAA) [12], Adaptive Piecewise Constant Approximation (APCA) [13], Piecewise Linear Approximation (PLA) [14], Piecewise Polynomial Approximation (PPA) [16], Symbolic Representations (SAX) [15], etc. have been developed.

It is important to note that most of the time series matching research is in the context of query by content, where the focus is on whole sequence or sequence-to-subsequence matching. The indexing centered algorithms and representations are not very beneficial to many other applications of time series matching. The speed-up techniques beneficial to indexing, for example, are not beneficial to clustering, where pairwise comparison of all time series in the dataset is needed. It is fair to say that, relatively, very little work has been done on subsequence-to-subsequence matching of time series.

Any representation that is suitable for subsequence-to-subsequence matching must segment and represent the matching subsequences in the two time series similarly, even if the two time series differ in translation, time and amplitude scale. Ideally, the segmentation should ensure a one-to-one mapping of segments of the two matching subsequences, and the corresponding segments must be similar. One approach is to partition time series at perceptually important points, and build the representation for each segment independently. The local maxima and minima, and points at which the slope changes abruptly may be taken as the perceptually important points. Bettaiah and Ranganath [17] have clearly shown that the segmentation algorithm used for the generation of PAA, APCA, PLA, and SAX representations do not segment the two time series to meet the stated requirement. Thus, they are not able to support subsequence-to-subsequence matching [19]. The DFT being a global transform is also not able to handle subsequence-to-subsequence matching [17]. The PLA representation has the potential to support the development of algorithms for all matching scenarios if each time series is segmented into identifiable segments by placing breakpoints at the perceptually important local maxima and minima [18]. However, well-known and frequently used segmentation methods (sliding window, top-down and bottom-up) do not guarantee the identification of PIPs as breakpoints.

3 The HPLA Based Time Series Matching Approach

An efficient and effective approach for subsequence-to-subsequence matching is given in this section. The approach consists of three major steps: Identification of perceptually important primary breakpoints, alignment of the two time series to identify pairs of corresponding candidate matching segments, and matching of segments in each suggested pair using their HPLA representations.

3.1 Identification of Perceptually Important Primary Breakpoints

Usually, a time series has many local maxima and minima due to the presence of noise. The goal is to develop an algorithm that ignores minor fluctuations and identifies prominent peaks and valleys that define the general shape of the time series. The following algorithm selects such prominent peaks and valleys as primary breakpoints.

The algorithm first identifies every maximum (minimum) with a raise (fall) greater than the average raise (average fall) as an initial perceptually important maximum (minimum), and stores its value in Prominent_MaxMin_I. The time index of each entry in Prominent_MaxMin_I is recorded in Prominent_MaxMinIndex_I. The adja-cent elements of Prominent_MaxMin_I are examined to obtain Prominent_MaxMin_F, the final list of perceptually important points.

3.2 Time Series Alignment Algorithm

Let, {p 1, p 2, p 3, …, p N } be the set of breakpoints of T 1 identified by Determine_Primary_BeakPoints. Let, A be the (N × N × N) relational array, where a ijk is a r-dimensional vector that specifies the relationship between p i , p j , and p k . In this paper, a 2-dimensional vector \( a_{ijk} = \left[ {\left( {t_{j} - t_{i} } \right)/\left( {t_{k} - t_{j} } \right) \, \left( {T_{1} \left[ {t_{j} } \right]{-}T_{1} \left[ {t_{i} } \right]} \right)/(T_{1} \left[ {t_{k} } \right]{-}T_{1} [t_{j} ])} \right] \) is used to specify the relationship among p i , p j , and p k . Note, a ijk is computed for all (i, j, k), where \( 1\le i \le \left( {N - 2} \right),\,i + 1\le j \le \left( {N - 1} \right) \), and j + 1 ≤ k ≤ N. Similarly, B is a (M × M × M) relational array, where b lmn specifies the relationship among q l , q m , and q n , and is computed as \( \left[ {\left( {t_{m} - t_{l} } \right)/\left( {t_{n} - t_{m} } \right) \, \left( {T_{2} \left[ {t_{m} } \right] \, {-} \, T_{2} \left[ {t_{l} } \right]} \right)/(T_{2} \left[ {t_{n} } \right] \, {-} \, T_{2} [t_{m} ])} \right] \). Note that A and B are invariant to translation, time scale, amplitude shift, and amplitude scale. The primary breakpoint mapping matrix C is computed by matching elements of A and B as follows.

The contents of matrix C suggest possible correspondences between the break-points of T 1 and T 2. For example, a high value of c il suggests that p i in T 1 is very likely to correspond to q l in T 2. If c jm is zero or close to zero then p j is unlikely to correspond to q m . The algorithm Align_Primary_BreakPoints identifies likely correspondences between the primary breakpoints of T 1 and T 2 by analyzing C. The following example illustrates the use of the above algorithm for aligning two time series.

The two time series T1 and T2 to be aligned are given in Figs. 1 and 2, respectively. The time scale of T1 and the time scale of T2 differ by a factor of 2, and the subsequence T1[100:524] is identical to T2 in shape. In other words, T1 and T2 differ in translation, time scale and length. The algorithm Determine_Primary_BreakPoints partitions time series T1 into 6 primary segments by identifying 7 primary breakpoints (including end points) labeled p1 through p7. The primary breakpoints of T1 are {(1, 0), (52, 0.1735), (191, −0.1979), (263, 0.2072), (431, 0.16), (504, 0.211), (635, −0.1375)}. The five 2-dimensional arrays that make 7 × 7 × 7 relational array A are shown in Fig. 3.

Fig. 1
figure 1

The time series T1 of length 635 with 7 primary breakpoints

Fig. 2
figure 2

The time series T2 of length 212 with 6 primary breakpoints

Fig. 3
figure 3

The relational array for the time series in Fig. 1

The time series T 2 is partitioned into 5 primary segments and the 6 primary break-points are labeled q 1 through q 7. The primary breakpoints of T 2 are {(1, −0.0023), (45, −0.1969), (82, 0.2061), (167,−0.1622), (203, 0.2112), (213, 0.075)}. The four 2-dimensional arrays that make 6 × 6 × 6 relational array B are shown in Fig. 4.

Fig. 4
figure 4

The relational array for the time series in Fig. 2

The vector aijk is taken as a match to blmn if corresponding values are within 15 % of each other. That means ε1 is set to 0.15. The 6 × 7 breakpoint mapping matrix C, where rows represent break points of T2 (q1 through q6), and columns represent breakpoints of T1 (p1 through p7) is given Fig. 5a. If the tolerance is increased to 20 % (ε1 = 0.2), the C matrix changes as shown in Fig. 5b.

Fig. 5
figure 5

Breakpoint mapping matrices for tolerances of 10 and 20 %

In both cases, it is clear that breakpoints p 3, p 4, p 5, and p 6 align with q 2, q 3, q 4, and q 5, respectively. Once again, with only one external input (tolerance ε1), even when the two time series differ in scale and translation, the algorithm automatically suggested subsequence-to-subsequence matching, and correctly identified likely correspondence between the primary breakpoints of T 1 and T 2. The pairs of corresponding candidate matching segments suggested by Align_Primary_Breakpoints for further matching are (p 3 p 4, q 2 q 3), (p 4 p 5, q 3 q 4), (p 5 p 6, q 4 q 5).

3.3 Matching of Segments Using the HPLA Representations

The alignment algorithm gives a list of pairs of segments to be matched to deter-mine the similarity between the two given time series. It does not consider the shape of segments in the decision making process. In order to ascertain that the suggested pairs of segments indeed match, further processing is necessary. Depending on the application, one of the following two approaches may be taken.

  1. 1.

    Euclidean distance or DTW may be used to compute the similarity between the corresponding segments identified by the alignment algorithm. This method is suitable if the application requires the comparison of a specific time series with many time series in a dataset. As each time series in the dataset participates only in one comparison, there is no incentive to develop compact representations of segments for the purpose of matching.

  2. 2.

    The Hierarchical Piecewise Linear Approximation (HPLA), which supports coarse-fine matching of time series, may be used for the determination of similarity between two segments. In HPLA, any segment between adjacent primary breakpoints is called a primary segment. Each primary segment is partitioned recursively into two segments at the optimal point (secondary breakpoint) until the desired level of representation accuracy is achieved. A binary tree is used for recording the segmentation hierarchy of each primary segment (subsequence). The structure of the binary tree is simple. Each non-leaf node represents a subsequence T[i:j], its left child represents T[i:p], and right child represents T[p:j], where p is the optimal break point that splits T[i:j] into two sub-segments. Each non-leaf node includes a feature vector, components of which relate attributes of the two sub-segments represented by its two child nodes. The components of the feature vector are (pi)/(jp) and (T[p]−T[i])/(T[j]−T[p]). The root node represents the subsequence by two line segments. The two non-leaf nodes in level-1 represent the subsequence by 4 line segments, and so on. The representation accuracy increases with the increasing level. Each leaf node specifies the starting point of the segment represented by the node. The HPLA representation of a time series is the time ordered sequence of binary trees of its primary segments.

The two segments in each suggested pair of candidate segments are matched by matching their binary trees. Two primary segments are considered similar if the feature vectors of the corresponding nodes of their binary trees are similar. The time series T 1 and T 2 are considered similar, if a sufficiently long continuous sequence of binary trees in the HPLA representation of T 1 matches a sequence of binary trees in the HPLA representation of T 2.

Therefore, for the example being considered, the six primary segments p3p4, p4p5, and p5p6 of T1, and their corresponding segments q2q3, q3q4, and q4q5 of T2 are further segmented to obtain their HPLA representations. As the components of the feature vectors of the corresponding non-leaf nodes of the corresponding segments are within 15 % of each other, the segments in all three suggested pairs are taken as similar. Therefore, T1 and T2 are taken as similar time series.

4 Experimental Results

Two datasets from UCR (Mallat and OliveOil) [20] and Pseudo Periodic Synthetic Time Series from UC Irvine archive (http://kdd.ics.uci.edu) are used as base time series to create a search set and a query set of time series. The three base time series are shown in Fig. 6. From each base time series, several translated, time scaled, amplitude scaled, and amplitude shifted versions are created. Some of these are corrupted with random Gaussian noise. Finally, the created time series are partitioned into two sets, a search set of 212 time series and query set of 40 time series. The search set includes 84 time series created from Mallat, 44 time series created from OliveOil, and 84 time series created from Pseudo Synthetic data. The query set includes 16 time series created from Mallat, 8 time series created from OliveOil, and 16 time series created from Pseudo Synthetic data. The data is carefully chosen to include several cases of sequence-to-sequence, sequence-to-subsequence, and sub-sequence-to-subsequence matching scenarios.

Fig. 6
figure 6

The three time series used for the creation of the search and query sets. a Mallat time series of length 1,024. b OliveOil time series of length 570. c Pseudo Synthetic time series of length 3,313

4.1 Matching Approach and Simulation Results

Each query time series Q in {Q 1, Q 2, …, Q 40}, is matched with all 212 time series in the search set S = {T 1, T 2, …, T 212}, and time series similar to the query time series are identified. Ideally, the matching algorithm should identify all time series in the search set that are created from the same base time series as the query time series, and others should not be selected. The details of the two stage matching approach used are given below.

  1. 1.

    As the 2 × 1 feature vectors used in the HPLA representation are invariant to amplitude scale and amplitude shift, the time series are not normalized. In fact, normalization is meaningful only in the case of whole sequence matching for representations that are not invariant to amplitude shift and amplitude scale. As the mean and standard deviation of a time series and its sub-series are usually not equal, the normalization is more likely to hurt the matching process than help when the similarity is based on sequence-to-subsequence or subsequence-to-subsequence matching. The optional smoothing step is also not used.

  2. 2.

    For each time series T in S, primary breakpoints are identified using the algorithm Determine_Primary_Breakpoints.

  3. 3.

    Each primary segment is partitioned recursively at the optimal point to obtain its binary tree representation [19]. For the sake of uniformity, each primary segment is represented by a complete binary tree with 8 leaf nodes (4 levels). The feature vector of each non-leaf node is computed.

Each query time series Q in the query set is matched with each T in S in two stages as described below.

  • Stage 1: Selection of candidate time series from the search set

The goal is to select a candidate subset of S for further matching in Stage 2. Ideally, the candidate set should include all time series in S that are similar to Q and none of the time series that are not similar to Q. In reality, the set will include a few time series not similar to Q (false positives) and not include a few time series similar to Q (false negatives).

In order to determine the candidate set, the primary breakpoints and the HPLA representation of Q are determined first. Algorithm Align_TimeSeries is used to obtain the list of pairs of potential matching primary breakpoints, which suggest pairs of candidate segments from Q and T for further matching. The value of the parameter ε1 used for comparing the corresponding elements of the relational arrays of T and Q is set at 0.15. If more than 50 % of the primary breakpoints of Q align with the primary breakpoints of T in the same time order, T is selected for further matching in Stage 2. Otherwise, T is not a candidate for further matching.

  • Stage 2: Filtering the false positives

The goal is to filter as many false positives as possible by matching the HPLA representations of Q and each T. The two segments (one of Q and one of T) in each suggested pair are matched by comparing their HPLA representations to determine if they are similar. The matching of the two binary trees begins with the matching of the feature vectors stored in their root nodes. If the two feature vectors are not similar, the two segments are considered as dissimilar. If the two feature vectors are similar, the matching is continued using non-leaf nodes in the next level. This process terminates when all corresponding non-leaf nodes are exhausted. The user may limit matching to root nodes, or consider the non-leaf nodes in other levels, depending on the level of accuracy desired. The time series T is considered similar to Q, if 75 % of the suggested segment pairs of T and Q are found similar. The simulation results, evaluated and discussed in the next section are tabulated in Table 1. Because of space constraints, results for 12 out of 40 query time series are shown.

Table 1 Simulation results of the HPLA representation based time series matching

4.2 Evaluation of Simulation Results

The two metrics, recall and precision, commonly used for evaluating the performance of database retrieval methods are used for the evaluation of the performance of the HPLA based time series matching approach. Recall is defined as the ratio of the number of truly matching time series retrieved to the total number of matching time series in the search set. The value of recall is in the range [0, 1], where higher value indicates better performance. A recall value of 1 indicates that all matching time series in the database are retrieved without any false negatives. Precision is defined as the ratio of the truly matching time series retrieved to the total number of time series retrieved from the search set. The value of precision is also in the range [0, 1]. A value of 1 indicates that there are no false positives.

Pruning power is another frequently used metric which specifies the number of time series considered for matching. For the current situation, as an HPLA based indexing method is not developed, the number of time series ruled as non-matching by Align_TimeSeries may be used as a measure of the pruning power. In this paper, pruning power is defined as the ratio of the number of time series selected for matching by Align_TimeSeries minus the number of time series similar to Q in S to the total number of time series in S minus the number of time series similar to Q in S.

The simulation results in Table 1 (shown only for 12 query time series), which lists number of false positives, number of false negatives, recall, precision, and pruning power are very impressive. The important observations about the HPLA based matching approach, and results are discussed below.

  1. 1.

    Each time series in the query set is similar to a subset of time series in the search set. The similarity may be based on sequence-to-sequence, sequence-to-subsequence or subsequence-to-subsequence matching. The matching scenario that establishes similarity between query and search time series is not known in advance. For example, consider Q 2 of length 512, which is created from Mallat of length 1,024 by adding 1.85 to each time scaled (factor of 2) value. In the search set there are 23 time series similar to Q 2 based on whole sequence matching, 31 time series similar to Q 2 based on sequence-to-subsequence matching, and 30 time series similar to Q 2 based on sub-sequence-to-subsequence matching. A total of 84 time series in S are similar to Q 2, and the remaining 128 time series in S are not similar. In Stage 1, the algorithm Align_TimeSeries has identified the correct matching scenario for Q 2, and for all other 39 query sequences.

  2. 2.

    The analysis of relative positions of primary breakpoints is able to prune most of the non-matching time series in S from further consideration. For example, in the worst case, 101 candidates are selected by Align_TimeSeries for further matching with Q 2 in Stage 2. As there are 84 time series in S that are similar to Q 2, even in the worst case, only 17 additional time series are selected for matching in Stage 2. On the average, for the Mallat query set, only 90 time series are selected as candidates for matching in Stage 2, giving an average pruning power of 0.0469 ((90 − 84)/(212 − 84)). The average pruning powers are also given for OliveOil and Pseudo Synthetic query sets in Table 2.

    Table 2 The average pruning power, recall and precision for Mallat, OliveOil, and Pseudo Synthetic query sets
  3. 3.

    The segment by segment matching of the HPLA representations of Q and T has filtered most of the false positives selected in Stage 1. For example, 101 candidates selected by Align_TimeSeries for further matching in Stage 2 include all 84 time series that are similar to Q 2, and 17 time series that are not similar to Q 2 (false positives). The HPLA based matching in Stage 2 filters all the false positives, and retains all 84 time series that are similar to Q 2. The ability of the method in eliminating or significantly reducing the number of false positives is consistently good for all 40 cases.

  4. 4.

    The effectiveness of the HPLA based time series matching is obvious from high recall and precision values, which are close to the ideal value of 1. The low values of pruning power (close to the ideal value of 0) indicate the potential for building an efficient HPLA based indexing approach. The average recall and precision for the three groups of query sequences (Mallat, OliveOil, and Pseudo Synthetic) are given in Table 2.

  5. 5.

    The approach requires the specification of only two tolerance parameters ε 1 and ε 2. The value of ε 1 directly affects the number of false negatives and false positives in Stage 1. As ε 1 increases, the number of false positives increases and the number of false negatives decreases. In order to avoid the risk of losing matching time series in Stage 1, the value of ε 1 should be relatively high. Similarly, the value of ε 2 also affects the number of false positives and false negatives in Stage 2. As the value of ε 2 increases, the number of false positives increases and the number of false negatives decreases.

5 Conclusion

In summary, the HPLA based time series approach described in this paper handles all three matching scenarios uniformly by identifying the appropriate scenario to be used through the analysis of the relative positions of the perceptually important primary breakpoints in query and search sequences. There is no need for the user to specify the type of matching scenario needed. The approach identifies the matching scenario automatically, and also prunes most of the non-matching time series in the search set from further consideration. The HPLA based matching algorithm filters most of the false positive, and achieves high precision and recall. The approach is invariant to time and amplitude translation and scale differences between the two time series matched, and requires the specification of two simple tolerance parameters as external input.