Keywords

1 Introduction

The use of dissimilarity measures between time series is critical in several data analysis tasks which range from simple querying to classification, clustering and anomaly detection. The role of dissimilarity measures in these contexts has been acknowledged by several works, e.g. [1,2,3].

Recently, in [4], we proposed a new dissimilarity measure, COMB, a convex combination of four (normalized) distance measures which offer complementary perspectives on the differences between two time series: the Euclidean distance which captures differences in scale; a Pearson-correlation-based measure that takes into account linear increasing and decreasing trends over time; a Periodogram-based measure that expresses the dissimilarities between frequencies or cyclical components of the series and a distance between estimated autocorrelation structures, comparing the series in terms of their dependence on past observations.

COMB achieved quite good results when clustering electricity market prices time series in European regions and also when clustering electricity loads time series (Portuguese Transmission System Operator data)—[4, 5].

In this work, we conduct an experimental analysis to evaluate the comparative performance of the proposed combined distance measure.

The remainder is structured as follows: first we present the Methodology used to provide the comparison of COMB with alternative distance measures; then, the Data Analysis and Results section brings some insights regarding the comparative analysis and, finally, we end with Discussion and Future Research of the presented work.

2 Methodology

2.1 UCR Repository

We resort to the University of California Riverside (UCR) Time-Series Archive where we can find time series of diverse lengths and numbers of target classes, with corresponding test and train sets—[6]. The UCR time-series datasets are from a wide variety of application domains and have been used to study the comparative performance of time-series classifiers—e.g. [2]—and specifically used in comparative studies of dissimilarity measures between time series, e.g. [7].

We limit the datasets considered to 57 taking into account computational cost. This is a criterion that has been invoked in similar studies—e.g. [8]. In our study we found that, for example, the script routine, when referring to the analysis of the “GesturePebbleZ2” UCR dataset, took 18:45 h to run (using a PC with processor Intel(R) Core(TM) i7-10750H CPU @ 2.60 2.59 GHz with a RAM of 32 GB). Nevertheless, we tried to include dissimilar datasets, namely, in what regards the number of target classes: 28 datasets have 2 target classes while 29 have more than 2 target classes. The selected datasets are presented in the Appendix. As in previous studies—e.g. [7]—and although this can limit the analysis, z-standardization is adopted for fairness, since many of the UCR series are presented in their z-standardized form.

2.2 Using the 1NN Classifier

We follow a methodology suggested in previous studies that were conducted to compare several dissimilarity measures and their variants—e.g. [7]: we use one nearest neighbour (1NN) classifier on labelled data to evaluate the performance of the distance measures. In fact, since the distance measure used is critical to 1NN accuracy, this indicator directly reflects the effectiveness of the dissimilarity measure used. According to [7] p. 1890, 1NN classifiers are suitable methods for distance measure evaluation for several reasons:

  1. 1.

    resemble the problem solved in time-series similarity search;

  2. 2.

    are parameter-free and easy to implement;

  3. 3.

    are dependent on the choice of distance measure;

  4. 4.

    provide an easy-to-interpret (classification) accuracy measure which captures if the query and the nearest neighbour belong to the same class.

2.3 Dissimilarity Measures

We compare COMB [4] with three alternative dissimilarity measures between time series. Comparisons are provided with three baseline measures: Euclidean distance, DTW (Dynamic Time-Warping with Sakoe-Chiba band [9] windowing considering 20% of the time-series length) and Complexity Invariance Distance (CID).

COMB Distance. Considering two time series \(x_{t}\) and \(y_{t}\), \(\left( t=1, \ldots ,T\right) \), the COMB distance is a convex combination of four distances: Euclidean \((d_{Euclid})\), a Pearson-correlation-based measure \((d_{Pearson})\), a Periodogram-based measure \((d_{Period})\) and an autocorrelation-based measure \((d_{Autocorr})\).

The Euclidean distance, \(d_{Eucl}\), yields the sum of Euclidean distances corresponding to each pair \(\left( x_t,y_t\right) \) capturing the differences in scale:

$$\begin{aligned} d_{Eucl}={\left( \sum _{t=1}^{T}{\left( x_t-y_t\right) ^{2}}\right) }^\frac{1}{2}. \end{aligned}$$
(1)

The Pearson-correlation-based measure takes into account linear increasing and decreasing trends over time. The following measure was suggested by [10]:

$$\begin{aligned} d_{Pearson}=\sqrt{\frac{1-r_{x_{t},y_{t}}}{2}}, \end{aligned}$$
(2)

where \(r_{x_{t},y_{t}}\) represents the Pearson correlation.

The Periodogram-based measure [11] considers the Euclidean distances between the Periodograms expressing the contribution of the various frequencies or cyclical components to the variability of the series,

$$\begin{aligned} d_{Period}=\left( \sum _{j=1}^{\left[ \frac{T}{2}\right] }\left( P_x\left( w_j\right) -P_y\left( w_j\right) \right) ^2\right) ^\frac{1}{2}, \end{aligned}$$
(3)

where \(P_x\left( w_j\right) \) is the Periodogram of time series \(x_t\) at frequencies \(w_{j}=2\pi _j/n\), \(j=1, \ldots ,[n/2]\) in the range 0 to \(\pi \), being [n/2] the largest integer less or equal to n/2,

$$\begin{aligned} P_x\left( w_j\right) =\left( \frac{1}{n}\right) \left| \sum _{t=1}^{T}{x_te^{-itw_j}}\right| ^2. \end{aligned}$$
(4)

The autocorrelation-based distance [12] calculates Euclidean distances between autocorrelation structures, comparing the series in terms of their dependence on past observations

$$\begin{aligned} d_{Autocorr}=\left( \sum _{l=1}^{L}\left( r_l\left( x_t\right) -r_{l}\left( y_t\right) \right) ^2\right) ^\frac{1}{2}, \end{aligned}$$
(5)

where \(r_l\left( x_t\right) \) and \(r_{l}\left( y_t\right) \) represent the estimated autocorrelations of lag l of \(\left( x_t\right) \) and \(\left( y_t\right) \), respectively.

In this study, we specifically use an uniform convex combination, all four weights being equal.

Eucl—Euclidean Distance. The comparison with the performance of the Euclidean distance is unavoidable in all studies of this type. Even because, despite its simplicity, this distance can obtain surprisingly good results especially if the size of the training set/database is relatively large, [13], p. 281.

DTW—Dynamic Time-Warping. DTW is an elastic measure that computes the optimal alignment between two time series to minimize the sum of distances between aligned elements.

Considering two time series \(x_{t}\) and \(y_{t}\), \(\left( t=1, \ldots ,T\right) \), let M be the \(T\times T\) matrix where each element is a dissimilarity \(d_{i,j}\) (commonly the Euclidean distance is considered) between any pair of elements \(x_i\) and \(y_j\) \(\left( i,j=1, \ldots ,T\right) \).

A warping path \(P=\left( \left( i_1,j_1\right) , \left( i_2,j_2\right) \right) \) is a series of indexes of M defining a mapping from each element of one time series to one, or more than one, or even none, of the elements of the other time series. A valid path should satisfy several conditions, for example, \(i_{k+1} \ge i_{k}\) ensures the path does not go back in time. For other step patterns constrains see, e.g. [14]. For each path P, through M, the total sum of the distances along it is

$$\begin{aligned} D\left( P\right) =\sum _{k=1}^{K} d_{i_k,j_k}. \end{aligned}$$
(6)

For example, the Euclidean distance is the total distance along the diagonal of M. The goal of the DTW measure is to find a path \(P^*\) that minimizes the total distances \(D\left( P\right) \):

$$\begin{aligned} P^*=\min _{P} D\left( P\right) . \end{aligned}$$
(7)

To improve the efficiency of the procedure, it is a common practice to limit the time distortion (e.g. considering 20% of the time-series length). For example, the Sakoe-Chiba band [9] limits the warping path to a band of size \(T_0\) directly above and to the right of the diagonal of the matrix M, by enforcing the constraint \(|i_k-j_k|<T_0\).

CID—Complexity Invariance Distance. CID measure was proposed by [15]. The time series’ complexity is measured by stretching them and measuring the length of the resulting lines.

$$\begin{aligned} CID\left( x_t,y_t\right) =d_{Eucl}.CF\left( x_t,y_t\right) , \end{aligned}$$
(8)

where

$$\begin{aligned} CF\left( x_t,y_t\right) =\frac{max\left( CE\left( x_t\right) ,CE\left( y_t\right) \right) }{min\left( CE\left( x_t\right) ,CE\left( y_t\right) \right) } \end{aligned}$$
(9)

is the Complexity Factor, and

$$\begin{aligned} CE\left( x_t\right) =\left( \sum _{t=1}^{T-1}{\left( x_t-x_{t+1}\right) ^2}\right) ^\frac{1}{2} \end{aligned}$$
(10)

is the Complexity Estimate of time series \(x_t\).

We resort to the R package “TSclust” [12] where the four distances that compose the COMB distance, the CID and the DTW (using the “dtw” package [14]) are implemented.

2.4 Evaluating the Classification Results

The evaluation of performance of the 1NN classifiers regards the test sets of the UCR time series considered. Balanced accuracy measure (average between sensitivity and specificity) when dealing with unbalanced sets is suggested by [6]. We propose using the Huberty index (HI)—e.g. [16], as a measure of classification performance:

$$\begin{aligned} HI=\sum _{k=1}^{K}{\frac{p^c-p^{def}}{1-p^{def}}}, \end{aligned}$$
(11)

where \(p^c\) and \(p^{def}\) are the proportion of observations correctly classified and the proportion of observations in the modal class, respectively. This measure is clearly useful for the evaluation of performance in unbalanced datasets. Furthermore, it provides a fair and interpretable view of the success of classification tasks which could be overestimated when high accuracy results are obtained in strongly unbalanced sets, e.g. a 90% accuracy result when a target class includes 95% of observations yields a negative Huberty index (one should do better by simply allocating all observations to the modal class). In addition, the computational time is also taken into account in the evaluation of the 1NN results referring to the four dissimilarity measures considered.

After the evaluation of aggregated results, comparisons referring to specific datasets are considered to get some dissimilarities’ performance-related insights. On a “closer look to specific problems”, [2] resorts to the selection of some time series from each target class, trying to capture the main differences between these classes on specific datasets. We propose using the medoids of each class as defined by each dissimilarity measure to obtain those insights. The medoid definition is the observations that minimize the sum of all distances to elements in the same class—[17].

3 Data Analysis and Results

3.1 General Comparisons

A brief exploratory data analysis leads to the conclusion that, in the datasets considered, DTW generally provides better classification results than the alternative distances, followed by COMB—Table 1. COMB comparative results are illustrated in Fig. 1. However, for time series with two target classes (K=2) only, COMB provides slightly better results—see Table 1. In what regards the computation time DTW clearly provides the worst results—see Table 2.

Table 1 All time-series results
Fig. 1
figure 1

Plot of Huberty index results: COMB versus Euclidean, DTW and CID

Table 2 Huberty index results: time series with two target classes versus more than two target classes
Table 3 Friedman test’s results regarding Huberty index
Table 4 Dunn’s pairwise comparison tests regarding Huberty index for data with more than two classes (“Adj. Sig” are p-values adjusted by Bonferroni correction)

According to the Friedman test’s results, there are no significant differences between the distributions of HI regarding the four dissimilarity measures (see Table 3). However, significant differences can be found when analysing data with more than two classes (K>2), which, after pairwise comparison of Dunn’s test, can be referred to the significant difference between HI.Eucl and HI.DTW (see Table 4).

The differences between computation times regarding the four dissimilarity measures are all significant according to Friedman’s test—see Table 5.

Table 5 Friedman test’s results regarding computation time

3.2 COMB “Wins” and “Looses” Examples

In an attempt to understand the data conditions that could (un)favour COMB, we looked for some insights regarding a “COMB wins example” and a “COMB looses example”: ToeSegmentation2 and Herring time series, respectively. ToeSegmentation2 was originated in the CMU Graphics Lab Motion Capture Data, referring to right toe movements, with target classes “Walk Normally” and “Walk Abnormally”. Herring data refers to calcium carbonate structures from two classes of Herring: North sea or Thames. In Table 6, we present the details of data referring to these two datasets.

Table 6 COMB “wins” and “looses” datasets

On the assumption that exploring the target classes in the test set could bring some insights into the performance of 1NN classifier, we obtained the medoids of target classes according to each of the four dissimilarity measures. The ToeSegmentation2 test set classes’ medoids are depicted in Fig. 2. The COMB measure reveals not only scale differences between the medoids (as Euclidean distance does, with the poorest results) but it is also apparent (for example) how the medoids’ tendencies diverge from each other, which, conjugated with the additional differences captured by COMB, results in its best performance, according to the HI.

Fig. 2
figure 2

Medoids of ToeSegmentation2 test set classes, according to dissimilarity measures Eucl, DTW, CID and COMB

The Herring test set classes’ medoids coincide for all dissimilarity measures except DTW which presents slightly different medoids. Nevertheless, a negative HI was obtained for all measures (revisit Table 6).

In an attempt to explore the potential of COMB measure, in a worst-case scenario, we performed a brief sensitivity analysis manipulating the COMB’s weights. After some trials, when considering the COMB weights regarding \(d_{Period}\) and \(d_{Autocorr}\) as nine times the weights regarding \(d_{Euclid}\) and \(d_{Pearson}\), we managed to cross the “waterline”, obtaining a HI slightly positive which the alternative measures could not. Note, however, that a customized parametrization of DTW could eventually obtain better results also, but we believe that it would also bring a relevant increase in computation time.

4 Discussion and Future Research

We conducted experiments on 57 time-series datasets from diverse application domains to compare the proposed dissimilarity measure, COMB, with three baseline alternative measures: Euclidean, Dynamic Time-Warping and Complexity Invariance Distance. We resorted to the 1-Nearest Neighbour classifier, using the four dissimilarities, to compare their effectiveness. Huberty index was used as a classification metric providing more informative analysis results than the simple Accuracy measure, adopted in previous studies to evaluate performance (ignoring prevalence). Experimental results obtained indicate that there are no significant differences between the classification performance (Huberty index measures) of the four dissimilarity measures. Nevertheless, it appears COMB can produce better results regarding time series with two target classes. Furthermore, there is also the potential to improve the results obtained with COMB by changing the weights in the convex combination: an example was provided for the Herring dataset where the COMB with uniform weights provided the worst classification results, while COMB with tuned weights was able to provide the best results. In what regards the computation time, Dynamic Time-Warping, which appears to be the most direct COMB competitor regarding classification performance, presented the (significantly) worst results. Considering the classification performance-runtime results we conclude that the proposed combined measure can be seen as competitive in several settings.

In future research, we aim to extend the present analysis to all (128) UCR datasets which will require to explore hardware-aware implementations and/or algorithmic solutions to turn the measures’ implementation the most efficient. We also think the Complexity Invariance Distance, which revealed to be a competitive measure, should definitely play a role in future similar studies (along with the unavoidable Euclidean and Dynamic Time-Warping dissimilarities and other eventual baseline measures). An investigation of the process to determine COMB weights should also be considered. Finally, the experimental design should include additional characteristics of the time-series data, besides the number of target classes, namely, we think that the inclusion of a measure of separation between classes should be considered.