Keywords

1 Introduction

Year after year, we see a remarkable increase of the interests in both collecting and mining of data. Typically, we differentiate time series problems from other data analysis tasks, because the attributes are ordered and we may look for a discriminatory feature that depends on the ordering [4]. In the past 20 years, interest in the area of time series has soared and many tasks have been deeply investigated, such as classification [4], clustering [29], indexing [26], prediction [50], anomaly detection [51], motif discovery [34] and more. In our opinion, there is a problem that appears throughout almost all of these topics: how to compare given two time series in the most appropriate way?

The problem of pairwise similarity of time series is based on the underlying distance measure (which are not necessarily metrics or even dissimilarity measures). To the best of our knowledge, there are about 40 distance measures proposed already in the literature. Some of them are based on certain feature of data, while the others use predictions, underlying models or some transformations. Such a variety may be confusing and makes it hard to find the most appropriate measure, especially for application-oriented scientists. Available research includes only 2 papers providing a partially comparison of selected distance measures.

Wang et al. [48] provide an extensive comparison of 9 different similarity measures and their 4 variants, which was carried on 38 time series datasets from UCR archive [13]. Authors of the paper conclude, that they did not find any measure, that is “universally better” at all datasets—some of them are better than the rest, while being worse on other datasets. However, dynamic time warping (DTW; [7])—slightly before some edit based measures: LCSS, EDR and ERP—seems to be superior to others. And it is in line with the widespread opinion that DTW is not always the best but in general hard to beat [45, 52]. From the other hand, the study points out that Euclidean distance remains a quick and efficient way of measuring distances between time series. Especially, when the training set increases, the accuracy of elastic measures converges to that of Euclidean distance.

Serrà et al. [44] compare 7 similarity measures on 45 datasets originated from UCR archive. Authors of the paper suggest that, in the set of investigated distances, there is a group of measures with no statistically significant differences: DTW, EDR and MJC. Another finding is that the TWED measure seems to consistently outperform all the considered distances. Euclidean distance is said to perform statistically worse than TWED, DTW, EDR, and MJC, and even its performance on large datasets was “not impressive”. What is more, an interesting remark is made about various post-processing steps that may increase classification accuracy: the complexity-invariant correction [5], the hubness correction for time series classification [42], unsupervised clustering algorithms to prune nearest neighbor candidates [44]. For details see Serrà et al. [44].

Despite giving interesting results, both studies take into account only some distance measures, while nowadays, due to the very dynamic increase of interest in the time series area, there are about 40 measures available. As it is computationally expensive, in this paper we compare 30 of them, but we plan to develop our experiment in the nearest future. Our contribution is to give an extensive comparison, supported by deep statistical analysis. We would like to create a benchmark study, that could be used not only by researchers from different application fields, but as well by authors of new distance measures, to assess their effectiveness. We are going to give only basic descriptions of used similarity measures, provided along with some reference, as our intention is to not to develop distance measures itself, but rather to compare their efficacy.

2 Distances’ Classification and Description

According to our best knowledge there exist about 40 distance measures, thus there is a strong need to classify them. Montero and Vilar [38] proposed to group measures into four categories: model-free measures, model-based measures, complexity-based measures, and prediction-based measures. Wang et al. [48] in their research named four groups of distance measures: lock-step measures, elastic measures, threshold-based measures, and pattern-based measures. In our opinion the most universal and covering almost all distances is categorization proposed by Esling and Agon [18]: shape-based measures, edit-based measures, feature-based measures and structure-based measures. We are going to follow the last classification. In this section, we list all 30 distance measures compared in this paper. We provide most important formulas, assuming we are given two time series: \(\mathbf X _T = (x_1, x_2, \ldots , x_T)\), \(\mathbf Y _T = (y_1, y_2, \ldots , y_T)\).

2.1 Shape-Based Distance Measures

This group of distance measures compares the overall shape of series looking mostly on the raw values.

The basic measures there are derived directly from \(L_p\) norms and we call them \(L_p\) distances: Manhattan distance, Minkowski distance, Euclidean distance, and Infinite norm distance. They are relatively simple in understanding and computation, but compare only time series of equal length and sometimes they perform poor and are highly influenced by outliers, noise, scaling or warping. For more information, we refer to Yi and Faloutsos [53], Antunes and Oliveira [2]. The basic formulas are given in Table 1.

Table 1 \(L_p\) distances, \(1< p < \infty \)

Berndt and Clifford [7] proposed Dynamic Time Warping (DTW) distance, which not only solve most problems know from \(L_p\) distances, but due to its ability to deal with warping of the time axis became one of the most popular measure for time series. In practice, we compute DTW using dynamic programming with the following recurrence:

$$\begin{aligned} \Gamma (i,j) = D(i,j) + min\{\Gamma (i-1, j-1), \Gamma (i-1, j), \Gamma (i, j-1)\} \end{aligned}$$

with initial conditions:

$$\begin{aligned} \Gamma (0,0) = 0, \Gamma (0,i) = \infty , \Gamma (i,0) = \infty (i = 1,2, \ldots , n), \end{aligned}$$

where \(\Gamma \) is a cumulative distance matrix, \(D(i,j) = d(x_i, y_j), d(x_i, y_j) = (x_i - y_j)^2\). The value of DTW at position (nn) of the matrix \(\Gamma \) is calculated then as \(DTW(x,y) = \sqrt{\Gamma (n,n)}\).

Because of the long computation time of basic DTW distance, several lower bounding and temporal constraints techniques have been proposed. In Sect. 4 we denote DTW with Sakoe–Chiba Band as “DTWc” and we use the window size as in Dau et al. [13]. For more details about DTW we refer to Bagnall et al. [4], Keogh and Ratanamahatana [30], Mori et al. [39]. We will also examine two distance measures expanding DTW with derivatives. Keogh and Pazzani [32] defined Derivative Dynamic Time Warping (DDTW) which is a DTW distance between the data transformed by the first (discrete) derivative. Górecki and Łuczak [22] proposed Parametric Derivative Dynamic Time Warping (\(\mathrm {DD}_\mathrm{{DTW}}\)) as a convex combination of the distances DTW and DDTW, what brought further performance improvements.

For irregularly spaced series, [37] proposed the Short Time Series (STS) distance given by

$$\begin{aligned} d_\text {STS}(\mathbf X _T, \mathbf Y _T) = \sqrt{ \sum _{i=1}^{T-1} \left( \frac{y_{i+1}-y_i}{t_{i+1}-t_i} - \frac{x_{i+1}-x_i}{t^{'}_{i+1}-t^{'}_i} \right) } , \end{aligned}$$

where t and \(t^{'}\) are the temporal indexes of series \(\mathbf X _T\) and \(\mathbf Y _T\) respectively. It is able to measure similarity of shapes formed by both the relative change of amplitude and the corresponding temporal information.

Another very important aspect of similarity measures is a tendency to put time series with high complexity level further apart than simple ones [5]. In order to fix this distortion, a Complexity-Invariant dissimilarity measure (CID) has been proposed by the authors. The general formula is as follows

where \(d(\mathbf X _T, \mathbf Y _T)\) is a distance which will be adjusted, \(\text {CF} (\mathbf X _T, \mathbf Y _T)\) is a complexity correction factor defined as

where \(\text {CE} (\mathbf X _T)\) stands for a complexity estimator of \(\mathbf X _T\). Now, we can observe, that when the complexity of both time series is equal, we get \(d_{\text {CID}} (\mathbf X _T, \mathbf Y _T) = d(\mathbf X _T, \mathbf Y _T)\) and, that increase of complexity difference results in increase of distance between time series. As a complexity estimator [5] proposed \({\text {CE}}(\mathbf{X }_T) = \sqrt{\sum _{t=1}^{T-1} (X_t - X_{t+1})^2}\).

2.2 Edit-Based Distance Measures

Edit-based distances use the minimal number of operation necessary to transform one series into another. They were initially proposed to measure the similarity between two sequences of strings and use the minimal number of edit operations (delete, insert, replace) necessary to transform one series into another.

As edit-based distances may be computed for time series of different length, in this section we will assume we are given two time series: \(\varvec{X_N} = (x_1, x_2, \ldots , x_N)\) and \(\varvec{Y_M} = (y_1, y_2, \ldots , y_M)\). For clarification and simplicity, in all other sections the notation is as mentioned in the introduction to Sect. 2.

The LCSS distance was proposed by Vlachos et al. [47] and measures the similarity between time series in terms of the longest common subsequence, with the addition that gaps and unmatched regions are permitted. LCSS is robust to noise and we expect that it should be more accurate than DTW in the presence of outliers and noise. The measure has two constant parameters. The first one, \(\delta \) controls the size of the window for matching given point from one series to a point in the other series. The second one, \(\epsilon \), is the matching threshold: two points are considered to match if their distance is less than \(\epsilon \). Given

$$\begin{aligned} L(i,j) = \left\{ \begin{array}{ll} 0 &{} \text {for }i=0\\ 0 &{} \text {for }j=0 \\ 1 + L[i-1, j-1] &{} \text {for }|x_i - y_j| < \epsilon \\ &{} \text {and }|i-j| \le \delta \\ \text {max(L[i-1,j], L[i,j-1])} &{} \text {in other cases}, \end{array} \right. \end{aligned}$$

we can compute [43]

$$\begin{aligned} LCSS(\varvec{X_N, Y_M}) = \frac{N+M-2L(N,M)}{N+M}. \end{aligned}$$

The Edit Distance on Real Sequence (EDR) is an adaptation of the edit distance that finds the minimal number of edit operations to convert one series to another [11]. Similarly to LCSS, EDR permits gaps and unmatched regions, but penalizes such occurrences with a value equal to their length. Computation of the EDR measure can be converted into an iteration using dynamic programming as follows

where Rest(\(\varvec{(X_N)}) = (x_2, x_3, \ldots , x_{N})\) and \(d_{\text {edr}}\) stands for the distance between two points in the series computed along to the rule: if \(x_i\) and \(y_i\) are closer to each other in the absolute sense than \(\epsilon \), it is equal to 1. Otherwise, it is equal to 0.

The third variation of edit distance is the Edit Distance with a Real Penalty (ERP) [10] that may be considered as a combination of DTW and EDR. It uses the \(L_1\) distance between elements of time series as the penalty for local shifting of time series. Penalization is carried out by setting a constant g and adding the euclidean distance of the unmatched points to g. The ERP measure is given by

2.3 Feature-Based Distances

These distances look at some aspect of the time series by extracting certain feature. Then, based on it, a similarity measure is calculated.

Taking into account correlation in time series, we may define at least several measures. Golay et al. [21] defined distance based on Pearson’s correlation coefficient as follows:

$$\begin{aligned} d_{\text {PC}}(\mathbf X _T, \mathbf Y _T) = 2(1 - PC), \end{aligned}$$

where PC denotes Pearson’s correlation coefficient.

Warren Liao [49] proposed to use cross-correlation between two series and based on it, formulated

$$\begin{aligned} d_{\text {CC}}(\mathbf X _T, \mathbf Y _T) = \sqrt{\frac{(1 - CC_0(X, Y))}{\sum _{k=1}^{\max }CC_k(X, Y)}} , \end{aligned}$$

where \(\text {CC}_k(X, Y)\) is the cross-correlation between two series at lag k.

Let \( \varvec{ \hat{\rho }}_{X_{T}} = (\hat{\rho }_{1, X_T}, \ldots ,\hat{\rho }_{L, X_T})^T, \varvec{\hat{\rho }}_{Y_{T}} = (\hat{\rho }_{1, Y_T}, \ldots ,\hat{\rho }_{L, Y_T})^T\) be the estimated autocorrelation vectors of \(\mathbf X _T, \mathbf Y _T\) (respectively), for some L such that \(\hat{\rho }_{i, X_T}, \hat{\rho }_{i, Y_T} \approx 0\) for \(i>L\). Peña and Galeano [40] proposed the following distance:

$$\begin{aligned} d_{\text {ACF}}(\mathbf X _T, \mathbf Y _T) = \sqrt{(\varvec{\hat{\rho }}_{X_{T}} - \hat{\rho }_{Y_{T}})^{T} \varvec{\Omega }(\varvec{\hat{\rho }}_{X_{T}}-\hat{\rho }_{Y_{T}})} , \end{aligned}$$

where \(\varvec{\Omega }\) is a matrix of weights, which define the importance of correlation at different lags. Obviously, to emphasize slightly different aspect of the data, it is possible to replace autocorrelations by partial autocorrelations and obtain \(d_{\text {PC}}\).

The first-order temporal correlation coefficient is defined by

$$\begin{aligned} \text {CORT}(\mathbf X _T, \mathbf Y _T) = \frac{\sum _{i=1}^{T-1} (X_{t+1} - X_t)(Y_{t+1} - Y_t)}{\sqrt{\sum _{i=1}^{T-1}(X_{t+1} - X_t)^2} \sqrt{\sum _{i=1}^{T-1}(Y_{t+1} - Y_t)^2}} . \end{aligned}$$

The CORT coefficient reflect the dynamic behaviors of the series [38]. The related dissimilarity measure was proposed by Chouakria and Nagabhushan [12] and it is defined as

$$\begin{aligned} d_{\text {CORT}}(\mathbf X _T, \mathbf Y _T) = \phi _k [\text {CORT}(\mathbf X _T, \mathbf Y _T)] \cdot d(\mathbf X _T, \mathbf Y _T), \end{aligned}$$

where \(\phi _k(\cdot )\) is an adaptive tuning function to automatically modulate a conventional data distance according to the temporal correlation. Chouakria and Nagabhushan proposed \(\phi _k(u) = \frac{2}{1 + exp(ku)}, k \ge 0 \).

Another aspect of time series may be revealed by the Discrete Fourier Transform. Based on that we may compute Euclidean distance \(d_{\text {FC}}\) between the first n coefficients [1]:

$$\begin{aligned} \text {FC}(\mathbf X _T, \mathbf Y _T) = \sqrt{\sum _{i=0}^n ((a_i - a_i^{'})^2 + (b_i - b_i^{'})^2)} . \end{aligned}$$

There at least several distances based on the frequency domain of the time series. Caiado et al. [9] proposed the Euclidean distance \(d_{\text {P}}\) between the periodogram coordinates as follows:

$$\begin{aligned} d_\text {P}(\mathbf X _T, \mathbf Y _T) = \frac{1}{n}\sqrt{\sum _{k=1}^n (I_{X_T}(\lambda _k) - I_{Y_T}(\lambda _k))^2}, \end{aligned}$$

where \(I_{X_T}(\lambda _k)\) and \(I_{Y_T}(\lambda _k)\) for \(k=1, \ldots , n\) are periodograms of \(\mathbf X _T\) and \(\mathbf Y _T\) (respectively).

Alternatively, de Lucas [14] introduced distance measure based on integrated periodogram, arguing that—due to some properties of integrated periodogram—it presents several advantages over the previous one. The distance is defined as

$$\begin{aligned} d_{\text {IP}}(\mathbf X _T, \mathbf Y _T) = \int _{- \pi }^{\pi } |F_\mathbf{X _T}(\lambda ) - F_\mathbf{Y _T}(\lambda )| d \lambda , \end{aligned}$$

where \(F_\mathbf{X _T}(\lambda _j) = C_\mathbf{X _T}^{-1} \sum _{i=1}^{j}I_\mathbf{X _T}(\lambda _i)\) and \(F_\mathbf{Y _T}(\lambda _j) = C_\mathbf{Y _T}^{-1} \sum _{i=1}^{j}I_\mathbf{Y _T}(\lambda _i)\), with

\(C_\mathbf{X _T} = \sum _i I_\mathbf{X _T} (\lambda _i)\), \(C_\mathbf{Y _T} = \sum _i I_\mathbf{Y _T} (\lambda _i)\).

Kakizawa et al. [24] proposed a general spectral disparity measure between two time series as

$$\begin{aligned} d_\text {LLR}(\mathbf X _T, \mathbf Y _T) = \int _{- \pi }^{\pi }\tilde{W} \left( \frac{f_{X_{T}}(\lambda )}{f_{Y_{T}}(\lambda )} \right) d \lambda , \end{aligned}$$

where \(f_{X_T}\) and \(f_{Y_T}\) are spectral densities of \(\mathbf X _T\) and \(\mathbf Y _T\). \(\tilde{W} = W(x) + W(x^{-1})\), \(W(x) = \log ( \alpha x + (1 - \alpha )) - \alpha \log x\), with \(0< \alpha <1 \). \(W(\cdot )\) is a divergence function satisfying regular quasi-distance conditions for \(d_{LLR}\).

Alternatively, Díaz and Vilar [16] described the two following distances. The first one is defined as

$$\begin{aligned} \begin{aligned} d_\text {GLK}(\mathbf X _T, \mathbf Y _T) = \sum _{k=1}^{n} \left[ Z_k - \hat{\mu }(\lambda _k) - 2\log (1+e^{Z_k-\hat{\mu }(\lambda _k)}) \right] - \\ - \sum _{k=1}^{n} \left[ Z_k -2 \log (1+e^{Z_k}) \right] , \end{aligned} \end{aligned}$$

where \(Z_k = \log (I_{X_T}(\lambda _k)) - \log (I_{Y_T}(\lambda _k))\) and \(\hat{\mu }(\lambda _k)\) is the local maximum log-likelihood estimator of \(\mu (\lambda _k) = \log (f_{X_T}(\lambda _k) - \log (f_{Y_T}(\lambda _k)\) computed by local linear fitting.

The second distance is given by

$$\begin{aligned} d_\text {ISD}(\mathbf X _T, \mathbf Y _T) = \int _{- \pi }^{\pi }(\hat{m}_{X_T}(\lambda ) - \hat{m}_{Y_T}(\lambda ))^2 d\lambda , \end{aligned}$$

where \(\hat{m}_{X_T}(\lambda )\) and \(\hat{m}_{Y_T}(\lambda )\) are local linear smoothers of the log-periodograms obtained with the maximum local likelihood criterion.

Moving on to an another characteristic, Aßfalg et al. [3] proposed a distance measure \(d_{TQ}\) based on Threshold Queries, using given \(\tau \) parameter as a threshold in order to transform a time series into a sequence of time stamps, when the threshold is crossed. Let us denote the time stamps for a certain threshold \(\tau \) as a sequence \((t_1, t_2, \ldots , t_n)\). For a time series \(\varvec{X_T}\) and a threshold \(\tau \) we define the interval set \(S(\varvec{X_T}, \tau ) = \{(t_1, t_2), (t_3, t_4), \ldots , (t_{n-1}, t_n) \}\). The distance between time series \(\varvec{X_T}\) and \(\varvec{Y_T}\), represented by the interval sets \(S(\varvec{X_T}, \tau )\) and \(S(\varvec{Y_T}, \tau )\) is given by

$$\begin{aligned} \text {TQuest}(\varvec{X_T, Y_T})&= \frac{1}{|S(\varvec{X_T}, \tau )|} \sum _{s \in S(\varvec{X_T}, \tau )} \min _{s' \in S(\varvec{X_T}, \tau )} d(s, s') \,+ \\&\quad +\, \frac{1}{|S(\varvec{Y_T}, \tau )|} \sum _{s' \in S(\varvec{Y_T}, \tau )} \min _{s \in S(\varvec{Y_T}, \tau )} d(s', s), \end{aligned}$$

where the distance between two intervals \(s=(s_l, s_u)\) and \(s'=(s'_l, s'_u)\) is computed as

$$\begin{aligned} d(s, s') = \sqrt{(s_l - s'_l)^2 + (s_u - s'_u)^2}. \end{aligned}$$

The TQuest measure is based on an interesting feature extraction idea, but—in our opinion—it is highly dependent on user’s specialist knowledge, as the \(\tau \) parameter must be set.

The symbolic approximation representation (SAX) has been introduced by Lin et al. [33] and became one of the best symbolic representation for most time series problems [27]. The original data are first transformed into the piecewise aggregate approximation (PAA) representation [53] and then into a discrete string. For the full outline of MINDIST dissimilarity measure based on SAX representation see Lin et al. [35].

2.4 Structure-Based Distances

The last group of distance measures try to find some higher level structures and then compare time series on these basis. This category can be subdivided into two another groups: model-based—aiming to fit a model and then to compare coefficients thorough certain distance function and compression-based, which work by compression ratios.

The first category is represented by the distance described by Piccolo [41] as the Euclidean distance between coefficients derived from AR representation of processes:

$$\begin{aligned} d_\text {{PIC}}(\mathbf X _T, \mathbf Y _T) = \sqrt{\sum _{j=1}^k(\hat{\pi }^{'}_{j, X_T} - \hat{\pi }^{'}_{j, Y_T})^2}, \end{aligned}$$

where the vectors of \(AR(k_1)\) and \(AR(k_2)\) for \(\mathbf X _T\) and \(\mathbf Y _T\) are denoted respectively by \(\hat{\varvec{\Pi }}_{X_T} = (\hat{\pi }_{1, X_T}, \ldots , \hat{\pi }_{k_1, X_T})\) and \(\hat{\varvec{\Pi }}_{Y_T} = (\hat{\pi }_{1, Y_T}, \ldots , \hat{\pi }_{k_2, Y_T})\), \(k = \max (k_1, k_2)\), \(\hat{\pi }^{'}_{j, X_T} = \hat{\pi }_{j, X_T}\) if \(j \le k_1\) and \(\hat{\pi }^{'}_{j, X_T} = 0\) otherwise and analogously \(\hat{\pi }^{'}_{j, Y_T} = \hat{\pi }_{j, Y_T}\) if \(j \le k_2\) and \(\hat{\pi }^{'}_{j, Y_T} = 0\) otherwise. In case of nonstationary series, a differencing is carried out. To fit truncated AR(\(\infty \)) model, a criterion such as BIC or AIC is used. There are at least two another distances (proposed by [25, 36]) based on the idea of fitting an ARIMA model to each series and then measure the dissimilarity between the models, but we will not use them due to implementation problems.

The distances from the second group compare levels of complexity of time series. Alternative to previous sections and paragraphs, complexity-based approaches do not rely on specific feature or knowledge of underlying models, but on evaluating the level of shared information by both series [38]. Keogh et al. [31] proposed compression-based dissimilarity measure defined as

$$\begin{aligned} d_\text {{CDM}}(\mathbf X _T, \mathbf Y _T) = \frac{\text {C}(\mathbf X _T, \mathbf Y _T)}{\text {C}(\mathbf X _T) \text {C}(\mathbf Y _T)}. \end{aligned}$$

The CDM distance is descended from normalized compression distance (NCD) proposed by Lin et al. [34], using the compressed size of \(\mathbf X _T\)\(\text {C}(\mathbf X _T)\)—as an approximation of Kolmogorov complexity.

Dissimilarity measure based on permutation distribution clustering (PDC) uses permutation \(\Pi (\mathbf X ^{'}_T)\) of m-dimensional embedding of \(\mathbf X _T\). Dissimilarity between two time series \(\mathbf X _T\) and \(\mathbf Y _T\) is expressed in terms of divergence between distribution of these permutations, denoted by \(P(\mathbf X _T), P(\mathbf Y _T)\). Specifically, Brandmaier [8] proposed the \(\alpha \)-divergence between \(P(\mathbf X _T)\) and \(P(\mathbf Y _T)\) as a dissimilarity between time series \(\mathbf X _T\) and \(\mathbf Y _T\).

3 Experimental Design

We performed experiments on 47 real time series that come from the UCR time series repository [13]. Each dataset from the database is split into training and testing subsets. Within the data, the number of classes ranges from 2 to 50, the number of time series per dataset go from 56 to 9236, and time series lengths ranges from 60 to 1882 samples. All time series instances are z-normalized.

In our paper, we will follow the methodology proposed by Keogh and Kasetty [28], which assumes evaluating the efficacy of distance measure by the prism of accuracy of 1NN classifier. While one should be aware that the proposed approach can not deliver us the overall evaluation of a distance measure, there seems to be more pros than cons of the chosen method. For example, Wang et al. [48] pointed out three aspects: the simplicity of implementation, performance directly dependent on distance choice and relatively (to other, often more complex classifiers) good performance. For more information we refer to Batista et al. [5], Ding et al. [17], Tan et al. [46], Xi et al. [52].

Specifically, for each dataset, we computed the classification error rate on a test subset. When a parameter to train the 1NN classifier was needed, we tried to use values proposed already in the literature (referred in the Sect. 2).

4 Results

The results are presented in Tables 2 and 3. We computed there the absolute error rates on the test subset with the 1NN classifier for each of 30 distance measures. In Fig. 1 we presented ranks for all considered distances.

Table 2 Error rates (in%) of all considered distance measures on 1NN classifier. Best classifier for each dataset was bold. In the last but one row we computed number of wins for each distance and in the last one—average ranks

If we look at the overall result, we can observe that none of the compared distances achieves the best performance for all, or even the most of datasets. In fact, the lowest error rates are computed for \(\mathrm {DD}_\mathrm{{DTW}}\) (15 wins), DDTW (9 wins), DTW (8 wins) ahead of ERP (6 wins), EDR (5 wins), LCSS (5 wins) and DTWc (5 wins). There are also CORT and ISD measures with 4 wins both, but the others are significantly worse. It may be the evidence for superiority of elastic measures and those connected with DTW distance over the rest. From the other hand, looking at average ranks, one may be surprised by the good performance of \(L_p\) norms: MAN, ED, and MIN. It is also worth mentioning about CID distance. It achieved better average rank than DTW, while in fact it only improves Euclidean distance by simple complexity correction factor.

Table 3 Error rates (in%) of all considered distance measures on 1NN classifier. Best classifier for each dataset was bold. In the last but one row we computed number of wins for each distance and in the last one—average ranks

Looking at certain datasets, we see, that some of them are almost perfectly classified (e.g., Coffee, DiatSizeRed, GunPoint, Plane), what could mean that their classes are relatively easy to recognize by the algorithm. Another interesting fact is, that there are datasets, which are better classified by some group of distances. For example, performance of \(L_p\) norms is relatively good for MALLAT, SynthetCont, while clearly worse for CricketX, CricketY, Haptics, what may indicate cases, where we should pay attention to shape (without editing) or not. Correlation-based distances (e.g. ACF, PACF, CCOR) may be considered as a good choice for datasets: ECGFive, Trace.

Fig. 1
figure 1

Box plot of ranks of each measure across all datasets. Boxes are colored according to the category of a measure: shape-based (blue), edit-based (green), feature-based (orange), structure-based (gray)

To assess the differences between examined methods, we performed a detailed statistical comparison. We tested the hypothesis that there are no differences between 1NN classifiers using different measures. Firstly, we employed the test proposed by Iman and Davenport [23], which is a less conservative variant of Friedman’s ANOVA [19]. The test is recommended by Demšar [15] and Garcia and Herrera [20]. If the hypotheses is rejected, we can proceed with the post hoc test to provide all pairwise comparisons. In this way we can detect the statistically significant differences between certain classifiers. Garcia and Herrera [20] proved that the procedure presented in Bergmann and Hommel [6] is the most powerful post hoc comparison test. It is based on the idea of finding all elementary hypotheses, which cannot be rejected. However, finding all the possible exhaustive sets of hypotheses for a certain comparison is extremely computationally expensive. Thus, we are able to compare in a post hoc test up to 9 classifiers.

The p-value from the Iman and Davenport’s test performed for all classifiers is equal to 0. We can, therefore, proceed with the post hoc tests. The results of multiple comparisons are given in Table 5. We have chosen for the comparison 9 distance measures, which achieved best average ranks. The p-value from the Iman and Davenport’s for these measures was equal to 0.

Table 4 p-values in the Bergmann–Hommel post hoc test for best 9 measures (taking into account average ranks). Statistically significant differences (\(p<0.05\)) are in bold
Table 5 Results of the Bergmann–Hommel post hoc test: division into groups

Based on the Fig. 1 and Table 4, we see that there is one measure that significantly outperform most of the rest—\(\mathrm {DD}_\mathrm{{DTW}}\). In the group of 9 best classifiers, using p-values obtained from Bergmann–Hommel post hoc test, we can make a division of distances into 3 groups (Table 5). We observe, that there are not statistically significant differences between \(\mathrm {DD}_\mathrm{{DTW}}\) and DTWc as well as LCSS. MAN distance is the worst performing one in the group (taking into account mean ranks), but the post hoc test did not signalize differences with DTW—which is considered to be one of the most efficient measure—and EDR. Another interesting fact is, that CID distance may be treated as statistically equal to much more computationally expensive elastic measures such as DTW, DTWc, EDR, ERP, LCSS. In Fig. 2 we provided plot of critical differences (CD) from Bergmann–Hommel post hoc test, shown in Demšar [15].

We decided to provide comparisons of pairs of classifiers (Fig. 3). We can see, that \(\mathrm {DD}_\mathrm{{DTW}}\) is observably better than DTW and LCSS (most of the points is above the diagonal). Looking at \(\mathrm {DD}_\mathrm{{DTW}}\) and MAN, we see that there are some datasets classified better with the MAN distance, but it occurs extremely rarely. In most cases performance of \(\mathrm {DD}_\mathrm{{DTW}}\) is far better (points are distant to the diagonal). Comparing ERP with MAN and DTW we observed, that the edit-based measure achieves lower error rates than both shape-based distances. The plot of CID and ED shows, that adding a simple complexity correction factor results in a considerable increase of accuracy.

Fig. 2
figure 2

Plot of critical differences from Bergmann–Hommel post hoc test. Groups of classifiers that are not statistically significantly different (at \(p = 0.05\)) are connected

Fig. 3
figure 3

Comparison of error rates

5 Conclusion

In this article, we have compared efficacy of 30 distance measures on 47 datasets, by the prism of 1NN classifier accuracy. Similarly to Serrà et al. [44], Wang et al. [48], we have observed, that there is no measure distinctly better than the others or appropriate for a majority of datasets. Thus, there is still a place for new ones, maybe connecting some properties of already existing measures. From the other hand, best average ranks were achieved by modifications of DTW distance—\(\mathrm {DD}_\mathrm{{DTW}}\), DDTW, DTWc and by edit-based distances—LCSS, ERP, EDR. Thus, we may draw two conclusions. First, processing shape of time series in a smart way may be a direction for future researches. Second, comparing time series by the mean of edit operations brings remarkable results. Finally, we have also observed, that there are some datasets that are classified better with some groups of measures. It would be highly desirable to find a set of metadata, which could help us to choose the most appropriate measure.

Since this study only discussed 30 of about 40 available distance measures, there is still potential to develop the presented comparison. We plan to cover all available distance measures in the nearest future and, as well, extend the number of datasets for testing them. It would be also interesting to confront conclusions made during these analyses with different time series mining tasks, e.g. with clustering.