Keywords

1 Introduction

A wide range of complex algorithms for time series classification (TSC) have been proposed. These include ensembles of deep neural networks [14], heterogeneous meta-ensembles build on different representations [22], homogeneous ensembles with embedded representations [26] and randomised kernels [10]. The majority of these algorithms rely on some form of transformation: features that in some way model the discriminatory time characteristics are extracted and used in the classification process. These features are often very complex, and usually embedded in the classifiers in complicated ways. For example, the Temporal Dictionary Ensemble (TDE) [19] is centred around the Symbolic Fourier Approximation (SFA) [25] transformation. The transform itself simply discretises the series into a set of words using a sliding window. However, just performing the transform does not lead to an algorithm that is competitive in accuracy. TDE also employs a spacial pyramid, uses bi-gram frequency, a bespoke distance function and a Gaussian process based parameter setting mechanism. The complexity increases further if the data is multivariate, containing multiple time series per case.

Researchers not directly involved in TSC algorithm research, and data scientists in particular, often ask the not unreasonable question of whether these complicated representations are really necessary to get a good classifier. They wonder whether a simple pipeline using standard feature extractors, as illustrated in Fig. 1 would not in fact be at least as good as complicated classifiers claiming to be state of the art? Clearly, the answer will not be the same for all problems, and the detailed answer depends on what level of accuracy is deemed sufficient for a particular application. However, we can address the hypothesis of whether, on average, a standard pipeline of transformation plus classifier performs as well as bespoke benchmarks and state of the art. Specifically, we compare a range of pipeline combinations of off the shelf unsupervised time series transformers with commonly used vector based classifiers to the current state of the art in TSC as described in [22]. In Sect. 2 we describe the transformers and classifiers used in our pipeline experiments, and give a brief overview of the state of the art in TSC. In Sect. 3 we describe our experimental structure, and in Sect. 4 we present our findings. Finally, in Sect. 5 we draw our conclusions and summarise what we have learnt from this study.

Fig. 1.
figure 1

Visualisation of a simple pipeline algorithm for TSC. Could using standard transformers and vector based classifiers be as good as state of the art TSC algorithms?

2 Background

TSC algorithms tend to follow one of three structures. The simplest involves single pipelines such as that described in Sect. 1, where the transformation is either supervised (e.g. Shapelet Transform Classifier [6]) or unsupervised (e.g. ROCKET [10]). These algorithms tend to involve an over-produce and select strategy: a huge number of features are created, and the classifier is left to determine which are most useful. The transform can remove time dependency, e.g. by calculating summary features. We call this type series-to-vector transformations. Alternatively, they may be series-to-series, transforming into an alternative time series representation where we hope the task becomes more easily tractable (e.g. transforming to the frequency domain of the series).

The second transformation based design pattern involves ensembles of pipelines, where each base pipeline consists of making repeated, different, transforms and using a homogeneous base classifier (e.g. Canonical Interval Forest [21]). These ensembles can also be heterogeneous, collating the classifications from transformation pipelines and ensembles of differing representations of the time series (e.g. HIVE-COTE [22]).

The third common pattern involves transformations embedded inside a classifier structure. An example of this is a decision tree: where the data is transformed, or a distance measure is applied prior to any splitting criteria at each node (e.g. TS-CHIEF [26]).

2.1 State of the Art for TSC

The state-of-the art for TSC consists of one classifier from each of the structures described, as well as a deep learning approach.

The Random Convolutional Kernel Transform (ROCKET) [10] is a transform designed for classification. It generates a large number of parameterised convolutional kernels, used as part of a pipeline alongside a linear classifier. Kernels are randomly initialised with respect to the following parameters: the kernel length; a vector of weights; a bias term added to the result of the convolution operation; the dilation to define the spread of the kernel weights over the input instance; and padding for the input series at the start and end. Each kernel is convoluted with an instance through a sliding window dot-product producing an output vector, extracting only two values: the max value and the proportion of positive values. These are concatenated into a feature vector for all kernels.

The Time Series Combination of Heterogeneous and Integrated Embedding Forest (TS-CHIEF) [26] is a homogeneous ensemble where hybrid features are embedded in tree nodes rather than modularised through separate classifiers. The trees in the TS-CHIEF ensemble embed distance measures, dictionary based histograms and spectral features. At each node, a number of splitting criteria from each of these representations are considered. These splits use randomly initialised parameters to help maintain diversity in the ensemble.

InceptionTime [14] is the only deep learning approach we are aware of which achieves state-of-the-art accuracy for TSC. InceptionTime builds on a residual network (ResNet), the prior best network for TSC [13]. The network is composed of two blocks of three Inception modules [27] each, as opposed to the three blocks of three traditional convolutional layers in ResNet. These blocks maintain residual connections, and are followed by global average pooling and softmax layers as before. InceptionTime creates an ensemble of networks with randomly initialised weightings.

The Hierarchical Vote Collective of Transform Ensembles, HIVE-COTE 1.0 (HC1) [2], alongside the three algorithms above, are not significantly different to each other in terms of accuracy. Additionally, all are significantly more accurate on average than the best performing algorithms from the bake off comparison of time series classifiers five years prior [3].

The second release of HIVE-COTE, HIVE-COTE 2.0 (HC2) [22] is a heterogeneous ensemble of four classifiers built on four different base representations. HC2 is the only algorithm we are aware of which performs significantly better than the four algorithms above. In HC2, three new classifiers are introduced, with only the Shapelet Transform Classifier (STC) [5] retained from HC1. TDE [19] replaces the Contractable Bag-of-SFA-Symbols (cBOSS) [20]. The Diverse Representation Canonical Interval Forest (DrCIF) replaces both Time Series Forest (TSF) [12] and the Random Interval Spectral Ensemble (RISE) [17] for the interval and frequency representations. An ensemble of ROCKET [10] classifiers called the Arsenal is introduced as a new convolutional/shapelet based approach. Estimation of test accuracy via cross-validation is replaced by an adapted form of out-of-bag error, although the final model is still built using all training data.

2.2 Unsupervised Time Series Transformations

Time Series Feature Extraction based on Scalable Hypothesis Tests (TSFresh) [8] is a collection of just under 800 featuresFootnote 1 extracted from time series data. TSFresh is very popular with the data science community, and is frequently proposed as a good transform for classification. The Highly Comparative Time Series Analysis (hctsa) [15] toolbox can create over 7700 featuresFootnote 2 for exploratory time series analysis. Alongside basic statistics of time series values, hctsa includes features based on linear correlations, trends and entropy. Features from various time series domains such as wavelets, information theory and forecasting among others are also present. Both TSFresh and hctsa cover similar domains, extracting masses of summary features from the time series. Some of these extracted features will be similar, with differently paramaterised variations of the same feature included if applicable.

The Canonical Time Series Characteristics (catch22) [18] are 22 features chosen to be the most discriminatory of the full hctsa [15] set. This was determined by an evaluation over the UCR datasets. The hctsa features were initially pruned, removing those which are sensitive to mean and variance and any which could not be calculated on over 80% of the UCR datasets. A feature evaluation was then performed based on predictive performance. Any features which performed below a threshold were removed. For the remaining features, a hierarchical clustering was performed on the correlation matrix to remove redundancy. From each of the 22 clusters formed, a single feature was selected, taking into account balanced accuracy, computational efficiency and interpretability. Like the hctsa set it was extracted from, the catch22 features cover a wide range of feature concepts.

Time Series Intervals are used in the interval based representation of TSC algorithms. Classifiers from this representation extract multiple phase-dependent subseries to extract discriminatory features from. Classifiers from this representation include TSF [12] and the Canonical Interval Forest (CIF) [21]. Both of these algorithms select intervals with a random length and position, extracting summary features from the resulting subseries and concatenating the output of each. This interval selection and feature extraction process can itself be used as an unsupervised transformation.

Generalised Signatures [23] are a set of feature extraction techniques, primarily for multivariate time series based on rough path theory. We specifically look at the generalised signature method [23] and the accompanying canonical signature pipeline. Signatures are collections of ordered cross-moments. The pipeline begins by applying two augmentations by default. The basepoint augmentation simply adds a zero at the beginning of the time series, making the signature sensitive to translations of the time series. The time augmentation adds the series timestamps as an extra coordinate to guarantee each signature is unique and obtain information about the parameterisation of the time series. A hierarchical dyadic window is run over the series, with the signature transform being applied to each window. The output for each window is then concatenated into a feature vector.

3 Experimental Structure

We perform our experiments on 112 equal length datasets with no missing values from the UCR time series archive [9]. We resample each dataset randomly 30 times in a stratified manner, with the first resample being the original train-test split from the archive. Each algorithm and dataset resample are seeded using the fold index to ensure reproducibility.

The transformations used in our experiments can be found in the Python sktimeFootnote 3 package. Each transformer was built and saved to file, with the process being timed for our timing experiments. The classification portion of our pipelines, and the TSC algorithms used in our comparison, were run using the Java tsmlFootnote 4 toolkit implementations. An exception for this is the deep learning approach InceptionTime, which we use the sktime companion package sktime-dlFootnote 5 to run.

To compare our results for multiple classifiers over multiple datasets we use critical difference diagrams [11]. We replace the post-hoc Nemenyi test with a comparison of all classifiers using pairwise Wilcoxon signed-rank tests, and cliques formed using the Holm correction as recommended in [4, 16].

We create pipelines primarily using the transformations described in Sect. 2.2, with the exception of the hctsa feature set, which required too much processing time and memory to be run in our timeframe. In addition to these transformations, we also include two benchmark transformations: Principal Component Analysis (PCA) and seven basic summary statistics. The seven statistics we use are the mean, median, standard deviation, minimum, maximum and the quantiles at 25% and 75%. PCA and basic summary statistics are the simplest transformations available, and perhaps one of the simplest approaches one could take towards TSC, alongside building classifiers on the raw time series and one-nearest-neighbour classification with Euclidean distance.

Our random interval transformation experiments extract 100 randomly selected intervals per dataset. We form two random interval pipelines, one extracting our basic summary statistics from each interval (RandInterval) and the other extracting the Catch22 features (RandIntC22).

For the classifier portion of our pipelines, we test three different vector based classifiers. Rotation Forest (RotF) [24] is the classifier of choice for the STC pipeline, and has shown to be significantly better than other popular approaches on problems only containing continuous attributes [1]. Extreme Gradient Boosting (XGBoost) [7] of Kaggle fame is our second classifier option. Our third option is a ridge regression classifiers with cross-validation to select parameters (RidgeCV), the better performing linear classifier suggested for the ROCKET pipeline [10].

4 Results

We structure our results to answer four specific questions:

  1. 1.

    Which transformation is best given a specific classifier?

  2. 2.

    Which classifier is best, given a specific transform?

  3. 3.

    How do the pipeline classifiers compare to standard benchmarks?

  4. 4.

    How do the pipeline classifiers compare to state-of-the-art?

Figures 2 show the relative performance of difference transforms for our three base classifiers. We include for reference our two baseline classifiers, Rotation Forest (RotF) built on the raw time series and 1-nearest neighbour using dynamic time warping with a tuned window size (DTWCV).

Fig. 2.
figure 2

Relative rank performance of seven transforms used in a simple pipeline with a linear ridge classifier (a), XGBoost (b) and rotation forest (c). TSFresh and RandInt22 are significantly better than all other transforms with most base classifiers.

The pattern of results is similar for all three classifiers: TSFresh and RandIntC22 are ranked top for all three classifiers. Both are significantly higher ranked than all the other transforms, and both baseline classifiers, except for the case of TSFresh with a ridge classifier. Summary statistics is always the worst approach and PCA, Catch22, Signatures are no better than, or worse than, the benchmark classifiers. RandomIntervals is significantly better than the benchmarks with rotation forest. There is an anomaly when drawing cliques (RotF and DTW are not always in the same clique despite there being no significant difference in all experiments, as it is impossible to draw accurately), but the initial indications are clear: TSFresh and RandIntC22 are the best performing techniques and, with the possible exception of RandomIntervals, the others do not outperform the standard benchmarks, and are therefore of less interest. We investigate the relative performance of classifiers by comparing the two best transforms (TSFresh and RandIntC22) in combination with the three classifiers. Figure 3 shows that RotF is significantly better than RidgeCV and XGBoost for both transforms. This supports the argument made in [1] that rotation forest is the best classifier for problems with all continuous attributes. Figures 4 show the pairwise scatter plots for four pairs of pipelines. Figures (a), (b) and (c) show the difference in accuracies on the archive for both TSFresh and RandIntC22 using each of our base classifiers. Figure (d) compares our best performing pipeline, TSFresh with rotation forest, to the next best, TSFresh pipeline using XGBoost.

Fig. 3.
figure 3

Relative performance of three classifiers Rotation Forest, XBoost and RidgeCV (prefixes RotF, XG and Ridge) with two transforms TSFresh and RandIntCatch22 (suffix TSFr and RIC22). RotF is significantly better than the other classifiers, and RotF with TSFresh is the best overall combination.

Fig. 4.
figure 4

Pairwise scatter plots for TSFresh vs RandIntC22 with (a) RidgeCV, (b) XGBoost and (c) rotation forest, and (d) the scatter plot of using TSFresh with XGBoost with TSFresh. (a), (b) and (c) demonstrate the superiority of TSFresh over RandIntC22. (d) shows that rotation forest significantly outperforms XGBoost.

Fig. 5.
figure 5

Critical difference plot for FreshPRINCE against SOTA and DTW.

Our primary finding is that the pipeline of TSFresh and rotation forest is, on average, the highest ranked and the most accurate simple pipeline approach for classifying data from the UCR archive. We feel the approach deserves a name better than RotF-TSFr. Hence, we call it the FreshPRINCE (Fresh Pipeline with RotatIoN forest Classifier). We investigate classification performance of the FreshPRINCE against the current and previous state of the art. Figure 5 shows FreshPRINCE against the very latest state of the art, HIVE-COTEv2.0 (HC2), the previously best performing algorithms, InceptionTime, TS-CHIEF and ROCKET and the popular benchmark, DTWCV. Figure 5 shows that FreshPRINCE does not achieve SOTA, but it does perform better than the popular benchmark 1-NN with DTW (DTWCV). Table 1 presents the summary performance measures averaged over all data. FreshPRINCE is approximately 6.5% more accurate than DTWCV, but on average 1.4% and 3.8% less accurate than ROCKET and HC2.

Table 1. Summary performance statistics averaged over 112 UCR datasets. Test set accuracy (Acc), balanced accuracy (BalAcc), F1 statistic (F1), Area under the receiver operator curve (AUROC) and negative log likelihood (NLL).

Table 2 displays the run times for generating the results summarised in Fig. 5 and Table 1. The FreshPRINCE is not as fast as the ROCKET classifier, but is still faster than then other SOTA TSC algorithms.

Table 2. Classifier runtimes, Average (Minutes), Total (Hours), Max (Hours).

We believe that, given the simplicity of the pipeline approach, the FreshPRINCE pipeline should be a benchmark against which new algorithms should be compared. If the claimed merits of an approach are primarily its accuracy, then we believe it should achieve significantly better accuracy than the simple approach of a TSFresh transform followed by a rotation forest classifier.

4.1 Implementation and Reproduction of Results

Given that we suggest FreshPRINCE as a benchmark classifier for new comparisons, we also provide resources for using it as such. We include our results for FreshPRINCE on the 112 UCR datasets used in this experiment on the time series classification web page.Footnote 6 For experiments outside the UCR archive, we have implemented the pipeline in the Python sktime package. The most commonly used machine learning package for Python, sklearn, does not contain a rotation forest implementation. As such, we also include an implementation of the algorithm in sktime.

Listings 1.1 displays the process for running FreshPRINCE using the sktime package, loading data from its .ts file forma.

figure a

FreshPRINCE can also be run using a sklearn pipeline, using the sktime TSFresh transformer and rotation forest implementations, as shown in Listing 1.2.

figure b

5 Conclusion

We have tested a commonly held belief that a simple pipeline of transformation and standard classifier is a useful approach for time series classification. We have found that there is some merit in this opinion: simple transformations such as PCA or summary stats are not effective, but more complex transformations such as TSFresh and random intervals with the Catch22 features do achieve a respectable level of accuracy on average. They are significantly worse than state of the art in 2021 and 2020, but significantly better than the state from 10 years ago (DTWCV). We suggest the best performing pipeline, a combination of TSFresh and rotation forest we call FreshPRINCE for brevity, be used more commonly as a TSC benchmark.