1 Introduction

In many domains, repeated measurements are collected in order to obtain characteristics of objects or situations that evolve over time. Examples of such domains include shape outline recognition, e.g., for classifying historical documents or projectile points (Ye and Keogh 2009), classification of electrocardiograms (ECGs) (Kampouraki et al. 2009), and anomaly detection in streaming data (Rebbapragada et al. 2009). These measurements are typically collected at a fixed rate and such collections are commonly referred to as data series. In the case of measurements over time, such series are referred to as time series, and they can be either univariate (with a single variable evolving over time) or multivariate (with d time-evolving variables).

Our main focus in this paper is time series classification. In other words, given a collection of time series, we would like to infer a model that can predict the value of a categorical output variable, based on the observed time series describing an instance.

Example Let us consider an example from the medical domain. More specifically, given an electrocardiogram (ECG) of a patient, we would like to infer the cardiovascular condition of that patient by exploiting the information contained in a collection of past observations of ECGs, labeled with a categorical outcome variable, taken from different patients. Hence, in this case, the outcome variable corresponds to the heart condition of the patient. An example of an ECG is illustrated in Fig. 1. We can observe that an ECG in this case can be modeled as a multivariate time series consisting of 15 concurrently evolving variables, each corresponding to the voltage of a heart signal over time.

Fig. 1
figure 1

An example of an electrocardiogram (ECG) with 15 variables. Each variable is a heart signal, measured by a separate voltage meter, that evolves over time. This ECG can be seen as a 15-dimensional time series

Achieving fast and accurate time series classification has attracted significant interest of the data mining community over the past decade. One line of research has been focusing on representing time series as ordinary feature vectors. Such representations allow for direct application of standard supervised learning algorithms, such as decision trees (Rodríguez and Alonso 2004), support vector machines (SVMs) (Wu and Chang 2004), neural networks (Nanopoulos et al. 2001), and nearest neighbor classifiers (Batista et al. 2011). For univariate time series (UTS) classification, experimental evidence has repeatedly demonstrated (Xi et al. 2006; Wang et al. 2013) that similarity-based approaches with elastic measures, such as dynamic time warping (DTW) (Berndt and Clifford 1994) provide state-of-the-art predictive performance. Multivariate time series (MTS), however, are characterized not only by similarities between individual attributes, but also by relations between different attributes. The latter is not captured by the traditional univariate approaches (Bankó 2012), which has directed the attention to feature based approaches such as learned pattern similarity (Baydogan and Runger 2015).

A recent approach for time series classification is to identify and extract time series subsequences, called shapelets, that can be used as discriminatory features for classification (Ye and Keogh 2009). Their main characteristic is that they are class-discriminant, phase-independent, i.e., location invariant, subsequences of a longer time series. Several shapelet-based approaches have been proposed for univariate time series classification, such as shapelet-based decision trees (Ye and Keogh 2009) and pre-processing approaches using shapelet transformations (Hills et al. 2014). Similar approaches have been proposed also for multivariate time series. One approach is to build a shapelet-based decision tree for each dimension and then combine the individual classifiers using majority voting (Cetin et al. 2015), while another approach is to perform a shapelet transformation with weighted voting (Patri et al. 2014). It has been demonstrated that while shapelet-based decision trees can provide interpretable rules and, hence, insights to practitioners (Ye and Keogh 2009), their classification accuracy and training costs are often prohibitive (Rakthanmanon and Keogh 2013), limiting their applicability when dealing with large and multivariate time series. Therefore, a natural extension to shapelet-based decision trees is to consider forests of such trees, where each individual tree is built by considering only a subset of all available shapelets. The latter can lead to a substantial speedup, especially for large multivariate time series, leading to that entire forest can be generated faster and obtain higher predictive performance than a single tree (generated by enumerating all shapelets), similar to how random forests (Breiman 2001) improve upon single decision trees.

The main contributions of the paper can be summarized as follows:

  • Novelty We propose a generalized method for efficient and effective time series classification. The key novelty of our approach is that it extends the random shapelet forest algorithmFootnote 1 to support both uni- and multivariate time series classification.

  • Efficiency and effectiveness Through an extensive experimental evaluation on both univariate and multivariate time series datasets, we demonstrate that the proposed algorithm can achieve competitive predictive performance compared to state-of-the-art methods, while reducing the training cost by up to an order of magnitude on average. In addition, we evaluate the robustness of the method with respect to different parameter settings and provide insights on how to tune them for different situations, exploring their impact on the evaluation measures through a decomposition of the mean square error into the bias and the variance of class probabilities assigned by the individual trees in the forest.

  • Applicability We demonstrate the applicability of the novel method on a large and diverse collection of datasets spanning various application domains, including 56 univariate and 14 multivariate time series datasets. Finally, the presented algorithm is: simple to implement, accurate, fast and embarrassingly parallel, i.e., applicable to a wide range of tasks.

The remainder of the paper is organized as follows. In the next section, we discuss related work on time series classification, focusing on shapelet-based classifiers and multivariate time series. In Sect. 3, the notation and problem setting are defined. In Sect. 4, we present the algorithm for generating forests of randomized shapelet trees and discuss various implementation choices. In Sect. 5, the experimental setup and results from the empirical investigation are presented. Finally, we summarize the main findings and point out directions for future work in Sect. 6.

2 Related work

Most approaches for univariate time series classification generally rely on instance-based classification algorithms, e.g., k-nearest neighbor, using different similarity (or distance) measures, of which the most common and simplest is the Euclidean norm. To improve accuracy, elastic distance measures have been proposed, such as dynamic time warping (DTW) or longest common subsequence (Maier 1978) and variants, e.g., cDTW (Sakoe and Chiba 1978), EDR (Chen and Özsu 2005), ERP (Chen and Ng 2004), that are robust to misalignment and time warps. These distance measures allow for localized distortion, e.g., DTW finds the optimal match between two sequences by allowing non-linearity in the distance calculation. By regularization using, e.g., a band (Ratanamahatana and Keogh 2004), the search performance and generalization behavior of k-nn can be greatly improved (Ding et al. 2008). For a more complete overview of instance based univariate time series classifiers, the reader is referred to, e.g., Ding et al. (2008). Although the extension of the DTW algorithm for the multivariate case is non-trivial (Shokoohi-Yekta et al. 2015), several alternatives have been proposed for multivariate time series classification, where the simplest and most commonly employed is the cumulative distance of all univariate distances. Another alternative is introduced by Bankó (2012), where a correlation-based version of DTW (CBDTW) is proposed that combines DTW and principal component analysis to preserve the correlation structure between time series.

To supplement instance-based classifiers and improve the interpretability of the generated models, the concept of shapelets has been introduced (Ye and Keogh 2009). Shapelets are usually described as subsequences whose distance to other time series provides discriminative features used for classification and interpretation (Ye and Keogh 2011). For shapelet-based classifiers, the idea is to consider all subsequences of the training data recursively in a divide-and-conquer manner, while assessing the quality of the shapelets using a scoring function to estimate their discriminative power, constructing an interpretable shapelet tree classifier (Ye and Keogh 2009, 2011).

The most common scoring function is information gain (Shannon 1948), which is also commonly employed when constructing traditional decision trees, (see e.g., Quinlan 1993). To prune non-informative shapelet candidates early and, hence, speed-up the exhaustive search, Ye and Keogh (2009) employ early-abandoning and lower-bounding on the information gain. Due to the combinatorial explosion, however, lower-bounding the information gain does not scale and becomes infeasible when the number of classes increases. To overcome this limitation, other measures based on the analysis of variance have been considered, with the rationale being that these measures are computationally less costly to compute (Lines et al. 2012; Hills et al. 2014). Hills et al. (2014) showed that although there is no significant difference in predictive performance between using information gain and the measures based on variance, the computational cost is significantly reduced. In addition to these techniques for speeding up the exhaustive search for shapelets, several other approaches have been proposed in the literature. For example, methods for trading time complexity for memory consumption, while finding the optimal match have been explored (Mueen et al. 2011).

Decision trees are interpretable, something which is useful in many domains. However, when it comes to predictive performance, they are in almost all cases outperformed by other classifiers, such as support vector machines (Cortes and Vapnik 1995), deep neural networks, (Schmidhuber 2014) and random forests (Breiman 2001). To overcome this limitation, Hills et al. (2014) proposes a single-scan algorithm for finding the best k shapelets in a collection of time series. Subsequently, the produced set of k shapelets is used to generate a new transformed k-feature dataset, with each attribute being the (minimum) distance from the i:th shapelet to the j:th original time series. By disconnecting the shapelet search and the model generation, time series classification is reduced to a feature selection (or generation) problem, enabling the use of a wide range of efficient learning algorithms (Hills et al. 2014).

Shapelet transformation is one instance of a more general concept of feature generation, which has been thoroughly investigated for time series classification. For example, generated features include interval-based features (Rodríguez et al. 2005; Rodríguez and Alonso 2004), statistical features (Nanopoulos et al. 2001; Deng et al. 2013), and other interpretable features, such as correlation structure, distribution or entropy (Fulcher and Jones 2014). Typically the features produced by these transformations can be grouped into three main categories: correlation-based, auto-correlation-based, and shape-based, each denoting similarity in time, change, and shape, respectively. For example, a time series forest based on interval features, such as averages, standard deviations and slope has been proposed by Deng et al. (2013) and a transformation based on time series bag-of-words by Baydogan et al. (2013).

Since exhaustive shapelet discovery requires the enumeration of every subsequence in the training data, it is not feasible for large datasets with large, multivariate, time series. To improve speed, the search space is usually reduced by only considering subsequences of specific lengths, e.g., within predefined ranges. The optimal shapelet length is generally unknown, requiring brute-force searches over multiple classifiers using cross-validation. Due to computational constraints, this is however often infeasible. To improve the performance, Hills et al. (2014) introduce a heuristic-based algorithm for estimating the shapelet length.

For univariate time series classification, an alternative to exhaustively enumerating and searching among all shapelets has been introduced by Grabocka et al. (2014) through the notion of shapelet learning (LTS). In the domain of univariate time series classification, this can be considered the current state-of-the-art in terms of classification accuracy. In this framework, shapelets are learned from the training data, while simultaneously minimizing both the training error using the logistic loss function and the minimum distance using a soft minimum (Grabocka et al. 2014). The learned shapelets can improve predictive performance significantly compared to other shapelet-based classifiers that are primarily based on exhaustive search (Grabocka et al. 2014). Apart from the high computational cost, the primary disadvantage of LTS is the large number of hyper-parameters that require tuning and the difficulty of choosing the initial shapelet prototypes, which typically has a large impact on the resulting accuracy (Grabocka et al. 2014).

As far as multivariate time series classification is concerned, two main approaches using shapelets have been introduced in the literature. Patri et al. (2014) proposes a shapelet forest for classifying heterogeneous multivariate sensor data. The algorithm uses the Fast Shapelet approach (Rakthanmanon and Keogh 2013) to extract the most informative shapelets from each dimension. Using these shapelets, a distance matrix is computed and weights are learned for each shapelet using different feature selection methods. The classification of a new multivariate time series is performed by weighting the votes of the individual shapelet trees built for each dimension by summing the learned weights of the used shapelets. A similar approach, in which one shapelet tree is built from each time series dimension, is presented by Cetin et al. (2015). For the latter, several techniques for improving the search speed are proposed, which makes building several trees tractable. The techniques improve the performance of tree generation significantly over existing methods by using multi-length indexing and dynamic stepping (Cetin et al. 2015). Different voting approaches are evaluated, showing that building one shapelet tree per dimension outperforms shapelets defined over multiple dimensions (Cetin et al. 2015). The main drawback of these approaches is however that, by building an individual tree per dimension, the trees do not take into consideration potential dependencies between the dimensions. In addition, the weighted voting technique that is employed for determining the final class does not consider the varying importance of different dimensions.

Similar to shapelet transformation, Wistuba et al. (2015) proposes a method for multivariate time series, Ultra-Fast Shapelets (UFS), where random shapelets are used as features. The authors show that such transformations are both fast and accurate. In particular, the method outperforms related algorithms for multivariate time series classification (Wistuba et al. 2015). The most prominent advantage of UFS, which is rather similar to the method proposed in this work, is the ability to model relationships between different dimensions at a low computational cost. A drawback, however, is the fact that UFS considers shapelets globally from a restricted pool of preselected shapelets, as opposed to the method proposed in this study, which, at each node in each tree, locally considers a set of shapelets, improving diversity and directing the search to more important regions. Finally, another line of research for multivariate time series classification includes feature based approaches, such as learned pattern similarity (LPS) (Baydogan and Runger 2015), symbolic representation for multivariate time series (SMTS) (Baydogan and Runger 2014), and multivariate extensions of time series bag-of-words (Baydogan et al. 2013).

3 Problem setting

The problem studied in this paper is uni- and multivariate time series classification, i.e., the task at hand is, given one or several set-valued variables, each with values ordered by time and sampled at regular and fixed intervals, we want to infer a model that is able to accurately predict the class labels of unseen examples based on the observed variables.

Next, we proceed by providing some background definitions along with the problem formulation.

Definition 1

(d-dimensional time series) A d-dimensional time series \(\mathcal {T} = \{\mathbf {T_{1}}, \ldots , \mathbf {T_{d}}\}\) is a sequence of d variables, such that \(\mathbf {T}_k\in \mathbb {R}^m\), \(\forall k\in \{1,\dots ,d\}\), where \(\mathbf {T_k}=\{T_{k,1},\ldots ,T_{k,m}\}\), with \(T_{k,j}\in \mathbb {R}\), \(\forall j\in \{1,\dots ,m\}\). For \(d=1\), \(\mathcal {T}\) defines a univariate time series, denoted simply as \(\mathcal {T}=\{T_1,\ldots ,T_m\}\) consisting of a sequence of m ordered elements \(T_j \in \mathbb {R}\). For \(d>1\), \(\mathcal {T}\) defines a multivariate time series.

A local segment of a time series is called a time series subsequence. A more formal definition is given next.

Definition 2

(time series subsequence) A time series subsequence of the kth dimension of a time series \(\mathcal {T}\) is a sequence of l contiguous elements of \(\mathbf {T}_k\), denoted as \(\mathbf {T}_{k}^{s:s+l-1} = \{T_{k,s}, \ldots , T_{k,s+l-1}\}\), where s is the starting position and l is its length. Time series subsequence and shapelet is used interchangeably.

As stated earlier, time series classification predominantly relies on the chosen distance (similarity) measure to compare and discriminate between instance pairs. Depending on the application domain and the nature of the time series, various distance measures can be used. For the case of a single time series dimension, k, we will consider the following two distance functions.

Definition 3

(time series distance) Given two time series \(\mathcal {T}\) and \(\mathcal {T'}\) of equal length l, the time series distance between their corresponding kth dimensions is the length-normalized Euclidean distance between \(\mathbf {T}_k\) and \(\mathbf {T}_k'\), i.e.:

$$\begin{aligned} Fdist(\mathbf {T}_k,\mathbf {T}_k') = ED (\mathbf {T}_k,\mathbf {T}_k') = \sqrt{\frac{1}{l} \sum _{i=1}^{l}(T_{k,i}-T_{k,i}')^2} \ . \end{aligned}$$
(1)

Finally, we define the distance between a time series subsequence and a time series.

Definition 4

(time series subsequence distance) Given a 1-dimensional time series \(\mathcal {S}\) and a d-dimensional time series \(\mathcal {T}\) of lengths l and m, respectively, such that \(l \le m\), the time series subsequence distance between \(\mathcal {S}\) and the kth dimension of \(\mathcal {T}\), is the minimum distance between \(\mathcal {S}\) and any subsequence of \(\mathbf {T}_k\) of length l, i.e.:

$$\begin{aligned} Sdist(\mathcal {S}, \mathbf {T}_k) = \min _{s=1}^{m-l+1} \{ Fdist (\mathcal {S}, \mathbf {T}_k^{s:s+l-1}) \} \ . \end{aligned}$$
(2)

Note that since a \(\mathcal {S}\) is 1-dimensional time series of length equal to the length of each subsequence \(\mathbf {T}_k'^{s:s+l-1}\), Fdist can be applied directly. Also note that the distance function must be length invariant in order to avoid penalizing longer sequences (Ye and Keogh 2009).

It is important to mention that in the two previous definitions, alternative distance measures can be used, instead of the Euclidean distance, without loss of generality. We have chosen Euclidean distance since it is almost exclusively used in the literature for building shapelet-based time series classifiers.

A collection of n time series \(\mathcal {D} = \{\mathcal {T}^{1},\ldots ,\mathcal {T}^{n}\}\) defines a time series dataset. For notational simplicity, we assume that every time series in \(\mathcal {D}\) has the same number of dimensions d and that each dimension is of the same length m. Although this is one of the most common settings for time series classification, the equal length assumption is not a requirement in general (Hu et al. 2013) and the presented method can in fact handle time series of varying lengths.

Supervised learning involves a set of training instances, usually denoted as the learning set, labeled with one of a finite set of possible values, which we denote as \(\mathcal {C} = \{c_1,c_2,\ldots , c_q\}\). For time series classification, the learning set consists of a collection of labeled time series. Next, we formally define a time series learning set.

Definition 5

(time series learning set) Given a time series dataset \(\mathcal {D}\) of size n and a set of class labels \(\mathcal {C}\), a time series learning set \(\mathcal {Z}_{\{\mathcal {D}, \mathcal {Y}\}}\) is defined as a vector of instance tuples \(z^{i} = (\mathcal {T}^{i}, y^{i})\), where each \(\mathcal {T}^{i} \in \mathcal {D}\) and each \(y^{i} \in \mathcal {C}\), \(\forall i \in \{1,\ldots n\}\). We use \(\mathcal {Y} = \{y^{1},\ldots , y^{n}\}\) to denote the vector of assigned labels for the time series.

Hence, time series classification can be seen as the task of learning a classification function from a dataset of d-dimensional time series \(\mathcal {D}\) to a set of labels \(\mathcal {C}\), such that the predicted class labels are as close as possible to the true time series class labels.

Definition 6

(time series classification function) Given a dataset \(\mathcal {D}\) of n time series and a finite set of labels \(\mathcal {C}\), a classification function is a mapping \(f : \mathcal {D} \rightarrow \mathcal {C}\), such that for each \(\mathcal {T}^{i}\in \mathcal {D}\)

$$\begin{aligned} f(\mathcal {T}^{i}) = \hat{y}_i\in \mathcal {C} \ , \forall i\in \{1,\dots ,n\} \ . \end{aligned}$$

Based on the above, the problem studied in this paper can be formulated as follows:

Problem 1

(time series classification) Given a time series learning set \(\mathcal {Z}_{\{\mathcal {D}, \mathcal {Y}\}}\) of size n, a finite set of labels \(\mathcal {C}\), and a loss function \(\mathcal {L}\), we want to learn a classification function \(f : \mathcal {D} \rightarrow \mathcal {C}\) that minimizes \(E_{(x,y) \in \mathcal {U}}[\mathcal {L}(y,f(x))]\), i.e., the expected loss on samples drawn from the (unknown) target distribution \(\mathcal {U}\).

In this work, we will consider the 0/1 loss function:

$$\begin{aligned} \mathcal {L}(y,y') = {\left\{ \begin{array}{ll} 0,&{} \text {if } y = y'\\ 1, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Our goal in this paper is to solve Problem 1 using this loss function for the case of multivariate time series, using the distance measures described above. The proposed approach is, however, generic enough for any type of data series, given a well-defined distance or similarity function.

4 Forests of generalized random shapelet trees

4.1 Background

Given a learning set, the standard decision tree learning algorithms (Quinlan 1993; Breiman et al. 1984) deterministically produce a classifier. For decision trees, however, it has been noted that minor changes of the learning set or parameter settings can have a fairly large impact on the resulting model. This variability can be remedied, or in fact benefited from, by generating sets of, rather than single, trees, as done by ensemble methods, among which bagging is one of the simplest (Breiman 1996). Other examples, such as random forests, apply randomization both to the learning set and the learning algorithm (Breiman 2001). In random forests, the first is done by employing bagging when building each tree and the latter by only selecting a (small) fixed number of random features to evaluate at each node.

In the traditional decision tree framework (Breiman et al. 1984), a shapelet tree, denoted as ST, is in essence generated by a combination of a decision tree learning algorithm and a feature extractor. In the case of multivariate time series, a shapelet \(\mathcal {S}\) is extracted from a selected dimension, say k, of a selected time series, say \(\mathcal {T}^i\); hence, it corresponds to a time series subsequence, i.e., \(\mathbf {T}_k\). Since it is highly essential to keep track of the dimension k as well as the time series index i, from which a shapelet is extracted, a shapelet is denoted as \(\mathcal {S}_{k,i}\). In the algorithm, each split consists of an shapelet and a distance threshold \(\tau \). The distance threshold is found by computing the distance between \(\mathcal {S}_{k,i}\) and all n time series of the \(k^{th}\) dimension in the dataset \(\mathcal {D}\), resulting in a list of n distances, i.e., \(\{Sdist(\mathcal {S}_{k,i}, \mathbf {T}_k) : \mathbf {T}_k \in \mathcal {T} \in \mathcal {D}\}\) (Ye and Keogh 2009, 2011).

The distance list is discretized, similar to the way numerical features are handled in decision trees (Quinlan 1993), i.e., by sorting the distances and at each possible split point \(\tau \) evaluating the impurity measure for a binary split. This results in partitioning the instances into two groups: one with minimum distance \({\le }\tau \) and one with minimum distance \({>}\tau \). Let \(\mathcal {D}^t\) denote the set of time series appearing in node t of the shapelet tree and let \(\mathcal {S}_{k,i}\) be a candidate shapelet. Then, \(\mathcal {D}^t\) is partitioned into two sets: \(\mathcal {D}^t_{L} = \{\mathcal {T} : \mathbf {T}_k \in \mathcal {T} \in \mathcal {D}^t, Sdist(\mathcal {S}_{k,i},\mathbf {T}_k)\le \tau \}\) and \(\mathcal {D}^t_{R} = \{\mathcal {T} : \mathbf {T}_k \in \mathcal {T} \in \mathcal {D}^t, Sdist(\mathcal {S}_{k,i},\mathbf {T}_k)>\tau \}\). A valid split point is the mean of two consecutive distances for which the associated instances are labeled differently. To evaluate the goodness of a split at a particular node t, an impurity measure is used. The impurity is defined as

$$\begin{aligned} Im(\mathcal {S}_{k,i}, t) = I(t) - \frac{n^{L}_t}{n_t} I\big (t^{L}\big ) - \frac{n^R_t}{n_t}I\big (t^R\big ) \ , \end{aligned}$$
(3)

where \(n^L_t = |\mathcal {D}^t_L|\) and \(n^R_t = |\mathcal {D}^t_R|\) denote the sizes of the two partitions emerging from node t, \(n_t\) the total number of time series instances in node t, and I(t) evaluates the goodness of a partitioning of node t. In this work, entropy is employed as goodness measure, i.e., \(I_{entropy} = -\sum _{c\in \mathcal {C}}p(c|t)\mathrm {log}_2\left( p(c|t)\right) \) since it has been shown to work well for both single decision trees, random forests (Breiman 2001), shapelet trees (Ye and Keogh 2009), and logical shapelets (Mueen et al. 2011). The selected shapelet is, hence, the time series subsequence that achieves the highest decrease in impurity when splitting the learning set at node t into two disjoint partitions using a time series shapelet from the kth dimension as a split attribute, i.e., \(\max \limits _{\mathcal {S}_{k,i}} Im(\mathcal {S}_{k,i}, t)\).

The most important factor limiting practical use of shapelets as discriminative patterns is the computational cost of evaluating shapelet candidates, especially in the case of multivariate time series. To overcome this limitation, the number of evaluated candidates needs to be reduced. One way of improving the computational (and predictive) performance of decision tree ensembles is to only evaluate a small number of attributes at each node (Ho 1998). In a similar way, the computational cost of building shapelet trees can be reduced by only evaluating a constant (small) number of, e.g., randomly selected, candidates, r, at each internal node. To reduce the variance of the generalized random shapelet trees, we adopt an ensemble approach for generating a set \(\mathcal {R} = \{ST_1, \ldots , ST_p\}\) of p random shapelet trees and combining their predictions using majority voting to determine the final class label. The proposed learning algorithm, as outlined in Algorithm 1, is elaborated in detail in the next section.

4.2 Generalized random shapelet forests

The generalized random shapelet forest (gRSF) algorithm (Algorithm 1) is a randomized ensemble method, which generates p generalized trees (using Algorithm 2), each built using a random selection of instances and a random selection of shapelets. In Algorithm 1, Sample draws a random sample of indices from the learning set \(\mathcal {Z}\), which we denote as \(\mathcal {Z}_{I_j}\) and defines the in-bag instances for the jth tree. Although, the function Sample can be implemented in a number of ways, the traditional bootstrap approach (Breiman 1996) is chosen here since, for each tree, several instances will be left out during training, i.e., the out-of-bag instances. The out-of-bag instances can subsequently be used to compute the performance of the forest using a subset of trees for which an instance is left out, enabling estimates of the running performance. Hence, Sample returns a vector of n indices drawn with replacement from the range [1, n]. Using this sample, the algorithm continues by generating the jth tree (sequentially or in parallel) using the function RandomShapeletTree and the instances included in the sample \(\mathcal {Z}_{I_j}\).

figure a

Each random shapelet tree is constructed using Algorithm 2. In the algorithm, TerminalNode returns true if certain conditions, which may vary between implementations, are met. Common conditions for traditional random forests, which are also employed here, include that all instances are labeled with the same class label, i.e., the node is pure, or that the number of instance is below a certain threshold (here we adopt \(|\mathcal {Z}| \le 2\)). If the termination condition is not met, Algorithm 2 continues by sampling r shapelets randomly from the learning set using the function SampleShapelet. This function, which introduces the second type of randomization mentioned above, can, again, be implemented in a number of ways of which we here adopt a straightforward approach. SampleShapelet is implemented to randomly and uniformly select a single time series from a randomly selected dimension k, \(T_{k}^{i} \in \mathcal {Z}\) and extract a shapelet \(S_{k,i}^{j,e}\) from \(T_{k}^{i}\) by uniformly selecting a length l in the range \(e=rand([l, u])\) and a start position in the range \(j=rand([1, m-l])\), where l and u denotes minimum and maximum shapelet sizes. Employing this function r times, where r is a parameter of the algorithm, results in a subset \(\mathbf {S}\) of candidates, where \(|\mathbf {S}| = r\).

figure b

The BestSplit function determines which shapelet, \(\mathcal {S}\in \mathbf {S}\), from the k:th dimension and distance threshold, \(\tau \), should be selected as the test condition at node t. The chosen test condition is subsequently used to separate the instances into two groups, those with a distance \(Sdist(\mathcal {S},\mathbf {T}_{k})\le \tau \) and those with a distance \(Sdist(\mathcal {S},\mathbf {T}_{k}) >\tau \). The utility of a split is determined by the information gain, i.e., Eq. 3, using entropy. To resolve conflicts in information gain, i.e., two or more thresholds with the same gain, we choose the threshold that maximizes the separation gap (Rakthanmanon and Keogh 2013):

$$\begin{aligned} \frac{1}{n_t^L}\sum _{\mathbf {T}_{k}\in \mathcal {D}_L}Sdist(\mathcal {S}, \mathbf {T}_{k}) - \frac{1}{n_t^R}\sum _{\mathbf {T}_{k}\in \mathcal {D}_R}Sdist(\mathcal {S}, \mathbf {T}_{k}). \end{aligned}$$
(4)

The SplitDataset function partitions the instances according to the chosen split point, i.e., \(\mathcal {Z}_L\) contains the instances with a distance less than or equal to \(\tau \) and vice versa for \(\mathcal {Z}_R\). The partitions are subsequently used to recursively build sub-trees. Finally, MakeLeaf returns a representation of a leaf in the generated tree by simply assigning the class label that occurs most frequently among the instances reaching the node, dealing with ties by selecting a label at random according to a uniform distribution.

4.2.1 Parameters and computational cost

From a computational point of view, the rational behind the random shapelet tree is that the cost of generating such trees is significantly lower than that of generating a traditional shapelet tree (Ye and Keogh 2009) and also lower than the cost of a Fast Shapelet Tree (Rakthanmanon and Keogh 2013) and a single shapelet ensemble tree (Cetin et al. 2015). Furthermore, from a bias-variance point of view, the rationale behind the random shapelet forest is that the randomization of the selected shapelets combined with the ensemble averaging is able to reduce the variance of variable base models and hence improve the generalization behaviour of the ensemble.

The total number of shapelet candidates in a 1-dimensional time series dataset with n instances of length m is \(\mathcal {O}(nm^2)\). Exhaustively comparing all pairs of candidates of equal length thus requires a runtime of \(\mathcal {O}(n^2m^4)\), which hence results in a computational complexity of \(\mathcal {O}(n^2m^3 \log nm^3)\) for the original shapelet tree algorithm (Ye and Keogh 2009). For the fast shapelet algorithm, an exhaustive search is only performed for a subset of r shapelets, found using a SAX approximation, reducing the complexity to \(\mathcal {O}(rnm^2)\) for a single node and \(\mathcal {O}(n^2rm^2 \log nrm^2)\) for a full tree. Similarly, the computational cost for a generalized random shapelet tree is \(\mathcal {O}(n^2rm^2 \log nrm^2)\). For both generalized random shapelet trees and fast shapelet trees the amortized computational cost is \(\mathcal {O}(n^2m^2 \log nm^2)\), since r is constant in both cases. In practice, however, generating a random shapelet tree is faster then building a fast shapelet tree, since the former randomly samples shapelets in constant time, whereas the latter performs a more costly SAX approximation (Rakthanmanon and Keogh 2013). Finally, the shapelet ensemble tree approach (Cetin et al. 2015) utilizes several speed-up techniques that, in practice, significantly improves the computational cost compared to the fast shapelet algorithm and could also be used to improve the performance of random shapelet trees. The worst case run-time, however, remains unchanged compared to the original shapelet tree algorithm (Cetin et al. 2015).

As summarized in Table 1, the random shapelet forest algorithm has three parameters that require tuning: r (the number of shapelets randomly selected at each node) and [lu] (the range of allowed shapelet sizes defined as fractions of m). It also has a final parameter, p (the number of constructed trees in the ensemble) that for most practical purposes does not require tuning (larger is better). The parameters r, [lu], and p have different effects: r determines the strength of the shapelet selection procedure, where a low number of selected shapelets results in a relatively high variance of the base models; while p determines the strength of the variance reduction of the ensemble model averaging. These parameters could be selected to suite the application in a number of manual or automatic ways using, for example, cross-validation, or employing the less costly out-of-bag error estimate. In Sect. 5.4, we empirically investigate the effect of different parameter configurations. Similar to what is commonly chosen for the traditional random forest algorithm, the default value for r is defined as the square-root of the total number of possible shapelets in a single time series, i.e., \(\sqrt{md(md+1)/2}\), and the default values for l and u are set to include shapelets of all possible lengths, i.e., as limited by the training set.

Table 1 The parameters of the defined random shapelet forest

5 Experimental evaluation

In this section, the empirical methodology is outlined together with the experimental design for evaluating the predictive performance of the gRSF algorithm compared to state-of-the-art univariate and multivariate time series classifiers. We also present experiments for exploring the effect of different parameter configurations for gRSF in terms of the bias-variance decomposition.

5.1 Experimental setup

For time series classification, the most common metric for evaluating the predictive performance of classifiers is by measuring the accuracy, i.e., the fraction of correctly classified instances (Hills et al. 2014; Ye and Keogh 2009; Gordon et al. 2012). We note, however, that other measures, such as the area under ROC curve (Bradley 1997) can be more suitable in some cases, e.g., when the class distribution differ between training and testing.

To explore and analyze how different parameter configurations affect the predictive performance, internal estimates for computing the average strength and the level of correlation in predictions by the individual classifier (based on the out-of-bag performance) are employed. For completeness, the definitions of these metrics are given in Appendix 1. Briefly, the strength measures the accuracy of each classifier in the ensemble based on the margin, i.e., the average number of votes for the right class that exceeds the average number of votes for any other class; and the correlation measures the dependence between classifiers. To further explore the gRSF, we also consider the bias/variance decomposition and compare how different parameter configurations impact the predicted probabilities. To make the paper self-contained, Appendix 1 provides a decomposition of the bias and variance in terms of the posterior probabilities assigned by the forest.

5.1.1 Hypothesis testing

The generally accepted procedure for inferring significant differences in terms of, e.g., classification accuracy, when comparing multiple classifiers over many datasets is the non-parametric Friedman test based on ranks (Demšar 2006). Using this test, the best performing algorithm receives a rank of 1 (for a particular dataset), the second best received rank 2, and so on, while ties are resolved by assigning the average rank. If the Friedman test allows for rejecting the null hypothesis (stating that there are no differences in performance between the methods), then a post-hoc test, e.g., a Nemenyi test, may be employed for identifying pairs of methods for which the difference is significant. Following Demšar (2006), the latter can be visualized by depicting the ranks of classifiers as points on a horizontal line, connecting points for which the difference is not significant, i.e., where the difference between the algorithms is less than a critical distance, which is dependent on the chosen level of significance. For a more complete description of the employed statistical tests, the reader may consult Demšar (2006).

5.1.2 Datasets

To evaluate the classification accuracy and run-time of the gRSF algorithm compared to the state-of-the-art univariate time series classifiers, we have selected 56 diverse datasets (see Tables 2 and 3). The majority of the datasets are from the UCR repository (Keogh et al. 2015) and the rest are from Lines et al. (2012).Footnote 2 The datasets cover a range of domains, commonly grouped into four categories: image outline classification, motion classification, sensor reading classification and simulated classification problems (Lines and Bagnall 2014). Similarly, to evaluate the performance of gRSF against the state-of-the-art multivariate time series classifiers, we have selected 14 datasets covering domains such as image outline classification, motion classification and sensor reading classification (Baydogan and Runger 2015).

Table 2 The predictive performance, as measured by accuracy, of gRSF compared to LTS, shapelet transformations and cDTW
Table 3 The predictive performance, as measured by accuracy, of gRFS compared to LTS, cDTW and SAX

5.1.3 Parameter optimization protocol

To avoid selection of sub-optimal parameter values for the baseline approaches, the results of gRSF are compared to results previously reported in the literature using the already provided training and test sets, for which typically the hyper-parameters have been optimized. For the presented method and the UTS datasets, we perform a grid-search over some of the parameters enumerated in Table 1 using the out-of-bag error rate of the gRSF, i.e., using only the training data, to identify the best performing configuration. To resolve conflicts between configurations with the same out-of-bag error rate, the correlation and strength square ratio, i.e., \(\bar{p}/s^2\), as defined in Appendix 1, is employed. To simplify replication of the experiment, the finally selected parameter configurations are listed in Tables 2 and 3. Since the generalization error of an ensemble is expected to decrease with an increasing number of members (cf. Fig. 6), the forest size is kept static at \(p=500\), which has been shown to be a suitable forest size for traditional random forests (Boström 2011). For univariate time series, the following configurations of upper and lower shapelet length, i.e., l and u, are used in the optimization: \(l=0.025, u=\{0.1, 0.2, 0.3, 0.4\}\); \(l=0.2, u=0.5\); \(l=0.3, u=0.6\) and \(l=\{0.6,0.7,0.8,0.9\}\), \(u=1\), i.e. 10 possible configurations. Furthermore, for multivariate time series, we here opt for using the default values for upper and lower shapelet length, i.e., all possible lengths. For both univariate and multivariate time series, the number of inspected shapelets is set to 1, 10, 50, 100, 500 and \(\sqrt{md(md+1)/2}\), yielding a total of 60 evaluated parameter configurations for univariate and 6 parameter configurations for multivariate time series. By using the out-of-bag error rate, we avoid more costly strategies, such as employing k-fold cross-validation.

5.2 Competing approaches

In this section, we list the baseline algorithms to which the presented approach is compared. In the univariate case, all competing algorithms, except cDTW, are based on shapelets. For multivariate time series, all alternative approaches are based on time series features extraction.

5.2.1 Nearest neighbours (univariate/multivariate)

The most widely adopted time series classifier is the nearest neighbour classifier, which have predominantly and successfully been used together with the constraint dynamic time warping distance measure (Sakoe and Chiba 1978; Ratanamahatana and Keogh 2004). We denote this approach cDTW. The nearest neighbour classifier requires the number of nearest neighbours k and DTW requires one additional parameter: the width of the band. Here, we adopt \(k=1\) using a cross-validation optimized DTW constraint. We opt for not optimizing the number of nearest neighbours, since the gain is often minimal (Bagnall and Lines 2014). For the univariate case, the predictive performance for cDTW in Tables 2 and 3 are taken from Keogh et al. (2015). For the multivariate case, the distance between two MTS is the sum of DTW distances between the associated UTS, similar to Baydogan and Runger (2014).

5.2.2 Shapelet trees (univariate)

Here we adopt the shapelet-based decision tree classifier with the F-stat measure (FST) proposed by Hills et al. (2014), since, in their work, the F-stat measure is able to both generate trees slightly faster and provide slight benefits in terms of classification accuracy compared to the traditional shapelet tree. The shapelet tree classifier requires two parameters: the upper and lower shapelet size, which were optimized using a subset of time series (Hills et al. 2014). The predictive performance in Table 2 is taken from Hills et al. (2014).

5.2.3 Fast shapelets (univariate)

The fast shapelet algorithm, which we denote SAX, uses symbolic aggregate approximations (SAX) to reduce the dimensionality of the shapelet search problem, improving the run-time of the shapelet-based decision tree algorithm with several orders of magnitude. The employed fast shapelet algorithm requires four parameters: the upper and lower shapelet size, the top-k shapelets for the approximations and a step-size. Additionally, SAX requires a number of parameters, e.g., the alphabet size. Here, we use the default setting. Predictive performance in Table 3 is taken from Rakthanmanon and Keogh (2013). Note that the predictive performance of SAX is not included in Table 2, since it provides an approximation of the single shapelet tree (FST) and, hence, gives similar results. Similarly, the single shapelet tree is not included in Table 3 due to computational constraints, instead we here include the approximation provided by SAX. Since the shapelet based ensemble (Cetin et al. 2015), provide comparable results to SAX, we here opt to include only the latter.

5.2.4 Shapelet transformations (univariate)

We also consider shapelet transformations (Hills et al. 2014). Since the transformation is independent from the used classifier, we here opt for the classifier that has been shown to give the best results, namely support vector machines (Hills et al. 2014). We refer to this approach as SVM. The transformation requires two parameters; the top-k shapelets to include and the number of clusters. Additionally, the chosen classifier might have parameters of its own that require tuning, e.g., kernel for SVMs. The predictive performance in Table 2 is taken from Hills et al. (2014), using a linear kernel.

5.2.5 Learning shapelets (univariate)

Finally, for the univariate case, we consider learning shapelets (LTS) (Grabocka et al. 2014), which instead of enumerating shapelets from a restricted pool of candidates in the learning set, consider them as parameters in an optimization problem. The LTS algorithm requires six hyper-parameters; the learning rate \(\eta \), regularization \(\lambda \), the number of iterations iter, the soft max parameter \(\alpha \), the scales of pattern lengths r and number of latent patterns k. The predictive performance using optimized parameters in Table 2 is taken from Grabocka et al. (2014) and in Table 3 from their supporting website.Footnote 3

Table 4 Predictive performance, as measured by accuracy, of gRSF compared to SMTS, cDTW, UFS, and LPS for the multivariate time series datasets

5.2.6 Symbolic representation for multivariate time series (multivariate)

For MTS, we consider a learned symbolic representation for multivariate time series (Baydogan and Runger 2014), which we will refer to as SMTS. SMTS provides a symbolic representation of a multivariate time series based on time index and values which are exploited to generate time segmented features using the terminal nodes in a decision forest. The features are subsequently exploited for making predictions using a second ensemble. The predictive performance in Table 4 is taken from (Baydogan and Runger 2014).

5.2.7 Learned pattern similarity (multivariate)

For the multivariate case, we consider learned pattern similarity (LPS), which models the dependency structure of time series based on the concept of local autopatterns, i.e. time segments, learned using an ensemble of regression trees. A similarity measure based on these patterns are subsequently used to make predictions using a nearest neighbour approach (Baydogan and Runger 2015). The feature extraction algorithm has two parameters; the number of regression trees and the number of segments. In Table 4, the predictive performance is taken from Baydogan and Runger (2015).

5.2.8 Ultra fast shapelets (multivariate)

Finally, we also include the the ultra-fast shapelet transformation (UFS) approach, which models the input data as a set of distances from randomly chosen shapelets (Wistuba et al. 2015). The algorithm has only one parameter; the number of random shapelets to sample. The predictive performance in Table 4 is in part taken from (Wistuba et al. 2015) and in part from experiments using our own implementationFootnote 4 with the recommended parameter configurations (Wistuba et al. 2015).

5.3 Empirical results

The performance of the gRSF algorithm is compared to the competing approaches both in terms of classification accuracy and run-time. The results are presented in three stages: in Sect. 5.3.1, we compare the predictive performance of gRSF against the competing approaches described in Sect. 5.2 for both univariate and multivariate time series. For univariate time series the comparison is first conducted for a subset of datasets commonly used when evaluating shapelet classifiers and then for the full set of datasets. In Sect. 5.3.2, the computational performance and scalability of gRSF is evaluated and compared to alternative methods. Finally, in Sect. 5.4, we investigate the effect of gRSF hyper-parameters in terms of the bias-variance decomposition.

5.3.1 Predictive performance

Univariate time series classification Since the computational cost for shapelet based classifiers have been a limiting factor for full utilization of the algorithms, most such classifiers (see Hills et al. 2014; Grabocka et al. 2014) have been evaluated on a subset of smaller datasets only. More specifically, 17 datasets from the UCR time series repository (Keogh et al. 2015) and 11 datasets from Hills et al. (2014) have usually been selected. To give an overview, the results for all algorithms and datasets are presented in Table 2. Statistical tests for determining whether differences in predictive performance are significant or not are undertaken by comparing the performance ranks of the algorithms.

In Table 2, we can see the classification accuracy of the different methods. By inspecting the ranks of the individual algorithms, on average gRFS and LTS are ranked second (with ranks 1.87 and 1.92, respectively) and SVM third (3) and cDTW and FST fourth (3.97 and 4.23). A significance test reveals that the observed differences in accuracy ranks significantly (\(p<0.001\)) deviate from what can be expected under the null hypothesis. For a significance level of 0.05 and 0.1, the critical distance is 1.15 and 1.04 respectively. As seen in Fig. 2, a post-hoc Nemenyi test reveals that there is a significant difference between gRSF when comparing against single shapelet-based decision trees (FST), cDTW and SVM for \(p<0.05\) and between gRSF and LTS compared to FST and cDTW for \(p<0.1\). Thus, for the lower significance levels, we can identify two main groups of algorithms; the performance of FST and cDTW are significantly worse than that of gRSF and LTS. The SVM approach, however, falls in between of the two groups; it cannot clearly be separated from the worst (or best) performing group at the lowest significance level. Since the difference between gRSF and LTS is minimal, the empirical findings for univariate time series classification indicate that when aiming for highest accuracy, either gRSF or LTS can be safely recommended.

Fig. 2
figure 2

Comparison of pair-wise Nemenyi tests for all classifiers used for the 28 univariate datasets in Table 2. Classifiers that are not significantly different at \(p=0.1\) (left) and at \(p=0.05\) (right) are connected

Fig. 3
figure 3

Comparison of pair-wise Nemenyi tests for all classifiers used for the 45 univariate datasets in Table 3. Classifiers that are not significantly different at \(p=0.1\) (left) and at \(p=0.05\) (right) are connected

Since the computational cost for gRSF is generally lower than that of other shapelet based algorithms, e.g., LTS (see Sect. 5.3.2), the second experiment include all 45 datasets from the UCR repository. In Table 3, the results for these datasets are presented for the gRSF algorithm and three state-of-the-art classifiers able to cope with these larger time series datasets, namely: cDTW, SAX, and LTS. As can be seen in Table 3, the two best performing classifiers, in terms of accuracy, are again the LTS and gRSF algorithms. This is also confirmed by the average performance ranks, where gRSF and LTS are ranked second (with ranks 1.76 and 1.89, respectively), while cDTW ranks 2.69 and SAX 3.67. For these datasets, the observed differences in accuracy ranks deviate significantly (\(p<0.001\)) from what can be expected under the null-hypothesis of no difference.

To investigate if there are any significant differences between pairs of classifiers, a post-hoc Nemenyi test is again performed (Fig. 3). Given the critical distances for the significance levels of 0.05 and 0.1 as 0.69 and 0.62, the post-hoc test reveals that there is no significant difference between gRSF and LTS for both significance levels. However, SAX is significantly less accurate than all alternatives (again, for both significance levels). Furthermore, at both significance levels, cDTW is less accurate than both LTS and gRSF. Partitioning the algorithms into groups for \(p<0.05\), we can identify two groups; the performances of SAX and cDTW are significantly worse than gRSF and LTS. Thus, since gRSF and LTS cannot be separated in terms predictive performance, either is expected to outperform the alternatives.

Multivariate time series classification Univariate time series classification is a special case of multivariate time series classification where each instance is described by a single time series. In many domains and settings, where more than one variable per instance evolves over time, this is however overly simplistic. Table 4 presents the accuracy for the evaluated multivariate time series classifiers. Looking at the accuracies, we can see that the gRSF algorithm predominantly performs either better or on par with the competing approaches. By inspecting the ranks of the individual algorithms, gRSF is ranked highest (with rank 1.86) and cDTW ranked last (with rank 4.18) and LPS, UTF and SMTS in between (with ranks close to 3). A Friedman test reveals that the observed differences in accuracy ranks deviate significantly (\(p<0.001\)) from what can be expected under the null-hypothesis of no difference.

Fig. 4
figure 4

Comparison of pair-wise Nemenyi tests for all MTS classifiers over the 14 multivariate datasets in Table 4. Classifiers that are not significantly different at \(p=0.1\) (left) and at \(p=0.05\) (right) are connected

To investigate if there are any significant differences between pairs of algorithms, a post-hoc Nemenyi test is again performed (see Fig. 4). The post-hoc test does not indicate any significant differences between gRSF, LPS, UFS and SMTS for neither the 0.05 nor 0.1 significance levels. cDTW is, however, significantly outperformed by gRSF (\(p<0.05\)), but not by LPS, UFS or SMTS. Partitioning the algorithms into groups for \(p<0.05\), we can identify two groups: the performance of cDTW is significantly worse than for gRSF. Since neither LPS, UFS or SMTS clearly belong to any of these groups, the experimental data is not sufficient to reach any conclusions regarding the relative performance of LPS, UFS or SMTS for multivariate time series classification.

5.3.2 Computational performance

In this section, we show that in addition to achieving state-of-the-art predictive performance for univariate and multivariate datasets, the gRSF algorithm is also competitive in terms of run-time performance. In the univariate case, we include shapelet based classifiers that have been designed for either speed (SAX) or predictive performance (LTS) and show that the gRSF is significantly faster than the latter and perform similar to, albeit is significantly more accurate, than the former. In the multivariate case, we include the classifiers with highest predictive performance, i.e., LPS, UFS and gRSF, and show that the run-time of gRSF is comparable to or better than the alternative approaches. For both UTS and MTS classification, the algorithms have approximately the same number of hyper-parameters and would, hence, be penalized in similar ways when performing grid search for the best setting. This cost cannot, however, be directly compared and is therefore left out.

In the experiments, we report training time relative to that of the gRSF algorithm, hence, a number larger than 1 indicates slower training than gRSF while a number smaller than 1 indicates faster training than gRSF. Since the training time for LPS only includes feature extraction and classification is instance based, i.e., has zero training time, we include both training and testing time in the multivariate case. For the comparison, we select a number of UTS datasets with representative properties and all available MTS datasets.

Univariate time series classification In the comparison of UTS classifiers, we have included SAX and LTS and used the default settings for all algorithms. For LTS this amounts to: \(\alpha =-30\), \(\eta =-0.1\), \(iter=1000\) and \(\lambda =0.01\), for gRSF: \(r=\sqrt{md(md+1)/2}\), \(l=0.025\), \(u=1\) and \(p=500\) and finally, for SAX: \(step=1\) and \(topk=10\). Table 5 shows that in the univariate case, gRSF is systematically at least three times faster than LTS, and in some cases also faster than SAX.Footnote 5 On average, over the selected datasets, the performance ratio shows that the gRSF algorithm is approximately 45 times faster than LTS and has a similar performance to SAX-based decision trees. Note, however, that we employ a rather large forest (500 trees) in the comparison, but a smaller one (consisting of less than 100 trees), which is generally faster to train, in most cases still outperforms SAX in terms of classification accuracy (see Karlsson et al. 2015).

Table 5 Relative training time of gRSF compared to LTS and SAX for univariate time series
Table 6 Relative training and testing time of gRSF compared to UFS and LPS for multivariate time series

Multivariate time series classification Comparing the MTS classifiers in terms of computational cost, we use 500 trees and 100 shapelets for gRSF; 500 trees and 10 segments for LPS and 10,000 shapelets for UFS.Footnote 6 Similarly to the univariate case, Table 6 shows that in the multivariate case, gRSF is competitive in terms of computational cost. On average, the gRSF algorithm is six times faster than UFS and eight times faster than the LPS algorithm.Footnote 7 One reason for the strong performance of gRSF in the multivariate case, compared to UFS, is that a dynamic (often lower) number of shapelets are required to construct a tree. Another, related, reason is that gRSF only computes the similarity between shapelets and time series from the same dimensions, whereas UFS finds the minimum distance to any dimension, increasing the number of comparisons with a factor d.

Fig. 5
figure 5

The reported relative training time (y-axis) as a function of increasing time series length (m) (right) and of increasing number of time series (n) (left)

To confirm the theoretical run-time (see Sect. 4.2.1), the computational performance of gRSF is also empirically evaluated in terms of how well the algorithm scales with increasing time series length (m) and increasing number of time series in the data set (n). For the former, univariate datasets with \(m>1000\) time points, and for the latter univariate datasets with \(n>1000\) instances were selected. Since the run-time of gRSF is unaffected by additional dimensions, only UTS datasets are used. In Fig. 5, the computational performance of running the gRSF algorithm is shown as the size of the training set (left) and time series length (right) increases with an increasing multiplier of \(0.1,\dots ,1\). As we can see, the gRSF algorithm tends to scale approximately linearly in the size of the learning set, i.e. n, and quadratically in time series length, i.e. m.

5.4 The effect of parameters

In this section, we analyze and discuss the effect of the random shapelet forest parameters, including the number of shapelets r, the shapelet length [lu] and finally the averaging strength of adding more trees to the ensemble, i.e., p. Since the effect of parameters is independent from the number of time series dimensions, we limit the analysis to univariate time series databases.

Fig. 6
figure 6

The evolution of the out-of-bag error rate with \(l=0.025\), \(u=1\) on 9 time series classification tasks. The vertical line denotes the default value of \(r = \sqrt{md(md+1)/2}\)

5.4.1 The shapelet selection strength r

The parameter r denotes the number of random shapelets to evaluate at each node of the forest. For a given problem, the smaller r, the greater the randomization of the trees and, hence, the weaker the output value depends on the structure of the tree. In the extreme, with \(r=1\), the tree is built using a single shapelet sampled independently from the target label, i.e., the trees are grown randomly in an uninformed manner. Figure 6, shows the evolution of the out-of-bag error rate for 9 time series classification data sets as we increase r. The default value of r is shown as a vertical line. In the figure, we see a number of trends, e.g., decreasing (Fish, Adiac, OSULeaf) and increasing (FaceFour). For these, the default value is clearly not the optimal choice. For the other data sets, however, the default value provides a good starting point for optimization. Since r provides a mechanism for balancing strength and correlation, it is an important hyper-parameter.

On average, over all datasets, when \(r=500\), the strength (s) and correlation (\(\bar{d}\)) (see Appendix 1) is \(s=0.36\) and \(\bar{d}=0.12\) and when \(r=1\), \(s=0.26\) and \(\bar{d}=0.1\), i.e., as expected, increasing r strongly increases the strength, but only slightly increases the correlation. By analyzing the bias, Fig. 7 (left) shows that it decreases as r increases, from 0.2 when \(r=1\) to 0.15 when \(r=500\). Similarly, as r increases the variability of the forests decreases, from 0.55 with \(r=1\) to 0.46 with \(r=500\). One reason for the increase in bias for forests with more randomization could be the fact that the tree generation results in trees predominantly voting for the majority class, hence producing highly biased probability estimates.

5.4.2 The shapelet sampling effect of l and u

The parameters l and u denote the fraction of shapelet lengths sampled. For a given problem, l and u defines the search space for suitable features and guides the algorithm towards global or local similarities. In the default case, every possible shapelet length is a possible candidate, but the choice can be specialized using the out-of-bag error rate. In Fig. 7 (right) the average out-of-bag error rate and bias and variance are shown for the average shapelet length when considering \(r=\sqrt{md(md+1)/2}\) shapelets at each node.

On average, the error is highest when limiting the search to very short or very long shapelets, indicating that a wider sampling range is needed to get sufficient variability among the trees. An analysis of the bias and the variance, as shown in Fig. 7 (right), shows that the bias decreases from 0.47 when limiting the search to short shapelets to 0.34 when increasing the sampled shapelet sizes. Similarly, the variance decreases as we limit the search to longer shapelets. As seen in Fig. 7 (right), the setting that minimizes the trade-off between bias and variance is, however, inspecting shapelets shapelets of all possible lengths, explaining its superior performance on average. We note, as shown in the result tables, that the parameter requires tuning to find the best performing model.

Fig. 7
figure 7

The shapelet sampling effect of l and u (right) and the average effect of increasing the number of shapelets r (left)

Fig. 8
figure 8

The averaging strength of p

5.4.3 The averaging strength of p

It is well known that the prediction error is a continual, asymptotically, declining function of the number of members in an ensemble. The latter corresponds to the parameter p, which denotes the number of trees in a random shapelet forest. Hence, the higher the value of p the better the ensemble will perform, essentially rendering the choice of p to be a compromise between computational efficiency and accuracy. Fig. 8 shows the effect of p on the out-of-bag error, bias and variance for a number of data sets (with \(r=\sqrt{md(md+1)/2}\)). In general, we can see that while randomization increases the bias and variance of individual trees, the variance due to randomization can be canceled out by averaging over a (large) ensemble.

6 Conclusions

In this paper, we have proposed and investigated a novel approach for both univariate and multivariate time series classification; the gRSF algorithm, building on the well-established shapelet tree algorithm (Ye and Keogh 2009) and the random forest algorithm (Breiman 2001). The use of shapelets has been shown to be an important approach for time series classification tasks, but current methods have been either too slow, e.g., learning time series shapelets or shapelet transforms, too inaccurate, e.g., fast shapelets, or both, e.g., single shapelet-based decision trees. We have demonstrated that by combining several weak randomized shapelet-based decision trees into an ensemble, substantial gains in predictive and computational performance can be achieved compared to generating single shapelet-based decision trees. Moreover, the performance of gRSF is comparable to the current state-of-the-art in terms of accuracy for both univariate and multivariate time series classification, but with substantially lower computational cost. The results from the presented extensive empirical investigation suggest that the proposed algorithm is among the strongest classifiers currently available for both uni- and multivariate time series.

Although the proposed algorithm requires tuning of a set of parameters, the number of such parameters are comparable to, or fewer, than for alternative methods. Most importantly, the use of out-of-bag examples allows for tuning the parameters without requiring that a computationally costly cross-validation procedure is employed or that a validation set is put aside. The latter is not as computationally costly as the former, but may negatively affect predictive performance due to leaving fewer examples for model construction. The parameters of gRSF include the number of trees to generate (a common parameter for most ensemble methods), the number of sampled shapelets, for controlling the strength of the random shapelet selection, and also a secondary pair of parameters for controlling the size of the evaluated random shapelets. The empirical investigation has shown that the predictive performance can be greatly improved by optimizing the parameters of the algorithm, reaching state-of-the-art predictive performance for both univariate and multivariate time series classification. Furthermore, by using a bias and variance decomposition of the (squared) prediction error, the ingredients making the algorithm performing well have been investigated, namely the trade-off between the strength of each individual tree and the diversity among the trees.

One advantage of shapelet based decision trees, similar to traditional decision trees, is the possibility to interpret the generated model and gain insights into predictions. For traditional random forests, the importance of factors contributing to predictions can be computed using various methods. A similar procedure was investigated for the univariate version of the random shapelet forest in Karlsson et al. (2015), where the importance of regions and shapelet sizes can be used to support interpreting the model. Since this method is limited to globally important regions and not shapelets, one direction for future work concerns alternative approaches for interpreting random shapelet forests, e.g., by providing the importance of specific shapes, or in the multivariate case, specific dimensions. Other directions for future work include investigating approaches for handling multivariate and heterogeneous time series consisting of both discrete and numerical data streams.