1 Introduction

Time series is a sequence of data changing with time order, so it is high-dimensional, consecutive and infinitely increasing. In recent years, time series classification has attracted rising research interest, and has been widely used in many application domains, such as medical diagnosis [19, 27, 31], disaster prediction [38], industrial production control [39], financial market [32] and community discovery [7]. At present, time series classification has many new extensions, of which early classification on time series data is becoming increasingly important. Early classification on time series can be used in many time-sensitive fields, including but not limited to video surveillance, intrusion detection, earthquake warning, early diagnosis, and action recognition [23, 28]. For example, in the early diagnosis of heart disease, abnormal ECG signals may indicate a specific heart disease that needs immediate treatment. Early diagnosis is critical in applications such as intensive care. If a classification model that can make early diagnosis as soon as early of ECG time series is available, the patient with the heart disease can get early treatment.

The goal of early classification on time series is to make prediction as early as possible provided that the accuracy is comparable to the full length. Therefore, there are two requirements in early classification on time series. First, the algorithm should confirm at the earliest time of reliable classification, and thus the predictions could be used for next steps. Second, the classification accuracy of the classifier trained on partial data should achieve comparable performance to that of the classifier trained on the entire data.

It is challenging for early classification on time series to construct an effective classifier. For time series classification with complete data, it is feasible to extract the required features from the whole data for constructing an accurate classifier. However, this does not work for the early classification task as many features are unavailable due to the very limited observed time series data. Xing et al. [34, 35] proposed Early Classification on Time Series (ECTS) model to tackle the problem that conventional classifiers lack earliness. ECTS is a nearest neighbor based classifier, and uses minimum prediction length (MPL) to find a stable prefix subspace to make prediction, but the prediction result lacks interpretability. In [37], Ye et al proposed a new data mining primitive, called time series shapelets. Shapelets are time series subsequences which are in some extent maximally representative of a class. Shapelets could make the classification more accurate, interpretable, and efficient. Following this work, several works also tried to use shapelets for time series classification [11, 20, 22]. Xing et al. [36] constructed an early classifier by extracting the shapelets and the classification result of their method is more accurate and interpretable.

There are mainly two types of works that use shapelets for time series classification. One is to first extract all the shapelets and then select useful shapelets by certain feature selection strategy [36], while the other is to generate shapelets by optimization methods [11]. In this paper, we focus on improving the first type of work. A major limitation of exsting works is that they will produce a large number of redundant shapelets, and thus lead to low efficiency. Another issue of current works is that they do not consider the diversity of shapelets. If we keep the diversity of shapelets, we will make the shapelets more distinguishable and achieve more accurate classification.

In this paper, we propose an Improved Early Distinctive Shapelet Classification method named IEDSC for more accurate early classification on time series. Specifically, we first present a new trend-based Euclidean distance to measure the similarity between two time series. Next we propose to prune the shapelets based on the estimated starting positions. Because shapelet is a subsequence of time series, it has a starting position and end position. If we could obtain starting positions of high quality shapelets and extract shapelets near these starting positions, the number of shapelet candidates can be reduced greatly. Further, we define an extended self-similarity and propose a diverse shapelet selection method to make the extracted shapelets more diverse. To more clearly show the novelty of IEDSC method, we compare it with four candidate methods under five characteristics as shown in Table 1.

Table 1 Comparison of five classification methods

The main contributions of this paper are summarized as follows:

  • A new similarity measure for time series is proposed by taking the relative trends of two time series into consideration.

  • A shapelet pruning technique is proposed to effectively prune shapelet candidates. It first estimates the starting positions of the shapelets and then extracts shapelets from these starting positions, such that only the shapelets near the estimated starting position are extracted. Thus the number of shapelet candidates can be significantly reduced.

  • A new shapelet selection method is proposed to effectively remove the similar shapelets so as to make the selected shapelets more diverse.

  • Extensively experiments on a bunch of real datasets verify the effectiveness and efficiency of our proposal.

The rest of this paper is organized as follows. In Section 2, we review the related work. We introduce some definitions and preliminaries in Section 3. In Section 4, we describe the feature extraction method. We introduce the feature selection and early classification method in Section 5. We demonstrate our experimental results in Section 6. Finally, we conclude the paper in Section 7.

2 Related work

In recent years, early classification on time series has received increasing research attention in the community of machine learning [15, 26, 33]. However, early classification has its own uniqueness, such as the temporal correlation in the data [24, 30]. Xing et al. [34, 35] for the first time formulated early classification problem explicitly. They proposed the concept of minimum prediction length (MPL), and then presented an early classification algorithm called ECTS. During model training, ECTS needs to calculate the MPL for each sequence in the training set. To predict the class label of an unlabeled time series, at each time stamp ECTS searchs its current nearest neighbor under the condition that the MPL of nearest neighbor cannot be larger than the current time stamp. If the nearest neighbor meets the condition, the prediction can be made in advance, otherwise, it should wait for more incoming data.

Some early classification methods simply train a classification model with the time series data at the early stage, and design criteria to judge whether the prediction result is reliable. Lin et al. [21] proposed an early classification method that utilized the hidden markov models (HMM) to classify the time series. It was evaluated at each time stamp and could get the best results when the complete time series was obtained. This approach lacks flexibility, because the prediction is always done at time stamp which is consistent with the length of training time series, and thus limits the application of the model [9]. Ghalwash et al. [10] proposed a specially designed hybrid model, which integrated HMM and support vector model (SVM). Although this model can make more reliable prediction by setting threshold value, its limitation is that plenty of samples are required for training.

Mori et al. [25] proposed an algorithm Early Classification of Time Series based on Discriminating Classes Over Time (ECDIRE). In the training phase, ECDIRE analyzes the discrimination of each class at each time stamp, and selects some certain time stamps that the classification accuracy of each class should exceed a predefined threshold. In the prediction phase, the prediction should be made at or after these time stamps. Ando and Suzuki [1] proposed an optimization-based learning method for timely prediction, which could directly minimize an empirical risk function and response time to achieve the minimal prediction risk without a user-defined trade-off between accuracy and earliness. Parrish et al. [26] proposed optimal and practical decision rules for classifying incomplete data. In [26], the class label of unclassified data is obtained only if the reliability threshold is met. The reliability threshold is used to ensure that the predicted class label of incomplete data would be the same as the label assigned for the complete data.

For early classification, the classification model should achieve a good trade-off between earliness and accuracy. Mori et al. [24] presented an approach for early classification of time series based on combining a set of probabilistic classifiers together with a stopping rule. The rule could decide when to make a prediction or when to wait for more data, which ensures the prediction accuracy and reliability. Romain et al. [27] proposed an early classification algorithm which can optimize classification accuracy and decision delay cost, and can point out the best time to make early classification. Hartvigsen et al. [12] proposed an early classification model, called EARLIEST. The model consists of the novel pairing of a recurrent discriminator network with a stochastic policy network, with the latter learning halting-policy as a reinforcement learning task. Schäfer and Leser [29] presented TEASER that modeled eTSC as a two-tier classification problem. The first tier periodically computes class probabilities and the second tier decides reliability of the predicted label. Once the reliability attains a threshold, the prediction is made.

Some early classification algorithms achieve good classification accuracy, but the drawback is the lack of interpretability. Ye et al. [37] proposed a new primitive, called time series shapelet. Shapelet can make classification faster, more accurate and interpretable. Therefore, there are some works that use shapelet for time series classification. Xing et al. [36] proposed the early distinctive shapelet classification (EDSC) algorithm. EDSC used shapelet with earliness to classify new time series, but its performance is undesirable when applied on larger dataset. Karlsson et al. [16] proposed a random forest algorithm based on shapelet for early classification, which can obtain high accuracy and earliness. As EDSC did not estimate the confidence of the prediction result, Ghalwash et al. [9] presented the modified EDSC with an uncertainty estimates (MEDSC-U) algorithm, which estimated the temporal uncertainty associated with the prediction. Wang et al. [32] proposed a new end-to-end deep learning framework Earliness-Aware Deep Convolutional Networks(EA-ConvNets) for early classification on time series. This framework can jointly perform feature learning and a dynamic truncation model learning to help deep feature learning architecture focusing on the early parts of each time series.

There are many researches on early classification for multivariate time series, Galwash et al. [8] proposed an early classification method called multivariate shapelets detection(MSD) for multivariate time series. MSD extracts shapelets from all dimensions of the multivariate time series. The shapelet in each dimension starts from the same position and has the same length. Furthermore, MSD uses weighted information gain based utility score to evaluate the effectiveness of shapelets, which indicates the precision and earliness of shapelets. MSD is developed based on an assumption that all the multivariate shapelets have the same starting position and the same length, making many shapelet candidates lost. He et al. [13] proposed an early classification method on multivariate time series, called Mining Core Feature for Early Classification (MCFEC). In MCFEC, a new shapelet quality assessment method was proposed, so as to ensure the accuracy and the earliness. Furthermore, MCFEC designs two methods for building an early prediction of the class of unknown multivariate time series object. Further, He et al. [14] proposed an adaptive classification ensemble method to deal with early classification on inter-class and intra-class imbalanced MTS data. They first design an adaptive ensemble framework to learn an early classification model. Second, they introduce a cluster-based shapelet selection method to obtain optimal shapelets. Finally, they design an associate-pattern mining approach to learn base classifiers.

3 Definition and preliminaries

In this section, we will give some terminology definitions and the preliminaries.

3.1 Definition

Definition 1

Univariate Time Series. Univariate time series T = {t1,t2,t3,…,tL} is an ordered set of L real-valued variables. Data points t1,t2,t3, …, tL are typically arranged by temporal order, and ti is the value at time stamp i.

Definition 2

Subsequence of time series. Given a time series T of length L, a subsequence s of T is a sampling from l(lL) continuous positions of T. That is, s = {tp,…,tp+l− 1}, 1 ≤pLl + 1.

Definition 3

Distance between the time series. Given two time series T and R with the same length L, the distance between T and R can be denoted by Dist(T,R). By using Euclidean distance as measure, Dist(T,R) is calculated as follows.

$$ Dist(T,R)=\sqrt{\frac{1}{L}\sum\limits_{i=1}^{L} (t_{i}-r_{i})^{2}} $$
(1)

Definition 4

Shapelet. Shapelet is a time series subsequence which is in some sense maximally representative of a class, and can be represented by a triple f = (s,δ,c), where s is a time series subsequence, δ is a distance threshold, and c is the class label.

Definition 5

Best Match Distance (BMD). The best match distance between shapelet f = (s,δ,c) and time series T is defined as:

$$ BMD(f,T)=\min(Dist(s^{\prime},s)),s^{\prime}\in s_{T}^{|s|} $$
(2)

where, \(s_{T}^{|s|}\) is a collection of subsequence with length |s| in T.

Given a shapelet f = (s,δ,c) and a time series T, if BMD(f,T) ≤ δ, T is classified to class c.

Definition 6

BMD-list. Given a shapelet f and a training dataset D, which contains r time series, BMD-list is a list of the BMDs between f and the time series in D, sorted in a non-descending order, as shown in Eq. 3:

$$ V_{f}=\{(d_{1},c_{1}),(d_{2},c_{1}),\ldots,(d_{r},c_{2})\} $$
(3)

where di = BMD(f,Ti), TiD, didj for i < j, and ci is the class label of time series Ti.

Definition 7

Earliest Match Length (EML). Given a shapelet f = (s,δ,c) and a time series T, EML is defined as the minimal identifiable length of T. It means f could classify the time series T using its prefix from the beginning to the position EML(f,T).

$$ EML(f,T)=\min\limits_{len(s) \leq i \leq len(T)} (Dist(T[i-len(s)+1,i],s)\leq \delta) $$
(4)

EML measures the earliness of f in correctly classifying T. If f cannot classify the time series T, we have \(EML(f,T)=\infty \).

Definition 8

Weight Recall (WRecall). WRecall is defined to measure the earliness of shapelet f on a training dataset D.

$$ WRecall(f)=\frac{1}{\lVert D_{\bar{c}} \rVert}\sum\limits_{T \in D} \frac{1}{\sqrt[\alpha]{EML(f,T)}} $$
(5)

where α is used to control the importance of earliness, and \(\lVert D_{\bar {c}} \rVert \) represents the number of non-target time series.

Definition 9

Precision. Given a shapelet f = (s,δ,c) and a training set \(D^{\prime }\), precision is defined as the proportion of time series in \(D^{\prime }\) that f can classify correctly.

$$ Precision(f)=\frac{\lVert BMD(f,T) \leq \delta \land C(T)=c \rVert}{\lVert BMD(f,T) \leq \delta \rVert},T \in D $$
(6)

Definition 10

Utility. Utility is defined as the quality score of shapelet f = (s,δ,c). A higher utility means a better quality of the shapelet.

$$ Utility(f)=\frac{2\times Precision(f) \times Wrecall(f)}{Precision(f)+Wrecall(f)} $$
(7)

3.2 Preliminaries

The concept of shapelet was first proposed in 2009 by Ye and Keogh [37]. Shapelet is a time series subsequence, which is in some sense maximally representative of a class. Ye and Keogh proposed a brute force search method, which can be used to find the optimal shapelet. However, early classification on time series needs to find multiple effective shapelets. Most shapelet based early classification algorithms first extract and estimate all the shapelets, and then select some good shapelets through a selection strategy. In shapelet extraction phase, two parameters minL and maxL need to be set first, and then extract all the possible shapelets from the training set whose lengths are between minL and maxL. The procedure of generating the shapelet candidates is shown in Algorithm 1.

figure a

Given a training dataset D, Algorithm 1 extracts all the shapelet candidates (lines 2-14). In line 6, the distance threshold and utility of each candidate are obtained by Generateshapelet() (shown in Algorithm 2). In lines 7-11, we remove part of shapelet candidates according to a precision threshold. At last, Algorithm 1 returns a shapelet candidates set (line 15).

figure b

Algorithm 2 describes how to generate shapelet. It first computes the bmdlist of each shapelet (lines 1-6), and obtains delta according to bmdlist. Then, the precision and wRecall are computed by computePrecision() and computeWrecall() (lines 7-8). Finally, the utility of each shapelet is obtained, which indicates the quality of shapelet (line 9).

4 Feature extraction

The workflow of IEDSC is shown as Figure 1. First, we generate shapelets from time series training data, by using the generation methods as shown in Algorithms 1 and 2 in Section 3. Algorithm 1 returns a shapelet candidates set, while Algorithm 2 describes how to generate shapelet. Second, we make feature extraction from the generated shapelets candidates which is presented in Section 4. Concretely, we propose the Trend-based Euclidean distance and shapelet pruning technique. Third, we select features from the extracted features and propose the diverse shapelet selection method in Section 5. Finally, we use the selected feature to make early classification on testing time series data.

Figure 1
figure 1

Workflow of the proposed algorithm IEDSC

In the feature extraction phase, in order to evaluate each shapelet, we need to compute the similarity between a shapelet and a time series. Most algorithms use Euclidean distance to measure the similarity. However, Euclidean distance simply calculates the point-to-point distance, without considering the trend of time series. In order to more precisely measure time series similarity we propose a trend-based Euclidean distance. In addition, some shapelet extraction methods extract all the shapelets directly, which can be time consuming. If the dataset is large, the shapelet candidate space will become extremely large. To solve this problem, we propose a shapelet pruning method, which can effectively reduce the number of shapelet candidates.

In this section, we first present the trend-based Euclidean distance, and then introduce the distance threshold calculation method. Finally, we introduce the shapelet pruning technique based on the prediction on the starting position.

4.1 The trend-based Euclidean distance

To measure the similarity of time series, we need compute the distance between two time series, which usually presents some trend changing over time. To take the changing trend of the time series data into consideration, we propose a trend-based Euclidean distance computation method. The idea is that if two time series present similar trend and the distance between them is small, we consider that the two time series are similar. The similarity can be calculated by:

$$ TDist(T,R)=\sqrt{\frac{1}{L} \left( \sum\limits_{i \in I_{s}} (t_{i}-r_{i})^{2}+\sum\limits_{j \in I_{\bar{s}}} (t_{j}-r_{j})^{2}*{\lambda}\right)} $$
(8)

where T and R are two time series with the same length L. ti, ri, tj, and rj represent the value of time series at time stamp i or j. All the time stamps in Is indicate that the size relationship between T and R at time stamp i are the same as the last time stamp. \(I_{\bar {s}}\) is the opposite of Is.

The specific steps of trend-based Euclidean distance computation algorithm are shown as Algorithm 3. Given two time series, we first calculate the distance between the data points of the two sequences at the first time stamp and set it as the initial value of dist, and then calculate the size relationship between them (lines 1-2). We use sgn() to represent the size relationship. sgn() is a symbolic function that returns the positive or minus of a parameter. When calculating the distance between two points of two sequences at time stamp i , it is necessary to consider the size relationship between the two points at the previous time stamp i − 1. If the size relationship between two points at time stamp i of two sequences is the same as that at time stamp i − 1, the distance between points of the two sequences at time stamp i is added to dist directly; otherwise the distance should be punished with the parameter λ ∈ [1,2], and then added to dist (lines 3-9). The distance between two sequences will increase with the increase of λ, so the distance between two shapelets with similar trend will be small. Therefore, the similarity measure is more accurate. If the two sequences have similar trend, which means that the size relationship between two points at each time stamp of two sequences could be the same. If two points at any time stamp do not satisfy this condition, the extent of similarity is diminished. So we amplify the distance between the two points. Finally, the algorithm returns the final distance (line 12).

figure c

Example 1

Suppose there are 5 time series with length 10, A = {0.5,0.9,2,2.5,4.5,3.9,1.2,− 0.5,− 0.3,0.6}, B = {1,1.1,2.4,2.8,5.5,4.4,1.6,0,0.2,2.1}, C = {0,0.7,1.6,2.2,3.5,3.4,0.8,− 0.8,− 0.8,0.2}, D = {0,0.5,2.4,2.8,5.5,4.4,0.8,− 0.2,0.2,0.8}, E = {0.1,1.4,2.3,2.9,4.3,4.2,0.7,0.5,0.1,0.1}.The trend relationship between A and the others is shown in Figure 2.

Figure 2
figure 2

Trend relations between A and other four time series

In Figure 2, if the Euclidean distance is used to calculate the distance between sequence A and the others, the distances are the same as the \(\sqrt {0.245}\). But from Figure 2 one can see that the two curves in Figure 2 (1)(2) are closer than that in Figure 2 (3)(4). If we use Algorithm 3 to compute the trend-based Euclidean distance between the sequences, their distances are different. The largest distance is the distance between curves A and E, which is \(\sqrt {0.247}\). The distance between curves A and B is the same as the distance between curves A and C, which is \(\sqrt {0.245}\). The distance between curves A and D is \(\sqrt {0.246}\). This example shows that our proposed trend-based Euclidean distance can better measure the similarity between time series compared with Euclidean distance.

4.2 Compute the distance threshold

The distance threshold is a very important attribute for shapelet, which is used to classify time series. Before calculating the distance threshold, we first get the BMD-list of shapelet, and then choose the best threshold according to the evaluation strategy.

Ye et al. [37] used information gain to evaluate the quality of the threshold, which divided the training dataset into two subsets. A higher purity of the two subsets leads to a higher information gain, indicating that the quality of this threshold is better. However, the limitation of this method is that it is prone to overfitting. Xing et al. [36] used kernel density estimation [6] and Chebyshev inequality to compute the distance threshold. Given a shapelet f = (s,δ,c) and a time series T, if BMD(f,T) ≤ δ, the kernel density estimation can guarantee the probability of time series T belonging to the target class is higher than a user-defined threshold. The Chebyshev inequality can guarantee the probability of the time series belonging to non-target class is lower than a user-defined threshold. He et al. [13] considered the accuracy and recall simultaneously, and obtained the best threshold when the harmonic mean of accuracy and recall was maximized. We use kernel density estimation to calculate the distance threshold of shapelet, because the distance threshold obtained by kernel density estimation is more accurate.

Given a sample y = {x1,x2,…,xn}, the kernal density estimation of f(X = x) can be calculated as follows:

$$ f(X=x)=\frac{1}{nh}\sum\limits_{i=1}^{n} K\left( \frac{x-x_{i}}{h}\right) $$
(9)
$$ K\left( \frac{x-x_{i}}{h}\right)=\frac{1}{\sqrt{2\pi}}e^{-\frac{(x-x_{i})^{2}}{2h^{2}}} $$
(10)
$$ h=1.06\delta n^{-\frac{1}{5}} $$
(11)

where K is a Gaussian kernel function, h is a smoothing factor, and δ is the standard deviation of the sample.

If the training set contains m classes , c ∈{1,…,m}, for a threshold x, we can get the probability of x belonging to class i as follows.

$$ Pr(c_{x}=i|X=x)=\frac{p_{i}f_{i}(x)}{{\sum}_{k=1}^{m} p_{k}f_{k}(x)} $$
(12)

where pk is a prior probability of class k .

Given a shapelet f = (s,?,c) and the BMD-list of f, we can use kernel density estimation to obtain a distance threshold δ, satisfying that if BMD(f,T) ≤ δ, we have Pr(cx = c|X = x) ≥ β. Here, T is the time series in training set and β is a user-defined probability threshold.

4.3 Shapelet pruning based on the estimated starting position

In the conventional method of shapelet extraction, the number of shapelet candidates is extremely huge, and it discovers optimal shapelet by estimating the whole shapelet candidates. As it is time consuming to scan all the candidates, we argue that we do not need to estimate some shapelets with bad quality, and pruning them can significantly reduce the computational complexity. In this paper, we propose a shapelet pruning algorithm based on the prediction on the starting position. The basic idea is to estimate the starting positions of shapelets with good quality and then extract shapelets from the starting positions. Here we preset a user-defined parameter, stepSize, which is used to change the length of shapelets in shapelet extraction. In our method, we estimate the start positions according to part of shapelets, and the lengths of shapelets vary. In order to make the number of extracted shapelets appropriate, we extract part of shapelets with different lengths according to stepSize. The pruning process is as follows. First, we extract all the shapelets with length minL, and then extract all the shapelet candidates with length minL + stepSize. Next time we add stepSize to the previous shapelet length until maxL. Second, we evaluate the extracted shapelets candidates and then select the better shapelets from these shapelet candidates. We take one of the extracted shapelets candidates as a feature denoted by f1. All training examples covered by f1 are marked. We consider the remaining unmarked shapelet candidates that can cover some training examples as covered, and iteratively select the shapelets candidates as features. The iteration continues until all the training examples are covered to obtain better shapelets. Finally, we extract the starting positions of these shapelets, and only keep the shapelets with lengths between minL and maxL from these starting positions. Note that these lengths do not include the length that has been previously extracted. Thereby we achieve the pruning of shapelets. The pruning method is described as Algorithm 4.

figure d

Given a training set, we first calculate the length array lenArr of the shapelet according to the parameter stepSize(lines 3-9). The function FirstExtractShapelets() only returns the shapelets with lengths in the length array lenArr (line 10). SelectionShapelet() and GetStartPos() are used to select the better shapelets and obtain the starting positions of these better shapelets(lines 10-12). After that, according to the starting positions, we extract all the lengths of the shapelets, but those extracted shapelets are no longer extracted (line 13). Finally, lines 14-15 merge the two sets valShapelets and shapeletCandidates, and return the final shapelet candidates .

Given a time series with length 80 as shown in Figure 3, suppose minL = 5, maxL = L/2, where L is the length of the complete time series. According to Algorithm 4, assuming stepSize is 8, we first compute the length array [5,13,21,29,37,40]. We extract shapelets with lengths in the length array and evaluate these shapelets. Then we select the better shapelets (the red sequence in Figure 3) and obtain their starting positions (the red dots in Figure 3). Finally, we start to extract shapelets from these positions, for reducing the number of shapelets.

Figure 3
figure 3

An illustration of the starting position estimation of shapelets

In the feature extraction step, we assume all the shapelet of length between minL and maxL. Suppose we have N time series with length L, and a parameter stepSize which is used to obtain lenArr. For a shapelet with length k, the computation cost is O(kN(Lk + 1)). We need extract shapelets twice, the cost of first extraction is \(O({\sum }_{k \in lenArr}kN^{2}(L-k+1)^{2})\). Supposing we obtain M starting positions from the first step, the cost of the second extraction is \(O(M {\sum }_{k=minL}^{maxL} kN^{2}(L-k+1))\). The total cost of extraction is \(O({\sum }_{k \in lenArr}kN^{2}(L-k+1)^{2}+ M {\sum }_{k=minL}^{maxL} kN^{2}(L-k+1))=O(N^{2}L^{4} )\).

5 Feature selection and early classification based on shapelet

Feature selection is one of the key steps of IEDSC, and the quality of shapelets determines the accuracy of classification. For early classification, good features should be frequent, discriminative and of good earliness, and the selected features should be diverse. To this end, we propose a new feature selection method.

Xing et al. [36] uses a greedy method to select shapelets, and the steps are as follows. First, sort all the shapelets in non-ascending order according to the quality score. Second, select the shapelet f = (s,δ,c) with the highest score, and mark all the samples covered by f in the training set. Here, the covering means BMD(f,T) ≤ δc(T) = c. Third, select the shapelet \(f^{\prime }=(s^{\prime },\delta ',c^{\prime } )\) with the second highest quality score, and repeat the covering operation on training examples that are not marked. The above steps are repeated until all training examples are marked.

Although the greedy method is simple, it has some disadvantages. The shapelets extracted by this method may be highly similar, and it might ignore some useful shapelets, leading to undesirable classification accuracy. In order to tackle this problem, we design a selection method according to the concept of non-self match [18] and trivial match [2, 3].

Definition 11

Non-Self Match. Given a time series T, containing a subsequence C of length n beginning at position p and a matching subsequence M beginning at q, we say that M is a non-self match to C at distance of Dist(M,C) if |pq|≥ n.

Definition 12

Trivial Match. Given a time series T, which contains a subsequence C with a beginning position p and a matching subsequence M with a beginning position q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence\( M^{\prime }\) beginning at \(q^{\prime }\) such that \(D(C, M^{\prime })> \mathfrak {R}\), and either \(q < q^{\prime } < p\) or \(p < q^{\prime } < q\).

For two shapelets, if they derive from the same time series and their starting positions are nearby, they should be similar. The similarity between shapelets is shown in Figure 4. In Figure 4, there are five subsequences with the same length and different starting positions. The starting positions of subsequences a, b, c, d, e are 19, 20, 21, 22, 23 respectively. From Figure 4, we can see that two shapelets are very similar if their starting positions are 1 or 2 apart.

Figure 4
figure 4

Shapelets with Similarity

5.1 Diverse shapelet selection

In order to remove similar shapelets and keep diverse shapelets, we define extended self-similarity below.

Definition 13

Extended Self-Similarity. Given two shapelets f1 = (s1,δ1,c1) and f2 = (s2,δ2,c2). We add three attributes Id, staPos and len to shapelets. Id represents the time series ID that we extract the shapelet from. staPos is the starting position in time series and len is the length of shapelet. We say f1 and f2 have extended self-similarity if the following conditions are satisfied.

  1. 1.

    Id1 = Id2;

  2. 2.

    |staPos1staPos2|≤ γ;

  3. 3.

    |len1len2|≤ η.

where, γ and η are user-defined threshold, γ denotes the allowed distance between the starting positions of two shapelets, and η represents the allowed difference of two shapelet lengths.

We illustrate extended self-similarity by the examples shown in Figure 5, assuming γ = 1 and η = 1. In Figure 5 (1), the starting position of c is the same as c1 and c2, but their lengths are different. c and c1, c and c2 both have a length difference of 1. Therefore, according to Definition 13, c and c1, c and c2 have extended self-similarity. In Figure 5 (2), the starting positions of c, c3, and c4 are different, and the starting position distance between c and c3, c and c4 is 1, but their lengths are identical. So from Definition 13 we know c and c3, c and c4 have extended self-similarity. In Figure 5 (3), c, c5 and c6 have different starting positions and lengths, but they satisfy the conditions in Definition 13. Thus c and c5, c and c6 have extended self-similarity. In Figure 5 (4), c, c7 and c8 also have various starting positions and lengths. c and c7 have a starting position distance of 1 and length difference of 2, and c and c8 are the same. Therefore, c and c7, c and c8 do not have extended self-similarity.

Figure 5
figure 5

An Example of Extended Self-Similarity

The traditional feature selection algorithm does not consider the similarity between features, and the results might contain some similar features. In order to avoid this problem, we propose a new feature selection algorithm that selects features with larger diversity. The main idea is that when selecting a new feature, we first judge whether there is extended self-similarity between the current feature and the extracted features. If there exists, we omit it, otherwise, we use it to cover the samples.

The steps of the new feature selection algorithm are as follows:

  1. 1.

    Sort all the shapelets in non-ascending order according to the utility.

  2. 2.

    Take the shapelet f = (s,δ,c) with the highest utility, and then mark all the samples covered by f in the training set. The cover means BMD(f,T) ≤ δc(T) = c.

  3. 3.

    Take the shapelet \(f^{\prime }=(s^{\prime },\delta ^{\prime },c^{\prime } )\) with second highest utility, judge whether the current shapelet has extended self-similarity to the extracted shapelets. If there exists, omit this shapelet and select the next shapelet to repeat this step; if not, cover the unlabeled sample in the dataset, and then mark the samples covered by \(f^{\prime }\) until most samples are marked.

5.2 Early classification method

The proposed early classification method is shown as Algorithm 5.

figure e

Given a sample in testing set \(D^{\prime }\), classifier first obtains the minimum length of the extracted shapelets (line 1), and then tries to match the time series with each shapelet in the shapelets. If there is a match, return the class label, otherwise get more data and repeat this operation (lines 2-8). If it is not possible to classify at the last time stamp, the sample is unclassified.

Given the selected m shapelets with length k, the number of the samples to be classified is \(|D^{\prime }|\) with length L, the online classification complexity is \((L-k+1)*k*m*|D^{\prime }|\).

6 Experimental evaluation

6.1 Experimental settings

We compare IEDSC with five baselines, 1NN-DTW(learned-w) [17], ECTS [35], EDSC [36], ECDIRE [25] and RelClass [26]. 1NN is a nearest neighbor based classification algorithm with complete time series, and we use Dynamic Time Warping(DTW) to measure the similarity of time series due to its good performance, where the warping windows are obtained after learning the best constraint from the training set. ECTS is the first early classification method, which uses partial time series for classification. EDSC is an early classification algorithm based on shapelet, which achieves good accuracy and interpretablity. ECDIRE uses probabilistic classifier to classify time series at time stamps that are discovered in learning phase. RelClass treats data as a random variable and makes decision depending on a reliability threshold.

The UCR time series data library [4] collects a number of standard time series datasets that are widely used in time series classification, each containing a training set and a test set. We select the following 16 datasets which are the representative datasets of different types from the UCR time series data library for evaluation and show the parameter setting of each dataset in Table 2.

Table 2 Datasets description

We use accuracy and earliness as the performance evaluation metrics. The accuracy represents the proportion of samples that the classifier can correctly classify in the testing set.

$$ Accuracy(EC)=\frac{NUM}{|D^{\prime}|} $$
(13)

where, EC is the early classifier, \(D^{\prime }\) is a testing dataset, \(|D^{\prime }|\) is the number of samples in the testing dataset, and NUM represents the number of samples that classifier EC correctly classifies.

Earliness is defined as the average minimum prediction length on time series set in which the time series are classified.

$$ Earliness=\frac{1}{N}{\sum}_{i=1}^{N}\frac{l_{i}}{L} $$
(14)

where N represents the number of samples that could be classified, li is obtained by the EML(Eq. 4), which indicates the minimal indentifiable length of ith time series, and L represents the length of the complete ith sequence.

All the experimental results are obtained on the computer with Intel (R) Core (TM) I5-7400 , 3.0 GHZ and memory 8G. The algorithms are implemented by Java.

6.2 Performance comparison over accuracy and earliness

In the experiments, we set minL = 5, maxL = L/2, where L is the length of the complete time series. The two length parameters are set empirically, and all the datasets in experiment have the same minL and maxL. In the distance calculation, we use the training set of each dataset to make cross validation and obtain the value of λ. The search space of λ is in [1,2] and we make 5-fold cross validation on the training set of each dataset to find the optimal value of λ shown in Table 2. When we evaluate the shapelets, we remove some shapelets whose precision obtained by a distance threshold is lower than 0.8. This is also an empirical value and is fixed for all datasets. When calculating the distance threshold, we set α = 3, β = 0.95. The values of two parameters are obtained from EDSC [36], which could achieve higher accuracy. In feature extraction, the stepSize is fixed to 8. In feature selection, we set γ = 2, the value of γ is fixed. The value η ∈ [5,6,7,8], we search the solution space and find there is no significant difference on the results with different η values, and thus we set η = 5.

We compare the proposed algorithm IEDSC with 1NN-DTW(learned-w), ECTS, EDSC , ECDIRE and RelClass, and the experimental results of accuracy and earliness are shown in Tables 3 and 5 respectively.

Table 3 Accuracy

From Table 3, we can observe that 1NN-DTW obtains the best accuracy in 7 datasets and has the highest average rank. It is because 1NN-DTW uses the full time series data to make classification. EDSC is based on shapelet, but it only obtains the best result in GunPoint.The reason is that EDSC selects shapelet through a simple cover method, and thus the shapelets do not have enough diversity. The ECTS beats all the other methods in 2 datasets. ECTS is based on 1NN method, so the decision time stamp would affect the accuracy. The earlier decision time stamp usually leads to a lower accuracy, while the later decision time stamp leads to a higher accuracy. For ECDIRE, due to the unreliability of class label, some series could remain unclassified. RelClass obtains the best accuracy in two datasets. IEDSC obtains the best results in 5 datasets and achieves the second accuracy average rank, only worse than the full time series method, 1NN-DTW, which does not provide early detection. IEDSC classifies time series by shapelets, and the quality of shapelets decides the accuracy. IEDSC eliminates parts of shapelet candidates in the extraction process, selects more diverse shapelets, and thus it achieves a higher accuracy. Moreover, Figure 6 shows a critical difference diagram for Nemenyi test [5] over the accuracy average ranks of six classifiers. In the critical difference diagram, α is set to 0.05. The critical difference (CD) length is shown above the graph. From Figure 6, we can see that DTW is the best-performing classifier as it is based on full length time series, and IEDSC is ranked the second. There is no significant difference among DTW, IEDSC, RelClass, ECTS and ECDIRE because they form a clique of classifiers.

Figure 6
figure 6

Critical difference diagram of accuracy for six classifiers

In order to further investigate whether the five early classification methods present similar performance, we conduct Wilcoxon rank sum test to evaluate the significance level of ranks on accuracy and earliness. Table 4 shows the results of rank sum test of significance level of the ranks on accuracy, wherein, \(\overline {\alpha }\) is the significance level which is set to 0.05 for test. The p-value is the significant probability that the pairwise is consistent. The last column h denotes the test result. If h is equal to 1, it indicates the pairwise has significance difference. From Table 4, one can observe that the pairwise EDSC vs. IEDSC are significantly different, while the others do not present significant difference.

Table 4 Wilcoxon Rank Sum Test of Significance Level of Average Ranks on Accuracy

Table 5 shows the result for earliness. It shows that IEDSC obtains the lowest values in 5 datasets out of 16, followed by EDSC, which obtains the best results in 4 datasets. ECDIRE and RelClass obtain the best resluts in only 3 datasets. ECTS does not obtain the best result in any dataset. Table 6 shows the results of rank sum test of significance level of the ranks on earliness. One can see that ECTS vs. IEDSC has significance difference, and the pairwise RelClass vs. IEDSC also has significance difference.

Table 5 Earlinesss
Table 6 Wilcoxon Rank Sum Test of Significance Level of Average Ranks on Earliness

From the above analysis, we can conclude that IEDSC could not only make prediction as early as possible, but also use partial data to achieve comparable accuracy with the classifier using complete time series.

6.3 Method effectiveness analysis

IEDSC contains three key designed features, trend-based Euclidean distance, shapelet pruning, and diverse shapelet selection. In order to evaluate how these three parts affect the classification performance, we conduct the experiments by changing one of these three parts and keeping the others unchanging.

We conduct the experiments on four datasets, i.e., ECG200, SonyAIBORobotSurface2, ECGfivedays and FaceFour. We set λ = 1.1 for ECG200 and SonyAIBORobotSurface2 datasets, and λ = 1.2 for ECGfivedays and FaceFour datasets.

Figure 7 shows the accuracy of four datasets with Euclidean distance and Trend-based Euclidean distance. We can observe that, the accuracy of ECG200 and SonyAIBORobotSurface2 datasets has improved, while the accuracy of ECGfivedays and FaceFour datasets does not change much.

Figure 7
figure 7

Accuracy with different distance methods

Figure 8 shows the accuracy of four datasets without and with pruning methods. The accuracy of ECG200 increased by 5%. For the SonyAIBORobotSurface2, the accuracy increases from 0.81 to 0.82. The accuracy of ECGfivedays and FaceFour datasets with pruning also increases. Figure 9 shows the extraction time and the number of shapelet candidates without and with pruning method for four datasets. From Figure 9 (1)(2)(3)(4), we can observe that the extraction time reduces significantly after pruning, and the number of shapelet candidates is also signficantly reduced.

Figure 8
figure 8

Accuracy of without pruning and with pruning methods

Figure 9
figure 9

Comparison of without pruning and with pruning methods

Figure 10 clearly shows that the proposed shapelet selection method improves the classification accuracy. The accuracy of ECG200 rises to 0.83, and the accuracy of SonyAIBORobotSurface2 increases by 6% with diverse shapelet selection method. The accuracy of ECGfivedays and FaceFour with diverse selection method also has improvement. The reason is that diverse selection method removes some of similar shapelets, and thus makes the shapelets more diverse and makes classifier more robust.

Figure 10
figure 10

Accuracy of different selection methods

6.4 Interpretability of features

In this section, we use ECG200 as an example to demonstrate the selected shapelets by our feature selection method. ECG200 dataset consists of the two classes of time series, and the lengh of each time series is 96, as shown in Figure 11. The class 1 represents a normal heartbeat, and the class -1 is an abnormal class which represents a myocardial infarction. In a normal human ECG, there are five waves in a heartbeat cycle, including the PQRST waves shown as Figure 12(1). If a person has a myocardial infarction, it is usually observed from the ECG that the ST wave is changed and elevated.

Figure 11
figure 11

ECG time series

Figure 12
figure 12

Interpretability of shapelets for classification

From Figure 12 (2), one can observe that the shapelet shown as red line is corresponding to the ST wave, and it has a gentle trend which indicates a normal heartbeat. The shapelet shown as blue line in Figure 12 (3) also contains the ST wave, and the elevated ST wave indicates that a person has a myocardial infarction.

Using 1NN-DTW on this dataset with complete data, the accuracy is 0.88. However, IEDSC uses only 26.91% data to obtain a comparable accuracy 0.84. From the above analysis, one can see the selected shapelets could represent the difference between two class time series and achieve a good accuracy.

6.5 Case study: Early classification of CBF

The CBF dataset is one of the most commonly used datasets in time series classification. In order to show the performance of IEDSC, we select the CBF dataset as a case study, and in the training phase we use standard-divided training sets to train the classifier.

CBF contains three classes of time series, and we draw the contours of the time series for these three classes in Figure 13. The accuracy of EDSC on CBF is 84% and the earliness is 31.85%, while the accuracy of IEDSC on CBF is 92% and the earliness is 27.41%. Obviously, our algorithm achieves higher accuracy than EDSC, and also gets the result more earlier. In feature selection phase, EDSC only gets three shapelets from CBF dataset, and in this paper we obtain 19 shapelets, which are displayed and marked out in red in Figures 1415 and 16 respectively.

Figure 13
figure 13

CBF time series

Figure 14
figure 14

Shapelet of class 1

Figure 15
figure 15

Shapelet of class 2

Figure 16
figure 16

Shapelet of class 3

IEDSC selects 10 shapelets from four different time series belonging to class 1 shown as Figure 14 . Shapelets from the same time series have partial differences, but shapelets from disparate time series are distinct. For class 2, there are only one shapelet selected in Figure 15 . Moreover, IEDSC extracts 8 shapelets from class 3, and from Figure 16 (1)(2) one can see that the shapelets from the same time series are diverse. The feature selection results show that, IEDSC can get more diverse shapelets, which may be the reason that the accuracy increases.

7 Conclusions

In this paper, we propose a new algorithm extracting diverse-shapelet for early classification on time series, called IEDSC. First, the IEDSC algorithm uses a new trend-based Euclidean distance to compute the similarity of time series, which can better measure the similarity of time series. Second, in order to reduce the shapelet candidates space, the IEDSC algorithm presents a shapelet pruning method based on the predictive starting positions to reduce the number of candidates. Third, in the shapelet selection phase, IEDSC makes the selected shapelets more diverse by a new feature selection method. Finally, we conduct the experiments on the real datasets. The experimental results demonstrate that IEDSC can make prediction earlier, obtain the comparable classification accuracy with that using full time series, and provide more interpretable results. In future work, we plan to improve the efficiency and effectiveness of similarity measure and feature selection, and also further study how to apply IEDSC to multivariate time series.