1 Introduction

Time series classification (TSC) has attracted considerable attention from the data mining community over the past decades because of the increase of time series data from various domains, such as the Internet of Things, finance, medicine, etc. Similar to the classification task in machine learning, TSC aims to build a classifier based on a time series training dataset and predict labels of target time series [1, 2]. Up to now, the technologies of TSC can be divided into four categories. The first is the whole-series comparison algorithm which combines classifiers, such as 1-Nearest Neighbor with similarity metrics, such as Euclidean Distance (ED) or Dynamic Time Warping (DTW) distance. The second category is the Deep Neural Network based TSC algorithm (DNN-TSC) which has been a popular topic in the field of machine learning recently [3]. The third category is the ensemble algorithm which combines two or more popular TSC algorithms and employs a voting strategy to determine the label of the target time series. These two categories of algorithms, such as ResNet [4], InceptionTime [5], TapNet [6], Elastic Ensemble [7], COTE [8, 9], Proximity Forest [10], etc., can achieve competitive results on most datasets. However, most of them cannot explain why a target time series is assigned to a particular label. Besides, training a DNN-TSC model or an ensemble classifier for TSC requires massive training data as well as considerable computing resources. The last category is the pattern-based TSC algorithm which aims to find out some local patterns that can discriminate time series from different classes. The advantages of the pattern-based TSC algorithm are that it not only achieves high classification accuracy but also provides good interpretability. In this work, we focus on the pattern-based TSC algorithm.

For TSC, a high-quality pattern is essentially a discriminative subsequence which is helpful to improve the accuracy of classification. There are two approaches to the pattern-based TSC, which are the shapelet-based algorithm [11] and the dictionary-based algorithm [12]. Generally, the shapelet-based algorithm extracts a number of subsequences from the raw numeric time series and employs information gain (IG) to measure their classification ability. The subsequences with the highest IG will be regarded as shapelets and be used to build a classifier. For the dictionary-based algorithm, a classic example is the bag-of-patterns (BOP) algorithm [13] which involves converting a time series into a discrete series using Symbolic Aggregate approXimation (SAX), creating a set of SAX words for each series through the application of a short sliding window, and then using the frequency count of the words in a series as the pattern. Senin et al. [14] extend the BOP algorithm in SAX-VSM thereby computing TF-IDF weights for each word and label. Schäfer [15] proposes the BOSS algorithm which converts time series using Symbolic Fourier Approximation (SFA) and then creates a dictionary of SFA symbols represented patterns.

Obviously, there are two key issues in pattern-based TSC, i.e., how to measure the discriminative power of subsequences and how to find the optimal combination of subsequences. Ye and Keogh [15] who first proposed shapelets employ a brute force algorithm to enumerate subsequences from the raw time series dataset and use IG to measure the quality of subsequences. Since the number of subsequences is quadratic to the length of the time series and the IG requires calculating the distance between all the subsequences and time series, the complexity of the algorithm is \(O\left( {n^{2} \cdot m^{4} } \right)\) where n is the number of time series and m is the length of time series. Although Keogh et al. [16, 17] proposed some techniques, e.g., candidate pruning, random masking, etc., to accelerate the discovery of shapelets, they sacrifice the accuracy of the classification. Moreover, their method only selects the subsequence with the highest IG score to be the shapelet to build a decision tree classifier which affects the accuracy of the TSC. Hills et al. [18] proposed a new algorithm named shapelet transformation (ST) which finds the top-k shapelets to produce a transformed dataset. Since the transformed dataset removes the temporal relation from the time series, the ST method can be combined with any machine learning classifiers, such as SVM, random forest, neural network, etc., which significantly improves the accuracy of the shapelet-based classifier. However, the ST algorithm requires setting some parameters to limit the final shapelet lengths which are sensitive to both datasets and algorithms. Moreover, it also needs to evaluate all the subsequences which is computationally expensive. To handle this problem, some shapelet-based methods replace the brute force algorithm with random sampling to improve the efficiency of the algorithm, such as gRSF [19], CRSF [20], ELIS [21], BSPCover [22], etc.

To address the two issues mentioned above, we propose a new method, Coverage Ratio and Mutual Information (CRMI), to find discriminative subsequences for multi-class time series classification (MC-TSC). The first step of CRMI is random sampling for several minutes that obtain a large number of subsequence candidates. Then, it calculates a distance matrix in which each cell is the distance between a subsequence candidate and a time series. Based on the distance matrix, a clustering technique is employed to figure out a coverage matrix which is used to measure the discriminative power of subsequence candidates. In this way, it can efficiently determine subsequence candidates that maximally represent each time series class. Next, a heuristic algorithm based on mutual information (MI) is designed to find the optimal combination of subsequence candidates. Finally, we also adopt the ST method to produce a transformed dataset and build a TSC classifier. The major contributions of this work can be summarized as follows:

(1) We propose an efficient algorithm named CRMI to find discriminative subsequences for MC-TSC. Different from the shapelet-based algorithms and the dictionary-based algorithms, we exploit the clustering technique to build a coverage matrix and propose a new measure, named coverage ratio (CR), to evaluate the discriminative power of subsequence candidates. Moreover, we consider the effect of feature combinations and propose a heuristic algorithm based on MI to find the optimal combination of subsequences.

(2) Extensive experiments were performed on 54 UCR [23] time series datasets with at least three categories to evaluate the effectiveness of the proposed algorithm. First, we explore the parameter setting in the CRMI algorithm although it has only two parameters. Second, ablation experiments were performed to demonstrate the effectiveness of our methods. Finally, we compare the CRMI algorithm with seven classic algorithms, including the 1NN + DTW and six state-of-the-art shapelet-based TSC algorithms. The efficiency of the algorithm is also discussed.

The rest of the paper is organized as follows. In Sect.  2, we recall some related works. Section 3 gives the symbols, concepts, and definitions in the paper. In Sect.  4, we introduce the CRMI algorithm in detail. In Sect. 5, the design of the experiment and analysis of experimental results are presented. Finally, we summarize the findings of this work in Sect. 6.

2 Related Works

In this section, we review some related works of TSC. A classic approach to TSC is 1NN + DTW, which has been proven by Wang et al. [24] to be hard to beat. Given a target time series, it calculates its distance to all the training time series using DTW and adopts the 1NN algorithm to determine its label.

It is known to all that there are enormous works on TSC, however, the shapelet-based methods are close to our work which also aims to find discriminative subsequences for building classifier models. Therefore, we review some important works about shapelet-based TSC. Ye and Keogh first proposed the concept of shapelets in [16], and shapelets are time series subsequences that are maximally representative of a class.

In this groundbreaking work, the IG is employed to measure the quality, i.e., discriminative power, of subsequence candidates, and a brute force algorithm is used to enumerate subsequences from the raw time series. Since the number of subsequences is quadratic to the length of the time series and the IG requires calculating the distances between all the subsequences and time series, the complexity of the algorithm is \(O\left( {n^{2} \cdot m^{4} } \right)\) where n is the number of time series and m is the length of time series. To improve the efficiency of the algorithm, Keogh et al. proposed fast shapelets (FS) [25] which exploit a random projection technique on the SAX representation of time series to find potential shapelet candidates. Its time complexity is \(O\left( {n \cdot m^{2} } \right)\) which outperforms the previous work by two or three orders of magnitude. However, the FS algorithm cannot guarantee finding the best shapelet and it sacrifices the accuracy of the classification.

Grabocka et al. proposed a Scalable shapelet Discovery algorithm (SD) [26], which used the Piecewise Aggregate Approximation (PAA) to represent time series and adopted an online clustering/pruning technique to avoid measuring the prediction accuracy of similar candidates in Euclidean distance space and incorporated a supervised shapelet selection to improve classification accuracy. Since the SD algorithm prunes 99% of shapelet candidates, it was three -four orders of magnitudes faster than FS.

Tight coupling of shapelet discovery and training of decision tree is another factor that hinders the accuracy improvement. Hill et al. proposed a ST [18] technique, which separates the shapelet discovery from the classifier fitting, which can cooperate with any existing classifiers, e.g., SVM, kNN, neural network, etc. The ties are resolved by constructing a new feature space based on the top-k best shapelets and then transforming the original time series data into the new feature space in which each data point is the similarity (i.e., distance) between the corresponding shapelet and the time series. However, ST is one of the slowest shapelet-based algorithms since it needs to evaluate all possible candidates.

Karlsson et al. proposed a generalized random shapelet forest (gRSF) [19] algorithm which consists of a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are random. The experiments prove its effectiveness and efficiency. However, some techniques in gRSF still decrease the efficiency of the algorithm, e.g., the growth of a decision tree needs to repetitively sample shapelets for each tree node. Yang et al. proposed a compressed random shapelet tree (CRSF) [20] algorithm to improve the gRSF algorithm by three techniques that are the SAX representation, an innovative SAX distance measuring, and a shapelet-pool strategy for generating shapelet-based decision trees.

Li et al. proposed BSPCover [22] which focuses on the discovery of a set of high-quality shapelet candidates for model building. BSPCover prunes identical and highly similar shapelet candidates then uses the p-cover algorithm to determine discriminative shapelet candidates, and finally applies the existing shapelet-learning technique to build the classifier. They reported that BSPCover achieves very competitive performance compared with the existing TSC methods.

3 Preliminaries

A time series is an ordered sequence of \(m\) real values, denoted as \(T_{i} = t_{i,1} t_{i,2} t_{i,3} \cdots t_{i,m}\), where \(m\) is the length of the time series. Usually, the \(m\) real values are obtained by observing a target object at a fixed time interval. A time series dataset contains \(n\) time series instances which can be denoted as \(D = \left\langle {T,Y} \right\rangle ,T = \left\{ {T_{1} ,T_{2} , \cdots ,T_{n} } \right\},Y = \left\{ {y_{1} ,y_{2} , \cdots ,y{}_{Q}} \right\}\), where \(T\) is the time series set, \(Y\) is the set of the time series labels, and \(Q\) is number of distinct labels. The label of \(T_{i}\) can be obtained by a function \(label\left( {T_{i} } \right)\). The proposed algorithm is appropriate for MC-TSC with \(Q \ge 3\).

A time series subsequence \(T_{i}^{j,k} = t_{i,j} t_{i,j + 1} \cdots t_{i,j + k - 1}\) is a consecutive sequence obtained from a time series \(T_{i}\) where \(j\) and \(k\) are the starting position and the length of the sequence, respectively. A discriminative subsequence is a specific time series subsequence that can strongly represent a class of time series. For simplicity, we also use symbol \(s\) to denote a time series subsequence. A discriminative subsequence is a specific time series subsequence that can strongly represent a class of time series. Finding the discriminative subsequence, requires calculating the distance between a subsequence and a time series. Regarding a subsequence as another time series with a shorter length, the distance between two time series with different length is defined below.

Definition 1 (Distance between two time series)

The distance between time series \(T_{1}\) with length \(m_{1}\) and time series \(T_{2}\) with length \(m_{2}\) can be calculated by Formula (1). Without loss of generality, we assume \(m_{1} < m_{2}\).

$$d\left( {T_{1} ,T_{2} } \right) = \mathop {\min }\limits_{{j = 0 \ldots m_{2} - m_{1} }} \sqrt {\frac{1}{{m_{1} }}\sum\limits_{i = 1}^{{m_{1} }} {\left( {t_{1,i} - t_{2,j + i} } \right)^{2} } }$$
(1)

It is not difficult to find that the distance is a variation of Euclidean distance. The proposed method adopts transformation, just like the ST algorithm, that converts the raw time series dataset to a distance matrix, also called transformation matrix.

Definition 2 (Subsequence transformation)

Assuming a set of subsequences \(S = \left\{ {s_{1} ,s_{2} , \cdots ,s_{{n_{S} }} } \right\}\) has been extracted from a time series dataset \(D = \left\langle {T,Y} \right\rangle\), it can transforms the time series dataset to a distance matrix, \(D^{\prime} = \left\{ {d_{i,j} } \right\}\), where \(d_{i,j}\) is the distance between the i-th time series and the j-th subsequence, i.e., \(d\left( {s_{j} ,T_{i} } \right)\). Note that all distances need to be normalized.

4 Finding Discriminative Subsequences for TSC

4.1 Framework of Algorithm

This paper proposes an algorithm, CRMI, for finding discriminative subsequences from time series to handle the TSC problem. The framework of the algorithm is shown in Fig. 1. The CRMI algorithm performs random sampling on raw time series dataset and extracts a large number of subsequence candidates at first. This step is the same with some existing works. Then, it calculates the distances between each subsequence candidate and all the time series based on Eq. 1 and forms a distance matrix. The third step is clustering performed on the distance matrix which aims to reveal the coverage relation of a subsequence on time series. The details of this step will be explained in Sect. 4.2. After that, a coverage matrix is produced and a new measure, named CR, is employed to evaluate the discriminative power of each subsequence candidate. In step 5, a heuristic algorithm based on MI tries to find the optimal combination of subsequences. In the last step, a set of discriminative subsequences is used to produce a transformed dataset for classifier training.

Fig. 1
figure 1

Framework of the CRMI algorithm

The pseudo code of the proposed algorithm is given in Algorithm 1. In lines 2 to 4, a set of subsequences are randomly sampled from time series dataset \(D\) according to the minimum length of subsequence \(minLen\), the maximum length of subsequence \(maxLen\). The sampling stops when the sampling time is higher than the parameter \(samplingTime\). In Lines 5–7, it calculates the distance between each subsequences and time series. The distance is stored in a matrix \(M\). In line 8, a K-means algorithm is performed on the distance matrix and figures out the coverage matrix \({\mathbb{C}}\). From lines 9–10, it calculates the CR for each subsequence based on the coverage matrix. The algorithm employs a MI-based heuristic algorithm to find the optimal combination of subsequences \(S^{\prime}\) in Line 11. The algorithm transforms the time series data set \(D\) to a new data set \(D^{\prime}\) based on \(S^{\prime}\) in Line 12 and trains the classifier \(\Psi\) on \(D^{\prime}\) in Line 13. More details of selection of subsequences will be elaborated in Sects. 4.2 and 4.3.

Algorithm 1:
figure a

Training a time series classifier based on Coverage Ratio and Mutual Information

4.2 A Measure for Estimating Discriminative Power of Subsequences

IG is a popular measure for evaluating the discriminative power of subsequences at present. However, IG has several shortcomings, including (1) it is easy to be disturbed by the noise; (2) the computation of IG is time-consuming. To overcome these shortcomings, a new measure named CR is proposed to estimate the discriminative power of subsequences.

Given a subsequence \(s_{j}\) sampled from time series \(T_{j}\) and \(D = \left\langle {T,Y} \right\rangle\) is a time series dataset with \(Q\) distinct labels. At first, the method calculates each distance \(d\left( {s_{j} ,T_{i} } \right)\), \(1 \le i \le n\). Obviously, \(d\left( {s_{j} ,T_{i} } \right) = 0\) if \(label(s_{j} ) = label(T_{i} )\), otherwise \(d\left( {s_{j} ,T_{i} } \right) \ge 0\). After that, the distance values are normalized by min–max strategy and a clustering algorithm e.g., K-Means, is exploited to group the distances into \(Q\) categories.

Since a small distance value means that there exists a subsequence of time series whose shape is similar to the target subsequence, we construct a vector in which the j-th value will be set to 1 if \(d\left( {s_{j} ,T_{i} } \right)\) falls into the clustering set which contains the distance value 0, and otherwise, the j-th value of vector will be set to 0. In other words, the cover relation between a subsequence and a time series is revealed by the clustering of their distance, which is the main difference between our method and the existing works. Finally, the vectors of all subsequences form a Boolean matrix, called coverage matrix.

The advantages of our method involve: (1) it is a parameter-free technique that does not need to set any parameters, which is different from some works that require a distance threshold to determine whether a time series contains or does not contain a subsequence; (2) it does not require any transformation on the raw time series, e.g., SAX or SFA, which may lose some information and increase computing overhead. Moreover, the number of categories for clustering is naturally set to \(Q\) which is the category size of a time series dataset. A toy example is given below.

Example 1

Assume a time series dataset that contains 5 time series \(T_{1} ,T_{2} ,T_{3} ,T_{4} ,T_{5}\) and has 3 distinct labels. Let \(s\) be a subsequence extracted from \(T_{1}\). The distance between \(s\) and the 5 time series forms a vector [0.00, 0.56, 0.37, 0.62, 0.08], e.g., \(d\left( {T_{5} ,s} \right) = 0.08\). Then, we perform the K-Means algorithm on the distance vector where \(K = 3\) (which is same with the number of the distinct labels) and obtain the clustering result {{0.00, 0.08}, {0.37}, {0.56, 0.62}}. Based on the result, the time series are divided into three groups: {\(T_{1}\),\(T_{5}\)}, {\(T_{3}\)} and {\(T_{2}\),\(T_{4}\)}. It can be seen that \(s\) is taken from \(T_{1}\) and the group containing \(T_{1}\) also contains another time series \(T_{5}\), therefore the values correspondent \(T_{1}\) and \(T_{5}\) in a coverage vector are set to 1 and the others are set to 0, that is \(\left[ {1,0,0,0,1} \right]^{T}\). It indicates that the time series \(T_{1} ,T_{5}\) are covered by the subsequence \(s\). In another word, there exist some subsequences in \(T_{1} ,T_{5}\) similar to \(s\) in shape.

Based on the coverage matrix, this paper proposed an innovative measure CR, for estimating the discriminative power of a subsequence. The definition of CR is given below.

Definition 3

(Coverage ratio, CR): Given a coverage matrix \({\mathbb{C}}\) and a subsequence \(s_{j}\) with the label \(label(s_{j} )\), the CR of \(s_{j}\), can be calculated by the following formula.

$$ CR\left( {s_{j} ,{\mathbb{C}}} \right) = \frac{{\# \left( {c_{i,j} = 1 \wedge label\left( {T_{i} } \right) = label\left( {s_{j} } \right)} \right)}}{{\# \left( {label\left( {T_{i} } \right) = label\left( {s_{j} } \right)} \right)}} - \lambda \cdot \frac{{\# \left( {c_{i,j} = 1 \wedge label\left( {T_{i} } \right) \ne label\left( {s_{j} } \right)} \right)}}{{\# \left( {label\left( {T_{i} } \right) \ne label\left( {s_{j} } \right)} \right)}} $$
(2)

where \(T_{i}\) and \(s_{j}\) represent the i-th time series and the j-th subsequence, respectively, and \(label\left( {T_{i} } \right)\) and \(label\left( {s_{j} } \right)\) represent the labels of \(T_{i}\) and \(s_{j}\), respectively. In the coverage matrix \({\mathbb{C}}\), \(c_{i,j} = 1\) when the j-th subsequence covers the i-th time series. The symbol \(\# \left( \right)\) is a cardinal operator, and \(\lambda\) is a parameter to control the weight of two components. The range of \(\lambda\) is \(\left( {0,1} \right]\). It is not difficult to figure out that the range of \(CR(s_{j} ,{\mathbb{C}})\) is \([ - 1,1]\). The rationale behind the CR is simple that we hope a subsequence \(s_{j}\) with great discriminative power covers as many time series with the label \(label\left( {s_{j} } \right)\) as possible and covers as few time series with other labels as possible. A toy example for explanation is shown below.

Example 2

There are 12 time series and \(Y = \left\{ {1,2,3} \right\}\) in the time series dataset (as shown in Table 1). Two subsequences \(s_{1} ,s_{2}\) are extracted from time series with the label 1. The coverage vectors of the two subsequences are shown in the 3rd and the 4th row of Table 1, respectively. The parameter \(\lambda\) is set to 0.5. We can calculate that the CR of \(s_{1}\) is 0.6875 and the CR of \(s_{2}\) is 0.3125 based on Formula (2).

Table 1 Illustration for coverage ratio

It demonstrates that \(s_{1}\) is more discriminative than \(s_{2}\). Let us analyze the coverage vector of both subsequences, it is easy to find that most of the time series covered by \(s_{1}\) belong to the same category (i.e., the first category), and in contrast, the time series covered by \(s_{2}\) are distributed across three categories. It demonstrates that the new measure, i.e., CR, can reveal the discriminative power of subsequences.

4.3 A MI-Based Algorithm for Discriminative Subsequence Selection

Although a new measure has been proposed to evaluate the discriminative power of subsequences, selecting a set of subsequences that can maximally represent the time series is still an issue. Some works simply choose the top-k subsequences with the highest score for classifier training. However, it cannot work well on some the datasets because it does not consider the correlation among the subsequences.

MI is an important tool in information theory that can handle the correlation between two variables. Moreover, MI-based feature selection has been proven to be effective in many works. In this section, an MI-based algorithm is proposed to handle the problem of discriminative subsequence selection. First, the definition of MI is given below.

Definition 4 (Mutual information, MI)

Given a coverage matrix \({\mathbb{C}}\) of a subsequence set \(S = \left\{ {s_{1} ,s_{2} , \cdots ,s_{{n_{S} }} } \right\}\) on time series dataset \(D = \left\langle {T,Y} \right\rangle\), the MI between \(S\) and \(Y\) is denoted as follows.

$$MI(S;Y) = \sum\limits_{y \in Y} {\sum\limits_{s \in S} {p(s,y)log_{2} \left( {\frac{p(s,y)}{{p(s)p(y)}}} \right)} } = H(Y) - H(Y|S)$$
(3)

where \(p(s,y)\) is the joint probability distribution function of \(s\) and \(y\) in \({\mathbb{C}}\), while \(p(s)\) and \(p(y)\) is the probability distribution function of \(s\) and \(y\) in \({\mathbb{C}}\), respectively. \(H(Y)\) is the information entropy of \(Y\), and \(H(Y|S)\) is the conditional entropy of \(Y\) with respect to \(S\). Similar to the existing work, the algorithm aims to find a subset \(S^{\prime}\) of \(S\) such that \(MI(S;Y)\) is equal to \(MI(S^{\prime};Y)\).

To find the optimal combination of subsequences, we first group the subsequences into \(Q\) groups by their labels, and sort them by their CR values in descending order within the groups. Then, the subsequences with the highest CR in each group are chosen to form a core set. The core set is regarded as the starting set of the subsequence selection. Next, a round-robin strategy is employed that successively selects a subsequence from a group and adds it to \(S^{\prime}\) until \(MI(S^{\prime};Y)\) no longer increases. The process of the MI-based subsequence selection is shown in Fig. 2.

Fig. 2
figure 2

Process of the MI-based subsequence selection

The pseudo-code of the algorithm is shown in Algorithm 2. In the first line, the algorithm groups all the subsequences by their labels and sorts them by CR score in descending order within each group. In Line 2, the result set is set to be empty. From Lines 3 to 4, a core set is built using the first subsequence (i.e., the subsequences with the highest CR score in each group) in each group, where the function \(group.getAndRemove(0)\) removes the first element from the group and returns the element. The MI of the core set is calculated in Line 5 and assigned to a variable \(lastMI\) in Line 6. From Lines 8 to 19, a round-robin strategy is employed to traverse all groups to find the best subsequence in each group which can achieve the maximum MI value. The loop ends when the MI value does not increase (Lines 20–23). At the end of the algorithm, the algorithm returns the final subsequences \(S^{\prime}\).

Algorithm 2:
figure b

Select Subsequences By MutualInformation (S,£,Y)

4.4 Analysis of Time Complexity

The proposed algorithm is composed of five steps which involve random sampling time series subsequences, calculating the distances between each subsequence and time series, clustering based on the distance matrix, calculating the CR for each subsequence, and finding the optimal subset of subsequences. The time of random sampling is a parameter and we will discuss it in the experiments. The worst time complexity of the second step is \(O\left( {m \cdot n \cdot n_{S} } \right)\) where m is the length of the time series, n and \(n_{S}\) represent the number of time series and the number of subsequences to estimate, respectively. The worst time complexity of the third step, e.g., K-means clustering, is \(O\left( {Q \cdot n \cdot n_{S} } \right)\), \(Q\) is the number of distinct classes. The key operation of the fourth step is counting the time of coverage and its worst time complexity is \(O\left( {n \cdot n_{s}^{{}} } \right)\). The final step requires calculating the MI for many times and its worst time complexity is \(O\left( {n \times Q \times \left( {Q + \left( {Q + 1} \right) + \left( {Q + 2} \right) + \cdots + n_{S} } \right)} \right) \approx O\left( {n \cdot Q \cdot n_{s}^{2} } \right)\). Since \(Q \cdot n_{s}^{{}}\) is obviously higher than the length of the time series in most cases, the worst time complexity of the proposed algorithm is \(O\left( {n \cdot Q \cdot n_{s}^{2} } \right)\).

5 Experiments

5.1 Experiment Setup

In this section, we perform extensive experiments to evaluate the performance of the proposed algorithm. First, the datasets will be introduced in this paragraph. Second, we explore the parameter setting strategies. Third, ablation experiments are performed to evaluate the effectiveness of the methods in CRMI, including the CR measure and the MI-based heuristic algorithm for discriminative subsequence selection. Finally, the performance of the proposed algorithm is evaluated by comparing it with six shapelet-based TSC algorithms as well as the 1NN + DTW algorithm which is claimed to be difficult to defeat.

The experiments were carried out on a server equipped with Intel Xeon Gold 5215 CPU (2.5 GHz) and 64 GB memory. The algorithm is coded in Java with toolkits Weka 3.4.3 and Time Series Machine Learning (TSML) [27]. Since the proposed method is for MC-TSC, 54 datasets that have at least three categories are selected from the UCR repository. The details of the datasets are shown in Table 2. The last four columns in the table are the number of training instances, the number of testing instances, the number of categories, and the length of the time series.

Table 2 Descriptions of datasets

5.2 Influence of Parameter Settings

In this section, we conduct experiments to study how to set the parameters in the CRMI algorithm, including the time of subsequence sampling and the parameter \(\lambda\) in Formula (2).

First, we investigate the impact of the time of subsequence sampling by setting its value from 2 to 5 min with a step value of 1 min. The value of \(\lambda\) is set to 1.0. For fairness, the experiments were performed 50 times on each dataset because the algorithm is stochastic. The accuracy of classification, as well as the training time and the testing time on each dataset, are recorded. For simplicity, we calculate the average value of the classification accuracy, the training time, and the testing time on all datasets. The results are shown in Fig. 3, in which the abscissa is the time of sampling and the left ordinate is the average accuracy of classification on all datasets and the right ordinate is the time cost of the algorithm. From the figure, it is obvious that the classification accuracy achieves the highest value, i.e., 74.91%, when the sampling time is 2 min. The second highest classification accuracy is 74.59% which is achieved when the sampling time is set to 4 and 6, respectively. Obviously, 2 min is a good choice for the sampling time.

Fig. 3
figure 3

Influence of sampling time setting

Next, we conduct experiments to investigate the influence of the setting of \(\lambda\). The value of \(\lambda\) is set from 0.2 to 1.0 with a step value of 0.2. The sampling time is set to 2 min. Similarly, the experiments were performed 50 times on each dataset, and the average value of classification accuracy is calculated. The results are shown in Fig. 4. As can be seen from the figure, the highest classification accuracy is 74.91% when the \(\lambda\) is set to 1.0, and the lowest classification accuracy is 73.70% when the \(\lambda\) is set to 0.8. Therefore, the value of \(\lambda\) in the experiment is set to 1.0. We also change the order of the parameter selection, i.e., first determine the setting of \(\lambda\) and then determine the sampling time, the results show that it will not influence the performance of the algorithm.

Fig. 4
figure 4

Average classification accuracy achieved by different values of λ

5.3 Comparison of Different Strategies in MI-Based Discriminative Subsequence Discovery

In Algorithm 2, an MI-based algorithm for discovering the discriminative subsequences is proposed in which two strategies are employed including a core set consisting of the subsequence with the highest CR value in each group and a round-robin selection strategy. In this section, we evaluate three different strategies.

(1) Core + Round-Robin.


This strategy is employed in the proposed algorithm. After the calculation of the CR value of all subsequences, the algorithm groups the subsequences according to their labels. Moreover, the subsequences in each group are ranked in descending order according to the CR value. Then, it constructs a core set that consists of subsequences with the highest CR value in each group. Finally, it successively selects a subsequence with the highest CR value in each group and adds it to the set. The algorithm ends until the value of MI does not grow.

(2) Core + Ranking.


The difference between this strategy and Core + Round-Robin is that the former does not successively select a subsequence from each group, but from a collection of all subsequences. After the construction of the core set, this strategy ranks all the subsequences according to their CR values in descending order and successively selects a subsequence with the highest CR value. The algorithm ends until the value of MI does not grow.

(3) Ranking.


This strategy sorts all subsequences in descending order according to their CR values and then adds them to the result set one by one. It calculates the MI of the selected subsequences and the algorithm ends until the MI no longer increases.

In the experiments, the sampling time is set to 2 min and the value of \(\lambda\) is set to 1.0. Similarly, each strategy was performed 50 times on each dataset, and the average value of classification accuracy as well as the training time was calculated. The experimental results are shown in Fig. 5. It is easy to find that the Core + Ranking strategy performs slightly better than the Ranking strategy. It demonstrates that the core set is helpful to improve classification accuracy. Furthermore, the Core + Round-Robin strategy performs much better than the Core + Ranking strategy in terms of classification accuracy. It proves the efficiency of the Round-Robin strategy. However, the training time of the Core + Round-Robin strategy is slightly longer than the Core + Ranking strategy. The only reason behind the phenomenon is that the former selects more subsequences than the latter because the number of subsequences from each group is balanced under the Core + Round-Robin strategy. The experimental results prove that balancing the number of subsequences (features) from different categories of time series is helpful for TSC.

Fig. 5
figure 5

Comparison of average accuracy and training time on different selection strategy

5.4 Comparison Against the State-of -the-Art Algorithms

We compared the CRMI algorithm with seven TSC classifiers, including 1NN + DTW and six state-of-art algorithms based on shapelets, which are ST [18], gRSF [19], CRSF [20], BSPCover [22], Fast Shapelet (FS) [25], and SD [26]. The reasons that we only select shapelet-based algorithms for comparison are twofold. On the one hand, existing works have shown that state-of-the-art shapelet-based algorithms perform much better than most dictionary-based algorithms. On the other hand, the proposed algorithm is similar to the shapelet-based algorithm in that all of them work on the raw time series data. This is different from the dictionary-based methods which need to convert the raw time series data by SAX or SFA etc. The parameters of algorithms for comparison were set according to the corresponding references. If datasets do not appear in the references, we use the default parameters in the open-source code.

The experiments were conducted on 54 datasets. All the stochastic algorithms were run 50 times on each dataset, including FS, SD, BSPCover, gRSF, and CRSF. The experimental results are shown in Table 3 in which the average accuracy of classification on 54 datasets for each algorithm is listed. The highest accuracy for each dataset is marked in bold. To describe the experimental results more intuitively, some comparisons are provided in the last five rows of the table. The “Wins” row shows the number of datasets on which the corresponding algorithm won the gold medal. The second row shows the average ranking of the algorithm on 54 datasets. The last three rows show the number of datasets that CRMI wins, draws, and loses across all datasets compared with other algorithms, respectively. For example, compared with CRSF, CRMI won 38 times, drew 1 time, and lost 15 times on 54 datasets. As shown in Table 3, the CRMI algorithm achieves the best results on 19 datasets which is slightly better than the ST algorithm which won on 18 datasets. The number of datasets won on by other algorithms is far less than the proposed algorithm. In terms of the average ranking, the top three algorithms are CRMI, ST, and gRSF which get 2.70, 2.74, and 2.81, respectively. Comparing CRMI with other algorithms one-to-one, it can also be found that the CRMI algorithm outperforms other algorithms.

Table 3 Accuracy comparison of 8 classifiers on 54 datasets

We also exploit the Nemenyi test to detect whether there exist significant differences among the eight algorithms. A critical difference diagram is shown in Fig. 6. Classifiers that are not significantly different at p = 0.05 are connected. It can be easily found that CRMI is significantly superior to the five algorithms, which are 1NN + DTW, SD, FS, BSPCover, and CRSF. Since the difference among CRMI, ST, and gRSF is not significant, the empirical findings indicate that when aiming for the highest accuracy, any one of them can be safely recommended. It is known to us that the ST algorithm is too time-consuming because it needs to evaluate all the subsequences in a time series dataset. Based on the prior analysis, we can conclude that the accuracy achieved by CRMI is slightly better than gRSF and ST, and obviously better than the rest methods. It proves the effectiveness of the proposed algorithm.

Fig. 6
figure 6

Nemenyi tests for 8 classifiers (p = 0.05)

5.5 Cross Validation

To further analyze the effectiveness of the CRMI algorithm, fivefold cross-validation experiments were conducted on all 54 datasets. Specifically, the training set data and testing set data in the UCR archive were merged, and the data with the same label was divided into five groups. One group was designated as the validation set, while the remaining four groups were served as the training set. We record the accuracy in the cross-validation experiments and figured out the average score. Same with the previous experiments, some algorithms which are stochastic were repeated 50 times on each dataset to ensure the fairness of results. Pairwise comparisons were performed on these experimental results which illustrated in Fig. 7. Each subplot represents the comparative analysis between the CRMI algorithm and another algorithm, where the y-axis represents the average accuracy of CRMI, and the x-axis represents the average accuracy of the compared algorithm. Each point represents a dataset. If a point falls above the diagonal, it suggests that the CRMI algorithm performs better than the compared algorithm. Conversely, the CRMI algorithm is inferior to the compared algorithm.

Fig. 7
figure 7figure 7

Results of cross validation on 54 datasets

From figures (a), (b), (c), (e), and (g), it is obvious that the CRMI algorithm are superior to the 1NN-DTW, FS, SD, BSPCover, and CRSF algorithms because most of the points fall on the upper side of the diagonal in those figures. In Figures (d) and (f), most points are near the diagonal, however it can be also found that the CRMI algorithm is also slightly superior to the two algorithms, i.e., the ST algorithm and the gRSF algorithm. The results of the cross-validation also prove the effectiveness of the proposed algorithm.

5.6 Analysis of the Impact of the Number of Categories

In this section, we investigate the impact of the number of categories in a dataset on CRMI in terms of classification accuracy. In terms of the distribution of category numbers, the datasets are categorized into three groups. The dataset category numbers in 3 groups are no more than 6, no less than 7, and no less than 10, respectively.

The experimental results are re-analyzed as listed in Table 4. From the table, it can be seen that the number of datasets won by the CRMI algorithm is 9 which is equal to that won by the ST algorithm when the number of categories is no more than 6. Although the CRMI algorithm is tied with the ST algorithm for first place in this metric, the average ranking of the former is 2.83 which is inferior to that of the ST as well as the gRSF which is 2.48. When the category number is no less than 7, it can be seen that the CRMI algorithm wins the gold medals on two metrics, i.e., Ave. Rank and Wins, which are 2.56 and 10, respectively. The second place is the ST algorithm which can achieve 9 and 3.04 on the two metrics, respectively. Furthermore, it can be seen that the CRMI algorithm obtains 2.60 and 7 on the two metrics when the category number is no less than 10. The second place of the two metrics is 3.13 and 4 which are obtained by the gRSF and the ST, respectively. Although the proposed algorithm achieves first place in both groups, it is easy to find that the advantage of the CRMI algorithm is apparent along with the increase in the category number of the dataset.

Table 4 Experimental results of different number of categories in datasets

The analyses show that CRMI, ST, and gRSF have good classification ability for multi-class time series classification. However, both the ST and the gRSF require calculating the IG to evaluate the discriminative ability of the shapelet. Moreover, the former will evaluate all shapelet candidates. Therefore, both of them are too time-consuming. For example, the time costs of the ST and the gRSF are 6752 s and 1872s on dataset NonInvasiveFetalECGThorax1, respectively, which is far higher than that of the CRMI which is 129 s. The experiments prove the effectiveness and efficiency of the proposed algorithm.

5.7 Other Advantages

The experiments show the effectiveness and efficiency of the CRMI algorithm, especially in multi-class classification tasks. Furthermore, the ease of use of the proposed algorithm is superior to most of the pattern-based algorithms of TSC because it is nearly a parameter-free algorithm. It is easy to find that the CRMI algorithm only requires determining the minimum and maximum lengths of the subsequences. We always set the two parameters to three and the length of the time series, respectively. The existing pattern-based algorithms of TSC, such as CRSF, ST, etc., must set a distance threshold to determine whether a time series contains or does not contain a subsequence which may greatly decrease the accuracy of the classification. Besides, the CRMI algorithm does not require any transformation of the raw time series. Some work, such as CRSF, BSPCover, BOSS, etc., require the representation of the time series by SAX or SFA which may lose some information and introduce extra computing overhead. Finally, the CRMI algorithm employs an MI-based heuristic for looking for the optimal combination of subsequences. Feature selection has been studied for a long time and lots of efficient techniques can be exploited in our work, such as the meta-heuristic algorithms. It may further improve the effectiveness and efficiency of the proposed algorithm.

6 Conclusions

In this paper, we present a new algorithm named CRMI to find discriminative subsequences for MS-TSC. A new measure for evaluating the discriminative power of subsequences, called CR, is proposed which reveals the cover relation between a subsequence and time series via clustering their distances. Besides, an MI-based heuristic algorithm for looking for the optimal combination of subsequences is also presented. We perform extensive experiments on 54 datasets to evaluate the effectiveness and efficiency of the CRMI algorithm. Some conclusions can be drawn. (1) Compared with the 1NN + DTW classifier and six well-known shapelet-based classifiers, the CRMI algorithm can achieve the best average accuracy on 54 datasets. Although there is no significant difference in terms of the accuracy between CRMI and ST statistically, CRMI is much more efficient than the two algorithms. (2) Extensive experiments were conducted to evaluate the proposed strategies and explore the strategies of parameter setting. The experimental results show that the proposed strategies, i.e., employing a core set to be the start point of subsequence selection and the round-robin selection for subsequence selection, are effective. (3) The CRMI algorithm only requires setting two parameters, i.e., the time of subsequence sampling and the \(\lambda\) in Formula (2). Experimental results show that 2 min and 1.0 are good choices for the two parameters, respectively. (4) The CRMI algorithm performs well on time series datasets with a large number of categories.

However, this study still has limitations and faces several challenges: (1) the UCR datasets are balanced, while many real-life data are imbalanced. Therefore, the CRMI algorithm which is trained on balanced datasets needs to consider imbalanced data; (2) although it has been proven the effectiveness of the algorithm in this paper, it can still be enhanced by utilizing representations such as SAX or SFA, or employing parallel computing to improve its efficiency; (3) it is necessary to consider the unseen time series data in real-life to improve the generalization of the algorithm.