Keywords

1 Introduction

Time series classification (TSC) is one of the classical and hot issues in time series data mining. Time series come from a wide range of sources, including weather prediction, malware detection, voltage stability assessment, medical monitoring, and network anomaly detection [1]. In general, time series \(T=\{{t}_{1}\), \({t}_{2}\), … \({t}_{m}\}\) is a series of the values of the same statistical index arranged according to the time sequence of their occurrence [2]. The main goal of TSC is to divide an unlabeled time series into a known class.

In the existing time series classification algorithms, shapelet based algorithms are promising. Shapelet is a sub-sequence in time series which can represent a class maximally. These sub-sequences may appear anywhere in the time series and are generally shorter in length. Compared with other TSC algorithms, shapelet based classification method has the advantages of high classification accuracy, fast classification speed and strong interpretability [3]. In order to improve the speed of shapelet dicovery, a fast shapelet algorithm (FS) which uses Symbolic Aggregate Approximation (SAX) representation is proposed by Rakthanmanon and Keogh. However, SAX only uses the mean of sequence to represent time series, and it may cause the loss of trend information of time series. To solve this problem, a fast shapelet discovery algorithm based on Trend SAX (FS-TSAX) is proposed in this work. The main contributions of this paper are as follows:

  1. (1)

    A new shapelet discovery algorithm is proposed, combining FS and TSAX. It solves the shortcoming that SAX is easy to lose the trend information of time series and improves the accuracy of shapelet classification.

  2. (2)

    Experiments are conducted on different data sets to evaluate the performance of the proposed algorithm. Experimental results show that the accuracy of our algorithm is at a leading level.

The remainder of this paper is structured as follows. Section 2 gives some related works on shapelet based algorithms and TSAX. Section 3 gives some definitions about FS-TSAX algorithm. Section 4 introduces our proposed FS-TSAX algorithm. Experimental results are presented in Sect. 5 and our conclusions are given in Sect. 6.

2 Related Works

2.1 Shapelet Based Algorithms

Since the concept of the shapelet was first proposed in 2009 [4], algorithms based on Shapelet have been proposed in large numbers.

However, shapelet-based algorithms are complex and take a long time to train [5]. For this, Rakthanmanon and Keogh proposed fast shapelet algorithm (FS) [6]. It uses SAX to reduce the dimension of the original data and uses the mean of sequence to represent time series. Then random masking the SAX string and construct Hash table statistics scores. Finally select the best shapelet according to the scores.

In addition, Wei et al. [7] combined existing acceleration techniques with sliding window boundaries and used the maximum correlation and minimum redundancy feature selection strategy to select appropriate shapelets. To dramatically speed up the discovery of shapelet and reduce the computational complexity, a random shapelet algorithm is proposed by Renard et al. [8]. In order to avoid using online clustering/pruning techniques to measure the accuracy of similar candidate predictors in Euclidean distance space, Grabocka et al. proposed a new method denoted as SD [9], which includes a supervised shapelet discover that filters out only similar candidates to improve classification accuracy. Ji et al. proposed a fast shapelet discovery algorithm based on important data point [12] and a fast shapelet selection algorithm [15]. The former accelerated the discovery of shapelet through important data points. The latter was based on shapelet transformation and LFDPs identification of the sampling time series, and then select the sub-sequences between two non-adjacent LFDPs as candidate sub-sequences of shapelet.

2.2 Trend-Based Symbolic Aggregate Approximation (TSAX) Representation

The symbolic representation of time series is an important step in data preprocessing, which may directly leads to the low accuracy of data mining. SAX is one of the most influential symbolic representation methods at present. SAX is a discrete method based on PAA, which can carry out dimensionality reduction processing simply and mine time series information efficiently. The main step of SAX is dividing the original time series into equal length sub-sequences, and then calculate the mean value of each subsequences and use the mean value to represent the subsequences, that is PAA. Then the breakpoint is found in the breakpoint table with the selected alphabet size, and the mean value of the PAA computed subsequences is mapped to the corresponding letter interval, finally the time series is discretized into strings [10].

However, SAX uses the letters after the mapping of PAA to represent each sub-sequence after segmentation, which may lose important features or patterns in the time series and lead to poor results in subsequent studies. As shown in Fig. 1, the result of SAX string is the same between two time series with completely different trend information.

Fig. 1.
figure 1

Two time series with different trends get the same SAX string.

To solve the problem that SAX is easy to lose trend information of time series, Zhang et al. proposed a symbolic representation method based on series trend [11]. Specifically, after PAA is used on the time series, the sub-sequence with equal length are evenly divided into three segments, and the mean value of each segment is calculated respectively. Then a smaller threshold ε is defined and the size of the subsequence is calculated according to the formula (1). The letter “u” represents the upward trend, “d” represents the downward trend and “s” represents for horizontal trend. For example: if the mean of the first sub-sequence is less than the second one, represented as “u”, and the mean of the second sub-sequence is larger than the third one, represented as “d”, then this piecewise trend information is represented as “ud”, so the trend information of the time series is represented. TSAX is a combination of the SAX letters which represent trends.

$$ \mathop {Trend}\nolimits_{indicator} = \left\{ \begin{gathered} d,\overline{{C_{i1} }} - \overline{{C_{i2} }} >\upvarepsilon \hfill \\ s,\left| {\overline{{C_{i1} }} - \overline{{C_{i2} }} } \right| \le\upvarepsilon \hfill \\ u,\overline{{C_{i1} }} - \overline{{C_{i2} }} <\upvarepsilon \hfill \\ \end{gathered} \right. $$
(1)

3 Definition

Definition 1:

\(T=\{{t}_{1}\), \({t}_{2}\), … \({t}_{m}\}\) is a time series which contains an ordered list of numbers. Each value \({t}_{i}\) can be any finite number and assume that m is the length of T.

Definition 2:

S is a continuous sequence on time series, which can be expressed by formula (2). Where l is the length of S, i is the start position of S.

$$ S = T_{i}^{l} = \left\{ {t_{i} ,t_{i + 1} ,...,t_{i + l - 1} } \right\} $$
(2)

Definition 3:

Time series dataset D is a set of N time series, each of which is m in length and belongs to a specific class. The class number in D is C.

Definition 4:

(dist (T, R)) is a distance function, whose input is two time series \(T=\{{t}_{1}\), \({t}_{2}\), … \({t}_{m}\}\) and \(R=\{{r}_{1}\), \({r}_{2}\), …\({r}_{m}\}\). It returns a non-negative value. This paper uses Euclidean distance, and its calculation method is shown in formula (3).

$$ dist(T,R) = \sqrt[2]{{\sum\limits_{i = 1}^{m} {\mathop {(t_{i} - r_{i} )}\nolimits^{2} } }} $$
(3)

Definition 5:

The distance between the subsequence S and the time series \(T (subdist(T,S))\) is defined as the minimum distance between subsequence \(S\) and any subsequence of \(T\) of the same length as subsequence \(S\). It is a distance function, which inputs time series \(T\) and sub-sequence \(S\), returns a non-negative value. Intuitively, this distance is the distance between \(S\) and the best matching point at a certain position in \(T\), as shown in Fig. 2, and its calculation method is given by formula (4).

$$ SubDist(T,S) = \min (dist(T_{1}^{l} ,S),dist(T_{2}^{l} ,S),...,dist(T_{m - l + 1}^{l} ,S)) $$
(4)
Fig. 2.
figure 2

Best matching point.

Definition 6:

Entropy is used to indicate the level of clutter in a dataset. The entropy of dataset D shown in formula (5). Where \(D\) are datasets, \(C\) are different classes, and \({p}_{i}\) is the proportion of time series in class i.

$$ e(D) = - \sum\limits_{i = 1}^{c} {(p_{i} \log p_{i} )} $$
(5)

Definition 7:

Information gain represents the degree of uncertainty reduction in a dataset under a partition condition. For a spilt strategy, the information gain calculation method is shown in Formula (6).

$$ gain(TSAX) = e(D) - \frac{{_{{\mathop n\nolimits_{C} }} }}{n}e(D_{C} ) - \frac{{_{{\mathop n\nolimits_{N} }} }}{n}e(D_{N} ) $$
(6)

Definition 8:

Optimal Split Point (OSP). When the information gain obtained by splitting at a threshold is larger than any other point, this threshold (distance value) is the optimal split point.

4 Fast Shapelet Discovery Algorithm Based on TSAX(FS-TSAX)

4.1 Overview of the Algorithm

The TS-TSAX algorithm is shown in Algorithm 1 and Fig. 3. Figure 3(a) shows the four processes of the FS-TSAX algorithm: (1) Generating TSAX words (Line 1–Line 3 in Algorithm 1); (2) Random masking of TSAX Words (Line 5–Line 7 in Algorithm 1); (3) Choose the top-k TSAX words with highest scores (Line 9 in Algorithm 1); (4) Find the best shapelet (Line 17–Line 21 in Algorithm 1).

figure a

Figure 3(b) shows the visual description of these four steps. Next, these four steps are described in detail.

Fig. 3.
figure 3

The flow of FS-TSAX.

4.2 Create TSAX Words

For an original time series, after normalization, it is divided into equal length subsequences, then calculate the mean value of each segment and use the mean to find the corresponding letter in the breakpoint table, and a string is used to represent the time series, this process is SAX. After the time series is segmented with equal length by PAA, the sub-sequence segments are evenly divided into three segments. Calculate the mean of each segment and use formula (1) to get the trend letters. Then the trend letters of each segment with the SAX letter are combined, and the time series is represented in TSAX. Figure 4 shows an example of this process.

Fig. 4.
figure 4

Combine the SAX string “AEEDCF” and the trend letters “du sd sd dd ud dd” and get the TSAX string “AduEsdEsdDddCudFdd”.

4.3 Random Masking TSAX Words

Two time series with similar real values may produce two different TSAX words just because of a minimal difference. Therefore, the best shapelet in the original time series may map to different TSAX words. The solution to this problem is to use random masking, which is the idea of projecting all the higher-dimensional TSAX words into smaller dimensions. The process of random masking TSAX words is as follows:

  1. (1)

    Randomly select a character in a TSAX word.

  2. (2)

    Select another character to generate the new TSAX word.

Typically, this process requires 10 iterations, which means masking 10 times TSAX words. After the TSAX words are randomly masked, the TSAX words and the words after masking are applied together for subsequent processing.

4.4 Choose the Best k TSAX Words

For a TSAX word, note 1 when it appears in a time series and 0 when it does not. By statistics of this information, training data set \(D\) can be divided into two sub-datasets: dataset \({D}_{C}\) composed of time series containing the TSAX words, and dataset \({D}_{N}\) composed of time series not containing the TSAX words. The information gain for this TSAX word can be calculated according to the following formula.

$$ gain(TSAX) = e(D) - \frac{{_{{\mathop n\nolimits_{C} }} }}{n}e(D_{C} ) - \frac{{_{{\mathop n\nolimits_{N} }} }}{n}e(D_{N} ) $$
(7)

Where D is the dataset, \({D}_{C}\) is the data set formed by the time series containing the TSAX words, \({D}_{N}\) is the dataset formed by the time series without the TSAX words, n, \({ n}_{C}\), and \({n}_{N}\) are the number of time series contained in each of the three datasets. After calculating the information gain of TSAX, we find the best k TSAX words which has the best information gain and obtain the final shapelet.

4.5 Discover the Best Shapelet

Each TSAX word represents a corresponding time series, so after getting the top-k TSAX words, the corresponding relationship between TSAX words and time sequence can be used to get the corresponding sub-sequence. Then, we can find the final shapelet from the corresponding sub-sequence.

The finall shapelet is the one with the greatest information gain among the subsequences corresponding to the top k TSAX words. If there are multiple subsequences with the maximum information gain, the sub-sequence with the maximum clearance is selected [11].

5 Experiments and Evaluation

5.1 Datasets

UEA&UCR time series classification warehouse is an important open source data set in the field of time series data mining. In this chapter, we select 12 datasets from it for comparative experiments [13]. These data sets are set in “arff” format and each dataset sample carries a category label. Table 1 shows the information of these datasets.

Table 1. Dataset.

5.2 Effect of the Number of TSAX Segments

To verify the effect of the number of segments, we compared the classification accuracy on different data sets when the number of TSAX segments is 2 and 3. The experimental results are shown in Fig. 5. One thing to explain, theoretically, the number of segments is artificially selected. But our code uses binary to symbolize the time series, and the int in Java is 32 bits, and representing a letter needs two bits. A TSAX word is 15 characters, consisting of 5 SAX words and 10 trend letters, so it needs to be represented with 30 bits, and the maximum value of segment number is 3. Therefore, we compared the influence on classification accuracy when the number of segments is 2 and 3. As shown in Fig. 5, the accuracy of FS-TSAX (three-segments) on the 10 data sets was higher than that of FS-TSAX (two-segments), the accuracy of FS-TSAX (three-segments) on the 1 data sets was lower than that of FS-TSAX (two-segments), and they are equally accurate on 1 data set. In general, FS-TSAX (three-segments) is more competitive than FS-TSAX (two-segments).

Fig. 5.
figure 5

Comparison of accuracy between FS-TSAX (three-segments) and FS-TSAX (two-segments).

5.3 Accuracy Comparison

In this section, we calculate the accuracy of our method and other three shapelet discovery algorithms. Table 2 shows the accuracy of these algorithms. The algorithm with the best result in each data set is shown in bold. The algorithms we selected are FS [6], SD [9] and the TSAX (Two-segments) algorithm mentioned in the previous section. As Table 2 shows, our approach gets the highest accuracy on 9 data sets, and the average rank is 1.25.

Table 2. Comparison of accuracy of different algorithms on different data sets.

6 Conclusion

The algorithms based on shapelet have attracted great attention in recent years. Shapelet discovery algorithm is the basis of shapelet transformation algorithm and shapelet learning algorithm, and it has various of acceleration strategies. We propose a shapelet discovery algorithm of FS-TSAX, which uses TSAX to represent time series, and can retain the trend information of time series well, and then carry out the process of shapelet discovery. Experiments on different data sets show that compared with other shapelet discovery algorithms, the accuracy of the proposed algorithm is at a leading level, especially in some time series with obvious trends.