1 Introduction

Time series analysis is applied in many areas of business engineering, finance, economics, health care, etc. It serves various purposes such as subsequence matching, anomaly detection, pattern discovery, clustering, classification, etc. Our study focuses on time series clustering There are two main approaches for time series clustering. The first approach is based on feature construction. Series are described by a vector of feature attributes [5], and instances are grouped using a classical clustering method (K-means, DBscan, ...). The second one uses similarity measures adapted to time series comparison, combined with basic approaches (e.g. K-means) to cluster set of raw time series. Several similarity measures have been suggested for time series clustering, such that DTW [8], SBD [7], LCSS [10], and ERP [1] measures. All those distance measures compare series considering only effects of the temporal phase shift, and do not include amplitude drifts. However, in some application domains, time series clustering should be done by considering invariance and interval of measurements on the y axis as well as the shift of series on the x axis. Indeed, the range of values on the y axis can strongly discriminate between classes. For example, in agriculture or aquaculture domains, the range of y values in time series related to environmental data, such as changes in temperature, can significantly influences the growth and survival of living species. In this paper, we propose an approach based on shape analysis and that also takes into account the variance along the y axis. In addition, we develop a strategy that allows to automatically define an optimal number of clusters k using a new dispersion criterion applied on distances between instances and their representative within each cluster. Unlike most methods that normalize data, our approach can be applied to both normalized and raw time series. This new method is robust to the shifting of series on the x axis because we use metrics that take into account the distortion of series over time, in particular DTW, which is the most used for time series clustering [3, 11]. For the y shift, we consider a maximum interval over which those metrics vary. Section 2 presents notations and basic definitions. In Sect. 3, we present our contribution, in which a new dispersion measure of distance distribution is presented as well as the principle of our method. Section 4 gives results of experiments on several datasets and compared to those of TSK-means [4] and K-shape [9].

2 Notations and Definitions

Let s be a time series of length n where s(i) corresponds to the value of the signal at time i. Let \(T = \lbrace s_1, s_2, \dots , s_n \rbrace \) a set of time series.

Clusters and Their Representatives: We call k-clustering C of T, the set \(C=\{C_1, C_2, \dots , C_k\}\) containing k homogeneous subsets of T (in relation to a measure of distance Dist), each having a representative noted \(R_{C_i}\) with \(\forall i\in \{1,\dots ,k\}\), \(C_i = \lbrace s_{i_1}, s_{i_2}, \dots , s_{i_{m_i}} \rbrace \) and verifying the following criteria: (1) T = \( \cup _{i=1}^k\) \(C_i\) and \(C_i \cap C_j\) = \( \emptyset \) \(\forall i \ne j \) and (2) \(Dist(R_{c_i}, s) < Dist(R_{c_j}, s)\) \(\forall s\in C_i\) and \(j\ne i \). The representative of a cluster (called prototype) can be a centroid, medoid, etc.

Standard Deviation and Entropy of a Cluster: Let \(C_i\) be a cluster of C on T according to a measure of distance Dist. Let \(Dist(C_i)=\{d_{i_1},\dots ,d_{i_{m_i}}\}\) the set of values of the Dist between an instance of \(C_i\) and its representative \(R_{C_i}\). Let \(\sigma (C_i)\) the standard deviation calculated on the distribution of values taken by \(Dist(C_i)\), and \(E(C_i)\) its entropy measure. \(\sigma (C_i)=\sqrt{\frac{1}{m_i}\sum ^{m_i}_{k=1} (d_{i_k} - \overline{d_i} )^2}\) where \(\overline{d_i}\) is the average of \(Dist(C_i)\) and \(E(C_i) = -\sum ^{m_i}_{k=1} P(d_{i_k}) \times \log (P(d_{i_k}))\). In this paper, we used the distance measure DTW optimized by Kehog [6].

3 TSX-Means: A New Method for Time Series Clustering

Our approach mainly focuses on a new strategy for robust cluster refinement and automatic determination of the optimal number of clusters k. Any distance (or similarity) measure adapted to time series can be used in this approach. We tested it with different distance measures, such as measures derived from DTW. The method, based on a minimum number of clusters initially set to \(nb\_min\_clust\) and a set of defined criteria, implements the principle of refining each cluster by revisiting all its instances. Instances that do not verify the criteria, in relation to the class they belong to, are put in a reject class. We then iterate the principle on that reject class (considered as a new set of series to be clustered) until the stopping conditions are verified. The criteria used in our approach are linked to the following thresholds: (1) \(nb\_min\_inst\): the minimum number of instances allowed per cluster and (2) seuil_disp: the intra-cluster variability, defined from a new dispersion measure that depends on both the variability and the entropy measures of distances between each instance and its representative in cluster belongs to. In this contribution, we propose a new dispersion measure of distances between instances and their representative in a cluster. This dispersion measure, noted disp, is determined by the ratio between the standard deviation and the entropy of the distance values.

Definition 1

(measure of dispersion disp). Let \(C_i\) a cluster of the set T. We define its measure of dispersion by: \(disp(C_i)=\frac{\sigma (C_i)}{E(C_i)} \).

If the dispersion is minimal then the homogeneity is maximal. \(disp(C_i)\) reflects the inner cluster variability. The smaller disp is, the smaller the variability around the representative is. That allows to select the nearest instances to a representative according to a fixed threshold, denoted \(s_d\) in the following.

Criteria for Selecting Cluster Instances: Let \(s_d\) a fixed threshold and \(C_i\) a cluster. A new associated cluster \(C'_i\subset C_i\) is built, verifying the \(disp(C'_i)\le s_d\). Computation of the dispersion measure requires at least two values. A minimum number of instances initially in the new \(C'_i\) cluster is thus provided by \(nb\_min\_inst\) in the algorithm. In order to determine those instances, \(Dist(C_i)\) are ordered and saved in \(Sort(Dist(C_i)) = \lbrace v_1, v_2, \dots , v_m \rbrace \) with \(\forall \) \(i < j\), \(v_i \le v_j\) (procedure ApplyCriteria). We integrate in \(C'_i\) the first \(nb\_min\_inst\) instances in the sorted list \(Sort(Dist(C_i))\). If \(disp(C'_i)\le s_d \) then other instances are added one by one in \(C'_i\), as long as the criterion remains true, otherwise instances that do not verify the criterion are put in the reject cluster. The value \(disp(C'_i)\) is updated each time an instance is added.

3.1 Principle of the Method

The algorithm takes as parameters thresholds \(nb\_min\_clust\), \(nb\_min\_inst\), and \(s_d\) and uses any Dist. As output, it provides a number of clusters determined automatically based on dispersion criteria, and a reject class noted CR. The principle of the algorithm is the following:

Step 1: Definition of Initial Clusters. Instances of T (set of time series) are partitioned into a minimum number of \(nb\_min\_clust\) clusters. To create those clusters, we apply the classic algorithm TSK-Means (or K-shape) with \(k=nb\_min\_clust\) and a distance measure Dist (f.ex DTW, etc.). The procedure \([C, Dist(C)]= CreateInitialsClusters(T,nb\_min\_clust)\) of Algorithm 1 returns initial clusters.

Step 2: Refining Clusters by Applying the Dispersion Criterion. The procedure \([C', CR]= ApplyCriteria(C,Dist,s_d, nb\_min\_inst)\) consists in applying the homogeneity criterion to each cluster \(C_i\) to only keep instances verifying that criterion. The remaining instances are assigned to the reject class CR. If the number of instances of an initial cluster \(C_i\) is less than \(nb\_min\_inst\), then this cluster is deleted and its instances are assigned to the reject class.

Step 3: Applying the Stopping Criterion. If the number of instances in the reject class is greater than \(nb\_min\_inst\), then the initial step is repeated taking as new set T the rejected class. Otherwise, the algorithm stops.

3.2 TSX-Means Algorithm

At first call of our recursive method (Algorithm 1), the number of clusters to be determined nbClust, is initialized to 0, and the set of final clusters \(C_f\) to the empty set. At each call of the recursive algorithm, a new set of at most \(nb\_min\_clust\) clusters and the reject cluster are created from initial clusters obtained by the CreateInitialsClusters method. The algorithm is therefore repeated as long as the reject cluster is not empty and the number of instances is greater than \(nb\_min\_inst\). The method could assign to the reject cluster CR the same instances indefinitely if no admitted new cluster \(C_f\) was generated. The recursiveCpt iteration counter allows to stop the algorithm when it reaches a maximum number of iterations provided by the user. Thus, it is possible to get a number of clusters lower than \(nb\_min\_clust\) or even no cluster at all. This occurs when the ApplyCriteria method does not find any instance verifying the dispersion criterion in each of the initial clusters. This case is linked to a low value of the dispersion threshold. Nevertheless, increasing the threshold will integrate instances that are far from the representative and will lead to creating a cluster with high variability.

figure a

4 Experimental Results

The method has been tested on data of the UEA & UCR [2] archives. We tested our algorithm on 20 datasets. Chosen datasets have series of various lengths and a different number of classes. Most of them have a low number of classes (\({\le }7\)), we say non complex data. In order to test our new method TSX-Means on more complex data, the last 7 datasets have a higher number of classes (\({\ge }24\)). For each dataset and for each distance used, we tested our method by varying the parameters \(s_d\) and \(nb\_min\_clust\). \(Nb\_min\_inst\) was set to the number of instances of the smallest class of the dataset. Once the number k is found by our algorithm, we run TSK-means and k-shape with the same value of k to compare the performances between the 3 methods. We used different metrics (Accuracy, ARI and V-Measure (VM)) for performance comparison averaged for the tested parameters. Accuracy is calculated when the number of clusters is the actual number of classes. Otherwise, ARI and V-M are used. The parameter nb_min_clust has a greater impact on performance measures, and particularly V-Measure, than threshold \(s_d\). The difference of ARI and V-M are, in average, 10% higher for complex data for TSX-Means than for K-Shape method. The new dispersion measure is a good indicator of cluster homogeneity. In general, TSX-Means method is more efficient than other methods, especially when the number of classes is very high. Table 1 shows results for accuracy scores. We noticed that dispersion measure improves clustering performance. Indeed, TSX-Means outperforms TSK-Means and K-Shape methods for the majority of data.

Table 1. Accuracy of TSX-Means with initial clusters from TSK-Means.

We noticed that the dispersion measure improves clustering performance with the set of measures derived from DTW. Indeed, TSX-Means outperforms TSK-Means and k-shape methods for the majority of data. Table 2 shows accuracy scores of TSX-Means using K-Shape as initial clusters generator. Our method outperforms k-Shape for 5/7 data.

Table 2. Accuracy of TSX-Means and the K-Shape with initial clusters from K-Shape

5 Conclusion and Perspectives

We proposed a new dispersion measure in a cluster, and designed a new method TSX-Means for time series clustering, allowing to automatically determine an optimal number of clusters. This measure allows to refine clusters initially generated by existing clustering methods. Performance of TSX-Means was compared to TSK-Means and K-Shape methods on a set data. Quality measures of clustering performance showed that TSX-Means method outperforms TSK-Means and K-Shape, especially for data with a very large number of clusters.