Keywords

1 Introduction

Time series clustering has become a very active research area, with applications in many different fields. However, most methods developed so far do not take into account the possible presence of contamination by outliers or spurious information. In this work, we propose a clustering algorithm for stationary time series that is based on considering the estimated spectral density functions as functional data. This procedure has robust features that mitigate the effect of noise in the data and help to prevent the identification of spurious clusters.

Comprehensive revisions of the area can be found in [1, 8, 19]. [21] present a package in R for time series clustering with a wide range of alternative methods. According to Liao [19], there are three approaches to clustering of time series: methods that depend on the comparison of the raw data, methods based on models fitted to the data and, methods based on features derived from the time series. Our proposal falls within the third approach and the spectral density is the characteristic used to gauge the similarity between time series in the sample.

Spectral characteristics have been previously considered as the main tool for time series clustering. Caiado et al. [6, 7] use the periodogram and normalized periodogram ordinates for clustering time series. Maharaj and D’Urso [20] propose a fuzzy clustering algorithm based on the estimated cepstrum, which is the spectrum of the logarithm of the spectral density. Alvarez-Esteban et al. [3] and Euán et al. [13] consider the use of the total variation distance on normalized estimates of the spectral density as a similarity measure for clustering. A brief description of the last two algorithms will be given in Sect. 2.

Other works have focused on developing robust clustering algorithms for time series. D’Urso et al. [11] use a fuzzy approach to propose a robust clustering model based on autoregressive models. D’Urso et al. [12] present robust fuzzy clustering schemes for heteroskedastic time series based on parametric models, Bahadori et al. [4] propose a clustering framework for functional data, which can be applied to time series with warped alignments.

Our proposal is based on the use of spectral densities, considered as functional data, and the application of the clustering algorithm recently developed in [24], which will described in Sect. 3. Several clustering methods for functional data have been proposed in the literature as, for instance, [5, 17, 18] but these methods are not aimed at dealing with outlying curves. Trimming techniques for robust clustering have been applied in [9, 15].

The rest of the paper is organized as follows: Sect. 2 describes the idea behind our proposal for time series clustering. Section 3 gives a brief description of the robust clustering procedure for functional data that supports the time series clustering algorithm. Section 4 presents a simulation study that compares the performance of the algorithm with existing alternatives and Sect. 5 gives an application to a real data set. The paper ends with a discussion of the results.

2 Time Series Clustering

Consider a collection of n stationary time series \(X_{1,t},X_{2,t},\dots ,X_{n,t}\) with \(1\le t \le T\). For ease of notation we take all series to have the same length, but this is not a requirement of the procedure. The spectral density of each time series is estimated by one of the many procedures available. Previous clustering methods based on the use of spectral densities relied on similarity measures for discriminating between them. In this work, the spectra are considered as functional data to which the robust clustering procedure developed in [24] is applied. The resulting clusters correspond to time series whose spectral densities have similar shapes, and hence similar oscillatory behavior. The procedure is able to detect outliers among the spectral densities, which correspond to time series having atypical oscillatory characteristics.

Two methods based on estimated spectral densities are presented in [3, 13]. We describe them in more detail since they will be used later for comparison purposes. We refer to them as “TVDClust” and “HSMClust”, respectively. In both cases the total variation distance (TVD) is used to measure similarity between spectral densities. TVD is a frequently-used distance between probability measures that, in the case of probability distributions having a density, measures the complement of the common area below the density curves. Thus, the more alike the densities are, the larger this common area and the smaller the TV distance. To use this distance to compare spectral densities, they need to be normalized so that the total area below the curve is equal to 1, which is equivalent to normalizing the original time series so that it has unit variance. Thus, it is the oscillatory behavior of the series, and not the magnitude of the oscillations that is taken into account in these clustering algorithms.

For “TDVClust”, a dissimilarity matrix is built up by measuring the TVD distance between all pairs of normalized estimated spectral densities. This matrix is then fed to a hierarchical agglomerative algorithm with the complete or average linkage functions. The result is a dendrogram which can be cut to obtain the desired number of groups. To decide on the number of clusters an external criteria such as the Silhouette or Dunn’s index is used. More details can be found in [3].

The second method, “HSMClust”, is a modification in which every time two clusters are joined together, all the information in them is used to obtain a representative spectrum for the new cluster. There are two ways to do this, either all the spectral densities are averaged, which is the average option in the algorithm, or else all the time series in the two groups are concatenated and a new spectral density is estimated, which corresponds to the single option. Under the assumption that the series in the same cluster have common oscillatory characteristics, either of this procedures will give a more accurate estimation of the common spectral density for the whole group. This algorithm is known as the Hierarchical Spectral Merger (HSM) algorithm, and its implementation in R is available at http://ucispacetime.wix.com/spacetime#!project-a/cxl2.

Every time two clusters are merged, the dissimilarity matrix reduces its size. In “TVDClust”, this matrix remains the same throughout the procedure and the distances between clusters are calculated using linear combinations of the distances of the individual points in each cluster. The linear combination used is determined by the linkage function employed. More details can be found in [3].

3 Robust Clustering for Functional Data

We now give a brief description of the algorithm proposed in [24], where more details can be found. Let X be a random variable taking values in the Hilbert space \( L^{2}([0,T])\) of square integrable functions defined in the interval [0, T], with inner product given by \(\langle f,g \rangle = \int f(t)g(t)\, dt\). If \(\mu (t)=E\{X(t)\}\) and \(\varGamma (s,t)=\text {cov}\{X(s),X(t)\}\), then it is usual to represent X through its Karhunen-Loève expansion \( X(t)=\mu (t)+\sum _{j=1}^{\infty }C_{j}(X)\psi _{j}(t) \). In that expansion, the \(\psi _{j}\) are an orthonormal system of functions obtained as eigenfunctions of the covariance operator \(\varGamma \), i.e. \(\langle \varGamma (\cdot ,t),\psi _{j} \rangle =\lambda _{j}\psi _{j}(t)\), and the eigenvalues \(\lambda _j\) are taken in decreasing order and assumed to satisfy \(\sum _{j=1}^{\infty }\lambda _{j}<\infty \). The principal component scores \(C_{j}(X)=\langle X-\mu , \psi _{j}\rangle \) are uncorrelated univariate random variables with zero mean and variance equal to \(\lambda _{j}\). Delaigle and Hall [10] show that \( \log P(||X-x||\le h)\) can be approximated by \( \sum _{j=1}^{p}\log f_{C_{j}}(c_{j}(x))\), for any \(x\in L_{2}([0,T])\) and small h, where \(f_{C_{j}}\) corresponds to the probability density function of \(C_{j}(X)\) and \(c_{j}(x)=\langle x, \psi _{j} \rangle \). This approximation entails a kind of “small-ball pseudo-density” approach for Functional Data Analysis by taking into account that probability density functions in the finite dimensional case can be seen as the limit of \(P(||X-x||\le h)/h\) when \(h\rightarrow 0\). In the particular case of X being a Gaussian process, the \(C_{j}(X)\) are independent normally distributed random variables with mean equal to 0 and variance equal to \(\lambda _j\).

With these ideas in mind, Jacques and Preda [17] proposed a “model-based” approach for clustering of functional data, where a finite number of independent normally distributed principal component scores are assumed and different variances are also allowed for each cluster. Previously, Bouveyron and Jacques [5] had already considered a different approach, where a certain fraction of the smallest variances are constrained to be equal for each cluster.

In Rivera-García et al. [24], starting from Bouveyron and Jacques [5] and Jaques and Preda [17], a robust functional clustering procedure is proposed where a proportion \(\alpha \) of curves are allowed to be trimmed and constraints on the variances are considered. If \(\{x_1,...,x_n\}\) is a set of curves in \(L^2([0,T])\), we consider the maximization of a trimmed mixture-loglikelihood defined as

$$\begin{aligned} \sum _{i=1}^{n} \eta (x_{i}) \log \bigg (\sum _{g=1}^{K} \pi _{g} \bigg [\prod _{j=1}^{q_{g}} \frac{1}{\sqrt{2 \pi a_{jg}}} \exp \Big (\frac{-c_{ijg}^{2}}{2a_{jg}}\Big ) \prod _{j=q_{g}+1}^{p} \frac{1}{\sqrt{2 \pi b_{g}}} \exp \Big (\frac{-c_{ijg}^{2}}{2 b_{g}}\Big )\bigg ]\bigg ) \end{aligned}$$
(1)

where \(c_{ijg}=c_{jg}(x_{i})\) is the j-th principal component score of curve \(x_{i}\) in group g, \(g=1,...,K\), and, \(\eta (\cdot )\) is an indicator function with \(\eta (x_{i})=0\) if the \(x_{i}\) curve is trimmed and 1 if it is not and \(\pi _g, g=1,...,K\) are mixing weights that add up to 1. A proportion \(\alpha \) of curves is trimmed, so that \(\sum _{i=1}^{n}\eta (x_{i})=[n(1-\alpha )]\). The main variance contributions in the g-th cluster are assumed to be \(a_{1g}\),..., \(a_{q_gg}\), for the first \(q_g\) components, while we assume that each of the remaining \(p-q_g\) components contribute with the same \(b_{g}\) variance. Notice that we take an equal number of principal components p in every cluster but the number of main components \(q_g\) may vary across clusters. Finally, to prevent the detection of spurious clusters, two constants \(d_1\ge 1\) and \(d_2\ge 1\) were fixed such that the maximization of (1) is done under the constraints:

$$\begin{aligned} \frac{\max _{g=1,...,K;j=1,...,q_j}a_{jg}}{\min _{g=1,...,K;j=1,...,q_j}a_{jg}}\le d_1 \quad \text {and} \quad \frac{\max _{g=1,...,K}b_{g}}{\min _{g=1,...,K}b_{g}}\le d_2. \end{aligned}$$
(2)

A feasible algorithm for performing the constrained maximization, detailed in [24], is a modification of the traditional EM algorithm used in model-based clustering where a “trimming” step (T-step) is added. In the T-step, those curves with smallest contributions to the trimmed likelihood are temporarily not taken into account in each iteration of the algorithm. The trimming step is similar to that applied in the “concentration” steps applied when performing the fast-MCD algorithm [25]. To enforce the required constraints on the variances, optimally truncated variances as done in [14] are adopted if needed. For the estimation of the dimension \(q_g\) in each cluster, a Bayesian Information Criterion (BIC) approach was proposed in [24].

4 Simulation Study

To evaluate the performance of the proposed Robust Functional Clustering (RFC) methodology, a simulation study was carried out. We now describe the different scenarios and contamination types. As in [13], the simulations are based on combinations of autoregressive processes of order 2, AR(2), which are defined as \( X_{t} = u_{1}X_{t-1}+ u_{2}X_{t-2}+\epsilon _{t} \) where \(\epsilon _{t}\) is a white noise process. The associated characteristic polynomial is \(h(y)=1-u_{1}y-u_{2}y^{2}\) and its roots, denoted by \(y_{1}\) and \(y_{2}\), are related to the spectrum of the time series. If the roots are complex-valued, they must be conjugate, i.e. \(y_{1}=\overline{y_{2}}\) and their polar representation is \( |y_{1}|=|y_{2}|=M \quad \text {and} \quad \arg (y_{i})= 2\pi \nu /w_{s} \) where \(w_{s}\) is the sampling frequency in Hertz; M is the magnitude of the root (\(M>1\) for causality) and \(\nu \) the frequency index, \(\nu \in (0, w_{s}/2)\). The spectrum will have modal frequency in \(\nu \), which will be broader as \(M\rightarrow \infty \) and narrower as \(M\rightarrow 1^{+}\). Then, given (\(\nu , M, w_{s}\)) and with \(\omega _{0}=\frac{2\pi \nu }{w_{s}}\) we have \( u_1 = 2M^{-1}\cos \omega _0 \quad \text {and} \quad u_2=-M^{-2}\).

Two groups of 50 time series each were simulated, with parameters \(\nu _1 = 0.21\), \(\nu _2 = 0.22\), \(M_1=M_2 = 1.15\), \(w_s=1\) and length \(T=1000\). From the simulated time series, the spectral densities were estimated using a smoothed lag-window estimator with a Parzen window and bandwidth 100/T. The estimated spectral densities are shown in Fig. 1(a). The functional form of the estimated spectral densities was recovered using a B-Spline basis of degree 3 with 14 equispaced nodes and smoothing parameter \(\lambda =0.000003\) (see e.g. [22], Chap. 3) We want to test the performance of the different algorithms in recovering these two groups, even in the presence of contaminating data. In the absence of contamination we have 100 observations divided into two groups.

We introduce the mixtures of AR(2) processes that will be used in the contamination schemes. Let \(Y_t^i, i=1,2\) be two AR(2) processes with parameters \(M_i\) and \(\nu _i\), \(i=1,2\). Their mixture is given by \( X_t = a_1 Y_t^1 + a_2 Y_t^2 + \epsilon _t \) where \(a_i, i=1,2\) are the weights and \(\epsilon _t\) is a white noise process. This mixture creates a signal that combines the oscillatory behavior of the processes \(Y_t^i, i=1,2\).

Starting from the two groups of 50 AR(2) time series described previously, which are considered as the clean data, we added another 11 time series (around 10% contamination level), generated according to the following schemes:

  1. (i)

    AR(2) processes with parameters \(\nu _{i}\) chosen randomly with uniform distribution in the interval (.20, .25), denoted U(.20, .25), \(M=1.2\) and \(w_{s}=1\). The contaminating series have smaller variance than the series in the clusters. See Fig. 1(b).

  2. (ii)

    A mixture of two AR(2) processes having parameters \(\nu _{i}=.20\) and .25; \(M_{i}=1.05, 1.1\), \(i=1,2\) y \(w_{s}=1\). See Fig. 1(c).

  3. (iiii)

    A mixture of two AR(2) processes with random parameters \(\nu _{1}=U(.19,.22)\) y \(\nu _{2}=U(.24,.26)\); \(M_{i}=1.05, 1.1\), \(i=1,2\) and \(w_{s}=1\), See Fig. 1(d).

Figure 1(b), (c) and (d) show the spectral densities for the simulated time series with the three contamination schemes described.

Fig. 1.
figure 1

Spectral density of the simulated time series: (a) No contamination, (b) Contamination type (i), (c) Contamination type (ii) and (d) Contamination type (iii)

In order to test the performance of the RFC methodology, the simulated process and their estimated spectral densities were used to compare with the results obtained when using the “Funclust” algorithm [17] and hierarchical methods using the total variation distance: “HSMClust” [13] and “TVDClust” [2, 3].

It is important to recall that we assume the \(q_g\) dimensions in the RFC procedure to be unknown parameters and that the BIC criterion is used to estimate them when applying this algorithm. The results in [24] already show the importance of trimming. Trimming levels \(\alpha =0\) and \(\alpha =0.1\) are used. As regards the constraints, we are assuming \(d_1=d_2=d\) to simplify the simulation study. Values of \(d=3\), 10 and \(10^{10}\) (i.e., almost unconstrained in this last case) were used. We always return the best solution in terms of the highest BIC value for each combination of all those fixed values of trimming level and constraints. We use 100 random initializations with 20 iterations.

Fig. 2.
figure 2

classification rate (CCR) for the four methods considered, represented in different columns. Rows correspond to the different contamination schemes, starting with no contamination in the first row and following with schemes (i), (ii) and (iii) described in the text. Constraint levels \(d_1=d_2=3\), 10 and \(10^{10}\), trimming levels \(\alpha =0\) and 0.1 were used for the RFC method. Threshold values \(\varepsilon = 0.001, 0.05\) and 0.1 were used for the “Cattell” procedure in “Funclust”. Single and average versions were used for “HSMClust” while average and complete linkage functions were used for “TVDClust”.

For the “Funclust” method we used the library Funclustering [26] in R where the EM algorithm has been initialized with the best solutions out of 20 “short” EM algorithms with only 20 iterations and threshold values of \(\varepsilon =0.001, 0.05, 0.1\) in the Cattell test. For the agglomerative methods we use the library HSMClust in R for “HSMClust” and “TVDClust” by means of the algorithm described in [2, 3].

Figure 2 shows the results for the simulation study. It is composed of a matrix of graphs, where the rows correspond to the different contamination schemes (uncontaminated in the first row) while the columns correspond to the methodologies tested. The first column corresponds to “Funclust”, the second to “HSMClust”, the third shows the results for the RFC procedure with trimming levels \(\alpha =0\) (untrimmed) and \(\alpha =0.1\) and three constraint levels \(d=3\), 10 and \(10^{10}\) (i.e., almost unconstrained in this last case). The fourth column shows the results corresponding to “TVDClust”. The x-axis corresponds to the threshold applied in the Cattell test for “Funclust”, the procedure in “HSMClust”, the constraint level for RFC and the linkage function for the agglomerative method “TVDClust”, while the y-axis corresponds to the correct classification rate (CCR).

Results show that the hierarchical methods, “HSMClust” and “TVDClust” are better in the absence of contamination, giving very consistent results. However, their performance degrades sharply in the presence of noise. This is not surprising since these procedures were not designed to handle contamination in the sample. The joint use of trimming and constraints in RFC improve the results (CCR) substantially. Results are very good for moderate \((d=10)\) and small \((d=3)\) values of the constraint constants, while for high values the results are poor. Very high values for these constants are equivalent to having unconstrained parameters. Trimming turns out to be very useful in all the contaminated cases while the results are not affected by trimming in the uncontaminated case.

In the presence of contamination, the results for “Funclust”, “HSMClust” and “TVDClust” fall below those of RFC when applying the \(\alpha =0.1\) trimming and small/moderate values \(d_1\) and \(d_2\) for the variance parameters.

5 Analysis of Real Data

We now consider wave-height data measured by a buoy located in Waimea Bay, Hawaii, at a water depth of approximately 200 m. This buoy is identified as number 106 (51201 for the National Data Buoy Centre). The data, which corresponds to 72.5 h divided into 30-minute intervals, was collected in June 2004 and has previously been analyzed by [2] where more details can be found.

In [2] the spectrum for each 30-minute interval was estimated and normalized The TV distance between all spectral densities was used to build a dissimilarity matrix, that was fed into a hierarchical agglomerative clustering algorithm. For more details see [3]. The 145 normalized densities are shown in Fig. 3(a).

Fig. 3.
figure 3

Spectral densities for the sea wave data after normalization. (a) Original data. (b) Original data plus 22 additional densities in black, considered as noise.

The RFC method was applied to this data set in order to obtain an alternative clustering. The functional form of the data was recovered using B-splines of order 3 with 31 equispaced nodes. We use 100 initializations with 20 iterations each. The constraint level considered was \(d_{1}=d_{2}=3\), and the trimming level \(\alpha =0.13\). In [2] two different clusterings were obtained, depending on the linkage function used: 4 clusters for the complete linkage and 3 for average. We will only consider the clustering into 4 groups for comparison purposes in what follows.

To compare the two results, the Adjusted Rand Index (ARI) [16] was used. This is an improvement of the original Rand Index [23] and measures the similarity between two partitions of the same set, having value 1 when there is exact coincidence and close to 0 when considering a completely “random” clustering of the data.

One can see that the effect of trimming and constraints is not harmful, even in the absence of contamination. For instance, we can see that the ARI of RFC with \(d_1=d_2=3\) and \(\alpha =0.13\) is equal to 0.513 with respect to the “reference” partition, which is obtained when applying [2] with 4 groups. To compute this ARI index we assign all the time series (trimmed and non-trimmed) to clusters by using posterior probabilities from the fitted mixture model that was described in Sect. 3. The two rows in Fig. 4 show the clusters found when using the TVD and RFC, respectively. Even though the groups have differences in membership and size, it is possible to see from the figures that the shape of the functions in the corresponding clusters are very similar and the mean functions are close. The variations are probably due to the different clustering techniques employed, but the similarity in the groups obtained point to consistent results for both methods. Observe that the trimmed curves for the RFC method are different from the rest of the functions in their cluster. For “HMSClust” both versions gave a value of 0.723 for the ARI, higher than that obtained with RFC, while for “Funclust”, values were lower, with a maximum of 0.315 with a threshold value of 0.01 or 0.1 in the “Cattell” test.

Fig. 4.
figure 4

(Top) Clusters found using the TV distance using the complete linkage function. (Bottom) Clusters found using the RFC method for \(K=4\) with constrains \(d_{1}=d_{2}=3\) and trimming \(\alpha =0.13\). Each panel corresponds to spectral densities in each cluster, grey lines represent the means and black lines represent the trimmed observations.

In order to test the performance of the different methods with real data and in the presence of contamination, 22 time series were added to the sample. These measurements were recorded at the same location and during the same month, but during different days. The corresponding estimated spectral densities are shown in black in Fig. 3(b). Some of these densities are bimodal while others are unimodal but have lower modal frequency than those in the original sample.

Fig. 5.
figure 5

Clusters found with the different procedures when \(K=4\). Each panel corresponds to a cluster of spectral densities. Gray lines represent the means and black lines represent the trimmed observations. First row: Original clusters for the TV distance using the complete link before adding contaminating time series. Second row: Clusters for the RFC method for \(K=4\) with constrains \(d_{1}=d_{2}=3\) and trimming \(\alpha =0.13\). Third row: Clusters for the “TVDClust” method with complete linkage. Fourth row: Clusters for “HSMClust” method. Fifth row: Clusters for “Funclust”

The four clustering procedures considered were applied to this contaminated sample and the results were compared using again as “reference” the clustering obtained in [2] with 4 groups applied to the clean data (i.e., before adding the contaminating curves). The ARI was computed by taking only into account the classification of the original (non-contaminating) densities. In the case of the RFC methodology, the assignments based on “posterior” probabilities were considered for the wrongly trimmed observations.

The ARI for the RFC method with \(K=4\) and \(d_1=d_2=3\) are equal to 0.167, 0.723 and 0.599 when trimming levels \(\alpha =0\), \(\alpha =0.13\) and \(\alpha =0.2\), respectively, are used. The associated ARI when using Funclust are always below 0.21 for all three Cattell thresholds tested (0.002, 0.05 and 0.1) as this method is not designed to cope with outlying curves. The other methods tested, “average TVD” and “average HMSClust”, have even worse results in this contaminated case reaching ARI values equal to 0 in both cases. Therefore, the best results overall were obtained using RFC with a \(\alpha =0.13\) while the other methods show poor results in the presence of contaminating data.

To reinforce previous claims, Fig. 5 shows in the first row, the partition obtained in [2] with four clusters before adding the contaminating time series, in the second, the results when using RFC with four clusters, \(d_1=d_2=3\) and trimming level \(\alpha =13\%\) to the “contaminated” data set. In the third row the results obtained with “TDVClust”, then “HMSClust” and “Funclust”, also in case that the contaminating time series were added. Once again, the clusters obtained with RFC differ slightly from those obtained in [2] but, in spite of the presence of contamination, the shape of the spectral densities in the corresponding clusters are very similar and the average densities are very close. The trimmed functions when using level \(\alpha =13\%\) are shown in black in the second row. The last three rows show the poor results obtained with the other three methods. For instance, in the third row, corresponding to “TVDClust”, the original sample is clustered together in a single group in the leftmost panel, while the other three groups only contain contaminating functions that were added as noise. Since \(\alpha =0.13\), 19 curves were trimmed with the RFC procedure, most of which come from the contaminating series that were added. Finally, it is also important to point out that trimming and clustering are performed simultaneously in the RFC approach.

6 Conclusions

A feasible methodology of robust clustering for stationary time series has been proposed and illustrated. The key idea behind the algorithm presented is the use of estimated spectral densities of the time series, that are considered as functional data. A robust model-based algorithm together with the simultaneous use of trimming and constraints is then used to cluster the original time series.

The use of trimming protects the estimation of the parameters against the effect of outlying curves, while the constraints avoid the presence of spurious clusters and improve the performance of the algorithms. Simulations show that the joint use of constraints and trimming tools improves results in the presence of outliers, in comparison to some other procedures for time series and functional data clustering, not designed to work with contamination. The real data example shows that the proposed RFC method for time series clustering has a good performance, with or without the presence of outlying curves. In the presence of contamination, RFC is able to detected almost all the outliers in the data. The trimmed curves often correspond to curves with different characteristics to the rest. We conclude that the proposed robust methodology can be a useful tool to detect contamination and groups in a time series data set simultaneously.

However, this methodology has some limitations. The choice of trimming level \(\alpha \) and the choice of the scatter constraints constants \(d_1\) and \(d_2\), can be subjective and sometimes depend on the final purpose of the cluster analysis. For this reason, we always recommend the use of different values of trimming and constraint, monitoring the effect in the clustering partition of these choices.