1 Introduction

On the one hand, the goal of clustering is to discover homogeneous groups in the data. Clustering approaches are divided into hierarchical and non-hierarchical methods which K-means is in the second group, (Everitt et al. 2011, Chapters 4 and 7).

On the other hand, functional data appear in many fields such as physical processes, genetics, biology, meteorology and signal processing (Bernardi et al. 2017; Ferraty and Vieu 2006; Ruiz-Medina and Espejo 2012; Bagirov and Mardaneh 2006). “A random variable \({\mathbf {X}}\) is called a functional variable if it takes values in an infinite dimensional space (or a functional space). An observation X of \({\mathbf {X}}\) is a functional data” (Ferraty and Vieu 2006, Chapter 1), and its analysis is called functional data analysis (FDA). FDA is an important topic in statistics and has received a wide range of applications. It contains principal components analysis, canonical correlation and discriminant analysis, functional linear models and clustering. Ramsay and Silverman (2005) presented various techniques to analyze these data.

Generic examples of functional data contain weather data and human growth data. Some authors such as Balzanella et al. (2017), Bohorquez et al. (2017) and Ruiz-Medina (2014) used FDA for environmental science application. Mateu and Romano (2017) presented a comprehensive reviewed of spatial functional data issue. In medicine, the growth of a child and the evaluation state of a patient are essentially functional data (Jacques and Preda 2014b; Tokushige et al. 2007) although their observations are discrete. They are primarily functional data because they can be considered as a continuous function of another varible such as time.

In practice, functional data are observed from a finite set of time-points. In other words, we usually have a set of \(\{f_{i}(t_{1}), \ldots , f_{i}(t_{n})\}\) with \(i=1, \ldots , N\), \(t_{j} \in [0, T]\) and \(T \in {\mathbb {R}}^{+}\). The matrix representation of these observations is as follows:

$$\begin{aligned} \begin{bmatrix} f_{1}(t_{1})&\quad f_{1}(t_{2})&\quad \cdots&\quad f_{1}(t_{n})\\ f_{2}(t_{1})&\quad f_{2}(t_{2})&\quad \cdots&\quad f_{2}(t_{n})\\ \vdots&\quad \vdots&\quad \ddots&\quad \vdots \\ f_{N}(t_{1})&\quad f_{N}(t_{2})&\quad \cdots&\quad f_{N}(t_{n})\\ \end{bmatrix}. \end{aligned}$$

Each row of this matrix corresponds to one observation such as growth of a child which is recorded at n points. If n tends to infinity, then each row becomes a continuous curve (i.e. infinite dimensional data or functional data). Raw data do not include extra information involved by an underlying process which produces data. Due to continuous nature of data, if a multivariate analysis is used on the raw data, the suitable results may not be achieved. In functional data clustering, the suitable result is a performance of an algorithm for assigning data to true clusters. It can be measured with different indices such as mean square distance. There are three shortcomings of multivariate analysis (Ramsay 1982) for these types of data: (1) more investigation about variations e.g. estimation of the first and second derivative along with the time, is impossible, (2) realization dependency in separate points of time is ignored, (3) information about variable between two consequence time-point is not available. Hence, FDA is appropriate. To use FDA including functional data clustering, it is common to reconstruct a continuous curve from finite points. It is done by approximating curves through some finite basis functions such as splines (specially B-splines), Fourier series, wavelets and reproducing kernel Hilbert space (RKHS). The last one is used in this paper. To this end, let \({\mathcal {H}}\) be a real Hilbert space of functions and a set \(\{\phi _{1}, \ldots , \phi _{L}\}\), \(L \in {\mathbb {N}}\) be an orthonormal basis that generated some space of functions in \({\mathcal {H}}\). Each \(\phi _{l}\), \(l=1, \ldots , L\), is a continuous function, for example in Fourier series \(\phi _{l}(t)=A_l\cos (lt)+B_l\sin (lt)\) where \(A_l\) and \(B_l\) are constant. Then, a function \(f_i \in {\mathcal {H}}\) can be expressed as a linear combination as follows:

$$\begin{aligned} f_{i}(t) = \sum _{l=1}^{L} \alpha _{il}\phi _{l}(t) = {\varPhi }^T(t) {\mathbf {\alpha }}_{i}, \end{aligned}$$
(1)

where \(\varPhi (t) = (\phi _{1}(t), \ldots , \phi _{L}(t))^T\) and the expansion coefficient \({\mathbf {\alpha }}_{i} = (\alpha _{i1}, \ldots , \alpha _{iL})^T\) of i-th curve, \(f_{i}\), is estimated using an interpolation procedure or least squares smoothing (Jacques and Preda 2014a). The continuous estimation of \(f_i\) is \(\hat{f_i}(t)= {\varPhi }^T(t) \hat{{\mathbf {\alpha }}}_{i}.\)

Functional data are infinite dimensional data because they are continuous curves. In the most of functional data clustering methods, the expansion coefficients, \({\mathbf {\alpha }}_{i}\), obtained from Eq. (1) are clustered, instead of clustering the functional data, see e.g. James and Sugar (2003), Abraham et al. (2003) and Luz-López-García et al. (2015). In other words, they project the infinite dimensional data onto the finite basis functions \(\{\phi _{1}, \ldots , \phi _{L}\}\) and obtain \({\mathbf {\alpha }}_{i}\). Then, they cluster the projected data, \({\mathbf {\alpha }}_{i}\), which are finite dimensional data. For clustering \({\mathbf {\alpha }}_{i}\), K-means algorithm is used commonly.

In this paper, we focus on functional data clustering using RKHS and random projection.

In the next paragraph, we review some papers in functional data clustering. In these articles, functional data are projected onto a basis function and using K-means \(\alpha _i\) are clustered. The main purpose of these works is to improve K-means algorithm. The common basis applied in these works is spline basis. Although they have suitable properties such as easy usage, and mathematical description (James and Sugar 2003; Jacques and Preda 2014a), they have two main drawbacks: (1) they applied to a function with some degrees of regularity and they are not appropriate for an irregular function and a function with peaks, (2) they require heavy computational for high dimensional data (Antoniadis et al. 2013; Giacofci et al. 2013). For these reasons, some works used wavelet decomposition which is suitable for irregular function and non-stationary data (Antoniadis et al. 2013).

Abraham et al. (2003) projected data onto a B-spline basis and clustered the expansion coefficients via K-means algorithm using Euclidean distance. Yamamoto (2012) proposed a new objective function for this aim. Giuseppe et al. (2013) presented a method for clustering Italian climate zones using B-spline. Antoniadis et al. (2013) considered K-means algorithm clustering and proposed a dissimilarity measure between curves based on wavelet transform. Ieva et al. (2013) clustered the multivariate functional data via K-means algorithm using a proper distance. Luz-López-García et al. (2015) projected the original data onto RKHS and reduced the dimension of coefficients by two strategies: functional principal components and random projection. Then, they clustered the coefficients by K-means algorithm. Abramowicz et al. (2017) presented a non-parametric statistical method which gives a suitable result to cluster misaligned dependent curves for seasonal climate interpretation.

In the following, we present various works which increased the performance of functional data clustering approaches such as non-parametric approach and model-based approach. Also, some of them compare different methods. James and Sugar (2003) projected the functional data onto spline basis and proposed the model-based algorithm that is especially for sparse data. Rossi et al. (2004) projected data onto spline basis and used self organized map (SOM) algorithm to cluster expansion coefficients. Ray and Mallick (2006) decomposed the data with wavelet transform and clustered obtained coefficients using a model-based algorithm. Bouveyron and Jacques (2011) applied a model-based method on the functional principal components. Boullé (2012) proposed non-parametric density estimation to cluster functional data. Haggarty et al. (2012) developed univariate and multivariate functional data clustering to determine a pattern of water quality data in Scotland. Jacques and Preda (2013) applied the model-based clustering via probability density which is defined by Delaigle and Hall (2010). The model-based clustering is extended to the multivariate case by Jacques and Preda (2014b). Jacques and Preda (2014a) presented a classification of different methods for functional data clustering: (1) raw-data clustering, (2) two-stage methods, (3) non-parametric clustering, (4) model-based clustering. Finazzi et al. (2015) presented a novel approach for functional data clustering and compared it with other approaches such as K-means algorithm.

By a regionalization procedure, groups of sites which are homogenous in term of hydro-meteorological characteristics could be found. One of the useful techniques to divide catchments in a region into natural groups is clustering. In what follows, we present several works which are related to regionalization of climatological data. Some of those works e.g. Rao and Srinivas (2006b) and Srinivas (2008) are based on hybrid clustering and global fuzzy K-means.

Mitchell (1976) regionalized the western United States into climatic regions based on the air masses. Anyadike (1987) used cluster analysis to the regions over West Africa, and obtained ten-region climatic clusters. Brring (1988) regionalized the daily rainfall in Kenya using common factor analysis. El-Jabi et al. (1998) divided the province of New Brunswick (Canada) into homogeneous regions by the regionalization of floods. Alila (1999) regionalized the precipitation annual maxima in Canada using a hierarchical approach. Guenni and Bárdossy (2002) disaggregated the highly seasonal monthly rainfall data using a stochastic method in two steps. Rao and Srinivas (2006a) regionalized the annual maximum flow data from the watersheds in Indiana by fuzzy cluster analysis. The regionalization of watersheds by hybrid-cluster analysis (the hybrid of Wards and K-means algorithms) was proposed by Rao and Srinivas (2006b). They conclude that the hybrid method is flexible and presents suitable results for regionalization studies. Srinivas (2008) combines self-organizing feature map and fuzzy clustering to regionalization of watersheds. Bharath and Srinivas (2015b) delineated extreme rainfall in India by global fuzzy c-means based on location indicators (latitude, longitude, and altitude) and seasonality measure of extreme rainfall. Asong et al. (2015) proposed a new approach for regionalization of precipitation climate characteristics in the Canadian prairie provinces. The obtained regions are homogeneous statistically and climatologically. Nam et al. (2015) delineated the climatic rainfall regions of South Korea using a multivariate analysis and regional rainfall frequency analysis. Rahman et al. (2017) developed the regional flood frequency analysis techniques by generalized additive models for Australia. The proposed method obtains better results than three other methods (log-linear model, canonical correlation analysis and region-of-influence approach).

Most of the mentioned methods are based on K-means algorithm (James and Sugar 2003; Ieva et al. 2013; Antoniadis et al. 2013; Luz-López-García et al. 2015). The K-means algorithm starts by initializing the k cluster centers randomly and minimizing the total distance (total square error) between each of the N points and its nearest centroid. The cluster centroid is the arithmetic mean of the points on each cluster. The distance is minimized when the algorithm converges, however, it may be a local minimum. In general, the objective function of K-means is not convex, and there is no guarantee that its solution is global. It reaches to a local solution in some datasets, (Cardoso and Wichert 2012).

Before analyzing high dimensional data, their dimension usually is reduced. The principal component analysis (PCA) (Jolliffe 2002) and random projection (RP) (Johnson and Lindenstrauss 1984) are used for this purpose. In this paper and some other works (Luz-López-García et al. 2015; Cardoso and Wichert 2012), high dimensional data are projected onto a lower dimensional subspace via RP such that Euclidean distances between the points are approximately preserved.

In this paper, we propose a new algorithm which has the following advantages:

  1. 1.

    It improves all known K-means based functional data clustering approaches.

  2. 2.

    It avoids local minimum which leads to a solution (i.e. a set of labels for curves) with a lower mean square distance (MSD) and larger Rand index.

  3. 3.

    A new kernel based on a wavelet function reduces the condition number of a linear system and MSD. In most of the data, MSD, Rand index and the number of iterations of the proposed method based on the wavelet kernel are smaller than other ones.

  4. 4.

    K-means in the last step of the proposed method, has less iteration number than clustering using a single random projection and classical K-means.

  5. 5.

    Since the obtained condition number under the new kernel is small, it produces a suitable and stable reconstruction of the curves that may be used in the estimation (prediction) problem of FDA.

The paper is organized as follows: Sect. 2 briefly presents RKHS, random projection and random projection K-means algorithm for functional data clustering denoted by (RPFD). The proposed method is introduced in Sect. 3. Section 4 reports the results of simulation studies. The real datasets are clustered in Sect. 5. Conclusions are given in Sect. 6.

2 RKHS and RPFD K-means

In this section, we present a brief review of RKHS and applying of Tikhonov regularization theory in RKHS. For the general version of Tikhonov regularization and theory of RKHS see (Tikhonov and Arsenin 1997; Aroszajn 1950), respectively. An excellent background for RKHS representation is given by Gonzáleez-Hernández (2010). At the end of this section, we mention the introduced algorithm by Luz-López-García et al. (2015) which called RPFD.

2.1 RKHS

Definition 1

(Reproducing kernel) Let \({\mathcal {H}}\) be a real Hilbert space of functions defined in \(X \subset {\mathbb {R}}^n\) with inner product \(\langle ., . \rangle _{{\mathcal {H}}}\) and the norm \(||f||_{\mathcal {H}} =\langle f, f \rangle _{\mathcal {H}}^{1/2}\). A function \(K: X \times X \rightarrow {\mathbb {R}}\) is called a reproducing kernel of \({\mathcal {H}}\) if

  1. 1.

    \(K(., x) \in {\mathcal {H}}\) for all \(x \in X\),

  2. 2.

    f(x) = \(\langle f(.), K(., x) \rangle _{\mathcal {H}}\) for all \(f \in {\mathcal {H}}\) and for all \(x \in X\).

A Hilbert space of functions is called a reproducing kernel Hilbert space (RKHS) if it admits a reproducing kernel. From Definition 1, it is concluded that \(K(x,y) = \langle K(., x), K(., y) \rangle _{\mathcal {H}}\) and for each \(f \in {\mathcal {H}}\) of the form \(f(x)=\sum _{i=1}^{n} \alpha _{i} K(x,x_{i})\) with \(x, x_{i} \in X\) and \(\alpha _{i} \in {\mathbb {R}}\), we have

$$\begin{aligned} ||f||_{\mathcal {H}}^{2} = \sum _{i=1}^{n} \sum _{j=1}^{n} \alpha _{i}\alpha _{j} K(x_{i},x_{j}) = \boldsymbol{\alpha} ^{T}K_{\mathbf {x}} \boldsymbol{\alpha} . \end{aligned}$$
(2)

Now, we express the Tikhonov regularization theory to reconstruct functions from sample points. Let compact subset \(X \subset {\mathbb {R}}^p\), K be a Mercer kernel (positive definite and symmetric kernel), \({\mathcal {H}}_{K}\) is its associated RKHS and the Borel probability measure in \(X \times {\mathbb {R}}\) is denoted by \(\nu\). Let \(f_{\nu }(x) = \int _{\mathbb {R}} y d_{\nu }(y|x)\) be a regression function where \(d_{\nu }(y|x)\) is the conditional probability measure on \({\mathbb {R}}\). Both \(\nu\) and \(f_{\nu }\) are unknown, and we want to reconstruct \(f_{\nu }\) using a random sample. Let \(X_{n} := \{x_{1}, \ldots , x_{n}\} \subset X\) and \(f^{j} := \{(x_{i}, y_{ij}) \in X \times {\mathbb {R}} \}_{i=1}^{n}\) be a random sample with \(j=1, \ldots , N\). Let \(V_{n} :={\mathtt{span}} \{ K(.,x_{i}): x_{i} \in X_{n} \}\) be a function space (the span of an arbitrary set may be defined as the set of all finite linear combinations of its elements).

Tikhonov regularization theory projects the sample \(f_{n}\) onto \(V_{n}\). It builds a stable reconstruction of \(f_{\nu }\) by solving the

$$\begin{aligned} f_{n}^{*^{j}} := \mathop {\hbox {arg min}}\limits _{f^{j} \in V_{n}} \frac{1}{n} \sum _{i=1}^{n} (f^{j}(x_{i})-y_{ij})^{2} + \gamma ||f^{j}||_{\mathcal {H}_{K}}^{2} \end{aligned}$$
(3)

where \(\gamma >0\) is a regularization parameter, \(j=1, \ldots , N\) and \(||f||_{\mathcal {H}_{K}} =\langle f, f \rangle _{\mathcal {H}_{K}}^{\frac{1}{2}}\). In other words, Tikhonov regularization theory states that there is a stable solution for the optimization problem (3). The following theorem gives a representation form for \(f_{n}^{*^{j}} \in V_{n} \subset {\mathcal {H}_{K}}\).

Theorem 1

(Luz-López-García et al. 2015) Under the mentioned assumptions there is a unique solution of (3) which can be represented by

$$\begin{aligned} f_{n}^{*^{j}}(x) = \sum _{i=1}^{n} \alpha _{i} K(x, x_{i}), \quad \hbox { for all}\quad x \in X \quad \hbox {and}\quad j=1, \ldots , N, \end{aligned}$$

where \({\mathbf {\alpha }} = (\alpha _{1}, \ldots , \alpha _{n})^{T}\) is obtained from

$$\begin{aligned} (n \gamma {\mathbf {I}}_{n} + K_{\mathbf {x}}){\mathbf {\alpha }} = y_{j}, \end{aligned}$$
(4)

where \({\mathbf {I}}_{n}\) is an identity \(n \times n\) matrix, \(y_{j} = (y_{1j}, \ldots , y_{nj})^{T}\) and matrix \((K_{\mathbf {x}})_{st} = K(x_{s}, x_{t})\) for \(s=t=1, \ldots , n\). The estimate of \(f_{\nu }\) in \(X_{n}\) is as follows:

$$\begin{aligned} \hat{f}_{n}^{*^{j}} = K_{\mathbf {x}} {\mathbf {\alpha }}. \end{aligned}$$
(5)

Since \(K_{\mathbf {x}}\) is a Mercer kernel, it is decomposed as \(K_{\mathbf {x}} =V^{T}DV\), where the rows of V are eigenvectors of the \(K_{\mathbf {x}}\) and the elements of diagonal matrix D are corresponding (non-negative) eigenvalues. Therefore, we can write

$$K_{\mathbf {x}} = \left(D^{\frac{1}{2}}V \right)^{T} \left(D^{\frac{1}{2}}V \right) = U^{T}U, \quad \hbox {where } \quad U = D^{\frac{1}{2}}V.$$
(6)

Using Eqs. (6) and (2), \(||f||_{V_{n}}^{2} = ||U{\mathbf {\alpha }}||^{2}\) which means that the norm of \(f \in V_{n}\) is equal to the Euclidean norm of its transformation. Let \(\nu\) be a Borel probability measure defined on \(X \times {\mathbb {R}}\). Suppose that \(f_{n}^{*}(x) = \sum _{i=1}^{n} \alpha _{i} K(x, x_{i})\) and \(g_{n}^{*}(x) = \sum _{i=1}^{n} \beta _{i} K(x, x_{i})\) are the regularized \(\gamma\)-projection of \(f_{\nu }\) and \(g_{\mu }\) respectively, which obtained from two samples \(f_{n} := \{(x_{i}, y_{i}) \in X \times {\mathbb {R}} \}_{i=1}^{n}\) and \(g_{n} := \{(x_{i}, y_{i}^{\prime }) \in X \times {\mathbb {R}} \}_{i=1}^{n}\). It follows by Luz-López-García et al. (2015) that the distance between \(f_{n}\) and \(g_{n}\) is attained by

$$\begin{aligned} d_{V_{n}}^{2} (f_{n}, g_{n}) := ||f_{n}^{*} -g_{n}^{*}||_{V_{n}}^{2} = (\alpha -\beta )^{T} K_{\mathbf {x}} (\alpha -\beta ) = ||U\alpha - U\beta ||^{2}. \end{aligned}$$

2.2 Random projection and RPFD

The nature of functional data requires high CPU times. So, we need to reduce their dimension. This fact encourages us to use RP for dimensionality reduction. The RP projects N points of dimension n to a random subspace of dimension h. To use RP, we generate a random matrix that its elements are randomly drawn from any zero mean distribution. For example, let \(A=\begin{bmatrix} 2&\quad 3&\quad 2&\quad 0&\quad 4\\ -2&\quad 3&\quad 12&\quad 10&\quad 9\\ 0&\quad -3&\quad 4&\quad 1&\quad 0\\ 4&\quad 5&\quad -1&\quad 7&\quad -4\\ \end{bmatrix}\) be data in \({\mathbb {R}}^5\) (four points of dimension 5). To project them onto \({\mathbb {R}}^2\), we generate a random matrix \(R=\begin{bmatrix} 2.63&\quad 1.52\\ 0.44&\quad 0.50\\ -0.50&\quad 1.75\\ -0.15&\quad 1.17\\ -0.53&\quad -0.16\\ \end{bmatrix}\) which elements are drawn from the standard normal distribution. The projected data in \({\mathbb {R}}^2\) are obtained as follows:

$$\begin{aligned} B=A\times R= \begin{bmatrix} 3.49&\quad 7.38\\ -16.15&\quad 29.67\\ -3.46&\quad 6.67\\ 14.26&\quad 15.65\\ \end{bmatrix} \quad \hbox {(four points of dimension 2)}. \end{aligned}$$

The success of RP is highlighted by the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss 1984; Dasgupta and Gupta 2003) is presented as follows:

Lemma 1

For any real number of \(\varepsilon , \delta \in (0, 1)\) and any \(N \in {\mathbb {N}}\), let \(h \in {\mathbb {N}}\) such that

$$\begin{aligned} h \ge 4(\frac{\varepsilon ^{2}}{2}-\frac{\varepsilon ^{3}}{3})^{-1} \log \left( \frac{N}{\sqrt{\delta }}\right) \end{aligned}$$

then, with probability \(1-\delta\)

$$\begin{aligned} (1-\varepsilon )||U\alpha - U\beta || \le ||R U\alpha - R U\beta || \le (1+\varepsilon )||U\alpha - U\beta ||, \end{aligned}$$

where R is a \(h \times n\) matrix and \(\alpha , \beta\) are obtained from Eq. (4) and U is attained from Eq. (6).

Lemma 1 states that the Euclidean distances between the points and the Euclidean distances between their projections are approximately preserved. The distance between \(f_{n}\) and \(g_{n}\) is defined as \(d_{V_{n}} (f_{n}, g_{n}) = ||U\alpha - U\beta ||\). According to Lemma 1, \(d_{V_{n}} (f_{n}, g_{n}) = ||U\alpha - U\beta || \cong ||R U\alpha - R U\beta ||\) and the projection matrix is \(P =R U\). Now, we can cluster the reduced expansion coefficients, \({\mathbf {\alpha }}\). A review of RPFD is presented in Algorithm 1.

Algorithm 1

Random projection K-means for functional data (RPFD)

input:

  1. 1.

    Dimension h (dimension of the projected space).

  2. 2.

    The number of clusters k.

  3. 3.

    \(X_{n} := \{x_{1}, \ldots , x_{n}\} \subset X\) and a sample of empirical functions \(\{f^{j}\}_{j=1}^{N}\) where \(f^{j} := \{(x_{i}, y_{ij}) \in X \times {\mathbb {R}} \}_{i=1}^{n}.\)

output:

  1. 1.

    Cluster label G.

begin

  1. 1.

    Calculate the kernel matrix \(K_{\mathbf {x}}\) by \((K_{\mathbf {x}})_{ij} = K(x_{i}, x_{j})\).

  2. 2.

    Compute the matrix \(U = D^{\frac{1}{2}}V\) where \(K_{\mathbf {x}} = (D^{\frac{1}{2}}V)^{T} (D^{\frac{1}{2}}V) = U^{T}U\).

  3. 3.

    Choose the regularization parameter \(\gamma\) based on cross-validation.

  4. 4.

    For \(j=1, \ldots , N\), solve \((\gamma n{\mathbf {I}}_{n} + K_{\mathbf {x}}){\mathbf {\alpha }} = y^{j}\) where \(y^{j} = (y_{1j}, \ldots , y_{nj})^{T}\). Denote the solution by \(\alpha ^{j}\).

  5. 5.

    Initialize G randomly.

  6. 6.

    Generate the random matrix \(R_{h \times n}\).

  7. 7.

    Set the projection matrix \(P_{h \times n} =R_{h \times n}U\).

  8. 8.

    For \(j=1, \ldots , N\), project the data using \(\beta _{h \times 1}^{j^{T}} =P_{h \times n}\alpha _{n \times 1}^{j}\) and put them in the rows of a matrix \({\mathcal {A}}_{N \times h}^{RP}\).

  9. 9.

    According to G, calculate the mean of each cluster in \({\mathcal {A}}_{N \times h}^{RP}\) and put them in rows of a matrix \(C_{k \times h}^{RP}\) as centers of the clusters.

  10. 10.

    Apply K-means on \({\mathcal {A}}_{N \times h}^{RP}\) with \(C_{k \times h}^{RP}\) as initialization to obtain G.

  11. 11.

    Return G.

end

3 Proposed method

In Sect. 2, we reviewed an algorithm for functional data clustering. It first projects the curves onto RKHS and second RP reduces the dimension of expansion coefficients. Then, it clusters them using K-means algorithm. There are two difficulties in this algorithm: (1) projecting data onto RKHS and reducing the dimension by RP cause more details are lost and the probability of assigning a point to a wrong cluster increases. (2) clustering by K-means algorithm may give a local solution (i.e. a set of labels).

To overcome these difficulties, we propose an algorithm to improve the Algorithm 1. We call the proposed algorithm iterative RP functional data clustering (IRPFD). The scheme of this algorithm is presented in Fig. 1. Note that when we use iterative RP, K-means attains a non-local solution. We explain IRPFD in what follows.

Fig. 1
figure 1

The flowchart of the proposed method. G is a set of cluster labels

As we mentioned before, RPFD uses a RP onto a single space, but in IRPFD, we project the expansion coefficients onto several spaces then we use K-means to cluster them. In each of the epochs, we increase the dimension of the projection. In a given dimension, the obtained clusters are used to initialize the K-means in the next dimension.

Let \(X_{n} := \{x_{1}, \ldots , x_{n}\} \subset X\) and a sample of empirical functions \(\{f^{j}=(x_{i}, y_{ij}), \ \ i=1, \ldots , n\}_{j=1}^{N}\) be data. To project data onto RKHS, we calculate the kernel matrix \(K_{\mathbf {x}}\) which is the size of \(n \times n\). We choose a suitable regularization parameter \(\gamma\) using cross-validation (CV which will be explained briefly for a real dataset). The expansion coefficients (\(\alpha _{n \times 1}^{j}\)) are obtained from Eq. (4) for all curves. We also compute Cholesky decomposition of \(K_{\mathbf {x}}\) from Eq. (6) to obtain \(U_{n \times n}\). After projecting data onto RKHS (i.e. obtaining expansion coefficients \(\alpha _{n \times 1}^{j}\)), we project the expansion coefficients onto a random subspace of dimension \(d_{1} < n\) as follows. For dimension \(d_1\), we generate the random matrix \(R_{d_{1} \times n}\) and obtain the projection matrix \(P_{d_{1} \times n}\) from \(P_{d_{1} \times n} = R_{d_{1} \times n} U_{n \times n}\). Now, we project \(\alpha _{n \times 1}^{j}\) using \(\beta _{d_{1} \times 1}^{j^{T}} =P_{d_{1} \times n}\alpha _{n \times 1}^{j}\) and put \(\beta _{d_{1} \times 1}^{j^{T}}\) in the rows of a matrix \({\mathcal {A}}_{N \times d_{1}}^{RP_{1}}\). To initialize label of clusters, we generate N random numbers from \(unif(\{1,\ldots , k\})\) where unif is discrete uniform distribution and put them in the set G. Using G, we obtain the cluster centers, \(C_{k \times d_{1}}^{RP_{1}}\), based on the mean of each cluster in \({\mathcal {A}}_{N \times d_{1}}^{RP_{1}}\). By applying K-means on \({\mathcal {A}}_{N \times d_{1}}^{RP_{1}}\) which is initialized by \(C_{k \times d_{1}}^{RP_{1}}\), G is updated. Based on G, we obtain \(C_{k \times d_{2}}^{RP_{2}}\) from \({\mathcal {A}}_{N \times d_{2}}^{RP_{2}}\). We initialize K-means for dimension \(d_2\) using \(C_{k \times d_{2}}^{RP_{2}}\) to cluster matrix \({\mathcal {A}}_{N \times d_{2}}^{RP_{2}}\). We repeat this process for some subspaces with increasing dimension i.e. \(d_{1} \le d_{2} \le \cdots \le d_{e} < n\). The relation between \(d_{z}\) for \(z=1, \ldots , e\) is heuristic, and we can find them using cross-validation.

Lower dimension is like a high temperature in simulated annealing clustering, and when we increase the dimension, it cools the simulated annealing clustering. So, the relation between \(d_{z}\) is similar to simulated annealing cooling (Cardoso and Wichert 2012; Selim and Alsultan 1991). The details of the proposed method are given in Algorithm 2.

Algorithm 2

Iterative random projection K-means for functional data (IRPFD)

input:

  1. 1.

    List of dimension \(d_{z}, z=1, \ldots , e\) (dimension of the projected spaces).

  2. 2.

    The number of clusters k.

  3. 3.

    \(X_{n} := \{x_{1}, \ldots , x_{n}\} \subset X\) and a sample of empirical functions \(\{f^{j}\}_{j=1}^{N}\) where \(f^{j} := \{(x_{i}, y_{ij}) \in X \times {\mathbb {R}} \}_{i=1}^{n}.\)

output:

  1. 1.

    Cluster label G.

begin

  1. 1.

    Calculate the kernel matrix \(K_{\mathbf {x}}\) by \((K_{\mathbf {x}})_{ij} = K(x_{i}, x_{j})\).

  2. 2.

    Compute the matrix \(U = D^{\frac{1}{2}}V\) where \(K_{\mathbf {x}} = (D^{\frac{1}{2}}V)^{T} (D^{\frac{1}{2}}V) = U^{T}U\).

  3. 3.

    Choose the regularization parameter \(\gamma\) based on cross-validation.

  4. 4.

    For \(j=1, \ldots , N\), solve \((\gamma n{\mathbf {I}}_{n} + K_{\mathbf {x}}){\mathbf {\alpha }} = y^{j}\) where \(y^{j} = (y_{1j}, \ldots , y_{nj})^{T}\). Denote the solution by \(\alpha ^{j}\).

  5. 5.

    Initialize G randomly.

    For z = 1: e

  6. L1.

    Generate the random matrix \(R_{d_{z} \times n}\).

  7. L2.

    Set the projection matrix \(P_{d_{z} \times n} =R_{d_{z} \times n}U\).

  8. L3.

    For \(j=1, \ldots , N\), project the data using \(\beta _{d_{z} \times 1}^{j^{T}} =P_{d_{z} \times n}\alpha _{n \times 1}^{j}\) and put them in rows of a matrix \({\mathcal {A}}_{N \times d_{z} }^{RP_{z}}\).

  9. L4.

    According to G, calculate the mean of each cluster in \({\mathcal {A}}_{N \times d_{z} }^{RP_{z}}\) and put them in the rows of a matrix \(C_{k \times d_{z} }^{RP_{z}}\) as centers of the clusters.

  10. L5.

    Apply K-means on \({\mathcal {A}}_{N \times d_{z} }^{RP_{z}}\) with \(C_{k \times d_{z} }^{RP_{z}}\) as initialization to obtain G.

end

  1. 6.

    Return G.

end

4 Simulation studies

In this section, we implement three methods: classical K-means, RPFD K-means and IRPFD K-means (the proposed method) under three kernels used by Luz-López-García et al. (2015) and a new kernel, on some simulated and real datasets that have been used in the most papers.

4.1 Simulated data

We describe following datasets which their figures are presented at the end of the paper.

  1. 1.

    Waves (Luz-López-García et al. 2015): For these data, there are \(N=450\) realizations of curves in \(n=400\) time-points, and they have \(k=3\) clusters. Let

    $$\begin{aligned}h_{1}(x) := \hbox {max}(6 - |x - 11|, 0); \quad h_{2}(x) = h_{1}(x-4); \quad h_{3}(x) = h_{1}(x+4),\end{aligned}$$

    where \(x \in [1, 21]\). The waves data are obtained by a convex combination as follows:

    $$\begin{aligned} y_{1}(x)= \, & {} uh_{1}(x) + (1-u)h_{2}(x) + \varepsilon (x) \quad \hbox {for class 1,}\\ y_{2}(x) = \, & {} uh_{1}(x) + (1-u)h_{3}(x) + \varepsilon (x) \quad \hbox {for class 2,} \\ y_{3}(x) = \, & {} uh_{2}(x) + (1-u)h_{3}(x) + \varepsilon (x) \quad \hbox {for class 3,} \end{aligned}$$

    where u is a uniform random variable in (0, 1) and \(\varepsilon (x)\) is a standard normal variable. The first row of Fig. 2a corresponds to these data and their centroids.

  2. 2.

    Irrationals (Luz-López-García et al. 2015): For these data, there are \(N=300\) realizations of curves in \(n=1009\) time-points, and they generated from \(k=3\) clusters. These data are not equidistant. Three irrational functions constructed as follows:

    $$\begin{aligned} y_{1}(x)= & {} x(1-x) + \varepsilon (x) \ \ \quad \hbox {for class 1,}\\ y_{2}(x)= & {} x^{\frac{3}{2}}(1-x) + \varepsilon (x) \quad \hbox {for class 2,} \\ y_{3}(x)= & {} x^{2}(1-x) + \varepsilon (x) \ \quad \hbox {for class 3,} \end{aligned}$$

    where \(\varepsilon (x)\) is a uniform random variable in \((-0.2, 0.2)\). The last row of Fig. 2a shows them.

  3. 3.

    Three groups: (Ferreira and Hitchcock 2009) This dataset contains three main groups of signal functions. In this paper, we call them by group 1, group 2 and group 3. In each group, there are \(k=4\) clusters. For each group, there are \(N=120\) realizations of curves in \(n=200\) time-points. They are shown in Fig. 2b.

    Fig. 2
    figure 2

    Simulated datasets and their centroids. a Waves and irrationals datasets and their centroids, b three groups data and their centroids

4.2 Kernels

We use three kernels applied by Luz-López-García et al. (2015),

and a new kernel, Mexican hat wavelet. We calculate their parameters using cross-validation (CV) technique. We will explain it briefly for a real dataset. These kernels are as follows:

  1. 1.

    Laplacian kernel:

    $$\begin{aligned}K(x_{i}, x_{j}) := e^{-\frac{1}{\sigma _{1}^{2}} ||x_{i} - x_{j}||_{1}}\end{aligned}$$
  2. 2.

    Gaussian kernel:

    $$\begin{aligned}K(x_{i}, x_{j}) := e^{-\frac{1}{\sigma _{2}^{2}} ||x_{i} - x_{j}||_{2}^{2}}\end{aligned}$$
  3. 3.

    Polynomial kernel:

    $$\begin{aligned}K(x_{i}, x_{j}) := (1 + ax_{i}^{T}\ .\ x_{j})^{b} \end{aligned}$$
  4. 4.

    Mexican hat kernel:

    $$\begin{aligned}K(x_{i}, x_{j}) := (\sigma _{3}^{2} - ||x_{i} - x_{j}||_{2}^{2}) \ e^{-\frac{1}{\sigma _{3}^{2}} ||x_{i} - x_{j}|| _{2}^{2}}\end{aligned}$$

    where \(\sigma _{1}, \sigma _{2}, \sigma _{3}, a >0\), \(b \in {\mathbb {N}}\) and \(|| . ||_{p}\) is norm p.

Hereafter, we call the fourth kernel by wavelet. The kernel parameters and \(\gamma\) for simulated data are given in Table 1. Note that in three groups dataset, the presented parameters are the same for groups 1, 2 and 3 except for group 3, \(\sigma _{3} = 0.7\).

Table 1 Kernel parameters of simulated datasets

4.3 Experiments

We measure the performance of three methods (classical K-means, RPFD K-means and IRPFD K-means) on the reported datasets with MSD and Rand index. We also calculate running time and the number of iterations that K-means converges. We compare the condition number of four kernels for all datasets. Now, we explain them.

The MSD is the mean square distance between each cluster center and its members. In other words, MSD is obtained by

$$\begin{aligned} {\text{MSD}} = \frac{1}{N} \sum _{j=1}^{k} \sum _{f \in S_{j}} ||f - C_{j}||_{2} \end{aligned}$$
(7)

where \(S_{j}\) is the j-th cluster and \(C_{j}\) is its center.

Rand index (Rand 1971; Hubert and Arabie 1985) measures the similarity between two partitions. It takes a value between 0 and 1. When it tends to 1, two partitions have a high agreement.

It can be calculated using the function rand.index() of the package fossil of the R software.

To obtain expansion coefficients, we solve Eq. (4) by numerical algorithms commonly. The type of matrix \(K_{x}\) has an effective role on the accuracy of solution of Eq. (4) i.e. \({\mathbf {\alpha }}\). In the following, we consider one validity criterion which shows an accuracy of the linear system. An important issue in a numerical analysis is the condition number (Trefethen and Bau 1997) of a matrix that is associated with the linear equation \(Ax = b\). It states the amount of sensitivity of a system. The higher condition number is equivalent to more sensitivity and yields the ill-posed system. In other words, a small variation of b makes big changes in x. The condition number of a nonsingular \(A_ {n \times n}\) is defined by \(\hbox {cond}(A) = ||A||_{2}\ .\ ||A^{-1}||_{2}\). In general, if the condition number of a matrix A be \(10^{m}\), then m digits of accuracy may lose. In next sections, for each kernel and dataset, we calculate the condition number of kernel matrix, \(K_{x}\), and \(T=(n \gamma {\mathbf {I}}_{n} + K_{\mathbf {x}})\) defined in (4). We implement mentioned algorithms on the simulated and real datasets, separately.

4.3.1 Performance on simulated datasets

We generate data from stated models. We run each algorithm 1000 times for each dataset. In all experiments, the entries of the random matrix are independent and identically distributed standard Gaussian random variables, N(0, 1). The classical K-means algorithm uses raw data, i.e. it uses only \(y_{ij}\) for \(i=1, \ldots , n\) and \(j=1, \ldots , N\). The way of choosing h and \(d_ {z}\) in the RPFD and IRPFD, respectively, is heuristic. In other words, we run the algorithms for different values and choose the best one based on MSD and Rand index. In Table 2, we present the average running time (by time), the average number of iterations, (by Iter), the average of MSD (by MSD), the average of Rand index (by RI) and their standard deviation by SD on 1000 iterations. The bolded numbers are the best results, and the selected values of h and \(d_{z}\) are mentioned in the table. The values are rounded at most to 2-decimal places. The cells with 0.00 have an exact value less than 0.005 which we round them up to 0.00. According to Table 2, we conclude that

  1. 1.

    IRPFD K-means has the minimum MSD among RPFD K-means and the classical K-means. In other words, it avoids local minimum;

  2. 2.

    As expected, in all datasets, the running time of the classical K-means is less than others because it performs K-means only one time on raw data. This is while the RPFD and IRPFD project the raw data onto some spaces and this needs more time than the classical K-means. The running time of IRPFD K-means and RPFD K-means are approximately the same in the most datasets. Note that K-means on the projected data takes a shorter running time than on the raw data;

  3. 3.

    The Iter of IRPFD K-means is less than two other methods in all datasets except for irrationals dataset;

  4. 4.

    The Rand index of IRPFD K-means is the best among all methods in all datasets;

  5. 5.

    In the most datasets for both cases (IRPFD K-means and RPFD K-means), the MSD, the Iter and Rand index of the wavelet kernel are better than three other kernels.

Thus, in all datasets, clustering via functional nature (especially IRPFD) is better than classical K-means based on MSD, RI and Iter.

Table 2 The clustering results of simulated data: Time, Iter, MSD and RI for different kernels. SD is a standard deviation

The results of the condition number are shown in Table 3. In the three groups, the results for group 1, group 2 and group 3 are equal. In all datasets, the wavelet kernel has the best condition number for the matrix T, that are bolded.

Table 3 The condition number of simulated data and \(T=(n \gamma {\mathbf {I}}_{n} + K_{\mathbf {x}})\) from Eq. (4)

For the dataset of group 1, the boxplots of its results are presented in Fig. 3. In this figure, the classical method is denoted by C, RPFD using wavelet kernel is denoted by Wrp, RPFD using Laplacian kernel is denoted by Lrp, RPFD using the Gaussian kernel is denoted by Nrp, RPFD using the polynomial kernel is denoted by Prp, IRPFD using wavelet kernel is denoted by Wirp, IRPFD using Laplacian kernel is denoted by Lirp, IRPFD using the Gaussian kernel is denoted by Nirp and IRPFD using the polynomial kernel is denoted by Pirp. According to these boxplots, the wavelet IRPFD has the best performance with respect to MSD and RI. We also show the scatter plot of the first and the second principal components of the true partition of \(y_{1}, \ldots , y_{N}\) (top-left) and those obtained by classical method (top-right), IRPFD (bottom-left) and RPFD via wavelet kernel (bottom-right) for group 1 and group 2 datasets in Fig. 4a, b, respectively. The clusters of group 1 dataset are close together and their recognitions are difficult, but for group 2 dataset, the clusters are distinct. The IRPFD method presents a partition which is more similar than others to the true partition.

Fig. 3
figure 3

Boxplot of Rand index, MSD, Iter and running time for group 1 data

Fig. 4
figure 4

The first and the second principal components of the different methods for group 1 and group 2 datasets

5 Application to real datasets

In this part, we try to obtain homogeneous groups (clusters) for real datasets. We introduce real datasets and explain CV, determining the number of clusters (using Wilks Lambda) and IRPFD for one dataset.

5.1 Real data

Some examples of functional data appear in meteorology and earth science (Tokushige et al. 2007; Luz-López-García et al. 2015; Ferraty and Vieu 2006). We describe three following datasets to show the performance of the proposed method. Their figures are presented at the end of the paper.

  1. 1.

    Canadian temperature:Ramsay and Silverman (2005) presented the details of these data. They are available in the R package fda. This dataset is used by some authors for FDA and clustering, for example, see Aguilera-Morillo et al. (2017) and Jacques and Preda (2014b). For \(N=35\) locations in Canada, the daily temperature had been measured and averaged over 1960–1994. The obtained data constitute a \(35\times 365\) matrix whose elements correspond to the average of 35 observed temperatures over 35 years. We consider each of the rows of this matrix as a curve. We present the data in Fig. 5 and their geographical locations are shown in Fig. 6a. Our purpose is to present the homogeneous groups of this dataset.

  2. 2.

    Russian temperature:Footnote 1 These data investigated by Bulygina and Razuvaev (2012) and are presented in detail in CDIAC (Carbon dioxide information center). The source of this dataset contains temperature of 518 stations in Russia. These data have been recorded since 1874–2011. We consider 490 stations which cover an interval 1978–2009, i.e. 32 years. For each station, after removing its trend (if it exists, for example, the stations 76 and 77 are stationary), the average of temperature over 32 years is computed. Therefore, we have a matrix of size \(490\times 365\) such that each row corresponds to the average of temperature over 32 years for one station. Each row is considered as a curve over interval [1, 365]. Thus, we have 490 curves that are shown in Fig. 5b. We also show the geographical locations of these data in Fig.6b.

  3. 3.

    Radar:Footnote 2 This dataset is recorded by satellite Topex/Poseidon around a region of 25 kilometers on the Amazon river and contains \(N=472\) waveforms (i.e. curves). Each curve is measured at 70 points, \(\{X_{i}(t_{1}), \ldots , X_{i}(t_{70})\}_{i=1}^{472}\), and corresponds to a type of ground (i.e. its natural features). To explore hydrological aims of Amazonian basin, this dataset is analyzed. The dataset forms a \(472\times 70\) matrix which each row of this matrix is considered as a curve. These data have clustered by Luz-López-García et al. (2015). The data are shown in Fig. 5c.

    Fig. 5
    figure 5

    Curves of real datasets. a Canadian temperature data, b Russian temperature data, c radar data

    Fig. 6
    figure 6

    Geographical locations. a Location of Canadian weather stations, b location of Russian weather stations

5.2 Cross-validation and condition number for real datasets

We determine the suitable parameters of stated kernels by CV technique. For example, to select the parameters of each kernel for Canadian temperature dataset, we calculate the average of corresponding projections (i.e. we project the data onto RKHS) for different values of \(\gamma _{s}\) and \(\sigma _{t}\) with \(s=1, \ldots , 4, \ \ t=1,2,4\) and \(\sigma _{3}=(a, b)^T\). Here \(\sigma _{3}\) is a vector. The optimal parameters are chosen such that the sum of square error, \(\sum _{j=1}^{35}\sum _{i=1}^{365} \left( f^{j}(x_{i}) - \hat{f}^{*^{j}}(x_{i})\right) ^{2}\), is minimized where \(f^{j}\) and \(\hat{f}^{*^{j}}\) are original signal and its reconstruction, respectively [\(\hat{f}^{*^{j}}\) is defined in Eq. (5)]. In addition, these parameters should not yield an ill-posed system. For Canadian temperature dataset, we consider the following values to find the optimal parameters, \(\gamma _{s}=10^0, 10^{-1}, 10^{-2}, 10^{-3}, 10^{-4}\) and \(\sigma _{t} = 0.1, 0.2, \ldots , 2\) \(s=1, \ldots , 4, \ \ t=1,2,4\) and \(\sigma _{3}=(10^{-3}, 3)^T\), \((10^{-3}, 4)^T\), \((10^{-3}, 5)^T\), \((10^{-4}, 3)^T\), \((10^{-4}, 4)^T\), \((10^{-4}, 5)^T\), \((10^{-5}, 3)^T\), \((10^{-5}, 4)^T\), \((10^{-5}, 5)^T.\) Using these parameters, we calculate the kernel matrix \(K_{365 \times 365}\), expansion coefficients \(\alpha\) and reconstruction of curves from \(\hat{f}_{n}^{*^{j}} = K_{\mathbf {x}} {\mathbf {\alpha }}.\) We compare the obtained square errors and choose those parameters that yield minimal square errors and well-posed system in Eq. (4). This procedure can be used for Russian temperature and radar datasets. The polynomial kernel cannot be used since it yields an ill-posed system in Eq. (4) for real datasets. In other words, if for a kernel the condition number of matrix \(K_x\) and \(T=(n \gamma {\mathbf {I}}_{n} + K_{\mathbf {x}})\) from Eq. (4) is increased, that kernel does not present a suitable reconstruction of data. In Fig. 7a, we present the average of original data and average of their reconstruction based on the polynomial kernel. For real datasets, the condition number of the polynomial kernel is presented in Table 4. As we can see in Table 4, the condition number of the polynomial kernel is very large, and it yields an ill-posed system.

Fig. 7
figure 7

a Average of projections and average of original curves via polynomial kernel for real datasets, and b average of projections and average of original curves via each kernel for real datasets. The second column shows the differences between the average of original signals and the average of their projections for different kernels

Table 4 The condition number of real datasets for polynomial kernel and \(T=(n \gamma {\mathbf {I}}_{n} + K_{\mathbf {x}})\)

The best values for the real datasets and corresponding error are shown in Table 5 where the smallest error is bolded.

Table 5 Kernel parameters for real datasets and corresponding error

Now, we represent the average of original signals and the average of their projections on RKHS (reconstruction) for each dataset in Fig. 7b. Since the average of original signals and the average of their projections are close together, we also show the differences between the average of original signals and the average of their projections in the second column. As we can see, an excellent reconstruction obtained based on wavelet kernel for Canadian temperature and radar dataset and based on Laplacian kernel for Russian temperature.

The condition number of real datasets are presented in Table 6. Gaussian kernel, wavelet kernel, and Laplacian kernel have the best condition number for Canadian temperature, Russian temperature, and radar datasets, respectively.

Table 6 Condition number of real datasets and \(T=(n \gamma {\mathbf {I}}_{n} + K_{\mathbf {x}})\)

5.3 Determining the number of clusters

A major problem in clustering is the detection of the number of clusters. There are many validity indices to identify it, but they do not cover all problems. Three examples of validity indices are Dunn index, Davies–Bouldin index and Wilk’s Lambda index. We explain them in the following.

The cluster diameters and dissimilarity between of clusters are combined to obtain the most suitable number of clusters by Dunn index (Dunn 1974). The high value of Dunn index indicates that clusters are compact and well separated.

Davies–Bouldin index (Davies and Bouldin 1979) can be used to measure the dispersion of a cluster and dissimilarity between clusters. The small value of Davies–Bouldin index gives the suitable number of clusters.

These indices are successful only in special cases, and they are not adequate for the general situation. Dunn index requires expensive computations and is sensible to the noise of data. It is not used for noisy data, moderate and big data. Also, Davies–Bouldin index does not give good results for overlapping clusters (Saitta et al. 2007).

For real data, the number of clusters is unknown. Canadian temperature dataset is clustered by Jacques and Preda (2014b) with 4 clusters. Luz-López-García et al. (2015) using Wilk’s Lambda (presented in the following) determine \(k=3\) for radar dataset. However, we used the mentioned indices (Dunn index and Davies–Bouldin index) for the real data and did not get appropriate results for the number of clusters.Footnote 3

Wilk’s Lambda (Kuo and Lin 2010) is usually used to determine the number of clusters. It is defined as follows:

$$\begin{aligned} {\hbox {Wilk's Lambda}} = \frac{SS_{{\hbox {within}}}^{k}}{SS_{{ \hbox {total}}}}; \qquad k=2, \ldots , k_{{\hbox {max}}}, \end{aligned}$$

where \(SS_{{\hbox {within}}}^{k}\) is the within-cluster variance and \(SS_{{ \hbox {total}}}\) is the total variance. Using IRPFD K-means, we determine it for Canadian temperature and Russian temperature datasets by three kernels and radar dataset by two kernels (Gaussian and wavelet). For \(k=2,3, \ldots , 20\), we present Wilk’s lambda in Fig. 8a, b for three datasets. We choose \(k=4\), \(k=8\) and \(k=3\) for Canadian temperature dataset, Russian temperature dataset, and radar dataset, respectively. Wilk’s lambda does not considerably decrease after the selected points based on the most kernels.

Fig. 8
figure 8

a The values of Wilk’s Lambda for Canadian temperature dataset (the first column) and Russian temperature dataset (the second column) for different kernels, and b the values of Wilk’s Lambda for radar dataset

5.4 The setting of IRPFD for Canadian temperature dataset

Now, we explain working of IRPFD for Canadian temperature dataset by previously obtained parameters in Sects. 5.2 and 5.3. The optimal parameters are \(\sigma _{1}=1\), \(\gamma _{1}=10^{-4}\), \(\sigma _{2}=0.1\), \(\gamma _{2}=10^{-4}\), \(\sigma _{3}=1.4\) and \(\gamma _{3}=10^{-4}\). The number of clusters is \(k=4\). Here, we have \(N=35\), \(n=365\), \(d_z=(10, 40, 60, 80)^T\) and \(X_{n}=\{1,2, \ldots , 365\}\). Matrix K is computed for three kernels using the mentioned values. Here, we consider Gaussian kernel which is as

$$\begin{aligned} K(x_i, x_j) = e^{-\frac{1}{0.1^{2}} ||x_{i} - x_{j}||_{2}^{2}} \end{aligned}$$

where \(||.||_2\) is Euclidean norm and \(i, j=1, \ldots , 365\). We obtain its Cholesky decomposition as \(K=UU^T\) where matrix U is \(365 \times 365\). For \(j=1, \ldots , 35\), we solve the linear system Eq. (4) to obtain \(\alpha ^j\) for Gaussian kernel. Also, \(\alpha ^j\) is a vector of size \(365 \times 1\). We generate a random matrix \(R_{10\times 365}\) which elements are drawn from the standard normal distribution and set \(P_{10\times 365} = R_{10\times 365}U_{365 \times 365}\). For \(j=1, \ldots , 35\), we project expansion coefficients by RP as \(\beta _{10 \times 1}^{j^{T}} =P_{10 \times 365}\alpha _{365 \times 1}^{j}\) and put them in the rows of a matrix \({\mathcal {A}}_{35 \times 10}^{RP_{1}}\). Now, we initialize G by generating 35 random numbers from \(unif(\{1,2,3,4\})\) where unif is discrete uniform distribution. According to G, we calculate the mean of each cluster and put them in the rows of a matrix \(C_{4 \times 10}^{RP_{1}}\) as cluster centers. That means

$$\begin{aligned}C_{4 \times 10}^{RP_{1}}= \begin{bmatrix} \frac{1}{|\{i | g_i=1\}|} \sum _{\{i | g_i=1\}} A_i\\ \\ \frac{1}{|\{i | g_i=2\}|} \sum _{\{i | g_i=2\}} A_i\\ \\ \frac{1}{|\{i | g_i=3\}|} \sum _{\{i | g_i=3\}} A_i\\ \\ \frac{1}{|\{i | g_i=4\}|} \sum _{\{i | g_i=4\}} A_i\\ \end{bmatrix}, \end{aligned}$$

where |.| is the cardinality of a set, \(g_i\) belongs to G and \(A_i\) is i-th row of \({\mathcal {A}}_{35 \times 10}^{RP_{1}}\) for \(i=1, \ldots , 35\). We cluster \({\mathcal {A}}_{35 \times 10}^{RP_{1}}\) by K-means which is initialized with \(C_{4 \times 10}^{RP_{1}}\). K-means gives a new G. We generate a new random matrix \(P_{40 \times 365}\) as previous and obtain \({\mathcal {A}}_{35 \times 40}^{RP_{2}}\). Using the obtained G and \({\mathcal {A}}_{35 \times 40}^{RP_{2}}\), we calculate \(C_{4 \times 40}^{RP_{2}}\). Applying K-means on \({\mathcal {A}}_{35 \times 40}^{RP_{2}}\) using \(C_{4 \times 40}^{RP_{2}}\) gives a new G. We perform these process for dimension 60 and 80. Using the obtained G in dimension 80, we calculate the cluster centers in the original space. \(d_z=(10, 40, 60, 80)^T\) is selected from candidate values which are minimized MSD.

5.5 Results

In this section, we present the results of applying classical K-means, RPFD and IRPFD on real datasets. Note that we use hard clustering techniques. However, there are several works in hydrometeorology which used fuzzy clustering and find statistically and climatologically homogeneous regions, (Srinivas 2008; Satyanarayana and Srinivas 2011; Bharath and Srinivas 2015a). The clustering results of real datasets (similar to those of simulated datasets) over 5 iterations are shown in Table 8 where SD is a standard deviation. Note that for real datasets, Rand index cannot be calculated since the true clusters of the real datasets are unknown. The proposed method with wavelet kernel for the first two datasets and with the Gaussian kernel for the third dataset has the best MSD among other methods. Let us note that Jacques and Preda (2014b) also clustered Canadian temperature and precipitation data. For their obtained clusters, we calculate the MSD of the temperature data. It is equal to \(3.56 \times 10^{3}\) which is greater than MSD of the proposed method (\(2.89 \times 10^{3}\)).

The boxplots for MSD are shown in Fig. 9. According to this figure, IRPFD based on wavelet kernel for Canadian temperature and Russian temperature datasets and based on the Gaussian kernel for radar dataset has the best performance among others.

Fig. 9
figure 9

MSD of real datasets. a Canadian temperature dataset, b Russian temperature, c radar dataset

The results of geographical locations of Canadian temperature dataset for classical K-means, RPFD and IRPFD based on wavelet kernel are shown in Fig. 10. We also present the geographical locations of the results which are found by Jacques and Preda (2014b) in Fig. 10d. Detected clusters by IRPFD and Jacques and Preda (2014b) are different for northern and southern stations, obviously.

Fig. 10
figure 10

Geographical locations of Canadian temperature dataset with a classical K-means, b RPFD, c IRPFD based on wavelet kernel and d the results of Jacques and Preda (2014b)

The geographical locations of Russian temperature dataset for classical K-means, RPFD and IRPFD based on wavelet kernel are shown in Fig. 11. According to Figs. 10c and 11c, IRPFD has founded the homogeneous clusters for Canadian temperature and Russian temperature datasets which are consistent with the reality of climate of Canada and Russia. Note that we do not use the geographical positions of the stations to obtain these clusters for Canadian temperature and Russian temperature datasets. The proposed method is only for univariate functional random variable. Horenko (2010) also clustered the historical temperature data in Europe only using the temperature. If we consider temperature and precipitation variables, then we have a bivariate functional data. Let us note that Satyanarayana and Srinivas (2011) and Bharath et al. (2016) proposed an approach for regionalization of precipitation and temperature, respectively using location indicators (e.g. latitude, longitude, altitude and distance to sea) and large-scale atmospheric variables and they obtained suitable results.

Fig. 11
figure 11

Geographical locations of Russian temperature dataset with a classical K-means, b RPFD and c IRPFD based on wavelet kernel

The curves of each cluster which obtained with IRPFD based on wavelet kernel for Canadian temperature dataset and based on the Gaussian kernel for radar dataset are presented in Fig. 12a, b. For Canadian temperature dataset with IRPFD, the cluster 1 contains 17 signals, the cluster 2 has 10 signals, the cluster 3 has 3 signals and the fourth cluster has 5 signals while Jacques and Preda (2014b) have found 23, 3, 1 and 8 signals for the clusters 1, 2, 3 and 4, respectively. For radar data using IRPFD, the clusters 1, 2 and 3 have 88, 50 and 334 signals while those obtained by Luz-López-García et al. (2015) are 94, 47 and 329, respectively.

Fig. 12
figure 12

a The solution of IRPFD based on wavelet kernel for Canadian temperature dataset, and b the solution of IRPFD based on Gaussian kernel for radar dataset

For radar dataset, we present the details of running time for classical K-means, RPFD and IRPFD based on the Gaussian kernel in Table 7. According to Algorithms 1 and 2, the running time of RPFD and IRPFD is the sum of the running time of: calculate kernel matrix (denoted by t.ker), Cholesky decomposition (denoted by t.Ch), compute coefficients \({\mathbf {\alpha }}\) (denoted by t.alp) and K-means (denoted by t.K). According to this table, the K-means running time (t.K) is fairly small compared to the total running time.

Table 7 The details of the running time for radar dataset
Table 8 The clustering results of real data: Time, Iter, MSD. Also, SD is a standard deviation

6 Conclusions

Although functional data clustering can be performed by clustering raw data (without considering the functional nature), there are examples of datasets that we must consider the functional nature to obtain the appropriate results. In other words, functional data clustering presents a better result than classical K-means (for example see MSD of radar dataset in Table 8). In this paper, we have introduced an iterative random projection K-means for functional data (IRPFD) which is a method to cluster functional data based on random projection (RP) and reproducing kernel Hilbert space (RKHS). We increase the dimension several times to avoid local minimum.

We implement three methods (classical K-means, RPFD, IRPFD) on some simulated and climatological datasets. For simulation study, Rand index (RI) and mean square distance (MSD) of IRPFD is the best in all datasets, and its average number of iterations (Iter) is the best for the most datasets. In general, clustering via functional nature is better than classical K-means (in some dataset classical K-means is better than RPFD based on RI and MSD). The running time of classical K-means is less than others because it uses K-means only one time on raw data. This is while the running time of IRPFD K-means and RPFD K-means are the same approximately in the most datasets (i.e. this is not for all datasets). Thus, IRPFD takes a little longer time than classical K-means and RPFD to find the best solution, and it has a smaller Iter than others.

We have used the wavelet kernel that gives a better reconstruction of signals based on Tikhonov regularization theory for the most datasets. This kernel also improves the Iter, RI and MSD of RPFD and IRPFD in the most cases.

For real dataset, we determine the number of cluster using Wilk’s Lambda which is based on the within-cluster variance. Since the true clusters of real dataset are unknown, Rand index is not calculated. For Canadian temperature and Russian temperature datasets, IRPFD based on wavelet kernel finds a solution with smallest MSD and for radar dataset based on the Gaussian kernel.

To cluster functional data, we project data onto RKHS, and we have to determine the suitable parameters of Mercer kernel and the regularization parameter (in this paper, we have 4 Mercer kernels). It is not clear which kernel is appropriate for data before clustering. Determining a true number of clusters and an optimal dimension for RP is a hard task. Future research can be done to present another guideline for kernel parameterization, detecting the true number of clusters and an optimal dimension for RP. The proposed method can be extended to multivariate functional data and spatial data in future works. The development of fuzzy framework of the proposed method could be considered in a future research.