1 Introduction

Climate regionalization (CR) is defined as the process of dividing a given area into smaller regions, in such a way that they are somehow homogeneous with respect to a specified climatic variable (Badr et al. 2015). CR is a key point in climate studies, since it allows explaining small-scale climate events in terms of the spatio-temporal mechanisms which produce them. CR has been specifically applied to palaeo-climatic problems (Knapp et al. 2002), precipitation trends, floods and drought events (Comrie and Glenn 1998; Baeriswyl and Rebetez 1997; Burn 1989), numerical models improvement for climate studies (Argüeso et al. 2011; Regonda et al. 2016), or climate change studies (Önol and Semazzi 2009), among others.

There are a number of well-known linear analysis techniques for obtaining high-quality CR. Empirical Orthogonal Function (EOF) analysis, also known as Principal Component Analysis (PCA), is one of the most standard techniques in climatology with direct application in CR. EOF analysis tries to identify natural spatio-temporal variability of observations (Jolliffe 2002). The idea behind EOF analysis is to identify a set of orthogonal eigenfunctions which accounts for most of the system’s total variance (von Storch and Zwiers 1999). Thus, EOF analysis tries to obtain the dominant modes of variability, in turn reducing the data space by only considering those EOFs which cover a large percentage of the total variance. EOF analysis has been intensely used in CR (White et al. 1991; Comrie and Glenn 1998; Baeriswyl and Rebetez 1997). The basic idea is to use EOF or Rotated Empirical Orthogonal Function (REOF) to define and interpret clusters of different climatic variables, at different spatial resolutions of the grid. However, different studies have pointed out some issues in the use of EOF analysis (Dommenget and Latif 2002). For example, in Kim et al. (2015), it is shown that EOF analysis usually considers stationarity in the covariance of the data, when some of the most important climatic variables show cyclic components. This could lead to misleading results, and the authors propose the use of cyclostationary EOF analysis in these cases. Another important problem when applying EOF analysis is due to the orthogonality constraint of the procedure. It is shown that unphysical modes may appear due to this constraint (Lian and Chen 2012), producing results difficult to explain. Different works have shown that this issue may be alleviated by using REOF analysis, a method which has been shown to perform much better than the classical EOF approach in many particular cases. In fact, previous studies have shown how REOF analysis is able to avoid unphysical modes of EOFs while keeping the important and robust spatio-temporal patterns of the data (Richman 1986). On the other hand, REOF analysis has its own set of problems: (1) the database size necessary to obtain a good calculation of the REOFs, (2) the selection of the number of modes involved in their calculation and last, but not least, (3) the selection of the rotation criteria (Jolliffe 2002, p. 271) all of which make the results highly dependent on subjective decisions.

Non-linear clustering techniques have also been applied to CR problems. These include methods such as hierarchical clustering (Ward 1963; Unal et al. 2003; Badr et al. 2014, 2015, Nojarov 2017), k-means (Cassou et al. 2004; Carvalho et al. 2016; Zhang et al. 2016), multivariate statistical techniques (Shahriar et al. 2015), or fuzzy clustering (Sarma and Hazarika 2014; Irwin 2015). The main advantages of these methods are that they allow a very high spatial resolution of the grid in climate problems, with a tunable similarity criterion (different from the variance as in EOFs). However, there are some issues related to clustering techniques in CR, such as a more difficult interpretation that can lead to different conclusions due to the use of different similarity criteria or schemes (Cassou et al. 2004). Also, the need for tuning some specific parameters of the methods (for example the number k of clusters in the k-means approach) might be an issue.

In this paper, a novel self-organized heuristic clustering technique is proposed for CR problems. Specifically, the Second-Order Data-Coupled Clustering (SODCC) algorithm is used (Chidean et al. 2015a), a self-organized clustering approach that uses statistical characteristics of the measured data to geographically group similar nodes. To this end, the proposed algorithm uses signal subspace dimension to determine the minimum amount of linearly independent components in each cluster, i.e. the number of eigenvalues that explain most of the variance in the data. It approaches the clustering problem from a bottom-up perspective, in which neighbouring nodes are initially grouped and, solely based on the signal subspace dimension of the covariance matrix, it decides whether to fuse to other cluster or not. This procedure is repeated until there is no free node left in the system. This procedure is robust to the selection of specific tuning parameters, apart from those that define the statistical distribution of eigenvalues in covariance matrices.

The proposed SODCC algorithm builds a bridge between linear (EOFs and REOFs) and non-linear analysis techniques (hierarchical, k-means... clustering) as it uses characteristics from the latter to build spatial structures comparable with the former. Moreover, by using SODCC, it is also possible to analyze the temporal and spatial structures of the dataset, with a higher spatial resolution of the grid. This analysis permits a direct comparison with climate patterns such as the North Atlantic Oscillation (NAO).

The rest of the paper is structured as follows: Section 2 includes a detailed description of the SODCC algorithm and its analysis from a theoretical point of view. Section 3 covers the case study considered in this work, where the regionalization and spatio-temporal analysis of wind speed data in the Iberian Peninsula is carried out using the SODCC algorithm and robustness to the selection of tunable parameters is demonstrated. Finally, Section 4 closes the article by giving some concluding remarks.

2 Methods

In this section, the clustering algorithm proposed in this work is detailed for CR. First, the system model is described, providing mathematical definitions for the network and dataset. Next, the SODCC algorithm is described and the procedure to analyze climatological data using the aforementioned algorithm. In the following, the operation of SODCC is analyzed from a theoretical point of view.

2.1 System model

Consider a set of N geographical locations determined by their corresponding latitude and longitude coordinates to be the measuring stations (or grid nodes) for a given climate variable, i.e. wind speed. Let Ts be the uniform sampling interval, namely the time interval between two consecutive data measurements. Each measuring station can be modeled as an element of a network with N nodes and C connections among them.

Let MT be the total number of data measurements per node and \({\mathbf x}_{m} \in {\mathbb R}^{N}\), with \({\mathbf x}_{m} = [x_{m}(1), x_{m}(2),\dots ,x_{m}(N)]^{\top }\), be the vectorFootnote 1 of data measurements for all nodes in a specific time instant \(m=1,2,\dots ,M_{T}\). By assembling the MT data vectors in a data matrix, we obtain \(\mathbf {X}=[{\mathbf x}_{1},\dots ,{\mathbf x}_{M_{T}}]\), which is the dataset measured by the entire network. The covariance matrix of a data matrix X is defined as \({\Sigma } = \mathbf {M_{X}}\mathbf {M_{X}}^{\top }/M_{T}\), being MX the centered data matrix.

With this system model, the spatio-temporal correlations of the climate variable can be analyzed considering the complete set of N nodes and using traditional climate analysis such as EOFs or REOFs (Jolliffe 2002). However, different approaches can be taken into account, such as the recently proposed clustering algorithm SODCC that we consider in this work. SODCC organizes the nodes in groups in terms of the statistics of the measured data (Chidean et al. 2015a). In the following, the details of the SODCC operation and the procedure to apply it to climate variables are further explained.

2.2 Clustering algorithm: SODCC

The SODCC algorithm has been previously proposed and described in detail for wireless sensor network applications (Chidean et al. 2015a).Footnote 2 However, previous works have shown that it is also suitable for climate data analysis, e.g. in Chidean et al. (2015b), where the structure of temperature field in Europe from more than 100 measuring stations was studied using this approach. In this section, we focus on the general description of the SODCC and its operation from the climate application point of view.

The SODCC algorithm organizes the nodes of a network in logical groups, also known as clusters, following a data-driven criterion, i.e. uses data statistics to decide. The outcome of SODCC is a set of clusters, each one containing Ni nodes. Note that all N nodes of the network must belong to one and only one cluster and that each cluster may have a different cluster size Ni. The final cluster size is determined by the minimum amount of nodes that explain a minimum of 90% of the data variance of that region, by means of the calculation of the corresponding eigenvalues. For example, small cluster sizes appear in regions with high data correlation and large cluster sizes appear in regions with low data correlation.

The fact that it is possible to explain a minimum amount of data variance for all the obtained clusters leads to a more relevant characteristic of SODCC: the phase transition of the data matrices of each cluster is achieved, i.e. it is possible to properly perform the eigendecomposition within each cluster. For details about the phase transition of correlation matrices with a finite number of samples, refer to Appendix A.

The SODCC algorithm follows a bottom-up approximation, where the nodes start belonging to small clusters and, by means of cluster fusion processes, they end up belonging to larger clusters. The algorithm has two phases that are different in both objective and operation:

  1. 1.

    Cluster initialization—random initialization of small clusters that act as seeds for the final set of clusters.

  2. 2.

    Cluster growing—all non-final (or non-stable) clusters take part in cluster fusion processes until a stopping criterion is reached and the final set of clusters is determined.

2.2.1 Cluster initialization

The objective of this phase is the random initialization of the set of clusters, so that the second phase can use them, perform its operation, and output the final set of clusters.

Given the N nodes that form the network, there are three possible and exclusive status for any of them:

  1. 1.

    Cluster initialization (CH)—this is the first node of a given cluster.

  2. 2.

    Normal node—the node belongs to a cluster and it is not the CH of that cluster.

  3. 3.

    Role-free node—the node does not belong to any cluster in any of the previous ways.

The first phase of the SODCC algorithm also requires two preset parameters in order to form the first set of clusters. Parameter P indicates the probability that a given role-free node becomes CH and parameter N1st indicates the maximum initial cluster size. These parameters are set in order to obtain clusters that are small enough to act as “seeds” of the final clusters and large enough to not interfere with the operation of the second phase.

The results presented in this work are obtained considering P = 0.35 and N1st = 3, that were also considered in previous works (Chidean et al. 2015a, b). However, Section 3 includes a brief discussion regarding method robustness with respect to these parameters.

The operation of this first phase is summarized in Algorithm 1. In short, role-free nodes independently decide to become CH according to P. Next, the new CHs search for role-free nodes in their neighbourhood and, if it is possible, include up to N1st − 1 role-free nodes into their own cluster. “Neighbourhood” allows multiple definitions in a network; however, in this work can be understood as nodes located within a short distance of a given node, e.g. the eight nearest nodes in a squared regular grid as the one sketched in Fig. 1.

figure a
Fig. 1
figure 1

Example of clusters obtained by SODCC. The embedded graphs schematize the sorted eigenvalues λi, highlighting the \(\hat d\) eigenvalues that explain a minimum of 90% of the variance in each cluster

Due to chance, it is possible that there are still some role-free nodes after this algorithm loop. For this case, the SODCC algorithm allows the repetition of this cluster initialization phase multiple times (up to Nrep times). Eventually, after a given number of repetitions, all remaining role-free nodes turn into CH and, unable to find available role-free nodes, their cluster is formed by a single node.

This random cluster initialization may affect the interpretation of results. Therefore, in order to cancel its effect, the SODCC algorithm has to be applied multiple times and the results have to be analyzed in a probabilistic manner.

2.2.2 Cluster growing

The objective of this phase is to determine the final set of clusters, by means of cluster fusions processes until a stopping criterion is achieved. The SODCC stopping criterion is independently evaluated for each non-final cluster after each cluster fusion process and uses the covariance matrix \({\Sigma }_{i}\in {\mathbb R}^{N_{i}\times N_{i}}\) of the data measured by the Ni nodes that form the i th cluster. To ensure the achievement of the stopping criterion, SODCC does not consider cluster division processes. The detailed operation of this second phase is summarized in Algorithm 2. At the beginning, for the i th cluster, the minimum amount of data per node Mi to achieve the phase transition is calculated, that is, according to matrix perturbation theory (Nadler 2008), the minimum number of entries per node needed for the successful construction of a covariance matrix of a given size. Next, the covariance matrix Σi is calculated, and its eigenvalues are estimated. Following, the number of eigenvalues \(\hat d\) (sorted in decreasing order) that account for a minimum of 90% of explained variance of the dataset measured by all the nodes forming i th cluster is determined. In this work, the value of the \(\hat d\) is calculated by means of the Fast Subspace Decomposition (FSD) algorithm (Xu and Kailath 1994). For details about the FSD algorithm, please refer to Appendix B.

figure b

Finally, based on the values of Ni and \(\hat d\), the stopping criterion is evaluated. More specifically, the i th cluster fulfills the stopping criterion if \(\hat d<N_{i}\) and the SODCC algorithm ends its operation for this cluster. Otherwise, this cluster must fuse to a neighbouring cluster (minimum euclidean distance) in order to grow in size and to span a larger spatial region and fulfill the stopping criterion in future algorithm iterations. Again, in the present work, cluster fusion is based on distance between nodes and a cluster can only fuse to those clusters located in the neighbourhood of its nodes.

The relation between the stopping criterion of the SODCC and the final cluster sizes has been shown in Chidean et al. (2015a). In the following, we summarize the main idea, also represented in Fig. 1. Given the final set of clusters, the small clusters are expected to appear in areas where there exists a high data correlation, as only data from few nodes are needed to determine the \(\hat d\) eigenvalues with 90% of the explained variance. For example, in Fig. 1, the cluster shaded in purple is final as the sum of the explained variance of the first \(\hat d=2\) eigenvalues is greater than 90%, indicating that the Ni = 3 nodes of the cluster have high data correlation. On the other hand, large clusters are expected to appear in areas where the cross-correlation of the data among nodes is low. Therefore, the SODCC algorithm ensures that closely correlated data series are clustered together (more than 90% of the variance can be explained with \(\hat d < N_{i}\) eigenvalues) and that the cluster configuration reflects the underlying data spatio-temporal correlations.

2.3 Theoretical analysis of SODCC

In the previous subsection, we have described an algorithm that drives the clustering in the different stages. The cluster fusion processes can be directly assimilated with diffusion-limited aggregation processes (in terms of the cluster border interaction) for irreversible clusters (i.e. once they form, they do not diminish in size). This assimilation provides us with the means to propose a theoretical analysis of the cluster size probability (CSP) theoretical distribution obtained for the final set of clusters.

An initial formulation for this distribution can be proposed following a dynamic scaling function (Vicsek and Family 1984):

$$ n(t,N_{i})\sim t^{-w} N_{i}^{-\tau} f\left( \frac{N_{i}}{t^{z}}\right) $$
(1)

where t is the time throughout the clustering process until the final distribution is achieved and Ni is the number of nodes that form the i th cluster. The importance of time in the present theoretical formulation lies on its scaling dependence, independent from that of the cluster size. Equation 1 is formed by three terms that describe the cluster fusion processes through a power-law (with w > 0), the final CSP theoretical distribution through a different power-law (with 0 < τ < 2), and the characteristic cluster size (with z > 1), respectively. Function f(x) has power-law behaviour for small x, i.e. x2−τ for x ≪ 1 and f(x) ≪ 1 for x ≫ 1. In the context of climatology, the scaling functions for cluster size distribution are related to stable spatio-temporal characteristics that exponentially relate the spatial extent of physical phenomena to their temporal stability. In the present case, the cluster second-order statistics features define both relations. In climatology, the extensive study of the EOF and their is the closest relative to the present approach. Generally, EOF is used to characterize global sets of data. In the present study, a possible interpretation would be the local determination of EOF and their temporal stability.

The CSP theoretical distribution for the final set of clusters (the equilibrium condition of the fusion processes) can be approximated by

$$ n(t,N_{i})\sim t^{\tau-2}N_{i}^{-\tau} $$
(2)

Equation 2 provides us with a power-law dependence on the cluster size from the smallest cluster size possible (Ni = 3 in our case) to infinity. A normalization condition is necessary in order to define a proper probability distribution, i.e.

$$ \sum\limits_{N_{i}=3}^{N}n(t,N_{i}) = 1 $$
(3)

The time dynamics of the formation of the final set of clusters (governed by the second phase of the SODCC) are quantized at the characteristic times, when the phase transition condition for the covariance matrix is fulfilled (see Appendix A) and the FSD statistic is computed. Then, a fraction of the clusters with size Ni will grow in size, as they are fused to other clusters.

In a CR problem, where spatial correlations are studied, the minimum meaningful cluster size of the final set of clusters is \(N_{i}^{\min \limits }=3\), where the SODCC determines two signal eigenvalues (\(\hat {d}_{\mathsf {min}}=2\)) that account for a minimum of 90% of explained variance and one noise eigenvalue. Therefore, the CSP theoretical distribution includes an exponential decay of the cluster size from Ni = 3 onwards.

On the other hand, the theoretical maximum resolvable signal subspace dimension \(\hat {d}_{\mathsf {max}}\) is limited by the temporal size of the data MT. For the final set of clusters, the condition for phase transition holds for all clusters. As the minimum amount of data required to properly calculate the data covariance matrix, Mi can be set to \(M_{i}=4\times N_{i}^{\max \limits }\) and as at least one noise eigenvalue is determined, the maximum value for \(\hat {d}_{\mathsf {max}}\) is \(N_{i}^{\max \limits }-1\) (Chidean et al. 2015a). Clusters where the signal subspace dimension is \(\hat {d}_{\mathsf {max}}\) (the maximum obtained value considering the complete set of grid points and all time instants) will be the largest clusters in the grid with FSD convergence. In statistical terms, these clusters will be the largest connected structures in the grid that are bound with the 90% explained variance. The larger clusters (where \(N_{i}>\hat {d}_{\mathsf {max}}+2\)) are not bestowed with such tight relation and, therefore, are bound to be ephemeral through the different realizations of the SODCC algorithm, thus changing their shape, size, and even their location. That is the main reason for their exponentially decaying sizes in the CSP distribution.

Thus, the CSP theoretical distribution includes an exponential decay between cluster sizes \(\hat {d}_{\mathsf {min}}+1\) and \(\hat {d}_{\mathsf {max}}\), a sharp probability rise in \(N_{i}=\hat {d}_{\mathsf {max}}+1\), followed by an exponential decay with the same exponent. These behaviours are universal in clustering processes with fusion (Vicsek and Family 1984).

Considering the above arguments, we formulate the CSP theoretical distribution obtained for the final set of clusters as:

$$ n(t,N_{i})=\kappa \times t^{-w}\left( g(N_{i},\hat{d}_{\mathsf{min}})+g(N_{i},\hat{d}_{\mathsf{max}})\right) $$
(4)

where the modulation function

$$ g(x,x_{0}) = \theta(x-x_{0})(x-x_{0})^{-\tau} $$
(5)

θ(x) is the Heaviside step function (θ(xx0) denotes the θ(x) function shifted x0 units in the abscissa axis) and

$$ \kappa=\frac{\tau -1}{(N-\hat{d}_{\mathsf{min}})^{1-\tau }+(N-\hat{d}_{\mathsf{max}})^{1-\tau }-2} $$
(6)

is the normalization constant s.t. Equation 3 is satisfied.

In Fig. 2, we can see a generic CSP in which \(\hat {d}_{\mathsf {min}}+1\) and \(\hat {d}_{\mathsf {max}}+1\) are the most relevant features of the distribution. Due to the dynamics of cluster aggregation, the clusters with sizes in the trough of the \(N_{i}\in {\left [\hat {d}_{\mathsf {min}}+1, \hat {d}_{\mathsf {max}}+1\right ]}\) interval are those in which their signal eigenvalues are relevant to the statistics of the data under study. Clusters with sizes beyond \(N_{i}=\hat {d}_{\mathsf {max}}+1\) are the result of cluster aggregation of large, unstructured clusters with unresolved signal eigenvalues according to the FSD statistic. Thus, let the “focus zone” be the cluster size range where the CSP is able to resolve the climate regions (e.g. the shaded area in Fig. 2).

Fig. 2
figure 2

Plot of a generic CSP n(t,Ni), where \(\hat {d}_{\mathsf {min}}+1\) and \(\hat {d}_{\mathsf {max}}+1\) are indicated with grey arrows and the interval of interest of cluster size probability (focus zone) is shown in blue

In this section, we have shown that the resulting CSPs do not have an arbitrary shape. On the contrary, they result from the cluster fusion process. The fitting of the experimental CSPs to their theoretical functions by means of the scaling parameters τ, κ, and w may reveal further details of the physical processes in a single month/year. These scaling parameters determine the statistics of the cluster sizes. Therefore, examining their variations through the years may lead to revealing the physical nature of the interactions among clusters. This task is out of the scope of the present work but it will be undertaken in the future.

3 Case study: wind resource in the Iberian Peninsula

In this section, we analyze the wind resource in the Iberian Peninsula in the framework of CR, in order to show the performance of the proposed SODCC. Before this, we discuss some previous works dealing with spatio-temporal analysis of wind speed and different approaches for its regionalization.

Spatio-temporal analysis of wind speed is significant for a number of problems related to climate, as indicator of circulation changes, and also for renewable energy resource analysis, among others. Very recent works deal with wind speed regionalization or spatio-temporal analysis of wind speed using reanalysis data. In Yu et al. (2015), the temporal variability of wind speed is studied in the USA, based on the climate forecast system reanalysis data. EOF analysis is used to find connections to NAO and ENSO patterns, which seem to control part of the wind speed variability across the USA. Another related work is Troccoli et al. (2012), which studies long-term variability of wind speed trends over Australia. In this case, wind speed observations and reanalysis data are used to obtain the wind speed trends.

Also, recent works on wind speed regionalization in the Iberian Peninsula have been published. In Azorin-Molina et al. (2014), the analysis of homogenization of wind speed in the Iberian Peninsula is considered. The study analyzes wind speed trends recorded at 67 land-based stations across Spain and Portugal for the period 1961–2011, finding a slight downward trend in wind speed. The paper also analyzes the possible impact of three atmospheric indices (NAO, Mediterranean oscillation, and western Mediterranean oscillation) and the role played by the urbanization growth in the observed decline of wind speed. Finally, in Santos-Alamillos et al. (2016), the effect of spatio-temporal balancing of wind resource in wind farms in Spain is studied. A regionalization approach based on EOF analysis is carried out in this case.

With this discussion on previous related works in mind, we start our analysis of wind speed in the Iberian Peninsula, by considering wind speed at 10-m data obtained from the ERA–Interim reanalysis dataset. We apply the SODCC clustering algorithm, from both spatial and temporal points of view, including an analysis of the results in the framework of climate indices such as the NAO. Next, we compare our findings with the results obtained by analyzing the dataset by means of classical data analysis in climatology, i.e. EOFS and REOFs. With this dual analysis, we are able to show the main benefits of the CR method proposed in this work. Finally, we demonstrate the robustness of the SODCC algorithm to initialization parameters.

3.1 Dataset

In this case study, we use wind speed at 10-m data obtained from the ERA–Interim reanalysis dataset (Dee et al. 2011). The geographical area is bounded by parallels 35° N and 44° N and by meridians 10° W and 6° E, demarcating the Iberian Peninsula and the Balearic Islands. The grid is set to 0.5° × 0.5°, making a total of N = 627 nodes in the network, placed in a regular grid as shown in Fig. 3. Given this node location, the distance resolution is approximately 56 km. For this case study, we consider the wind speed data for January, in representation for the most windy season in the area of interest (Zishka and Smith 1980; Branick 1997). We consider the 1979–2014 time interval with a temporal resolution of Ts = 6 h, making a total of Mmonth = 31 × 4 = 124 data values per month. As the analysis that we consider in this work is performed independently per year, we define the following set of data (TJ) to clearly specify the time period under analysis:

$$ T_{J} = \{\text{first time instant of } j\ |\ j = 1979, \dots, 2014 \} $$
(7)
Fig. 3
figure 3

Geographical location of the N = 627 nodes (blue dots) of the network used in the present case study

Given the regular grid of the nodes, we should expect edge effects in the border of the studied area. However, as it will be commented on Section 3.b.2, these effects are not apparent in the results.

3.2 Data analysis with SODCC

The signal processing method for CR presented in this work is based on the SODCC algorithm. The idea behind this method is that, for each realization, the statistics calculated for the set of clusters obtained reveal both the spatial and temporal modes of the climate variable. All of the software used in the present work for the SODCC analysis is original and it has been developed by some of the authors in Matlab and it is available on the following link (https://github.com/MihChi/SODCC).

The considered dataset includes January wind speed data between 1979 and 2014. Data of each particular year is independently analyzed, giving a total of 36 starting points for the SODCC algorithm. This analysis allows the assessment of the temporal evolution of the spatial structures of the data.

For each possible value of TjTJ, in this case study, we perform 5000 independent simulations, applying the SODCC algorithm to the wind speed dataset of N = 627 nodes. Such experiments allow the analysis of the results with sufficient statistical representativity that are not determined by specific realizations of the random initialization. As previously mentioned, the output of each independent SODCC realization is a set of clusters, formed by nearby nodes, where it is possible to estimate the eigenvalues that explain a minimum of 90% of the data variance.

Result analysis can be performed at node level and at set of clusters level for each Tj. In the first case, by analyzing how likely is each node of the grid to belong to a cluster of a given size, the spatial correlation of the data at multiple spatial scales is revealed. Moreover, changes over the different years will show both the temporal evolution of the data statistics and their spatial extent. The following section is devoted to the spatial analysis of the results by means of the node to cluster size probability (NCSP) histograms. In the second case, studying the complete set of clusters allows a more straightforward analysis of the temporal evolution of wind resources in the Iberian Peninsula. Section 3.2.2 includes the temporal analysis by means of the CSP histograms normalized to have unit area as approximations of the corresponding CSP theoretical distributions that were previously introduced in Section 2.2.3.

3.2.1 Spatial analysis

The spatial analysis of the wind speed dataset can be performed by means of the node to cluster size probability (NCSP) histograms normalized to unit area, namely the probability of a given node to be associated with a cluster of a certain size Ni for the starting point of the simulation TjTJ. These NCSP histograms can be displayed over a map, similar to the EOF and REOFs representations. We can represent one map for each combination of Ni and Tj enabling the analysis of the spatial extent of the data correlations with great detail. Moreover, this representation also allows the simultaneous analysis of various spatial extents by using multiple values of Ni for each value of Tj (e.g. Ni ∈ {5, 6} or Ni ∈ focus zone).

The decision criterion of the SODCC organizes the nodes in terms of the explained variance, i.e. in a cluster of Ni a minimum of 90% of the data variance is explained by \(\hat d\) eigenvalues, being \(\hat d<N_{i}\). Then, geographical areas where the measured data has higher spatial correlation will result in higher probability to form smaller clusters and lower probability to form larger clusters, while areas where the spatial correlation is lower the behaviour is opposite. In the present case study, from Figs. 4 and 5, we can identify the extent of the spatial correlation for each geographical areas. The complete set of figures showing the NCSP histograms for all possible cluster sizes (\(N_{i}\in [N_{i}^{\min \limits },N_{i}^{\max \limits }]\)) and for all starting points of the simulation (TjTJ) is provided as electronic supplemental material (file “2_NCSP.pdf”). All these additional figures reaffirm our conclusions regarding the spatial correlation of the wind speed data.

Fig. 4
figure 4

NCSP values for Ni = 7 and multiple values of Tj represented over the map, showing the probability of a given node to belong to a cluster of size Ni = 7. This representation allows the spatial analysis of the data correlations

Fig. 5
figure 5

NCSP values for Ni = 18 and multiple values of Tj represented over the map, showing the probability of a given node to belong to a cluster of size Ni = 18. This representation allows the spatial analysis of the data correlations

Figures 4 and 5 show the NCSP for Ni = 7 and Ni = 18, respectively, for a subset of TjTJ, i.e. each figure indicates the probability of each node to belong to a cluster if size Ni in a given year. The representation of the NCSP over a map and showing the probability value using colors allows the spatial analysis of the data correlations: nodes located in areas with similar colors have similar data statistics. The selected cluster sizes are representative values from both inside and outside the CSP focus zone and will allow the analysis at different spatial scales, as the cluster size is directly related with the spatial extent. Both figures use identical color range in order to show the probability values and allow direct comparison. From these figures we can clearly identify two different behaviours regarding the spatial extent of the data correlations.

Figures 4 and 5 show the NCSP for Ni = 7 and Ni = 18, respectively, for a subset of TjTJ, i.e. each figure indicates the probability of each node to belong to a cluster if size Ni in a given year. The representation of the NCSP over a map and showing the probability value using colors allow the spatial analysis of the data correlations: nodes located in areas with similar colors have similar data statistics. The selected cluster sizes are representative values from both inside and outside the CSP focus zone and will allow the analysis at different spatial scales, as the cluster size is directly related with the spatial extent. Both figures use identical color range in order to show the probability values and allow direct comparison. From these figures, we can clearly identify two different behaviours regarding the spatial extent of the data correlations.

The fact that multiple nearby nodes have a higher probability to belong to a cluster of a given size (e.g. blue and green areas in Figs. 4 and 5) indicates that the statistics of the wind speed data of those locations are similar (from an explained variance point of view). However, while in Fig. 4, the examined clusters reveal the short range inter-node relations (clusters with size 7, with an isotropic distribution refer to an area of approximate radius of 87 km), Fig. 5 refers to longer range inter-node relations (areas with approximate radius of 137 km). Even though the particular probabilities for both cluster sizes may be similar, the spatial spread of both is obviously different. This fact is also apparent in these figures due to contrast differences: lower for larger cluster sizes, higher for small cluster sizes.

A consistent feature can be observed in both figures. If a given node has a high probability of belonging to a small size, then the same node has lower probability of being assigned to a larger cluster, as the NCSP is calculated per node and its sum has to be 1. However, individual clusters are not necessarily isotropic, i.e. radially symmetric. As a matter of fact, individual clusters may adopt elongated or even non-convex shape. And self-averaging over multiple cluster sets makes indistinct the variations on the clusters frontiers.

The particular spatial patterns that appear in Figs. 4 and 5 are revealing a shorter range explained variance in the area around the Balearic Islands for the subfigures plotted in the rightmost columns. As for the physical interpretation of these results, very similar variations in wind speed through time in neighbouring nodes are revealed as blue/green patches. Thus, from subfigures (a), (c), (e), (g), and (j) of both figures, we can clearly define short-range similarly varying areas. These homogeneously varying areas and their relation to macroscopic climate features will be addressed later in the temporal analysis of the SODCC.

One of the main benefits of the present methodology relies on the fact that cluster sizes are directly related to the explained variance of the data from the encompassed nodes. Smaller cluster sizes can be directly related to fewer eigenvalues being directly responsible for 90% of the variance of the data in the cluster (by construction). On the other hand, in larger clusters, a larger number of eigenvalues is needed to explain the same amount of variance of the data over a larger spatial extension. Thus, the fewer eigenvalues needed (small clusters), the higher the variance endowed to each of those eigenvalues than with respect to the set in larger clusters. In the present case, the larger variance in the smaller clusters (in wind speed) can be related directly to a higher amount of energy concentrated in smaller areas. This fact can be directly related to wind energy production in those areas (Chidean et al. 2018). Furthermore, a relative increase (decrease) of the population of smaller clusters in successive years would indicate the absence (presence) of dominant components at a larger scale. This facilitates not only the quantification of the correlation distance of turbulence in wind speed, but also their dominant locations at any given time.

At this point, one further comment is required with respect to the edge effects. Although we expected edge effects to be apparent in the NCSP histograms, we did not find evidence of such. We performed spatial sensitivity analysis considering both the center and the borders of the grid, obtaining, up to experimental measurement error, virtually the same results. Thus, this type of analysis for large grids does not exhibit edge effects.

In conclusion, we can state that the CR based on the SODCC algorithm allows for the identification of regions of interest where the climatic data share similar statistics or belong to the same physical mode.

3.2.2 Temporal analysis

The temporal analysis of the results can be performed by means of the NCSP. However, due to the many possible cluster sizes (from cluster sizes 3 to 35 for each year under study), an alternative method based on the theoretical CSP distribution presented earlier is considered. The proposed method is based on the analysis of CSP histograms for each Tj and its inter-annual variations.

The CSP histogram reveals overall information about the behaviour of the results for each Tj. For example, Fig. 6 shows the CSP for Tj = 1979 and Tj = 1980. As previously stated, these histograms reveal a bi-modal shape, with two exponential decays starting in \(\hat {d}_{\mathsf {min}}+1\) and \(\hat {d}_{\mathsf {max}}+1\), respectively. The complete set of figures showing the CSP histograms is provided as electronic supplemental material (file “1_CSP.pdf”).

Fig. 6
figure 6

CSP histograms resulting from the SODCC clustering of the wind speed data from the N = 627 nodes for Tj = 1979 and Tj = 1980

Again, the temporal analysis of the results using the complete set of figures like the ones represented in Fig. 6 might be arduous and relevant details might be omitted. Thus, Fig. 7 combines all the obtained CSPs in a 3D representation, where the color gradient indicates the bar height (the probability of each cluster size) and the third axis is time. The orientation of the figure is chosen to better visualize the most representative differences over the time.

Fig. 7
figure 7

CSP histograms resulting from the SODCC clustering of the wind speed data for the N = 627 nodes and for all values of TjTJ (see Eq. 7), represented as a 3D-bar plot and color gradient to indicate the bar height

In this representation, the variations in time in the different CSPs are now apparent and are revealed as a series of hills and valleys (in time). For example, for a cluster size of 15 nodes, we can see a valley for Tj = 1981 and a hill for Tj = 1991. However, some datasets might lead to less clear results and additional calculations have to be made in order to obtain similar conclusions. Possibly one might criticize that visual analysis of the figures is not enough to offer convincing conclusions. Therefore, the next step in this methodology is to quantify the differences between the different CSPs. A canonical measure of the “distance” between any two probability mass functions is the Kullback-Leibler divergence (KLD) (Kullback and Leibler 1951). Thus, to quantify the variations in time between any two of the CSP histograms Pcluster(Tj) and Pcluster(Tk), we examine their KLD computed as

$$ D_{\text{KL}}(T_{j},T_{k}) = \sum\limits_{N_{i}=N_{\min}}^{N_{\max}}P_{\text{cluster}}(T_{k}|N_{i})\log\frac{P_{\text{cluster}}(T_{k}|N_{i})}{P_{\text{cluster}}(T_{j}|N_{i})} $$
(8)

where Pcluster(a|b) states for the value of the CSP histogram for the year a and cluster size b. In order to avoid numerical issues and still calculate the KLD, if any Pcluster(a|b) = 0 we replace this zero value by 2− 52, the machine epsilon of the simulation and analysis equipment.

As stated above, the KLD is used here as a measure of dissimilarity among the different CSP histogram with different years as reference, i.e. a low KLD value means that the CSP histograms from two different years are very similar. Thus, we will have as many KLD curves as reference years.

We calculate the KLD, DKL(Tj,Tk), for all possible combination of (Tj,Tk), being TjTJ and TkTJ. Figure 8 a and b show all the calculated KLD curves, separated into two differentiated patterns depending on the reference year. This separation depends on both the decay exponent of the CSP histograms shown in Fig. 7 and the shape of each KLD curve. In Fig. 8a, the reference years Tj are

$$ T_{J}^{*} \in \{1986, 1988, 1994, 1996, 2001, 2003, 2008, 2010, 2014\} $$
(9)
Fig. 8
figure 8

Kullback-Leibler divergence DKL(Tj,Tk) for TkTJ and separating between (a) \(T_{j}\in T_{J}^{*}\) (see Eq. 9) and (b) \(T_{j}\not \in T_{J}^{*}\), plotted in black lines. c January index of the NAO between 1979 and 2014, plotted in red and blue bars for NAO− and NAO + , respectively. d Correlation between all KLD curves and NAO index (normalized to their respective maximum values, \(\text {Corr}\left (D_{\text {KL}},NAO+\right )_{\max \limits }=0.41\), \(\text {Corr}\left (D_{\text {KL}},NAO-\right )_{\max \limits }=0.21\)). In (b) and (d), years \(T_{j}\in T_{J}^{*}\) have been highlighted with grey lines

The representation of the KLD curves in separated plots is useful as it helps the analysis of the results by enhancing the differences between two distinct behaviours. For example, the \(T_{J}^{*}\) subset can be easily identified from Fig. 8b as the \(D_{\text {KL}}(T_{j}\not \in T_{J}^{*},T_{k}\in T_{J}^{*})\) stand out as maxima. This subfigure also reveals a quasi-periodic pattern for the \(T_{J}^{*}\) subset, that appears in non-consecutive pairs every 4–6 years. On the other hand, in Fig. 8a, \(D_{\text {KL}}(T_{j}\in T_{J}^{*},T_{k})\) exhibit significantly higher values that confirm the lower degree of similarity to other years, as the years from \(T_{J}^{*}\) are the reference for the KLD.

3.2.3 Correlation with climate indices

The KLD between the different CSP histograms offers the possibility of relating the obtained results with different climate indices. Note that it is possible to calculate the correlation between the different KLD curves and any climate index related to the phenomenon under study. In this case, a good choice for a climate index to be correlated with wind speed in the Iberian Peninsula is the NAO (Working Group on Surface Pressure 2016). This climate index controls important processes related to Iberian Peninsula climate (Trigo et al. 2002), with deep effects not only on temperature (Prieto et al. 2002) or precipitation trends (Trigo et al. 2004), but also in renewable energy resources such as wind speed, as shown in Jerez et al. (2013). Of course, the NAO is not the only index with influence in the climate of the region, but different studies have pointed out a stronger relationship of the Iberian Peninsula climate with the NAO (Gimeno et al. 2002). On top of the already referred to works in the introduction of the present work, there are more recent studies that build on to that relation (Qu et al. 2012; Burningham and French 2013) and (Bierstedt et al. 2014).

To sum up the following analysis, we have encountered the difficulty of correlating a time series which can be positive or negative (the NAO) with a positive defined time series (the KLD). If we do that the usual way, both the correlation and the anticorrelation cancel each other, thus obtaining a numerical small correlation (< 0.5) where a clear graphical correlation between signals can be observed. We have separated the positive and negative phases of the NAO as they have a clear climatological interpretation.

More specifically, we consider the January index of the NAO for the years under study, that is plotted in red (NAO−) and blue (NAO + ) bars in Fig. 8c. Visual comparison between the KLD curves in Fig. 8a and the NAO index reveals a high degree of similarity. To attest this fact, we calculate the correlation between the KLD curves and the January index of the NAO, differentiating between NAO− and NAO + , and normalizing to the maximum obtained value.

This normalization is necessary as in this work we calculate the correlation between a real-defined time series (the NAO index) and a non-negative defined one (the KLD time series). Moreover, the separation between NAO + and NAO− (e.g. by means of a half-wave rectification) is also useful because our signals of interest exhibit both correlation or anti-correlation, the cross correlation operator performs poorly (cancelling each other), and the actual similarity between the signals is not apparent. This non-linear operator (correlation a positive signal with a half-wave rectified real signal) has been described elsewhere (Eq. (3) in Doi and Fujita 2014) and it has been referred to as cross-matching operator.

By separating between NAO + and NAO−, the total signal energy of the NAO series (i.e. the squared modulus of the signal integrated in time, such that the larger the variation of the NAO series between positive and negative values, the larger is its corresponding energy) is distributed between the two rectified signals (energy of the NAO signal \(={\sum }_{i}|\text {NAO}_{i}|^{2}={\sum }_{i}|\text {NAO}_{i}^{+}|^{2}+{\sum }_{i}|\text {NAO}_{i}^{-}|^{2}\)), namely in the following percentage 68% and 32%, respectively. Therefore, the maximum value of the correlation between the KLD series and NAO + signal would always result in value lower than unity, to be precise 0.68 for NAO + (and 0.32 for NAO−) given the considered NAO series. In this work, the correlation values obtained between all the KLD series and NAO + signal are ≤ 0.41 (and ≤ 0.21 for NAO−). Taking into account the half-wave rectification, these values correspond to actual correlation values of 0.41/0.68 = 0.60 between KLD and NAO + and of 0.21/0.32 = 0.66 between KLD and NAO−.

As both actual correlation values are similar and in order to perform fair comparison avoiding different signal energy issues, in Fig. 8d, we represent the correlations series normalized to their maximum value. In order to clarify the relation to the KLD represented in Fig. 8b, we marked with vertical grey lines the reference years \(T_{j}\in T_{J}^{*}\) in both figures. From these results, we can explicitly state that the KLD with reference years \(T_{j}\in T_{J}^{*}\) is highly correlated with both phases of the NAO. Then, the spatial extent of the data correlations, observed in the NCSP histograms, is also very similar between years \(T_{j}\in T_{J}^{*}\) but very different to those of any other year \(T_{j}\not \in T_{J}^{*}\). As the KLD is used as a distance metric of “difference” between histograms, each KLD series quantifies differences between the CSPs of different years. Namely the higher the KLD, the more different the CSPs histograms and a given time series is dependent on Tj. For example, CSPs obtained for \(T_{j}\in T_{J}^{*}\) reveal a higher occurrence of small clusters, perceptible by the presence of multiple “valleys” in Fig. 7.

Therefore, KLD series calculated with \(T_{j}\in T_{J}^{*}\) as a reference, shown in Fig. 8a, have higher values on a sustained basis. These higher values are caused by the displacement of the CSPs histograms, that shift from high occurrence of small clusters to higher occurrence of large clusters. Moreover, it is noticeable that common patterns in Fig. 8a closely follow the shape of the January NAO index (depicted in Fig. 8c). Therefore, we can assert with confidence that almost all years with positive NAO index and most of the years with negative NAO index show a higher occurrence of larger clusters. We can also conclude that years \(T_{j}\in T_{J}^{*}\) are “tipping point” years, where no clear phase (either positive or negative) of the NAO index dominates in the Iberian Peninsula. This situation results in fewer dominant large-scale components and a larger short scale wind speed correlation, causing a more frequent and uneven distribution of turbulent cells. In the present work, these cells correspond to clusters sizes of 7 − 9 nodes covering convex clusters with a ≈ 50-km radius.

In conclusion, a macroscopic time evolution of the climate regions with the SODCC analysis and its relation to NAO climate index has been established and the associated physical interpretation can be extracted from the results.

3.3 Comparison with EOFs and REOFs

In the following, we analyze the wind speed dataset using EOFs and REOFs, independently for each year considered in the dataset (from 1979 to 2014), using the methodology detailed in Björnsson and Venegas (1997).

The first step for this analysis is to check out whether the phase transition is achieved. In other words, we calculate the ratio between the amount of nodes and the amount of data available for each node and then we check if it is sufficient to properly calculate the EOFs. The considered dataset includes N = 627 nodes in the network and Mmonth = 31 × 4 = 124 data values per node and month. The condition that establishes that the phase transition is achieved and therefore that the EOFs are properly calculated is the following:

$$ \frac{M_{\text{month}}}{N}\gtrsim4 $$
(10)

Given the considered dataset, it is obvious that the phase transition is not achieved and then it is not possible to calculate the EOFs using all the geographical locations in the Iberian Peninsula. Therefore, it is necessary to use a subsample of the N possible nodes, in order to satisfy both the phase transition condition and the regular spatial grid requirement. Specifically, we consider the Nsubsample = 25 nodes highlighted in Fig. 9 to calculate the EOFs and the REOFs. This spatial subsampling diminishes the spatial resolution of the grid from 56 to 168 km in the longitudinal axis and to 280 km in the latitudinal axis.

Fig. 9
figure 9

Geographical location and identification of the Nsubsample = 25 nodes (red dots) of the network used for the EOF and REOF analysis

Regarding the REOF calculation, we consider the varimax criterion (Kaiser 1958) applied to the first three EOFs. Figure 10 shows the first EOF and the first REOF calculated over the considered dataset for several values of TjTJ, revealing the different physical modes of the wind speed in the region of interest. The complete sets of figures showing the EOF and REOF are provided as electronic supplemental material (files “3_EOF.pdf” and “4_REOF.pdf”). As expected for this analysis, we can observe the differences between the EOFs and the REOFs, mainly caused by the inherent orthogonality obtained from the EOF calculation and relaxation of this constraint in the REOF case. It is commonly accepted that the physical modes are not in general orthogonal and, therefore, the EOF analysis is not the most suitable. On the other hand, the REOF analysis reveals the structure of the physical modes and the obtained patterns indicate underlying behaviour of the data, i.e. the wind speed in this work. Note, however, that the proposed SODCC algorithm also allows the study of the underlying data behaviour, and besides that, as we previously mentioned, it allows carrying out this study at multiple spatial resolutions of the grid.

Fig. 10
figure 10

Spatial patterns and percentage of the explained variance of the first EOF ((a), (c), (e) , and (ff)) and REOF ((b), (d), (f), and (h)) modes calculated for the Nsubsample = 25 spatial locations and multiple values of Tj, presented as dimensionless maps normalized to the [− 1,1] interval. Selection of years has been done in order to show no bias with respect to set selection, i.e. \(\{1987,1994\}\in T_{J}^{*}\), \(\{1990,1994\}\not \in T_{J}^{*}\)

In the following, we directly compare the REOF results with the results obtained from SODCC for the NCSP histograms, that were previously described. Figure 11 represents the results obtained for multiple values of Tj using the REOF analysis and the NCSP histograms considering cluster sizes from Ni = 5 to Ni = 8. In this case, the NCSP histogram when multiple cluster sizes are considered represents the probability of a node to belong to clusters of multiple sizes, therefore is the sum of the histograms of individual cluster size. The complete set of figures showing the REOF plots overlapped to the NCSP histograms is provided as electronic supplemental material (file “5_NCSP_REOF.pdf”). This range of cluster sizes represents the focus zone for the CSP histograms, it allows the study of short range spatial correlation, and limits the small variations between consecutive cluster sizes that can appear by chance. Moreover, the results that we may obtain with this analysis can be extrapolated to long range spatial correlations due to the dichotomy previously observed.

Fig. 11
figure 11

Comparison between results obtained with SODCC and REOF analysis for multiple values of Tj. The NCSP distributions are represented for Ni = 5 to Ni = 8 using the colorbar included at the top of this figure. The first REOF mode is represented as dimensionless map normalized to maximum 1. Colorbars are selected to facilitate the visualization. Left column corresponds to set of years \(T_{j}\not \in T_{J}^{*}\), right column corresponds to set of years \(T_{j}\in T_{J}^{*}\). Climate regions can be clearly discerned in the NCSP distributions

We can clearly see in Fig. 11 the marked change in the long to short range behaviour of clusters/cells (larger probability values) in years with either clear positive or negative NAO indices (\(T_{j} \not \in T_{J}^{*}\)) with respect to transition years (\(T_{j} \in T_{J}^{*}\)) commented in the previous subsection. REOF values tend to show the sharpest transitions following the edges of probability change regions (red vs. green or blue) albeit with much less definition. Thus, where there is no dominant phase of the NAO, wind behaviour does not determine the sharp regional separation in climate zones. The Northeastern part of the Iberian Peninsula has marked long range behaviour (low probability for the NCSP distribution, i.e. orange/red regions) independent of the NAO phases (mainly due to the Pyrenees). On the other hand, the behaviour of the rest of the regions in the Iberian Peninsula evolves with time and shifts from long range corridors to short range turbulent cells (higher probabilities for the NCSP distribution, i.e. blue/green regions). The Northwestern region clearly shifts from long range correlations (with either negative or positive NAO) to short range (in years with transitional NAO).

A close analysis of the contour lines that represent the REOFs and the color patches that represent the NCSP histograms reveals that both results lead to comparable conclusions. Thus, it is possible to observe the same physical phenomenon using either REOF analysis or the SODCC analysis. However, the EOF and REOF calculation has several drawbacks that the SODCC algorithm is able to overcome. For example, there exists a high dependency between the REOF results and several criterion choices during their calculation, i.e. the number of EOFs used for their calculation or the rotation algorithm. Moreover, when calculating the EOFs and the REOFs, the phase transition has to be achieved for the autocorrelation matrix of the considered dataset. In other words, if a high spatial resolution is required for the grid then the dataset must include a very large amount of data. Moreover, if the dataset has not enough amount of data, the spatial resolution of the grid is compromised, as it has happened in this case study.

On the other hand, the SODCC algorithm does not depend on any parameter definition that could alter its final result. Moreover, a significantly higher spatial resolution for the grid can be obtained as the cluster organization follows a bottom-up approximation and the phase transition condition has to be achieved at cluster level, requiring a significantly lower amount of data.

A further advantage of SODCC over REOF is the following. While the REOF analysis offers a panoramic view of the behaviour of the underlying data and its statistics, the SODCC algorithm allows the study of multiple spatial correlation extents. This benefit is achieved by analyzing different cluster sizes where the degree of correlation in different geographical regions can be identified (recall that small clusters emerge in areas with high data correlation).

3.4 Robustness to initialization parameters

In the present section, we closely examine the robustness of the results with respect to the initialization parameters. We analyze the variation between CSP originated from simulations with different initialization parameters, using the KLD metric. The initialization parameters under examination are as follows:

  1. 1.

    P—the probability of a node being selected as CH,

    $$P_{i}\in\{0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75\}$$
  2. 2.

    N1st—the maximum initial cluster size, \(N^{1st}_{i}\in \{2, 3, 4, 5\}\), and

  3. 3.

    Nrep—the number of repetitions allowed for the first phase, \(N_{i}^{rep}\in \{2, 3, 4, 5\}\)

Without any loss of generality, we have selected for the present test the first year in the series, Tj = 1979.

Figure 12 shows the resulting KLD calculated between the CSPs obtained varying the three tuning parameters. In Fig. 12a, we may observe that changes in the value of Pi lead to a convex evolution in KLD (with respect to Pi = 0.35, dash-dotted green line, the value that was used throughout this work) for all values in the 0.35 ≥ Pi ≥ 0.65 range. Furthermore, the maximum difference in KLD for that range is below 10− 3. Thus, for reasonable values of Pi (neither too few nor too many initial clusters), selection of the specific value has no relevant impact on the obtained CSPs and, therefore, on the results of the present work. Nevertheless, extreme values of Pi are shown to have limited impact (KLD difference well below 10− 2).

Fig. 12
figure 12

KLD between CSPs obtained from simulations with different initialization parameters for year Tj = 1979 with respect to the parameters used in the present work: a variation of the probability of a node being selected as CH, Pi, b variation of the maximum initial cluster size, \(N^{1st}_{i}\), and (c) variation of the number of repetitions allowed for all the nodes to be included in any cluster, \(N_{i}^{rep}\)

In Fig. 12b, we may observe that the variation of the maximum initial cluster size, \(N^{1st}_{i}\) also leads to negligible differences between the CSPs, as the resulting KLD is well below 10− 3 for the considered range.

Finally, in Fig. 12c, we can see the resulting KLD calculated between CSPs obtained from simulations where different \(N_{i}^{rep}\) was used. Again, the resulting KLD is also below 5 ⋅ 10− 4 indicating that the CSPs are very similar between them.

In order to clarify the overall impact of a purposefully bad selection of initial parameters (worst case scenario which determines the upper limit of variation of KLD), we have represented the maximum obtained KLD in each panel of Fig. 12 as horizontal lines in Fig. 13. Figure 13 also includes the KLD obtained for all TjTJ, previously shown in Fig. 8 (a) and 8(b). It can be clearly seen that, even in the case of a bad selection of the tuning parameters, the impact on the resulting KLD patterns is well below the amplitude of the significant variations (≈ 10− 4). Therefore, the correlation analysis presented in Section 3.2.3 would remain qualitatively invariant should a worst case scenario of initialization parameters arise.

Fig. 13
figure 13

Upper panel: KLD for the different CSPs for all TjTJ (joint representation of the curves in Fig. 8 (a) and (b)) along with the maximum obtained KLD values in Fig. 12 for 0.15 ≥ Pi ≥ 0.75 (magenta), reasonable values of 0.25 ≥ Pi ≥ 0.55 (cyan), \(2 \geq N^{1st}_{i}\geq 5\) (dashed blue), and \(2 \geq N_{i}^{rep}\geq 5\) (purple). Lower panel: same as the above but y-axis zoomed and in log-scale (Zero values of KLD are omitted for clarity)

Therefore, we have shown that the SODCC method is robust to the selection of the initialization parameters.

4 Conclusions

In this work, we have presented a novel methodology for self-organized CR based on the SODCC algorithm. This method organizes the regions by clustering the grid nodes in terms of the explained variance of physical measurements and their extent, i.e. in terms of the statistical characteristics of the time series. The main advantages of the SODCC algorithm, and hence of the CR method proposed, are that it is robust to the selection of tunable parameters and that it does not require any regular or homogeneous grid for the nodes. Moreover, the present method has higher spatial resolution of the grid, lower computational complexity, and a more direct physical interpretation of the outputs than other existing CR methods.

First, the operation of the SODCC algorithm only requires the selection of the margin of the FSD statistic to adjust to the corresponding chi-square distribution, which controls the amount of explained variance in a cluster. In this study, the margin was set such that, once the phase transition for the covariance matrix is achieved, the explained variance in each cluster is higher than 90%. This feature of SODCC is a benefit over other classical methods as the obtained result is due to the self-organization nature of the algorithm and it is not necessary to determine a priori some key parameters, such as the number of EOFs used in the REOF calculation.

Second, the SODCC independence from the measurement nodes location is also a significant advantage over other classical CR methods. In the wind speed case study described in this paper, the SODCC algorithm has been applied to a regular grid. However, this method has been successfully applied to non-regular inhomogeneous grids (Chidean et al. 2015b).

The analysis of the obtained set of clusters enables a microscopic characterization of the climate regions and their spatial extent. The NCSP histograms show the probability of each node to belong to a given climate region (with joint explained variance of 90%) of a given size. These outcomes reveal the different physical modes of the data in the region of interest.

The SODCC outcomes also permit the study of the macroscopic behaviour of the different climate regions through time. The CSP histograms show the overall behaviour of the final set of clusters. The temporal analysis of these results can be performed by using a measure of similarity between the CSP histograms, such as the KLD.

This work also includes the analysis of the wind speed in the Iberian Peninsula in January for the period (1979 − 2014) as a specific case study. By applying the SODCC method to the present time series of data, it is possible to (1) characterize spatial probabilistic climate regions that have higher than 90% explained variance with high spatial resolution of the grid and (2) characterize the time evolution of the similarity of the studied area as a whole. Two distinct patterns emerged in the KLD curves initialized in the different years, pointing to a subset of years (\(T_{J}^{*}\)) with distinct behaviour. Comparison between the KLD curves and both the positive and negative phases of the NAO showed higher correlation values for the \(T_{J}^{*}\) subset of reference years. The CR results corresponding to the \(T_{J}^{*}\) subset of years also showed distinctive features with large spatial correlations. These results are corroborated by means of the REOF analysis, with much lower spatial resolution of the grid however.

Some general results obtained in the present study are in the line pointed out by previous approaches. In Lorente-Plazas et al. (2015), a characterization of surface winds over the Iberian Peninsula is performed via a PCA and clustering approach in which 20 differentiated regions in the Iberian Peninsula. In the present work, we go beyond that characterization and perform a time analysis and a novel correlation with the NAO to show a high correlation of wind patterns in the Iberian Peninsula with such oscillation. This finding is supported, for example, in Jerez et al. (2013) where it indicates a close relationship of the wind speed resource with the NAO. Another work with comparable results with this one is Azorin-Molina et al. (2014), where a study on surface wind speed trend in the Iberian Peninsula is carried out based on measuring stations data. A direct comparison of results is difficult to establish, due to the different methodologies applied, observation lengths, and data used. However, in Azorin-Molina et al. (2014), it is also recognized the relationship of the wind speed resource with the different phases of the NAO. But to add to these previous findings, we have shown that there is statistical evidence that the local wind speed clusters are large while the NAO phase is either positive or negative. However, in the transition years, where the NAO changes its sign, the local wind speed clusters diminish in size. Thus, the regionalization of wind speed zones in the Iberian Peninsula is shown to depend, not only in the geographical features, but also on the NAO phase.

Finally, as main conclusions of this work, we have found the following: (1) the climate regions obtained by means of the SODCC algorithm can be determined with higher spatial resolution of the grid than equivalent analysis methods, e.g. REOF analysis; (2) the SODCC outcomes facilitate the physical interpretation of the underlying phenomena; and (3) the NAO has a huge impact in the wind distribution in the Iberian Peninsula in only a subset of years in the studied period. This indicates that the SODCC is a valuable tool to study the microscopic spatial distribution of climate regions and also the macroscopic temporal evolution of those regions.