1 Introduction

The study of the many environmental and socio-economic impacts of meteorological phenomena and climate change implies to improve our knowledge of climate at a local scale. Indeed, studying climate change impacts on agriculture, water resources, pollution, and many other environmental features at a human scale makes high-resolution model simulations essential. However, General Circulation Model (GCM) simulations of the different future climate scenarios prescribed by the Intergovernmental Panel on Climate Change (Vuuren et al. 2011) have generally a coarse spatial resolution (about 250 km) and are thus not adapted as inputs into the impacts models that need much finer scale climate information. Hence, it is required to bring GCMs climate simulations information to more regional or local scales, i.e., to generate high-resolution simulations based on (reanalyses or GCM) large-scale information. This is the aim of downscaling. Downscaling models can be dynamical or statistical, both approaches being driven by GCMs or reanalysis data.

Dynamical downscaling models correspond to the so-called “Regional Climate Models” (RCM), which simulate high-resolution physical processes consistent with the prescribed large-scale dynamics. RCMs can be a GCM with grid refinement over a specific region (e.g., Déqué and Piedelievre 1995; Hourdin et al. 2006) or a limited area model (LAM) constrained at its lateral boundaries by GCMs (WRF, Skamarock et al. 2008). Both GCMs and LAMs are sensitive to the resolution and the physical package which regroups all the model parametrization used in the model to take into account sub-grid scale processes. While the use of LAMs presents some advantages, for instance the fact that they are non-hydrostatic allows very high-resolution downscaling or also the possibility to set a region-specific parametrization, it also creates discontinuities at the boundaries. Previous studies have investigated the sensitivity of the results to the frequency of boundary conditions, size and resolution of the domain (e.g., Noguer et al. 1998; Seth and Giorgi 1998), lateral conditions (Denis et al. 2003) and frequency of reinitialization (Lo et al. 2008). Those studies show that the internal variability of RCMs can strongly influence the results at regional scales and that the small-scale field inside the domain is not always consistent with the driving field (Laprise et al. 2008). To ensure the consistency between the small- and large-scale fields, the model can be driven using nudging techniques (e.g., Omrani et al. 2012a, b). The choice of the physical package that allows the model to simulate all the sub-grid scale processes using parameterizations is also very important and induces large discrepancies between model outputs (e.g., Flaounas et al. 2011). Despite the increase of computing power, running an RCM including all those different formulations still requires important computational resources. This often puts limits on the number, the resolution and the time period length of the RCMs runs.

The alternative approach to RCMs is based on Statistical Downscaling Models (SDMs) that rely on determining statistical relationships between large- and local-scale variables and do not try to solve the physical equations modelling the dynamic of the atmosphere. Due to their statistical formulation, they generally have a low computational cost and provide relatively fast simulations. SDMs are now considered as complementary to RCMs, for example in terms of applications for ensembles uncertainties studies (Sachindra et al. 2014). SDMs are based on a static relationship, i.e. the mathematical formulation of the relation between predictand (i.e., the local-scale variable to simulate) and predictors (i.e., the large-scale information or data used as inputs in the SDMs) is supposed to be valid for any time period: not only for the current climate on which the relationship is calibrated, but also, for example, for future climates. This does not mean that the statistical properties of the predictands are stationary (i.e., are the same in current and future climates): if the statistical properties of the large-scale predictors evolve in time, those of the local-scale predictands will evolve as well. Hence, if the relationship is said to be “static” (or “stationary”), the statistical distribution of the predictands is “non-stationary” and the SDMs can be said to be non-homogeneous (e.g., Vrac et al. 2007b). Most state-of-the-art SDMs can be divided into the four following families: transfer functions (TFs), stochastic weather generators (WGs), weather typing (WT) based methods and model output statistics (MOS).

The TFs approach regroups the deterministic functions which “transfer” the large-scale information to the local scale. Those mathematical functions characterize the nature of the dependencies between the predictors and the predictands. They could be linear [e.g. through a multi-linear regression (MLR), see Jeong et al. 2012] or non-linear [.g. through polynomial regression or artificial neural network (ANN) see Xiaoli et al. 2008] These methods are usually easy to implement and apply but tend to underestimate the variance (see the variance inflation procedure in Wilby et al. 2002). One solution is to use a stochastic modelling in order to adapt the statistical distribution instead of “inflating” the variance.

Stochastic WGs simulate daily weather scenarios thanks to probability distribution functions (pdfs) estimated from observations. A wide range of WGs has been developed to generate weather variables (e.g., see for a review Wilks 2012). Historically, WGs were used to reproduce the observed rain statistical properties (Wilks 2010). However, in a downscaling context where the statistical properties may evolve in time, WGs have to be based on pdfs that depend on atmospheric predictors. These conditional pdfs can evolve in time, i.e., their parameters can change with the predictors (e.g. Bardossy and Plate 1992). This approach is particularly interesting to generate variability in data.

The WT approach defines large-scale patterns from circulation variables and rely on clustering techniques. The main assumption is that for a given large-scale pattern, the relationship between the large- and local-scale variables is always the same. One particular method is the “analog” method where each daily large-scale situation is considered as a pattern. For a day to be downscaled, the day in the past which has the closest large-scale situation (according to a similarity metric) is chosen (Zorita and Storch 1999). The local-scale observations of the selected day are then the downscaled values. This approach also provides methods easy to implement. However, in climate change context these methods can miss a possible climate change signal because of their inability to generate values beyond the range of past values.

All the previous approaches need daily synchronicity between large-scale and local-scale data to be calibrated. They are referred to as “Perfect-prog” downscaling (Klein et al. 1959). Model output statistics approach is quite different by essence because it generally works directly on model outputs, without calibration based on reanalysis data. MOS aims to link characteristics like the mean, the variance or the probability (or cumulative) distribution function (pdf or CDF). This approach presents many interesting applications in terms of downscaling and bias correction but the performance is deeply linked to the quality of the modelled large-scale variable (Coiffier 2011).

Many different intercomparison studies have been conducted lately. These studies have a wide range of purposes. They can be discriminated for instance by the type of models which were compared: RCMs only, SDMs only or both SDMs and RCMs.

Concerning RCMs, several coordinated projects have been developed involving collaborations between Regional Climate Modelling groups. There are several projects taking place around the world over different regions. Over Europe, the MERCURE project (1997–2000, Machenhauer et al. 1998), aimed at identifying the strengths and weaknesses of RCM simulations driven by atmospheric analyses. It led to the project PRUDENCE,Footnote 1 (2001–2004, Christensen et al. 2007) where one important goal was to analyse future projections according to four uncertainty aspects: natural variability, greenhouse gases emissions and concentrations scenarios, the choice of the driving GCM atmospheric and oceanic boundary conditions and finally the RCM formulation. This was followed by the project ENSEMBLESFootnote 2 (2004–2009, Hewitt 2004). It produced for the first time a probabilistic estimate of uncertainty of future climate at several timescales, using an ensemble validated against observational datasets for Europe. Note that similar projects exist over other regions like the Asian RMIP project (Fu et al. 2005) or the North American projects PIRCS (Takle et al. 1999) and NARCCAP (Mearns et al. 2013). Lately, the Coordinated Regional Climate Downscaling Experiment (CORDEX, Giorgi et al. 2009) initiative from the World Climate Research Program promotes running multiple RCM simulations at 50 km and higher resolution for multiple regions. This initiative is mainly aiming to assess RCMs quality and uncertainty for the recent past and for twenty first century projections, covering the majority of populated land regions on the globe. The uncertainties are associated with varying GCMs simulations, varying greenhouse gas concentration scenarios, natural climate variability and different downscaling methods. In contrast to the former intercomparisons, the CORDEX initiative impose several additional and mandatory constraints which make the runs comparable. The constraints include domains definition, time period, same spatial resolution and boundary forcing (ERAinterim Reanalysis, Dee et al. 2011) for the hindcast evaluation to provide a framework for model evaluation and assessment.

SDMs-focused intercomparisons are also more and more available now but are mostly done by modest research initiatives compared to CORDEX for instance. One of the first intercomparison studies was brought by Wilby and Wigley (1997), who aimed to make a review of the available SDMs at the time and to compare precipitation models in terms of present and future climate over north America. Six SDMs calibrated on NCEP reanalyses have been compared with one GCM. The main result pointed out intervariable inconsistencies in the GCM which made unreliable the precipitation changes generated by the GCM. Even if this study was quite exhaustive, the MOS approach was not represented in it.

Since then, many intercomparison studies have been conducted, often not taking into account one or several SDM approaches and with specific purposes. For instance, Schoof and Pryor (2001) aimed to compare two TF methods calibrated over circulation indices on midwestern USA. The evaluation performed on present climate pointed out that the models failed to capture the variability of precipitation as governed by the large-scale circulation and suggested that other variables were necessary to capture precipitation. Although this paper is an important contribution, only TF methods were discussed in this study. Focussing also mostly on TFs methods, Harpham and Wilby (2005) evaluated two ANN-based SDMs (i.e., TF) and one WG to downscale heavy precipitation and their multisite behaviour in a present climate context over United Kingdoms. A follow up study included three supplementary SDMs and two RCMs in a future climate context (Haylock et al. 2006). The results underlined the need of an ensemble approach when considering future climate projections. However, the WT approach was missing in the first study and the MOS approach in both studies. Similar studies conducted over the Serpent River basin (Quebec, Canada) aimed to focus on a particular temporal neural network (TNN, i.e., TF) to downscale precipitation (Dibike and Coulibaly 2006; Khan et al. 2006). Results showed the high-performance of that particular TNN model but the study did not include WT and MOS models in both cases. Moreover, in a recent study, Gaitan et al. (2014) aimed to compare high-resolution precipitation models over Ontario and Quebec, Canada to reproduce climate change signal based on the RCM pseudo-observation approach developed in Vrac et al. (2007c). Six rain occurrence WT models and four rain intensity TF models have been designed to this end from the same set of predictors. Their ANN (i.e., TF named ANN-F in their study) was found to be the best model. Although an interesting intercomparison, the study did not investigate MOS and WG approaches.

Other recent studies compared methods from the four statistical families but with different predictors. For instance, six SDMs and three RCMs precipitation outputs were compared over the Alps by Schmidli et al. (2007). The SDMs were calibrated on several reanalysis databases for present climate and applied to GCMs for future climate. Results showed that the statistics of most statistically and dynamically downscaled precipitation were similar. In another study by Bürger et al. (2012), five SDMs with their own set of predictors have been evaluated in a present climate context over British Columbia, Canada, focusing on extremes aspects of the compared SDMs for temperature and precipitation. It turned out that the use of hybrid models (i.e., models with components built from several families) made difficult to identify the component of the models which explains the model efficiency. Even if all the SDMs families were studied in both papers, the models were calibrated on different sets of predictors. A common set of predictors would have allowed an easier comparison.

All these references are examples and this list is by no means exhaustive. They give a general idea of some major studies. More generally, all those studies did not include a cross-validation procedure in their model evaluation (except in Gaitan et al. 2014 see also Vrac et al. 2007c) which is an important step to validate SDMs in a present climate context (this notion is illustrated in Sect. 2.2). Even if they compared many models with different interesting features and results, they all presented some inconsistencies. Indeed, one important argument is that in most studies the predictors were different: for instance they were selected according to the observation station (Harpham and Wilby 2005; Haylock et al. 2006; Dibike and Coulibaly 2006; Khan et al. 2006) or were specific to the models features (Schmidli et al. 2007; Bürger et al. 2012). Sometimes the purpose of the study was not the intercomparison itself but rather to underline the developments of a new model (e.g., Dibike and Coulibaly 2006; Harpham and Wilby 2005). As said above, in most of the studies at least one type of model is missing and the SDMs were calibrated on more or less sparse observation network. In this paper, to perform a consistent intercomparison, we want to compare models outputs from all types of models (i.e., from the four approaches of SDMs and from RCMs, see Schmidli et al. 2007) and observational data with similar resolution over a common area. Another criterion is to calibrate all the models with a common set of predictors (as much as possible) with a cross-validation procedure (see Gaitan et al. 2014). Thus, the three main requirements of this intercomparison study are: (1) models must have common predictors, (2) RCMs and SDMs model outputs and observations have to share the same area and resolution, (3) all SDMs families models have to be represented and a representative number of dynamical models have to be included. Recently, two initiatives sharing similar objectives have to be mentioned: CORDEX-ESD (http://wcrp-cordex.ipsl.jussieu.fr/index.php/community/cordex-esd), and to the COST Action VALUE (http://www.value-cost.eu/, Maraun et al. 2015). These projects aim at coordinating SDMs intercomparison at a continental scale and make SDMs comparable to RCMs.

The present intercomparison takes place under the CORDEX initiave hindcast evaluation: all the models have to be forced by ERAinterm reanalyses and run over the 1989–2008 period at \(0.44^\circ\) resolution. For the present study, the variable of interest is the precipitation. This choice is motivated by its high spatial and temporal variability and the difficulties faced to model precipitation compared to other variables like temperature. Another argument is that rainfall is one of the most important variables for many impact studies (e.g. for floods prediction, Raje and Mujumdar (2010) or crop yields, Oettli et al. 2011).

Hence, in this paper, several downscaling models are compared through a common and well defined framework. The aim is to set a generic intercomparison framework. More precisely, our goal is not to select the best model or to develop a model with new features. The objective is to design an intercomparison experiment in which the models are easily confrontable. The performance criteria are expected to be wide enough to correctly inspect the main aspects of the models representing each statistical downscaling family. The chosen indicators are relevant for climatological studies. Indeed, these can be different when considering other application domains (e.g. hydrology), which can produces different performance evaluation results. The proposed framework would help to point out main models strengths and weaknesses, identify the needed improvements and provide statistically simulated time series to be compared to RCMs over a common area and forced by a common set of predictors (ERAinterim). Differences between models with specific features both in conceptual terms, e.g., dynamical versus statistical or deterministic versus stochastic, and technical details are going to be described and evaluated. This intercomparison is also designed in a way that other models or indicators can be easily added.

This paper is organized as follows: the data and experimental set-up are presented in Sect. 2, while Sect. 3 describes the downscaling models used in this study. The results of the comparison are presented in Sect. 4. Finally, in Sect. 5, some conclusions, perspectives and discussions are proposed.

2 Data and experimental setup

SDMs seek to establish a link between large-scale and local-scale climate data. The experimental setup thus has to state which large-scale variables will act as predictors and which local-scale variables will be predicted. In addition, the validation procedure has to be defined. In order to design the experiment rigorously, it is essential to keep in mind assumptions under which the SDMs are performed (Hewitson and Crane 1996): (1) the relationship between local-scale data and large-scale predictors is fixed in time (even if the statistical properties of the downscaled simulations can evolve in time), (2) the predictors fully represent the climate signal, (3) the large-scale variables are well reproduced by climate models, including reanalysis.

2.1 Local-scale predictands and large-scale predictors

In order to limit any RCM data transformation from their initial spatial resolution, the common resolution of the RCMs at \(0.44^\circ\) or local-scale predictands has been chosen. Therefore, the comparison with the E-OBS V8 gridded dataset from the EU-FP6 project ENSEMBLESFootnote 3 and the data providers in the ECA&D project is straightforwardFootnote 4 (Haylock et al. 2008) at \(\text {0.44}^\circ \times \text {0.44}^\circ\). In the experimental setup, the E-OBS precipitation data will serve as local-scale reference predictand for the calibration of the statistical models that will therefore downscale largee-scale information to \(0.44^\circ\) spatial resolution, directly comparable to RCMs outputs. Note that there are some quality inconsistencies in this version of E-OBS data (Hofstra et al. 2009). The reader has to keep in mind that this intercomparaison is done using E-OBS data as reference, which can potentially induce some inexact results over specific areas. This issue is discussed in Sect. 5.2.

As one of the goal of this study is to make intercomparisons between SDMs and RCMs involved in the EURO-CORDEX (Jacob et al. 2014) and MED-CORDEX/HYMEX (Drobinski et al. 2014, www.medcordex.eu/medcordex.php) initiatives, the atmospheric data chosen to drive the statistical models (i.e., the large-scale predictors) are the same as those used as forcing for the RCMs for the hindcast evaluation. Figure 1 represents the geographical areas over which the models are evaluated: in green the SDMs domain corresponding to the domain of E-OBS data, in blue the EURO-CORDEX evaluation domain which is the intersection between EURO-CORDEX and E-OBS domain and in orange the MED-CORDEX evaluation domain which is the intersection between MED-CORDEX and E-OBS domain. The atmospheric variables used as predictors are selected from the ERAinterim Reanalysis (ERAi, Dee et al. 2011) at \(\text {1.125}^\circ \times \text {1.125}^\circ\) resolution, over the North-Atlantic region which includes the EURO-CORDEX and MED-CORDEX domains. It corresponds to 5,452 grid-points over the geographical area \([-52.875^\circ \text {E}\) \(; 76.50^\circ \text {E}]\) \(\times\) \([20.25^\circ \text {N}; 72.00^\circ \text {N}]\). All fields are taken at the daily time scale obtained by averaging 6 h reanalyses outputs. These variables are selected according to many criteria. First, considering the objective of our study which is to intercompare models, a common set of predictors for all the statistical models is needed. Such a consideration makes the study as fair as possible in the way the models are considered. The choice is also motivated by the physical relation of the variables to the precipitation and their role in the precipitation processes. Another criterion is the availability of the common variables in GCMs and Reanalysis products and also the correct representation of the predictors over the domain (Hewitson and Crane 1996). Note that this is not a requirement for the intercomparison itself (one can imagine an intercomparison with badly simulated predictors) but only for our choice of predictors. Table 1 shows the chosen variables. Some of them have been widely used in statistical downscaling context with good results. For instance, surface variables such as the temperature at 2 m (T2), the sea level pressure (SLP) or atmospheric variable as the geopotential height, the zonal and meridional wind components and relative humidity at 850 hPa (Z850, U850, V850, R850) can be found in studies like Cavazos et al. (2005) or Crawford et al. (2007). The dew point at 2 m (D2) was also added. Physically, precipitation results from saturation of water vapour due to a vertical lift of the atmospheric cell, that is to say a combination between atmospheric instability and humidity convergence. As saturation is a non-linear function of temperature and moisture, it is important to include both temperature and moisture (relative, specific, or dew point temperature) as predictors. Moreover, SLP (or geopotential height at some tropospheric level) is a good large-scale predictor candidate, as it includes the direction of the advection (which implicitly interacts with orography) and the convergent motions (which produce also vertical lift). The U and V components of the wind bring also relevant information in terms of synoptic motions. Finally, using two levels (SLP and Z850) enables to take into account, to some extent, the vertical stability of the lower troposphere through the baroclinicity. T2 also accounts for the degree of atmospheric stability. A statistical analysis based on sparse canonical correlation analysis (SPARSE CCA, Witten et al. 2009) was conducted and corroborates our choice of predictors. Traditional CCA seeks the best projections of two sets of variables (in our case, the predictors and the predictand over the spatial domain) by iteratively maximizing the correlation of the projections. The sparse version of CCA adds sparsity constraints on the projections resulting in projection vectors with a number of zero coefficients which depends on the sparsity enforced. Each potential predictor variable was first spatially summarized by taking its first principal component (PC) computed from a principal component analysis (PCA, Barnston and Livezey 1987) applied—separately for each climate variable—to the 5452 grid-points over the North-Atlantic region. Then SPARSE CCA was carried out between a set containing the first PC of each of the seven potential predictors and a second set of variables comprising the precipitation on the EURO-CORDEX area. Only the first PC is considered to summarize spatially a climate variable. Two points motivated this choice. First, the physical/atmospheric variables that make sense as predictors for precipitation downscaling only at the first order have been determined. Hence, a natural choice was to retain only the first PC. Second, applying the SPARSE CCA algorithm over the whole EURO-CORDEX region based on the relatively high-resolution E-OBS dataset is computationally intensive, even for a single principal component. Therefore it has been decided to limit this first exploratory step of SPARSE CCA to only the first PC. The sparsity constraints are tuned so that only one predictor variable appears in each projection vectors (only one non-zero coefficient). Thus, each predictor variable is associated with a rank given by the correlation (see Table 1) with the projected predictand. The representation of some predictors have known issues in some GCMs, in particular R850 and D2 included in this study. The humidity has proven itself to improve the quality of the downscaled precipitation estimates (e.g., Vrac et al. 2007b). Therefore, although some GCMs may have some problems to represent this variable, it was decided to include it among the variables to be tested in the CCA analysis. The outcome of the SPARSE CCA excludes the relative humidity as a predictor. Instead, the dew point temperature at 2 m (D2), an index of moisture saturation (Charles et al. 1999), is kept. Although D2 depends on humidity, it also integrates pressure and temperature in its computation, two physical variables that are expected to be relatively well represented by most GCMs. The dew point temperature is then expected to be relatively well represented. The MOS model uses only the large-scale variable as predictor (c.f. Sect. 3.1.5). As precipitation is usually not well represented by the GCMs, this variable is rarely employed as a predictor in Statistical Downscaling Models. Nevertheless, in the present intercomparison exercise since it is aimed at having predictors as common as possible, the large-scale precipitation has also been added in order to have at least one common predictor for all the SDMs. More precisely, to account for the non-Gaussian behaviour of the daily precipitation whose distribution is generally skewed, a transformation of the precipitation data has to be performed before applying PCA. Hence, as in Vrac and Friederichs (2014), the zero precipitation values have been set to a small value different from zero (0.00033) and the logarithm of all precipitation data (with 0’s transformed to 0.00033) have been computed.

Fig. 1
figure 1

The models are run and evaluated over the following geographical areas: the SDMs domain in green corresponding to the E-OBS domain, the intersection between EURO-CORDEX and E-OBS domain in blue which is the evaluation domain of EURO-CORDEX models and the intersection between MED-CORDEX and E-OBS domain in orange which is the evaluation domain of MED-CORDEX models

Table 1 Selected predictors for each season and their correlation and rank into parenthesis given by SPARSE CCA algorithm

The SPARSE CCA was carried out over two 6-month periods: a 6-month “summer” (from April, 15th, to October, 14th) and 6-month “winter” (from October, 15th, to April, 14th). Table 1 shows the selected predictors for each season and their order according to the rank given by SPARSE CCA: the first five variables have been selected for each season. For the intercomparison, the first two PCs of each selected large-scale variable are kept as predictors. This choice is made to avoid the optimization of too many parameters since the SDMs calibrations/simulations are pointwise over 6043 E-OBS grid-points. This is a trade-off to keep a relatively low complexity (i.e., a relatively low number of parameters)—especially for the stochastic and TFs models—while including a significant number of physical variables as predictors. The variable selection pre-processing resulted in 12 predictors (2 first PCs for each of the 5 variables selected through the SPARSE CCA and precipitation). For example, this corresponds for the stochastic models, to 39 parameters to be estimated (13 for the occurrences, 26 for the intensity, see Sect. 3.1.3) for each of the 6043 E-OBS grid-points.

2.2 Cross-validation set up

In order to intercompare some SDMs and RCMs involved in the CORDEX exercise, all evaluations have to be made within the constraints of this program, i.e., over the 1989–2008 time period which is the hindcast evaluation time period. Figure 2 sketches the two calibrations (\(\hbox {C}_1\) and \(\hbox {C}_2\)) and validations (\(\hbox {V}_1\) and \(\hbox {V}_2\)) time periods used in this study for SDMs. The models are trained and validated sequentially, first over \(\hbox {C}_1\) (i.e., [1979–1998]) and \(\hbox {V}_1\) (i.e., [1999–2008]) respectively and then over \(\hbox {C}_2\) (i.e., [1979–1988] \(\cup\) [1999–2008]) and \(\hbox {V}_2\) (i.e., [1989–1998]) respectively. The model evaluation is performed over \(\hbox {V}_2\cup \hbox {V}_1\;=\) 1989 to 2008, therefore with the outputs of two different calibrations per model. The rain occurrence threshold is set at 1 mm per day for the evaluation. In the literature, a wide panel of thresholds has been used: 0 mm in Semenov et al. (1998), 0.5 mm in Ambrosino et al. (2014) or 5mm in Bouvier et al. (2003). In this study, a middle ground is stroke and a threshold of 1 mm is selected since it is commonly used (e.g., Schmidli et al. 2007).

Fig. 2
figure 2

Cross-validation scheme over the period 1979–2008 with two calibrations (\(\hbox {C}_1=\) [1979–1998] and \(\hbox {C}_2=\) [1979–1988] \(\cup\)  [1999–2008]) and two validations (\(\hbox {V}_1=\) [1999–2008] and \(\hbox {V}_2=\) [1989–1998]) periods. The intercomparison is done over the period \(\hbox {V}_2\cup \hbox {V}_1=\) [1989–2008] corresponding to the CORDEX RCM runs period

3 Statistical and dynamical downscaling models

3.1 Statistical downscaling models

One SDM per each of the four families of approaches (TF, WG, WT-based methods and MOS) has been selected—potentially with some variants—in order to evaluate the main philosophical and technical features between the different approaches, e.g., deterministic versus stochastic. Statistical modelling of precipitation is usually divided in two successive steps: first the occurrence and then the intensity. Section 3.1.1 describes rain occurrence modelling and Sects. 3.1.2 to 3.1.5 the different rain intensity models.

3.1.1 Rain occurrence

In this study, two ways to model rain occurrence are considered. In the first way, the model outputs are simply thresholded at a given level (1 mm in that case) from a model including zeros and making no difference between 0’s and positive values. If negative values are generated, they are set to 0. In the other way, rain occurrence at a given location is modelled as a binomial distribution \(B\left( 1,p\right)\) using a logistic regression (LR, see Buishand et al. 2004; Fealy and Sweeney 2007). Let \(p_i\) be the probability of rain for the day \(i\) conditionally to a N-length predictor (or covariate) vector \(\mathbf {X_i}\). The conditional probability of occurrence \(p_i\) is formulated through an LR as:

$$\begin{aligned} \left.\begin{array}{ccc} \log\left(\frac{p_i}{1-p_i}\right) &=&\underbrace{P^{0} +\sum\limits_{j=1}^{N}P^{j}X_{i,j}}_\text{\large=S}\\ \end{array} \right. \end{aligned}$$
(1)
$$\begin{aligned} \left.\begin{array}{lccc} \Leftrightarrow & p_i &=& \frac{exp(S)}{1+exp(S)}, \end{array} \right. \end{aligned}$$
(2)

where \((P^{0}, P^{1}, \ldots , P^{N})\) is a vector of coefficients to be estimated. Based on the predictors for day \(i\), Eqs. (1 and 2) provides the probability of rain from which it is easy to simulate a rainfall occurrence. Computational details to estimate \(p_i\) are available in “Appendix”.

3.1.2 Transfer functions (TFs)

The models belonging to this family link directly the large-scale information to local-scale variables using deterministic functions. As stated in the introduction, those functions characterize the nature (linear or non-linear) of the predictors-predictand relationships. For this approach the Generalized Additive Models (GAM) framework (Hastie and Tibshirani 1990) has been chosen. It is a deterministic model which consists in modelling the expectation of \(Y\) (here, the precipitation) conditionally on the \(N\) large-scale predictors \(\left( X_{1}\ldots X_{N}\right)\) as a sum of spline functions \(f_{j}(X_{j})\):

$$\begin{aligned} E(Y|X_{1}\ldots X_{N}) = \sum \limits _{j=1}^{N}f_{j}(X_{j}) \end{aligned}$$
(3)

where \(f_j\) are cubic regression spline functions. The cubic splines have a relatively low complexity while allowing a high non-linearity to model the link between \(X_j\) and \(Y\), i.e., the large- and the local-scale data. This method has been applied for the present time period, for instance to downscale the near surface wind fields in Salameh et al. (2009), or for the Last Glacial Maximum time period (−21 ky), to retrieve monthly climatology for temperature and precipitation over Europe (Vrac et al. 2007a) or global permafrost (Levavasseur et al. 2011). GAM is a data-driven approach in the sense that it allows to model both piecewise linearities and non-linearities depending on the nature of the predictor-predictand dependence. Two variants have been defined in the present study: (1) GAM and (2) GAM-so. In the first one, GAM has been calibrated with all values (i.e., including 0’s) and then rain intensity has been directly simulated and the rain occurrence is dealt by thresholding the outputs at 1 mm. In the second one (i.e., the GAM-so approach), the LR is first used to model the occurrence and then \(E(\text {log}(Y)|X_{1}\ldots X_{N})\) (instead of \(E(Y|X_{1}\ldots X_{N})\)) has been modelled for positive rain intensities. Computational details for GAM simulations are available in “Appendix”.

3.1.3 Stochastic weather generator (WG)

WGs are models generating daily weather scenario thanks to pdfs estimated from observations. As previously stated they are mainly used to simulate data whose statistical properties are similar to those of observations. They present a large diversity in terms of techniques and complexity: starting from quite simple series generators (e.g., Semenov and Stratonovitch 2010), passing through Markov chain based models (e.g., Kilsby et al. 2007) to sophisticated approaches like the observed hierarchical organization of rainfall and rain-cell space and time-clustering processes (e.g., Onof et al. 2000). One way to build a stochastic SDM is based on generalized linear models (GLMs). GLMs have been first applied by Stern and Coe (1984) for the generation of precipitation. GLMs link the expected mean of a random variable to the \(N\) predictors as:

$$\begin{aligned} g(\mu ) = \sum \limits _{j=1}^{N}\theta ^{j}\cdot X_{j} \end{aligned}$$
(4)

where \(\mu\) is the expected mean, \(\theta ^j\) are regression coefficients to be estimated and \(g(\cdot )\) a monotonic link function. In this work, an extension of this formulation is used. Conditional pdfs are used to model the precipitation in a Vectorised Generalized Linear Models (VGLM) framework as in Chandler and Wheater (2002). It means that the distribution family is fixed and the distribution parameters are estimated by a GLM. Thus, the rain distribution parameters for each day are estimated from the selected predictors. This method allows also the simulation of spatio-temporal rainfall with an appropriate covariance function (Yang et al. 2005) or at subdaily temporal resolution (Mezghani and Hingray 2009). In all those works a two-step approach is implemented to model precipitation. It stands as follows:

  1. 1.

    Rain occurrence is modelled by an LR as given in Eqs. (1 and 2),

  2. 2.

    Rain intensity is supposed to follow a Gamma distribution \(\Gamma _{\alpha ,\beta }(\cdot )\) whose parameters shape \(\alpha\) and rate \(\beta\) are functions of the large-scale predictors at day \(i\):

    $$\begin{aligned} \left\{ \begin{array}{l} \begin{array}{ccc} \log (\alpha _i)&{}=&{}\alpha ^{0} + \sum \nolimits _{j=1}^{N}\alpha ^{j}X_{i,j}\\ \end{array}\\ \begin{array}{ccc} \log (\beta _i)&{}=&{}\beta ^{0} + \sum \nolimits _{j=1}^{N}\beta ^{j}X_{i,j}\\ \end{array}\\ \end{array} \right. \end{aligned}$$
    (5)

Hence, for each day the parameters are calculated and a distribution is retrieved, which makes the model non-stationary and able to evolve with predictors. Then, simulations are performed based on the daily pdf. Note that the Gamma distribution parameters have been estimated from all values above 0 mm but only rain amounts above 1 mm are simulated. Indeed, estimating the Gamma distribution for values above 1 mm makes the hypothesis of a Gamma to simulate rain intensity no longer valid. Besides, calibrating the model over precipitation amounts above 1 mm causes an artificial increase of the variability of the generated time series: the variance is about twice the variance of the data generated from a calibration with all positive precipitation (not shown). Computational details to infer the Gamma distribution parameters are available in “Appendix”.

In the following, two variants are applied and tested: (1) SWG, the non-stationary model described above in Eqs. (1 and 5), and (2) SWG-s, the stationary version of SWG, where the occurrence probability and the parameters \(\alpha\) and \(\beta\) are constant and do not depend on any predictor. All the parameters of both variants are estimated by maximizing the likelihood function, except the constant occurrence probability (hereafter referred to as COP), which corresponds to the observed occurrence.

3.1.4 Weather typing (WT)

The WT family is based on large-scale circulation (LSC) patterns. It relies on the idea that the same LSC situation (i.e., predictors) produces the same local-scale effects (here E-OBS rain fields). WT consists in regrouping days with similar LSCs. This is classically done with statistical clustering methods: given a number K of clusters and a measure of similarity, data (here daily situations) are grouped in K clusters such that the situations into a given cluster are as similar as possible, while situations in different clusters have to be very different. The clustering methods are widely used to study weather regimes (e.g., Yiou 2004; Vrac et al. 2014, and the references therein). In terms of SDMs those methods are rather used to condition statistical models, for example a stochastic model as in Schnur et al. (1998), Bellone et al. (2000) or Vrac et al. (2007b).

In this study, the analog method is employed as representative of the WT family. This method considers each day as a cluster. A deterministic analog modelling as defined in Yiou et al. (2013) has been chosen here. It has been used in several previous studies (Zorita and Storch 1999; Yiou et al. 2007; Vautard and Yiou 2009; Chiriaco et al. 2014). It consists in determining for a given day to be downscaled in the validation period the day in the calibration period which has the closest atmospherical situation. It is determined by a similarity metric between the predictor set of the day to be downscaled (\(X_{\text {V}}\)) and the predictor set of the day in the calibration period (\(X_{\text {C}}\)). This approach is quite flexible to change the distance or the temporal window of the situations (Yiou 2014). Many family of metrics can be used (e.g., Grenier et al. 2013) and one of them is distance:

$$\begin{aligned} \text {Day}_{\in \text {C}} = \text {argmin}_{\text {day}_{\in \text {V}}}\left( \text {dist}(X_{\text {V}},X_{\text {C}})\right) . \end{aligned}$$
(6)

The Euclidian distance is chosen in this study. Only one experiment has been set and is called ANALOG. Note that one important difference with the other models is that this method is applied over the entire predictor dataset anomalies, not only over the first two PCs. Hence, much more information than for the other models has been provided to this model, this will be discussed in Sect. 5.2. A threshold at 1 mm has also been applied to the output values for rainfall occurrences.

3.1.5 Model output statistics (MOS)

This approach regroups all the “quantile-mapping” related methods, more precisely all the methods relating the large-scale CDFs to the local-scale CDFs. For instance quantile-quantile based methods have been widely used for downscaling (e.g., Vrac et al. 2012 and references therein) or to correct bias in model outputs thanks to observations CDFs (e.g., Gudmundsson et al. 2012, and references therein) and the correspondences between predictors and predictands quantiles. Those methods can be directly calibrated on models outputs (e.g., GCM or RCM). Those correspondences can be based on non-parametric (Déqué 2007) or parametric (Piani et al. 2010) models. Many methods have been implemented and compared in Gudmundsson et al. (2012). The MOS technique used here is the “Cumulative Distribution Function-transform” (CDF-t) initially developed in Michelangeli et al. (2009) to downscale wind and applied later to temperature and precipitation, for example in Lavaysse et al. (2012), Vrac et al. (2012) and Vigaud et al. (2013).

The CDF-t model consists in relating local-scale (i.e., here E-OBS precipitation) CDF to the large-scale (i.e., here ERAi reanalysis precipitation) CDF. The CDF-t and quantile-quantile methods are similar in philosophy, except that CDF-t takes into account the change in the large-scale CDF from the calibration to the projection (or validation) time period, while quantile-quantile projects the simulated large-scale values onto the historical CDF to compute and match quantiles. Let \(F_{\text {Rc}}(x)\) and \(F_{\text {Ec}}(x)\) define respectively the rain CDFs from the Reanalyses (subscript R) and from E-OBS (subscript E) over the calibration period (subscript c) and \(F_{\text {Rv}}(x)\) and \(F_{\text {Ev}}(x)\) the CDFs over the validation period (subscript v). An estimation of \(F_{\text {Ev}}(x)\) is assumed to be:

$$\begin{aligned} F_{\text {Ev}}(x) = F_{\text {Ec}}\left( F^{-1}_{\text {Rc}}\left( F_{\text {Rv}}(x)\right) \right) , \end{aligned}$$
(7)

with \(x\) in the range of the physical variable of interest. Thus, the local-scale CDF over the validation period, \(F_{\text {Ev}}\) is obtained from the large-scale CDF \(F_{\text {Rv}}\) over the validation period, on which a transformation \(T\) defined from the CDFs over the calibration period, \(T(u) =F_{\text {Ec}} \left( F^{-1}_{\text {Rc}}\left( u\right) \right)\) is applied. Then, a quantile mapping between \(F_{\text {Ev}}(x)\) and \(F_{\text {Rv}}(x)\) is performed to retrieve the precipitation values at local scale. More detailed information, descriptions and evaluations of CDF-t are available in Vrac et al. (2012). CDFt-so is the only experiment set for this approach. In the same way as GAM-so, rain amount is modelled by CDF-t and rain occurrence by the LR. Because the ERAi precipitation presented too few days with precipitation amounts above 1 mm, CDF-t has been calibrated over precipitation above 0 mm. Indeed, too few rainy days (rain above 1 mm in that case) at the large-scale will produce too few rainy days in the downscaled data (not shown). That is why the calibration has been made for days above 0 mm and then the outputs have been thresholded at 1 mm. This model is the one which has the lowest quantity of information in terms of predictors: the large-scale precipitation only. Indeed the other models have six variables with precipitation among them. Computational details on CDF-t are available in “Appendix”.

3.2 Regional Climate Models

Concerning dynamical models, five runs have been selected: two from EURO-CORDEX and three from MED-CORDEX experiment. These simulations cover two different domains (Fig. 1) but use the same horizontal resolution (\(\text {0.44}^\circ\)) and are all initialised and forced at their boundaries by ERAinterim data. None of the models uses nudging inside the domain except IPSL-WRF311. A relaxation region of different widths (a few hundreds of km, depending on the model) is used to account for boundary imbalance effects. The common period of simulation is 1989–2008 and each model uses its own set of parameterizations. Details on each run can be found in the following references: Flaounas et al. (2013) for IPSL-WRF311, Nabat et al. (2014) for CNRM-ALADIN52 (see also Colin et al. 2010; Herrmann et al. 2011), Domínguez et al. (2013) and Jiménez-Guerrero et al. (2013) for UCLM-PROMES and Table 1 of Vautard et al. (2013) for IPSL-INERIS44 and for CNRM-ARPEGE51. These models are hereafter referred to as MED-IPSL, MED-CNRM, MED-UCLM, EURO-IPSL and EURO-CNRM respectively. As indicated in Table 3, CNRM and UCLM models repeat the year 1989 two or three times to take into account the spin-up associated to the surface scheme initialization. This is widely sufficient to equilibrate moisture in the levels of the soil that interacts with the atmosphere through evapotranspiration. Repeating the year 1989 two or three times is considered as negligible in the final results. IPSL models do not repeat this year but this does not influence the results. Indeed, year 1989 has been tested and similar behaviour compared to other years has been observed and several tests have shown that simulations were converging after a few days. Moreover, the use of nudging for the MED-IPSL simulations reduces the spin-up period. Besides, this investigation is beyond the scope of this paper.

Tables 2 and 3 summarize all the models (SDMs and RCMs) and their features.

Table 2 Statistical Downscaling Models features concerning the occurrence model (LR: logistic regression, COP: constant occurrence probability, T: thresholded) and the predictors (Anom.: anomalies over all the variables, ERAi PR: ERAi reanalyses precipitation, 6 \(\times\) 2 PCs: the first two PCs of the five selected predictors and precipitation) used in each case
Table 3 Dynamical downscaling models features

4 Intercomparison results

The quality of the simulations is assessed by comparison to the data product considered as pseudo local-scale observations (E-OBS) in terms of rain occurrence and intensity, as well as spatial and temporal properties through selected indicators. In the view of the relatively equivalent results over the two seasons, only the results over the “summer” season will be presented hereafter. Besides, even if impacts studies generally need annual precipitation data, impact studies focusing for example on agricultural impacts, heatwaves or droughts studies need accurate precipitation data during spring and summer. Intense precipitation events around the Mediterranean usually take place between mid August and mid November and cause floods. Precipitation during winter is easier to model by the RCMs because of the stratiform nature of precipitation, whereas summer rainfalls are driven by convective rain processes, more difficult to represent and resulting from a parametrization in the RCMs. All the indicators are computed over the 1989–2008 period. Results specific for the “winter” season will be described and the corresponding figures are available as auxiliary material. In the following section, most evaluations are presented in terms of bias of the indicators with respect to those of the pseudo-observations defined as “Indicator(simulation) minus Indicator(observation)”. In terms of colours, blue means that the model underestimates and orange/red means that it overestimates the considered criterion with respect to the observations.

4.1 Occurrence indicators

The evaluation begins by exploring the ability of the models to reproduce the occurrence properties: Do the models respect the observed proportions of wet or dry days and the time they occur? In this part, only nine models are considered for occurrence evaluation (ANALOG, GAM, LR, COP, EURO-CNRM, EURO-IPSL, MED-CNRM, MED-IPSL and MED-UCLM) since SWG, CDFt-so and GAM-so share the same LR occurrence model presented in Eq. (1).

First, bias (in %) of wet days frequency have been investigated in Fig. 3. The LR, COP and ANALOG models perform well. They show biases close to zero with very small positive or negative values and distributed over all the area. All the other SDMs and RCMs are strongly biased. Most of them are mainly positively biased which is a well known problem for RCMs: the models produce little rainfall amount too often (see Sun et al. 2006; Stephens et al. 2010). The negative bias of MED-IPSL is due to land surface/atmosphere feedbacks that are not well reproduced generating dry soil too early in spring over most Western Europe then less clouds and precipitation and higher temperature in summer. Except ANALOG, LR and COP, all the models are globally producing rainfall too frequently. Both IPSL RCMs show patterns at the borders of the domain. This is a consequence of the relaxation zone at the domain boundaries. Similar patterns are observed on Figs. 6 and 7 for the same reason. Note the very poor performance of GAM which largely overestimates the percentage of rainy days.

Fig. 3
figure 3

Wet day frequency bias in percentage (%). Some bias values have been saturated to ease the reading and to improve the colour contrast: black means that the bias values exceed 20 % in absolute values. The real range of values for each map is given in auxiliary material table 3

For the winter season the results are more or less the same for all the models except for the EURO-CNRM model where the biases are smaller and distributed in terms of sign all over the domain. MED-IPSL presents also some interesting differences. The biases evolve from negative at the south–west to positive at the north–east of the domain (see auxiliary materials). This gradient is a consequence of a humidity bias observed in winter in the model (compared to GPS measures). Indeed, there is a light positive humidity bias in Western Europe and it increases when going eastward. One explanation is that the microphysics scheme is not efficient enough for precipitation and can induce a lower precipitation amount for a given humidity rate. Besides, in winter the air mass flows from west to east which also increases the humidity and therefore the precipitation in the east.

Periods of consecutive wet or dry days (or spells) have been also considered, in particular the mean length of the wet spells and dry spells biases (expressed in days). In other words, the mean wet and dry persistence biases are investigated. They are pictured by boxplots respectively in Fig. 4a, b. In order to remain consistent with the domains presented in the maps of Fig. 1, they are computed over different domains. These boxplots are nevertheless relevant since, when the indicators are calculated only over the MED-CORDEX domain for all the models, the ranking of the models and the global aspects of the boxplots are similar (not shown). All the models except GAM show skills for reproducing the wet spells of E-OBS, especially the EURO-IPSL and ANALOG models. On the opposite, GAM is strongly biased. Interestingly, although not perfect, the LR occurrence provides better results than the constant occurrence probability (COP) approach. In other words, the non-stationarity brought by the logistic regression improves the wet occurrence modelling compared to the stationary COP model. Concerning the mean dry spells, the models uniformly underperform (i.e., they present larger biases) than for mean wet spells. They all have also difficulties to reproduce dry spells around the Mediterranean (not shown). They mainly underestimate the mean dry spell length except for MED-IPSL. The mean wet and dry spells biases do not cancel each other even if the MED-CORDEX models, GAM and EURO-CNRM show opposite bias signs between wet and dry spells. In other words, a deficit (or an excess) of the wet days persistence does not necessarily imply an excess (or a deficit) of the dry days persistence. In winter, the results are similar except that the mean wet spells biases absolute values are smaller for all the models (see auxiliary materials).

Fig. 4
figure 4

Mean wet and dry spells length bias in days for the summer season. The left and right fences are situated at 1.5 times the interquartile range from the box and the dots are the values beyond these fences. Bias values above 6 days or below −6 days have been ignored in these plots for readability reasons (c.f. auxiliary material for the real range). a Wet spells bias. b Dry spells bias

Until now, the rain occurrence has been tested only in terms of frequencies. In order to characterize the time synchronicity of the rainy events, the Brier score (hereafter referred to as BS, Brier 1950) is computed. The BS describes how close to the daily observed occurrences the daily estimated probabilities are:

$$\begin{aligned} BS = \frac{1}{N}\sum \limits _{t=1}^{N}(p_t-o_t)^{2}, \end{aligned}$$
(8)

where \(p_t\) is the estimated probability at the time t from LR and 1 or 0 for deterministic models for rain or no rain respectively, \(o_t\) is the observed occurrence in observation at time \(t\) which takes the values 1 or 0 (meaning rain or no rain) and N is the number of days. Hence, the closer the score to 0, the more synchronized the model is. Figure 5 shows the scores computed for each model. LR and MED-IPSL have the smallest values, on average below 0.2. The other models, except GAM and ANALOG, have a BS on average below 0.4. Note that the Analog approach has better results in terms of rainy days proportion than for the timing of rainfall events. This means that the Analog model produces sequences of wet or dry days with correct proportions but not at the right moment. In winter (see auxiliary materials), the results are similar.

Fig. 5
figure 5

Brier score: the scores boxplots of each model are represented between 0 and 1. Models with scores close to zero have good estimation of the daily rain occurrence probabilities

4.2 Intensity indicators

The statistical properties of the downscaled rain intensity at individual grid-points of the 11 models are now compared to those of the observations. Figure 6 shows mean daily precipitation biases (in mm) for the precipitation above 1 mm. The average rain amounts are well represented by SWG, SWG-s, ANALOG, CDF-so models. CDFt-so shows small positive biases over almost the whole domain while SWG, SWG-s and ANALOG models present small positive and negative biases distributed all over the domain. GAM, GAM-so and the dynamical models are more or less strongly positively and negatively biased. MED-IPSL is the best among them with positive and negative biases distributed all over the domain which is also the case for MED-UCLM. EURO-CNRM, MED-CNRM present mostly negative biases while EURO-IPSL has mostly positive ones. Border patterns are visible for all the RCMs which are a consequence of the relaxation zone. Similar results are found for winter although with smaller biases for all the models except MED-IPSL (see auxiliary materials).

Fig. 6
figure 6

Mean precipitation bias in \(\hbox {mm}\cdot{day}^{-1}\). Bias values above 6 mm or below −6 mm have been saturated in black in order to have a good colour contrast (c.f. auxiliary material for the real range)

Figure 7 displays the variance ratio (in percentage). It is the ratio between the variance of the simulations and that of the observations:

$$\begin{aligned} \%rv=\frac{\sum _{i=1}^{n} (S_{i}-\overline{S})^2}{\sum _{i=1}^{n}(O_{i}-\overline{O})^2} \times 100, \end{aligned}$$
(9)

with \(S_ i\) is the simulated value for day i, \(O_i\) is the observed value at day i, \(\overline{O}\) is the mean of the observations for the period, and \(\overline{S}\) is the mean of the simulated data. While CDFt-so performs well with some variations and mostly overestimates the variance over the area, ANALOG, SWG and SWG-s tend to underestimate the variance. For the stochastic models, it is caused by the way rain amounts have been simulated. Indeed, the SWG and SWG-s models have been forced to simulate precipitation above 1 mm which can reduce the variability of the generated data. Once again GAM and GAM-so perform poorly. While the other SDMs reach an average ratio between 80 and 150 %, GAM and GAM-so barely reach 25 % and are the worst among all models. Concerning RCMs, CNRM models are the best among them although they mainly underestimate the variance around 80 %. Others are much more biased and mostly overestimate it. Their variance ratios are above 150 %. Here, the patterns at the boundaries for RCMs are stronger than for the previous indicators. In winter, the SDMs have the same behaviour unlike RCMs. CNRM models and MED-IPSL present variance ratio larger than for summer: closer to 100 % for CNRM models and ratios above 150 % for MED-IPSL (see auxiliary materials).

Fig. 7
figure 7

Variance ratio in percentage (%). Values above 300 % have been saturated in black for the same reason as previously (c.f. auxiliary material for the real range)

As a last indicator of marginal intensity, the reproduction of extreme values is investigated. The 99th quantile bias (in mm) is considered and shown in Fig. 8. Overestimation and underestimation patterns are quite similar to those observed for the variance ratio (see Fig. 7) transposed to the 99th quantile bias i.e., biases are quite similarly distributed all over the area (not shown). Thus, similarly to the variance ratio ANALOG, CDFt-so, SWG and SWG-s are good to reproduce extremes. ANALOG, SWG and SWG-s slightly underestimate the 99th quantile, while CDFt-so overestimate it. Note that MOS models like CDF-t may be unstable to simulate extremes especially for future projections. In order to deal with this issue, the constant correction method defined in Déqué (2007) is used in CDF-t. The underestimation for SWG and SWG-s results from the marginal Gamma pdf used here which is not able to reproduce correctly the extremes. This is something known and investigated in literature (e.g., Vrac and Naveau 2007). GAM and GAM-so reach a median bias below -10 mm and therefore widely underestimate the 99th quantile. RCMs over- or underestimate depending on the model. CNRM RCMs present mostly negative biases and the others positive biases. In winter (see auxiliary materials), the results are similar except that biases are smaller in absolute values. The only remarkable difference is for CNRM models which present mostly positive 99th quantile bias.

Fig. 8
figure 8

99th quantile bias in mm. Bias values exceeding 20 mm in absolute values are not represented (c.f. auxiliary material for the real range)

4.3 Spatial indicators

The spatial properties of the downscaling models, more precisely the spatial variability are now evaluated. To this end, a PCA is performed on daily downscaled precipitation outputs for each of the 11 models and on E-OBS data. Figure 9 pictures the first summer EOF of E-OBS and of each model. Since the distribution of precipitation is skewed, and therefore non-Gaussian, a transformation of the precipitation data has to be performed before applying a PCA. Here, the approach suggested by Vrac and Friederichs (2014) is followed: the zero precipitation values have been set to a small value different from zero (0.00033) and we then computed the logarithm of all precipitation data, with 0’s transformed to 0.00033. The PCA is actually performed on those transformed precipitation outputs. The variance explained by the first EOF is indicated for each model. Even if the values are generally low (mostly around 10 %), in the present case, it is a valuable tool to spatially compare modes of variability. The EOF coefficient characterizes the contribution of each grid-point to the variability explained by a PC. The aim is to see if the EOF values for each model have the same spatial distribution as for E-OBS. Similar patterns means that the models have a good ability to reproduce the spatial variability of the observations. The ANALOG model has almost the same spatial structure as the observations. This was expected since ANALOG is based on a resampling procedure and therefore keeps the spatial structure. The other statistical models have quite different spatial patterns even if CDFt-so, GAM and SWG are quite close. In some cases, they even present “flat” spatial patterns (i.e., EOF coefficients are almost equal). The “flat” spatial patterns come from models that are not able to reproduce any spatial variability in their simulations. That is the case for GAM-so and SWG-s for example, whose simulations are made pointwise without spatial constraints. EURO-CORDEX models well reproduce the observation pattern whereas MED-CORDEX models. In winter (see auxiliary materials), the spatial variability of all the models is better caught than in summer, except for GAM-so and SWG-s again. It is probably a consequence that the rain processes involved are different depending on the season. In winter the precipitation is stratiform or dynamic which is related to large-scale atmospheric system. In summer, the precipitation relies on convective processes (i.e. isolated storms for instance) which have a complex spatial structure.

Fig. 9
figure 9

Spatial distribution of the first EOF loadings of each model. The variance explained by each EOF is indicated for each model

The pattern correlation of the daily maps has also been investigated. It was computed between the previously transformed precipitation outputs used to compute the EOF and the transformed E-OBS. In Fig. 10, the boxplots of daily pattern correlation are given. RCMs—which are spatially constrained—are better than SDMs. Even ANALOG, which is considered as efficient for reproducing the spatial variability, fails in reproducing daily spatial pattern. It is consistent with the result given by the Brier score which indicates that ANALOG fails in terms of synchronicity of the events. The best model is the MED-IPSL model; this might be explained by the fact that it is nudged. Note that ERAi presents the best pattern correlation with E-OBS, with the exception of MED-IPSL. Even if MED-IPSL model is nudged with ERAi, it seems to improve the pattern correlation of MED-IPSL with E-OBS.

Fig. 10
figure 10

Pattern correlation of daily maps between each model and E-OBS

4.4 Temporal indicators

The temporal aspect is studied through two angles: by studying the interannual variability and studying the seasonality. Naturally these indicators are examined over the whole year.

In Fig. 11, the cumulated annual rain amount over two illustrative stations (see Fig. 1 for their location): Montpellier (Fig. 11a) and Moscow (Fig. 11b) is represented. The top panels display the E-OBS amounts, all the statistical models and ERAi, while the bottom panels show the results from the dynamical models and E-OBS. The reanalysis precipitation is plotted since it is the only predictor of CDFt-so. First, the case of Montpellier is considered: among statistical models all deterministic models (in purple) except GAM-so seem to be better than the stochastic models (in green) to reproduce the inter-annual variability. GAM-so annual amounts are low because of the combination of LR to model rain occurrence and GAM to model rain intensity. The latter is designed to simulate the average rain amount but the random trial for the rain occurrence reduces the annual amount. The dynamical models are better than the statistical models for the inter-annual variability (except ANALOG and CDFt-so).

Fig. 11
figure 11

Annual rain amount in \(\hbox {mm}\cdot\hbox {year}^{-1}\) of the models, left for Montpellier and right Moscow. Top panels regroups all the SDMs, ERAi and E-OBS, bottom panels all the RCMs and E-OBS. a Montpellier, b Moscow

For Moscow, the evaluation result is quite different. In this case, no SDMs seem to reproduce the inter-annual variability of the observations. As for Montpellier, low annual rainfall amounts are observed for GAM-so. Almost all dynamical models overestimate precipitation for this station except EURO-CNRM which is particularly close to E-OBS in this case. In order to have a more global overview over the domain, the correlation between cumulated annual rain amount time series of each model and that of E-OBS have been computed pointwise. The boxplots of the correlations are available in Fig. 12. Obviously the SDMs have difficulties to reproduce the inter-annual variability compared to the RCMs except the CDFt-so whose predictor is ERAi total precipitation (c.f. the boxplot of total precipitation above 1 mm of ERAi in Fig. 12). The performance of the other SDMs is poor (with correlation from 0.2 to 0.4) and the stochastic models and ANALOG have the worst performance while they were the best for occurrence and intensity marginal properties. RCMs are more satisfactory, especially the EURO-CNRM and MED-IPSL models. However, these results have to be considered carefully because they characterize the year-to-year synchronisation of the variability i.e., if the variations of the annual amount increase or decrease at the same moment. In terms of RMSE (given in auxiliary material) SDMs are better than RCMs (except GAM-so) as already suggested by the evaluation of the Brier score. This observation does not stand for EURO-CNRM, which is good in terms of RMSE and correlation but not for the Brier score.

Fig. 12
figure 12

Boxplots of the correlation coefficients between annual rain amount time series of each model and E-OBS

Now the seasonality is examined. To this end, the daily mean of each month (including zeros) over the 20 years is computed (i.e., 12 values, one for each month) for each model and E-OBS. Then the correlation between the seasonal cycle of each model and E-OBS is calculated. Figure 13 shows the corresponding boxplots. Here the results are opposite compared to the previous figure when comparing SDMs and RCMs. This time SDMs achieve higher correlation (except SWG-s), reaching correlations around 0.9 while RCMs have more troubles to reproduce the seasonal cycle, reaching correlations around 0.75 nevertheless. In the case of MED-IPSL, the bad seasonal cycle is partly a consequence of the land surface/atmosphere feedbacks described in Sect. 4.1.

Fig. 13
figure 13

Correlation coefficient between the seasonal cycle of each model and E-OBS

A third index has been considered to evaluate the temporal properties. In Fig. 14, the first order summer autocorrelation coefficients (AR1) for each model are pictured. As for the first EOF, the aim is to see if the spatial distribution of the coefficients of each model is the same as for E-OBS. The AR1 coefficient is computed over the precipitation outputs gaussianized as in Sect. 4.3. GAM gives too high autocorrelation due to the fact that it generates a little amount of rain too frequently. Other SDMs have very low autocorrelation except the ANALOG which reproduces closely the autocorrelation of observations. Note that the CDFt-so model achieve very different AR1 coefficients than ERAi coefficients, it is a consequence of the LR used for the occurrence of this model. This widely modify the rain occurrence observed in ERAi and therefore influence the auto correlation. RCMs have autocorrelation values different from that of the observations but are very close to E-OBS in terms of range. The 2-day and 3-day-lag-autocorrelation values have also been computed (not shown), these coefficients decrease quite fast as expected for rain and the ranking of the models compared to E-OBS is the same as for AR1. In winter (see auxiliary materials) the results are similar for SDMs. RCMs are globally much better and their autocorrelation coefficients are really close to those computed for E-OBS.

Fig. 14
figure 14

First order autocorrelation coefficient for each model. Only values below 0.6 are represented, above they are saturated in black

5 Conclusions and discussion

5.1 Conclusions

In this study, an intercomparison of several precipitation downscaling models has been conducted. To this end, an intercomparison framework has been built following some essential requirements. First, all the models had to have common predictors (as much as possible) coming from the same database, here ERAi reanalyses. Second, observations and models outputs with the same spatial resolution and over a common area were considered. So, considering the available RCMs and observational data resolution (E-OBS), a resolution at \(0.44^\circ\) has been chosen. Third, the selected models had to represent all the downscaling approaches the authors have defined (TF, WG, WT, MOS statistical families and some dynamical models). So 11 models (six SDMs and five RCMs) have been selected and their outputs confronted according to criteria characterizing the four following aspects of the rain: occurrence, intensity, as well as spatial and temporal properties. This study is an opportunity to set-up and test the consistency of the intercomparison framework to compare outputs coming from SDMs as well as RCMs. Very different downscaling models, at least in terms of model philosophy, have been compared.

All the RCMs (except MED-IPSL), as well as GAM, seem to produce too many rainy days. For general consideration, modelling the rain occurrence by an LR (logistic regression) reveals itself to be a better approach than thresholding the outputs. Concerning the spells, all the models have better abilities to reproduce the wet spells than the dry ones and ANALOG is the best to reproduce them. However, even if ANALOG is good in terms of occurrence statistics, it fails in terms of time accuracy (Brier score).

The second examined aspect is the rain intensity. Here, the mean climatology is better reproduced by the stochastic models (SWG and SWG-s). While variability and extremes are better dealt by ANALOG and CDFt-so, SWG and SWG-s are close behind. All the other models present strong biases with variations over the domain. GAM-so and GAM completely fail in reproducing intensity properties. This is in agreement with Schoof and Pryor (2001) concerning TFs models performances. Concerning RCMs, the study corroborates the classical results found in the literature, namely that they are producing too many rainfall events (occurrence) but with low intensity (Sun et al. 2006; Stephens et al. 2010) except for MED-IPSL.

Spatial pattern are studied through two specific angles: first, the spatial variability thanks to an EOF analysis; second the pattern correlation of the daily maps. Concerning the spatial variability the ANALOG and EURO-CORDEX, are better reproducing E-OBS spatial rain patterns while the others show quite different or no patterns at all. The models with good spatial pattern are the models which have spatial constraints: by construction for RCMs and by keeping the observations spatial structure for ANALOG model. This shows the importance of developing statistical spatial models in the future. In terms of daily pattern correlation, the only model which has been nudged, MED-IPSL, is the best to achieve the daily pattern of E-OBS (even if the nudging has been done with ERAi).

Finally the temporality was investigated. In this study, SDMs fail to retrieve E-OBS inter-annual variability especially SWG and SWG-s models. It is probably due to the random nature of the simulations which can generate too large or too little rain amounts and thus simulate very different annual rain amount compared to the observations. Another explanation could be the lack of information in terms of inter-annual variability provided by the predictors. RCMs on the opposite are in general better with a good performance of EURO-CNRM and MED-IPSL. For several aspects, MED-IPSL model achieve good performances. This can be partly explained by the nudging performed inside the domain with ERAi. In the mean time, the SDMs succeed in reproducing the seasonality. RCMs have more difficulties to achieve a good seasonal cycle. Finally, in terms of autocorrelation, ANALOG, followed by the RCMs are close to the E-OBS autocorrelation. Other SDMs, are quite far from E-OBS autocorrelation values.

In order to synthesize the results, the statistical models and the dynamical models are ranked according to each criterion. The models are scored according to the domain-wide averaged indicators in Tables 4 and 5 over MED-CORDEX and EURO-CORDEX domain respectively: the lower the score, the better the model. A global score can be obtained by simply adding each indicator rank over each of the considered aspects of the model evaluation (occurrence, intensity, spatial and temporal). Tables 4 and 5 can be used as a guideline for the users of the simulations. It allows to choose the model(s) to be used, depending on the needed statistical properties that the simulations must satisfy for some particular applications. Indeed, there is not one model in particular which really takes the advantage on the others considering the four aspects of the evaluation. Their performances really rely on the considered indicators and therefore on the use of the model simulations. Thus, the model quality depends on the end-users needs and the properties they expect the data to have to define their “best” model.

Table 4 Score and rank table for summer season computed over the MED-CORDEX domain
Table 5 Score and rank table for summer season computed over the EURO-CORDEX domain

5.2 Perspectives and discussion

Many perspectives can be foreseen for this work. The choice between SDM and RCM methods can not be done solely on the reproduction of ERAi climate. A direct continuation can be the intercomparison in a future climate context. First, the couple “GCM/SDM” over the historical (or CTRL) period has to be evaluated. From the SDMs fitted over the historical period (e.g., 1979–2008) to the observations (E-OBS) and reanalyses (ERAi) (i.e., basically similar to which has been done in this study), new time series driven this time by GCMs as predictors will be generated and evaluated. A good agreement of those time series with observations would mean that GCMs provide good predictors to simulate local-scale variables. Thus, the ability of the SDMs to reproduce the climatological present characteristics of the precipitation when driven by historical GCM fields would be assessed. The evaluation would be performed only in terms of statistics. In other words, indicators needing day-to-day synchronicity (e.g., Brier score and daily maps correlation) would not be relevant in that case. The next step would be to assess the capability of the SDMs to capture changes in future spatial and/or temporal local-scale properties. The couple “GCM / SDM” would be evaluated in a climate change context with a RCM-based pseudo-observations approach, for example as developed in Vrac et al. (2007c) and applied in Gaitan et al. (2014). RCMs will be considered as proxies of future climate conditions and RCMs and SDMs have to be driven by the same GCM simulations. SDMs fitted to CTRL GCM simulations and pseudo-observations coming from RCM over the same time period will be driven with future GCM simulations (multiple emissions scenarios can be used) to generate new time series. Good agreement between those time series and the future RCM time series would mean that the SDM is able to capture a similar climate change signal as that simulated by the RCM.

A multi-model approach can also be an interesting follow-up study. It has been first tested in Sanders (1963) for subjective and Perrone and Miller (1985) for objective weather forecasting and has proven itself to be superior to the methodologies applied individually. There are many occasions when this result is verified. Even theoretical contributions are made to support these experimental facts (e.g. Hagedorn et al. 2005). However, it is not generalized until the 2000s (e.g. Palmer and Shukla 2000; Pavan and Doblas-Reyes 2000; Lambert and Boer 2001; Gillett et al. 2003; Jacob et al. 2007; Ruti et al. 2011; Solman et al. 2013; Gallardo et al. 2013) and is consolidated as the standard in studies of climate performed with dynamical models. Therefore, future studies should include the multi-model approach when MED-CORDEX and EURO-CORDEX databases are completed. This methodology could be thus extended, as noted by Haylock et al. (2006) to a mix of dynamical and statistical models. Note that one major difference between ensemble methods in weather forecasting and in climate studies is that the first must deal effectively with uncertainty in initial conditions, while in climate studies this uncertainty is not as much relevant.

Moreover, a way to refine the results would be to study the impacts of E-OBS uncertainties on the downscaled data. Some studies pointed out some quality inconsistencies. For instance in Hofstra et al. (2009) problems such as data homogeneity over the E-OBS domain or oversmoothing in interpolation scheme causing difficulties to catch correctly the extremes or rain patterns over mountains have been pointed out. Therefore, the data uncertainty caused by the interpolation is ill-estimated. In this study, E-OBS V8 has been used. Potentially some improvements can be expected, if the last version E-OBS V11 is used instead, since the network density has been increased and an artefact of drizzle occurrence has been corrected. However, concerning the drizzle effect, it should not influence our results since the rain occurrence threshold is set at 1 mm. This occurrence threshold could also influence the results. In our case, simulations with a 0 mm threshold have also been tested for all the models (not shown). This changes the indicators values but does not influence the ranking of the models. The poor performance of GAM is not a consequence of the threshold since the same poor performances of GAM have also been observed for the 0 mm threshold. This mainly comes from the fact that the deterministic TF based models are not suited to simulate precipitation. Besides, concerning the drizzle effect of the RCMs, the results show that the tested RCMs produce too many rainy days even with this threshold except for MED-IPSL model (see Fig. 3).

Improvements can also be made on SDMs calibration, for instance by improving the predictors selection process or adding other predictors. It is worth noticing that the first exploratory step based on the SPARSE CCA algorithm (i.e., to determine the variables that make sense as predictors for precipitation downscaling) has been performed only on the first principal component of each variable. Although the SPARSE CCA method is computationally intensive, it would be interesting to have additional leading PCs in this exploratory analysis to bring more robustness to the choice of the predictors. Moreover, as the SPARSE CCA has not been applied in the cross-validation context, the performance of the SDMs as assessed via cross-validation could be overly optimistic (or at least biased). Although the differences could be minor, it would be interesting to perform the selection of the predictors within the cross-validation procedure. Note that the cross-validation scheme used in this study has a rather short calibration period (20 years), which may underestimate or even overestimate the skills for some methods. One solution could be to use a ”29-leave-one-out” scheme, with calibrations of the models made on 29 years and evaluations on the left-out year. This 29-leave-one-out strategy, however, may not be an adapted strategy to evaluate the performances of the models in a changing climate context. Indeed, as the one-year left out would be either surrounded by the 29 calibration years, or appended (before or after) to the 29 calibration years, the basic statistical properties of the large-scale predictors and of the local-scale data should be the same in the 29 calibration years and the evaluation year. Hence, this strategy could provide overly optimistic results compared to an evaluation performed on a whole decade (or more). Besides, the 20-leave-10-out method is closer to the framework in which the downscaling methods are applied (calibration on historical period and application on future period).That is why, despite the limited length (20 years) of the calibration period, the “turning”? 20 leave-10-out cross-validation procedure has been favoured in this study. Predictors relevant in terms of rain physical process such as the CAPE (convective available potential energy, Foufoula-Georgiou and Tsonis 1996), the vertical wind shear (Wingo and Cecil 2009) or moisture flux (Yang et al. 2010) characterizing the atmosphere instability can be also considered. Some temporal information could be added by including the previous day precipitation observation especially for the occurrence model (Kleiber et al. 2012). Weather regimes or seasonal cycle indicators could also bring interesting information leading to potential improvements. Globally, the intercomparison could be broadened by adding more statistical and dynamical models or adding new variables of interest such as temperature or wind. Thus, an inter-variable analysis could be carried out based on adapted indicators.

Besides, the SDMs’ features can be improved. According to the results, it would be legitimate to focus on the ANALOG model. However, this model presents some limitations. Indeed, it is limited by its range over the calibration period: in case of future projections in context of climate change signal it is possible to miss that signal because ANALOG cannot go beyond the calibration climate range. Besides, this model has more large-scale information than the other models tested here. This could also explain its performance. One can object that the ANALOG model could have been run with the same set of predictors that have been used for the other SDMs (i.e., the 12 PCs). The authors are not aware of any application of the ANALOG model with PCs as predictors. The usual way to apply it is to work with fields of anomalies. However, the ANALOG model has also been run with PCs as predictors for comparison. This approach strongly degrades the results of the ANALOG model compared to using the anomalies as predictors. This model presents large biases and sometimes the results are even unrealistic (not shown). Some analog approaches combine multiple analogs (e.g., Radanovics et al. 2013; Chardon et al. 2014; Yiou 2014). In the way the analogs are computed in our study, the use of a combination (e.g., through a mean or weighted average) of multiple analogs would decrease the quality of the ANALOG simulation. Indeed, it would undermine the mean and the variance of the ANALOG model output and could also introduce a bias in the wet days frequency. An artificial variance-inflating procedure would then be necessary to maintain the main statistical properties.

On the opposite, a focus can be given to the SWG model. Indeed, in spite of its caveats in terms of spatiality and temporality, it seems to be very promising. There are many ways of improvement for instance by giving the model a spatial structure through a covariance function (e.g., Vischel et al. 2009) or by improving the Bernoulli/Gamma marginal probability distribution function used here. It would allow us to generate daily rain fields with a spatial coherence and one model for an entire region instead of a model per grid-point. Instead of two seasons, considering weather regimes could also lead to a potential improvement (Vrac et al. 2007b). Of course, the CORDEX regions are probably too large to define a simple but realistic dependence model. However, improving the SWG model seems a good compromise between the many leads of improvements and the model flexibility. Spatial coherence can also be ensured in other modelling framework: for instance the spatial MOS model, EC-BC, developed in Vrac and Friederichs (2014). Another path can be a combination of a stochastic model with an ANALOG model.

Finally, the present study has focused entirely on the intercomparison framework and the results that have come out of it. This work aspires to set an easily reproducible ground rules to conduct a RCM intercomparison which includes RCMs as well as SDMs and allows the SDMs to fit into the CORDEX initiative. Based on that, it is expected to perform consistent future intercomparison studies between SDMs as well as RCMs.