1 Introduction

General Circulation Models (GCMs) are the principal instrument for making projections of future climatic conditions. They are designed to propose a large-scale vision of the evolution of climate in responses to natural and anthropogenic forcing. Because of the current limitations in computing resources, their spatial resolution is still limited. As a consequence, raw GCMs' outputs do not meet the needs of regional impact studies (Wilby et al. 2004). Another issue is that different GCMs return different climate projections for both the present day and future periods. As a consequence, using outputs from a range of GCMs is recommended for impacts studies (IPCC 2007). Multi-model approach has been therefore developed recently at the global scale using a Bayesian approach (Lopez et al. 2006), at the regional scale with a linear Bayesian linear model (Greene et al. 2006) or using more classic weighting methods (Giorgi and Mearns 2002).

Another issue is that projections in complex topographic areas, such as mountainous regions, cannot be rendered with enough accuracy (Hostetler 1994; IPCC 2007). Regional climate is also often affected by circulations that occur at smaller scales (e.g. Giorgi and Mearns 1991). Accordingly, recognising these needs for small scale projections, the scientific community developed during the last decade several so-called downscaling techniques. Among these, statistical downscaling is appreciated for its advantages of being simple and easy to implement when compared to other techniques (Wilby et al. 2004). Another advantage lies in the fact that spatial and temporal variability are conserved by involving historical records for the determination of the model's parameters. The Statistical Downscaling Model (SDSM) developed by Wilby et al. (2002) is a widely used linear regression model, especially in the field of the analysis of climate change impacts on water resources.

The interaction between the atmospheric circulation and the complex topography of the Andes mountain range has a strong influence on the regional climate. It makes also this region especially vulnerable to climate variability. Such mountainous arid regions, in their majority characterized by nival regimes, are highly dependent on precipitations but are as well strongly influenced by temperatures (Singh et al. 2006). If downscaling techniques, including statistical downscaling, have been mainly tested and applied in Northern America and Europe (e.g. Dibike et al. 2008; Diaz-Nieto and Wilby 2005; Boe et al. 2007), fewer contributions exist in Southern America. Solman and Nunez (1999) reported about one of the first attempts of statistical downscaling of temperature in central Argentina. In their conclusions, the authors estimated that if the interannual variability is not always well captured, the statistical downscaled data show a good agreement with observations. In a more recent work, Labraga (2010) proposed recently one of the few applications of a statistical downscaling model in western Argentina. The author concludes that significant useful relationships between precipitation and atmospheric patterns could be found. However, the scarcity of observational stations on the spatial and temporal levels adds up to the challenges of downscaling in mountainous areas (Labraga 2010). Recently, a preliminary investigation carried out in the arid Northern Chile by Souvignet et al. (2010) proposed a successful application of statistical downscaling for the projection of future precipitation and temperature scenarios. However, the authors concluded that uncertainties linked with the technique should be addressed.

In southern central Chile, Rojas (2006) first reported in a recent benchmark study about the application of downscaling. Even if dynamical and not statistical downscaling was involved, her conclusions indicate that biases were highly correlated with the station's elevation and the terrain representation in her model. More recently, Marengo et al. (2009) used the PRECIS regional climate modelling system to investigate the distribution of extreme temperature and precipitation in South America over the recent past as well as under future climate forcing. In their analysis, the authors note significant biases in the central Andes between observed and simulated values.

Uncertainties in downscaling techniques are manifold. A part of it could be associated with those intrinsic to the GCMs (Mearns et al. 2001). In addition, the downscaling model structure contributes to the overall downscaling uncertainty. Nevertheless, several solutions exist in order to address the issue of uncertainty with statistical downscaling:

First, the use of multiple climate models is expected to reduce the uncertainty linked with GCMs (e.g. Doblas-Reyes et al. 2003). It provides a measure of uncertainties as well as a better mean estimate, which is expected to mitigate possible biases of the different GCMs. Therefore, a multi-model approach should be used while implementing statistical downscaling techniques for impact studies. In the case of SDSM, climate model predictors are made available courtesy of the Canadian Institute for Climate Studies. However, predictors for South America and most of the world are available from only one GCM. As a consequence, most impact studies using SDSM rely only on the outcomes of one sole GCM (e.g. Chu et al. 2010).

Second, the quantification of the uncertainty linked with the downscaling model structure is made possible by several statistical methods. The most commonly used statistical techniques for assessing model uncertainty include the analysis of statistical properties of the model errors.

Hence, this paper aims at providing a comprehensive comparison of uncertainties linked with different GCMs using the SDSM in an arid mountainous watershed of the central Andes. The uncertainty assessment will include the deviations in terms of median and variance between downscaled variables and their corresponding observations using boundary data of three GCMs, which were never tested with SDSM in this region. In addition, and given the regional high climatic variability, the ability of SDSM to reproduce extreme events will be tested. Hence, with the quantification of biases linked with three GCMs, to the best of our knowledge never tested for SDSM in the central Andes, this work is expected to contribute to a reduction of uncertainties linked with impact studies in the region and in other similar areas in the world. In the case of the central Andes, it is expected that a better knowledge of uncertainties at stake will enhance the decision capacity of national and local water managers as well as provide decision makers with sound estimates of potential impacts of climate change in the region.

This paper is structured in four main sections. First, the study area and the data based used for this study will be presented. The next section will describe the methodology used in this work. After a description of the linear regression downscaling technique, the statistical tests and analysis used in the uncertainty analysis will be presented. Thereafter, results for the performance and associated uncertainties of downscaled variables will be explored. Eventually, general concluding remarks are presented in the last section.

2 Study area and database

2.1 Study area

This work focuses on the span between 29º20′ and 32º17′ S, the so-called Coquimbo Region (see Fig. 1). Located just south of the arid diagonal (Messerli et al. 1992), its climate ranges from arid to semi-arid from the north to the south (Miller 1976). The main reasons for the choice of the Coquimbo Region for an uncertainty analysis are described below.

Fig. 1
figure 1

Location of the meteorological stations within the study area

A strong future climate change signal in the western side of the Andes is suggested by most climate models (Mata and Campos 2001). This makes this region a meaningful candidate for impact studies in arid mountainous regions (e.g. investigating the influence of climate variability on water resources).

As other arid and semi-arid areas, which cover more than 30% of the inland earth's land surface (Dregne 1985), the region has climatic, physiographic and ecological features of high vulnerability to climate variations (Downing et al. 1994; Holmgren et al. 2006). The annual precipitation shows a strong orographic dependence, going from ca. 80 to 300 mm a−1 from the coastal area to the Andes (Favier et al. 2009). An increase in precipitation from north to south is also observed in the same amounts.

Moreover, precipitation is strongly influence by several natural large-scale phenomena such as El Nino Southern Oscillation (ENSO) and the Pacific Decadal Oscillation (PDO) (Aceituno 1988; Garreaud et al. 2009; Verbist et al. 2010). Temperature, also influenced by these phenomena, reaches its minimum in winter (June–August), which coincides with precipitation maximum.

In addition, hydrological processes are expected to be strongly affected. Snowfall covers large areas, up to 50% of total surface in certain watersheds at high altitude (Favier et al. 2009). With a relatively low glacier coverage (ca. 7 km2 between 29–32° S; see Garin 1987), river discharge is then mainly driven by the melting process of winter-accumulated snowpack. Therefore, a better understanding of uncertainties linked with downscaled temperature and precipitation for future scenarios will give more confidence in the outcomes of future impacts studies.

Last but not the least, this region, as other areas in developing countries, has been given very little attention with regard to climate change. Most downscaling technique comparative studies and impact assessment investigations concentrate on areas within Western Europe and Northern America (Kundzewicz et al. 2008).

2.2 Predictands

The database used in this study is threefold. It includes (1) a set of observed daily data records, (2) daily variables based on the National Center for Environmental Prediction (NCEP) re-analysis data set and (3) daily variables based on three GCMs data set.

The predictands, i.e. observed data, were provided by the Dirección General de Agua (DGA) for height stations and are presented in Table 1. Daily maximum temperatures (referred to as Tmax), daily minimum temperature (referred to as Tmin), and daily precipitation for a 22-year period (1979–2000) were considered. Because of data relative scarcity and quality constraints, a longer time period was not available.

Table 1 Location of meteorological stations used in the uncertainty analysis

Data quality assessment is an important preliminary step in uncertainty analysis. Because the failure of the input data to meet certain quality standards enhances the biases introduced into the simulation, a special emphasize was given to data quality control. The homogenous data series used in this study were provided by Souvignet et al. (2011) for a trend analysis of temperature and precipitation records in central–northern Chile.

2.3 Predictors

In statistical downscaling, the relevance of the relationship between large-scale predictors (i.e. variables from NCEP re-analysis and GCMs datasets) and the small scale predictands (i.e. precipitations, Tmax, Tmin at the stations' scale) will determine the model ability to produce good climate projections for the research area. This is based on the assumption that the predictor–predictand relationships under the current conditions remain valid under future climate conditions. This assumption allows the statistical downscaling of global future climate projections. Therefore, the choice of predictors is of particular importance to linear regression downscaling techniques.

The NCEP re-analysis dataset (Kalnay et al. 1996; Kistler et al. 2001) was re-gridded to the same coordinate system as the selected GCMs (listed in Table 2) and normalized with respect to their respective 1961–1990 mean. This modified dataset will be referred to here as NCEP.

Table 2 GCMs used for uncertainty analysis

When the best fit between predictands and predictors is validated for observed climate, a simulated control signal is introduced by substituting observed large-scale predictors from the NCEP dataset with the corresponding simulated GCMs predictors.

CGMs predictors were obtained from the World Climate Research Programme Coupled Model Inter-comparison Project Phase 3 (CMIP3) dataset for the climate of the twentieth century experiment (20C3M) scenario. This scenario corresponds to a change in greenhouse gases as observed until 2000. Thereafter, variables were standardized according to their 1961–1990 variance and mean values. The modified datasets for GCMs predictor variables will be referred to as to using GCMs short names described in Table 2.

Among all predictors available from the CMIP3 database, only 16 were available at the daily resolution for NCEP and the three GCMs considered in this study. Subsequently, re-gridding and standardization of data of NCEP and the GCMs were processed using the PostScript-based language of the International Research Institute for Climate and Society (IRI) Data Library (http://iridl.ldeo.columbia.edu/SOURCES/.WCRP/.CMIP3/). The 16 predictors considered for the calibration exercise are displayed in Table 3.

Table 3 Available predictor variables considered for NCEP, CGCM3, GFDL and MRI models

3 Methodology

3.1 Statistical downscaling technique

The Statistical Downscaling Model was developed by Wilby et al. (2002). As Wilby et al. (2002) already propose a complete and detailed description of SDSM, the model will be depicted hereafter only briefly. Additional details could be found in the literature cited in this section. SDSM is designed to produce high-resolution daily climate information from coarse-resolution GCM simulations. It is best described as a hybrid of the stochastic weather generator and multiple linear regression methods. The underlying idea of the model is that large-scale circulation patterns and atmospheric moisture variables (e.g. observed measures of vorticity or relative humidity) are used to condition local-scale weather generator parameters (e.g. precipitation occurrence and intensity; Penlap et al. 2004). It is then assumed that this relationship remains constant under climatic change. This allows the generation of future local climate scenarios.

Among statistical downscaling techniques, SDSM has become increasingly used, mostly because it proposes a set of pre-processed predictors (NCEP and GCCMs) available for most regions of the world. Although this model has been used in several countries, uncertainties linked with its downscaled variables appear to be highly dependent on the local climate regimes.

Lately, Chu et al. (2010) tested SDSM in semi-arid China. Their results suggest that temperature data are well simulated, whereas downscaled precipitation introduced systematic errors in extreme events.

In a recent study, Dibike et al. (2008) investigated the uncertainties linked with SDSM downscaled precipitation and temperature regimes in Northern Canada. Their approach is based on the systematic analysis of uncertainties related to hypothesis test of median and variance along with their corresponding confidence intervals. The authors concluded that biases introduced by the sole SDSM technique were acceptable for impact studies in the region. In central Sweden, Wetterhall et al. (2007) investigated four statistical downscaling methods in terms of their ability to capture statistical properties of daily precipitation in different seasons. Their conclusions showed that although SDSM outperformed the other analogue models, it was unable to capture well differences between wet and dry summers. Khan et al. (2006) compared three different downscaling techniques in northern Canada using a similar approach later used by Dibike et al. (2008). They concluded that SDSM was one of the best performers in simulating temperatures and precipitation in this region. In addition, Wetterhall et al. (2006) investigated the ability of four statistical downscaling methods to simulated rainfall in southern eastern and central China. The authors conclude that the annual cycle was well captured with SDSM.

3.2 Screening predictor variables

In order to define the statistical downscaling model parameters, a screening of available predictors is necessary. To ensure a proper calibration, NCEP predictor variable (re-gridded for the different GCMs) were selected according to their robust statistical significance, i.e. significant partial correlation (r) at the 95% confidence level with the local predictand (i.e. precipitation and temperature). This statistical measure helps in identifying the amount of explanatory power of each predictor once the influence of other predictors have been removed.

In order to allow a comparison among the three GCMs, the whole set of predictors listed in Table 3 is tested for each GCMs against observed values. Predictors with non-significant correlations are discarded until a set of predictors with the highest partial correlation is chosen (from two up to six, depending on the predictand). This procedure, although compromising between the three grid box outputs, allows a sound comparison between the different GCMs, attributing them the same set of statistically significant predictand–predictors relationships. In addition, based on the data statistical characteristics, a lag 1-day auto-regression was introduced in the model for temperature simulation. The outcome of the predictor screening for precipitation, maximum and minimum temperature are displayed in Table 4.

Table 4 Selected NCEP predictor variables and their respective partial correlation for precipitation, Tmax and Tmin corresponding to the investigated GCMs

Predictor variables such as sea level pressure (mslp) as well as specific humidity at different level pressure (sh85 and sh50) return indeed high partial correlation values for precipitation. Nevertheless, at high altitude (e.g. La Laguna), mslp does not play a prevailing role anymore, ceding to a greater influence of winds (p5_u). As shown by Kalthoff et al. (2002), westerly winds prevail at high altitude (4,000 m a.s.l.) while below this height, winds flow southward along the Andes. These winds appears to have an influence on both precipitation and temperature with meridional and zonal velocity at appropriate pressure levels (surface, 850 hPa, 500 hPa) returning relative high partial correlations. However, few predictor variables, with the above-mentioned regional physical characteristics, returned satisfactory partial correlation values for precipitation. In contrast, a larger set of physically based predictors were found correlated with Tmin and Tmax.

In conclusion, among the predictors considered for SDSM calibration in Table 3, sea level pressure plays a major role in the prediction of precipitation with high partial correlations. If this predictor is also significant for temperature (especially for Tmin), its influence is lower in the region. Meridional and zonal velocity appears to influence both precipitation and temperature in the region, with graduated pressure levels correlated with the station's altitude. Equally, relative humidity predictors at higher pressure levels return significant partial correlations for both precipitation and temperature. On the other hand, surface relative humidity and high pressure winds (10_u and 10_v) do not prove to be predictors appropriate for the region.

A 16-year calibration period (1979–1994) was chosen for temperature. The statistical model's parameters were then validated during a 6-year period (1995–2000). Because rainy days are very scarce in the region, a longer validation period was necessary for precipitation. Therefore, a 10-year (1979–1988) period for calibration was chosen, along with a 12-year (1989–2000) validation period for precipitation. As a result of these data availability constraints, the calibration period is relatively short for both precipitation and temperature variables. However, as the region's climate is strongly affected by phenomena such as ENSO and the PDO, it should be verified that the calibration phase is representative of the regional climate variability. Figure 2 compares the calibration and validation period for precipitation and temperature against the Oceanic Nino Index (ONI) from 1950 to 2000. The PDO phenomenon is not represented on this graph because, as shown by Biondi et al. (2001), after 1977, the PDO entered a so-called warm phase which is homogenous over the calibration and validation period presented here. Nevertheless, it should be noted that the “cold phase” of the PDO is consequently not taken into consideration. Concerning the ENSO phenomena, the ONI is represented by a dark grey shaded area from 1950 to 2000. Standardized regional averaged annual precipitations (for all stations considered in this study) are represented by a black line. The dashed line shows standardized regional average mean temperature for the region of interest. For both temperature and precipitation, medium and extreme warm/cold phases of the ENSO phenomenon are included within the calibration period. Hence, although the calibration period is relatively short, the regional climate variability is well captured by the calibration period.

Fig. 2
figure 2

Long-term regional climate variability against the calibration and calibration periods. The grey shaded area represents the ONI, black line displays standardized average precipitation and the dashed line represents the standardized mean temperature. The calibration phases for temperature are indicated by a light grey area

3.3 Uncertainty analysis

The conservation of the mean value and variability of observed events, for a baseline period, into the simulation of future events is necessary in order to enable a reasonable confidence into the statistical model outputs. Therefore, the uncertainty analysis aims at quantifying the model's ability to reproduce the current state of precipitation and temperature. This is a prerequisite for the simulation of future climate based on the outputs of GCMs.

The methodology to identify uncertainties of downscaled variables was first proposed by Khan et al. (2006). Given the regional climate characteristics of the central Andes, a modified approach is used for the region and an analysis of biases linked with the reproduction of extreme events was introduced.

Hence, the uncertainty linked with the statistical downscaling of daily temperatures (Tmax and Tmin) and precipitation is determined in terms of (1) model's biases in the estimates of median, (2) model's biases in the estimates of variance and (3) model's ability in the simulation of extreme events.

The uncertainty analysis is carried out in three steps. The first step describes the analysis of the data basic statistical characteristics. This step, commonly called the exploratory data analysis (EDA) is recommended while handling statistical tests in natural sciences (Helsel and Hirsch 2002). A hypothesis test requires the population's distribution to be characterized by certain parameters. For example, many tests rely on the assumption that the population follows a normal distribution, is outlier free and is not auto-correlated when values were collected at regular time intervals. Non-parametric tests, for instance, do not make these assumptions, so they are useful for data not meeting the above-mentioned characteristics. The EDA helps in determining the statistical characteristics of the data. Therefore, it is useful in order to identify what tests are best suited. The outcomes of the EDA, not detailed here, showed that temperature data (Tmax and Tmin) return a normal distribution, with few outliers, whereas precipitation data is heavily right skewed (i.e. returns a strong asymmetrical probability distribution, with the mass of the distribution concentrated on the left tail of the figure), with many outliers (as this is a arid area, outliers are in fact records of precipitation, most records being zeros). In addition, the EDA shows the presence of autocorrelation for temperature, which disappears after a few lags. However, no evidence of autocorrelation was found for precipitation records. Hence, the outcomes of the EDA suggest that specific statistical tests should be used.

In a second step, hypothesis tests of median and variance explored the model's ability to reproduce daily observed data (Tmax, Tmin and precipitation) for three different GCMs during the 1979–2000 baseline period. SDSM is able to produce for a single run several simulations (in this case, 20) with the same probability of appearance. One set of 20 simulations is called an ensemble. The hypothesis tests were based on the ensemble means. Analyses were performed on a seasonal and an annual basis. Seasons are defined as follows: summer (December–February), fall (March–May), winter (June–August) and spring (September–November). A complementary technique, namely frequency analysis, was used in a third step to analyse the models ability to accurately simulate extreme events. The next sections will shortly address the theory related to the steps mentioned above.

3.3.1 Test of equality of median

The Mann–Whitney test is one of the most powerful non-parametric hypothesis tests of the equality of medians of two populations (Mann and Whitney 1947). This test is also known as the two-sample rank test or the two-sample Wilcoxon rank sum test. The Mann–Whitney test is based on the idea that the sum of the ranks for the samples above and below the median should be similar. In this study, a significance level of 5% (p < 0.05) is used. As other non-parametric statistics, the Mann–Whitney test uses the ranks of the sample data, instead of their specific values, to detect the p value. This allows this test to be particularly robust against outliers, non-normal distributed, and auto-correlated data.

3.3.2 Test of equality of variance

The Levene's test is usually used to test the equality of two populations' variance. This test is based on an analysis of variance of the absolute difference from the mean and is most appropriate in cases where data are normally distributed (Levene 1960). However, in this study, as precipitation series displayed a strong skewness in the preliminary EDA, the more robust Brown–Forsythe's test is used (Brown and Forsythe 1974). This is a modification of the Levene's test in which the absolute mean difference is replaced with the absolute median difference. This test appears to be more robust and powerful for skewed data. Using the sample median rather than the sample mean makes also the test more robust for smaller samples (Conover et al. 1981).

In this study, the equality of variances between observed and downscaled Tmin will be tested for four different GCMs and a significance level of 5% (p < 0.05) is used.

3.3.3 Frequency analysis

Frequency analysis allows fitting a statistical distribution to observed and downscaled data. It can be used in order to interpret extreme events. Different statistical distributions can be fitted to the selected data. In this case, as the distribution of extreme events (Tmax, Tmin and precipitation) is of interest, a generalised extreme value (GEV) distribution was used. The data are fitted a GEV distribution of the form:

$$ F(x) = { \exp }\left\{ { - { \exp }{{\left[ {\frac{{ - \left( {x - \xi } \right)}}{\beta }} \right]}^{\frac{{1}}{k}}}} \right\}, $$

Where x is the random variable, ξ, β and k are respectively, location, scale and shape parameters estimated from the sample.

This allows interpreting the ability of the statistical downscaling model to simulate the extreme events with regard to observed data sets at various return periods. When plotting extreme value distributions for modelled data with a number of ensembles, it is possible to plot confidence limits (lower and upper percentiles) around the fitted line. In this study, 20 ensembles were simulated for the 1979–2000 period and the 95% confidence interval boundary is chosen.

4 Calibration and validation

Before beginning with the uncertainty analysis per se, the performance of the model shall be explored through classical statistics indicators for the calibration and validation phases.

The analysis of the model's performance is twofold: first, its ability to reproduce observed data by simulations based on the NCEP predictors is investigated. Second, its ability to reproduce observed data by simulation based on the three GCMs predictor variables is analysed. The overall performance of SDSM at investigated stations is summarized in Table 5.

Table 5 Performance indicators for calibration and validation of the model for precipitation, Tmax and Tmin

It is clear that Tmax and Tmin are simulated more accurately than precipitation for both NCEP and GCMs variables. Nevertheless, the calibration process returns similar statistics for both temperature and precipitation, with only slightly better values for Tmax and Tmin. However, when the coefficients of determination (R 2) of precipitation and temperature are compared for the validation period (NCEP and all GCMs), the superior accuracy of simulation for temperature is obvious. In contrast, R 2 values for simulation of precipitation with NCEP and GCMs variables are rather low. Likewise, a decrease (increase) of the standard error values for temperature (precipitation) is observed at most stations.

These general observations fit the conclusions of most works about SDSM and statistical downscaling in a broader sense: the model tends to underestimate precipitation simulations, while reproducing accurately temperatures.

Using yet another perspective of analysis, the performance of the model could be assessed based on which predictor dataset is used. This is possible for the validation period only, the model being calibrated with the NCEP dataset. First, concerning R 2 values for temperature, no significant difference is observed between NCEP-based and GCMs-based simulations. The parameters, derived from the NCEP predictor variables, seem to be well reproduced by the GCMs predictors. In the same way, no major difference could be determined between the different GCMs. In contrast, precipitation simulations based on the NCEP dataset return greater R 2 and lesser SE values than those based on the GCMs predictor variables. This indicates that parameters estimated during the calibration process do not fit entirely for the different GCMs predictor variables. Now, it seems important to scrutinize whether the model reports different accuracy with regards to which station was used for the simulation. Comparison studies tend to rely on one or two stations for their analysis.

Table 5 shows that certain stations return a R 2 significantly different from one another. In the case of La Laguna (3,160 m a.s.l.), the R 2 value during the calibration process is distinctly smaller when compared to other stations, especially those located at lower elevations. This variability among stations is also observed for temperature, with high elevation stations such as La Laguna performing better (in term of R 2) than the lower elevation stations Illapel DGA (290 m a.s.l.), La Paloma Embalse (320 m a.s.l.) and Rivadavia (820 m a.s.l.). This confirms a behaviour observed for simulations of temperature and precipitation with SDSM in the region (Rojas 2006; Souvignet et al. 2010). The important eastward slope in the region causing important gradient in precipitation and temperature seems to affect the calibration. Nevertheless, during the validation period, the variability of performance between stations disappears.

Hence, despite a careful screening of predictors variables, the overall accuracy of downscaled precipitations remains poor, whereas it returns good results for downscaled temperature. The use of several stations at different elevations allows drawing conclusion about the influence of slope gradient on the accuracy of downscaled variable. Indeed, results show that downscaled precipitations are influenced by the slope gradient.

5 Uncertainty analysis

To have confidence in simulations of future temperature and precipitation, one should be convinced of the ability of the downscaling model to simulate accurately observed data based on GCMs derived parameters. This section will explore how the downscaling results corresponding to the GCMs predictors reproduce mean, median and extreme events of daily observed records for the 1979–2000 period. The selection of the hypothesis tests of equality was based on the statistical characteristics of observed temperature and precipitation. These characteristics were determined according to the EDA and the conclusions suggested that the uneven distribution of precipitation and possible outliers due to extreme events, robust statistical tests should be used. The next section presents the outcomes of the hypothesis tests of equality of median and variance. Subsequently, a frequency analysis will determine whether extreme events are accurately simulated for downscaling results based on the GCMs predictors.

5.1 Equality of median

The uncertainty linked with the ability of downscaled data, based on three different GCMs predictors, to reproduce observed median values has been quantified by applying the non-parametric Mann–Whitney test. This test was applied to all stations on a seasonal basis at the 95% confidence level for temperature and precipitation. The results of the test of equality of median are presented for respectively daily precipitation, Tmax and Tmin in Fig. 3a–c. In this case, stations satisfying the hypothesis of equality of median between observed and simulated variables are represented by a black dot, which magnitudes correspond to their respective p values. Stations which did not satisfy the hypothesis of equality are not represented. In addition, the respective percentages of rejected tests of equality of median for precipitation, Tmax and Tmin, summarized in all stations, are displayed in Fig. 4a–c. The results are discussed below.

Fig. 3
figure 3

ac Results of the Mann–Whitney test for the difference of median between observed and downscaled daily precipitation, Tmax and Tmin. Black dots represent no significant difference at the 95% confidence level. The magnitude of the dots is proportional to the test p values for the different stations

Fig. 4
figure 4

Percentage of rejected tests for ac the Mann–Whitney test (equality of median) and df the Brown–Forsythe test (equality of variance) for daily precipitation, Tmax and Tmin. Note: summer is not displayed for the tests with precipitation

In the case of precipitation, Fig. 3a shows that the statistical downscaling model reproduces daily observed data median values accurately for the investigated stations at the annual level with the exception of the station La Laguna. Nevertheless, the comparison of median values in spring indicates that daily observed precipitations are not well reproduced by the model in almost all stations (with the exception of La Tranquilla and La Laguna). In addition, lower p values in the northern part of the study area tend to indicate less confidence in the simulation of median values by downscaled precipitation. The graphical comparison in Fig. 4a shows that no significant differences on how median is reproduced by daily downscaled precipitations exist between the three different GCMs. GCGM3, GFDL and MRI return a 14% rejection rate at the annual level and in fall, whereas all three models display a 71% rejection rate in spring. Hence, the comparison of daily median value indicates homogeneity among stations and GCMs with respect to their ability to reproduce daily observed data. This homogeneity exists also at the seasonal level, with the exception of spring. In addition, with the exception of the spring season, precipitation might not be reproduced adequately at high elevations as indicated by the lower performance of the La Laguna station.

In the case of daily Tmax, Fig. 3b shows that observed daily temperature median values are accurately reproduced in all stations by downscaled precipitations at the annual and seasonal level. Comparatively to downscaled precipitation, downscaled Tmax returns a better ability to reproduce median values. Nevertheless, lesser confidence should be given to certain stations (La Paloma Embalse, Rivadavia and Illapel DGA) which display non-significant p values indicating that the hypothesis of equality of variance should be rejected, especially in summer and fall. The graphical comparison in Fig. 4b indicates that disparities exist in how the different GCMs-based downscaled Tmax reproduce observed median. Where CGCM3 and MRI-based simulations return low rejection rates, GFDL shows a rejection percentage of the hypothesis of equality of median of 67% (50%) for summer (fall). This is also confirmed by the lower p values displayed in Fig. 3b for GFDL. Thus, the comparison of median values between downscaled and observed Tmax shows that uncertainty with SDSM downscaling exists both for stations, as well as for the different GCM. Consequently, the choice of the GCM and the station selected for analysis is expected to influence the uncertainty linked with the downscaled values.

In the case of Tmin, Fig. 3c shows that all p values are significant, so that all downscaled Tmin median are considered equal to their respective observed values. In the same vein, Fig. 4c shows a rejection rate of zero for all season and all models. In addition, p values are larger relatively to those of Tmax and precipitation, which points out an excellent agreement between observed and downscaled values. This indicates that a high confidence level exists in the ability of SDSM to simulate Tmin in the region, and this independently of the season, the model or the station selected.

Hence, results from the Mann–Whitney test show that very high confidence could be given to the ability of downscaled Tmin to reproduce the median of observed data. This remains valid at the seasonal and annual levels independently of the GCM used or to which stations it applies. However, more uncertainties arise in the case of downscaled Tmax and precipitation. If downscaled precipitation returns a relative homogeneity among stations and GCMs with regard to their ability to reproduce variance, significant biases are observed in spring and possibly for high elevations. In the case of Tmax, the choice of the GCMs from which the predictor variables are used for downscaling appears to have an influence on the accuracy of the median simulation. In this case CGCM3 and MRI return the best estimates.

5.2 Equality of variance

The uncertainty linked with the ability of downscaled data to reproduce the variance of observed values has been quantified by applying the Brown–Forsythe test. The test was applied to all stations on a seasonal basis at the 95% confidence level for temperature and precipitation. The results of the test of equality of variance are presented for respectively daily precipitation, Tmax and Tmin in Fig. 5a–c. In this case, stations satisfying the hypothesis of equality of variance between observed and simulated variables are represented by a black dot, which magnitudes correspond to their respective p values. Stations which did not satisfy the hypothesis of equality are not represented. The respective percentages of rejected tests of equality variance for precipitation, Tmax and Tmin, summarized in all stations, are displayed in Fig. 4df. The results are discussed below.

Fig. 5
figure 5

ac Results of the Brown–Forsythe test (modified Levene's test) for the difference of variance between observed and downscaled daily precipitation, Tmax and Tmin. Black dots represent no significant difference at the 95% confidence level. The magnitude of the dots is proportional to the test p values for the different stations

In the case of daily downscaled precipitation, Fig. 5a shows that SDSM does not reproduce observed variance values accurately for all investigated stations at the annual. However, the variance is well reproduced by downscaled precipitations in spring for the majority of stations. Nevertheless, in some stations located in the southern part of the region of interest (La Tranquilla and Limahuida), there is no evidence that downscaled precipitations variance is reproduced accurately. The graphical comparison in Fig. 4d shows that differences between the three different GCMs exist on how variance is reproduced by daily downscaled precipitations. For instance, CGCM3 returns the lowest rejection rate in fall (57%) and spring (29%). On the other hand, with a 100% rejection rate, downscaled precipitations from the same GCM seem unable to reflect the variance of observed data at the annual level. In the same way, GFDL has the highest rejection rate (86%) in fall. Consequently, the histogram indicates that levels of rejection for the test of equality of variance are high for all GCMs and for all seasons. In addition, there are some differences between the simulations of variance by the GCMs. Hence, the variance of daily observed precipitation is not accurately reproduced, independently of the GCMs chosen. Slightly better figures exist for spring. However, this season is of lesser importance for arid mountainous region where precipitations occur mainly in winter.

In the case of daily Tmax, Fig. 5b shows that observed variance is well reproduced in all stations. Nevertheless, in Hurtado, p values are inferior to 0.05 for all GCMs in summer and fall. In addition, comparatively to downscaled precipitation, there is a higher confidence in the ability of downscaled Tmax to reproduce variance of observed daily data. The graphical comparison in Fig. 4e shows that the variance is well reproduced by downscaled Tmax for all seasons. Nevertheless, the simulations based on CGCM3 return a high rejection rate (83%) of the test of equality of variance in summer. Consequently, uncertainty linked with the variance of downscaled Tmax is influenced by the GCM choice. Overall, GFDL is the best suited model to reproduce variance of observed data for Tmax in the region.

In the case of daily downscaled Tmin, Fig. 5c shows that the test of hypothesis was less rejected in only three stations (Hurtado, La Laguna and Illapel DGA). It indicates that GCMs-based predictors permitted to reproduce well the variance of observed data. Nevertheless, the lesser performance with regard to the statistical test in other stations suggests that downscaled Tmin are subject to an important variability among stations. Fig. 4f shows that the higher rejection rate is for summer with values ranging from 50% to 83% respectively for GFDL and MRI. MRI-based predictors appear to introduce the most biases in downscaled Tmin variance in all seasons. GFDL is the best performer in terms of non-rejection of the hypothesis of the equality of variance. Hence, these results indicate that a moderate confidence could be place in the ability of downscaled daily Tmin to reproduce variance of observed data in the region. In addition, the GCMs from which predictors are retrieved have an influence on the uncertainty linked with variance simulation. Moreover, disparities between seasons exist, with higher biases in summer for all GCMs.

Hence, considering the outcomes of the Brown–Forsythe test, downscaled precipitation failed to reproduce observed data variance, independently on which GCM is chosen. On the other hand, the choice of the GCM appears to introduce biases for downscaled Tmax. In this case, GFDL was found to return the best estimates for variables compared with observed data. With regards to Tmin, a moderate confidence could be place in the downscaled variances. A high uncertainty exists in summer and the choice of the GCM still has an influence, GFDL appearing again to be the best choice for the region.

5.3 Simulation of extreme events

The following section focuses on the model's ability to simulate extreme events. The results of the frequency analysis, performed for the different GCMs for the 1979–2000 time period, are displayed in Fig. 6a–f for precipitation, in Fig. 6g–l for Tmax and in Fig. 3m–r for Tmin. Because not all results for the 19 stations at three GCMs could be intelligibly displayed, a selection of investigated stations was made for the display of the frequency analysis. As we have seen previously that the performance indicator (R 2) of simulations for stations located at different elevations vary significantly, two stations were selected according to their respective elevation: at high altitude (La Laguna, 3,160 m a.s.l.) and low altitude (La Paloma Embalse, 320 m a.s.l.) are investigated. Because the authors understand that results of the frequency analysis for other stations in the region could be of interest for comparison purposes, outcomes for other stations could be delivered upon request.

Fig. 6
figure 6

Frequency analysis for precipitation (af), maximum temperature (gl) and minimum temperature (mr) in La Laguna (3,160 m a.s.l.) and La Paloma Embalse (320 m a.s.l.) for CGCM3, GFDL and MRI models (1979–2000). Note: all statistical ensembles (20) are included for the GCMs simulations

First, simulations of extreme events will be analysed with regards to precipitation. Subsequently, an analysis of extreme Tmax and Tmin will follow.

The extreme events prediction ability of SDSM for precipitation is displayed for La Laguna in Fig. 6a–c and in La Paloma Embalse in Fig. 6d–f. At both stations and for all three GCMs, downscaled precipitations clearly underestimate historical records. Nevertheless, extreme events with a return period inferior to 20 years are accurately simulated in La Laguna. In contrast, downscaled precipitations in La Paloma return underestimation for shorter return periods (5 years). These results seem to indicate that extreme events with large return period (>20 years) could not be simulated with confidence in the upper part of the Andes, where the largest rainfall amount occurs. Also, the systematic underestimation of rainfall amounts and the inability to simulate short term extreme events (events with a 5-year return period) at lower elevation is problematic with regard to the ENSO phenomena. In addition, no evident difference is observed between precipitations downscaled with the three different models. This indicates that a multi-model approach to the simulation of future precipitations in the region might not enhance the model response. Thus, lower reliance levels should be set for extreme precipitation values. This behaviour is explained by the inherent characteristics of statistical downscaling procedures. Indeed, as mentioned by Kim et al. (2000), this type of method performs better “if two conditions are met: (1) there are sufficient historical data for generating probability distribution functions and (2) variables of interest have well defined statistical patterns”. Thus, in the case of our study, in an arid regime with rather few precipitation events per year, it seems that the first condition is hardly met, i.e. there are not enough values to allow for an accurate regression with the NCEP and the CGM data. Second, the occurrence within the validation data set of a particularly strong El Niño event (1997) and a severe, La Niña-related, drought period (1994–1996) is likely an additional source of variability. This is supported by the studies of Barnston et al. (1999), Kim et al. (2000) and Wood et al. (2002) that show the effect of the 1997 El Niño event, one of the strongest in the last century, on the performance of several climate and downscaling models.

The ability to simulate extreme downscaled maximum temperature is displayed for La Laguna in Fig. 6g, i and La Paloma Embalse in Fig. 6j, l. It appears that the extreme events prediction ability, based on all ensembles, is not equally well reproduced by the GCM models at the two stations. In addition, it does not appear that the three GCMs have the same ability to simulate downscaled values for Tmax. Indeed, downscaled Tmax is overestimated in La Laguna for the CGCM3 model and for the GFDL model in La Paloma Embalse. In contrast, downscaled Tmax based on MRI predictors performs well in both stations. This indicates that the choice of the model might be of some importance with regard to the ability of the model to reproduce downscaled extreme Tmax.

Downscaled extreme minimum temperatures are plotted against observed records in Fig. 6m–o for La Laguna and in Fig. 6p–r for La Paloma Embalse. Results suggest that extreme minimum temperatures are simulated accurately by all models and in both stations. However, for different return periods (<10 years for GCM3 and GFDL; <20 years for MRI), downscaled Tmin are overestimated in La Laguna. This does not appear in La Paloma Embalse, where extreme events are simulated accurately for all return periods. Hence, in the case of Tmin, the elevation of the station seems to influence the model ability to simulated extreme events.

Hence, outcomes from the frequency analysis show that the ability of SDSM to simulated extreme events is not always secured. In the case of precipitation, a systematic underestimation of extreme events is observed. Therefore, limited confidence should be placed on the ability of the model to simulate future ENSO phenomena, and this is independent of the model chosen. With regard to Tmax and Tmin, higher confidence exists in the model ability to simulate extreme events. Nevertheless, biases persist, with an overestimation of Tmax, a dependency upon what GCM predictors are used and biases introduced by altitude's gradients. This has consequently important implications for arid mountainous regions, where snow and glacier melting are heavily influence by temperature changes (Pouyaud et al. 2005; Singh et al. 2006).

6 Conclusions

The evaluation of uncertainties linked with downscaled temperature and precipitation variables is crucial for regional and local impact studies. This paper showed the potential of statistical downscaling technique using predictors from three different CGMs over the highly heterogeneous area of the central Andes in Chile. Biases in median and variance were estimated for downscaled temperature and precipitation. In addition, the ability of the downscaled variables to reproduce extreme events was tested using a frequency analysis.

Despite a careful screening of predictor variables, the overall accuracy of downscaled precipitations remains poor compared to downscaled temperatures, which return more accurate simulations of observed records. Extreme events, of particular relevance in the central Andes, where climate is heavily influence by the ENSO phenomena, were systematically underestimated for precipitation. Whereas a higher confidence lies in the simulation of extreme temperature conditions, the existence of biases due to altitude gradients questions the accuracy of such predictions in regions where snow and ice processes are strongly influenced by temperature gradients. Besides, no evidence was found that a particular GCM was best adapted to the region. This underlines therefore the importance of a multi-model approach for impact studies in the region and generally in areas with heterogeneous surface conditions. Such uncertainties in the downscaled results diminish the confidence that one should have in simulation of future climate scenario, and raise the question of whether local climate impact studies based on the outcomes of SDSM in the region (e.g. on water resources) are meaningful.

Nevertheless, in general, the simulations produced with the statistical downscaling approach (viz. SDSM) still outperformed raw GCMs outputs, unable to reproduce the complexity of the central Andes and similar regions.

Testing the outcomes of downscaling is important and as stated by Blöschl and Montanari (2010), climate impact studies tend to be over-optimistic about their own reliability and over-pessimistic concerning the potential impacts on society. Therefore, evaluating the reliability and quantifying uncertainties linked with statistical downscaling allows a more serene approach to local climate impact studies. In addition this study contributed to evaluate which predictors are the most appropriate to the region. However, in this case, further work is needed to develop predictors better adapted to precipitation and temperature. Eventually, other downscaling techniques should be explored. For instance, a multi-site artificial neural network approach using a non-linear transfer function might be introduced to map the predictor–predictand relationships. Moreover, the multi-objective fuzzy logic-based classification approach offers the possibility to identify large-scale atmospheric or oceanic patterns, which are responsible for wet and dry phenomena in the research area. This method has already been successfully used to identify droughty and wet weather patterns in West Africa (Laux et al. 2009)