1 Introduction

At the onset of the Coupled Model Intercomparison Project Phase 5 (CMIP5), a new generation of General Circulation Models (GCMs) has become available to the scientific community. In comparison to the former model generation, these ‘Earth System Models’ (ESMs) incorporate additional components describing the atmosphere’s interaction with land-use and vegetation, as well as explicitly taking into account atmospheric chemistry, aerosols and the carbon cycle (Taylor et al. 2012). The new model generation is driven by newly defined atmospheric composition forcings—the ‘historical forcing’ for present climate conditions and the ‘Representative Concentration Pathways’ (RCPs, Moss et al. 2010) for future scenarios. The dataset resulting from these global simulations will be the mainstay of future climate change studies and is the baseline of the Fifth Assessment Report (AR5) of the Intergovernmental Panel on Climate Change (IPCC). Moreover, this dataset is the starting point of different regional downscaling initiatives on the generation of regional climate change scenarios, which are being coordinated worldwide within the framework of the COordinated Regional Climate Downscaling EXperiment (CORDEX) (Jones et al. 2011). These initiatives use both dynamical and statistical downscaling (SD) approaches to provide high-resolution information over a specific region of interest (e.g. Europe or Africa) at the spatial scale required by many impact studies (Fowler et al. 2007; Maraun et al. 2010; Winkler et al. 2011a, b). This is done by either running a Regional Climate Model (RCM), driven by a GCM at its lateral boundaries, or by applying empirical relationships, usually found between large-scale reanalysis data and small-scale station data, to GCM output (Giorgi and Mearns 1991). The basic assumption of applying downscaling methods in this context is that the ESMs should closely reproduce the observed climatology of the large-scale variables used as predictors/drivers in statistical/dynamical schemes (Hewitson and Crane 1996; Timbal et al. 2003; Charles et al. 2007; Plavcova and Kysely 2012).

In this study, we provide a comprehensive evaluation of the new GCM generation from a downscaling perspective, taking into account the requirements of both statistical and dynamical approaches. To this aim, we test the ability of seven ESMs to reproduce present-day climate conditions as represented by ERA-Interim reanalysis data (Dee et al. 2011). This is hereafter referred to as the ‘performance’ of the ESMs (Giorgi and Francisco 2000). ERA-Interim is used as reference for evaluating ESM performance, not because it is assumed to be superior to other reanalysis products, but because it is the one used within the CORDEX initiative (http://wcrp-cordex.ipsl.jussieu.fr). The models’ performance is assessed by testing their ability to reproduce the mean and cumulative distribution function of season-specific daily data, hereafter jointly referred to as the ‘climatology’.

The study focusses on middle-tropospheric circulation, temperature and humidity variables which are of particular importance for the purpose of downscaling since they are either used as predictor variables in statistical schemes (Cavazos and Hewitson 2005; Sauter and Venema 2011; Brands et al. 2011b) or form the lateral boundaries in dynamical applications (Fernández et al. 2007; Laprise 2008). In order to test ESM performance in different climate regions, we consider a large spatial domain covering Europe and Africa. Specific information for the dynamical downscaling approach is provided by assessing ESM performance along the lateral boundaries of the three domains used in the Euro-CORDEX, Med-CORDEX and CORDEX-Africa initiatives.

In downscaling studies, reanalysis products are commonly used as a surrogate of observational data. However, reanalyses are known to suffer from biases with respect to observations and consequently can differ significantly over certain regions (see Brands et al. 2012, and references therein). As outlined by Sterl (2004), the difference between two distinct reanalysis datasets is a reasonable estimator of observational uncertainty, especially in case an accepted observational dataset for the variables in question is not available. Albeit seldom assessed in downscaling studies (Koukidis and Berg 2009; Hofer et al. 2012), reanalysis uncertainty is relevant for (1) the evaluation of ESM performance and (2) the applicability of the downscaling methods themselves. With respect to (1), large differences between JRA-25 and ERA-Interim indicate that ESM performance is sensitive to the choice of reanalysis used as reference for validation and, consequently, cannot be objectively assessed (Gleckler et al. 2008). With respect to (2), calibrating SD-methods and coupling RCMs require the large-scale predictor/boundary data to reflect ‘real’ atmospheric processes (Fernández et al. 2007; Koukidis and Berg 2009; Hofer et al. 2012). Strictly speaking, downscaling is not applicable in regions where reanalysis uncertainty is large since the latter assumption does not hold. Therefore, apart from assessing ESM performance, we provide a simple estimate of reanalysis uncertainty by calculating the climatological differences between an additional reanalysis product, the Japanese Reanalysis JRA-25 (Onogi et al. 2007), and ERA-Interim. Note that a comprehensive assessment of this issue, which would involve a comparison with observations, is out of the scope of the present paper.

Our results are expected to be of value for the downscaling community because little to no information on the relative performance of the CMIP5-ESMs is available at a time the downscaling community has to decide on which ESMs to rely on. Our approach provides a general overview on ESM performance on hemispheric to continental scale and, as such, is not meant to replace studies on the synoptic-scale performance (Maraun et al. 2012). The additional assessment of reanalysis uncertainty is an update of Brands et al. (2012), who assessed the differences between ECMWF ERA-40 (Uppala et al. 2005) and NCEP/NCAR reanalysis 1 (Kalnay et al. 1996) from a downscaling perspective, and is meant to foster the scientific discussion on this important issue within the downscaling community.

2 Data

The study area considered in this work is shown in Fig. 1. It extends from the Arctic to South Africa and from the Central Atlantic to the Ural Mountain Range and Arabic Peninsula, covering the Euro-CORDEX, Med-CORDEX and CORDEX Africa domains.

Fig. 1
figure 1

Geographical domain considered in the study (black dots) and lateral boundaries of the Euro-CORDEX, Med-CORDEX and CORDEX Africa domains, solid and dashed squares refer to the exterior and interior of these boundaries

We use data from the seven ESMs listed in Table 1, which were obtained from the Earth System Grid Federation (ESGF) gateways of the German Climate Computing Center (http://ipcc-ar5.dkrz.de), the Program for Climate Model Diagnosis and Intercomparison (http://pcmdi3.llnl.gov), and the British Atmospheric Data Center (http://cmip-gw.badc.rl.ac.uk). Since we evaluate performance in present climate conditions, we use CMIP5 experiment number ‘3.2 historical’ (Taylor et al. 2012). This new generation of control runs is forced by observed atmospheric composition changes of both natural and anthropogenic nature in the period 1850–2005. The first historical run of the available ensemble was chosen for the variables listed in Table 2. These variables are standard predictors in statistical downscaling studies (Hanssen-Bauer et al. 2005; Cavazos and Hewitson 2005), and they are also taken into account for defining the lateral boundaries in the process of nesting a RCM into a GCM.

Table 1 CMIP5 Earth System Models used in this study
Table 2 Variables used in this study

As reference dataset for assessing ESM performance, we apply the European Centre for Medium Range Weather Forecasts ERA-interim reanalysis (Dee et al. 2011). As a second quasi-observational dataset, the Japanese Meteorological Agency JRA-25 reanalysis (Onogi et al. 2007) is used for comparison with ERA-Interim in order to obtain an estimate of reanalysis uncertainty (see Sect. 3 for more details).

Due to distinct native horizontal resolutions (see Table 1), the reanalysis and ESM data were regridded to a regular 2.5° grid by using bilinear interpolation, which is a common step in downscaling and GCM performance studies. The period under study is 1979–2005. Daily mean values were used and, when not provided by the original data set, were derived from 6-hourly instantaneous values.

3 Methods

The methodological approach followed in this study is twofold. First, to evaluate the degree of reanalysis uncertainty, atmospheric variables from JRA-25 are validated against those from ERA-Interim. Due to the lack of observational datasets for free-tropospheric variables on daily timescale, the difference between two distinct reanalyses is a reasonable estimator of observational uncertainty. If a close agreement is found, both reanalyses are likely driven by assimilated observations and reasonably reflect reality. On the contrary, in case of considerable differences, at least one reanalysis is dominated by internal model variability rather than observations and therefore does not reflect reality (Sterl 2004). Consequently, validating JRA-25 against ERA-Interim does not yield an ‘error’ in the sense of one reanalysis being ’better’ than the other, but is interpreted as an estimate of reanalysis uncertainty.

Second, ESM performance in present climate conditions is assessed by validating the ESMs listed in Table 2 against ERA-Interim. At this point, the results obtained from the first step allow for testing if the degree of reanalysis uncertainty permits for assessing ESM performance in an objective manner. In case of large reanalysis uncertainties, ESM performance cannot be objectively assessed since it is sensitive to reanalysis choice. On the contrary, in case of negligible reanalysis uncertainties, ESM performance is not sensitive to reanalysis choice and applying JRA-25 as reference for validation would lead to similar results.

The first measure for evaluating reanalysis uncertainty and ESM performance in this study is the mean difference (bias). Since the variability of the applied daily timeseries is much smaller in the tropics than in the mid-latitudes, the bias is normalized by the standard deviation of ERA-Interim to make results comparable (Brands et al. 2011b). This is hereafter referred to as ‘normalized bias’ or ‘normalized mean difference’ (when applied to the two reanalyses).

To detect distributional differences, we apply the two-sample Kolmogorov Smirnov test (KS test) to the original time series and to the time series centered to have zero mean, which are obtained by subtracting the seasonal mean from each timestep. For simplicity, the resulting time series will hereafter be referred to as ‘centered’. Validating centered time series is equivalent to removing the mean difference and, consequently, permits for detecting distributional differences in higher order moments. Note that comparing centered ESM data to centered ERA-Interim data is one possible solution of correcting the mean error of the ESM, which is commonly done in statistical downscaling studies (Wilby et al. 2004) and recently has also been proposed for the dynamical downscaling approach (Colette et al. 2012; Xu and Yang 2012).

The KS test is a non-parametric hypothesis test assessing the null hypothesis (H 0) that two candidate samples (here: reanalysis and ESM series for a particular gridbox and season of the year) come from the same underlying theoretical probability distribution. It is defined by the statistic:

$$ KS-statistic = \max_{i=1}^{2n} |E(z_i)- I(z_i)| $$
(1)

where n is the length of the time series, z i denotes the i th data value of the sorted joined sample and E and I are the empirical cumulative frequencies from a given ESM (or JRA-25, in case reanalysis uncertainty is assessed) and the ERA-Interim reanalysis, which serves as reference for validation in any case. This statistic is bounded between zero and one, with low values indicating distributional similarity. In this study we use the p value of this statistic as a measure of distributional similarity. Decreasing p values indicate an increasing confidence on distributional differences between both series. Note that a base 10 logarithmic transformation is applied to the p values in order to better indicate the different significance levels, 10−1,   10−2,   10−3, corresponding to increasing confidences (90, 99, 99.9 % respectively) on the dissimilarity of the distributions.

Since the daily time series applied here are serially correlated, we calculate their effective sample size before estimating the p value of the KS statistic in order to avoid committing too many type I errors (i.e. erroneous rejections of the H 0). Under the assumption that the underlying time series follow a first-order autoregressive process, the effective sample size, \( n^* \) , is defined as follows (Wilks 2006):

$$ n^* = n\frac{1-p_1}{1+p_1} $$
(2)

where n is the sample size and p 1 is the lag-1 autocorrelation coefficient.

If not specifically referred to in the text, all of the above mentioned validation measures are applied at the grid-box scale, using season specific time series.

4 Results

In this section we first assess reanalysis uncertainty (by comparing JRA-25 with ERA-Interim) and then evaluate ESM performance (by comparing the ESMs with ERA-Interim). The normalized bias is applied to assess reanalysis differences and ESM errors in the mean of the distribution. Then, to detect reanalysis differences and ESM errors in higher order moments, we apply the KS test to the centered time series. Note that in the latter case the degrees of freedom are reduced by −1, which is a negligible problem since \( n^* \) is of the order of several hundreds in any case.

Finally, model performance for the original (i.e. non-transformed) data is specifically assessed along the lateral boundaries of the three CORDEX domains defined in Fig. 1, which is of particular interest for the dynamical downscaling community. Unless RCMs are nudged to the large scale information (von Storch et al. 2000), ESM performance in the interior of the aforementioned domains is less important for the purpose of dynamical downscaling, since the corresponding atmospheric variability is simulated by the RCM, which is driven by the ESM at the boundaries of the domain only.

4.1 Reanalysis uncertainty

In Fig. 2, the results of validating JRA-25 against ERA-Interim in boreal winter (DJF, first and second column) and summer (JJA, third and forth column) are mapped for the variables SLP, T2, T850, Q850, U850, V850, T500 and Z500 (from top to bottom). Along the first and third column, the mean difference between JRA-25 and ERA-Interim, normalized by the standard deviation of ERA-Interim (Bias/Std) is mapped. The second and fourth columns display the logarithm to base 10 of the KS statistic’s p value (KS pVal), obtained from applying the KS test to the centered time series. Recall that applying centered data at this point permits for detecting reanalysis uncertainties in higher order moments. Values below −1.3 indicate that distributional differences in higher order moments are significant (α = 0.05), whereas values exceeding this threshold represent spurious differences (see the white area in the panels). For simplicity, the latter will hereafter be referred to as ‘perfect’ distributional similarity. A grid box is marked with a black dot if significant distributional differences for the original data disappear when applying the KS test to the centered time series, thereby indicating that reanalysis uncertainty is restricted to a shift in the mean of the distribution.

Fig. 2
figure 2

Columns 1+3: Mean differences between JRA-25 and ERA-Interim, normalized by the standard deviation of the latter; Columns 2+4: p Value (in logarithmic scale) of the KS test applied to the time series from JRA-25 and ERA-Interim, both centered to have zero mean. Grid boxes are whitened if the p value does not exceed the threshold value of −1.301, i.e. if the distributional differences are not significant (α = 0.05). Colour darkening corresponds to increasing (and significant) distributional differences/reanalysis uncertainties. Grid boxes marked with a black dot indicate areas where significant distributional differences for the original reanalysis data are eliminated by using the centered time series

Reanalysis uncertainty for SLP (see row 1 in Fig. 2) is negligible north of 45° N and clearly depends on season in the Northern Hemisphere subtropics (25° N–45° N), where it is more (less) pronounced in JJA (DJF). Over Africa (and especially in JJA), SLP in JRA-25 is much lower than in ERA-Interim, while the opposite is the case over the adjacent ocean areas. Consequently, JRA-25 is characterized by a more pronounced land-sea pressure gradient than ERA-Interim. For the Southern and Northern Hemisphere mid-latitude oceans, reanalysis differences are negligible.

Reanalysis uncertainty for T2 (see row 2 in Fig. 2) is more widespread than for any other variable under study, with JRA-25 being systematically warmer than ERA-Interim. Exceptions from this general result occur over land areas north of 45° N and the northern Arctic Ocean, where differences are negligible or even negative during DJF and MAM (MAM is not shown).

As was the case for SLP, reanalysis uncertainty for T850 (see row 3 in Fig. 2) is most pronounced over Africa and negligible over the the Northern-Hemisphere extratropics (with the exception of the Scandinavian Mountains in DJF and Greenland in all seasons). Along the ascending branch of the Hadley Cell, JRA-25 is considerably warmer than ERA-Interim, while the opposite is the case for the large-scale subsidence zones. Interestingly, the resulting meridional tripole structure (JRA-25 colder, JRA-25 warmer, JRA-25 colder) follows the seasonal march of the Hadley Cell.

The tripole difference structure found for T850, as well as its associated seasonality, also appears in Q850 (see row 4 in Fig. 2). Along the ascending branch of the Hadley Cell, JRA-25 is dryer than ERA-Interim, while the opposite is the case along the descending branches. Except for central-to-east Europe and the northern North Atlantic, differences for Q850 are remarkable over the whole study area.

For U850 and V850 (see row 5+6 in Fig. 2), reanalysis uncertainty is generally weaker than for the other variables under study and, in the extratropics, is confined to regions of high orography. During the core of the monsoon season (JJA), U850 and V850 over West Africa are weaker in JRA-25 than in ERA-Interim, while over East-Africa the sign of the difference is more heterogenous.

Considerable reanalysis uncertainties for T500 (see row 7 in Fig. 2) are mainly confined to the Tropics. In DJF, JRA-25 is generally colder than ERA-Interim (exception: western South Africa), whereas in JJA it is colder near the Equator but warmer over the semi-arid to arid regions of the Northern Hemisphere.

Finally, although reanalysis uncertainty for Z500 (see row 8 in Fig. 2) is generally lower than for any other variable under study, considerable differences are found over the tropics and subtropics. Over Africa and the tropical Oceans, and especially during DJF and MAM, Z500 in JRA-25 is lower than in ERA-Interim. In conjunction with higher values in the area of the St. Helen's High, the meridional gradient for Z500 over the South Atlantic is more pronounced in JRA-25 than in ERA-Interim.

When applying the KS-test to centered/zero-mean data, no significant distributional differences are detected for the case of SLP, T500 and Z500. For T850 and T2, the area of significant distributional differences is reduced to Central Africa (Kongo Basin), where it follows the seasonal march of the Hadley Cell, as was the case for the original data (see Fig. 2, columns 2 and 4). For U850 and V850, this area is confined to high-orography regions and, in case of V850, to the Guinea Coast (with a widespread error in JJA, i.e. during the core of the summer monsoon). For Q850, significant distributional differences are essentially removed in the extratropics, while large areas of significant differences remain over the South Atlantic, Tropical Africa and, with a considerable error magnitude (i.e. low p value), over the Indian Ocean.

As an anticipated conclusion to bear in mind when interpreting the results of the next section, the mean difference between JRA-25 and ERA-Interim generally exceeds a magnitude of one standard deviation for central-to-south Africa. Even if the data is centered to have zero mean, i.e. if differences in the mean are removed, there remain significant differences in higher order moments. Consequently, it is neither possible to objectively assess ESM performance for central-to-south Africa, nor does the basic assumption of ‘real’ or ‘perfect’ large scale data hold in these regions.

In contrast to the tropics, reanalysis uncertainty in the extratropics is generally negligible and the above mentioned problems may consequently be ignored, meaning that ESM performance can be assessed and the basic downscaling assumption can be affirmed.

4.2 Performance maps

Figures 3, 4, 5, 6, 7, 8, 9 and 10 show the results of validating the 7 ESMs listed in Table 1 against ERA-Interim for the case of SLP, T2, T850, Q850, U850, V850, T500 and Z500 respectively. Columns 1 and 2 (3 and 4) refer to the results for DJF (JJA). For each season we show the bias normalized by the standard deviation of ERA-Interim (Bias/Std), as well as the logarithmic p value of the KS statistic (KS pVal) obtained from the centered/zero-mean data. For the ease of comparison, the corresponding panels for reanalysis uncertainty (copied from Fig. 2) are displayed at the bottom of each figure.

Fig. 3
figure 3

Columns 1+3: Mean differences (columns 1+3) between the seven ESMs listed in Table 1 and ERA-Interim, normalized by the standard deviation of ERA-Interim; Columns: 2+4: p Value (in logarithmic scale) of the KS test applied to the time series from the respective ESM and ERA-Interim, both centered to have zero mean. Grid-boxes are whitened if the p value does not exceed the threshold value of −1.301, i.e. if the distributional differences are not significant (α = 0.05). Colour darkening corresponds to increasing (and significant) distributional differences/ESM errors. Grid boxes marked with a black dot indicate areas where significant ESM errors in the original data are eliminated by using the centered time series; results for SLP. For the ease of comparison, the corresponding panels for reanalysis uncertainty (copied from Fig. 2 are displayed at the bottom of the figure

Regarding the ESM error for SLP (see Fig. 3), the meridional pressure gradient in the Northern Hemisphere (NH) extratropics during DJF and MAM is too strong in CanESM2, IPSL-CM5A-MR, MIROC-ESM, MPI-ESM-LR and NorESM1-M (MAM is not shown). In JJA, CanESM2 and CNRM-CM5 suffer from too low SLP values over a large fraction of the land areas. For MIROC-ESM, MPI-ESM-LR and NorESM1-M, and in the light of considerable reanalysis uncertainty, both the Sahara Heat Low and the St. Helen’s High are too weak during JJA, leading to an underestimation of the land-sea pressure gradient during the West African rainy season. Over the extratropical North Atlantic, SLP during JJA is systematically overestimated by all ESMs except MPI-ESM-LR and CanESM2, the latter two showing more heterogeneous spatial patterns.

The T2 bias is generally larger and more widespread than at 850 hPa (compare Figs. 4, 5). The aforementioned largely exaggerated meridional pressure gradient during boreal winter and spring is associated with too strong westerlies in the Northern Hemisphere mid-latitudes, which lead to an exaggerated advection of oceanic air masses, resulting in too mild and moist conditions in continental Europe, an effect that extends throughout the whole planetary boundary layer (see Figs. 4, 5, 6 for T2, T850 and Q850 respectively).

Fig. 4
figure 4

As Fig. 3, but for T2

Fig. 5
figure 5

As Fig. 3, but for T850, green grid boxes refer to lack of data at the ESGF-portals

Fig. 6
figure 6

As Fig. 3, but for Q850, empty panels and green grid boxes refer to lack of data at the ESGF-portals

During the core of the West African monsoon (JJA), and as revealed by U500 (not shown), a too strong Subtropical Jet, as well as a too weak African Easterly Jet (Cook 1999) are simulated by the ESMs, with NorESM1-M performing best for these features. The monsoonal winds over West Africa, as represented by U850 in JJA, are underestimated over the Sahel but overestimated over the subhumid to humid zones along the Guinea Coast in all ESMs except IPSL-CM5A-MR; the latter underestimating this variable over the entire region (see Fig. 7). Also reflected in U850 is the above mentioned overestimation of the wintertime westerlies in the North Atlantic-European region. In general, the bias for U850 is larger and more widespread than for V850 (compare Figs. 7, 8).

Fig. 7
figure 7

As Fig. 3, but for U850, green grid boxes refer to lack of data at the ESGF-portals

Fig. 8
figure 8

As Fig. 3, but for V850, green grid boxes refer to lack of data at the ESGF-portals

For all ESMs except IPSL-CM5A-MR, a cold bias was found in the middle troposhere (see Fig. 9), which covers a large fraction of the domain under study in any season and, with the exception of CanESM2 and IPSL-CM5A-MR, is associated with an underestimation of the geopotential at 500 hPa over the Tropics (see Fig. 10).

Fig. 9
figure 9

As Fig. 3, but for T500

Fig. 10
figure 10

As Fig. 3, but for Z500, empty panels refer to lack of data at the ESGF-portals

Remarkably, one should expect the spatial pattern of the normalized ESM error to be independent from the spatial patterns of the normalized reanalysis difference. However, a considerable agreement between both types of patterns is found in central-to-south Africa, at least for some variables. To mention an example, the pattern of reanalysis uncertainty for T850 (see Fig. 5, JRA-25 is warmer than ERA-Interim over central Africa) is approximately resembled by a warm bias in all of the 7 ESMs under study (compare last row to remaining rows in Fig. 5). This points to a substantial error in the reference data set (ERA-Interim) for this specific region. This error, however, cannot be ultimately deduced from our analyses, since this would require a more thorough verfication against independent station and/or radiosonde data.

For all applied variables, ESM performance largely improves when applying centered time series (see columns 2 and 4 in Figs. 3, 4, 5, 6, 7, 8, 9, 10). In case of SLP, errors in higher order moments are detected over the high-orography regions of the Middle-East (for CanESM2, IPSL-CM5-MR and MIROC-ESM in at least one season of the year), over the Red-Sea and adjacent land areas (MIROC-ESM in JJA and SON, the latter season not shown), the Mediterranean (MIROC-ESM, NorESM1-M and MPI-ESM-LR in JJA), South Africa (CanESM2, IPSL-CM5-MR and MIROC-ESM in SON and/or DJF) and West Africa (CNRM-CM5 in JJA). Best overall performance is yielded for HadGEM2-ES, which, at least in case of SLP, does not suffer from errors in higher order moments at all (see Fig. 3, row 3, columns 2 and 4).

In case of the centered T850 data (see Fig. 5), any ESM except CanESM2 and HadGEM2-ES suffers from significant distributional differences over the tropics, the Southern-Hemisphere subtropics and the North Atlantic, while errors for T2 (see Fig. 4) are more widespread and additionally cover the Southern Hemisphere mid-latitudes. Interestingly, HadGEM2-ES again outperforms any other ESM for both T850 and T2, the performance of CanESM2 being comparable in case of T850.

Regarding the centered U850 and V850 data (see Figs. 7, 8), performance is generally better for U850. Errors in higher order moments appear over the tropics and subtropics. Large inter-model differences are found for both variables, with HadGEM2-ES and IPSL-CM5-MR performing clearly better than the remaining ESMs.

Albeit the errors in T500 are largely reduced by using centered data, CanESM2, MIROC-ESM, and NorESM1-M suffer from errors in higher order moments along the ascending branch of the Hadley Cell in JJA (see Fig. 9). In IPSL-CM5-MR, this error type appears during DJF between the Azores and the Bay of Biscay while it is virtually absent in HadGEM2-ES.

As shown in Fig. 10, ESM errors for Z500 disappear almost completely for the centered data.

4.3 Performance along the lateral boundaries of the CORDEX domains

Figure 11 displays the medians (bars) of the samples formed by the absolute normalized differences along the lateral boundaries (LB) of the 3 CORDEX domains shown in Fig. 1. From top to bottom (left to right) the results for different variables (LBs) are shown, while the season-specific results are displayed within each panel (see x-axes). For reasons of simplicity, the interquartile ranges (IQRs) are not shown since they are roughly proportional to their respective medians (i.e. the higher the median, the broader the IQR).

Fig. 11
figure 11

Median of the absolute normalized mean differences between JRA-25 and ERA-Interim (reanalysis uncertainty, first bar in each panel) and between the ESMs and ERA-Interim (ESM errors, remaining bars) along the lateral boundaries of the three CORDEX domains shown in Fig. 1. Left EURO-CORDEX, middle Med-CORDEX, right CORDEX Africa. Results are shown for all seasons, grey bars indicate lack of availability at the ESGF portals. Due to the larger error magnitude, y-axes have been stretched for T2 and T500

It is remarkable that ESM performance along the lateral boundaries of the 3 domains is generally very similar, i.e. the models do not perform systematically worse for any single domain compared to the other two. For any domain under study, ESM performance is best for V850, followed by U850, and is worse for T2 and T500 (note the distinct scaling of the y-axis for the latter two). Intermodel performance differences are smallest for U850 (except over the African domain) and V850 and generally larger for the remaining variables. Also, intermodel performance differences for the Med-CORDEX and CORDEX Africa domains are more pronounced than for the Euro-CORDEX domain. While MPI-ESM-LR and HadGEM2-ES are among the best models in any case, MIROC-ESM and IPSL-CM5-MR generally perform poorer, the remaining ESMs lying in-between in most cases. Interestingly, for the CORDEX Africa domain, ESM performance (and reanalysis agreement) along the lateral boundaries is systematically better than in the interior of the domain.

5 Discussion and conclusions

This study has shown that distributional differences between free tropospheric circulation, temperature and humidity data from JRA-25 and ERA-Interim are comparable to those obtained from validating the ESMs against ERA-Interim in central-to-south Africa. This questions the basic downscaling assumption of ‘real’ or ‘perfect’ reanalysis data and hinders the objective evaluation of ESM performance (Gleckler et al. 2008) in these regions.

The reason behind the differences cannot be inferred from our analyses. However, the large differences between JRA-25 and ERA-Interim over central-to-south Africa are consistent with Betts et al. (2009), who found ERA-Interim compared to in-situ station data to be cold-biased over the Amazon basin. Moreover, the cold bias of ERA-Interim over African tropical regions, which was systematically found against JRA-25 and 7 ESMs, indicate that ERA-Interim might not reflect ‘real’ atmospheric conditions in that area and that, in a strict sense, it should not be applied there for the purpose of downscaling. This should be a warning sign for the CORDEX Africa community, indicating that the errors of the downscaled times series may originate from the driving reanalysis, apart from being caused by SD or RCM errors.

In contrast, reanalysis uncertainty for the Northern Hemispheric extratropics is negligible, which (1) affirms the above mentioned basic downscaling assumption and (2) permits for assessing ESM performance. A largely overestimated meridional pressure gradient was found in 5 out of 7 ESMs during boreal winter and spring, leading to too mild and moist conditions in continental Europe. This is in agreement with van Ulden and van Oldenborgh (2006) and Vial and Osborn (2011), who found serious circulation biases and an underestimation of the frequency and duration of wintertime atmospheric blocking in most CMIP3-GCMs. Consequently, artificial feedback processes in the scenario period resulting from ESM errors in the control/historical period (Raisanen 2007) cannot be ruled out for Europe.

HadGEM2-ES and MPI-ESM-LR generally outperform the remaining models along the lateral boundaries of the Euro-CORDEX, Med-CORDEX and CORDEX Africa domains, which is in qualitative agreement with Brands et al. (2011a), who validated the former versions of these models over southwestern Europe. The systematic superiority of these models questions the paradigm of equiprobable treatment of the driving models in downscaling studies.

For the CORDEX Africa domain, ESM performance and reanalysis agreement along the lateral boundaries is systematically better than in the interior of the domain, which might be one argument against the use of RCM nudging (von Storch et al. 2000). In this context, it is worth mentioning that GCM control runs nudged to reanalysis data (Eden et al. 2012) fail to reproduce the temporal variability of observed precipitation in the tropics (where reanalysis uncertainty is large) whereas they perform well in the extratropics (where reanalysis uncertainty is low). This indicates that the success of nudging GCMs (and also RCMs) to reanalysis data might critically depend on the degree of reanalysis uncertainty.

The final message is that many of the errors found in the CMIP3-GCMs are still present in current Earth System Models. For instance, the systematic domain-wide cold bias in the middle troposphere found in this study is consistent with John and Soden (2007), who found similar results for the CMIP3-GCMs. Thus, the shortcomings and corresponding recommendations for working with GCM data in the context of downscaling (Wilby et al. 2004) remain valid for the new model generation.