1 Introduction

Global climate models (GCMs) are a primary tool not only in providing insight into global climate responses to natural and anthropogenic changes, but also in assessing potential impacts of climate change on local hydrologic systems and water resources (Christensen and Lettenmaier 2007; Kim et al. 2007; Liu et al. 2015; Hay et al. 2014). Most climate change impact studies require fine-resolution input data sufficient to model and simulate climate-relevant problems at regional or local scales, whereas GCMs provide climate information at a coarse resolution, in general on ~1° to 2° grids (Olsson et al. 2001; Dibike and Coulibaly 2006; Fowler et al. 2007). In addition, simulated outputs from climate models (global and regional) have shown systematic biases with respect to observational data sets (Mearns et al. 2012; Sillmann et al. 2013) mainly due to unresolved sub-grid scale processes (Cherubini et al. 2002), physical parameterizations (Jenkins and Lowe 2003; Mizuta et al. 2006), and cascading errors from boundary forcing in regional climate models (Deque et al. 2007; Christensen et al. 2007).

To provide relevant climate information at scales needed to assess local and regional impacts, downscaling techniques are often employed (Wilby et al. 1998; Haylock et al. 2006; Hidalgo et al. 2008; Abatzoglou and Brown 2011; Stoner et al. 2013). Between the two main categories of downscaling (dynamical and statistical), many prior studies have applied statistical downscaling methods, which have advantages in computational efficiency and ability to reproduce essential statistics of regional observed climate data (Bardossy et al. 2005; Haylock et al. 2006; Eum and Simonovic 2012).

Recently bias-correction/spatial disaggregation (BCSD, Wood et al. 2004), bias-correction/constructed analogue (BCCA, Maurer et al. 2010), multivariate adaptive constructed analogs (MACA, Abatzoglou and Brown 2011), bias-correction/stochastic analog (BCSA, Hwang and Graham 2013), and bias-correction/climate imprint (BCCI, Hunter and Meetemeyer 2005) model output statistics (MOS)-based methods have been actively applied to downscale outputs of GCMs at local scales. As a result of the plethora of techniques, inter-comparison studies have been conducted to identify more robust downscaling methods with regard to extremes (Segui et al. 2010; Goodess et al. 2012; Thrasher et al. 2012; Bürger et al. 2013; Werner and Cannon 2016), wildfire applications (Abatzoglou and Brown 2011), and water resources assessment (Eum et al. 2010; Chen et al. 2011; Gutmann et al. 2014; Rana and Moradkhani 2016; Mizukami et al. 2016). We selected four statistical downscaling methods because the methods have been widely used in natural hazards, hydrology, and water sectors such as water resfources planning (Brown et al. 2012; Miller et al. 2013), integrated water management (Hanson et al. 2014), climate change impact studies (Salathé et al. 2007; Brekke et al. 2009; Werner et al. 2013), and wild fire warning systems (Abatzoglou and Brown 2011): (1) BCSD, (2) BCCA, (3) MACA, and (4) BCCI. In addition, the selected techniques have been tested and documented very well (Gutmann et al. 2014) and downscaled data archives are available for US (at http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/dcpInterface.html#Limitations, and http://climate.northwestknowledge.net/MACA/downloadTools.php) and Canada (http://tools.pacificclimate.org/dataportal/downscaled_gcms/map/). This study applied the four statistical downscaling methods for South Korea and assessed the characteristics of each method based on a suite of evaluation metrics.

Formulating precise and informative evaluation metrics is a prerequisite to accurately quantify statistical robustness and performance for statistical downscaling inter-comparison studies. Hayhoe (2010) grouped performance metrics into five categories from 466 recent journal articles: (1) climatological biases, (2) correlation, (3) variance, (4) extremes, and (5) persistence. At the same time, a standardized evaluation framework for intercomparison studies was suggested to provide pertinent, transparent and independent statistical and physical tests, across three categories: (1) mean values and trends, (2) threshold, exceedance probabilities, quantiles, and (3) multi-day persistence. An evaluation framework working group comprised of scientists and practitioners in National Climate Predictions & Projections (NCPP) (refer to https://earthsystemcog.org/projects/downscaling-2013/metrics for more details) also suggested an expanded form of evaluation framework from three groups that proposed evaluation metrics related to uniform comparison, application-specific comparison, and process-related issues, respectively. In particular, the first group proposed metrics of time series and distribution related, temporal structure, and spatial correlation while the second group suggested indices related to accumulated parameters such as cold and dry spells, heat waves, etc. The third group suggested process-related indices such as monsoon, weather typing of extreme events. In accordance with the NCPP evaluation framework for the uniform comparison, Murdock et al. (2013) evaluated statistical downscaling methods using three diagnostics including sequencing of events, distribution of values, and spatial structure. Bürger et al. (2013) intercompared multiple methods in terms of ability to reproduce climate extremes indices recommended by the World Meteorological Organization’s Expert Team on Climate Change Detection and Indices (ETCCDI) (Zhang et al. 2011). Integrating all performance metrics and these diagnostics employed in prior studies, we suggest a suite of evaluation metrics which is grouped into five categories: (1) time-series related, (2) distribution related, (3) multi-day persistence, (4) extremes, and (5) spatial structure. More details are discussed in the methodology section. We use the climate forecast system reanalysis (CFSR) version 2 (Saha et al. 2010) for training and validation, which places more focus on the skill of methods rather than fidelity of one or more climate models (Wilby et al. 2000; Fasbender and Ouarda 2010; Nicholas and Battisti 2012).

In prior studies, downscaling methods showed different levels of skill for different performance metrics. This makes selection of the most skillful method overall difficult and potentially ambiguous. As a branch of multi-criteria decision making (MCDM), the technique for order of preference by similarity to ideal solution (TOPSIS), originally developed by Hwang and Yoon (1981), provides a tool to sort alternatives that are simultaneously far from the worst solution and close to the best condition (Triantaphyllou 2000). TOPSIS has a compensatory procedure that compares a set of alternatives by calculating geometric distance of each alternative from an ideal solution with an assumption that the criteria are monotonically increasing or decreasing (Garvey 2008). Therefore, it is easy not only to define the best and worst solutions but also to apply and order the preference of alternatives (Ozturk and Batuk 2011; Chu 2002). The TOPSIS method has successfully been applied to water sectors in South Korea, for instance, for identifying hydrological vulnerable locations (Chung and Lee 2009; Jun et al. 2013) and alternatives for watershed management (Kim and Chung 2014). Using the TOPSIS technique, therefore, we present an approach to rank multiple statistical downscaling methods based on the full suite of evaluation framework suggested in this study. The resulting rank may be informative for end-users to select the most suitable method in an available inventory.

Therefore, the main objectives of this study are to: (1) develop and apply the four statistical downscaling methods in South Korea, (2) intercompare the skill of each method on the four performance metrics, and (3) rank the methods using the TOPSIS technique.

2 Study area and climate data sets

We developed and applied the four statistical downscaling methods over South Korea (~100,210 km2), a country with complex topography including islands (approximately 3000 km2) and the southern portion of the Korean peninsula (refer to Fig. 1a). Because the spatial resolutions of GCMs are too coarse to describe regional climate characteristics over South Korea, downscaling plays a crucial role in regional applications.

Fig. 1
figure 1

Schematic maps of a elevation and b surface networks for AWS over South Kore

For historical validation, we selected the CFSR data set at 1.0 degree grid spacing over the period from 1979 to the present (Saha et al. 2010). The latest CFSR is a coupled (i.e. atmosphere, ocean, land, and sea ice) reanalysis system with an interactive sea ice model, assimilation of satellite radiance data for the entire period, and relatively high horizontal and vertical resolutions. We downscaled daily precipitation (PRCP), maximum temperature (TMAX) and minimum temperature (TMIN) from CFSR by the four statistical downscaling methods to the observational station network.

In South Korea, the Automatic Weather Station (AWS) and the Automated Synoptic Observing System (ASOS) networks provide observational climate data with inter-station spacing of approximately 12 km resolution (refer to Figs. 1b, 2). ASOS provides long-term climate data information (more than 30 years) whereas AWS stations have only operated since 2000, thereby providing a short term data set for only ~10 to 15 years. In general, a reliable long-term observed data set is a prerequisite to apply statistical downscaling models that reflect the climate variability of historical period. In this study, 60 stations (Fig. 2) that provide nine observational climate variables (PRCP, TMAX, TMIN, and average air temperatures, dew point temperature, average and maximum wind speed, average and minimum humidity) with few missing values for a common period from 1973 to 2010 were selected. Although leave-one-year-out cross validation with 38 years can avoid the drawback of having a short data record for model validation, it can provide overly optimistic estimates of skill due to year-to-year persistence related to low frequency climate variability such as El Niño Southern Oscillation (ENSO) etc. Therefore, we incorporated a split-sample validation approach with the first 19 years for calibration and the latest years of 19 years for validation.

Fig. 2
figure 2

ASOS stations (green) and CFSR (black) grid points located over South Korea

3 Methodology

3.1 Simple spatial interpolation (SSI)

As the simplest downscaling method, simple spatial interpolation (SSI) is conducted to produce downscaled values at each station by the inverse distance technique (Lapen and Hayhoe 2003). Therefore, downscaled values preserve bias of the climate model, which varies with levels of topographical complexity at each station. This means that the skill of SSI is entirely dependent on the skill of the climate model. Consequently, we can quantitatively evaluate effects of statistical downscaling methods by measuring improvement in downscaling skill relative to SSI.

3.2 Daily bias-correction/spatial disaggregation (BCSD)

BCSD was originally developed to downscale GCM output to provide input data for a macro-scale process-based hydrologic model (Wood et al. 2004). Traditionally, BCSD downscales at a monthly time step and temporally disaggregates to a daily time step by randomly selecting daily sequences from the historical data set (Maurer and Hidalgo 2008; Werner et al. 2013). The use of BCSD has been extended to directly downscale at daily scale (Abatzoglou and Brown 2011; Thrasher et al. 2012). We employ daily BCSD for the intercomparison of statistical downscaling methods in this study. Because daily BCSD performs spatial disaggregation and bias-correction with daily coarse-resolution climate data, this method does not require temporal disaggregation. As a result, it maintains the daily spatial and temporal structure of the coarse-resolution data. Therefore, the skill of the coarse-resolution historical data plays an important role in the performance of daily BCSD.

Daily BCSD first spatially disaggregates coarse-resolution data to a finer resolution by an inverse distance interpolation scheme (Lapen and Hayhoe 2003), and then applies bias-correction by a quantile mapping algorithm that equates empirical cumulative distribution functions (CDFs) of observed and modeled data (F o and F m , respectively),

$$\hat{x}_{m} (t) = F_{o}^{ - 1} [F_{m} \{ x_{m} (t)\} ]$$
(1)

where \(\hat{x}_{m} (t)\) and \(x_{m} (t)\) are bias-corrected and modeled data at time t, respectively. Note that CDFs, F o and F m , are formulated by observed and modeled data during historical period in this study. Although the majority of the model outputs can be corrected by Eq. (1), some extreme cases may be out of the range of the historical data. For these cases, the CDFs can be estimated by parametric distribution functions and then corresponding values are extrapolated (Wood et al. 2004). For precipitation, Gumbel distributions are employed for high precipitation events whereas a normal distribution is employed for temperature. To formulate the sample CDFs, we use observational and modeled daily data within 15-day moving windows centered on each calendar day.

3.3 Bias-correction/climate imprint (BCCI)

A spatial climate imprint which represents the correspondence between the coarse and fine resolutions is obtained from the mean value at each grid point for the calibration period from 1973 to 1991. The fine-resolution historical data are aggregated to the CFSR spatial resolution prior to bias correction by quantile mapping. Ratios of daily model output relative to the mean values at the coarse resolution are calculated for PRCP and then spatially interpolated to the fine (station-based) resolution. The interpolated ratios are multiplied by the mean values at each station to obtain downscaled values [Eq. (2)]. For TMAX and TMIN, differences between daily model output and the long-term mean values are used instead of the ratios [Eq. (3)]. We have

$$P_{\text{downscaled}} (t) = R_{\text{interpolated }} P_{\text{ave}} ,$$
(2)
$$T_{\text{downscaled}} (t) = T_{\text{ave}} - D_{\text{interpolated }} ,$$
(3)

where P downscaled is downscaled PRCP, R interpolated is the spatially interpolated ratio at each station, P ave is the long-term mean precipitation at each station, T downscaled is downscaled TMAX or TMIN, D interpolated is spatially interpolated difference at each station, and T ave is the long-term mean temperature at each station. As a post-process, BCCI again applied quantile mapping at the fine scale as in daily BCSD.

3.4 Bias correction/constructed analogue (BCCA)

A primary assumption of analogue downscaling methods is that a weather pattern in the historical record can be used to represent those in the future (Lorenz 1969; van den Dool 1994). By incorporating a relationship between coarse and find scale historical the current weather patterns and a specified GCM weather event, useful coarse-scale analogues can be constructed and used for downscaling (Timbal et al. 2003; van den Dool et al. 2003; Diez et al. 2005). Because a linear combination of multiple patterns rather than a single analogue (Fernández and Sáenz 2003) provides added-value in forecasts (van den Dool et al. 2003), a best-fit analogue is constructed by a linear combination of multiple coarse-resolution historical weather patterns (called the library) that match the bias-corrected GCM weather pattern on a given day; The fine-resolution historical data are aggregated to the climate model resolution to form the library. Then, the constructed analogue weights at the coarse-resolution are applied to the fine-resolution weather patterns on those days to produce downscaled outputs. As Maurer and Hidalgo (2008) found that the greatest skill was obtained by using precipitation and temperature as predictors, the library is composed of precipitation and temperature patterns in this study. Bias-correction is applied to the coarse-resolution GCM data using quantile mapping. Hence, following Maurer et al. (2010), absolute values are used to identify analogues in BCCA rather than anomalies.

Mathematically, the target pattern (T m ) should be defined, as shown in Eq. (4), by the regression coefficient matrix (A analogue) and analogue patterns (C analogue) composed of the best 10 days selected based on root mean square error (RMSE) between the target pattern and the historical weather patterns in the library during the calibration period.

$${\mathbf{T}}_{m} = {\mathbf{C}}_{\text{analogue}} {\mathbf{A}}_{\text{analogue}}$$
(4)

Applying the Moore–Penrose inverse to C analogue, A analogue can be obtained by Eq. (5).

$${\mathbf{A}}_{\text{analogue}} = \left[ {\left( {{\mathbf{C^{\prime}}}_{\text{analogue}} {\mathbf{C}}_{\text{analogue}} } \right)^{ - 1} {\mathbf{C^{\prime}}}_{\text{analogue}} } \right]{\mathbf{T}}_{m}$$
(5)

For precipitation, the matrix to be inverted in Eq. (5) is often near-singular matrix, for example when the domain is dominated by dry grid cells. Therefore, we employ the ridge regression technique (Tikhonov et al. 1995) with a small penalty to solve in these cases. The same regression coefficient matrix (A analogue) is applied to the fine-resolution weather patterns on the same days that correspond to the date of the constructed analogue (C analogue) at the coarse-resolution. That is, the downscaled values (V downscaled) are obtained by Eq. (6).

$${\mathbf{V}}_{\text{downscaled}} = {\mathbf{V}}_{\text{analogue}} [({\mathbf{C^{\prime}}}_{\text{analogue}} {\mathbf{C}}_{\text{analogue}} )^{ - 1} {\mathbf{C^{\prime}}}_{\text{analogue}} ]{\mathbf{T}}_{m}$$
(6)

where V analogue is a constructed fine-resolution analogue, i.e. spatial pattern of climate variables over the 60 stations.

3.5 Multivariate adapted constructed analogues (MACA)

Mathematically, MACA follows the same procedure as BCCA except that it incorporates additional variables into the analogues at both coarse and fine resolutions. An epoch adjustment that removes and re-introduces the difference of mean values between the current and future time slices (e.g. 1990s and 2050s) in the original MACA methodology (Abatzoglou and Brown 2011) is skipped in this study because we downscale the CFSR data set over a relatively stationary historical period. After analyzing correlations between variables (Table 1), MACA is conducted for PRCP with minimum humidity and for TMAX and TMIN with average temperature to improve coherence in spatial weather patterns on a given day. Thus, we can evaluate the added-value of MACA, which introduces more variables into the constructed analogues, relative to BCCA.

Table 1 Averaged correlation coefficients between climate variables at the ASOS stations

3.6 Assessment of downscaling skill

The downscaled daily PRCP, TMAX, and TMIN from CFSR for the validation period from 1992 to 2010 are assessed using the five criteria comprised of 23 ETCCDI indices (refer to Table 2 for details) and spatial correlation for PRCP, TMAX and TMIN as presented in Table 3. In Table 3, quantitative evaluations for 23 ETCCDI indices are implemented by the Euclidean distances of errors for mean and standard deviation between observed and downscaled (simulated) ETCCDI indices at 60 ASOS stations. Note that all metrics are normalized by Eq. (7) and then the Euclidean distances are evaluated by Eq. (8).

Table 2 Expert Team on Climate Change Detection and Indices (ETCCDI) indices employed in this study
Table 3 A list of evaluation framework comprised of five criteria
$$\mu_{E} = \frac{{\mu_{s} - \mu_{o} }}{{\mu_{o} }}, \sigma_{E} = \frac{{\sigma_{s} - \sigma_{o} }}{{\sigma_{o} }} ,$$
(7)
$$D = \sqrt {\mu_{E}^{2} + \sigma_{E}^{2} } ,$$
(8)

where, μE and σE are normalized errors of mean and standard deviation, and subscript s and o represent simulated and observed values, respectively, and D represents the Euclidean distance of errors.

The first metric measures the ability to reproduce indicators related to time-series, for instance, annual precipitation, trends for annual averages, annual summer and frost days, and so on. Among them, τ is evaluated by Mann–Kendall trend test, which represents the magnitude of the trend for variables. The second metric measures the ability to reproduce distribution of values, which is estimated by the Kolmogorov–Smirnov (K–S) D statistic, the supremum of the set of distances of empirical distributions between observed and downscaled data. This study evaluates the K–S D for each month and then the average value of RMSEs from all months is used as an evaluation metric. The third metrics measures the ability to reproduce multi-day persistence by consecutive dry day (CDD) and consecutive wet day (CWD) for precipitation, the warm spell duration index (WSDI) for TMAX, and cold spell duration index (CSDI) for TMIN, respectively. The fourth metric measures the ability to reproduce extremes, i.e. values of maximum, minimum, and exceedance threshold. The fifth metric measures the ability to reproduce observed spatial patterns, which is measured by RMSE of spatial correlations between observed and downscaled data sets. In other words, we calculated correlations between stations for observed and downscaled data sets individually (1830 cases for 60 stations in this study). Then, RMSE of spatial correlations was calculated based on the difference of correlations between the observed and downscaled data. In this way, we compared RMSEs for all statistical downscaling methods. As in the second metric, the RMSEs of spatial pattern are evaluated every month and then the average value is used in this study.

3.7 Technique for order of preference by similarity to ideal solution (TOPSIS)

A goal of this study is to identify the most suitable technique among the four statistical downscaling methods considered in this study for South Korea. The use of multiple performance criteria means that methods may perform well in some areas but not others, potentially leading to ambiguity when making a final recommendation. Therefore, we introduce the use of TOPSIS, a systematic process for determining suitable alternatives to a multi-criteria problem. Because of its simplicity, TOPSIS has been applied to a range of problems, including a business model comparison (Zhou et al. 2012) and watershed management (Chung and Lee 2009; Jun et al. 2013; Lee et al. 2013). TOPSIS starts by creating an evaluation matrix, (x ij ) m × n for the ith alternative and jth criterion. Then, the weighted normalized decision matrix, (t ij ) m × n is given by

$$(t_{ij} )_{m\, \times \,n} = (w_{j} n_{ij} )_{m\, \times \,n} \quad i = 1,\,2, \ldots ,m; \quad j = 1, \,2, \ldots , n,$$
(9)
$$n_{ij} = \frac{{x_{ij} }}{{\mathop \sum \nolimits_{i = 1}^{m} x_{ij}^{2} }} ,$$
(10)

where n ij is normalized matrix from x ij by the vector normalization [Eq. (10)], and w j represents weighting on the jth criterion (Hwang and Yoon 1981). The weights reflect the preferences of decision makers, i.e. relative importance of criteria under consideration. However, objectively assigning the weights is not straightforward but arguable because it significantly depends on the aims of a particular task. Employing, therefore, the law of large numbers to determine a proper number of weighting combinations without the distribution of weighting, we generated 10,000 combinations of weights on the five criteria including equal weighting (=1/n). The use of 10,000 combinations allows us to sure that the mean value of weighting for each criterion is within 0.01 of the population mean of weighting at 95 % significance level. Then we ranked the statistical downscaling methods to identify a robust method among the four statistical downscaling methods. Then, determining the worst (A w ) and best (A b ) conditions [refer to Eqs. (11) and (12)] is necessary to calculate Euclidean distances from the worst condition and best condition at the target alternatives i., d iw and d ib , using Eqs. (13) and (14), respectively.

$${A_w} = \left\{ {\left\langle {{\rm{max}}({t_{ij}}|i = 1,\,2, \ldots ,m)|j \in {J_ - }} \right\rangle ,\left\langle {{\rm{min}}({t_{ij}}|i = 1,\,2, \ldots ,m)|j \in {J_ + }} \right\rangle } \right\} \equiv \{ {t_{wj}}|j = 1,\,1mu2, \ldots ,n\} ,$$
(11)
$${A_b} = \left\{ {\left\langle {{\rm{min}}({t_{ij}}|i = 1,{\mkern 1mu} 2, \ldots ,m)|j \in {J_ - }{\rm{ }}} \right\rangle ,\,\left\langle {{\rm{max}}({t_{ij}}|i = 1,{\mkern 1mu} 2, \ldots ,m)|j \in {J_ + }} \right\rangle } \right\} \equiv \{ {t_{bj}}|j = 1,{\mkern 1mu} 2, \ldots ,n\} ,$$
(12)
$$d_{iw} = \sqrt {\mathop \sum \limits_{j = 1}^{n} (t_{ij} - t_{wj} )^{2} } \quad i = 1,2, \ldots ,m ,$$
(13)
$$d_{ib} = \sqrt {\mathop \sum \limits_{j = 1}^{n} (t_{ij} - t_{bj} )^{2} } \quad i = 1,\,2, \ldots ,m .$$
(14)

For the four performance criteria (n = 5) from the four downscaling methods (m = 4) in this study, for example, (x ij )4×5 is evaluated and then the weighted normalized matrix (t ij ) can be obtained. According to the rescaled metrics (t ij ) and characteristics of each performance metric, the worst and best conditions (A w and A b ) are determined. In this study, all performance criteria are in J . For instance, t w1 and t b1 are the minimum and maximum values of time-series related criterion from four downscaling methods, respectively. Using Eqs. (13) and (14), then, Euclidean distances from the worst condition and best condition for each of downscaling method are calculated.

As a final step in TOPSIS, we rank all alternatives based on the similarity [Eq. (15)] for determining suitable alternatives or excluding worst alternatives by introducing a cut-off threshold.

$$s_{iw} = \frac{{d_{iw} }}{{d_{iw} + d_{ib} }}, \quad 0 \le s_{iw} \le 1, \quad i = 1,2, \ldots ,m .$$
(15)

When s iw  = 1, it indicates the alternative is equal to the best condition, and s iw  = 0 if the alternative is equal to the worst condition. That is, the ith statistical downscaling method nearest to s iw  = 1 is the most suitable and robust method, and vice versa.

4 Results and discussion

4.1 Time-series related criterion

Performance evaluation of the four statistical downscaling methods for all variables is shown in Table 4 where all values are averaged over 60 ASOS stations. The performance of SSI in SDII, the standardized precipitation intensity per day, is the poorest, which may be related to a drizzling bias in CFSR precipitation fields. That is, very small amounts of precipitation are generated in the CFSR precipitation data (Demirel and Moradkhani 2016). The skill in the ID indicator, icing days, is very low for all methods but the four statistical downscaling methods improved the skill compared with SSI. Interestingly, the skill in reproducing trends for PRCP, TMAX, and TMIN is quite good for all methods, even SSI because the CFSR reanalysis data provides high skill in replication of day-to-day weather in the region. Overall, BCSD provides the highest skill for PRCP and TMAX while MACA does for TMIN. However, the difference in evaluation scores between BCSD and MACA is not considerable in most indicators.

Table 4 Performance evaluation of four statistical downscaling methods for time-series related indicators

PRCPTOT (annual precipitation) is often used to categorize climatologically wet and dry areas. Figure 3 displays spatial patterns of mean PRCPTOT during the validation period (1992–2010) for observations and all statistical downscaling methods. SSI underestimates PRCPTOP over South Korea while the other four methods show good agreement with the observations, in particular in the southeast. However, statistical downscaling methods underestimate PRCPTOT at the central and northern areas. In addition, time series of spatially averaged indices in Fig. 4 show that the statistical downscaling methods improve the skill in reproducing time-series related indicators compared to SSI. In particular, SSI extremely underestimates SDII, SU, and FD mainly due to drizzling effect in CFSR, cold bias for TMAX, and warm bias for TMIN, respectively. The biases for TMAX and TMIN may result from a coarse grid size that includes both ocean and land in a grid cell.

Fig. 3
figure 3

Spatial patterns of mean annual precipitation (PRCPTOT) from the observations (Obs), SSI, and the four downscaling methods during the validation period

Fig. 4
figure 4

Time series of ETCCDI indices related to time-series during the validation period from 1992 to 2010

4.2 Distribution related criterion

The skill of statistical downscaling methods in reproducing distributions of observed daily climate data is evaluated by the K–S D statistic, which is defined as the maximum distance between empirical cumulative density functions (CDFs) of the observed and downscaled climate data sets. Therefore, a lower value of K–S D represents better performance on this criterion.

Seasonal K–S D values are shown in Table 5. As for time-series related criteria, all values represent the averaged K–S D values over 60 stations. Notably, K–S D values for all variables are considerably improved by the four downscaling methods by 67 % for TMAX, 55 % for TMIN, and 88 % for PRCP compared with SSI. Interestingly, K–S D values for TMAX and TMIN during winter have the poorest performance among the four seasons, which indicates that CFSR lacks skill in capturing hot and cold extreme events during winter. As expected, SSI has the lowest skill in simulating summer precipitation with the K–S D statistic improved by introducing the statistical downscaling methods. BCSD, which incorporates spatial disaggregation with quantile mapping bias-correction, outperforms other methods in reproducing the distribution of station data for all variables and seasons. Corresponding to the main types of statistical downscaling algorithms described in the methodology section, the spatial patterns of K–S D statistics from BCSD and BCCI are similar, while BCCA is similar to MACA (not shown here).

Table 5 Seasonal K–S D statistics in two-sample K–S test

4.3 Multi-day persistence related criterion

Four indicators related to multi-day persistence criteria are evaluated for TMAX, TMIN, and PRCP. Table 6 presents the Euclidean distances of errors between observed and simulated values for each downscaling method. All values are spatially averaged over all stations. In general, MACA and BCCA, which incorporate spatial patterns, show higher skill for TMAX and TMIN than BCSD and BCCI employing spatial disaggregation. In particular, MACA and BCCA outperform others for CSDI which represents the skill in reproducing at least 6 consecutive extreme cold days, i.e. when TMIN is less than the minimum temperature of the 10th percentile. Cold extremes of TMIN for the long-term (i.e. longer than 6 days) are normally affected by both surface conditions and a large-scale forcing such as cold waves derived from large-scale patterns (Christensen et al. 2007). Therefore, downscaling methods employing spatial patterns may provide better skill at long lasting cold events as well as warm events. For PRCP, on the contrary, BCSD shows the highest skill in reproducing consecutive dry and wet days.

Table 6 Euclidean distances of errors between observed and simulated values for multi-day persistence related indicators

Figure 5 shows the time series of the four indicators spatially averaged during the validation period. As shown in Table 6, SSI and BCSD show poor performance in CSDI, overestimating from 1998 to 2001. In addition, sequencing of WSDI and CSDI for SSI and BCSD is very similar to each other, which indicates that BCSD may highly depend on the skill of climate model in reproducing long-lasting consecutive warm and cold events. For precipitation indicators, SSI shows much better skill at CDD compared with CWD, which indicates that CFSR overpredicts wet days (refer to CWD in Fig. 7), resulting from the drizzling effect in the CFSR precipitation data while CFSR may provide reliable CDD. However, the four statistical downscaling methods properly bias-correct sequencing of dry and wet days in the CFSR data by quantile mapping with daily data (Hwang and Graham 2013).

Fig. 5
figure 5

Time series of ETCCDI indices related to multi-day persistence during the validation period from 1992 to 2010

4.4 Extremes related criterion

The extremes related criterion measures how well statistical downscaling models simulate extreme events. Table 7 shows evaluation metrics for TMAX, TMIN, and PRCP. In general, the performance of the statistical downscaling methods follows that for SSI, indicating that downscaling performance may be highly dependent on the skill of CFSR. We also found that CFSR reproduces observed PRCP extremes well, i.e. the lowest Euclidean distance of errors on average. Despite the high baseline, performance is considerably improved by the statistical downscaling methods relative to SSI. Based on the average over the error distances of all indicators for each variable, BCSD and BCCI show the best performance for TMAX, PRCP and TMIN, respectively. Overall, BCSD shows the highest skill in simulating extremes, which may be related to its performance on the distribution related criterion described above as the maximum difference between distributions often occur at the most extreme values.

Table 7 Euclidean distances of errors between observed and simulated values for extreme related indicators

In addition, time series of spatially averaged extreme related indicators show poor skill by SSI compared to the statistical downscaling methods. Figure 6 displays time series of indices related to TMAX (TXn, TXx, TX10p, and TX90p). SSI has a cold bias and underestimates in most cases, mainly resulting from a lack of reflecting regional topographical characteristics due to a coarse grid size that includes both ocean and land parts in a grid cell. Specifically, annual maximum values of TMAX (TXx) are underestimated by 2.7 °C on average for SSI while the four downscaling methods remove the bias and show large improvement in TXx. For extreme indicators related to TMIN (Fig. 7), the performance of SSI for extreme related TMIN indicators is comparable to those of the downscaling methods. However, SSI has a warm bias in most cases, leading to an underestimation of FD and TNx and an overestimation of TNn and TN90p. On the contrary, the four statistical downscaling methods show a cold bias in TNn, which causes worse error distances as shown in Table 7. Such results indicate that downscaling methods may cause additional biases on climate data. Figure 8 shows time series of precipitation related extreme indicators (Rx1 day, Rx5 day, R95pTOT, and R99pTOT). SSI underestimates Rx1 day and Rx5 day, likely due to the inability of CFSR to capture topographic effects on precipitation by CFSR. Total amounts of precipitation above a threshold (i.e. R95pTOT and R99pTOT) from SSI show relatively good agreement with observations while overestimation of extreme precipitation is found in Rx1 day, R95pTOT, and R99pTOT during years with high precipitation. These results may be induced by extrapolating with extreme distributions (e.g. Gumbel) for the values out of calibration range in the statistical downscaling methods. The extrapolation may cause a substantial distortion of climate events (Maraun et al. 2010; Maraun 2013). However, all indices are considerably improved by the statistical downscaling methods in most cases.

Fig. 6
figure 6

Time series of extreme related indicators for TMAX during the validation period from 1992 to 2010

Fig. 7
figure 7

As in Fig. 6 except TMIN

Fig. 8
figure 8

As in Fig. 6 except PRCP

4.5 Spatial structure related criterion

The last metric we employ measures the ability of methods to reproduce the spatial correlation of each variable, which is an important factor when simulating hydrologic responses for water resource management because spatial precipitation distributions play a crucial role in simulating floods and drought events. Root mean square errors (RMSEs) of seasonal spatial correlations between the observed and downscaled data sets are shown in Table 8. RMSEs for SSI during summer for all variables are higher than those during other seasons, which may be attributed to the localized nature of heat waves and heavy rainfall during summer. However, RMSEs are considerably reduced by the downscaling methods. In particular, large improvements by BCCA and MACA are found for all variables, up to 45 % on average by MACA for PRCP versus 8 % for BCCI. An important finding is that the spatial structure of summer precipitation is improved by MACA. Summer precipitation, as indicated above, is significantly affected by typhoons, leading to localized heavy rainfall under a monsoon climate, thus it is difficult to capture the spatial structure over the study area. Approximately 60 % of the annual precipitation occurs during summer, which has significant implications for water resources management as a balance must be kept between maintaining water supply during the drawdown period (spring, autumn, and winter) and flood prevention during summer (Eum et al. 2011).

Table 8 RMSEs of seasonal spatial correlations of between stations for the observed and downscaled data sets

Figure 9 displays quantile–quantile (Q–Q) plots of downscaled versus observed spatial correlations (1830 cases for 60 stations) for summer PRCP during the validation period. For summer precipitation, SSI and BCCI overestimate spatial correlations, lying above the diagonal line, due to the intrinsic bias of CFSR mainly induced by the coarse resolution of CFSR, whereas MACA is concentrated near the 1:1 line. Overall, methods incorporating spatial weather patterns (BCCA and MACA) better reproduce spatial structure of PRCP, TMAX, and TMIN, while methods that rely on spatial interpolation schemes (SSI, BCSD, and BCCI) tend to overestimate spatial correlations mainly due to the mismatch of spatial resolution between CFSR and station-spacing.

Fig. 9
figure 9

Quantile–quantile (Q–Q) plots of spatial correlations between stations from all methods for summer precipitation during the validation period; Dots falling on the diagonal line indicate perfect correspondence between observed and downscaled spatial correlation

4.6 Selecting a robust statistical downscaling method with TOPSIS

In previous sections, each method showed different levels of skill on each performance metric. For example, BCSD outperformed other methods in the distribution related criterion while MACA did in the spatial structure criterion. Selecting a preferred technique when considering a single metric is straightforward. In general, however, multiple criteria such as the evaluation framework suggested in this study should be considered as accurate simulation of hydrologic and ecological responses depend on all aspects of skill considered in this study. Therefore, we use a systematic procedure with the TOPSIS technique, which implements experiments with various combinations of weighting for the five criteria to identify a robust downscaling method, across a range of the criteria.

Table 9 shows the percentage of each ranking for the five downscaling methods among TOPSIS experiments with 10,000 different weighting combinations based on the comprehensive performance metrics evaluated in this study. Bold numbers represent the highest chance to be the rank in this experiment. The orders of ranking for the statistical downscaling methods are changed with different variables. However, MACA shows the highest ranking for all variables. BCCA is the second for TMAX and TMIN while BCSD is the second for PRCP. These results imply that MACA and BCCA employing spatial climate patterns can provide more reliable downscaled temperature information while BCSD may provide reliable precipitation data with regard to distribution and extremes. This makes sense because temperature is dominated by large-scale forcing such as heat and cold waves while precipitation is mainly affected by local effects such as topography and local convection (Christensen et al. 2007; Eum et al. 2016). Therefore, the most robust method among the four statistical downscaling methods compared in this study is MACA overall followed by BCCA for temperature (TMAX and TMIN) and BCSD for PRCP. In addition, MACA employing auxiliary variables to improve coherence of the analogues may bring improved performance based on the experiment in this study. Because MACA requires more variables as predictors, BCCA and BCSD can be considered as the second alternatives for temperature and precipitation, respectively, when only TMAX, TMIN, and PRCP fields are available as predictors.

Table 9 Percentage of each ranking for the five downscaling methods among 10,000 weighting combinations by using the TOPSIS technique based on the comprehensive performance metrics evaluated in this study

We also need to determine which criteria have more important role in contributing to ranking in the TOPSIS analysis. Although we tested the sensitivity of ranking to the change in the weightings for all criteria, we present the results for the two criteria—3 (multi-day persistence) and 5 (spatial structure)—that showed a prominent contribution to the ranking between the statistical downscaling methods. Figure 10 shows the ranking of each method when weightings on criteria 3 and 5 are changed. For PRCP, contributions of each criterion to the ranking are more prominent, i.e. the weightings of criterion 5 (spatial structure) have the most crucial role in determining rankings between the four statistical downscaling methods. When the weighing of criterion 5 is higher (>0.1), the ranking of MACA and BCCA is also higher: first for MACA and the third for BCCA. When criterion 5 is negligible, however, BCSD and BCCI are ranked first and third, respectively. For TMAX, the higher the weightings of criteria 3 and 5 (>0.2), the more the ranking orders are prominent, i.e. MACA is the first, BCCA is the second, and BCSD is the third, mainly due to considering the spatial patterns in MACA and BCCA. When lower weightings, however, the cases BCSD becomes the first occur, These results indicate that the both criteria play a crucial role in ranking among the four methods for TMAX. For TMIN, MACA is not sensitive to a range of weightings on all criteria, i.e. MACA is ranked the first in most cases. As in TMAX, however, criteria 3 and 5 have an important role in deciding the ranking between other methods (BCCA, BCSD, and BCCI). When the weightings of criteria 3 and 5 are higher, BCCA has the highest chance to be the second alternative. In particular, the weighting of criterion 3 is more crucial in ranking between BCSD and BCCI because the performance of BCSD in CSDI is substantially different from other methods (refer to Table 6).

Fig. 10
figure 10

Scatter plot of ranking of each statistical downscaling method corresponding to the weightings of criteria 3 (multi-day persistence) and criteria 5 (spatial structure) for PRCP, TMAX, and TMIN among experiments with 10,000 weighting combinations

5 Conclusions

MOS-based statistical downscaling methods have been developed and successfully used to downscale outputs of GCMs to a local scale over South Korea. Four methods (BCSD, BCCI, BCCA, and MACA), which have been widely used in various fields as well as SSI as a surrogate downscaling scheme to measure skill of CFSR, have been evaluated with an evaluation framework that consists of time-series, distribution, multi-day persistence, extremes, and spatial structure related indicators. We downscaled PRCP, TMAX, and TMIN from CFSR to 60 ASOS stations over South Korea. Dividing historical observational station data into two parts, all statistical downscaling models were calibrated with 19 years data from 1973 to 1991 and validated on 19 years from 1992 to 2010. Based on the evaluation framework, we employed the TOPSIS technique with 10,000 weighting combinations to identify a robust method in the experiment for the study area.

Regarding the time-series criterion, BCSD and MACA showed comparable skill for all variables. However, BCSD outperformed other methods in reproducing the distribution of variables. In addition, all methods showed less skill at higher elevation stations where more extremes are influenced by orographic effects and complex topography. For the multi-day persistence criterion, MACA and BCCA that incorporate spatial patterns show higher skill for TMAX and TMIN, in particular both methods outperformed BCSD in CSDI which is an indicator to estimate an ability to reproduce consecutive cold extremes. This result implies that the spatial patterns of cold events may play an important role in providing better skill in long-lasting extreme cold and warm temperature. On the contrary, BCSD shows the highest skill in reproducing consecutive dry and wet days. For the extreme related criterion, BCSD showed the highest skill in simulating extremes overall, which is related to the best performance in the distribution related criterion. While the four statistical downscaling methods improved the cold bias compared to SSI, they may cause additional biases on climate data when climate models provide relatively accurate extreme indicators mainly due to extrapolation for values outside of the range of values during the calibration period. In terms of spatial structure, BCCA and MACA outperformed BCSD and BCCI, mainly because they incorporate spatial weather patterns into the downscaling process. In particular, MACA showed large increases in skill for summer precipitation which is affected by localized heavy rainfall under a monsoon climate. Based on the TOPSIS analysis, MACA is the most reliable and robust method for all variables in South Korea while BCCA is the second for TMAX and TMIN while BCSD is the second for PRCP. We also found that the weightings on the criteria 3 (multi-day persistence) and criteria 5 (spatial structure) are crucial to ranking the skill of statistical downscaling methods.

As one of the first downscaling intercomparison studies for South Korea, we downscaled CFSR data to a network of stations. However, climate variables downscaled to a gridded climate data set may be useful for distributed hydrologic models that require input data on a regular grid. Recently, gridded precipitation and temperature data have been generated at 1 and 5 km grid spacing for South Korea (Hong et al. 2007; Shin et al. 2008; Kim et al. 2014). A straightforward next step would be to repeat our analysis for these data sets. Also, recent studies have focused on 21st century downscaled climate change projections based on the fifth phase of Coupled Model Intercomparison Project (CMIP5) (Maloney et al. 2014). At the same time, studies have raised potentially serious issues with bias-correction techniques such as quantile mapping, which is used in BCSD, BCCI, BCCA, and MACA, with regard to preserving long-term projected trends from the driving GCMs (Maraun 2013; Maurer and Pierce 2014). In response, various algorithms have been developed to preserve the long-term trend of climate projections such as equidistant quantile matching (Li et al. 2010), the ISI-MIP approach (Hempel et al. 2013), detrended quantile mapping (Bürger et al. 2013), and quantile delta mapping (Cannon et al. 2015). Employing such a new bias correction technique with the methods recommended in this study, we plan to downscale various CMIP5 projections for regional applications to climate change impacts studies in South Korea.