1 Introduction

Stochastic weather Generators (referred to as SWGs hereafter) are numerical tools employed broadly to simulate the statistical characteristics of observed climate variables and generate random time series that can be used as inputs for climate-sensitive hydrological models (Wheater et al. 2005). The variability in the input translates into variability in the generated hydrological time series. The risk associated with and performance of the modeled water system are assessed by estimating the statistics for the simulated variables. The use of SWG outputs in such studies is convenient, as SWGs can generate long and gap-free synthetic sequences based on historical observations and can be used for water resources planning and management (Vu et al. 2018). A large ensemble of synthetic weather sequences (or realizations) is assumed to represent the internal variability of hydroclimatic variables, consisting mostly of precipitation, maximum temperature, minimum temperature, solar radiation, and relative humidity (Santer et al. 2008) at different spatial and temporal scales (Ailliot et al. 2015).

According to Guenni (1994), SWGs are mainly useful in: (1) extending insufficient or incomplete records that constrain the modeling approach (e.g., Fodor et al. 2013; Fatichi et al. 2016), (2) developing datasets for ungauged sites via spatially interpolating model parameters from adjacent areas with sufficient records (e.g., Baffault et al. 1996; Fodor et al. 2013), and, recently, (3) accounting for the uncertainty that arises from natural variability along with anthropogenic forcing in climate-change simulations (e.g., Räisänen and Ruokolainen 2006; Minville et al. 2008; Deser et al. 2012; Thompson et al. 2015). Ailliot et al. (2015) classified SWGs in into four group according to the random number generation process: resampling techniques (e.g., Räisänen and Ruokolainen 2006; Oriani et al. 2014), multivariate autoregressive models applying Box-Jenkins Method (e.g., Box and Jenkins 1976), point process models (e.g., Rodriguez-Iturbe et al. 1987; Onof et al. 2000) and Bayesian hierarchical modeling such as weather type models (e.g., Thompson et al. 2007).

SWGs were introduced initially for hydrological applications requiring long sequences of daily weather data (e.g., Gabriel and Neumann 1962; Todorovic and Woolhiser 1975; Buishand 1977). Since then, SWGs have found wide application in various hydrologic investigations, such as the assessment of anthropogenic climate change impacts (e.g., Zwiers 1996; Eames et al. 2012; Kilsby et al. 2007; Candela et al. 2012), crop yield estimates (e.g., Vesely et al. 2019), ecosystem and food security models (e.g., Stevens and Madani 2016), and in streamflow simulations (e.g., Zhang and Garbrecht 2003; Dubrovský et al. 2004; Alodah and Seidou 2019a) mainly to characterize internal atmospheric variability (or climate noise) (Räisänen and Ruokolainen 2006; Santer et al. 2008; Deser et al. 2012) and particularly under conditions of data scarcity (Breinl et al. 2017). The use of observed climate data in hydrological modeling is always preferable; however, SWGs provide a suitable alternative, as some localized risky events that are not covered fully in the observed set may be overlooked (Räisänen and Ruokolainen 2006; Ivanov et al. 2007; Santer et al. 2008; Vu et al. 2018).

Several authors investigated the abilities of SWGs in representing the statistical properties of observed weather series (e.g., Semenov et al. 1998; Hayhoe 1998, 2000; Qian et al. 2004; Ivanov et al. 2007; Chen et al. 2014; Ailliot et al. 2015; Breinl et al. 2017; Mehan et al. 2017; Gitau et al. 2018; Vesely et al. 2019). Well-known limitations of SWGs include their inability to generate low-frequency variability very well (Soltani and Hoogenboom 2003a) or correctly reproduce the dependence of temperature variables and precipitation amount in wet days on wet/dry spell lengths determinant parameters (Wilby et al. 2004). Alternatively, Panagoulia (2006) showed the great potential of artificial neural network (ANN) models in simulating nonlinear processes of extreme river flows in various climates conditions. The ability of ANN in flow simulations was also proven even at the level of detailed localized studies taking into consideration the appropriate selection of input variables (Panagoulia et al. 2017).

SWGs are often employed to study the impacts of climatic variability, for instance, in rainfall-runoff simulations (e.g., Dubrovský et al. 2004; Panagoulia 2006), erosion simulations (e.g., Zhang and Garbrecht 2003), simulations of extreme precipitation events (e.g., Furrer and Katz 2008; Semenov 2008), and in climate-change studies (e.g., Kilsby et al. 2007; Kim et al. 2007; Al-Mukhtar et al. 2014; Alodah and Seidou 2019b). Yet, unlike observed weather data which provide only one realization, an unlimited number of weather realizations can be generated (Kim et al. 2018; Vu et al. 2018), and it is very improbable statistically that any two realizations will be identical (i.e., uncorrelated data from a realization to the next one). In general, multiple stochastically-generated time series can provide a broad range of weather possibilities for a detailed sensitivity analysis (Dubrovský et al. 2004; Santer et al. 2008), such as the recently introduced vulnerability-based methods (e.g., bottom-up approaches) for evaluating uncertainty in projected climate change impacts (e.g., Brown et al. 2011; Steinschneider and Brown 2013; Mukundan et al. 2019; Alodah and Seidou 2019b). An ensemble of multiple realizations is recommended in order to characterize the variability in climate data adequately and estimate realistic mean values and variances of meteorological variables (Alodah and Seidou 2019a; Guo et al. 2018; Mehrotra et al. 2006; Anyah and Semazzi 2006; Dubrovský et al. 2004).

Multiple realizations of climate series are increasingly becoming the adopted modeling approach when evaluating the variability of complex climate systems to account for rare occurrences of climate variables (Anyah and Semazzi 2006). Typically, an arbitrary (and commonly limited) number of realizations (ranged from 5 to 1000) is used. Examples of some recent publications utilizing multiple runs of weather generators are presented in Table 1. It is also common to use SWGs to produce a time series that is longer than observed ones (e.g., Kou et al. 2007; Caron et al. 2008; Chen et al. 2012; Eames et al. 2012), although this might lead to biases due to an insufficient sampling of the distribution (Mithen and Black 2011). Therefore, it is recommended that multiple realizations with the same lengths as the training set be used (Dubrovský et al. 2004; Guo et al. 2018). However, the use of multiple realizations requires high-performance computational resources, especially when used in conjunction with a complex impact model. For example, Gitau et al. (2012) analyzed 172 management scenarios and ran a SWAT model 250 times for each of them, for a total of 43,000 runs, using an extremely large computing Condor framework. However, they stated that their work could have taken up to 3.3 years to complete via a traditional desktop computer workstation. Thus, given the acknowledged limitations imposed by time and computational expenses, the question related to the required number of realizations to fairly characterize the hydrological space is still open.

Table 1 Examples of ensembles of multiple realizations used in the prior hydrological investigations

This prolonged process, particularly for large watersheds, may be overcome with the help of expensive supercomputers or by identifying a sufficiently representative number of outputs needed to capture the random component of the hydrological model and ultimately reduce the computations. Guo et al. (2018) investigated the numbers of realizations necessary for capturing several statistical characteristics of meteorological variables satisfactorily (i.e., for precipitation and minimum and maximum temperature) generated synthetically by CLIGEN, LARSWG, and WeaGETS. They analyzed increasing discrete numbers of realizations (1, 25, 50, and 100) and concluded that a weather generator would reproduce essential climate characteristics well by 25 realizations. The current work generally builds on their ideas. However, the statistics considered in their work belong to the climatic data space only (precipitation and temperature variables); thus, their findings may not be applicable for hydrological variables, especially due to the non-linearity of the hydrologic response in rainfall-runoff transformations.

Frequently, synthetically generated climate sequences are fed to hydrological models and used after that to examine some risk spaces. This study aims to analyze how the accuracy of the estimates of key statistics evolves with the number of realizations of SWGs. Five SWGs were used to generate ensembles of daily precipitation occurrences and amounts (PCP) and daily maximum (Tmax) and minimum (Tmin) temperatures coupled with a hydrological model (SWAT) to simulate streamflow. A variety of diagnostic tools were then applied to identify the optimal number of realizations needed for both the climatic and hydrologic variables.

2 Materials and methods

2.1 Study area and available hydro-climatic data

The study area is the South Nation Watershed (SNW), located in Eastern Ontario, Canada. The SNW has a relatively flat area of about 4000 km2 between 74°22′ to 75°43′ W longitude and 44°40′ to 45°38′ N latitude. The watershed is drained by the South Nation River, which runs northeast for 175 km towards Plantagenet, with a low topographic gradient of only 80 m between its headwaters and the confluence with the Ottawa River. This characteristic maximizes the flood risk and boosts the erosion of riverbanks and agricultural topsoil. The reader is referred to Alodah (2015, 2019) for detailed descriptions of the study area. Climate data were collected for a 41-year period, based on the availability and consistency of the observed data, between 1971 and 2011 at four metrological stations, namely, Russell Station (Climate Identifier (CI): 6107247, Latitude: 45° 15′ 46″N, Longitude: 75° 21′ 34″W, Elevation: 76.2 m), South Mountain Station (CI: 6107955, Latitude: 44° 58′ 00″N, Longitude: 75° 29′ 00″, Elevation: 84.7 m), Morrisburg Station (CI: 6105460, Latitude: 44° 55′ 25″N, Longitude: 75° 11′ 18″W, Elevation: 81.7 m), and St. Albert Station (CI: 6107276, Latitude: 45° 17′ 14″N, Longitude: 75° 03′ 49″W, Elevation: 80 m). In addition, the observed downstream daily discharge data was collected at the Plantagenet Gauging Station (ID: 02LB005, Latitude:45° 31′ 01′′ N, Longitude: 74° 58′ 41′′ W). There was no missing data in either dataset for the reference period. A detailed description of the observed hydroclimatic data has been presented previously in Alodah and Seidou (2019a).

2.2 Stochastic weather generators

The observed 41-year climate series for maximum air temperature, minimum air temperature, and precipitation from the four metrological stations were fed into five SWGs, namely, the WeaGETS implementing multi-Gamma (referred to as WG hereafter) and multi-Exponential (referred to as WE hereafter) distributions for wet-day sequences (Chen et al. 2012), MulGETS implementing multi-Gamma (referred to as MG hereafter) and multi-Exponential (referred to as ME hereafter) distributions for wet-day sequences (Chen et al. 2014), and the k-nearest neighbor resampling models (Sharif and Burn 2007; Goyal et al. 2013). WeaGETS, a uni-site weather generator from the École de Technologie Supérieure (ÉTS), is a multivariate parametric model that simulates temperature variables conditional to each other based on a normal distribution and using first-order linear auto-regression coupled with constant lag-1 autocorrelation and cross-correlation. It also considers seasonal cycles with the help of Finite Fourier series with two harmonics. The MulGETS, a multi-site weather generator also from ÉTS, is an extension of WeaGETS and has the ability to take into account the spatial attributes of climate data, which is crucial in most hydrological models. For the simulation of a precipitation occurrence, MulGETS uses a two-state (dry or wet) first-order Markov chain with Cholesky factorization, whereas WeaGETS uses a third-order Markov model without parameter smoothing.

A higher-order Markov model is used in WeaGETS since it is recommended for better predicting long dry and wet spells (Bastola et al. 2012; Chen et al. 2012), whereas a first-order Markov chain is the only option in MulGETS. The order determination in the Markov chain models for the simulation of precipitation has been assessed by numerous studies (e.g., Schoof and Pryor 2008; Stowasser 2012). For instance, Schoof and Pryor (2008) examined Markov chains of order 0–3 to replicate monthly precipitation occurrence using the Bayesian information criteria (BIC) and found that the higher-order models performed better in simulating wet spells, while underperformed in the dry spell lengths. The inherent inadequacy of simulating the length of dry and wet spells by the exponential Markov approach proposed by Richardson (1981) was purportedly improved by the serial model- spell length approach (Racsko et al. 1991). Also, temperature variables and precipitation amount in wet days are conditional on wet/dry spell lengths determinant parameters (Wilby et al. 2004). Stowasser (2012), however, indicated that the improvement in producing precipitation statistics when using the theoretically best mixed‐order model was minimal in comparison to simpler models.

Both models (WeaGETS and MulGETS) were used twice to simulate the daily wet-day precipitation sequences while implementing two probability distribution functions: a multi-Gamma distribution (a combination of several gamma distributions) and multi-Exponential distribution. The probability distribution functions (PDFs) of the Gamma and Exponential models are:

$$f_{Gamma} ( {\text{x) = }}\frac{{\left( {{\text{x}}/\beta } \right)^{\alpha - 1} {\text{exp}}\left[ { - {\text{x}}/\beta } \right]}}{\beta \varGamma \left( \alpha \right)}$$
(1)
$$f_{Exp} ( {\text{x) = }}\lambda e^{ - \lambda x}$$
(2)

The k-nearest neighbor resampling model (KNN) is a daily generator that applies different methodologies based on the nonparametric resampling of an observed climate dataset. Since it is nonparametric, KNN has the advantage of being able to generate unprecedented values in the historical period but within the sampled values. For further information, the reader is referred to a previous paper (Alodah and Seidou 2019a) for a full description of the configurations of the abovementioned stochastic models and their performances.

2.3 Rainfall-runoff model

The Soil and Water Assessment Tool (SWAT) is a well-known hydrological model that has been used widely for many applications, including the simulation of sediment and nutrient flow but mainly for streamflow simulations (Neitsch et al. 2011). SWAT is a semi-distributed watershed-scale model that relies on hydrologic response units (HRUs) of uniform land and climate characteristics. The SWAT model for this study was first calibrated and validated with the observed climate data using a daily time step and based on the Nash–Sutcliffe efficiency (NSE), the RMSE-observations standard deviation ratio (RSR), and the percent bias (PBIAS). Mehrotra et al. (2006) pointed out that care should be taken when applying NSE alone particularly for its dependence on the size of test samples. Alternatively, more than one metric should be considered (Criss and Winston 2008). The results of the calibration and validation of the model indicate a good fit between the observed and simulated flows (Metric: Calibration, Validation; NSE: 0.90, 0.81; RSR: 0.31, 0.43; PBIAS: −10.0%, −8.3%). The reader is referred to Alodah and Seidou (2019a) for an enhanced description of the SWAT configuration and parameter selection. Next, synthetic climate time series were fed independently into the SWAT model to generate synthetic daily streamflow time series. To examine the hydrological responses to various synthetic climate scenarios, all SWAT parameters were kept unchanged except for the climate input when synthetic climate time series replaced the observed ones, enabling the effect of climate variability on hydrological variables to be tracked.

2.4 Definitions and notations

For additional clarity, the definitions of some terms used herein are given below:

  • Realization is a random output generated by running a SWG (climate) or the SWAT model with synthetic climate data (streamflow) for a number of years (a 41-year cycle herein), where all realizations are considered equally plausible for a given SWG (the terms “realization”, “run”, “iteration” and “scenario” are frequently interchangeable in the prior literature).

  • Cloud refers to an ensemble of separately-generated realizations (i.e., one thousand herein) of synthetic climate (or streamflow) series accomplished by running a given SWG (coupled with SWAT) 1000 sperate times.

  • Sample refers to a set of N realizations (for example a 10-realization sample), where this set with length N is reproduced randomly 10,000 times from the cloud.

To ease comprehension, the following notations are adopted:

  • Index s goes from 1 to S and represent the metrological stations listed below:

    1. 1.

      Russel

    2. 2.

      South Mountain

    3. 3.

      Morrisburg

    4. 4.

      St. Albert

    where, S is the number of stations (4).

  • T is the length in years of all climatic and hydrological time series (41-yr series).

  • The observed climate and flow time series are denoted as:

    • \({\text{PCP}}_{\text{t}}^{{ {\text{obs,s}}}}\), t = 1,…,T; s = 1,..,S, which represents the observed precipitation at time t at meteorological station s

    • \({\text{Tmax}}_{\text{t}}^{{ {\text{obs,s}}}}\), t = 1,…,T; s = 1,..,S, which represents the observed maximum temperature at time t at meteorological station s

    • \({\text{Tmin}}_{\text{t}}^{{ {\text{obs,s}}}}\), t = 1,…,T; s = 1,..,S, which represents the observed minimum temperature at time t at meteorological station s

    • \({\text{Q}}_{\text{t}}^{{ {\text{obs}}}}\), t = 1,…,T, which represents the observed discharge (OBS Flow) at time t at the outlet for the SNW.

  • The flow time series at the outlet for the SNW obtained by forcing the SWAT model using observed climate data, called SimulatedFlow usingObservedClimate, is denoted as:

    • \({\text{SFOC}}_{\text{t}}\), t = 1,…,T.

The following sections provide a more enhanced description of the main steps involved in the integrated framework: (a) the generation of multiple realizations of climate and streamflow data, and (b) the evaluation criteria used to define the number of realizations needed in hydrological simulations. A schematic illustration of the overall modeling framework is presented in Fig. 1.

Fig. 1
figure 1

Schematic representation of the current work, where N ranges from 1 to 1000 unique realizations

2.5 Climate and flow cloud generation

In this work, 41 years (1971–2011) of observed climate and streamflow records are used as the reference data from which the deviations are calculated. The synthetic precipitation and temperature time series in this paper are also 41 years long to permit an adequate risk analysis to be conducted (Semenov and Barrow 1997; Elliot and Arnold 2001). Each SWG was run 1000 separate times, resulting in a total of 5000 realizations of weather sequences at a daily time step. Each realization is 41-year long to match the length of the observed climate data, resulting a total of 205,000 synthetic weather years (5 SWGs × 41 years × 1000 realizations). Soltani and Hoogenboom (2003b) found that at least 15 years of historical climate data is required to generate synthetic time series that mimic the observed statistical characteristics. Similarly, the SWAT model was run 5000 separate times, with each run producing a unique 41-year realization of the climate. The choice of 1000 for the number of realizations for each weather generator, despite the excessive computational demand particularly for the hydrological modeling, was done to form a dense cloud of realizations and thus identify a satisfactory number of realizations. The 1000 synthetic time series for precipitation, minimum temperature, and maximum temperature generated using SWGs and representing the climate at station s (referred to as the climate cloud hereafter) are denoted:

  • \({\text{PCP}}_{\text{t}}^{{ {\text{i,s}}}}\), t = 1,…,T; s = 1,..,S for the precipitation time series,

  • \({\text{Tmax}}_{\text{t}}^{{ {\text{i,s}}}}\), t = 1,…,T; s = 1,..,S for the maximum temperature time series,

  • \({\text{Tmin}}_{\text{t}}^{{ {\text{i,s}}}}\), t = 1,…,T; s = 1,..,S for the minimum temperature time series.

The 1000 streamflow time series obtained by forcing the SWAT model with the synthetic climate time series (referred to as the flow cloud hereafter) are each called Simulatedflow usingsyntheticclimate (\({\text{SFSC}}\)) and denoted as:

  • \({\text{SFSC}}_{\text{t,SWG}}^{{ {\text{i}}}}\), t = 1,…,T; i = 1,…,1000, SWG\(\in \left\{ {\text{ME,MG,WE,WG, KNN}} \right\}\).

2.6 Estimation of a statistic V using N realizations

The following algorithm is used to estimate a statistic V using N realizations. For k between 1 and 10,000:

  • Sample without repetition from a subset of size N of indices between 1 and 1000, i.e., \(\left\{ {j_{1}^{k} ,j_{2}^{k} , \ldots ,j_{N}^{k} } \right\}\).

  • A kth estimate of the mean value of a statistic V is,

    $$\mu_{k} = \frac{1}{N}\sum\limits_{m = 1}^{N} {\left( {\frac{{\sum\nolimits_{t = 1}^{T} {V_{t}^{{j_{m} ,s}} } }}{T}} \right)}$$

The more variability in \(\left\{ {\mu_{k} } \right\},k = 1, \ldots ,10000\), the less precise the estimate. The variability in these estimated means can be illustrated using a violin-plot graph. The deviations from \(Y_{ref}\) quantify the biases of the estimates.

2.7 Evaluation criteria

Given that a series of samples generated by the SWGs will not be identical, the impact of such variations between the samples is investigated visually using time-series graphs of the simulated sequences, such as sequence plots, running mean plots, and violin and boxplots of the samples.

2.7.1 Visual convergence assessment

An examination using proper graphical techniques can produce a general idea concerning the variable of interest (Ott and Longnecker 2015). Plots of each parameter and the running mean are used to examine the simulation process as the number of realizations increases. A time series plot of the running mean is simple and easy to implement and used to check when a new stochastic generation of flow data is no longer deviating significantly from the mean of previous realizations. The running mean is computed as the mean of all sampled values up to and including the current realization. The plot then shows whether the running mean stabilizes at a realization (randomly ordered) against the mean of all realizations (Smith 2007). These plots will eventually converge to a constant value, which is the mean of all realizations according to the Central Limit Theorem. These visual evaluations should provide general insights, yet they are not sufficient indicators and further statistical analyses must be conducted.

2.7.2 Quantitative assessment

The four key statistics to be estimated from the time series are the mean (μ), standard deviation (σ), and the skewness (\(\alpha_{3}\)) and kurtosis (\(\alpha_{4}\)) coefficients of the climate or flow variable of interest. For the sake of simplicity, Y will be used herein to indicate any of the estimates of above statistics. The statistical measures considered in this paper are the Relative Error (RE), the Relative Root Mean Squared Error (\(RMSE_{r}\)) and the Cohen’s effect size (d). The relative error (RE) refers to the magnitude of the difference between an experimental (sample) value (\(Y_{i}\)) and the known or accepted value (\(Y_{ref}\)):

$${{RE}}\left( {\% } \right) { = }\left( {\frac{{{{Y}}_{{ref}} - {{Y}}_{{i}} }}{{{{Y}}_{{ref}} }}} \right) \times 1 0 0$$
(3)

The root mean squared error (RMSE), also called root-mean-square deviation, is one of the most common metrics used to measure the accuracy of continuous variables via measuring the average magnitude of the error. It is a negatively-oriented score that has a range between 0 to ∞, meaning that values closer to 0 are preferable. This metric is particularly useful when a large error cannot be tolerated, as the errors are squared when computing it. The RMSE and relative RMSE (\({\text{RMSE}}_{r}\)) are computed as:

$$RMSE = \sqrt {\frac{1}{KN}\mathop \sum \limits_{j = 1}^{K} \mathop \sum \limits_{i = 1}^{ N} \left( {Y_{ref} - Y_{j} } \right)^{2} }$$
(4)
$$RMSE_{r} \left( \% \right) = \left( {\frac{RMSE}{{Y_{ref} }}} \right) \times 100,$$
(5)

The improvement in the \(RMSE_{r}\) value obtained by adding one more realization (\(RMSE_{r, improvement}\)) and the marginal improvement (\(RMSE_{r, mar. improvement}\)) are defined as:

$$RMSE_{r, improvement} = RMSE_{r, n - 1} - RMSE_{r, n}$$
(6)
$$RMSE_{r, mar. improvement} = RMSE_{r, n = 1} - RMSE_{r, n}$$
(7)

where n = 2, 3, …., N, and N = 1000.

2.7.3 Evaluation of effect size

The Cohen’s Effect Size (Cohen’s d) is a standardized quantitative index that can help in better understanding such large Monte-Carlo-like runs by checking the deviation (or overlap) between two groups in standard deviation units. Cohen’s d uses the differences in means of the control (or reference) and sample groups and the standard deviation (SD) of the control group (Rosnow and Rosenthal 1996), and determined mathematically as

$${{Cohen's d = }}\frac{{\left| {{{Mean}}_{{{{Y}}_{{ref}} }} - {{Mean}}_{{{{Y}}_{{i}} }} } \right|}}{{{{SD}}_{{{{Y}}_{{ref}} }} }}$$
(8)

The standard deviation of the control group is used following Mehan et al. (2017), assuming variances of the two groups are not similar. This method govern by the control group is also known as Glass’s d or \(\Delta\) (Glass 1976). Large effects mean larger differences in means and lesser overlap between the two distributions. Some rules of thumb were given by Cohen (1988), who stated that the effect size could be interpreted as small (d < 0.2), medium (\(0.2 < d < 0.8\)), and large (d > 0.8). Nevertheless, the interpretation of the effect sizes shouldn’t rigidly follow Cohen’s framework; rather results should be evaluated in the context of prior related literature as suggested by Vacha-Haase and Thompson (2004).

2.7.4 The reference values for the key statistics

For any given statistic, several reference statistics can be used to calculate both the RE and the RMSEr. The three reference values for the key climate statistics are:

  • The statistics calculated from observations (\(V_{ref, Y,OBS}\)), and

  • The average of the statistics calculated from the 1000 realizations in the synthetic climate (\(V_{ref, Y,SC}\)).

The three reference values for the key flow statistics are:

  • The estimates of statistics V calculated with observations, denoted by (\(V_{ref, OBS }\));

  • The estimates of statistics V calculated from the time series simulated via SWAT using the observed climate, denoted by (\(V_{ref, SFOC }\)); and

  • The average of statistics V calculated from the 1000 realizations in the flow cloud, denoted by (\(V_{ref, SFSC }\)).

3 Results and discussion

The results are presented and discussed in three parts: first, a visual assessment of the synthetically generated climate and flow time series is presented. Second, the effect of the number of SWGs realizations on the accuracy of basic annual climatic indices is assessed. Third, the degree of divergence between the sample and the cloud mean (the control group) is characterized by Cohen’s d effect size. Variability is presented via violin and boxplots and graphics of the running mean, the RMSEr, and the RE, where the x-axis in each case represents the number of realizations, which goes from 1 to 1000. The same analysis is performed for each climate and flow variable, performance index, and reference value.

3.1 Visual convergence assessment

Figure 2 shows that the mean annual precipitation estimated by the MulGETS and WeaGETS realizations is reasonably close to the mean of observed values (\(\mu_{\text{ref, PCP,OBS}}\)), but that the observed values are underestimated by KNN. However, the WeaGETS models (WE and WG) and KNN underestimated the standard deviation \(\sigma_{\text{ref, PCP,OBS}}\) of the annual precipitation, while both the MulGETS models (ME and MG) were able to capture σ adequately (Fig. 2). The kurtosis coefficients for the synthetic annual precipitation were consistently higher than those for the observed precipitation (Fig. 2). Thus, the results are consistent with the findings of Chen and Brissette (2014), who reported that the kurtosis coefficient of the mean annual precipitation is poorly reproduced by SWGs. The differences among the five models in terms of generating \(\alpha_{3}\) for the synthetic annual precipitation were not notable.

Fig. 2
figure 2

Plots of the precipitation statistics generated by five SWGs compared to the observed climate values (black line). The side boxes delineate distributions of all realizations with the interquartile range (IQR: \({\text{q}}_{ 2 5}\),\({\text{q}}_{ 7 5}\)), while the whiskers limits correspond to \({\text{q}}_{ 2 5}\) ± 1.5 IQR

Fig. 3
figure 3

Plots of mean annual streamflow statistics generated by five SWGs compared to the observed flow and SFOC values (shown by the black and blue dashed lines, respectively). The side boxes delineate distributions of all realizations with the interquartile range (IQR: \({\text{q}}_{ 2 5}\),\({\text{q}}_{ 7 5}\)), while the whiskers limits correspond to \({\text{q}}_{ 2 5}\) ± 1.5 IQR

The interannual σ’s of \({\text{SFSC}}_{\text{WE}}^{{}}\) and \({\text{SFSC}}_{\text{WG}}^{{}}\) were underestimated compared to the observed flow and—to a lesser degree—the SFOC (Fig. 3). The interannual variability of \({\text{SFSC}}_{\text{KNN}}^{{}}\) closely matched that of SFOC, while \({\text{SFSC}}_{\text{ME}}^{{}}\) and \({\text{SFSC}}_{\text{MG}}^{{}}\) were between the two reference datasets (mostly underestimated the observed flow but overestimated the SFOC). Interestingly, the \({\text{SFSCs}}\) of all SWGs performed similarly in well reproducing the \(\alpha_{3}\) of the OBS Flow and overestimating the \(\alpha_{3}\) of the SFOC. Similar to precipitation results, the poor performance (overestimation) of most outputs of the tested SWGs in replicating the \(\alpha_{4}\) of the annual streamflow was observed when compared to both the OBS Flow and SFOC data (Fig. 3). In general, it is fairer to compare the SFSC to the SFOC than to the observed flow, as the first two were both simulated by SWAT and inherited the biases within the model itself. Such figures can help grasp a general idea about the realizations, but further investigation using more sophisticated statistical methods is certainly needed.

The annual precipitation and streamflow statistics are plotted as a function of the number of realizations in Figs. 4 and 5. The running mean plots show the mean of previous realizations up to and including each iteration displayed on the x-axis. Such figures show how the running mean highly fluctuates at the beginning of the sequence, making it difficult to construct robust confidence intervals. The statistics for the outputs of the five weather generators, however, do not differ much after 100 realizations. That is, almost all parameter estimates appear to stabilize around 100 realizations. Biases caused by the stochastic generation of the cloud are clearly outweighed eventually by the increased number of realizations, as stated in Räisänen and Ruokolainen (2006), as it is the case for any Monte Carlo experiment (Cunha et al. 2014). That is, the approximation or the performance gains can be improved by increasing the number of realizations to achieve a certain level of precision. We are not presenting graphs for temperature parameters due to lack of space, but similar patterns were observed.

Fig. 4
figure 4

Running mean plots for the mean annual precipitation statistics generated by five SWGs in which the order of the realizations is random. The black dashed lines indicate the observed climate values

Fig. 5
figure 5

Running mean plots for the annual streamflow statistics generated by five SWGs in which the order of the realizations is random. The observed flow (SFOC) values are represented by the black (blue dashed) lines

3.2 Variations in the spread, RMSEr’s, and REs for key statistics as a function of the number of realizations

3.2.1 Climate space

As explained in the methodology section, the spread of the estimates was visualized using violin and box plots. Violin plots are accompanied with black boxplots (25th and 75th percentiles representing interquartile ranges, and 1.5 times the IQR whiskers). These plots were generated using the functions by Bastian Bechtold available on the GitHub repository (Violin plots for Matlab https://github.com/bastibe/Violinplot-Matlab). As expected, the variability in each of the indicators decreases as the number of realizations increases (Figs. 6, 7, and 8). The use of a single realization is not recommended due to the high error expected, particularly for applications that depend heavily on higher moments, such as an assessment of extremes. For instance, the precision when estimating the \(\alpha_{3}\) of the annual precipitation using one realization can be off by more than 500%. Once the number of realizations increases, the expected error decreases dramatically. This decrease in the expected error is particularly clear for higher moments at 25 realizations and higher. Moreover, the use of more than 100 realizations seems very unnecessary.

Fig. 6
figure 6

Violin plots of the relative errors (%) of the main annual precipitation statistics for the N-realization samples used to estimate these statistics from the cloud; an N-realization sample is derived from 10,000 different randomly selected SFSC sets

Fig. 7
figure 7

Violin plots of the relative errors (%) of the main annual maximum temperature statistics for the N-realization samples used to estimate these statistics from the cloud; an N-realization sample is derived from 10,000 different randomly selected SFSC sets

Fig. 8
figure 8

Violin plots of the relative errors (%) of the main annual minimum temperature statistics for the N-realization samples used to estimate these statistics from the cloud; an N-realization sample is derived from 10,000 different randomly selected SFSC sets

The marginal improvements in the RMSEr’s of PCP, Tmax, and Tmin as the number of realizations increases are shown respectively in Figs. 9, 10, and 11, where the synthetic climate using N realizations (relative to using N-1 realizations) is compared to the two reference datasets: the climate cloud (synthetic climate) and the observed climate. Tables 2, 3, and 4 present a similar comparison for the three climate variables but relative to the results of just a single realization. These results are consistent with the previous findings suggesting that after 100 realizations, the marginal improvement in the RMSEr becomes insignificant (e.g., less than a 0.21 (1.09) % improvement across SWGs in \(\mu_{\text{Tmax}}\) (\(\sigma_{\text{Tmax}}\)) when adding 900 realizations). Also, 25 realizations appear to be reasonably adequate, particularly for the first two moments (e.g., less than a 0.46 (2.34) % improvement across SWGs in \(\mu_{\text{Tmax}}\) (\(\sigma_{\text{Tmax}}\)) when adding 975 realizations). The results are very similar for the temperature variables, whereas precipitation indicators require even fewer realizations.

Fig. 9
figure 9

Improvement in the RMSEr’s of the main annual precipitation statistics for the N-realization samples generated by the five SWGs. The RMSEr’s is calculated using either the observations (red line) or the cloud mean (blue line) as reference. The N-realization samples are derived from 10,000 different randomly selected SFSC sets. Scattered markers represent actual results for which the lines are slightly smoothed by moving averages with spans of 3. Vertical black dashed, dash-dotted, and solid lines represent 10, 25, and 100 realizations, respectively

Fig. 10
figure 10

Improvement in the RMSEr’s of the main annual maximum temperature statistics for the N-realization samples generated by the five SWGs. The RMSEr is calculated using either the observations (red line) or the cloud mean (blue line) as reference. The N-realization samples are derived from 10,000 different randomly selected SFSC sets. Scattered markers represent actual results for which the lines are slightly smoothed by moving averages with spans of 3. Vertical black dashed, dash-dotted, and solid lines represent 10, 25, and 100 realizations, respectively

Fig. 11
figure 11

Improvement in the RMSEr’s of the main annual minimum temperature statistics for the N-realization samples generated by five SWGs versus the counterparts. The RMSEr is calculated using either the observations (red line) or the cloud mean (blue line) as reference. The N-realization samples are derived from 10,000 different randomly selected SFSC sets. Scattered markers represent actual results for which lines are slightly smoothed by moving averages with spans of 3. Vertical black dashed, dash-dotted, and solid lines represent 10, 25, and 100 realizations, respectively

Table 2 Marginal improvements in RMSEr (\(RMSE_{r, mar. improvement}\)) of two precipitation reference datasets obtained by using N realizations relative to a single realization
Table 3 Marginal improvements in RMSEr (\(RMSE_{r, mar. improvement}\)) of two maximum temperature reference datasets obtained by using N realizations relative to a single realization
Table 4 Marginal improvements in RMSEr (\(RMSE_{r, mar. improvement}\)) of two minimum temperature reference datasets obtained by using N realizations relative a single realization

3.2.2 Hydrological space

For the streamflow data, Fig. 12 presents the REs of the key annual streamflow statistics, including the mean, standard deviation, skewness, and kurtosis. The variability of each RE as a function of different numbers of realizations (1, 5, 10, 25, 50, 100, and 1000) is represented using violin plots, each of which is based on ten thousand N-realization samples randomly taken from the cloud containing all SFSC time series. Figure 12 strongly suggests that a sole realization is not sufficient for representing SWGs in hydrological modeling. Similar to the situation for the climate variables, 100 realizations seem adequate, with very low relative errors across different statistics.

Fig. 12
figure 12

Violin plots of the relative errors (%) of the main annual streamflow statistics for the N-realization samples used to estimate these statistics from the cloud; an N-realization sample is derived from 10,000 different randomly selected SFSC sets

The marginal improvements in the RMSEr’s of the streamflow statistics are plotted in Fig. 13, and Table 5 lists these improvements as functions of the number of realizations. In Fig. 13, the estimates are compared to the three reference values (SFSC, SFOC, and OBS Flow). The estimate is compared to the estimate obtained using a single realization (Table 5). The results are consistent with the previous findings, which suggest that after 100 realizations, the marginal improvement in the RMSEr becomes insignificant (e.g., less than a 0.55% improvement for all three reference datasets and across all SWGs for μ when compared to the μ calculated from 1000 realizations). Also, 25 realizations appear to be reasonably adequate, particularly for the first two moments (e.g., less than a 1.78% improvement for all three reference datasets and across the SWGs for μ when compared to the μ calculated from 1000 realizations).

Fig. 13
figure 13

Improvement in the RMSEr’s of the main annual streamflow statistics of the N-realization samples generated by five SWGs. The RMSEr is calculated using either the observed flow (green line), the simulated flow using a synthetic climate (SFSC, purple line), simulated flow using the observed climate (SFSC, blue line). The N-realization SFSC sample is derived from 10,000 different randomly selected SFSC sets. Scattered markers represent actual results for which lines are slightly smoothed by moving averages with spans of 3. Vertical black dashed, dash-dotted, and solid lines represent 10, 25, and 100 realizations, respectively

Table 5 Marginal improvements in RMSEr (\(RMSE_{r, mar. improvement}\)) of three reference datasets obtained by adding N realizations relative to a single realization

3.3 Impact of the number of realizations on Cohen’s d effect size

Cohen’s d values of precipitation, maximum temperature, minimum temperature, and streamflow statistics are presented in Figs. 14, 15, 16, and 17, respectively. Such figures clearly indicate a very large effect size, as expected, when a single realization is used for all variables, statistics, and SWGs. 5, 10 and 25 realizations are not sufficiently enough but the effect size diverges decreasingly from zero as the number of realizations is increasing. Moreover, the upper quartile of the 10,000 different randomly selected sample sets across variables, SWGs, and statistics show that in more than 75%, the effect size deemed to be small (d < 0.2) after 50 realizations. Further, figures demonstrate that 100 realizations are exhibiting even a smaller effect size of the randomly selected sets as Cohen’s d values are always below 0.5 (the horizontal red dash-dotted line) indicating small to medium effect size. Similar interpretive framework was also followed in the prior related literature (cf. Mehan et al. 2017; Guo et al. 2018).

Fig. 14
figure 14

Range of Cohen’s d results of precipitation statistics where an N-realization sample is derived from 10,000 different randomly selected sets from the five SWGs. Horizontal red dashed, dash-dotted, and solid lines represent 0.8, 0.5, and 0.2 Cohen’s d values, respectively

Fig. 15
figure 15

Range of Cohen’s d results of maximum temperature statistics where an N-realization sample is derived from 10,000 different randomly selected sets from the five SWGs. Horizontal red dashed, dash-dotted, and solid lines represent 0.8, 0.5, and 0.2 Cohen’s d values, respectively

Fig. 16
figure 16

Range of Cohen’s d results of minimum temperature statistics where an N-realization sample is derived from 10,000 different randomly selected sets from the five SWGs. Horizontal red dashed, dash-dotted, and solid lines represent 0.8, 0.5, and 0.2 Cohen’s d values, respectively

Fig. 17
figure 17

Range of Cohen’s d results of streamflow statistics where an N-realization sample is derived from 10,000 different randomly selected sets from the five SWGs. Horizontal red dashed, dash-dotted, and solid lines represent 0.8, 0.5, and 0.2 Cohen’s d values, respectively

3.4 Discussion

The main finding of this work is that, while a larger number of realizations may provide a better representation of climate variability, a limited problem-dependant number of realizations can provide robust estimates of key risk statistics. In this particular application, the marginal improvements in the RMSEr’s of all statistics (climatic and hydrological variables) are not substantial after 25 realizations, particularly for the first two moments (i.e., μ and σ) and to a lesser extent for higher moments (i.e.,\(\alpha_{3}\) and \(\alpha_{4}\)). Cohen’s d, which characterizes the degree of divergence between the sample and the cloud mean (the control group), was used to compare the effect sizes as the number of realizations is increasing. Like any Monte Carlo simulation relying on repeated random sampling, the marginal value of a realization decreases as the number of realizations increase. The findings of these metrics suggest that for this particular SWAT model and this particular set of SWGs, going beyond 100 realizations is redundant with a higher computational cost, as the improvement afterward is very minimal even for higher moments. Such results will surely serve to better account for risk in quantitative analysis and decision making in a variety of water and environmental related problems with minimal computational and time requirements.

An interesting finding is that there are systematic biases contained within the weather generators that lead to the SFSC and SFOC to be different from the observed flow. However, increasing the number of realizations cannot reduce these biases. That is, repeated runs of a given SWG that tends to misestimate a particular variable will not be useful in obtaining a correct characterization of the observed variable. A few ways to decrease such biases include improving the SWGs, selecting a SWG with minimal bias, and/or improving the rainfall-runoff model. Alternatively, one can consider generating a large dataset of realizations and then select a number of realizations that better represents the observed set, as suggested by Gitau et al. (2018). However, the latter approach still presents a challenge, as it can be computationally expensive and time-consuming. The simplest of these solutions is to use the methodology presented herein to select the number of realizations that leads to the feasibly lowest RMSEr or RE for the problem at hand (i.e., when the relative improvement becomes very small). Alternative methods for assessing SWGs include statistical tests of significance, such as the t test and F-test (e.g., Min et al. 2011; Chen and Brissette 2014); \(\chi^{ 2}\) goodness-of-fit test (e.g., Semenov et al. 1998); nonparametric tests, such as the Wilcoxon rank sum test, the Kolmogorov–Smirnov (K–S) test, and Mann–Whitney test (e.g., Zhang and Garbrecht 2003; Qian et al. 2004; Chen et al. 2010); the RMSEs of various statistics of interest (e.g., Mehrotra et al. 2006); and employing distance techniques, such as the Mahalanobis distance between statistics derived from observed and simulated time series (e.g., Alodah and Seidou 2019a).

One limitation of the present work is that the results are specific to a particular hydrological model on a particular watershed and to particular SWGs. However, the methodology can be applied to any case in which multiple weather generators are compared, and where there is a strong incentive to limit the number of simulations, for instance to save time and computational resources. The current paper required 5,000 runs (scenarios) of the SWAT model, and the computation time required to complete these scenarios was almost a month on a typical desktop computer workstation (Intel Core i7-4790 Processor @ 3.60 GHz (8 CPUs), 16 GB (2 × 8 GB) RAM, 1 TB disk), exclusive of the subsequent time spent in the post-processing of the outputs. The time involved could even be higher for larger watersheds or a longer simulation period. Thus, the use of a small but adequate representative number of realizations, as determined herein, can significantly minimize the computational challenge and reduce the simulation time without losing much information (e.g., it would take less than a day for 25 realizations on a 3.60 GHz Intel Core i7 CPU with a 16 GB RAM machine). Furthermore, the methodology presented in this paper has the advantage of making a straightforward link between the number of realizations and common statistical indicators and is more likely to appeal to practitioners.

Indeed, it can be argued that the results depend on the SWG, the hydrologic model, and the risk parameter used. High quantiles of flow, and hydrological parameters, such as sediments, would perform differently. Therefore, the results of this work can be further extended to include multiple hydrological models and more such risk parameters. However, the vast majority of risk statistics derived from environmental models are related to the four first moments of the time series that were examined in this paper. We also used a relatively complex hydrological model that is used worldwide, suggesting that the results of this study would be a reasonably informed guess in most practical cases where the modeler does not want to run an experiment to determine the optimal number of realisations. Our findings are comforted by Guo et al. (2018), who found that the optimal number of realizations is 25 by using a different set of SWGs and risk parameters on a different watershed.

4 Conclusions

In summary, five SWGs coupled with a SWAT model are used to generate multiple time series for four hydroclimatic variables at four climatic stations and one hydrometric station on the South Nation Watershed located in Ontario, Canada. The investigated variables, including precipitation, maximum and minimum air temperature and streamflow, are used to determine the optimal output size of the cloud representing SWGs. Four risk and performance indicators, namely, the mean, standard deviation, skewness, and kurtosis of these variables were estimated to assess the level of agreement between synthetic time series and observations. The number of realizations required to reach a predefined Relative Root Mean Square Error is then investigated to ultimately conduct computationally inexpensive impact studies. Using the two error metrics, namely, RE and RMSE, and the effect-size metric (Cohen’s d), it was shown that when the number of realizations is high, the considered five weather generators perform somewhat similarly in terms of reproducing the risk and performance indicators. Overall, the results indicate that there is no very major benefit from generating more than 25 realizations in hydrological modeling. Applications requiring more precision (e.g., analysis of hydro-climatic extreme events) may use 100 realizations, as the results obtained from 100 realizations are not notably different from those obtained using 1000 realizations. Adopting a smaller, but carefully chosen, number of realizations can significantly reduce the workload on analysts and therefore benefit a larger audience in risk assessment studies, particularly when high-performance machines are not easily accessible.