1 Introduction

Weather generators are increasingly used in environmental, water resources, and agricultural applications. This is especially so where observed data are not consistent or available in sufficient quantities for these applications (Yu 2003; Fodor et al. 2013; Lemann et al. 2016). In climate related studies, the generators are used to obtain localized predictions based on observed data (Al-Mukhtar et al. 2014) and/or downscale from coarse resolution predictions (Semenov and Barrow 1997; Fatichi et al. 2011; Farzanmanesh et al. 2012). Given this potential, there is the need to evaluate weather generators to determine the suitability of generated data for the intended application. This is particularly so with respect to their ability to capture extreme events given the projected changes in their magnitude and severity and the challenges they present with respect to impact (Klein Tank et al. 2009). A lot of work has been conducted to determine weather generator effectiveness in capturing the primary statistical characteristics of observed data—such as the daily mean, standard deviation, skewness, kurtosis, and percentiles (e.g., Zhang and Garbrecht 2003; Min et al. 2011; Chen and Brissette 2014), and other essential characteristics, as described in Gitau et al. (2017). For example, Mehan et al. (2017) evaluated three commonly used weather generators with respect to reproducing the distribution, occurrence of wet and dry spells, number of snow days, growing season temperatures, and growing degree days in relation to those from observed data.

A relatively less amount of work has been done on generator effectiveness in capturing the extremes with primary works including Kyselý and Dubrovský (2005), Semenov (2008), Hundecha et al. (2009), and Acharya et al. (2017). Extreme metrics assessed in these works include: mean annual maxima of daily precipitation and maximum temperatures; mean annual minima of minimum temperatures; distributions of annual 1-day maxima; lengths of heat and cold waves; 10- and 20-year return periods determined based on Generalized Extreme Value distributions; wet/dry probabilities; and, lengths of wet and dry periods and their distributions. Annual means of maximum precipitation were generally captured well (Semenov 2008) as were annual maxima, multi-day extremes, and heat/cold characteristics for temperatures (Kyselý and Dubrovský 2005). The weather generators evaluated were, however, less effective in capturing 1-day extremes for temperature (Kyselý and Dubrovský 2005; Semenov 2008), while a mixed distribution was better at reproducing 1-day precipitation extremes than the gamma distribution (Hundecha et al. 2009).

Weather generators produce stochastic outputs. Thus, it is unlikely that any two simulation runs (realizations) will produce the same output. This is desirable as it provides a range of possible outputs, making it possible to better capture the range and variability of climate. It also provides opportunities for probabilistic assessments (Kalnay et al. 2006; Wiegel 2012). Several studies have employed the use of multiple realizations, including: 10 (Elliot and Arnold 2001); 30 (Caron et al. 2008); 100 (Hundecha et al. 2009; Min et al. 2011); 250 (Gitau et al. 2012); and, 1000 (Brisson et al. 2015), although the use of one realization is more common (e.g., Zhang and Garbrecht 2003; Farzanmanesh et al. 2012; Fan et al. 2013; Al-Mukhtar et al. 2014). Using too many realizations would be time consuming and computationally expensive, especially where generated data are used in subsequent applications.

In this paper, we present an evaluation of weather generator representation of climate extremes based on work done with the Long Ashton Research Station Weather Generator (LARS-WG; Racsko et al. 1991) on three stations in the Western Lake Erie Basin, U.S., and considering 50 realizations. Since daily weather data are needed to run hydrologic, water quality, and crop-growth models, this study was focused on the suitability of generated values on a daily time scale. We also discuss some important considerations with respect to the use of multiple realizations and with regard to translating the work to data-sparse and non-humid regions.

2 Methods

2.1 Study Site and Base Data Description

This study was focused on Fort Wayne, Norwalk, and Adrian stations in the Western Lake Erie Basin, part of the U.S. Great Lakes region. These stations, which were adopted from Mehan et al. (2017), were selected based on an analysis of eight stations within the basin, and so as to provide coverage across the basin. Long-term and consistent datasets were available for all three stations making them suitable for use with the analysis. Based on historical data ranging from as far back as July 1887 to November 2017 (MRCC 2017), 1-day maximum precipitation can range from 125.2 mm (4.9 in.) to 229.1 mm (9.0 in.), these having been recorded at Fort Wayne and Norwalk, respectively. One-day maximum temperatures can range from 20.0 °C (68.0 °F) in January to 42.2 °C (108.0 °F) in July while one-day minima can range from −31.1 °C (24.0 °F) in January to 6.7 °C (44.0 °F) in July based on July 1887 to June 2017 records for Fort Wayne and Adrian (MRCC 2017). Historical data (1887-current) on extreme temperatures were not available at Norwalk. For this study, observed data for the period 1966–2015 were used with 100%, 98.8%, and 97.6% availability at Fort Wayne, Norwalk, and Adrian, respectively.

2.2 Data Development

The data used for this study were adopted from Mehan et al. (2017). These data included fifty years of observed data for the period 1966–2015 and synthetic data comprising 50 realizations, each 50 years in length, generated from LARS-WG. The synthetic data were developed using different random number seeds to generate each realization, such that each set of 50-year outputs was expected to be different from the others making it possible to represent a range of variability in the generated values (Mehan et al. 2017). An analysis of LARS-WG effectiveness in reproducing essential characteristics of observed data at the three stations was conducted by Mehan et al. (2017). Selected results are provided in Table 1. Based on the analysis, LARS-WG captured the essential characteristics relatively well and was found especially suitable for hydrologic and water resources applications (Mehan et al. 2017).

Table 1 Summary of LARS-WG performance in representing essential characteristics of observed data for Adrian, Norwalk, and Fort Wayne stations. Adapted from Mehan et al. (2017)

2.3 Analysis of Extremes

A number of extreme metrics were evaluated in this study considering seasonal, annual, and multiday contexts, and also sequences. Seasonal metrics included: the 1-day maximum precipitation; 1-day maximum temperatures; and, 1-day minimum temperatures each of which was considered on a seasonal basis. The 1-day maximum precipitation provides important information for flood risk assessment (Hundecha et al. 2009). Extreme high or extreme low temperatures can lead to crop damage (Semenov 2008) if these occur during the growing season. For example, corn which is an economically important crop in the study area, is severely impacted when temperatures rise above 35 °C (95 °F) or drop below 10 °C (50 °F). For this study, the seasons were defined as: winter (Jan, Feb, Mar); spring (Apr, May, Jun); summer (Jul, Aug, Sep); and, Fall (Oct, Nov, Dec), consistent with definitions in other studies conducted within the basin. Annual metrics included the number of days in a year in which: precipitation exceeded 50.8 mm (2 in.); precipitation exceeded 101.6 mm (4 in.); maximum temperatures exceeded 32.2 °C (90 °F); and, minimum temperatures were below 0 °C (32 °F). The metrics related to precipitation indicate the extent to which heavy precipitation is experienced in an area. The amounts represent precipitation that is much greater than normal (Karl et al. 1996). Days with minimum temperatures below 0 °C indicate frost days, while 32.2 °C is associated with heat stress and heat injury (NIOSH 2016; Blevins 2011; Epstein and Moran 2006; Griffin undated). Continuous periods in which extreme events occur are of interest due to the sustained impacts on agricultural, environmental, and water resource systems. Continuous metrics considered in this study include: number of periods with 7-day continuous precipitation; number of dry sequences; and, number of wet sequences in the 50-year period considered. The latter two were derived from Mehan et al. (2017) as these had been computed using the same dataset used in this study. Visual comparisons were also employed to allow a more detailed examination of the data and associated distributions. This was done using Q-Q plots and plots of distribution of distributions (Mehan et al. 2017) with a focus on skewness. Q-Q plots are typically used to determine whether the data comes from a particular distribution (Dalgaard 2008; Hensel and Hirsch 2002). For this study, Q-Q plots were developed considering annual 1-day maximum precipitation and temperatures and annual 1-day minimum temperatures from the observed versus simulated datasets. This was done to assess whether the data came from the same distribution with the expectation that quantiles coming from the same distribution would form a straight line. Thus, the Q-Q plots allowed an evaluation of where and to what extent there were deviations in the data consistent with Hundecha et al. (2009).

3 Results and Discussion

3.1 Performance on Extreme Metrics

Based on the analysis (Table 2), LARS-WG generally underestimated 1-day maximum precipitation across all stations based on the range and median values obtained. However, other extreme values associated with precipitations were captured relatively well, although there was a tendency towards overestimation for the number of days with greater than 50.8 mm (2 in.) precipitation. In the series of 50 years considered for this study, there were 29, 34, and 36 periods with continuous 7-day precipitation recorded from observed data for Adrian; Norwalk, and Fort Wayne, respectively. Values obtained from the generator were slightly lower based on median values (24, 28, 32, respectively). From the observed data, the highest value for maximum one-day precipitation was observed during summer except at Fort Wayne, where the maximum one-day precipitation was recorded in spring during the study period (1966–2015). This differed from the historical data (MRCC 2017) based on which all the highest values occurred in the summer. The generator was, however, able to capture the prevailing pattern in the data. The 1-day maximum precipitation values for the study period were also similar to historical values, except at Fort Wayne where the value was substantially lower (111.8 mm compared with 125.7 mm for the historical). Precipitation events larger than 101.6 mm (4 in.) were rare and these were generally well captured by LARS-WG. The generator captured the occurrence of wet and dry spells relatively well, as detailed in Mehan et al. (2017).

Table 2 Evaluation of LARS-WG effectiveness in simulating different extreme variables associated with precipitation and temperature in comparison to values obtained from observed data

One-day maximum temperatures in the fall and winter were captured relatively well. However, the generator had a tendency to underestimate these values in the spring and summer. The number of days with maximum temperatures greater than 32.2 °C (90 °F) obtained based on simulated data were much lower than (about half of) those obtained from observed data, suggesting a tendency towards simulating fewer extremely hot days. The highest value of 1-day maximum temperature was observed in the spring at all stations during the study period (1966–2015). The generator tended to simulate about the same in both spring and summer. This was consistent with historical values of 1-day maximum temperatures where maximum values were recorded in June and July for Fort Wayne and July for Adrian. Seasonal 1-day minimum temperatures were simulated relatively well at all stations although actual winter values tended to coincide with the absolute minima of the simulated data, hence a tendency to simulate winter temperatures that were warmer than observed. However, the number of days with temperatures less than 0 °C (32 °F) were overestimated, hence overall more frost days being simulated by the generator.

3.2 Visual Comparisons

For annual 1-day maximum precipitation, quantiles from the simulated data matched those from the observed data relatively well at Fort Wayne. Some deviations were observed at the lower end and towards the upper end at Adrian, although the quantiles at the upper end were relatively well matched. For the most part, the plot deviated from the ideal at Norwalk with larger deviations seen on the upper tail. This was consistent with findings in Mehan et al. (2017) from which the generator tended to overestimate the 99.5 percentile precipitation at this station and the number of days with greater than the 99.5 percentile precipitation, although this latter value was overestimated at all stations (Table 1). The distributions of annual 1-day maximum temperatures were not captured well at any of the stations (Fig. 1). In general, there was a tendency to overestimate values in the lower quantiles and underestimate those in the upper quantiles. Annual 1-day minimum temperatures were generally well captured at the lower end although with a tendency toward underestimating the lower colder temperatures and overestimation at the higher quantiles. Overall, simulated 1-day minimum temperatures were warmer at the lower extremes and cooler at the higher ends relative to observed data. This pattern was consistent across the stations based on the Q-Q plots (Fig. 1). When considered across realizations, the skewness of annual 1-day maximum precipitation was better captured at Fort Wayne but tended to be underestimated at both Norwalk and Adrian (Fig. 2). The skewness for both annual 1-day maximum temperatures and annual 1-day minimum temperatures was generally underestimated by LARS-WG although simulations were better for minimum temperatures for which values were only slightly underestimated (Fig. 2). The implications of underestimated skewness are more evident with precipitation and translate to a tendency to overestimate mean and median values. This effect is less evident with annual 1-day maximum and minimum temperatures due to the nature of their distributions.

Fig. 1
figure 1

Q-Q plots for annual 1-day maximum precipitation, annual 1-day maximum temperatures, and annual 1-day minimum temperatures based on values simulated by LARS-WG for 50 different realization each 50 years in length (maximum value in any one year was used regardless of the realization in which it occurred) for the weather stations at Adrian, MI; Norwalk, OH; and Fort Wayne, IN

Fig. 2
figure 2

Variability of skewness for annual 1-day maximum precipitation, annual 1-day maximum temperatures, and annual 1-day minimum temperatures based on values simulated from the weather generator LARS-WG for 50 different realizations each 50 years in length for the weather stations at Adrian, MI; Norwalk, OH; and Fort Wayne, IN compared to the values calculated from observed data (dashed lines)

3.3 Suitability of Number of Realizations

Fifty realizations were used in this study as this was the number of realizations used by Mehan et al. (2017) from which the data were taken. Based on a comparison of 1, 10, 25, 50, and 100 realizations, Guo et al. (2017) found no added advantage in going beyond 25 realizations given that generator effectiveness in capturing the essential characteristics of observed data did not improve appreciably when more than 25 realizations were used. Guo et al. (2017) also found that 10 realizations were not sufficient to capture essential characteristics of observed data. That said, increasing the number of realizations was not helpful for a particular characteristic if the generator was unable to capture the characteristic. An assessment of data for the Fort Wayne and Norwalk stations using 25 realizations taken at random (Table 3) showed that: annual 1-day maximum precipitation was estimated generally well except at the absolute maximum where it was underestimated at both stations consistent with results from 50 realizations. The 75th percentile value was, however, overestimated at Norwalk; annual 1-day maximum temperatures were also well captured except for the maximum value which was underestimated; and, annual 1-day minimum temperatures were underestimated (warmer) at the lower end and overestimated (colder) at the higher end, also consistent with results from 50 realizations. The interquartile range for annual 1-day minimum temperatures based on simulated values was smaller than that from observed data at the Fort Wayne station. The number of realizations has implications on computational needs especially if the generated data are used in subsequent applications—such as in water resources modeling—thus the need to determine a suitable number of realizations. Oftentimes, a long time series is generated so as to reduce sampling error associated with the stochastic nature of the weather generator (e.g., Wilks 2002; Zhang and Garbrecht 2003; Chen et al. 2012; Eames et al. 2012; Chen and Brissette 2014), sometimes dividing these into sets of equal lengths (pseudo-realizations), e.g., Kou et al. (2007). In the latter case, the entire dataset is essentially a reflection of the single realization and errors may simply be propagated across the dataset as discussed in Gitau et al. (2017). In general, generating a time series that is substantially longer than the base data could result in biases as the sampling of the distribution would be insufficient (Mithen and Black 2011).

Table 3 Distribution of simulated data for annual 1-day maximum precipitation, annual 1-day maximum temperature, and annual 1-day minimum temperatures as obtained from 25 realizations in comparison to values from observed data at Fort Wayne and Norwalk stations

3.4 General Discussion

In this study, an evaluation of weather generator representation of climate extremes was conducted based on simulated values from LARS-WG and observed data from three stations in the Western Lake Erie Basin, and considering 50 realizations. Generally, LARS-WG captured the number of days with precipitation greater than 50.8 mm (2 in.) and 101.6 mm (4 in.), 7-day wet sequences, and wet and dry sequences relatively well. The distribution of 1-day maximum precipitation was generally captured well with the exception of the upper tail at the Norwalk station. One-day maximum temperatures were also captured relatively well in the fall and winter, although the generator tended to underestimate spring and summer values across all stations. The generator greatly underestimated the number of days per year with maximum temperatures greater than 32.2 °C (90 °F) and overestimated the number of days with temperatures less than 0 °C (32 °F).

When evaluating generator effectiveness, the choice of characteristics to evaluate largely depends on the intended use for the synthetic output. However, 1-day extreme variables (maximum precipitation, maximum temperatures, minimum temperatures), tendency to exceed selected extreme thresholds (number of days in which the threshold is exceeded), occurrence and severity of multi-day events (e.g., 7-day precipitation, wet sequences, dry sequences), and tendency to exceed certain percentile thresholds (e.g., 90th percentile precipitation) are commonly evaluated. These can be categorized as absolute, threshold, duration, and percentile based metrics, respectively based on Sillmann et al. (2013). The 90th and 10th percentiles, for example, have been used to represent extreme hot days and extreme cold days (Vincent et al. 2005), while the 95th percentile has been used to represent extreme precipitation (Sillmann and Roeckner 2008). Metrics based on percentiles reflect anomalies relative to the local climate and are, therefore, location-specific (Klein Tank and Können 2003). They, however, have an advantage in that they sample the same part of a distribution regardless of the location, making them especially suitable for comparative analysis across locations (Klein Tank et al. 2009; Zhang et al. 2011). More universal measures include: a 50.8 mm (2 in.) threshold for precipitation; a 30 °C (86 °F) threshold for maximum temperatures, used in relation to heat waves; and, a 32.2 °C (90 °F) threshold for maximum temperatures, which is commonly associated with heat stress as previously discussed. Semenov (2008) defined a heat wave as two or more days with temperatures greater than 30 °C. A threshold of 0 °C (32 °F) is used for assessment of frost.

Generally, a number of realizations are needed in order to capture the range of variability in climate. The use of multiple realizations is, however, only useful with respect to characteristics that the generator is able to simulate relative to observed data (Guo et al. 2017; Gitau et al. 2017). For example, if a generator has the tendency to underestimate a particular extreme variable, then increasing the number of realizations will result in multiple datasets in which the variable is underestimated. The use of multiple weather realizations in subsequent applications, such as water resources modelling studies, provides opportunities to capture a range of possible outcomes particularly as related to future climate. Such evaluations could, however, become computationally expensive and produce large datasets beyond what can be handled effectively using a standard workstation. An alternative might be to select a best set from the realizations based, for example, on verification methods for paired data such as those described in Jolliffe and Stephenson (2012). However, the time series produced by weather generators and the base input data are not necessarily paired, although this might be implied if the length of the generated data is equal to that of the base data. However, if the length of the generated data (Y) is shorter than that of the base input data (X), generated data could represent any of the many different potential combinations of time series each of length Y, as detailed in Gitau et al. (2017). This might present a challenge where data are generated for use with modelling applications, and especially when model outputs are used in parameter optimization. A challenge may also present where data are extrapolated beyond the length of the base input data as is sometimes done where the interest is in evaluating impacts of factors other than climate or in assessing future scenarios. An important aspect relates to how generator effectiveness would translate with respect to representing future climate. Some key questions to be addressed include how the statistics might change as climate change impacts are realized and considerations that would be needed to allow the generator to be used in a meaningful way to quantify the impacts. Such analysis is, however, beyond the scope of this work.

The stations considered in this study had very rich datasets with long records dating as far back as 1887 and consistent data availability through the study period. This is not often the case. Furthermore, the study site is located in a humid region where rainfall occurs throughout the year. It is not clear how the generator would perform with respect to both essential characteristics and extreme variables in areas with much smaller and less consistent observed datasets and/or in non-humid regions. Possibly the biggest challenge with respect to data-sparse regions is in the ability to derive a suitable set of input parameters. A 20-year period is recommended for LARS-WG (Semenov and Barrow 1997). However, such a record might be difficult to find in some areas and the analysis of extremes might be confounded by the limited knowledge of such events afforded by an inconsistent dataset. In arid and semi-arid regions, precipitation events occur infrequently but are often very severe. Extreme temperatures might also be much higher than what is implied by global thresholds. The occurrence of such events might, thus, be characterised differently than they would in humid regions. Thus, more localized metrics such as percentiles might be more applicable.

4 Conclusions

Given the increasing use of weather generators in environmental, water resources, and agricultural applications, and their potential for use in climate impact and climate change studies, it is important that the generators be evaluated for their suitability for the area in which they would be applied. This is particularly so with respect to their ability to capture extreme events. In this study, LARS-WG was evaluated using data from three stations in the Western Lake Erie Basin. Generally, the generator performed well particularly with respect to capturing wet and dry sequences, winter and fall 1-day maximum temperatures and 1-day minimum temperatures in general. The 50 realizations used in this study were sufficient for the analysis although similar results might have been obtained with 25 realizations based on recent literature. Further work is needed to determine the impact of number of realizations on generator effectiveness in simulating extremes. Further work is also needed to evaluate the generator for use in data-sparse and/or non-humid regions.