1 Introduction

East Coast Lows (ECLs) are intense low-pressure systems that occur off the eastern coast of Australia. Extreme rainfall events associated with ECLs frequently cause significant flash flooding near the coast, as well as major flooding in river systems with headwaters in the Great Dividing Range. In spite of its destructive capacity, the rainfall associated with ECLs has a beneficial role for coastal communities, as it provides significant inflow to coastal storages along the New South Wales (NSW) coast (Pepler and Rakich 2010). Large events can even provide inflow to the headwaters of western flowing rivers, particularly in northeastern NSW. It is important for both hazard control and water management to correctly capture the ECL features in modeling, in particular, to reproduce the observed rainfall amounts and spatial patterns.

The Weather Research and Forecasting (WRF) model (Skamarock et al. 2008) is a numerical weather prediction and atmospheric simulation system designed for operational forecasting, atmospheric research, and dynamical downscaling of Global Climate Models. Previous studies have shown that the WRF model performs well for simulating the regional climate of south-eastern Australia (Evans and McCabe 2010, 2013; Evans and Westra 2012). Evans et al. (2012) evaluated physics scheme combinations for hind-cast simulations of four ECL events using the WRF model. The authors investigated the influence of selecting different planetary boundary layer (pbl), cumulus (cu), microphysics (mp), and radiation (ra) schemes on accuracy of maximum and minimum temperature, wind speed, mean sea level pressure, and rainfall. Similar sensitivity study was done for other regions too (Yuan et al. 2012; Jankov et al. 2005; Awan et al. 2011). For example, Yuan et al. (2012) used the WRF model configured with two alternative schemes of mp, cu, ra, and land surface physics schemes when forecasting winter precipitation in China. The authors of these studies struggled to identify a single “best” physics scheme combination for all variables, although it was clear that some combinations performed better than others for certain variables (Evans et al. 2012).

While sensitivity analysis cannot agree on a best model configuration (Jankov et al. 2005; Evans et al. 2012), other methods can be utilized to maximize the information gained from multiple model runs using different parameterizations. Ensemble averaging is one of them, which is widely used in weather forecasting, seasonal predictions, and climate simulations (Fraedrich and Leslie 1987; Hagedorn et al. 2005; Phillips and Gleckler 2006; Schwartz et al. 2010; Schaller et al. 2011). Many studies have investigated the ensemble averages of various regional climate models and perturbed initial conditions in simulating regional rainfall (Cocke and LaRow 2000; Yuan and Liang 2011; Carril et al. 2012; Ishizaki et al. 2012). Some of them showed that an ensemble average had skill in reproducing heavy rainfall events (Yuan and Liang 2011; Yuan et al. 2012), while others reported that the results were far from satisfactory (Carril et al. 2012).

In this paper, we evaluated the skill of ensemble averages (for the full ensemble and subsets of particular physics schemes) relative to the use of individual ensemble members to capture spatial and distributional properties of rainfall associated with eight ECLs. We used the same physics scheme combinations as those used by Evans et al. (2012) but extended the modeling to cover four more events, in order to give a complete representation of the different types of synoptic events typically associated with ECLs (Speer et al. 2009).

2 Method

2.1 Physics scheme ensemble and model domain

This study used version 3.2.1 of WRF with the Advanced Research WRF dynamical core (Skamarock et al. 2008). Initial and boundary conditions were provided by European Centre for Medium-Range Weather Forecasts interim re-analyses (ERA Interim) (Dee et al. 2011). The experimental configuration consists of: 2 pbl, 2 cu, 3 ra, and 3 mp schemes, giving a total of 36 runs for each event (Table 1). The full details of the experiment setup are described in Evans et al. (2012).

Table 1 WRF physics parameterization schemes used in the study

Two model domains with one-way nesting (with spectral nudging of wind and geopotential above 500 hpa in the outer domain) were used in this study (see Fig. 1), with grid spacing of 50 and 10 km for the outer and inner model domain, respectively. Both domains had 30 vertical levels. Each run was started 1 week prior to the event for a 2-week period, thus encompassing pre- and post-storm days. The total number of events simulated and resolution chosen were limited by the available computational resources.

Fig. 1
figure 1

Topographic map showing WRF model domains with grid spacing of about 50 and 10 km for the outer and inner domain, respectively (inner domain marked by red box). All evaluations are conducted in the inner domain minus a border of six-grid cells

2.2 Case study periods

Using mean sea level pressure, wind speed, rainfall, and wave height, Speer et al. (2009) identified six different types of ECLs: (1) ex-tropical cyclones, (2) inland trough lows, (3) easterly trough lows, (4) wave on front lows, (5) decaying front lows, and (6) lows in the westerlies. Unlike Evans et al. (2012), our study of eight ECL events includes examples of all common synoptic ECL types (Table 2). The events were subjectively named based on the location, timing, or type of event.

Table 2 Eight events used in the study

The eight ECL events were subjectively divided into two categories (strong and weak) according to the observed rainfall amount. Four strong events (NEWY, SURFERS, JUN, and FEB) generally produced more than 200-mm cumulate rainfall and caused regional or local flooding. In contrast, four weak events (CTLOW, OCT, MAY and SOLOW) generated less rainfall.

2.3 Observation and evaluation methodology

Gridded daily rainfall data over land with ∼5-km horizontal grid spacing obtained from the Australian Water Availability Project (AWAP) (Jones et al. 2009) were used for evaluation of model simulations. For the evaluation, 13-days accumulated AWAP rainfall was re-gridded to the 10-km resolution domain. Figure 2 shows accumulated AWAP rainfall maps for the eight ECL events.

Fig. 2
figure 2

Observed rainfall totals for each event (millimeter): a NEWY, b SURFERS, c JUN, d FEB, e CTLOW, f OCT, g MAY, and h SOLOW

The first day of each simulation was considered as the spin-up period, and was hence excluded from the analysis. The skill of each WRF physics scheme combination and their ensembles to simulate accumulated rainfall was assessed using: spatial correlation (R), bias, mean absolute error (MAE), root mean square error (RMSE), and equitable threat score (ETS), also known as the Gilbert Skill Score (Wilks 2006). Higher values of R and ETS indicate better forecasts, with a perfect ETS achieving a score of 1. An ETS below zero indicates that random chance would provide a better simulation than the model. MAE, RMSE and bias are all better as they approach zero.

The ETS is commonly used in forecast verification to investigate the overall spatial performance of the simulations for different rainfall thresholds (10, 25, 50, 100, 200, and 300 mm for all ECL events). It should be noted that the higher rainfall thresholds have smaller sample sizes with the 200-mm threshold sampling 1,136 grid points and the 300-mm threshold sampling only 197 grid points from AWAP. For each threshold, an observation can be classified as either: “a” (forecast and observed agree on exceedance of threshold, i.e., hits), “d” (forecast and observed agree on non-exceedance of threshold), “c” (exceeded in observed but not in forecast, i.e., missed event), and “b” (exceeded in forecast but not in observed, i.e. false alarm). Having classified all grid cells according to a–d, the ETS is then calculated as \( \mathrm{ETS}={{{\left( {a-\mathrm{ar}} \right)}} \left/ {{\left( {a+b+c-\mathrm{ar}} \right)}} \right.} \), where ar is the number of hits due to random chance and is given by \( \mathrm{ar}={{{\left( {a+b} \right)\times \left( {a+c} \right)}} \left/ {{\left( {a+b+c+d} \right)}} \right.} \).

2.4 Ensemble integration

Ensemble averages were calculated for each event, using all 36 members and subsets of runs simulated with common pbl, cu, mp, or ra physics schemes. The names and the collection of runs used in ensembles are summarized in Table 3. For example, the “YSU” ensemble is averaged over the 18 runs that use the YSU pbl scheme.

Table 3 List of ensembles and the members used to calculate them

3 Results

The skill of the ensemble averages was assessed using the metrics described in Section 2.3 and the results compared with each other and with the median of the 36 individual members. Inter-event comparisons showed similarity in results for all but the SURFERS event. Therefore, in this section, we focused on the results from the JUN event (typical of the seven events that give similar results) and the SURFERS event to demonstrate our findings.

For the JUN event, the ensemble average using all members (ALL) provided substantially higher spatial correlation, lower MAE and RMSE, and better ETS (except for the 300-mm threshold) than those from the median of the 36 individual members (Fig. 3a), but also resulted in more negative bias than the median. Some of the results using ALL (i.e. R, RMSE, ETS10) were even better than the 75th percentile estimate of the 36 individual members. This suggests that ALL provides better estimates for rainfall amounts and patterns compared to the median of the 36 individual members at all thresholds below 300 mm.

Fig. 3
figure 3

Plots summarizing statistics from 36 members and ensemble averages for the JUN (a) and SURFERS (b) events. The boxes and whiskers show the results from 36 members. The boxes show the inter-quartile, the middle horizontal lines show the median and the whiskers show the best and the worst values. The bias, MAE and RMSE values were divided by 30, 40, and 80, respectively, to allow them to be plotted in the same graph. The results from five ensembles are shown in different markers. All represents an ensemble average of all the 36 members, YSU represents the ensemble average for all members using YSU scheme, KF represents the ensemble average for all members using KF scheme, YSU_KF represents the ensemble average for all members using both YSU and KF schemes, and YSU_BMJ represents the ensemble average for all members using both YSU and BMJ schemes

Ensembles using certain mp (WSM3, WSM5, and WDM5) and ra (Dudhia/RRTM, CAM/CAM, and RRTMG/RRTMG) schemes showed little to no improvement in the skill relative to ALL. However, there was a large difference in the results from ensembles using different pbl (YSU, MYJ) and cu (KF, BMJ) schemes. The ensemble using YSU pbl scheme performed better than the ensemble using MYJ pbl scheme, and similarly ensemble using KF cu scheme performed better than the ensemble using BMJ cu scheme (not shown). The ensembles using either YSU pbl scheme or KF cu scheme outperformed ALL except for ETS 25, where KF was worse than ALL. The best performance is typically given by the ensemble using a combination of the YSU pbl and KF cu scheme (nine members). This ensemble also considerably outperformed the combination using YSU pbl and BMJ cu scheme. At high rainfall thresholds, this ensemble was even superior to the best result in the 36 individual members (i.e., thresholds at 200 and 300 mm).

As mentioned above, results for the SURFERS event were different to those of other events. While ALL gave a higher spatial correlation and better ETS for multiple thresholds (except for 10 and 300 mm) compared to the median of the 36 individual members, it also produced larger bias, MAE and RMSE compared to the median (Fig. 3b). The ensemble using YSU pbl scheme performed better than ALL, while the ensemble using KF cu scheme performed worse than ALL at all rainfall thresholds for ETS but was still better than ALL for bias, MAE and RMSE. The ensemble using both YSU pbl scheme and KF cu scheme at the same time showed higher R, smaller bias, MAE and RMSE, and better skill at rainfall threshold below 100 mm compared to ALL, but the results were poorer at the 100- and 200-mm rainfall thresholds.

4 Discussion

Ensemble averages have been found to perform consistently as well as, if not better than, the median of individual members when evaluated using common metrics in weather forecasting, seasonal predictions and climate simulations (Fraedrich and Leslie 1987; Hagedorn et al. 2005; Phillips and Gleckler 2006; Schwartz et al. 2010; Schaller et al. 2011). For many of these metrics (e.g., bias, RMSE), ensemble averaging smoothens the field of interest, thus removing any large errors present in the individual ensemble members. However, this smoothing reduces the outlier values and hence the ability to capture extremes. Results presented in Fig. 3 suggest that judicial choice of ensemble members allows the high rainfall centers to be captured by an ensemble average. For all events, assessments showed that the ensemble mean for all members (ALL) and for members using YSU pbl scheme (YSU) provided better estimates for rainfall amounts and patterns compared to the median of the individual members, even though they also resulted in larger bias, MAE and RMSE for the SURFERS event relative to the median. The ensemble using the combination of the YSU pbl scheme and KF cu scheme (YSU_KF) was superior to all the other ensembles for seven of the eight events. For SURFER event, the ensemble using the combination of the YSU pbl scheme and BMJ cu scheme (YSU_BMJ) gave the best performance for ETS 100–300.

The unique response of SURFER prompted further investigation into the synoptic conditions prevailing during this event. Comparing the complete rainfall field from WRF (i.e., including ocean grid cells) with observational rainfall data, we propose that the geographical positioning of the main rainfall center of SURFER relative to the observational rainfall data set could provide an explanation for the different behavior of SURFER relative to the other events. As described in Evans et al. (2012), the SURFERS event developed from a tropical low that persisted for 5 to 6 days over the Coral Sea, classified as an ex-Tropical Cyclone (xTC) type by Speer et al. (2009). While SURFER caused flash flooding throughout the region including Surfers Paradise in Queensland, information from satellite images and Climate Prediction Center Merged Analysis of Precipitation showed that the major rainfall center was offshore, and hence was not captured in the land-based AWAP observations, whereas for all other events, the major rainfall centers were on land. As the rainfall evaluation was conducted only over land, the SURFER simulations were assessed on conditions peripheral to the main storm center. Thus, we propose that geographical positioning of the main rainfall center of SURFER relative to the observational rainfall dataset could provide an explanation for the different behavior of SURFER relative to the other events.

To further investigate whether the ensembles using the YSU pbl scheme, the KF cu scheme, or a combination of both will give more realistic rainfall estimates in different weather conditions, we compared these ensembles with the full member ensemble (ALL) using an average of skill metrics from eight events to evaluate their performance. Figure 4a shows differences between the three ensembles (YSU, KF, and YSU_KF) and the full member ensemble (ALL) averaged over eight events. The bias, MAE and RMSE values were divided by 100, respectively, to allow them to be plotted in the same scale with the other metrics. The positive values for R and ETSs and the negative values for Bias, MAE, and RMSE indicate improvement relative to ALL. The results showed that the ensembles using either YSU pbl scheme, or KF cu schemes, or their combination generally produce relatively small changes in R, bias, MAE and RMSE relative to ALL. At higher rainfall thresholds, the ensembles preformed much better than ALL based on the ETS. This indicates that an ensemble based on carefully chosen physics schemes can dramatically improve the ability of the ensemble to capture centers of high rainfall. When the SURFERS event was excluded from the statistics, the performance of these ensembles became even better for high rainfall (Fig. 4b). Considering that the ex-Tropical Cyclone type (like SURFERS event) only constitutes 4 % of observed ECL events (Speer et al. 2009), the climatological performance of these ensembles would be more similar to Fig. 4b.

Fig. 4
figure 4

Difference between the three ensembles (YSU, KF, and YSU_KF) and the full member ensemble (ALL) averaged over eight events (a) and over seven events (eight events excluding SURFERS event) (b). The bias, MAE and RMSE values were divided by 100, respectively, to allow them to be plotted in the same scale with other metrics. The positive values for R, ETSs and the negative values for Bias, MAE, and RMSE indicate improvement relative to ALL

5 Conclusions

The performance of 36 physics scheme combinations of the WRF model and their ensembles were evaluated for modeling rainfall totals associated with eight ECL events (four strong and four weak) using five statistical metrics (R, bias, MAE, RMSE, and ETS for six rainfall thresholds).

The results show that the ensemble average generally provides more realistic rainfall estimates compared to the median performer of the individual members. Improvements compared to individual ensemble members were seen in the accuracy of rainfall amount and spatial pattern. Furthermore, based on the sensitivity analyses of physics scheme combinations, we found that the ensembles using the YSU pbl scheme or the KF cu scheme provided improved rainfall estimates compared to both the median performer of individual members and the all member ensemble mean, particularly at high rainfall thresholds. The ensemble average using the combination of YSU pbl scheme and KF cu schemes provided the best results for simulating centers of high rainfall. This points to the value of focusing only on a subset of ensemble runs when calculating an ensemble mean. Depending on the rainfall characteristics of interest, different parameterization based sub-ensembles may improve the ensemble mean estimate.