1 Introduction

Floods and droughts are major natural disasters that occur in China and have enormous destructive strength, restricting economic and social development (Wang 2017; Zhang Xu et al. 2019). As an essential part of the hydrological forecasting system, skillful precipitation forecasts are of vital importance in mitigating risk associated with extreme events, which supports decision-making for water resource utilization. In addition, precipitation forecasts provide decision-makers with uncertainty information in precipitation and flood forecasts (Ying et al. 2019).

Numerical weather prediction (NWP) models have been developed and improved since the 1940s (Trenberth 1992), and the forecast accuracy has also steadily improved (Buizza et al. 1999; Lan et al. 2011). In the past 25 years, NWP has evolved from deterministic forecasting to a new stage of ensemble prediction systems (EPSs) (Molteni et al. 1996). In contrast to deterministic forecasting, EPS generates a forecast set through initial perturbation and model uncertainty, thereby providing the most likely forecast value as well as the uncertainty of the forecast. The improved EPS performance is attributed to advances in the initial perturbation strategy, (Meng 2011; Whitaker and Hamill 2002), model uncertainty simulation strategy, resolution, number of members and forecast length (Roberto 2019; Roebber et al. 2004). Currently, EPS is not only used operationally to generate forecasts valid for different time scales, such as short-term forecasts (up to 2–3 days), medium-term forecasts (up to 2 weeks), seasonal forecasts and subseasonal (10–90 days) forecasts but also for different hydrometeorological variables, such as temperature, precipitation, wind speed, and tropical cyclone paths (François et al. 2018; Hemri et al. 2014). Currently, EPS is widely used in many fields. For example, in hydrology, coupled with hydrological models, precipitation ensembles can generate runoff forecasts (Cloke and Pappenberger 2009; Lan et al. 2011; Pappenberger et al. 2005). In the energy field, different weather scenarios are created by ensembles to estimate the uncertainty of electricity demand forecasts (Taylor and Buizza 2003). In aviation, ensembles are used to provide the probability of convectional calamity and flying conditions, guiding air traffic control (Robert 2018; Verlinden 2017). Furthermore, the inherent forecast limitations of a single model are difficult to measure. It is common to combine ensembles from multiple independent models in a scheme called a multimodel ensemble. This practice considerably reduces systematic errors in forecasts and improves reliability (Kirtman et al. 2014; Krishnamurti et al. 1999, 2016).

The Observing System Research and Predictability Experiment Interactive Grand Global Ensemble (TIGGE) provides a solid technical and data support for studies on the operational global ensemble forecasts (Park et al. 2008). In recent years, regional cases on ensemble forecast systems have been extensively carried out in quantitative precipitation forecasts (QPFs) and probabilistic quantitative precipitation forecasts (PQPFs): Hamill (2012) examined the PQPFs from four TIGGE EPSs over the contiguous United States during July–October 2010 and discussed the TIGGE multimodel and the European Centre for Medium-Range Weather Forecast (ECMWF) reforecast-calibrated PQPFs. The author concluded that PQPFs from the Canadian Meteorological Centre (CMC) EPS are the most reliable, while those from the U.S. National Centre for Environmental Prediction (NCEP) and the United Kingdom Meteorological Office (UKMO) EPSs are the least reliable. In addition, the TIGGE multimodel shows better forecast skills, while the accuracy of ECMWF reforecast-calibrated PQPFs is reduced. Xiang et al. (2014) evaluated the QPFs and PQPFs from six TIGGE EPSs during June–August 2008–2010 in the Northern Hemisphere (NH) midlatitude and tropics, as well as the change in performance after being upgraded. Their study indicated that the overall forecast skill is better in the NH midlatitude than in the NH tropics, and generally, the ECMWF EPS performs best. After the upgrade, the overall QPF and PQPF errors from CMC EPS increase due to its excessively enlarged ensemble spread. Louvet et al. (2016) compared PQPFs from seven TIGGE EPSs with satellite rainfall estimates over West Africa during 2008–2012 and examined the performance of the ensemble mean of all models. They found that the skills of UKMO and ECMWF EPSs are better than others. For a lead time from 1 to 15 days, the skill of TIGGE forecasts decreases, and the performance of the multimodel overcomes that of any individual models. Karuna et al. (2017) assessed the skills of three TIGGE EPSs in predicting 15 rainstorm events over India during 2007–2015. Their results showed that NCEP EPS has the least spread, but its QPFs are not well predicted. The displacement and pattern errors contribute more to the total root mean square error (RMSE). Using deterministic, dichotomous (yes/no) and probabilistic techniques, Aminyavari et al. (2018) verified the precipitation forecast performance of three TIGGE EPSs over Iran for the period of 2008–2016. This study concluded that all EPSs underestimate precipitation in high precipitation regions and overestimate precipitation in other regions. ECMWF EPS has better scores than others, while UKMO EPS yields higher scores in mountainous regions. The multimodel superensemble is recommended to improve the forecast quality.

However, systematic studies on regional TIGGE precipitation forecasts are scarce. Thus, a more comprehensive study is needed to reveal the detailed properties of regional precipitation EPSs. In addition, statistical postprocessing can construct a multimodel superensemble from EPSs to remove systematic biases and improve the accuracy and robustness of EPSs (Qingyun et al. 2019). It is of interest to analyze the forecast skill of a multimodel superensemble in a particular area.

This study focuses on the QPFs and PQPFs generated from individual TIGGE centers from April to December 2015 over the Huaihe River basin (HB). The forecast quality is assessed in many aspects to obtain a comprehensive understanding and summary of the precipitation forecast properties of five selected operational global EPSs in the HB. The overall forecast quality is verified, and the forecast quality at different precipitation thresholds is further discussed. Forecast quality changes for different lead times are also examined in this study. We evaluate the spatial distribution of forecast performance to reveal the adaptabilities of EPSs to the terrain and climate background. In addition, the multimodel superensemble is integrated from five EPSs using the Bayesian model average (BMA), and its performance is evaluated with reference to individual EPSs.

The rest of the paper is organized as follows: Sect. 2 describes the study area, datasets and methods. Section 3 provides the results and discussions. A summary is presented in Sect. 4.

2 Study area, datasets and methods

2.1 Study area

The HB is located at 111°55′ E–121°25′ E and 30°55′ N–36°36′ N (Fig. 1). The left bank of the Huaihe River is almost all plain rivers with large concentration areas, while the right bank is all hilly rivers with small, concentrated areas. In addition, HB is a transitional zone between the northern and southern climates of China (Robert 2018). In contrast to the warm zone with a semihumid monsoon climate in the northern region, the southern region is a subtropical zone with a humid monsoon climate.

Fig. 1
figure 1

The location of HB, predicted grid points and stations

The average annual precipitation over the HB is approximately 910 mm, and the precipitation decreases from South to North. June–September is the flood season in the HB, and precipitation is usually 500–600 mm during this period, accounting for 50–80% of the annual precipitation. During the unique plum rain season (June and July), rainfall lasts for 1 or 2 months, covering the whole basin. The atmospheric system is complex and changeable over the HB. The spatial and temporal distribution of precipitation is uneven and prone to floods, droughts and other disasters. The complex terrain and unique climate background make it difficult to forecast precipitation in this region.

2.2 Datasets

2.2.1 Observed data

The observed data set comes from the National Meteorological Information Centre of China. The data set is the collection of surface meteorological records submitted monthly by the data-processing departments of provinces, municipalities and autonomous regions. The data set comprises the daily data of 752 meteorological stations in China from 1951 to 2015. Daily precipitation data from 40 stations over the HB are used in this study. Some dates are missing data or contain outliers, and these dates are culled.

2.2.2 Precipitation forecast data

The cumulative precipitation forecast data of 1–9 days provided by the Japan Meteorological Agency (JMA), China Meteorological Administration (CMA), UKMO, U.S. NCEP and the ECMWF EPSs in the TIGGE data set are adopted for evaluation. The regional range is 112°–121° E and 30.5°–36.5° N. The original precipitation data are converted into the same 0.5° × 0.5° grid before downloading using the bilinear interpolation software provided by the ECMWF TIGGE data portal. The configurations of the selected operational global EPSs are shown in Table 1.

Table 1 Configurations of the five TIGGE EPSs used in this study

The JMA EPS starts to provide data in February 2014; in addition, the CMA EPS is missing data from October 2014 to March 2015 due to the system upgrade, and thus, the verification period covers April–December 2015 in this study. The negative values of the forecast are set as 0, and linear interpolation of time is used to estimate the missing values. The nearest-neighbor approach is used to obtain the forecast of a specific station from the gridded forecast data (Vogel et al. 2017). Figure 1 illustrates the distribution of the selected predicted grid points and the meteorological stations. The background grid in gray is the original grid.

2.3 Verification metrics and postprocessing method

The QPFs and PQPFs of five typical EPSs are verified in terms of different verification metrics. A multimodel superensemble is constructed by the BMA. The principles of each part are described as follows.

2.3.1 Verification methods

2.3.1.1 Verification metrics of PQPFs

The direct output of EPS is a set of possible values (i.e., PQPF); thus, PQPF is a probabilistic prediction. Sharpness, skill, reliability and resolution are the most common aspects of probabilistic prediction quality. The sharpness describes the concentration of the probabilistic prediction distributions. The skill represents the forecast accuracy compared with a reference forecast. The reliability relates to the average consistency between the forecast and observation when a specific forecast is issued, measuring how well forecast probabilities match observed frequencies. The resolution shows differences in outcomes for the different forecasts issued, which means that the distribution of outcomes when “A” was forecast is different from the distribution of outcomes when “B” is forecast (Qingyun et al. 2019). In this study, the continuous ranked probability skill score (CRPSS) is applied to assess the forecast skill of PQPFs (Hersbach 2000). The reliability diagram and Brier score resolution are used to intuitively and quantitatively evaluate the reliability of PQPFs, respectively, for dichotomous events. Dichotomous events refer to events whose results can be divided into occurrence and nonoccurrence through thresholds. Brier score skill and Brier score resolution represent the prediction skill and resolution of PQPFs for dichotomous events, respectively.

The CRPSS is calculated by normalizing the continuous ranked probability score (CRPS) with the reference forecast, which is defined as follows:

$$CRPSS_{j}^{T} = \frac{{CRPS_{ref,j} - CRPS_{j}^{T} }}{{CRPS_{ref,j} }}$$
(1)

where \(CRPSS_{j}^{T}\) represents the CRPSS of EPS \(T\) at station \(j\) and \(CRPS_{j}^{T}\) is the CRPS of EPS \(T\) at station \(j\) (Tilmann and Raftery 2007):

$$CRPS_{j}^{T} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\int_{ - \infty }^{\infty } {\left[ {G_{ij}^{T} \left( x \right) - H\left( {x - o_{ij} } \right)} \right]^{2} dx} }$$
(2)
$$H\left( {x - o_{ij} } \right) = \left\{ {\begin{array}{*{20}c} 1 & {x \ge o_{ij} } \\ 0 & {x < o_{ij} } \\ \end{array} } \right.$$
(3)

where \(x\) represents accumulated precipitation; \(o_{ij}\) is the observation on \(i\) day at \(j\) station; \(N\) is the number of days in the verification period; and \(G_{ij}^{T}\) represents the predictive cumulative distribution function of \(T\) EPS on \(i\) day at \(j\) station.

\(CRPS_{ref,j}\) represents the referenced CRPS at \(j\) station and is generated using the cumulative distribution function (CDF) of the observed samples (i.e., sample climatology) (Konstantinos et al. 2019):

$$CRPS_{ref,j} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left| {o_{ij} - \overline{o}_{j} } \right|}$$
(4)

where \(\overline{{o_{j} }}\) is the average observed precipitation at \(j\) station during the verification period. CRPSS ranges from \(- \infty\) to 1, and a negative value indicates that the forecast skill of EPS \(T\) is worse than that of the sample meteorology (Demargne et al. 2010; Ye et al. 2014). In this study, 95% confidence intervals for CRPSS are calculated by the bootstrapping method by randomly selecting the statistics 10,000 times (Xiang et al. 2014).

The reliability diagram represents the frequency of the actual event when the predicted event occurs with a certain probability. The reliability diagram sets the observed relative frequency of an event versus the forecast probability of the event (Fig. 2). Given that \(m\) denotes the different \(M\) thresholds of forecast probability, the observed relative frequency \(q_{m}^{T}\) is given by the following equation (Wilks 2009):

$$q_{m}^{T} = \frac{1}{{n_{mj}^{T} \times J}}\sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{{n_{mj}^{T} }} {\gamma_{mj}^{T} } }$$
(5)

where \(n_{mj}^{T}\) denotes the number of forecast-observation pairs used in the verification period for EPS \(T\) at \(j\) station; and \(J\) represents the total number of stations. Since the observation of the event is dichotomous for the forecast-observation pair of \(T\) EPS at the \(j\) station, \(\gamma_{mj}^{T} = 1\) if the event occurs and \(\gamma_{mj}^{T} = 0\) otherwise. According to the forecast probability, the reliability diagram parts the verification dataset into subsamples, which means that the reliability diagram requires a fairly large dataset.

Fig. 2
figure 2

Schematic of the reliability diagram

The Brier score resolution and Brier score reliability for EPS \(T\) are defined as follows (Wilks 2009):

$$RES^{T} = \frac{1}{{\sum\limits_{j = 1}^{J} {\sum\limits_{m = 1}^{M} {n_{mj}^{T} } } }}\sum\limits_{i = 1}^{{n_{mj}^{T} }} {\sum\limits_{m = 1}^{M} {\left( {p_{m}^{T} - q_{m}^{T} } \right)^{2} } }$$
(6)
$$REL^{T} = \frac{1}{{\sum\limits_{j = 1}^{J} {\sum\limits_{m = 1}^{M} {n_{mj}^{T} } } }}\sum\limits_{i = 1}^{{n_{mj}^{T} }} {\sum\limits_{m = 1}^{M} {\left( {q_{m}^{T} - \sum\limits_{j = 1}^{J} {\sum\limits_{m = 1}^{M} {\frac{{\gamma_{mj}^{T} }}{{n_{mj}^{T} }}} } } \right)} }^{2}$$
(7)

where \(p_{m}\) is the forecast probability of threshold \(m\); and \(q_{m}^{T}\) is the observed relative frequency corresponding to the threshold m of EPS \(T\). The larger RES is, the higher the resolution of PQPFs, and the smaller REL is, the better the reliability.

BS skill (BSS) normalizes the mean square error of PQPFs of dichotomous events by reference forecast. For PQPFs of EPS \(T\), the BSS is given by the following equation:

$$BSS^{T} = \frac{{BS_{ref} - BS^{T} }}{{BS_{ref} }}$$
(8)

where \(BS^{T}\) and \(BS_{ref}\) represent the BS of EPS \(T\) and reference forecast, respectively, and \(BS_{ref}\) is calculated by the observed sample frequency of each station (Thomas and Josip 2006; Wilks 2009):

$$BS^{T} = \frac{1}{N \times J}\sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{N} {\left( {p_{ij}^{T} - \gamma_{ij} } \right)^{2} } }$$
(9)
$$BS_{ref} = \frac{1}{N \times J}\sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{N} {\left( {p_{ref,j} - \gamma_{ij} } \right)^{2} } }$$
(10)
$$\gamma_{ij} = \left\{ {\begin{array}{*{20}c} 1 & {o_{ij} \in I} \\ 0 & {o_{ij} \notin I} \\ \end{array} } \right.$$
(11)

where \(p_{ij}^{T}\) is the forecast probability of the event of \(T\) EPS on \(i\) day at \(j\) station; \(p_{ref,j}\) is the observed frequency at \(j\) station. Similar to the CRPSS, the perfect score is 1, and the lower the BSS is, the worse the skill of PQPFs. A lower limit of atmospheric predictability is a prediction that the future will be like the past climatology (Qingyun et al. 2019). Climatology is a forecast of the climatological outcome and is often used as an important reference for the forecast skill (Qingyun et al. 2019). This paper obtains observed samples to calculate climatology, that is, the average precipitation and precipitation distribution of observed samples are taken as the forecast of the climatological outcomes in this study (Konstantinos et al. 2019).

2.3.1.2 Verification metrics of QPFs

The output of EPS is a set of possible values, which not only provides PQPFs (ensembles) but also provides relatively robust QPFs by using the mean of all ensemble members (mean ensemble) (WMO 2012). As an important output of EPS, multiple deterministic verification metrics are used to demonstrate different aspects of QPFs. Scatter plots (Fig. 3a) and Pearson correlation coefficients are used to measure the linear relationship between forecasts and observations (Qingyun et al. 2019). The RMSE and discrimination diagram are used to evaluate the accuracy and discrimination of QPFs. Accuracy refers to the average difference between individual forecasts and observations, while discrimination represents differences in forecasts for different outcomes.

Fig. 3
figure 3

a Schematic diagram of scatter plot; b Schematic diagram of discrimination diagram

In the scatter plots (Fig. 3a), the lower correlation between forecasts and observations results in scatter about the one-to-one line. The Pearson correlation coefficient is a measure of the degree of linear correlation between QPFs and observations. A Pearson correlation coefficient of 1 (− 1) indicates a perfect positive (negative) linear correlation between QPFs and observations, while the absence of such a relationship leads to 0.

RMSEs are often used to measure the accuracy of deterministic predictions. The RMSE evaluates the standard deviation of the error between deterministic predictions and the observations. For RMSE, a lower value indicates better accuracy.

The discrimination diagram divides predictions into three types: correct prediction, false positive and false negative (Fig. 3b).

2.3.2 Bayesian model average method

The BMA, developed by the University of Washington (Raftery et al. 2005), is now recognized as one of the best statistical postprocessing methods for constructing multimodel superensemble forecasts (Sloughter et al. 2007). By combining data from different EPSs, BMA generates a single probabilistic prediction in the form of a predictive probability density function (PDF) (Vogel et al. 2017). Given that \(y\) is the predictive variable, the output corresponding to model \(M_{1} , \ldots ,M_{K}\) is \(f_{1} , \ldots ,f_{K}\), and for the training dataset, \((y^{T} ,f^{T} )\),

$$p\left[ {y\left| {(f_{1} , \ldots ,f_{K} ,y^{T} )} \right.} \right] = \sum\limits_{k = 1}^{K} {\omega_{k} g_{k} (y|} (f_{K} ,y^{T} ))$$
(12)
$$\sum\limits_{k = 1}^{K} {\omega_{k} = 1}$$
(13)

where \(g_{k} (y|f_{k} ,y^{T} )\) is the PDF of the \(M_{k}\) EPS and \(\omega_{k}\) is the BMA weight of the \(M_{k}\) EPS, reflecting the overall performance of the \(M_{k}\) EPS during the training period.

The default distribution of the variable is a normal distribution in BMA. The accumulated precipitation is zero in many cases; however, the distribution will be highly skewed for cases in which it is not zero. Therefore, a modified condition PDF of BMA is applied to extend BMA. In addition, the BMA variable in this study is taken as the cube root of precipitation to yield a good distribution. (Jianguo 2014; Sloughter et al. 2007).

The modified conditional PDF comprises two parts. The first part calculates the probability distribution of zero precipitation by a logistic regression model:

$$\begin{gathered} \log it\left\{ {p\left[ {y = 0|(f_{k} ,y^{T} )} \right]} \right\} = \log \frac{{p\left[ {y = 0|(f_{k} ,y^{T} )} \right]}}{{p\left[ {y > 0|(f_{k} ,y^{T} )} \right]}} \\ = a_{0k} + a_{1k} f_{k}^{{{1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-\nulldelimiterspace} 3}}} + a_{2k} \delta_{k} \\ \end{gathered}$$
(14)

where \(a_{0k}\), \(a_{1k}\), and \(a_{2k}\) are computed by logistic regression.

The second part is the PDF when the precipitation is nonzero, which is represented by a gamma distribution (Sloughter et al. 2007):

$$h_{k} \left[ {y|(f_{k} ,y^{T} )} \right] = \frac{1}{{\beta_{k}^{{\alpha_{k} }} \Gamma (\alpha_{k} )}}y^{{\alpha_{k} - 1}} \exp ({\raise0.7ex\hbox{${ - y}$} \!\mathord{\left/ {\vphantom {{ - y} {\beta_{k} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\beta_{k} }$}})$$
(15)

where the shape parameters \(\alpha_{k}\) and scale parameters \(\beta_{k}\) are expressed as follows:

$$\mu_{k} = \alpha_{k} \beta_{k} = b_{0k} + b_{1k} f_{k}^{{{1 \mathord{\left/ {\vphantom {1 3}} \right. \kern-\nulldelimiterspace} 3}}}$$
(16)
$$\sigma_{k}^{2} = \alpha_{k} \beta_{k}^{2} = c_{0} + c_{1} f_{k}$$
(17)

where \(\mu_{k}\) and \(\sigma_{k}^{2}\) are the mean and variance in the gamma distribution, respectively; \(b_{0k}\) and \(b_{1k}\) are calculated by generalized linear regression; and \(c_{0}\) and \(c_{1}\) are obtained by using the maximum likelihood method.

In summary, the BMA predictive PDF of the cube root of the accumulated precipitation \(y\) is

$$\begin{aligned} p\left[ {y|(f_{1} ,...,f_{k} ,y^{T} )} \right] &= \sum\limits_{k = 1}^{K} {\omega_{k} \left\{ {p\left[ {y = 0|(f_{k} ,y^{T} )} \right]I\left[ {y = 0} \right]} \right.} \\ &\quad \left. { + p\left[ {y > 0|(f_{k} ,y^{T} )} \right]h_{k} \left[ {y|(f_{k} ,y^{T} )} \right]I\left[ {y > 0} \right]} \right\} \\ \end{aligned}$$
(18)

where the general indicator function \(I\left[ {} \right]\) is 1 if the condition in brackets holds; otherwise, it is zero. Using the maximum likelihood method to calculate \(\omega_{k}\),

$$l(\omega_{1} , \ldots ,\omega_{K} ;c_{0} ;c_{1} ) = \sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{n} {\log p\left[ {y_{i,j} |(f_{1,i,j} , \ldots ,f_{K,i,j} ,y^{T} )} \right]} }$$
(19)

where \(i\) and \(\overline{j}\) represent time and station, respectively; \(J\) is the number of total stations; and \(n\) is the days of training period. The equation above is maximized numerically by the expectation–maximization (EM) algorithm (Dempster 1977; McLachlan and Krishnan 1988).

3 Results and discussion

3.1 The performances of EPSs for a lead time of 24 h

The precipitation forecast for the lead time of 24 h receives the most attention during flood control. For the lead time of 24 h, the overall performances, the performances at different precipitation thresholds and the spatial distribution of the performances of five EPSs are examined.

3.1.1 The overall performances of EPSs

Figure 4 demonstrates the scatter plots of QPFs versus observations. The abscissa of the black box in Fig. 4 represents the average QPFs, and the ordinate represents the average observations. The Pearson correlation coefficients of QPFs are given as the numbers in the figure. The QPFs of JMA EPS have the best correlation with observations, while CMA EPS has the worst correlation. Table 2 lists the mean RMSEs of five typical EPSs during the verification period, which reflects the QPF accuracies. In line with the correlation results, followed by ECMWF EPS, QPFs of JMA EPS show the best accuracy (i.e., lowest RMSE), and CMA EPS shows the worst accuracy.

Fig. 4
figure 4

Scatter plots of QPFs versus observations

Table 2 Basin mean QPF RMSEs of the five EPSs

Figure 5 illustrates the PQPF skills of each EPS relative to climatology, and a value of CRPSS greater than 0 indicates more forecasting skills than climatology. The PQPFs of all EPSs have positive CRPSSs, which indicates that they are more skillful than climatology (i.e., observed sample). Followed by ECMWF and UKMO EPSs, the mean CRPSS of JMA EPS is the highest, and the confidence interval is the narrowest, indicating the better PQPF skill. CMA EPS has the worst PQPF skill.

Fig. 5
figure 5

Basin mean PQPF CRPSSs of EPSs

ECMWF has been proven to be a superior EPS in multiple regions, but the adaptability of other EPSs varies from region to region. For instance, ECMWF and JMA EPSs show the best skills in China's Huai River basin (Tao et al. 2014). Along the coasts of the northern Indian Ocean, ECMWF, UKMO and NCEP EPSs produce more skillful forecasts (Bhomia et al. 2017). In West Africa, the forecasts of ECMWF and UKMO EPSs are the best (Louvet et al. 2016).

3.1.2 EPS performances at different precipitation thresholds

In general, drought relief focuses on the forecast quality at a low precipitation threshold, while flood control concerns the forecast quality at a large precipitation threshold. Therefore, a large threshold and a low threshold are selected in this paper to evaluate the EPS capacity for drought relief and flood control, respectively. The precipitation between the large threshold and low threshold is not considered here.

There are few data points at the threshold of more than 50 mm/days (Table 3) during the verification period, and it is difficult to meet the needs of the reliability diagram. Therefore, 10 mm/days and 25 mm/days are selected as the low threshold and large threshold for dichotomous events, respectively, whereby the quality of QPFs and PQPFs from five EPSs are estimated at two thresholds.

Table 3 The proportion of data for different dichotomous events (%)

Figure 6 plots the reliability curves of PQPFs for different events. The closer the curve is to the diagonal, the more reliable the PQPF is. For clarity, the EPS with more ensemble members has more probability bins. The BSS, reliability (REL) and resolution (RES) of the BS are shown as numbers in the figure. The horizontal dashed line is the observed sample frequency (i.e., climatology). When the reliability curve is lower than the dashed line, the forecasting skills are inferior to climatology at this forecast probability.

Fig. 6
figure 6

Reliability diagrams of PQPFs at two thresholds. The bar graphs show the subsample frequencies on the logarithm scale. The horizontal dashed line is the observed sample frequency (i.e., climatology)

For the dichotomous event at a low precipitation threshold (< 10 mm/days), it largely deviates from the diagonal line as the prediction probability decreases, which is a severe false negative case. For the dichotomous event at a large precipitation threshold (> 25 mm/days), severe false negatives occur with increasing prediction probability. However, in contrast to false negatives, false positives are more advantageous for flood control safety.

All EPSs have superior PQPFs skill at low precipitation thresholds due to the higher BSS value at low precipitation thresholds. CMA and NCEP EPSs have relatively poor PQPF reliabilities at both thresholds, presenting poor PQPF skills at both thresholds. UKMO EPS is sharper and has the best PQPF skill at a low threshold (the largest BSS), which is mainly attributed to its best reliability and resolution (the smallest REL and the largest RES). For the large precipitation threshold, ECMWF and JMA EPSs have better PQPF skills, where ECMWF has better resolution and is sharper, and JMA is more reliable.

It is easy to calibrate the reliability term through postprocessing, while the resolution term is difficult to postprocess because it is intrinsic to the model (Xiang et al. 2014). Therefore, for flood warnings, the ECMWF EPS is relatively more promising and is expected to further acquire skill through postprocessing.

Figure 7 reveals the discriminations of QPFs at two thresholds. For a low threshold (< 10 mm/days), the ratio of correct prediction is approximately 90% for each EPS, representing the superior QPF discrimination of all EPSs at the low threshold, which is of reference value for drought warnings. For a large threshold (> 25 mm/days), no EPSs can discriminate well. The CMA EPS has almost the same ratio of correct prediction with others, while its false negative ratio is inferior to others. Therefore, in the case of flood control, the QPFs of the CMA EPS are preferred when adopting deterministic forecasts among these EPSs.

Fig. 7
figure 7

Discrimination diagrams of QPFs at two-thresholds

Overall, the EPS forecasting skill for precipitation with a large threshold is far lower than that with a low threshold. This result relates to the main precipitation types over the HB and the characteristics of EPS. Typhoons and plum rains are the main sources of precipitation in the HB. The typhoon is a tropical cyclone, and the plum rain belongs to the East Asian monsoon (Wang et al. 2011; Chen et al. 2018). EPS is good at predicting precipitation generated by the above two types (Lan et al. 2011; Olson et al. 1995), so EPS has good forecast quality for low thresholds. However, the accurate forecast of heavier precipitation is a challenge to EPS (Lan et al. 2011). It is obvious that EPSs can play an effective role in drought predictions for the HB. However, when EPSs are used to force the hydrological model to produce a flood forecast, they should be used carefully.

3.1.3 Spatial distribution of EPS performances

The spatial distribution of precipitation is more realistic and accurate in mountainous terrain when elevation dependence is considered (Song et al. 2019). Thus, the interpolation method used in this paper is Gradient plus Inverse-Distance-Square (GIDS) (Price et al. 2000), which can consider the influence of elevation. Affected by the different climates over the HB, the average daily precipitation distribution decreases from South to North during the verification period (Fig. 8). The precipitation distribution in the HB is not only affected by climate but also by topography and geographical location. The precipitation in mountainous areas and coastal areas increases.

Fig. 8
figure 8

The average daily precipitation at stations

In this study, two approaches are applied to evaluate the spatial differences in EPS performances in the HB. First, the mainstream is taken as the dividing line between the northern and southern HB (the Qinling Mountains-Huai River line is the North–South boundary line of China). The mean verification metrics of EPSs in the northern and southern HB are calculated and displayed in Table 4 to study the adaptability of EPSs in the climatic transition zone. Second, the spatial distribution of the verification metrics of EPSs is carried out by the GIDS (Figs. 9, 10), which intuitively describes the spatial changes in the prediction quality of EPSs.

Table 4 Mean verification metrics of EPSs in northern and southern HB
Fig. 9
figure 9

Spatial distributions of PQPF CRPSSs in the HB

Fig. 10
figure 10

Distributions of QPF RMSEs in the HB

The PQPF skills and QPF accuracies of all EPSs at a large threshold are worse than those at a low threshold (Figs. 6, 7); thus, the PQPF skills and QPF accuracies are better in the northern HB than in the southern HB. UKMO EPS has the best PQPF skill in the northern HB due to its better skill at a low threshold, while the PQPF skill of JMA EPS is superior in the southern HB. In terms of QPF accuracy, the ECMWF and JMA EPSs perform the best in the southern and northern HB, respectively. The PQPF skill and QPF accuracy of CMA EPS are the worst in both the northern and southern HB.

The PQPF skill is significantly decreased in the mountainous area (the yellow and green parts in Fig. 9). The PQPF skill distribution is not completely consistent with the precipitation distribution, which is because atmospheric predictability varies with precipitation formation and type (Qingyun et al. 2019). There are many factors affecting precipitation formation and type, such as atmospheric circulation, topography, and geographical location (including lake and ocean effects) (Chen et al. 2018). For PQPF, the forecast of precipitation caused by ocean effects is skillful. However, the precipitation caused by the complex terrain is still very difficult to forecast because the original resolution of EPSs is not adequate (Kaufman et al. 2003).

For QPF, the ensemble mean process eliminates the above skill, giving rise to a QPF accuracy distribution that is similar to the precipitation distribution. In addition, regardless of the amount of precipitation, the QPF accuracies are always low for mountainous areas.

3.2 The performances of EPSs for different lead times

The longer lead time is more favourable for flood control operation in the future, but it is disadvantageous to forecast accuracy. Hence, it is necessary to analyse the forecast quality of EPSs for different lead times. In this study, we select four lead times of 24 h, 48 h, 72 h and 168 h, and verify the accumulative precipitation forecast for each lead time.

Figure 11 demonstrates the PQPF CRPSSs of EPSs for different lead times, and the 95% confidence intervals are also provided. As the lead time increases, the PQPF skill consistently decreases, and the confidence interval becomes wider. The PQPF skill is poor for the lead time of 168 h; as a result, the forecast for a lead time of 168 h has no practical value. As the lead time increases, the PQPF skill advantage of ECMWF EPS gradually appears. The PQPF of CMA EPS has poor performances for all lead times.

Fig. 11
figure 11

PQPF CRPSSs of EPSs for different lead times

Figure 12 shows the QPF accuracies for different lead times. With increasing lead time, the accuracies of QPFs decrease. For a lead time of 24 h, the QPF accuracy of ECMWF EPS ranks second best following JMA EPS. The QPF accuracy of CMA EPS lags behind the others for all lead times.

Fig. 12
figure 12

QPF RMSEs of EPSs for different lead times

In particular, ECMWF EPS shows a good PQPF skill and QPF accuracy for a long lead time because a long lead time requires more ensemble members to obtain the maximum forecast skill (Clark et al. 2011; Richardson 2001). CMA EPS has the least ensemble members among the five EPSs, and it may be a reason for its poor performance. Thus, for long lead times, an important consideration for the selection of EPSs is the number of ensemble members.

3.3 The performance of the multimodel superensemble

The multimodel superensemble is obtained from all members of five EPSs by BMA. For the members of an individual EPS, their weights are constrained to be equal because they are derived from the same model (Robert 2018; Xiang et al. 2014). This section focuses on the flood control support capacity of a multimodel superensemble. Since the flood season in the HB lasts from June to September each year, July 31 to August 31 is selected as the verification period for the multimodel superensemble, and the performances of five individual EPSs during the same period are also verified for comparison.

  1. (1)

    The length of the BMA training period

    The BMA model is reconstructed each day for each station throughout the verification period. The training period is a sliding window, and the parameters are calibrated using the training period of n previous days. In this study, following the references (Bo et al. 2017; Wu et al. 2014), 35 days, 4 days, 45 days, 50 days, 55 days and 60 days are selected as the training sample periods to train the model. The means of CRPSS and RMSE are taken for all stations and for each day in the verification period.

    Table 5 lists the results of the training sample periods for a lead time of 24 h. The multimodel superensemble shows the lowest RMSE for 35 days, and the discrepancy of CRPSS between 35 days and other training sample periods is quite small. Thus, 35 days is chosen as the length of the training period for the lead time of 24 h. Table 6 lists the training period lengths for different lead times.

  2. (2)

    The performance of the multimodel superensemble

    The multimodel superensemble is expected to show an improved performance compared with all individual EPSs. Figure 13 illustrates the PQPF skills of five EPSs and the multimodel superensemble for different lead times from July 31 to August 31, as well as the 95% confidence intervals. Only at lead times of 24 h and 48 h do EPSs and multimodel superensembles have better PQPF skills than climatology during the flood season. The multimodel superensemble has a slight improvement effect on individual EPSs for all lead times, except 168 h, which manifests a slightly higher CRPSS score and the narrowest confidence interval. However, for the lead time of 168 h, the PQPF skill of the multimodel superensemble ranks second after ECMWF EPS. It seems that 60 days does not meet the training requirement of 168 h.

    At the same time, the QPF accuracy of the multimodel superensemble is expected to be improved compared with that from the five individual EPSs. Figure 14 plots the QPF accuracies of EPSs and multimodel superensembles for different lead times. Contrary to the result of PQPFs, for all lead times except for 168 h, the multimodel superensemble has a slightly improved QPF accuracy compared with individual EPSs, while the multimodel superensemble exhibits a remarkable performance with the highest QPF accuracy for the lead time of 168 h.

    Compared with meteorological elements, such as air temperature and wind speed, the statistical postprocessing of precipitation is more difficult to conduct. The reasons are listed as follows by Scheuerer and Hamill (2015): (1) The skewed distribution of precipitation discontinuity is difficult to fit. (2) The difficulty of forecasting increases with increasing precipitation threshold. (3) The shortage of samples for heavy rainfall and rainstorms is also a major problem.

    The results of research on the value of multimodel superensembles have been mixed. Hamill (2012) stated that PQPF based on multimodel superensembles has better reliability and prediction skills than PQPF based on individual EPSs. Peter Vogel et al. (2017) found that the BMA was not very good at improving the skill, and the BMA was not much more valuable than climatology for long lead times. This finding is consistent with the results of this paper. Renate Hagedorn et al. (2012) investigated the possibility of combining all available EPSs into a multimodel superensemble and found that ECMWF EPS was a major contributor to the performance improvement, and the multimodel superensemble did not improve much more than ECMWF, which may explain the CRPSS results of 168 h in this paper. Saedi et al. (2020) proved that BMA has a great influence on improving probabilistic prediction, but it is not very effective in deterministic predictions. Ji et al. (2019) further proved that the deterministic prediction constructed by the BMA is accurate for low precipitation thresholds but has limited accuracy for medium and high precipitation thresholds. It is evident that the QPF accuracy for a short lead time is in good agreement with the above two results.

    The improvement in the forecast reliability of the multimodel superensemble mainly comes from the potential bias cancelation in different members (Duan et al. 2012). If the forecasting skills of the EPSs are markedly different from each other, then the deterministic prediction through postprocessing is better than the best EPS (Winter and Nychka 2010). If the QPFs released by different EPSs are highly correlated or the best EPS performs significantly better than others, then the QPF through postprocessing cannot always be better than the best individual EPS (Jeong and Kim 2009; Renate Hagedorn et al. 2012; Winter and Nychka 2010). Therefore, for the lead time of 168 h, EPSs have obviously different QPF accuracies (Fig. 12), eliciting a good performance of BMA for improving QPF accuracy.

Table 5 The mean verification metrics of the multimodel superensemble by different training sample periods for a lead time of 24 h
Table 6 The lengths of the BMA training period for different lead times
Fig. 13
figure 13

PQPF CRPSSs of EPSs and multi-model super-ensemble for different lead times

Fig. 14
figure 14

QPF RMSEs of EPSs and multimodel superensemble for different lead times

4 Summary

This study provided a comprehensive verification of QPFs and PQPFs from five operational global EPSs in the HB from April to December 2015. Focusing on the lead time of 24 h, the forecast qualities are evaluated in terms of overall performance, different thresholds and spatial adaptability. The forecast qualities for different lead times are later assessed. In addition, for different lead times, BMA was used to integrate all members of the five EPSs, and the overall performance of the multimodel superensemble in the main flood season of the HB was verified. The main conclusions are listed as follows:

  1. (1)

    As the ECMWF EPS has the largest ensemble members, the ECMWF EPS has the best forecast quality both in QPFs and PQPFs for longer lead times. CMA EPS has the least ensemble members, which may account for its poor forecast quality for all lead times.

  2. (2)

    EPS has a reference value for drought warnings in the HB. The PQPF of the ECMWF EPS has a potential ability for the prediction of intense precipitation.

  3. (3)

    For long lead times, a large number of ensemble members is valuable for high forecast quality, so computing resources should be allocated to increase the ensemble members. According to the spatial distribution of EPS performances, for a lead time of 24 h, resources should be focused on the development of higher resolution, which is conducive to increasing the forecasting skills for various types of precipitation.

  4. (4)

    Owing to the climate transitional zone over the HB, EPS forecast quality is better in the northern HB than in the southern HB. Furthermore, the PQPF skill is also affected by the precipitation type. PQPF is skillful for forecasting the precipitation caused by the ocean effect but is poor for predicting the precipitation affected by mountain topography. The QPF accuracy is also influenced by the terrain, causing it to decrease in mountainous areas.

  5. (5)

    The multimodel superensemble has slightly improved the PQPF skill for short lead times, and for long lead times, it is not much more valuable than climatology. When the QPF accuracy of each individual EPS is significantly different, the multimodel superensemble will obtain an improved QPF accuracy.

The results of this study are only applicable to a specific river basin, but the analytical method for the adaptability of ensemble forecasting over a river basin is generally applicable. This result not only provides a detailed feedback report for the precipitation ensemble forecast model but also provides information for the subsequent watershed flood forecast based on precipitation ensemble forecasts. Limited by the synchronization of the verification period of prediction data sets and observed data, only a portion of the 2015 period is used in this study. Therefore, future work should include a longer verification period to derive a more general conclusion. In future studies, many other postprocessing methods should be tested and compared (Aminyavari and Saghafian 2019; Huo et al. 2019; Shin et al. 2019). Furthermore, gridded datasets may be helpful to further improve the accuracy of the assessment of EPS.