1 Introduction

The Huaihe River Valley (HRV) is a key agricultural area in China to sustain food supplies to whole nation, where summertime flooding events are a primary breaker of local agriculture (Li et al. 2011). Historical records showed that devastating flooding events over this region are tightly associated with heavy rainfall events. For example, in the late June and early July of 2003, consecutive heavy rainfall events brought about 450 mm precipitation into the HRV, causing severe flooding, loss of life and billions of dollars in agricultural and economic damage (Zhang and You 2014). Thus, accurately predicting HRV heavy rainfall events is of vital importance to reduce the economic loss caused by flooding and downpours.

However, previous studies on rainfall prediction focused primarily on the seasonal mean precipitation, which might not be readily adaptable to heavy rainfall events. Specifically, the statistical behavior of heavy rainfall events are determined by the tail of probability distributions (Katz and Brown 1992), which is substantially different from that of the mean (Gumble 1954; Wilson and Toumi 2005). Furthermore, it has been noticed that heavy rainfall events respond differently to climate variability and climate change compared to the seasonal mean precipitation (Chen et al. 2012; Chou et al. 2012; Qian et al. 2007; Wu and Fu 2013). Consequently, the factors responsible for seasonal mean precipitation over the HRV, including the sea surface temperature anomaly (SSTA) patterns (Chang et al. 2000; Huang and Sun 1992; Yang and Lau 2004), springtime soil moisture content (Meng et al. 2014; Zhang and Zuo 2011), and Arctic sea ice extent (Li and Leung 2013; Vihma 2014), might not be used in the same way as for the prediction of heavy rainfall events.

In order to improve seasonal prediction of heavy rainfall events, the prediction models should be updated and the climate predictors suitable for the HRV heavy rainfall events should be identified. On top of that, a reliable statistical inference on the HRV heavy rainfall events and a better understanding of the related physical processes are needed.

This study advances climate prediction of HRV heavy rainfall by applying a novel statistical framework. Unlike traditional statistical prediction models, the framework does not require predefinition of distribution kernels, and thus largely reduces the biases in rainfall statistical models due to their subjective selection of distribution kernels (Li and Li 2013). By objectively identifying heavy rainfall events, the framework can better model the statistical behavior of heavy rainfall, and thus help to improve the understanding of related physical mechanisms. The probability behavior of heavy rainfall events are then linked to preceding SSTA pattern to identify the potential climate predictors. Two statistical prediction models, one linear model and the other machine learning algorithm, are constructed to assess the predictability of the HRV heavy rainfall events using the identified SSTA predictors.

The rest of the paper is organized as following. In Sect. 2, data, analysis methods, and the statistical framework are described. The Bayesian inference on the HRV summertime heavy rainfall events is presented in Sect. 3. The SSTA predictors of the HRV heavy rainfall are identified based on exploratory data analysis in Sect. 4. Furthermore, the atmospheric circulation is analyzed to create the physical linkage between preceding SSTA and summertime HRV heavy rainfall events. In Sect. 5, two statistical prediction models, multiple linear regression model and support vector machine (SVM) algorithm, are constructed to test the predictability of the HRV heavy rainfall events using the SSTA predictors identified in Sect. 4. Concluding remarks are presented in the last section.

2 Data and methods

2.1 Data

The data used in this study includes gridded daily precipitation data archived by China Meteorological Administration (CMA) (Xie et al. 2007). The temporal period analyzed in this study is 1961–2012 in order to avoid the potential inaccurate of statistical inference due to the sparse data coverage prior to 1960s. The HRV covers a geographical domain of 30.5°N–36.5°N; 110.5°E–121.5°E (Hong and Liu 2012), which is delineated by the box in Fig. 1. Daily precipitation within this domain is averaged in order to capture the regional-scale features of rainfall from the climate perspective.

Fig. 1
figure 1

Climatology (1961–2012) of monthly mean precipitation (gray bars, mm day−1) over the HRV; the error bars denote one standard deviation of interannual variation of precipitation in each month. The inner plot shows the 1961–2012 climatology of summer (JJA) precipitation in Eastern China. The black box denotes the HRV region

Summer season is defined as June–July–August (JJA), when precipitation over the HRV peaks (Fig. 1). The precipitation during these three months also displays the strongest interannual variation, indicating the importance of JJA precipitation to climate variability in this region (Fig. 1). Furthermore, summer precipitation in the HRV is critical to local agriculture and economy, making warm-season rainfall a central problem of climate prediction for this region.

Atmospheric circulation fields, i.e. wind patterns, were adopted from the National Center of Environment Prediction/National Center of Atmospheric Research (NCEP/NCAR) reanalysis (Kalnay et al. 1996) during 1961–2012. In this study, both daily and JJA mean circulation were analyzed. In order to explore the potential predictability of rainfall, both synchronized and preceding sea surface temperature anomalies (SSTA) were examined. The SST data is from the National Oceanic and Atmospheric Administration (NOAA) Extended Reconstructed SST [ERSST Version 3 (Smith et al. 2008)].

2.2 Rainfall probability framework: finite Normal mixture model

To describe the probability distribution of summer rainfall over the HRV, a rainfall framework based on a three-cluster Normal mixture model was implemented (Li and Li 2013). The advantages of the Normal mixture model lie in its flexibility of distribution shapes, because any smoothed distribution can be approximated by the combination of a finite number of Normals. Thus, it overcomes the limitation of traditional statistical models that requires the subjective selection of predefined distribution kernels and cannot be easily adapted to different climate zones. For example, the Log-Normal distribution tends to better model the rainfall in the subtropical regions, while the Gamma distribution fits tropical precipitation better (Cho et al. 2004). Such a subjective selection of distribution models can introduce bias to the statistical inference on the HRV heavy rainfall events. The flexibility of Normal mixture model in distribution shape is especially important for HRV summer rainfall, because daily rainfall displays multi-modal feature that cannot be captured by traditional models with uni-modal distribution kernels.

When constructing finite Normal mixture models, choosing the optimal number of clusters is difficult and sometimes controversial (McLachlan and Peel 2000; Melnykov and Maitra 2010; Richardson and Green 1997). Generally, adding more clusters to the mixture model could better approximate the true distribution of rainfall. However, an unlimited increase in clusters also increases the risk of over-fitting (Lin et al. 2007), blurs the physical meanings of each cluster, and hampers the interpretation of mechanisms that control the rainfall probability distribution. In this study, a three-cluster Normal mixture was constructed. The three clusters reflect the probability behavior of different summer rainfall categories over the HRV, i.e. the light, moderate, and heavy rainfall according to the American Meteorological Society (AMS) Glossary of Meteorology (2009).

The three-cluster finite Normal mixture model takes the mathematical form:

$$ y_{i} \left| {\pi ,\mu ,\phi \sim \sum\limits_{h = 1}^{3} {\pi_{h} N\left( {y_{i} \left| {\mu_{h} ,\phi_{h}^{ - 1} } \right.} \right)} } \right. , $$
(1)

where \( \pi \) is the weight of rainfall clusters (i.e. the frequency of each rainfall types). It is noteworthy that \( \sum\nolimits_{h = 1}^{3} {\pi_{h} } = 1 \), and thus \( \pi_{h} \) is mutually dependent. \( \mu \) is the cluster mean, and \( \phi \) is the precision of Normal distributions. \( h \in \left\{ {1,2,3} \right\} \) is the cluster index.

To obtain the distribution parameters (\( \pi \), \( \mu \), and \( \phi \)) in the mixture model, Bayesian statistical inference was implemented. The priors about \( \pi \), \( \mu \), and \( \phi \) are as follows:

$$ \pi \left| {a_{1} ,a_{2} ,a_{3} \sim Dirichlet\left( {a_{1} ,a_{2} ,a_{3} } \right)} \right. $$
(2)
$$ \left( {\mu_{h} ,\phi_{h} } \right) \sim Normal\left( {\mu_{h} \left| {\mu_{0h} ,\kappa_{h} \phi_{h}^{ - 1} } \right.} \right)Gamma\left( {\phi_{h} \left| {\alpha_{h} ,\beta_{h} } \right.} \right) $$
(3)

In Eq. (3), \( \kappa \) is the degree of freedom, and the \( Gamma\left( {\phi_{h} \left| {\alpha_{h} ,\beta_{h} } \right.} \right) \) was parameterized to have mean \( \alpha_{h} /\beta {}_{h} \) and variance \( \alpha_{h} /\beta_{h}^{2} \). The parameters in the prior distributions (Eqs. 23) were assigned according to the AMS definitions of light (0–6 mm day−1), moderate (6–18 mm day−1), and heavy rainfall (>18 mm day−1) (AMS Glossary of Meteorology 2009). However, we kept the prior distributions weakly informative to incorporate more data information into the posterior distribution: \( (a_{1} ,a_{2} ,a_{3} ) = (0.5,0.35,0.15) \), \( \mu_{0h} = (1.0,8.0,20.0) \), \( \kappa_{h} = (1,1,1) \), \( \alpha_{h} = (1.0,1.0,0.4) \), and \( \beta_{h} = (1.0,1.0,1.0) \).

Since the priors (Eqs. 23) and the likelihood model (Eq. 1) are semi-conjugate, the full conditional posterior distributions can be derived analytically (Gelfand 2000). The Gibbs sampler for posterior computation using Markov Chain Monte Carlo (MCMC) algorithm is as follows:

$$ \Pr \left( {z_{i} = h\left| - \right.} \right) = \frac{{\pi_{h} Normal\left( {y_{i} \left| {\mu_{h} ,\phi_{h}^{ - 1} } \right.} \right)}}{{\sum\nolimits_{h = 1}^{3} {\pi_{h} Normal\left( {y_{i} \left| {\mu_{h} ,\phi_{h}^{ - 1} } \right.} \right)} }} $$
(4)
$$ \left( {\mu_{h} ,\phi_{h} \left| - \right.} \right) \sim Normal\left( {\mu_{h} \left| {\hat{\mu }_{0h} ,\hat{\kappa }\phi_{h}^{ - 1} } \right.} \right)Gamma\left( {\phi_{h} \left| {\hat{\alpha }_{h} ,\hat{\beta }_{h} } \right.} \right) , $$
(5)

where \( \hat{\kappa }_{h} = \left( {\kappa_{h}^{ - 1} + n_{h} } \right)^{ - 1} \); \( \hat{\mu }_{0h} = \hat{\kappa }_{h} \left( {\kappa_{h}^{ - 1} \mu_{0} + n_{h} \bar{y}_{h} } \right) \); \( \hat{\alpha }_{h} = \alpha_{h} + \frac{{n_{h} }}{2} \); and \( \hat{\beta }_{h} = \beta_{h} + \frac{1}{2}\left\{ {\sum\nolimits_{i = 1}^{{n_{h} }} {\left( {y_{i} - \bar{y}_{h} } \right)^{2} + \left( {\frac{{n_{h} }}{{1 + \kappa_{h} n_{h} }}} \right)\left( {\bar{y}_{h} - \mu_{0h} } \right)^{2} } } \right\} \). \( n_{h} = \sum\nolimits_{i = 1}^{n} {1\left( {z_{i} = h} \right)} \) denotes the number of samples in cluster \( h \), and \( \bar{y}_{h} = n_{h}^{ - 1} \sum\nolimits_{{i:z_{i} = h}} {y_{i} } \) is the sample mean of cluster \( h \).

$$ \left( {\pi_{1} ,\pi_{2} ,\pi_{3} \left| - \right.} \right) \sim Dirichlet\left( {a_{1} + n_{1} ,a_{2} + n_{2} ,a_{3} + n_{3} } \right) $$
(6)

In this study, the MCMC algorithm (Eqs. 46) is applied to daily rainfall in each summer during the 1961–2012. Since heavy rainfall intensity is stronger than moderate rainfall and light rainfall, this physical constraint is placed upon \( \mu_{h} \) (\( \mu_{1} < \mu_{2} < \mu_{3} \)) to deal with the label switching issues (Stephens 2000). The MCMC algorithm is run 1000 times and the first 200 burn-in samples were discarded, and the remaining 800 post burn-in samples were used in the analysis.

The goodness-of-fit of the constructed three-cluster Normal mixture model was compared with the widely used distribution models, including the Gamma, Exponential, Weibull, and Log-Normal (Table 1). The quantile-based square error of the five distribution models was calculated to assess the “goodness-of-fit” of each model. The smaller square error indicates a better approximation of models to observed rainfall distribution. According to Table 1, the Normal mixture model shows the smallest square error, suggesting its better performance than the other four traditional rainfall models. Thus, the application of the Normal mixture model can improve the statistical inference and diagnostic analysis of regional hydroclimate over the HRV.

Table 1 Comparison of distribution models used to describe the probability behaviors of summer precipitation over the HRV

3 Bayesian inference on summer precipitation over the HRV

From the posterior distributions (Eqs. 46), the parameters in the Normal mixture model (Eq. 1) can be sampled. In this study, the interannual variation of the distribution parameters \( \mu_{h} \) and \( \pi_{h} \) are of primary interest, where \( \mu_{h} \) reflects the intensity of light, moderate, and heavy rainfall and \( \pi_{h} \) describes their frequency.

3.1 Bayesian inference

Figure 2 shows the year-to-year fluctuations of \( \mu_{h} \) and \( \pi_{h} \) over the HRV region during the period of 1961–2012. The 52-year climatology is about 1.09 mm day−1 for light rainfall, 5.45 mm day−1 for moderate rainfall, and 16.23 mm day−1 for heavy rainfall, respectively (Fig. 2a). Compared to the AMS criteria of the three rainfall types, the \( \mu_{h} \) derived from the posterior distribution is lower for moderate and heavy rainfall, indicating that the criteria used to define rainfall categories should be adjusted at different regions. Constrains placed on \( \mu_{h} \) are needed in the algorithm so that there is no overlap of the sampled \( \mu_{h} \) between rainfall clusters in the MCMC algorithm (Fig. 2a). Besides rainfall intensity, the frequency of each rainfall category (\( \pi_{h} \)) can also be derived from the framework. The climatology of light rainfall frequency is 49 %, moderate rainfall frequency is 38 %, and heavy rainfall frequency is 13 %, respectively (Fig. 2b).

Fig. 2
figure 2

Bayesian inference on the interannual variation of the a intensity (mm day−1) and b) frequency of light (red curves); moderate (black curves); and heavy (blue curves) rainfall events. Shading represents the 95 % credible interval as derived from the post burn-in Markov Chain Monte-Carlo samples

The sampling uncertainty of \( \mu_{h} \) and \( \pi_{h} \) from the MCMC algorithm is also calculated. The uncertainty range is derived as the 95 % credible intervalFootnote 1 based on post burn-in MCMC samples (Hoff 2009). Comparatively, the inference shows the highest confidence of light rainfall intensity among all three clusters, due to relatively larger sample size (\( \hat{\kappa } \)) and smaller sample standard deviation (\( \phi_{1}^{ - 1} \)) in this cluster. In contrast, the heavy rainfall cluster has the least rainfall samples, and the cluster mean \( \mu_{3} \) usually shows largest uncertainty compared to \( \mu_{1} \) and \( \mu_{2} \) (Fig. 2a). However, compared to other regions such as the Southeastern United States (Li and Li 2013), sampling uncertainty over the HRV is within the magnitude of interannual variation of heavy rainfall intensity in the period of 1961–2012, i.e. the inference on the HRV heavy rainfall events is of considerably higher confidence.

It is noteworthy that Bayesian inference on Normal mixture model does not require the sample sets to be strictly independent and identical distributed (i.i.d.) (Hoff 2009). In other words, the inference on posterior distribution is not impacted by the autocorrelation of rainfall samples. We run the model using de-autocorrelated samples and the inference on the distribution parameters is almost identical (Appendix 1). The results suggest that the model can be applied to climate variables, including precipitation and temperature, which usually have high autocorrelation. In the following study, the inference using the original sample set is adopted for better temporal consistency.

3.2 Contribution of heavy rainfall to HRV regional hydroclimate

According to the Normal mixture model, the sample mean (i.e. seasonal mean) of rainfall equals to the weighted average of the three rainfall clusters: \( \bar{Y}\left| {\pi ,\mu ,\phi } \right. = \sum\nolimits_{h = 1}^{3} {\pi_{h} \mu_{h} } \). Utilizing such a relationship, the contribution of each rainfall cluster to summertime hydroclimate over the HRV can be assessed.

Figure 3a shows the relationship between \( \pi_{h} \mu_{h} \) and seasonal mean precipitation (\( \bar{Y} \)). Climatologically, the HRV receives 432 mm precipitation in JJA, with approximately 50 mm from light rainfall, 190 mm from moderate rainfall, and 192 mm from heavy rainfall (Fig. 3a). The contributions of moderate and heavy rainfalls to total seasonal precipitation amount are almost equivalent. However, the variance of seasonal mean precipitation is primarily explained by the heavy rainfall at interannual scales. Specifically, the R2 between heavy rainfall and seasonal mean precipitation reaches 0.58 for the 1961–2012 period, indicating that about 60 % of the JJA mean precipitation variance can be explained by the heavy rainfall events (Fig. 3a). In contrast, the light and moderate rainfalls together explain only 21 % of the seasonal rainfall variance (Fig. 3a). The results indicate that heavy rainfall events can significantly modulate seasonal scale hydroclimate over the HRV, exerting persistent climatic impact on regional water resources.

Fig. 3
figure 3

a Contribution of light (red dots), moderate (gray dots), and heavy (blue dots) rainfall clusters to summer season cumulative precipitation over the HRV (mm); and the relationship between summer precipitation (unit: mm day−1) and b heavy rainfall intensity (blue dots; unit: mm day−1) and c heavy rainfall frequency (blue dots). The straight lines are the best least square fitting lines

The contributions of heavy rainfall events to summer season precipitation are mainly through the interannual variation of rainfall frequency (\( \pi_{3} \)) rather than heavy rainfall intensity (\( \mu_{3} \)). Specifically, the correlation between heavy rainfall intensity (\( \mu_{3} \)) and seasonal mean precipitation (\( \bar{Y} \)) is 0.17, statistically insignificant (Fig. 3b). In contrast, the frequency of heavy rainfall is highly correlated with seasonal mean precipitation, with correlation coefficient approaching 0.60 (p < 0.001). Thus, in order to achieve a better seasonal prediction of HRV summer precipitation, improvements in the understanding of factors and processes responsible for HRV heavy rainfall frequency is a key.

4 Wintertime SSTA related to HRV heavy rainfall events in the summer

Previous studies have suggested that global climate modes could influence the rainfall events by changing the large-scale background circulation where synoptic-scale systems develop (Higgins et al. 2007; Ropelewski and Halpert 1987; Ting and Wang 1997). In the East China, summer rainfall events are associated with the East Asia summer monsoon (EASM) (Ding 1992; Lau 1992; Matsumoto 1988), and are further linked to the well-defined climate modes, including the El Nino–Southern Oscillation (ENSO) (Wang et al. 2003; Wu and Wang 2002), the Pacific Decadal Oscillation (PDO) (Lei et al. 2011), and the Indian Ocean Dipole (IOD) (Yang et al. 2010), as well as anthropogenic forcing (Wang et al. 2013).

The linkage between the HRV heavy rainfall frequency and climate factors is explored by regressing SSTA upon the sampled heavy rainfall frequency. Establishing such linkages could provide insights for seasonal prediction of HRV heavy rainfall events as well as summertime hydroclimate over the HRV.

4.1 Association of tropical SSTA to heavy rainfall events over the HRV

Figure 4 shows the precedent and synchronized SSTA regressed upon the frequency of the heavy rainfall events. Generally, the increased frequency of HRV heavy rainfall events is associated with warmer SSTA over the north Indian Ocean, the equatorial western Pacific, and the tropical Atlantic Ocean (Fig. 4). Specifically, the warm SSTAs over the three tropical oceans occur in the preceding winter (January–February–March, JFM), 5 months before the summer (JJA, Fig. 4a). In the north Indian Ocean and the equatorial western Pacific, SSTAs persist throughout summer (Fig. 4b–f), indicating that the SSTAs might influence HRV heavy rainfall events by providing persistent oceanic forcing on the overlying atmosphere and thus impacting the summertime circulation pattern over the HRV.

Fig. 4
figure 4

Sea surface temperature anomalies (SSTA; shaded) regresses upon the interannual variation of heavy rainfall frequency over the HRV: a JFM, b FMA, c MAM, d AMJ, e MJJ, and f JJA. The regression coefficients significant are α = 0.05 level are stippled

Compared to the SSTAs over the Indian Ocean and the equatorial western Pacific, the tropical Atlantic SSTA is less persistent. In JFM, the warm SSTAs span the entire tropical Atlantic and the Intra-Americas Seas (Fig. 4a). In the following months, the areas with significant SSTAs gradually shrink and the SSTAs in the southern portion of the tropical oceans completely decay in JJA months (Fig. 4d–f). In the north Tropical Atlantic, although statistically significant, the SSTAs become weaker in summer months (Fig. 4f). The gradually decaying feature of SSTAs in the tropical Atlantic may be related to the negative feedback between tropical oceans and the seasonal evolution of atmospheric circulation, including North Atlantic Oscillation (Huang and Shukla 2005). Thus, tropical Atlantic SSTA might influence summertime atmospheric circulation through the air–sea interaction over the Atlantic and modulate heavy rainfall events over the HRV (Sun and Wang 2012).

In conclusion, the analysis of HRV heavy rainfall frequency and global SSTA identified a significant relationship between summertime HRV heavy rainfall events and wintertime tropical SSTAs. It is noteworthy that the abovementioned relationship should not be interpreted as a coincidence between the SST warming trend and the wetting trend over the HRV. Regression analysis using detrended SSTA records shows similar results, except minor changes in the regression coefficients (not shown). These tropical SSTAs occur 5 months prior to summer season, providing potential sources of predictability for the HRV heavy rainfall frequency.

4.2 Contribution of tropical SSTA to the typical circulation pattern associated with HRV heavy rainfall

The analysis of HRV heavy rainfall and global SSTA identifies a positive relationship between tropical SSTAs and the HRV heavy rainfall frequency. Usually, the SSTA influences heavy rainfall events by changing the large-scale atmospheric circulation that provides background conditions for the development of rain-bearing systems. Thus, exploring the contribution of tropical SSTAs to the large-scale circulation patterns favorable for the HRV heavy rainfall events can improve the mechanistic understanding of the relationship between tropical SSTAs and the HRV heavy rainfall events.

The typical circulation pattern associated with the HRV heavy rainfall is analyzed based on the posterior predictive. Specifically, using the post burn-in parameter sets (i.e. the 800 MCMC samples), the probability of individual rainfall events falling into different rainfall clusters can be quantified (Eq. 4). By comparing the calculated probability, we can objectively categorize each rainfall event and identify all heavy rainfall events during 1961–2012. Thus, the atmospheric circulation patterns associated with the heavy rainfall events can be obtained accordingly.

Figure 5 shows the circulation maps at 850, 500, and 200 hPa composite upon the objectively identified heavy rainfall events. Climatology of the circulation has been removed from each composite map. According to Fig. 5, heavy rainfall events are associated with abnormally strong southerly wind over Eastern China at the 850 hPa level (Fig. 5a). The significantly intensified southerly wind indicates a strong East Asia summer monsoon, which favors the onset of heavy rainfall events by increasing the moisture supplies from the South China Sea (SCS) (Zhou and Yu 2005) and along the western edge of western Pacific subtropical high (Zhang 2001).

Fig. 5
figure 5

Typical atmospheric circulation (vectors, m s−1) patterns (represented as wind anomalies) composite upon heavy rainfall events as identified using posterior predictives (Eqs. 46): a 850 hPa, b 500 hPa, and c 200 hPa. The bold red vectors represent wind anomaly significant at 95 % confidence interval

At 500 hPa, an anomalous anticyclone occurs with its center located off the eastern coast (Fig. 5b). The anticyclone is associated with an intensification and southwestward movement of the northwestern Pacific subtropical high. Accompanying this anticyclone, a cyclone is generated inland, presenting a meridionally oriented circulation pattern (Fig. 5b). Such a circulation pattern is consistent with previous studies (Bao 2008).

This meridionally oriented circulation is also observed in the upper troposphere (200 hPa), although the upper tropospheric circulation pattern becomes less significant and shifts northward by about 5 degrees (Fig. 5c). The northward tilt of the circulation pattern in the vertical direction indicates an anomalously strong EASM, favoring the northward migration of monsoon rain-belt to the HRV region.

These typical circulation patterns tend to become more frequent when the tropical oceans are warmer, according to the composite analysis of JJA circulation upon the JFM SSTAs over the north Indian Ocean, the equatorial western Pacific, and the tropical AtlanticFootnote 2 (Fig. 6). The warm years are defined as when the SSTAs exceed 0.5 standard deviation (STD); while cold years are when SSTAs are below −0.5 STD. Here, 0.5 STD instead of 1 STD criterion is used in order to increase the size of the composite samples.

Fig. 6
figure 6

Differences field of JJA moisture flux (vector, unit: kg m−1 s−1) and moisture flux divergence (shaded, unit: mm day−1) between warm and cold tropical SSTAs (a); bd different fileds in 850 hPa, 500 hPa, and 200 hPa wind (blue vectors; unit: m s−1), respectively. The years with JFM SSTA over all three regions above 0.5 (below −0.5) standard deviation are selected to denote the warm (cold) events

According to the composite analysis, SSTAs over the tropical oceans collectively contribute to the atmospheric circulation patterns favorable for the HRV heavy rainfall events (Figs. 56). Specifically, warmer tropical oceans intensify northward moisture transport into the Eastern China, which enhances moisture supplies from the SCS and along the western ridge of the western Pacific subtropical high (Zhang 2001; Zhou and Yu 2005). The anomalous moisture flux converges in the HRV, facilitating the heavy rainfall events there (Fig. 6a). The intensification of northward moisture flux is associated with an increase in 850 hPa southerly wind (Fig. 6b), a key circulation feature to sustain HRV heavy rainfall events (Fig. 5a). Furthermore, the southerly wind can enhance the upward motion as indicated by Sverdrup vorticity balance \( \left( {\beta v \propto - f\frac{\partial \omega }{\partial z}} \right) \) along the subtropics (Liu et al. 2004, 2007).

Intensification of the southerly wind along with the anomalous easterly wind over the SCS (Fig. 6b) might result from the Gill-type response of atmospheric circulation to the warming over the north Indian Ocean (Gill 1980). The warming of the north Indian Ocean generates equatorial Kelvin waves, which enhance tropical easterlies and induces southerly wind to the east of the SSTA center (Gill 1980; Kosaka et al. 2013). Such a circulation pattern as shown in Fig. 6b is consistent with the numerical simulation of tropical Indian Ocean’s impact on the HRV summer precipitation (Hong and Liu 2012). Furthermore, the warming over the equatorial western Pacific can reinforce the southerly wind through a meridionally oriented tropical–extratropical teleconnection pattern. The warming over the equatorial western Pacific induces a cyclone over the local SSTA center and an anticyclone off the eastern coast of China (Huang and Sun 1992; Ji et al. 2014; Kosaka and Nakamura 2010). The anticyclone can enhance the southerly wind over the East China, favoring the HRV heavy rainfall events. In addition, the tropical Atlantic also contribute to the above mentioned atmospheric circulation, likely through the wave propagation (Zuo et al. 2013).

In the 500 hPa, the warming over the tropical oceans is accompanied by the intensification of the Northwest Pacific subtropical high and the westward extension of its western ridge (Fig. 6c), consistent with previous studies (Li et al. 2012; Zhou et al. 2009). The extension of the western ridge is a typical feature of the intensified EASM (Chang et al. 2000), which provides a favorable circulation pattern for the HRV heavy rainfall events (Fig. 5b). In the upper troposphere (200 hPa), the circulation composite on tropical SSTAs also resembles the typical patterns during the HRV heavy rainfall events (Fig. 5c, d).

Overall, the above analysis suggests that a collective impact of tropical SSTAs on HRV heavy rainfall events, which is likely through the oceanic influence on large-scale circulation. Warmer tropical oceans can enhance the southerly wind and supply excessive moisture to the HRV, favorable for the generation and maintenance of heavy rainfall events.

5 Predictability of HRV heavy rainfall frequency

The analysis of HRV heavy rainfall, atmospheric circulation, and SSTA patterns demonstrates a significant relationship between heavy rainfall frequency and tropical SSTA in proceeding months (Fig. 4a–e). The tropical SSTAs occur five months before the HRV rainy season (Fig. 4f), providing potential sources of predictability for heavy rainfall events over this region.

Using these tropical SSTAs as predictors, statistical prediction models are constructed to predict HRV heavy rainfall events. In this study, multiple linear regression model and support vector machine (SVM) algorithm are applied to assess whether the selection of prediction models contribute to the prediction skill of HRV heavy rainfall.

The multiple linear regression aims to model the relationship between two or more predictors and one response variable by fitting a linear equation to observed data (Christensen 2011). In this study, the regression model is formulated as:

$$ y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \varepsilon , $$
(7)

where \( y \) is the observed heavy rainfall frequency over the HRV; \( x_{1} \), \( x_{2} \) and \( x_{3} \) denote the SSTAs over the north Indian Ocean, equatorial western Pacific, and the tropical Atlantic, respectively; and \( \varepsilon \) is the residual of the prediction model. \( \beta_{0} \), \( \beta_{1} \), \( \beta_{2} \) and \( \beta_{3} \) are derived so that the least-square errors between observation and prediction model is minimized.

Consistent with the exploratory data analysis, multiple linear regression using tropical SSTAs shows certain skill in predicting heavy rainfall frequency over the HRV (Fig. 7). The R2 between prediction and observation is 0.20, significant at \( \alpha = 0.01 \) level. However, the model (Eq. 7) underestimates the heavy rainfall frequency at the distribution tails (Fig. 7). Thus, it is likely that nonlinearity between predictors and response variable should be considered in order to achieve a better prediction skill (Fig. 7).

Fig. 7
figure 7

Heavy rainfall frequency over the HRV as predicted by multiple linear regression model (blue dots) and support vector machine (SVM) algorithm (gray dots). The straight lines are the best least squares fitting lines between observations and prediction models

The potential nonlinearity between predictors and response variable is considered by constructing SVM regression model (Burges 1998). The SVM algorithm uses nonlinear kernel function and is formulated as:

$$ \hat{y} = f\left( {x_{1} ,x_{2} ,x_{3} } \right) = w^{T} \varPhi \left( {x_{1} ,x_{2} ,x_{3} } \right) + b , $$
(8)

where \( \left( {x_{1} ,x_{2} ,x_{3} } \right) \) is the predictor vector (i.e. tropical SSTAs) and \( \hat{y} \) is the SVM model output (i.e. HRV heavy rainfall frequency). The \( \varPhi \left( {x_{1} ,x_{2} ,x_{3} } \right) \) is the nonlinear kernel functions, and the \( w^{T} \) is the support vector denoting the norm of the nonlinear hyperpanel defined by \( \varPhi \left( {x_{1} ,x_{2} ,x_{3} } \right) \). In this study, we apply a least square SVM, meaning that the \( w^{T} \) and \( b \) in Eq. (8) are formulated to minimize the cost function (see details in Appendix 2).

Taking into account the nonlinear relationship between predictors and responses, the SVM regression substantially improves the prediction of heavy rainfall frequency over the HRV. The R2 reaches 0.47, indicating that the SVM increases the explained variance in comparison with the multiple linear regression model (Fig. 7). More importantly, the regression slope better approaches the y = x line than the linear model (Fig. 7). The results support our postulation and suggest that nonlinearity should be considered when constructing prediction models for HRV heavy rainfall using preceding tropical SSTA.

The probable sources of the nonlinearity might come from the nonlinear relationship between atmospheric moisture content and air temperature (Chou et al. 2009; Held and Soden 2006), as well as the potential positive feedback between heavy precipitation and vertical motion (Chou et al. 2012). These nonlinear processes are missing in linear models, but can be captured by the SVM algorithm built on nonlinear kernel functions. However, it is noteworthy that the explicit expression of SVM regression is virtually impossible to obtain, making it inapplicable to identify the mechanisms responsible for the nonlinearity. To achieve the mechanistic understanding, numerical simulations should be implemented, which is beyond the scope of this study.

6 Conclusions

Summertime heavy rainfall events over the HRV in China are vitally important to the regional and national agriculture, economy and social development. However, accurately predicting HRV heavy rainfall remains challenging due to its complicated statistical behavior and the poorly understood physical processes governing heavy rainfall.

This study advances the understanding of and improves prediction skill of HRV heavy rainfall events by applying a novel rainfall framework built on a three-cluster Normal mixture model (Li and Li 2013). The three clusters reflect the probability behavior of light, moderate and heavy rainfall. Bayesian inference and the Gibbs sampler using Markov Chain Monte Carlo algorithm are applied to sample the distribution parameters of the model. Compared to traditionally used distribution models, the new framework improves the statistical inference on the HRV summer rainfall.

The analysis shows that heavy rainfall cluster contributes the largest amount of variance (58 %) to summer precipitation over the HRV, almost three times higher than that of light and moderate rainfall clusters. Furthermore, the contribution of heavy rainfall is manifested by the interannual variation of heavy rainfall frequency, whereas the variability of heavy rainfall intensity is secondary.

The HRV heavy rainfall frequency is most influenced by SSTA patterns over the north Indian Ocean, equatorial western Pacific, and the tropical Atlantic that collectively modulate summertime atmospheric circulation. When the tropical oceans are warmer than normal, abnormally stronger southerly wind supplies more moisture to the HRV, which favors the onset and maintenance of heavy rainfall events over the HRV.

The SSTA signals occur five months prior to the summer season, providing potential sources of prediction skill for heavy rainfall frequency over the HRV. Two statistical prediction models are thus constructed and tested: multiple linear regression model and the SVM algorithm. Both prediction models show considerable accuracy in predicting the frequency of HRV heavy rainfall events. Comparatively, the SVM algorithm further improves the predictions due to its capability to capture the nonlinear relationship between SSTA and rainfall over the HRV. Thus, our study suggests that the application of the new rainfall framework and the SVM algorithm has the potential to improve seasonal prediction of heavy rainfall frequency over the HRV region. The results obtained in this study have important implication to improve the nation’s disaster early warning system, which can help reduce economic and agriculture losses caused by heavy rainfall and related natural hazards.