1 Introduction

In the context of global warming, extratropical land, which is densely populated, has experienced extremely high surface air temperature (SAT) during boreal summer more frequently (Meehl and Tebaldi 2004). For instance, an intense heat wave during July 1995 hit the mid-west United States, leading to more than 700 deaths in just 5 days in Chicago (Changnon et al. 1996; Kunkel et al. 1996); Europe suffered its hottest summer of the last 500 years in 2003, causing a heat-related death toll exceeding 70,000 (Robine et al. 2008); a record-breaking extremely hot summer choked western Russia in 2010, leading to around 55,000 fatalities (Barriopedro et al. 2011; Hoag 2014); and in summer 2013, southern China experienced its worst heat wave on record for the past 113 years (Xia et al. 2016); in Hawaii, which is usually controlled by breezy trade winds, the summer of 2015 was the hottest since records began in 1948, with 98 exceptional heat-wave days occurring during that Hawaiian local summer (July, August, September and October) (Zhu and Li 2017c). Moreover, the areas affected by heat waves have been projected to persistently increase in the coming decades, at a rate dependent on the emissions scenario (Battisti and Naylor 2009; Rahmstorf and Coumou 2011; Coumou and Robinson 2013; Fischer and Knutti 2013; Russo et al. 2014, 2015). Regardless of when and where they occur, extremely hot summers and heat waves are exerting notable influences on human mortality, regional economics, natural ecosystems and many other aspects (Schar and Jendritzky 2004; Meehl and Tebaldi 2004), and are therefore important phenomena to understand and forecast.

Whilst numerical models are unlikely to ever develop to such an extent that the global climate can be simulated perfectly, continuous efforts have nevertheless been made towards producing statistical predictions of monsoons, SAT and their extremes over different regions (Barnett et al. 1984; Cohen and Fletcher 2007; Lee et al. 2013a; Li and Wang 2016; Li et al. 2016; Wang et al. 2015a, b). However, comparing to the seasonal prediction, extended-range (5–30-day lead) statistical forecasting of SAT and heat waves has been less well addressed. This scale of forecasting is extremely useful in areas such as agricultural planning, power management and activity scheduling. As such, researching the extended-range forecasting of SAT and heat waves is a matter of considerable importance.

Statistical models for the extended-range forecasting of intraseasonal oscillation (ISO) have achieved rapid development in the past decade (Jiang et al. 2008; Kang and Kim 2010; Roundy 2012; Cavanaugh et al. 2014; Lee et al. 2013b, 2015; Lee and Wang 2016; Zhu and Li 2017d). Based on the spatial and temporal information of the “nonconventional filtering” coupled predictor–predictand fields (Zhu et al. 2015), spatial–temporal projection models (STPMs) have been constructed to realize the real-time extended-range forecasting of different meteorological variables. For example, tropical convection patterns associated with the Madden–Julian oscillation (MJO) were reproduced well in a study by Zhu et al. (2015), 20–30 days in advance; forecasting of Chinese subseasonal summer rainfall anomalies is practically useful at a 20-day lead time over most parts of China (Zhu and Li 2017a); the onset date of the South China Sea monsoon can be captured (Zhu and Li 2017b); the occurrence of western North Pacific clustering tropical cyclogenesis has been reproduced reasonably well at a 15-day lead time (Zhu et al. 2016).

Chinese summer SAT presents considerable intraseasonal variability. As shown in Fig. 1a, the intraseasonal (10–80-day) SAT accounts for more than 80% of the total pentad mean SAT variance over most parts of China. The areal-mean fractional variance over whole China accounts for 84%. There are two dominant modes of ISO over China: one is the Quasi-Biweekly Oscillation (QBWO) mode, with a period of around 10–30 days (Yang et al. 2010). The QBWO explains around 60% of the total variance over most of China (Fig. 1b), and the areal-mean fractional variance accounts for 58%. The other is the MJO mode, with a period of 30–80 days, which accounts for more than 36% of the total variance over southern China (Fig. 1c). Since these two dominant ISO modes possess different spatiotemporal characteristics, they may have different predictability sources. Bearing that in mind, the present study aims to construct STPMs for two SAT modes separately, using different projection domains of predictors. Specifically, we seek to answer the following two questions: By splitting the 10–80 day SAT into 30–80 and 10–30 day SAT, could the separate STPM for MJO and QBWO mode achieve better forecasting skill comparing to that of 10–80 day SAT STPM? And how predictable are Chinese heat waves at 5–30-day lead times? Therefore, several STPMs are constructed to assess the forecasting skills for Chinese SAT and heat waves 5–30 days in advance.

Fig. 1
figure 1

The fractional variance of a 10–80 day, b 10–30 day and c 30–80 day filtered SAT variability to the pentad-mean total SAT variability. The green contour line is the areal-mean fractional variance for each band of SAT variability

Following this introduction, Sect. 2 introduces the data, method and the configuration of the STPMs. In Sect. 3, the predictability sources are detected for the 30–80 and 10–30 day modes of SAT, and the projection domains for STPMs are determined. The forecasting skills for SAT and heat waves are assessed in Sect. 4. Section 5 provides the conclusion and discussion of the present study.

2 Data, method and model configuration

2.1 Dataset

A homogenized maximum daily air temperature dataset derived from 753 gauge stations over mainland China, run by the China Meteorological Administration, is employed. The atmospheric circulation variables are derived from the National Centers for Environmental–Prediction National Centers for Atmospheric Research reanalysis (Kalnay et al. 1996), with its original horizontal resolution of 2.5° × 2.5° (longitude × latitude). In order to focus on large-scale circulation, the atmospheric circulation datasets are first interpolated into a 5° × 5° resolution. All the datasets are processed to a 5-day mean to be transferred into pentad data—except for the 12th pentad of leap years, to which a 6-day mean is applied. Therefore, there are always 73 pentads in each year. All datasets span from 1960 to 2013. In the present study, we focus on the extended boreal summer, which includes 6 months (May, June, July, August, September and October), from pentad 25 to pentad 61.

2.2 ISO signal extraction

To avoid the tapering problem, an “unconventional filtering” method is applied to extract the ISO signal. This method is reliable in extracting different bands of the ISO signal (Hsu et al. 2015). Taking the extraction of the 30–80 day ISO signal as an example, three steps are followed: firstly, the climatological annual cycle is taken off from the original pentad-mean data by subtracting a climatological 18-pentad (90 days) low-pass filtered component; Secondly, the last 8-pentad [i.e., from day (−40) to day (0)] running mean field is then removed, and therefore the low-frequency signal longer than 16 pentads (80 days) is taken off; Finally, by applying a running mean to the last 3 pentad [i.e., from day (−15) to day (0)], the high-frequency signal (shorter than 30 days, i.e., synoptic-scale perturbation and the QBWO) is eliminated. In this way, the 30–80 day ISO signal is extracted. Note that this “unconventional filtering” method may create a 1–2 pentad delay in intraseasonal peaks and troughs (figures not shown) comparing to the original intraseasonal signal, but the phase delay cannot affect the forecasting skills and it can be resolved by a post-hoc correction.

2.3 Model configuration

To design appropriate STPMs, we first unravel the spatiotemporal characteristics of the Chinese intraseasonal SAT. Figure 2 shows the empirical orthogonal function (EOF) analysis for the 30–80 and 10–30 day summer SAT, separately. As shown in Fig. 2a, the first three leading modes for 30–80 day SAT present a monopole, dipole and tripole spatial pattern, respectively. The monopole pattern of the first EOF has maximum loadings over central and southern China. The Tibetan Plateau, northwestern and northeastern China presents relatively smaller loadings, which may relate to their high altitude comparing to elsewhere. The first EOF explains more than 30% of the total 30–80 day SAT variance. The second EOF shows a meridional dipole pattern, with positive loadings over southern/southwestern China and negative loadings over northern China. The boundary between positive and negative loadings, at around 32°N, is consistent with the Huai River–Qin Mountains line which geographically divides China into northern and southern part. The tripole pattern in the third EOF shows negative loadings in central and western China but positive loadings over southern and northeastern China. The second and third mode explains 14 and 12% of the total 30–80 day SAT variance. As indicated in Fig. 2b, the first three leading modes (accounting for 56% of the total 30–80 day SAT variance) are highly independent from the other modes (North et al. 1982).

Fig. 2
figure 2

The a first three leading EOF modes of 30–80 day SAT and b the fractional variance (red line; units %) explained by the first ten modes and their error (blue bars). c, d are the same as a, b but for 10–30 day SAT

Albeit with a quite different frequency, it is interesting to note that the 10–30 day SAT’s first three leading EOF patterns (Fig. 2c) are quite similar to those of 30–80 day SAT. Meanwhile, the first three leading modes of 10–30 day SAT are also highly independent, explaining about 60% of the total 10–30 day SAT variance (Fig. 2d). Note that, the first two EOF modes may still have some physical associations (Roundy 2014) even they are statistically independent based on the criterion of North et al. (1982), because their principle components (PC1 and PC2) are significantly lag-correlated, with 1-pented lag correlation coefficients of 0.36 and 0.22 for 30–80 and 10–30 day SAT. Nevertheless, the first three modes for both 30–80 and 10–30 day SAT are significantly separated from the higher modes.

The three leading modes for 30–80 and 10–30 day SAT already account for 56 and 60% of the total SAT variance, and the higher modes can be considered as noise. Therefore, the first priority is to reproduce the first three leading PCs. Note that we can filter out the “higher-mode noise” by reconstructing the temporal varying SAT patterns as the summation of the products of the three leading EOF patterns and their corresponding PCs. This reconstructed SAT can be referred to as the EOF-filtered SAT. The EOF-filtered SAT is the predictable part of the total SAT. This idea is somewhat analogous to the “Predictable Mode Analysis” (Wang et al. 2015a, b; Li and Wang 2016). The STPM aims to directly forecast each of the first three PCs, and then the succeeding 5–30-day SAT patterns can be reconstructed as the summation of the products of the forecasted PCs and the observed corresponding EOF patterns. In the present study, the data from 1960 to 1999 (40 years) are used for training, and data during 2000–2013 (14 years) are used to assess the model performance for the independent forecast.

Because the STPMs are based on the extended singularity value decomposition (ESVD) method (Zhu et al. 2015), they can capture the temporally varying coupled modes of the previous predictor and following predictand (Bretherton et al. 1992). Six pentad predictors and predictands are concatenated in ESVD to construct the STPMs for 5–30-day (5-, 10-, 15-, 20-, 25- and 30-day, respectively) lead times forecast. For instance, the predictors at pentads 25–30, pentads 26–31… and pentads 50–55 are applied to forecast the predictand at pentads 31–35, pentads 32–36… and pentads 56–61. During the 40-year training period of 1960–1999, the large-scale predictor (and predictand) involves a data matrix X (Y) with t dimension points in time and i 1 ×  j 1 × 6 (1 × 6) points in space, where 6 is the number of preceding (succeeding) pentads corresponding to the forecast time point [as denoted by Eqs. (1) and (2)]:

$$X(t,{i_1} \times {j_1} \times 6) \approx \sum\limits_{k=1}^K {{V_k}({i_1} \times {j_1} \times 6)} {v_k}(t);$$
(1)
$$Y(t,1 \times 6) \approx \sum\limits_{k=1}^K {{U_k}(1 \times 6)} {u_k}(t).$$
(2)

The parameter K, which is equal to 6 in the present study, is how many coupled modes are derived during the training period. Only the “persistent” modes are retained through the cross-validation (leave 1 year out) procedure, which avoids the overfitting. The cross-validation procedure involves the following three steps: (1) leave 1 year’s data out (i.e., 26 pentads) and conduct the ESVD analysis using the remaining 39 years’ training data; (2) project the 1 year left-out predictor and predictand fields onto the corresponding predictor–predictand ESVD modes derived from the remaining years, yielding a pair of 1 year’s projected expansion coefficients (26-pentad time series) for each of the K ESVD modes; (3) repeat steps (1) and (2) 39 times by leaving out different sets of 1-year data, combine the projected expansion coefficients for the total 40 years (26 × 40 pentads) for each coupled mode, and retain m coupled “persistent” modes which have significant correlation coefficients (passing the 99% confidence level) between their projected expansion coefficients.

The reproduced expansion coefficient (v m ) is reproduced by projecting the temporal-varying predictor field (X) onto the retained persistent predictor modes (V m ), as denoted by Eq. (3). Because of the high correlation coefficient between two expansion coefficients, as indicated by Eq. (4), the independent forecasting (from time point t f ) of predictand (Y) can be made as the summation of the products of the reproduced expansion coefficients of predictors (v m ) and the corresponding retained ESVD patterns of predictand (U m).

$${v_m}({t_f})=\sum\limits_{k=1}^{{i_1} \times {j_1} \times 6} {X({t_f},{i_1} \times {j_1} \times 6)} \times {V_m}({i_1} \times {j_1} \times 6);$$
(3)
$$Y({t_f},1 \times 6) \approx \sum\limits_{m=1}^M {{U_m}(1 \times 6)} {v_m}({t_f}).$$
(4)

According to the thermal equation, temperature change is determined by horizontal temperature advection, diabatic heating (associated with moisture) and adiabatic heating (associated with vertical motion), therefore, 850 hPa temperature, 850 hPa horizontal temperature advection, 700 hPa moisture and 500 hPa vertical motions are used as the potential predictor. Because ISO at lower level and upper level has independent propagation property, sea level pressure and 200 hPa geopotential height are additional potential predictors. Besides, because of the recurrence nature of the SAT mode itself, each PC also serves as potential predictor. The arithmetic mean is employed to the forecast outputs of seven predictors for an ensemble forecast.

Four metrics—the temporal correlation coefficient (TCC), root-mean-square error (RMSE) and pattern correlation coefficient (PCC) for SAT, and the hit rate for heat waves—are employed to evaluate the performances of STPMs. Because the intraseasonal data is of reduced degrees of freedom, effective sample size (Bretherton et al. 1999) is introduced to estimate the useful TCC. The effective sample size (Ne) formula is: \(Ne\,=\,N{{\left( {1\, - \,r{1^2}} \right)} \mathord{\left/ {\vphantom {{\left( {1\, - \,r{1^2}} \right)} {\left( {1\,+\,r{1^2}} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {1\,+\,r{1^2}} \right)}}\), where r1 is the lag (−1) auto-correlation and N is the original sample size, which is 364 [26 pentads per year over 14 years (2000–2013)] in the present study. The 1-pentad lag autocorrelation (r1) is around 0.7 for the PC of 30–80 day SAT, nearly zero for the PC of 10–30 day SAT, and 0.23 for the PC of 10–80 day SAT. Therefore, their effective sample sizes are 125, 364 and 327, respectively. Based on the effective sample size, TCCs exceeding the thresholds of 0.23, 0.13 and 0.14 are of useful skills (significant at the 99% confidence level), for 30–80, 10–30 and 10–80 day SAT, respectively. Because there are 753 stations over China, PCC skill above 0.11 is significant which exceeds the 99% confidence level.

3 Predictability sources

Lead–lag correlation analysis between PCs and pervious predictor fields helps to seek the predictability sources. For PC1 of the 30–80 day SAT (Fig. S1), the predictability sources on different meteorological fields are detected over Eastern Europe/Northeast Asia at a 35-day lead time. The signals then propagate southeastwards/southwards from a lead time of 25 to 5 days, and finally influence the SAT over the whole China, leading to a monopole pattern of the first EOF mode. Figure S2 shows the correlation map with PC2 of the 30–80 day SAT, the predictability sources appear with opposite signs over Eastern Europe and Northeast Asia at a 35-day lead time. Then, the predictability sources over Eastern Europe and Northeast Asia propagate southeastward and southward, respectively, from a lead time of 25 to 5 days. The southeastward and southwestward propagated signals arrive in northern and southern China together, resulting in the dipole SAT pattern of the second EOF mode. As to PC3 of the 30–80 day SAT (Fig. S3), one branch of predictability sources propagates southeastward from Europe to China, and the other propagates northwestward from the western tropical Pacific to China. These two branches of predictability sources induce a tripole pattern of Chinese 30–80 day SAT.

Figure S4 shows the predictability sources for PC1 of the 10–30 day SAT. The predictability sources of PC1 roughly appear over the Mediterranean Sea at a 35-day lead time, and propagate eastward from 25- to 5-day lead time to influence the SAT over China. The predictability sources of PC2 (Fig. S5) are detectable at a 35-day lead time over both the Mediterranean Sea and western Pacific. They respectively propagate eastward and westward, mutually affecting the Chinese SAT. Similar to PC2, PC3 also has two branches of predictability sources propagating from west and east (Fig. S6) to China, but the signals over the Pacific are much farther to the east compared to that of PC2.

Based on the above correlation analyses, a simple illustration of the pathways of predictability sources for each PC is plotted in Fig. 3. For both 30–80 and 10–30 day SAT, three main propagation pathways of the predictability sources are determined. They propagate from Europe, northeastern Asia and the tropical Pacific to China, influencing the Chinese summer SAT. Depending on these temporal-varying predictability sources, the projection domains of the STPMs for each PC are therefore selected, as listed in Table 1. As mentioned above, PC1 and PC2 have significant lag-correlation for both 10–30 and 30–80 day SAT, it is understandable that PC1 and PC2 share some common predictability sources. For comparison purpose, using the same method, the projection domains for each PC of 10–80 day SAT are also selected (Table 1).

Fig. 3
figure 3

Illustration of the predictability sources for the PCs of a 30–80 day and b 10–30 day SAT. Arrows in different colors denote the three main pathways of predictability sources

Table 1 Projection domains for each PC of 30–80, 10–30 and 10–80 day Chinese summer SAT

Because of the stochastic nature of each individual ISO process, the propagation pathway of each particular predictability source (i.e. Rossby wave train) could be slightly different, and their actual origins/centers may be washed out in the correlation map. However, the correlation map still indicates some common features of previous (5–35-day lead) precursors on the way to the target region in China. Therefore it indeed helps to detect the predictability sources and to determine the projection domains of STPMs. It should be noted that the prediction models based on “spatial projection method” and “multiple regression method” have distinctions. The regression method as used in many other studies only picks up the fixed areal-mean variables with high correlation coefficients as the predictors, whereas the projection method chooses a relatively large rectangle domain which includes both highly correlated regions and some less correlated regions. Therefore, the projection method used in STPMs can better capture each individual wave train of the predictability source which may be slightly different to each other, and this somehow avoids overfitting.

4 Forecasting skill

Using the defined projection domains, independent forecasts are produced for the period 2000–2013. Since the STPMs are originally set to forecast the three leading PCs, the forecasting skill for each PC is first presented. Given that the 30–80, 10–30 and 10–80 day SAT are forecasted separately, in this section we examine the forecasting skills for 30–80, 10–30 and 10–80 day SAT, separately.

The total SAT is reconstructed as the summation of the climatological low-frequency SAT variability (longer than 80 days) and the forecasted 10–80 day SAT. Note that the climatological low-frequency SAT variability is derived from the data during the training period, which ensures a fully independent forecast. The forecasted heat waves are then determined based on the forecasted total SAT.

4.1 Forecasting skill for SAT

Table 2 shows the TCCs and RMSEs between the 5–30 day lead forecasted and observed PCs. For 30–80 day SAT, the STPMs are able to forecast PC1 and PC2 at lead times of 5–30 days. Useful TCC skill for PC3 is up to a lead time of 20 days, but it fails to pass the 99% confidence level beyond the 20-day lead time. The 10–30 day SAT has much lower forecasting skill. Useful skills can only persist to a 15-day lead time for PC1 and PC2, and a 10-day lead time for PC3. The forecast limit for the three leading PCs of 10–80 day SAT is similar to that of 30–80 day SAT. Useful skill is up to a lead time of 30 days for PC1 and PC2, but 20 days for PC3. The poor forecasting skill for 10–30 day SAT, and the similar forecasting limit between 10–80 and 30–80 day SAT, both imply the STPMs have difficulty in reproducing the PCs of 10–30 day SAT.

Table 2 TCC and RMSE (in brackets) skill for PCs of 30–80, 10–30 and 10–80 day SAT at 5–30-day lead times, with TCCs failing at the 99% confidence level in italic

Figure 4a shows the spatial distribution of TCC skill for 30–80 day SAT forecasted at 5–30-day lead times versus the EOF-filtered 30–80 day SAT. The results indicate that useful skill occurs over most parts of China at 5–30-day lead times. The area-averaged TCC ranges from 0.64 to 0.28 for 5–30-day lead times, suggesting the STPMs are able to reproduce the predictable part of 30–80 day SAT. Figure 4b shows the TCC map between forecasted and observed 30–80 day SAT. It is clear that eastern China has higher TCC skills which persist to a 30-day lead time. Poor skill appears over western China and northeastern China after a 20-day lead time. Similar results are observed in the TCC map of forecasted 30–80 day SAT versus the observed 10–80 day SAT (Fig. 4c). The poor forecasting skills over western and northeastern China may relate to the low fractional variance of 30–80 day SAT variability over western and northeastern China, as shown in Fig. 1c. This is understandable because the running of STPMs and the detection of the predictability source are based on the variation of 30–80 day SAT. Therefore, poor forecasting skill is expected where the fractional variance of the 30–80 day SAT is relatively small.

Fig. 4
figure 4

The distribution of TCC skill for the 5–30-day lead forecasted 30–80 day SAT against a EOF-filtered 30–80 day SAT, b observed 30–80 day SAT and c observed 10–80 day SAT during the independent forecasting period (2000–2013). The contour line is the threshold of the 99% confidence level, the areal mean TCC skill is shown on the central top of each panel

Forecasting the Chinese summer SAT pattern is one of the motivations of this study, but to what extent the Chinese summer SAT pattern is predictable? To this end, we further check the PCC skill during the independent period of 2000–2013 for the 5–30-day lead forecasted 30–80 day SAT against the EOF-filtered and observed 30–80 day, and the observed 10–80 day SAT, separately. At any lead times, if more than half of the forecasts during the independent period got the significant PCC skill (0.11), we can consider forecast at this lead time is practically useful. Figure 5a suggests the STPMs reproduce the EOF-filtered (the predictable part of) 30–80 days SAT quite well. During the total independent forecast period 2000–2013, for 5–25-day lead times, useful PCC skills account for more than 60% and the averaged PCC skill is above 0.19, passing the 99% confidence level. Similar percentage of useful PCC skills can be found in the verification against the observed 30–80 day SAT (Fig. 5b) but with relatively lower averaged PCC skill. For the verification against the observed 10–80 day SAT (Fig. 5c), significant PCCs account for more than 50% within a 20-day lead time.

Fig. 5
figure 5

PCC skill evolution for the 5–30-day lead (from top to bottom panel) forecasted 30–80 day SAT against a EOF-filtered 30–80 day SAT, b observed 30–80 day SAT and c observed 10–80 day SAT during the independent forecasting period (2000–2013). The green line is the threshold of the 99% confidence level; black (grey) bar denotes the significant (insignificant) PCC skill. The percentage of the significant PCC skill and the averaged PCC skill are shown on the right of each panel

Figure 6 shows the TCC skill distribution for 5–30-day forecasted 10–30 day SAT against EOF-filtered, observed 10–30 and 10–80 day SAT, separately. The TCC map indicates that, beyond a 15-day lead time, the STPMs are unable to reproduce even the predictable part of 10–30 day SAT. Significant TCCs only appear at 5- and 10-day lead times. Consequently, at a 5-day lead time, useful TCC skills against observed 10–30 day SAT appear over most parts of China; at a 10-day lead time, useful TCC skills are confined to the lower reaches of the Yangtze River basin and northeastern China. Beyond a 15-day lead time, no significant TCC skill can be found. For the verification against the observed 10–80 day SAT, useful TCC skill only appears at a 5-day lead time over parts of northern China (where the largest fractional variance of 10–30 day SAT locates, as shown in Fig. 1b).

Fig. 6
figure 6

TCC skill distribution for the 5–30-day lead forecasted 10–30 day SAT against a EOF-filtered 10–30 day SAT, b observed 10–30 day SAT and c observed 10–80 day SAT during the independent forecast period (2000–2013). The contour line is the threshold of the 99% confidence level, the areal mean TCC skill is shown on the central top of each panel

Figure 7 further presents the evolution of PCC skills during 2000–2013. Useful PCC skills, accounting for above 50%, can only persist to a 15-day lead time for the predictable part of 10–30 day SAT, a 10-day lead time for observed 10–30 day SAT and only a 5-day lead time for observed 10–80 day SAT.

Fig. 7
figure 7

PCC skill evolution for the 5–30-day lead (from top to bottom panel) forecasted 10–30 day SAT against a EOF-filtered 10–30 day SAT, b observed 10–30 day SAT and c observed 10–80 day SAT during the independent forecast period (2000–2013). The green line is the threshold of the 99% confidence level; black (grey) bar denotes the significant (insignificant) PCC skill. The percentage of the significant PCC skill and the averaged PCC skill are shown on the right of each panel

The above results suggest that STPMs are able to reproduce the 30–80 day SAT while it is hard to capture the 10–30 day SAT. Can combining the forecasts results of 30–80 and 10–30 day SAT achieve better forecasting skill than direct forecasting of 10–80 day SAT? To investigate this, we compare the forecasted pattern of 10–80 day SAT and the combined forecasted pattern of 10–30 and 30–80 day SAT. As shown in Fig. 8, the forecast from the combined 30–80 and 10–30 day SAT STPMs does not show superior skills compared to that of the 10–80 day SAT STPMs. The useful skills appear over eastern China for 5–25-day lead times, both in the combined result of 10–30 and 30–80 day STPMs, and in the direct forecasting result of 10–80 day STPM. Although differences exist between the two results, e.g., the combined result shows better skill over northeastern China and worse skill over eastern China at a 30-day lead time, as compared with the 10–80 day STPM result, they basically possess comparable forecasting skill (based on their averaged TCC skill). The PCC skill presented in Fig. 9 also supports the similar conclusion. Over half of the forecasts achieve a significant PCC before a 20-day lead time, with the averaged PCC exceeding 0.1 for both forecasting strategies.

Fig. 8
figure 8

TCC skill distribution for a the forecasted 10–80 day SAT and b the combined forecasted 10–30 and 30–80 day SAT at 5–30-day lead times against the observed 10–80 day SAT during the independent forecast period (2000–2013). The contour line is the threshold of the 99% confidence level, the areal mean TCC skill is shown on the central top of each panel

Fig. 9
figure 9

PCC skill evolution for a the forecasted 10–80 day SAT and b the combined forecasted 10–30 and 30–80 day SAT at 5–30-day lead times against observed 10–80 day SAT, during the independent forecast period (2000–2013). The green line is the threshold of the 99% confidence level; black (grey) bar denotes the significant (insignificant) PCC skill. The percentage of the significant PCC skill and the averaged PCC skill are shown on the right of each panel

Because of the poor forecasting skill of 10–30 day SAT, the strategy of summing up the forecasts of the 30–80 and 10–30 day SAT STPMs cannot improve the forecasting skills, as compared to using the 10–80 day SAT STPMs directly. The STPMs show encouraging forecasting skills for 30–80 day SAT, but relatively low skills for 10–30 day SAT. Since the forecasting skills using the above two strategies show no critical differences, for simplicity, the following verification of Chinese heat waves is based on the direct forecasting results of the 10–80 day SAT STPMs.

4.2 Forecasting skill for heat waves

According to the definition by Fischer and Schar (2010), a heat wave is defined as a spell of at least 6 consecutive days with maximum temperatures exceeding the local 95th percentile over a control period. Similar to the definition of Fischer and Schar (2010), a heat wave in the present study is defined as one pentad mean surface maximum air temperatures exceeding the local 95th percentiles during the control period of 1960–1990. Thus, for every gauge station, each pentad has a threshold for the criterion of a heat wave. If the forecasted and observed SAT both meet the criterion of a heat wave in a certain pentad, a “hit” of heat wave is achieved. The hit rate is then calculated as the ratio of hits to the total observed heat waves. Figure 10 shows the forecasted (red line) and observed (black dashed line) SAT (including both intraseasonal and climatological low-frequency SAT) at Nanjing gauge station at 5–30-day lead times. The STPMs reproduce the observed total SAT well for the independent forecast period (2000–2013) at all 5–30-day lead times. Based on the heat wave criterion, both forecasted and observed heat waves are determined. For 5–15-day lead times, more than one-third of heat waves are reproduced by the STPMs. The hit rates are 26 and 24% for 20- and 25-day lead times. At a 30-day lead time, the STPMs can barely reproduce the local heat waves. The hit rate is only 19%.

Fig. 10
figure 10

The observed (black dashed line) and 5–30-day lead (red line, from top to bottom panel) forecasted SAT at Nanjing gauge station, along with the heat waves (black and red rectangles are the observation and forecast, respectively), during the independent forecast period (2000–2013), the hit rate of heat waves is shown on the right of each panel

The same as with Nanjing station, the 5–30-day lead forecast hit rates of heat waves for other gauge stations over China are provided in Fig. 11. The results indicate that the hit rates over large areas of China exceed 30% for 5–10-day lead times. Central and North China possess a persistently high hit rate, whereas northeastern China, southern China and western China present relatively low hit rates.

Fig. 11
figure 11

Distribution of the 5–30-day lead forecasted heat wave hit rate (shading, %, black contour line denotes 30%) during the independent forecast period (2000–2013)

It is interesting to find that the low hit rate regions overlap the regions with high fractional variance of the 30–80 day SAT rather than the regions with low fractional variance (Fig. 1c), suggesting that the heat waves may be more related to the 10–30 day component of SAT, which the STPM is hard to reproduce, rather than 30–80 day component of SAT. Although a large number of false alarms exist, the extended-range forecasting of Chinese heat waves by STPMs generally presents promising skills.

5 Conclusion and discussion

Real-time forecasting of Chinese SAT and heat waves at 5–30-day lead times has become an urgent nationwide demand. Based on EOF analyses, predictability sources are detected for the first three leading modes of 10–30 and 30–80 day SAT. Both 10–30 and 30–80 day SAT present a uniform pattern, a dipole pattern, and tripole pattern for the first three leading modes. Regardless of whether for 10–30 or 30–80 day Chinese SAT, three pathways of predictability sources from Europe (for PC1, PC2 and PC3 of both 10–30 and 30–80 day SAT), Northeast Asia (for PC1 and PC2 of 30–80 SAT, and for PC1 of 10–30 day SAT), and the tropical Pacific Ocean (for PC3 of 30–80 day SAT, and for PC2 and PC3 of 10–30 day SAT) to the target region of China are detected.

Based on the detected predictability sources for PCs, projection domains are selected to construct STPMs. The STPMs perform well in reproducing PCs of 30–80 day SAT. Useful TCC skills are achieved at all 5–30-day lead times for PC1 and PC2, and at 5–20-day lead times for PC3. The Chinese SAT pattern is then reconstructed using these forecasted PCs. The STPMs reproduce the temporal variation of EOF-filtered 30–80 day SAT quite well at 5–30 day lead times. Verification against observed 30–80 and 10–80 day SAT shows that useful TCC skills over eastern China can persist to 25–30-day lead times. Useful PCC skills accounting for more than 50% is up to a 30-day lead time against both EOF-filtered and observed 30–80 day SAT, and up to a 20-day lead time against observed 10–80 day SAT. The STPMs have little skill in reproducing 10–30 day SAT. Useful forecasts for PC1 and PC2 can be obtained only within a 15-day lead time. For PC3, the forecasts beyond a 10-day lead time are useless. Thus, owing to the poor forecasting skill of 10–30 day SAT, combination of STPMs constructed separately for MJO and QBWO components cannot improve forecasting skill comparing to the 10–80 day SAT STPM.

Note that, for the sake of real-time forecasting, we use the “nonconventional filtering” method instead of the traditional “centered filtering” method to extract the intraseasonal variability. One may ask whether the forecasting skill can be inflated by this approach. To answer this question, we generate a test product based on the centered filtering method to remove the low-frequency signal. The forecasting skill based on the centered filtering method shows no difference compared to the forecasting skill obtained in the present study (figure not shown). Therefore, we consider the “nonconventional filtering method” is suitable for real-time forecasting and cannot artificially inflate the forecasting skills.

Although a large number of false alarms exist, the STPMs show some capacity for reproducing the heat waves over central and North China at a 15-day lead time, suggesting the prospect of the statistical model in forecasting climate extremes. The forecast limitation of heat waves may relate to the low forecasting skill of 10–30 day SAT. Compared to the 30–80 day mode of ISO, the frequency of 10–30 day mode is dispersed (figure not shown). This might be one of the reasons for the low predictability of the 10–30 day mode ISO. We noted that the forecasting skill of Chinese summer SAT in terms of PCC has considerable interannual variation (Fig. 9). Given that ENSO has strong modulation on boreal summer ISO (Liu et al. 2016), the relationship between ENSO and the extended-range forecasting skill merits further investigations.

Previous statistical predictability and prediction studies on intraseasonal time scale focused mainly on MJO component. Verification is also against the 30–60 day component rather than total intraseasonal variation (10–80 day). This may cause an overestimate of the forecasting skills on subseasonal time scale. The present study suggested that the STPM can reproduce the predictable part of MJO component at all 5–30-day lead times quite well, but the predictable part of QBWO component only for 5–15-day lead times. The total QBWO component can be reproduced only at 5–10-day lead times.

We confirmed that combination of forecast results of separate STPMs for MJO and QBWO components of SAT cannot improve the forecasting skills of total 10–80 day SAT. Further improvement of extended-range forecast should underscore how to better represent QBWO component. Outputs from state-of-the-art dynamical models can be employed to offset the shortcomings of the statistical models in forecasting 10–30 day mode ISO. The dynamical–statistical method is expected to be a possible approach to achieving the best extended-range forecasting skills.