1 Introduction

Analyzing the climate dynamics as an interacting complex network yields valuable insights into several climatic phenomena. A multiple number of climatic predictors influence the state and dynamics of the climatic phenomenon. The monsoon is a prime and interesting climatic phenomenon that is widely studied (Rajeevan 2001; Gadgil 2003; Gadgil et al. 2005; Guhathakurta and Rajeevan 2008; Wang et al. 2015; Saha et al. 2016b; Saha and Mitra 2016). The dynamism of the monsoon phenomenon results from its dependence over a number of global climatic variables. The variation in the quantity and distribution of monsoon are high. In addition, the influencing predictors of monsoon also evolve over time. Thus, it is important to reconsider the monsoon predictors and explore different climatic variables over the world affecting the complex monsoon phenomenon. We concentrate our study on the Indian summer monsoon and in a complex network paradigm to explore and identify new climatic predictors influencing the phenomenon.

The use of climatic network in earth science is an emerging direction toward analyzing and understanding the climatic phenomena. Tsonis and Roebber (2004) suggested the concept of climatic network and represented the phenomena as a network of dynamic processes. They revealed that the overall dynamics result from interactions of two subsystems, one working in the higher latitudes and other in the tropics.

The climate networks are also built using complex networks-based concepts and they are utilized to figure out the interesting patterns present in the climatic system (Donges et al. 2009a, b). Steinhaeuser et al. (2011) proposed the analysis and modeling of the climatic events using a complex network-based approach. Clusters derived by a complex network approach are proven to be superior predictors than one obtained from the traditional clustering approach. Clustering methods are also used widely to detect the region of importance in the climatic network (Noor and Awan 2005; Steinbach et al. 2003). Steinhaeuser et al. (2010) detected the communities within the climatic system, and the approach was used for the potential predictors’ identification in the climatic network. The new predictors elucidate reasons behind the changing climatic phenomenon and assist in analyzing the causes behind the phenomenon. Tsonis and Swanson (2008) have built networks for La-Niña and El-Niño. They have shown that the latter network (for El-Niño) is less stable and validated a better predictability of La-Niña event. Major climatic shifts as the transition between different equilibria of oscillators are explained using the climatic network (Tsonis et al. 2007).

The proposed work is focused in two main directions—(1) identification of new monsoon predictors utilizing community detection approach and density-based clustering, and (2) predicting the Indian summer monsoon (ISM) by the identified predictors.

In the proposed approach, climatic networks are built considering the spatial grids of the world as nodes of the network. The nodes are attributed with climatic variables and weighted edges are added by considering the similarity between the nodes. After the building of networks, communities are detected from the networks for identification of significant climatic regions. The community detection-based approach achieves higher performance in detecting similar groups as compared to the clustering method because unlike the clustering approach, the community detection method also focuses on the architecture of the network in addition to the attributes of the nodes. Finally, the density-based clustering is applied to the detected communities to obtain spatially localized regions, which are representative for the new monsoon predictors. The identified predictors are observed to be more correlated to the ISM than the existing predictors of the monsoon. Lastly, the prediction of Indian summer monsoon is performed utilizing the identified correlated monsoon predictors with ensemble regression model. The identified predictors establish their superiority in forecasting the Indian summer monsoon.

Section 2 of the article describes the data, the building of climatic networks, followed by the proposed predictor identification approach using the community detection and density-based clustering methods. The non-linear model for predicting the Indian summer monsoon is elaborated in Sect. 3. The concept of uncertainty and its association with the monsoon forecast is explored in Sect. 4. The detailed exploration of the monsoon predictors is provided with their predicting skills for the Indian monsoon in Sect. 5. Lastly, the article is concluded in Sect. 6.

2 Climatic network-based approach for identifying the predictors of Indian summer monsoon

The proposed method for the identification of predictors influencing the summer monsoon of the sub-continent is shown in Fig. 1. It elaborates all the steps followed in the approach to identify novel monsoon predictors, and finally forecasts the summer monsoon of the country.

Fig. 1
figure 1

Climate network-based method for identifying monsoon predictors and predict the Indian summer monsoon

2.1 Data sources and preprocessing techniques

The climatic variable considered are surface pressure (SP) and zonal wind at 850 hPa (UWND), which are the well-known influencing factors of the Indian monsoon phenomenon (Rajeevan et al. 2007; Saha and Mitra 2016; Saha et al. 2017). Surface pressure values and zonal wind values are accumulated from the NCEP reanalysis data NOAA/OAR/ESRL/ PSD (http://www.esrl.noaa.gov) (Kalnay et al. 1996), available at \(2.5^{\circ } \times 2.5^{\circ }\) resolution. Thus, considering the spatial resolution it boils down to 73 (180/2.5 + 1) latitudinal and 144 (360/2.5) longitudinal grids, which assemble to 10,512 nodes (73 \(\times\) 144) in the climatic network built for the variable surface pressure (Net_SP) and zonal wind (Net_UWND).

The other climatic variable examined is sea surface temperature (SST), which has a high impact on the climatic phenomenon of monsoon (Rajeevan et al. 2004, 2007; Saha et al. 2016a, b; Saha and Mitra 2016). Sea surface temperature data are collected from NOAA_OI_SST_ V2 (http://www.esrl.noaa.gov) (Reynolds et al. 2002) at \(2^{\circ } \times 2^{\circ }\) resolution. We have considered the SST data at \(4^{\circ } \times 4^{\circ }\) grid points to reduce the computational overhead and this network of sea surface temperature (Net_SST) has 4050 (180/4 \(\times\) 360/4) nodes. These are the initial grid location where sea surface temperature values are examined. Many of these locations are over the land and values of sea surface temperature are not available over the land surfaces. Thus, the post-processing method includes the selection of grids over the sea with consideration of nodes having less than 20% as null values over time. The method also comprises the addition of links between the nodes considering the similarity measure. These are elaborated in Sect. 2.2. It is noted that the final networks for SP, UWND, and SST have fewer nodes as compared to the initial nodes. SP, UWND and SST data are examined for the period 1948–2018 on monthly scale for the study.

The prediction of the Indian summer monsoon (ISM), which accounts for total rainfall in June–September is the primary focus of the study. Rainfall data are collected from the India Meteorological Department (IMD: http://www.imdpune.gov.in), for the period 1948–2017. The long period average (LPA) rainfall over the span is 890.1 mm.

As a preprocessing step, the SP, UWND and SST anomaly values are evaluated by deducting the monthly mean from the respective month values of the variables (Eq. 1).

$$\begin{aligned} \text {anomalyData}^{y}_{m} = \text {realData}^{y}_{m} - \text {mean(realData}_{m}), \end{aligned}$$
(1)

where \(\text {realData}^{y}_{m}\) denotes the value of the variable for mth month of yth year. The mean (\(\text {realData}_{m}\)) signifies the average value of all years under study for the mth month.

2.2 Design of climatic network and link thresholding

The introductory step of the proposed approach involves the creation of climatic networks for variables, namely, surface pressure, zonal wind at 850 hPa, and sea surface temperature. The spatial grids at a resolution of \(2.5^{\circ } \times 2.5^{\circ }\) for SP and UWND, and \(4^{\circ } \times 4^{\circ }\) for SST over the world are considered as nodes in the respective networks. The network built for SP and UWND have 10512 nodes, and that for SST has 4050 nodes at the initial phase. The latitude, longitude, and the variable values over time at grid points characterize the nodes of the network. The values of the variable SST over the land surface are null. Such null nodes are eliminated from the network in the post-processing phase. The weighted edges are inserted considering the similarity between every node pair in terms of normalized euclidean distance (NED). The NED is calculated as shown in Eq. (2).

$$\begin{aligned} \text {NED}_{\left( n,m\right) }&= \left( \text {ED}_{\left( n,m\right) } - \left( \forall _{(x,y): x,y \in G, x \ne y} min \left( \text {ED}_{\left( x,y\right) } \right) \right) \right) \nonumber \\&\quad \div \left[\left( \forall _{(x,y): x,y \in G, x \ne y} \max \left( \text {ED}_{\left( x,y\right) } \right) \right) \right.\nonumber \\&\quad\left. - \left( \forall _{(x,y): x,y \in G, x \ne y} \min \left( \text {ED}_{\left( x,y\right) } \right) \right) \right], \end{aligned}$$
(2)
$$\begin{aligned} \text {ED}_{\left( n,m\right) }= & {} \sqrt{\sum _{i=1}^{t} \left( n_{i} - m_{i} \right) ^{2}}, \end{aligned}$$

where \(\left( n,m\right)\) denotes an edge between the nodes n and m, G denotes the set of nodes, \(\text {ED}_{\left( n,m\right) }\) denotes the Euclidean distance between the climatic variable’s time series at nodes n and m; and t denotes the length of variable time series.

An edge is added between two nodes if the normalized Euclidean distance between nodes attains the threshold, computed in the following manner. The range of NED between all pairs of nodes is divided into 100 intervals and the occurrence frequency of weights in all the intervals are plotted. The sharp descent of the graph plot is ascertained as threshold and edges having NED less than the threshold (lesser the NED, closer are the nodes) are added (Fig. 2). The threshold NED values for Net_SP, Net_UWND, and Net_SST are 0.05, 0.03, and 0.02, respectively. We also varied the threshold around the ascertained threshold value and repeated the proposed approach. The changes in accuracy with the varying thresholds are observed to be comparable. Thus, we considered the ascertained threshold for building the edges of the climatic network.

Fig. 2
figure 2

The frequency of edge weights at various intervals to calculate the threshold for surface pressure variable

Lastly, the networks are post-processed by removing the isolated nodes for building the connected network. After the post-processing and link addition, the final network built with the surface pressure variable has 9614 nodes and 196,606 edges, that for the zonal wind at 850 hPa has 7464 nodes and 10,416 edges, and finally, for the sea surface temperature variable has 1543 nodes and 1,094,514 edges. It is noted that the network for SST is immensely dense, which signifies that the variation of sea surface temperature between spatial grids is less as compared to that of sea level pressure.

2.3 Community detection followed by the density-based clustering for identifying the monsoon predictors

The proposed approach consists of three major steps, which are discussed in the following section.

  1. (a)

    Identifying the communities of the climatic network.

  2. (b)

    Filtering and the selection of the detected communities.

  3. (c)

    Identifying the geographically localized regions of interest from the communities.

The fast-greedy community detection method (Clauset et al. 2004) is applied over the climatic network to detect communities. Communities aid to identify new potential climatic predictors influencing the Indian summer monsoon. The algorithm is selected considering the following properties—(1) utilization of the edge weights of the network, (2) high suitability for the intense and dense networks, and (3) computational efficiency in finding the communities within the network.

The fast-greedy is a hierarchical agglomeration method which optimizes the modularity of the network. It performs greedy optimization starting with the individual vertex being the community of single dimension. Any two different communities are constantly joined into a single community, whose combination produces the highest improvement in the modularity of the communities. The stopping criterion for the algorithm is the time when there is no further improvement in the modularity.

The communities by fast-greedy community detection are filtered by scrutinizing the density of nodes in the communities. This value is selected empirically.

The obtained communities may be sparsely located, which are processed to obtain geographically localized communities using density-based spatial clustering (DBSCAN) (Ester et al. 1996). The algorithm is used because it is a spatial clustering technique which aids in extracting a localized set of grids. Other supplementary reasons include—(1) number of clusters is not required a priori, (2) capability in detecting arbitrarily shaped clusters, (3) it is one scan, and (4) the approach is robust to outliers.

The latitude and longitude of grids in the communities are fed to DBSCAN to obtain a set of spatially localized dense clusters, which are representative for the potential monsoon predictors.

2.4 Identification of the climatic predictors from the clusters

The spatially localized clusters are considered to evaluate the new monsoon predictors. Each cluster consists of a number of grid points. For a specific cluster, the mean time series is evaluated over all the series of grids within the cluster. This mean time series represents the newly identified potential predictor. The evaluation of predictor variable is shown in Eq. (3). Thus, each cluster represents a newly identified potential monsoon predictor. A few identified predictors signify well-known monsoon predictors, symbolizing for the validation of our proposed method of identifying the monsoon predictors, while the others represent new localized geographical regions, which are significant for the phenomenon of the Indian monsoon.

$$\begin{aligned} {\text {identified predictor}} = \frac{\sum _{i=1}^{k}\left( P_{i} \right) }{k}, \end{aligned}$$
(3)

where \(P_{i}\) denotes climatic variable time series of the ith grid of localized cluster, and k is the number of grid points within a localized cluster (representative of identified predictor).

3 Prediction model with identified monsoon predictors

Fitted ensembled regression tree model with bagging algorithm (RegTreeB) (MATLAB 2012) is used as the prediction model, which assembles a number of weak learner-trained models to provide the forecast. The model predicts the ensemble response by aggregating the predictions obtained from the trained weak learner regression models. The bagging method is utilized for building and training the regression tree weak learners of the ensemble model.

The prediction model is built using this algorithm for the following reasons—(1) it uses the bagging, a bootstrap aggregating method improvising estimation, (2) it assists in increasing the predictive ability of the underlying regression tree, and (3) the algorithm can work with a large number of training instances and high dimensional data.

4 Uncertainty associated with monsoon prediction

Ascertaining uncertainty involved in the forecast of the monsoon has a significance. Decision makers need to analyze the uncertainty involved in the monsoon to propose justifiable strategies. Uncertainty should be appreciably communicated or it may lead to a false certainty sense, improper decision-making, and overall reduced performance in the forecast. Uncertainty in the forecast arises from the probabilistic forecast of the phenomenon.

The uncertainty arises from different sources and it may be first-order or second-order uncertainty. The first-order uncertainty points toward the likelihood of a phenomenon occurring in accordance with a particular forecast or the risk involved in it, whereas the second-order uncertainty is cited as ‘uncertainty about uncertainty’. It results from how well the model has adapted to forecast or it highlights the model errors in execution. It is represented as a measure of reliability.

Taylor et al. (2015) propose propagating the uncertainty in the climate forecast in a preferable receiving format. They utilize surveys from user needs conducted on different organizations. Azad et al. (2015) showed that the uncertainty in predicting the monsoon of India is reduced. The performance is improved by treating the periodic and random part of the time series data separately with wavelet and neural network, respectively. We explain the uncertainty in the prediction of monsoon using measures like root mean square error, bias, correlation coefficient, and Willmott index, which define the uncertainty involved with the model or they give a measure of explaining how well the forecasts have satisfied the actual values.

5 Experimental results and analysis

The proposed climate network-based approach to identifying the monsoon predictors is judged by the measure of performance of identified predictors in forecasting the Indian monsoon.

5.1 Identified monsoon predictors

A correlation investigation of new monsoon predictors is performed with the prime monsoon period of India (i.e., the total rainfall during June–September) by considering a lead of 1–12 months. The lead months are considered to evaluate the best correlated month (a lead of one represents the month of May in the same year of predictor influencing monsoon of the year (monsoon starts in June), a lead of 2 represents April of the same year influencing the monsoon of the year, and finally a lead of 12 represents June of the previous year influencing the present year’s monsoon). Pearson correlation (\(\gamma\)), shown in Eq. (4), is used for the purpose. The best lead month corresponds to the month of identified climatic predictor which has the highest correlation with the monsoon of India. The variable value of the best correlated month are used for further forecast. Top correlated identified predictors are filtered for all three variables, and are shown in Table 1. The table highlights the location of identified predictors along with their correlation values and the best correlated month with the Indian summer monsoon.

$$\begin{aligned} \gamma =\frac{\sum \nolimits _{i=1}^{N}\left( X^{i} - \overline{X}\right) \left( Z_{m}^{i} - \overline{Z_{m}}\right) }{\sqrt{\sum \nolimits _{i=1}^{N}\left( X^{i} - \overline{X}\right) ^{2}}\sqrt{\sum \nolimits _{i=1}^{N}\left( Z_{m}^{i} - \overline{Z_{m}}\right) ^{2}}}, \end{aligned}$$
(4)

where \(X^{i}\) and \(Z_{m}^{i}\) represent the Indian summer monsoon rainfall (total rainfalls for June–September) at ith year and identified climatic predictors of mth month at the ith year, \(\overline{X}\) is the averaged monsoon rainfall and \(\overline{Z_{m}}\) is the averaged climatic predictor for mth month, and N is the total years. The identified predictors are ordered by their correlation with the Indian summer monsoon (the first predictor having the highest correlation and the last having the lowest). The identified predictors of surface pressure, sea surface temperature, and zonal wind are shown in Fig. 3a–c, respectively.

Table 1 Identified monsoon predictors (Pred.) for surface pressure, sea surface temperature, and zonal wind with geographical location, absolute correlation (Corr. value) and correlated month (Corr. month) (0 signifies the same year and − 1 signifies the previous year)
Fig. 3
figure 3

Identified climatic predictors for a surface pressure (\(S_{1}\)\(S_{7}\)), b sea surface temperature (\(T_{1}\)\(T_{7}\)), and c zonal wind (\(U_{1}\)\(U_{8}\)). Monsoon predictors are arranged in accordance with their correlation with the summer monsoon of India (i.e., \(S_{1}\) been most highly correlated and \(S_{7}\) been the least)

The identified predictors by the proposed climate network-based approach can be classified into two classes. Firstly, one class consists of the predictors belonging to regions which are well-known monsoon predictors. Identifying the established predictors supports our proposed method. Secondly, the ther class consists of predictors belonging to the new geographical regions whose impact on the Indian summer monsoon are not studied in literature. They are presented as a new set of monsoon predictors of the country.

Re-identification of the influencing monsoon predictors include surface pressure of the Equatorial South-Eastern Indian Ocean (S2), and the disturbance of this region affects the tropical climate (Achuthavarier et al. 2012). The region of the Pacific Ocean around Indonesia and Malaysia (S5) are studied to be influencing the summer monsoon of India (Rajeevan et al. 2004). The South-Eastern Equatorial Indian Ocean is shown to be tele-connected with the tropical Indian Ocean influencing the monsoon. It is notified that the change in sea surface temperature over the band stretching around the Equator over the Pacific Ocean (T2) influences the Indian summer monsoon (Rajagopalan and Molnar 2012). The sea surface temperature of the Equatorial East Pacific Ocean corresponds to the El Nĩno region, a known regulating factor for the Indian summer monsoon (Cherchi and Navarra 2013). Regarding 850 hPa zonal wind, Central North–South Pacific Ocean (U1) is also considered as an important monsoon predictor by the India Meteorology Department for predicting monsoon (Rajeevan et al. 2007). Moreover, North Pacific Ocean–Gulf-of-Alaska 850 hPa UWND (U3) and Equatorial North Pacific Ocean 850 hPa UWND (U4) shared similar regions of the North Pacific Ocean, which is a well-known monsoon predictor of India (Rajeevan et al. 2004, 2007). Finally, the Southern Indian Ocean 850 hPa UWND has evolved as an important predictor which is also used by Rajeevan et al. (2004) to forecast the Indian monsoon.

Newly identified climatic predictors include surface pressure of the Southern Indian Ocean (S1), the Tasmania–Southern Ocean (S3), and the region of Southern Ocean (S7). Other new predictors are the sea surface temperature of the Solomon Islands–Fiji–Pacific Ocean (T1), the Tasmania–South Indian Ocean (T3), the Philippine Sea (T5), the region of Southern Ocean (T6), and the South Pacific Ocean (T7). Newly identified zonal wind-based predictors include Equatorial South Pacific Ocean 850 hPa UWND (U2), United States–Mexico–Gulf-of-Mexico 850 hPa UWND (U5), South Pacific Ocean 850 hPa UWND (U7), North-Central Russia–China 850 hPa UWND (U8). These regions are shown to correlate the Indian summer monsoon at different lead months (refer to Table 1).

We have presented the top seven climatic predictors for surface pressure and sea surface temperature variables, and the top eight for zonal wind variable, corresponding to regions obtained from the proposed approach. Reasons behind presenting these predictors are—(1) correlation values of other identified predictors with monsoon are lower compared to the presented set, (2) it is observed in literature that the predictor set with five–six predictors performs superiorly for the monsoon prediction (Rajeevan et al. 2007).

5.2 Prediction of monsoon with identified predictors

5.2.1 Predictor sets

The predictor sets are built considering the correlation of the identified predictors with the monsoon of India and their lead months of forecasting. Predictors are chosen in a way such that they forecast the monsoon in two different leads. D1_Y, D2_Y, D3_Y, and D4_Y denote the predictor sets (Y denotes either S for predictors of SP, U for predictors of UWND, T for that of SST, or S_U, S_T, U_T, and S_U_T for respectively combined predictors). Tables 2 and 3 show the identified predictors considered for individual predictor set along with the lead number of months, which possesses the best correlation with the monsoon. Finally, considering the lead months of the individual predictors, it declares the month for providing monsoon prediction.

Table 2 Predictor sets (Pred. sets) with the individual predictors of SP, UWND and SST for forecasting the Indian summer monsoon
Table 3 Predictor sets (Pred. sets) with the combined predictors of SP + UWND, UWND + SST, SP + SST and SP + UWND + SST for forecasting the Indian summer monsoon

5.2.2 Prediction performance

A non-linear model named ensemble regression tree with the bagging method (Sect. 3) is considered to forecast the monsoon. The prediction model is trained in two ways. The first method segregates the total period under the study into a separate set of training and test years, and the model is trained only once with the set of training instances, and tested over test instances. The second method uses the strategy of moving-window training. We calculate an optimal training period and, for every test year, the model is trained using instances of the preceding optimal number of years. Thus, for testing t number of years, the model is required to be trained separately for all the cases (i.e., it is trained t number of times).

In our first approach, the total period (1948–2017) is divided into an exclusive set of training and test set considering a 70–30 ratio. The model is trained for the period 1948–1994 and tested for 1995–2017. The prediction is evaluated by the mean absolute error (MAE), expressed as follows.

$$\begin{aligned} \text {MAE} = \frac{\sum \nolimits _{i=1}^{N} \mid Y_{i} - X_{i} \mid }{N}, \end{aligned}$$

where \(X_{i}\) and \(Y_{i}\) are the predicted and observed monsoon for the ith year and N denotes the total years.

The training errors for all the individual predictor with the ensemble regression tree model is presented in Table 4.

Table 4 Training errors as mean absolute errors (%) for Indian monsoon prediction with SP, UWND, and SST for 1948–1994

The prediction model and the identified predictors are evaluated over 23 years of the test period (1995–2017). The test errors are presented in terms of mean absolute errors. The predictions by individual predictor variable (SP, UWND, and SST) with static training period (first approach) are presented in Table 5 and those for combined identified predictors (SP + UWND, UWND + SST, SP + SST, and SP + UWND + SST) are shown in Table 6.

Table 5 Mean absolute errors (%) for forecasting the Indian monsoon, with a static training span using individual predictors of SP, UWND, and SST for the test period 1995–2017
Table 6 Mean absolute errors (%) for forecasting the Indian monsoon, with a static training span with the combined predictors of SP + UWND, UWND + SST, SP + SST, and SP + UWND + SST for the test period 1995–2017

The second approach of moving-window training strategy is also utilized for predicting the Indian summer monsoon. In this method, if the number of training years is n, then for testing the tth test year, training years for the model is considered from \((t-1)\)th to \((t-n-1)\)th years (i.e., considering the number of training years as ten, and if we need to test for the year 1995, then the model is trained with the data of 1985–1994). Thus, the model needs to be trained individually for every test year with corresponding preceding training years. The training period is inspected from 5 to 45 years. The optimal period is observed as 20 years, which gives comparatively less error. Results for individual and combined variables for moving-window training strategy are shown in Tables 7 and 8. It is observed that the results are superior to that of the static training process as followed in our first approach. The reason underlying is that the moving-window training method can engross the pattern of variability of the close period of the test year.

Table 7 Mean absolute errors (%) for forecasting the Indian monsoon, with a moving-window training span with individual predictors of SP, UWND, and SST for the test period 1995–2017

Predictor set with identified predictors of surface pressure shows a mean absolute error of 4.6% in predicting the monsoon in May. The UWND predictors provide 4.4% error in April. The identified predictors of SST predict monsoon with 5.1% error in March.

Table 8 Mean absolute errors (%) for forecasting the Indian monsoon, with a moving-window training span with combined predictors of SP + UWND, UWND + SST, SP + SST, and SP + UWND + SST for the test period 1995–2017

Moreover, the predictor sets with combined predictors also forecast monsoon with good accuracy. Predictor set with surface pressure and zonal wind (SLP+UWND) predicts the monsoon in April with 4.4% error. Similarly, UWND + SST and SP + SST predictor sets predict Indian monsoon at 2 months lead with 4.5% and 4.9% errors, respectively. Finally, the predictor set built with all three variables (SP + UWND + SST) shows 4.2% error in forecasting the monsoon in May.

Figure 4a, b shows the actual and predicted rainfall as deviation from the long period average rainfall (LPA) for predictor sets with individual variable and combined variables, respectively. The result highlights that the predicted rainfall shows a similar deviation of rainfall (negative or positive departure from LPA) as actual for the majority of the test years.

All the extremes (drought—2002, 2004, 2009, 2014, 2015) are forecast with a negative anomaly from the LPA by the combined predictor sets. The predictors of surface pressure capture the drought year of 2014, and SP + SST predictors correctly capture the drought of 2009. For numerical models, even the direction of the deviation of predicted rainfall from LPA is incorrect in many years (Nanjundiah et al. 2013). Therefore, the identified predictors by the climate network-based method improve the accuracy of monsoon prediction of India.

Fig. 4
figure 4

Forecasts by predictor set with the identified a individual predictors of SP (D4_S), UWND (D1_U), SST (D4_T), and b combined variables of SP + UWND (D1_S_U), UWND + SST (D1_U_T), SP + SST (D1_S_T), and SP + UWND + SST (D1_S_U_T) for the test period 1995–2017

5.2.3 Other evaluation measures of prediction

We have also evaluated the performance of the predictors using different statistical measures. The measures and their corresponding results are elaborated in this section.

  1. (a)

    Root mean square error (RMSE): calculates the variation of the model output against actual values.

    $$\begin{aligned} \text {RMSE} = \sqrt{\frac{\sum _{i=1}^{N}\left( Y_{i}-X_{i} \right) ^2}{N}}, \end{aligned}$$

    where \(X_{i}\) and \(Y_{i}\) are the observed and predicted monsoons for the ith year, and N is the total years as defined earlier.

  2. (b)

    Prediction yield (PY): is evaluated at three different error categories (5%, 10%, and 15%) to assess the overall prediction by judging the number of predicted years within the allowed errors range.

  3. (c)

    Multiplicative bias (MB): is the ratio of the predicted to actual value; a closer value to 1 signifies good performance.

  4. (d)

    Pearson correlation coefficient (\(\gamma\)): measures the strength of the linear association between the actual and predicted value (shown in Eq. 4).

  5. (e)

    Willmott index of agreement (WI): is a measure calculating the degree of model prediction, with higher values indicating a better fit of model. It is shown in Eq. (5).

    $$\begin{aligned} \text{ Index } \text{ of } \text{ agreement }= 1 - \frac{\sum _{i=1}^{N}\mid X_{i} - Y_{i} \mid ^{2} }{\sum _{i=1}^{N} ( \mid Y_{i} - \overline{X} \mid + \mid X_{i} - \overline{X} \mid )^{2} }. \end{aligned}$$
    (5)

Table 9 elaborates the verification statistics for different predictor sets for forecasting the Indian summer monsoon. The monsoon prediction by combined predictors of surface pressure, zonal wind and sea surface temperature is observed to be superior as compared to the other predictors or their combinations.

Table 9 Prediction evaluation measures for the model with the identified climatic predictors of SP, UWND, SST, SP + UWND, UWND + SST, SP + SST and SP + UWND + SST for the test period 1995–2017

We have also presented the skills of identified predictors by investigating their productivity in predicting the rainfall’s negative or positive deviation from the LPA. A confusion matrix (Table 10) is used for the purpose. We have compared the predicted negative or positive deviation with the observed rainfall deviation from LPA.

Table 10 Confusion matrix

True positive (TP) denotes the count of test years when both observed and predicted rainfall show positive deviation from LPA, true negative (TN) denotes the count of test years when both observed and predicted rainfall show negative deviation from LPA, false positive (FP) represents the count of test years where the observed rainfall shows negative deviation but it is predicted as positive deviation from LPA, and false negative (FN) represents the count of test years when the rainfall is predicted as negative deviation but the observed rainfall shows positive deviation from LPA. The related measures to the confusion matrix are defined as follows.

  1. (a)

    Sensitivity: proportion of years that is correctly predicted as positive deviation from total observed positive deviation (\(\text{TP}/(\text{TP} + \text{FN})\)).

  2. (b)

    Specificity: proportion of years that is correctly predicted as negative deviation from total observed negative deviation (\(\text{TN}/(\text{TN} + \text{FP})\)).

  3. (c)

    Precision: proportion of positive deviation that is predicted correctly from the total number of predicted positive deviations (\(\text{TP}/(\text{TP} + \text{FP})\)).

  4. (d)

    Negative predictive value: proportion of negative deviation that is predicted correctly from the total number of predicted negative deviations (\(\text{TN}/(\text{TN} + \text{FN})\)).

  5. (e)

    Accuracy: proportion of years when it is correctly predicted to be the same as the observed deviation (\((\text{TP} + \text{TN})/(\text{TP} + \text{TN} + \text{F } + \text{FN})\)).

  6. (f)

    F1 score: the harmonic mean of sensitivity and precision of the model (\((2* \text{TP})/(2*\text{TP} + \text{FP} + \text{FN})\)).

The confusion matrix elaborating the correctly predicted number of positive and negative deviations from LPA as observed by all the predictor variables is presented in Table 11a–g. The observed number of positive and negative deviations from LPA rainfall during test period 1995–2017 are 8 and 15, respectively. The measures calculated from the confusion matrix to evaluate the performance of identified predictors in predicting correctly the positive or negative deviation rainfall are shown in Table 12.

Table 11 Confusion matrix for monsoon prediction by (a) SP, (b) UWND, (c) SST, (d) SP + UWND, (e) UWND + SST, (f) SP + SST, (g) SP + UWND + SST predictors for the test period 1995–2017
Table 12 Evaluation for positive and negative deviation rainfall prediction with the identified climatic predictors of SP, UWND, SST, SP + UWND, UWND + SST, SP + SST and SP + UWND + SST for the test period 1995–2017

5.2.4 Uncertainty analysis

The uncertainty involved in monsoon prediction is explained in terms of ‘Fraction of variance unexplained (FVU)’. The measure is defined as the fraction of variance of dependent variable (the monsoon in our case) which cannot be explained or correctly predicted by the explanatory variables (identified monsoon predictors). FVU will be one if the identified predictors to not convey anything about the monsoon, and the prediction is said to be more accurate with less uncertainty as the FVU value approaches zero. The expression is shown in Eq. (6).

$$\begin{aligned} \text{ FVU } = \frac{\text{ VAR }_{{\text {error}}}}{\text{ VAR }_{{\text {total}}}} = \frac{\text{ SD }_{{\text {error}}}/\text{ n }}{\text{ SD }_{{\text {total}}}/\text{ n }} = \frac{\text{ SD }_{{\text {error}}}}{\text{ SD }_{{\text {total}}}}, \end{aligned}$$
(6)

where

$$\begin{aligned} \text{ SD }_{{\text {error}}}= & {} \sum _{i=1}^{N}\left( X_{i} - Y_{i} \right) ^{2},\\ \text{ SD }_{{\text {total}}}= & {} \sum _{i=1}^{N}\left( X_{i} - \overline{X} \right) ^{2}. \end{aligned}$$

The terms are already defined in Sect. 5.2.3. The prediction provided by the surface pressure has FVU of 0.38 and that by the zonal wind and sea surface temperature provides FVU of 0.33, and 0.55, respectively. Lower values of the variable suggest that less fraction of variance remains unexplained, which symbolizes a good prediction of the monsoon.

5.2.5 Comparisons with existing models

The prediction skill of monsoon predictors, which are identified by the proposed approach are investigated with the existing monsoon prediction models of India Meteorological Departments (IMDs). The models used by IMD are 16-parameter power regression model (Gowariker et al. 1991), 8-parameter and 10-parameter regression models (Rajeevan et al. 2004). The results are shown in Fig. 5.

Fig. 5
figure 5

Root mean square errors in monsoon prediction by the predictors of SP, SST, UWND, SP + SST, UWND + SST, SP + UWND, SP + UWND + SST; and IMD’s 16- (Gowariker et al. 1991), 10-, and 8-parameter models (Rajeevan et al. 2004) during 1996–2002

The predictor sets with SP, SST, UWND, SP + SST, UWND + SST, SP + UWND, SP + UWND + SST provide 5.5%, 6.1%, 5.9%, 5.7%, 5.9%, 4.5% and 4.6% root mean square errors, respectively in monsoon prediction, for the period 1996–2002 [period is considered to compare with existing forecasts by Rajeevan et al. (2004)]. The three IMD models predict monsoon with 10.8%, 6.4%, and 7.6% errors, respectively.

The prediction by the identified climatic predictors are also compared with the predictions provided by current IMD’s model using pursuit projection regression (PPR) (Rajeevan et al. 2007). The monsoon predictions provided by the discovered predictors are compared during the period of 2003–2017 (available forecasts by IMD models). The pursuit projection regression model presents monsoon prediction in two intervals– first in April (LRF1) and the next in June (LRF2). The model predicts the monsoon with 6.8% and 6.1% mean absolute errors in April and June, respectively.

The identified predictors of SP, SST, and UWND provide 4.9%, 5.2%, and 4.5% errors, respectively. The combined predictors of SP + SST, UWND + SST, SP + UWND, and SP + UWND + SST provide errors of 4.9%, 4.7%, 5.1%, and 4.6%, respectively. Thus, the identified predictors by the proposed climate network-based approach are comparable with the monsoon models used by IMD (Gowariker et al. 1991; Rajeevan et al. 2004, 2007). The results are presented in Fig. 6 by a bar chart diagram.

Fig. 6
figure 6

Mean absolute errors (%) in monsoon prediction by the predictors of SP, UWND, SST, SP + UWND, UWND + SST, SP + SST, SP + UWND + SST, and IMD’s PPR model (LRF1 and LRF2) (Rajeevan et al. 2007) during 2003–2017

5.3 Analysis based on correlation of monsoon predictors with the Indian summer monsoon

The Pearson correlation (\(\gamma\)) (Eq. (4)) between the identified monsoon predictors and the monsoon of India are explored with the same well-known predictors and the monsoon (Rajeevan et al. 2007). The important well-known monsoon predictors, as considered by India Meteorology Department, include the North Atlantic SST (NA_SST), the Equatorial South-Eastern Indian Ocean SST (ESE_IO_SST), the East Asia SP (EA_SP), the North Atlantic SP (NA_SP), the North-Central Pacific Ocean zonal wind (NC_PO_WV), and the North-West Europe SP (NW_Eu_SP). The identified predictors of SP, SST, and UWND having \(\gamma\) of 0.35, 0.34, and 0.43 are comparable to the correlation of known IMD’s predictors with the monsoon of India (shown in Fig. 7).

Fig. 7
figure 7

Correlation between the Indian monsoon and IMD’s predictors as well as the identified predictors of surface pressure (\(S_{1}\)), sea surface temperature (\(T_{1}\)), and zonal wind (\(U_{1}\))

5.4 Monsoon of current year 2018

Indian monsoon for the current year 2018 is predicted. The identified predictors of SP, UWND, and SST forecast rainfall as 92.26%, 90.12%, and 92.15% of a long period average in May, March, and March, respectively. Additionally, the combined predictors of SP + UWND, UWND + SST, SP + SST, and SP + UWND + SST predict monsoon as 94.03%, 91.65%, 95.19%, and 95.22% in May, April, April, and May, respectively. Thus, averaging all the values, we present the Indian monsoon for 2018 as 92.94% of LPA rainfall, which is below normal for the current year.

6 Conclusions

The identification of monsoon predictors is always been a prime focus in earth science. In our work, community detection approach is used for identifying the monsoon predictors that are important for the monsoon of the subcontinent. The community detection method is followed by the density-based clustering method to obtain the localized geographical regions. These regions represent newly identified monsoon predictors. Some of the identified predictors correspond to known predictors of the monsoon, which validate the proposed predictors’ identification approach, while some other new predictors are also found having high correlation with the Indian summer monsoon. The non-linear ensemble regression model, designed with identified monsoon predictors, was observed to be comparable to the IMD’s existing models for forecasting the Indian monsoon.

The future scope of the work comprises the inclusion of more climatic variables and identification of the new predictors from an amalgamation of different variables. The focus will be on exploring the new climatic predictors which will be crucial to the summer monsoon and may prove as an even better estimator for the Indian summer monsoon.