Introduction

Drought remains among the costliest climatic threats many countries are facing. In a general definition, drought is a long-term climatological period that experiences a less-than-average amount of precipitation, which in turn leads to deficit in water resources and other major problems in agricultural and economic spheres, ecosystems, human health, etc. (Wu et al. 2001; Quiring and Papakryiakou 2003; Belayneh et al. 2016; Wilhite and Buchanan-Smith 2005; Liu et al. 2021). This phenomenon is often interpreted as a creeping phenomenon. To be specific, compared to other natural disasters such as flood, landslide and earthquake, the resulting damage of this phenomenon gradually pans out (Rossi 2000; Wilhite et al. 2007). Situated in the Middle East, Iran is generally categorized as an arid to semi-arid country since two-thirds of its area are mostly desert.

Definition of drought varies from one region to another, and its classification is specific to every region. American Meteorological Society (1997) classified drought in 4 main terms: meteorological (based on precipitation), hydrological (reservoir storage), agricultural (soil moisture and streamflow) and socio-economic categories. Drought is monitored using different drought indices that detect drought conditions and trends based on the precipitation deviation from the normal level (Paulo et al. 2012), soil moisture deficit and reduction in surface and underground flows (Zargar et al. 2011). Palmer Drought Severity Index (Palmer 1968), SPI (McKee et al. 1993), Standardized Precipitation Evapotranspiration Index (Vicente-Serrano et al. 2010), Rainfall Anomaly Index (van Rooy 1965), Crop Moisture Index (Palmer 1968) and Surface Water Supply Index (Doesken and Garen 1991) are some important instances of drought indices. SPI is among the well-known drought indicators (McKee et al. 1993) that has been introduced as a standard global-scale meteorological drought index (Wardlow et al. 2012). Being measurable on different time scales including 1, 3, 6, 9, 12, 18, 24 and 48 months, SPI is used to observe the historical trend of drought on 3 long-, short- and medium-term scales. SPI enjoys several advantages; for instance, it can be measured using only precipitation information/data at a high confidence level, compared to the case of soil moisture. These benefits contribute to the popularity of this index in drought studies (see Hayes 1999; Szalai and Szinell 2000; Bordi and Sutera 2001; Lloyd-Hughes and Saunders 2002; Vicente-Serrano et al. 2004; Tsakiris et al. 2007; Shukla and Wood 2008; Raziei et al. 2009; Palchaudhuri and Biswas 2013; Portela et al. 2015; Ionita et al. 2016; and Kadam et al. 2021).

Drought monitoring helps raise awareness of the onset of drought and identify its magnitude and level in the past. However, more importantly, proper drought forecasting is needed to gain an insight into possible future drought in the region. Drought forecasting seems necessary in managing and reducing its adverse effects. Recently, a number of researchers have expressed great interest in regression models (Leilah and Al-Khateeb 2005), time series models of Auto-Regressive Integrated Moving Average (ARIMA) (Han et al. 2010), probability and analytic models of auto-covariance matrix (Cancelliere et al. 2007), Artificial Neural Network (ANN) models (Morid et al. 2007; Barua et al. 2012), Adaptive Neuro-Fuzzy Interface System (ANFIS) (Mokhtarzad et al. 2017; Kisi et al. 2019), extreme learning machine (Deo and Sahin 2015) and Support Vector Machine (SVM) (Khan et al. 2020) methods for predicting drought.

Mishra and Desai (2005) managed to make a combination of ANN and linear stochastic models based on SPI series in Kansabati River Basin, India. The mentioned model managed to predict drought with high accuracy. Bacanli et al. (2009) investigated the efficiency of Feed Forward Neural Network (FFNN) and ANFIS models in predicting drought based on the SPI series Central Anatolia, Turkey. They found that ANFIS model outperformed FFNN model. Shirmohammadi et al. (2013) investigated the efficiency of ANN, ANFIS, Wavelet-ANFIS and Wavelet-ANN models in predicting drought based on the SPI series in Azerbaijan, Iran. Their results demonstrated that all the aforementioned models could predict SPI; however, the hybrid Wavelet-ANFIS model outperformed the others. Mokhtarzad et al. (2017) compared the efficiencies of ANN, ANFIS and SVM in predicting drought based on the SPI series using meteorological station data of Bojnourd province, Iran. According to the results, SVM outperformed ANN and ANFIS in terms of accuracy.

Kisi et al. (2019) investigated the precision of four evolutionary neuro-fuzzy methods, namely Adaptive Neuro-Fuzzy Inference System with Particle Swarm Optimization (ANFIS-PSO), ANFIS with Genetic Algorithm (ANFIS-GA), ANFIS with Ant Colony Algorithm (ANFIS-ACO) and ANFIS with Butterfly Optimization Algorithm (ANFIS-BOA). Then, they made a comparison between the precision of these methods and that of the classical ANFIS method in predicting SPI time series at Abbasabad and Biarjmand Stations, Semnan, Iran. According to the results obtained from Ebrahim-Abad Station, the ANFIS-PSO method exhibited the best prediction precision on different SPI time scales.

Iran, an arid to semi-arid region, receives the mean rainfall of 250 mm, which is about a quarter of the world average (Mahdavi 2010). In recent years, Iran has been subjected to persistent and brutal droughts that caused severe shortage of surface water and groundwater resources, followed by subsequent environmental and agricultural adverse effects. This issue triggered further investigation and characterization of drought phenomenon in different regions of Iran through several studies recently conducted by Raziei et al. (2009), Zarch et al. (2011), Moradi et al. (2011), Mirabbasi et al. (2013), Saghafian and Mehdikhani (2014), Raziei et al. (2015) and Rezaei et al. (2016). The western part of Iran is a vital area that contains a significant amount of water supply in the country since it is the source of 3 major rivers of Karkheh, Dez and Karoon. The Karkheh basin is one of the regions that are frequently affected by drought (Byzedi et al. 2012; Ashraf Vaghefi et al. 2014; Kamali et al. 2015; Zamani et al. 2015; Kamali et al. 2017). This basin is shared by some of Iranian provinces including Hamedan, Kermanshah, Kurdistan, Ilam, Lorestan and Khuzestan. Karkheh River, after Karoon and Dez rivers, is considered the third biggest river in Iran that plays a significant role in providing a large share of water to many parts of Iran. This is the main reason why droughts in this basin cause many challenges in agricultural and economic sectors of the mentioned provinces. Recently, considerable attention has been drawn to tree-based models all over the world and it has been stated that tree-based models are more effective and have higher prediction power than ANFIS, SVM and ANN models. Hussain and Khan (2020) found that Random Forest (RF) model outperformed both ANN and SVM models in forecasting monthly river flow. Shamshirband et al. (2020) argued that M5 model trees provided a better prediction of the standardized streamflow index than SVM and Gene Expression Programming.

In this regard, first and foremost, this study has 4 main objectives to follow: (a) to forecast the next drought occurrence at the Kermanshah synoptic station on time scales of 3, 6, 12 and 48 months using SPI index as well as new algorithms for the standalone model of REPT and its new integration with Bagging (BA-REPT), Dagging (DA-REPT), Additive Regression (AR-REPT) and Random Committee (RC-REPT); (b) to investigate the prediction power of the meteorological variable versus lag-time SPI (i.e., SPI(t-1), SPI(t-2) and so on) as 2 inputs, which have not been compared with each other yet (investigated only separately); (c) to determine which input scenario exhibits better performance in drought forecasting; and (d) to develop a predictive model to forecast future drought based on the past-to-current data. The findings of the current study can assist decision-makers and the rest of Natural Resources Bureau with better management of drought risk threatening the basin.

Study area

This research is centered on the Karkheh Watershed (Fig. 1) which is 50768 km2 in size, and it is located in the central and southwestern regions of the Zagros Mountains with latitudes and longitudes ranging from 30° 08ʹ to 35° 04 ʹ and from 46° 06ʹ to 49° 10ʹ, respectively. Nearly 55.5% of the basin is located in mountainous areas and the rest in plains and foothills. The climate of Karkheh basin is Mediterranean-oriented. The mean annual rainfall fluctuates from 150 mm in the South to more than 1000 mm in the North and East parts of the basin. In addition, the mean annual air temperature fluctuates from less than 5 °C over the high mountains to 25 °C in the southern areas.

Fig. 1
figure 1

Kermanshah synoptic station location in Kermanshah province, Iran

Methodology

Dataset

A 30-year set of monthly recorded data including the maximum relative humidity (RHMax), minimum relative humidity (RHMin), maximum temperature (TMax), minimum temperature (TMin), and rainfall was compiled at the Kermanshah synoptic station. In this study, while SPI was considered the target/output variable, other variables and SPI with lag time (i.e., SPI(t−1), SPI(t−2), etc.) were considered as the inputs used for predicting the target variable. The input and output datasets were categorized into 3 subsets including 70% (from January 1988 to December 2008) for model development and 30% (from January 2009 to December 2018) for model validation. 70:30 is the most widely used ratio in ML modeling (Khosravi et al. 2018a, b; Khosravi et al. 2019; Venegas-Quiñones et al. 2020; Khosravi et al. 2021a, b, c; Kargar et al. 2021; and Panahi et al. 2021). Table 1 presents the descriptive statistics of the development, calibration and validation datasets.

Table 1 Data characteristic for training and testing sections

The input data added to the SPI include the monthly rainfall data collected from Kermanshah synoptic station. To be specific, the monthly precipitation dataset was prepared for a period of 30 × 12 = 420 months. The set of averaging periods is n = 3, 6, 12 and 48 months that represents the typical time scales for precipitation deficits. In this dataset, for each month, a new value is determined from the previous n months. Each of the datasets is fitted to the Gamma function to define the relationship of probability to precipitation. The probability of any observed precipitation data point was calculated; then, it was used to measure the precipitation deviation for a normally distributed probability density with a mean of zero and standard deviation of unity. This value represents the SPI for a particular precipitation data point.

Given that the available data are homogenous, time series on monthly scales of 3, 6, 12 and 48 will be constructed, and finally, a time series will be fitted into the Gamma distribution. Therefore, the probability density function is calculated as follows (Kisi et al. 2019):

$$g(x) = \frac{1}{{\beta^{\alpha } \Gamma (\alpha )}}X^{(a - 1)} e^{ - X/\beta } \;\;{\text{for}}\;\,X > 0,$$
(1)

where \(\alpha\) and \(\beta\) represent the shape factor and scale factor, respectively, and \(\Gamma (a)\) is the Gamma function defined as follows (Kisi et al. 2019):

$$\Gamma (\alpha ) = \int\limits_{0}^{\infty } {x^{(a - 1)} e^{( - x)} {\text{d}}x},$$
(2)

where \(\alpha\) and \(\beta\) are related to the Gamma density functions for each station and each time scale for every month of the year can be determined. McKee et al. (1993) predicted \(\alpha\) and \(\beta\) using the optimum maximum likelihood method (Kisi et al. 2019):

$$\alpha = \frac{1}{4A}\left( {1 + \sqrt {1 + \frac{4A}{3}} } \right)$$
(3)
$$A = {\text{Ln}}(\overline{X}) - \frac{{\sum {{\text{Ln}}(X)} }}{n}$$
(4)
$$\beta = \frac{{\overline{X}}}{a},$$
(5)

where n is the number of rainfall data samples (i.e., observations) and \(\overline{X}\) the mean rainfall in a specific period. Next, the aforementioned parameter is used for calculating the value of rainfall cumulative probability on a specific time scale. Rainfall cumulative probability calculated in Eq. 6, with the assumption of \(t = X/\beta\), can be converted to a deficit Gamma function:

$$G(X) = \int\limits_{0}^{x} {g(x)}.$$
(6)

Given that Gamma function for X = 0 has not been defined yet, whenever the value of rainfall distribution reaches zero, the cumulative probability is calculated as follows:

$$H\left( X \right) = q + \left( {1 - q} \right)G\left( X \right),$$
(7)

where q is zero rainfall probability (q = m/n) and m is zero value in a time series of rainfall data. SPI can be predicted through the following equations (Kisi et al. 2019):

$${\text{SPI}} = - \left[ {t - \frac{{c_{0} + c_{1} t + c_{2} t}}{{1 + d_{1} t + d_{2} t^{2} + d_{3} t^{3} }}} \right], \, 0 < H(X) \le 0.5$$
(8)
$${\text{SPI}} = + \left[ {t - \frac{{c_{0} + c_{1} t + c_{2} t}}{{1 + d_{1} t + d_{2} t^{2} + d_{3} t^{3} }}} \right], \, 0.5 < H(X) \le 1.0.$$
(9)

Regression form of t can be calculated as follows (Kisi et al. 2019):

$$t = \sqrt {\ln \left( {\frac{1}{{H(X)^{2} }}} \right)} , \, 0 < H(X) \le 0.5$$
(10)
$$t = \sqrt {\ln \left( {\frac{1}{{(1 - H(X))^{2} }}} \right)} , \, 0.5 < H(X) \le 1.0.$$
(11)

Constant coefficient values in these equations are found in Table 2.

Table 2 Constant coefficient values in SPI equations (Kisi et al. 2019; McKee et al. 1993)

Constructing input combinations

The role of determining the most instrumental input variables in the modeling prediction power cannot be ruled out. In this respect, first, the correlation coefficient (r) between the input variables and each SPI with different time scales was obtained and the values were considered as the bases for constructing different input combinations. At first, the input variables with the highest r values were determined as the first input. Next, those variables with the second highest r values were added to the first input to construct input No. 2. Therefore, this approach should be used until the variables with the lowest r values were added to construct input No. 9. To determine the most effective input combination, models with all inputs were applied (Table 3), and finally, the most effective one with the lowest value of Root Mean Square Error (RMSE) was considered as the best input scenario. This approach was among popular methods for creating and examining the input scenario through ML modeling (Monteiro Junior et al. 2019; Nhu et al. 2020; Salih et al. 2020; Meshram et al. 2021).

Table 3 Various input combinations for SPI on different time scales

Determining optimum values

Another step that significantly facilitates and affects prediction power modeling is determining the optimum values for the parameters of the models. All the models except ANFIS and SVM (performing in MATLAB software) were developed in a Waikato Environment for Knowledge Analysis (WEKA 3.9) software. To this end, the optimum values for the parameters of the models selection were obtained through trial and error. The first models were applied using default values, and then, larger and smaller values were considered. Consequently, the models were reapplied. Using this approach continues until determining the optimum values (Khosravi et al. 2021a, b, c). Similar to the previous section, RMSE criterion is considered as the metric to obtain the optimum value.

Descriptions of the models

Reduced error pruning tree (REPT)

REPT is a radical simplification of Decision Tree (DT) based on the “if–then” rule that is used for linking a set of predictors (xi) to one predicted variable (y) and conducting in-depth research on suitable parameters among a large number of trees (Wang et al. 2020). The cumulative results of several iterations will yield several trees. In this respect, Mean Square Error (MSE) is used in Reduced Error Pruning (REP) to prune the unsuitable tree initially provided by the regression tree (Lalitha et al. 2020). The splitting criteria adopted by the REP Tree include the information gain ratio and error minimization of variance (Saha et al. 2020). The major benefit of REP Tree is its capability to accurately reduce the complexity of DT, which is widely regarded as the most significant deficiencies of DT approaches. In addition, the error resulting from variance is considerably reduced (Abdar et al. 2020; Murwendo et al. 2020). Given that there are a large number of trees in the DT, at each node, the error is computed and compared to each class, the total aggregate error is then recorded, and the most significant errors are finally pruned, this process is referred to as “divide and conquer” (Li et al. 2020).

Bootstrap aggregation (bagging)

To enhance the accuracy of the individual decision tree models, the idea of establishing an ensemble of methods was suggested, which was greatly conducive to the betterment of the accuracy, precision and robustness of the decision tree models. Bootstrap aggregating, also called Bagging (BA), is one of the most and well-known algorithms that function based on the idea of generating multiple models and aggregating them into a unique and coherent aggregated predictor (Sánchez-Medina et al. 2020). BA is an ML ensemble meta-algorithm proposed to enhance the accuracy of ML models used in both statistical classification and regression approaches. It also reduces the variance and helps avoid overfitting. One of the major contributions of the ensemble methods is their capability to decrease the variances of the regression and classification errors (Chen et al. 2020) and overcome the overfitting problem encountered when using the single tree (Lee et al. 2020). BA works by drawing each training pattern through Bootstrap sampling; consequently, n training samples yield n different sets of Out-of-Bag instances (Liu and Chen 2020). BA model can be established in 3 steps. First, the training dataset should be randomly re-sampled, thus providing a set of training subsets with the same size. In the second stage, an individual model is designed and trained for each subset. Finally, a coherent aggregated predictor is constructed based on the averaging approach (Chen et al. 2020). BA model ensures the enhancement of unstable procedures including ANN, classification and regression trees and subset selection in Linear Regression. BA model can improve preimage learning. However, it can mildly degrade the performance of stable methods such as K-nearest neighbors.

Disjoint aggregating (Dagging)

DA is one of the meta-algorithms that were first introduced by Ting and Witten (1997). This meta-classifier forms several disjoint stratified folds based on the data and feeds each chunk of data to a copy of the supplied base classifier. Since all the generated base classifiers were put into the Vote meta-classifier, predictions were made possible through averaging (Chen et  al. 2022; Zhao et al. 2020). DA is suitable for base classifiers that are quadratic or worse in time behavior with respect to some instances in the training data. During computation, with N patterns forming the training dataset, the DA built M subset of data where each n pattern does not follow any common pattern. Therefore, an exclusive model was formed for each dataset and the final model was selected based on voting strategy (Ting and Witten 1997).

Additive regression (AR)

The AR model is classified as a type of nonparametric regression. It is a part of Alternating Conditional Expectations (ACE) algorithm that was first introduced by Friedman and Stuetzle (1981). ACE and AR are more flexible due to less curse of dimensionality, hence used for predicting much more complex phenomena. AR is a general (potential and nonlinear) regression model that incorporates a special case of linear regression. Suppose that variable \({Y}_{i} (i=1, 2, \dots ,n)\) is unrestricted function \({f}_{j} (j=1, 2, \dots ,p)\), determined by the input variables \({X}_{i1}, {X}_{i2}\dots ,{X}_{ip}\), respectively. In this model, a mathematical equation is proposed (Xu and Lin 2017):

$$Y_{i} = \mathop \sum \limits_{j = 1}^{p} f_{j} \left( {X_{ij} } \right) + \mu_{i} , \, \mu_{i} \sim iid\left( {0,\sigma^{2} } \right),$$
(12)

where \({f}_{j}({X}_{ij})\) is the nonparametric function that fits the data. The random error term \(({\mu }_{i})\) has zero mean and variance of \({\sigma }^{2}\).

Random committee (RC)

Hybrid models produced by a combination of more than 2 artificial intelligence techniques are generally called Committee Machine (CM). The major advantage of the CMs is their ability to constitute a robust model with a necessary know-how to compensate for the deficiencies currently attributed to the individual model (Ghiasi-Freez et al. 2012). RC is a type of the CM learning approach to solving both classification and regression problems, and it is considered one of the promising ensemble models (Niranjan et al. 2017). Upon using the RC, an ensemble of randomizable base regressors or classifiers should be developed, in which each classifier is formed based on identical data, but uses a unique random number seed. The final response of the model is obtained by averaging the prediction results of each individual model (Witten and Frank 2005; Lira et al. 2007).

Model performance evaluation

The present study presents a visual method of scatter plot and offers some quantitative metrics including RMSE, Mean Absolute Error (MAE), Nash Sutcliff Efficiency (NSE), Percentage of BIAS (PBIAS), Coefficient of Persistence (CP) and the ratio of RMSE to standard deviation of the observations (RSR). These metrics are measured in the following:

$${\text{RMSE}} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {({\text{SPI}}_{e} - {\text{SPI}}_{o} )^{2} } }$$
(13)
$${\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {{\text{SPI}}_{e} - {\text{SPI}}_{o} } \right|}$$
(14)
$${\text{NSE}} = 1 - \frac{{\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{e} - {\text{SPI}}_{o} )^{2} } }}{{\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{e} - \overline{{{\text{SPI}}}}_{e} )^{2} } }}$$
(15)
$${\text{PBIAS}} = \left( {\frac{{\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{o} - {\text{SPI}}_{e} )} }}{{\sum\nolimits_{i = 1}^{n} {{\text{SPI}}_{e} } }}} \right)*100$$
(16)
$${\text{RSR}} = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{e} - {\text{SPI}}_{o} )^{2} } }}{{\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{e} - \overline{{{\text{SPI}}}}_{e} )^{2} } }}}$$
(17)
$${\text{CP}} = 1 - \frac{{\left[ {\sum {\left( {{\text{SPI}}_{O(i)} - {\text{SPI}}_{e(i)} } \right)}^{2} } \right]}}{{\left[ {\sum {\left( {{\text{SPI}}_{O(i)} - {\text{SPI}}_{e(i - j)} } \right)}^{2} } \right]}}$$
(18)
$$R^{2} = \left[ {\frac{{\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{o} - \overline{{{\text{SPI}}}}_{o} ) - ({\text{SPI}}_{e} - \overline{{{\text{SPI}}}}_{e} )} }}{{\sqrt {\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{o} - \overline{{{\text{SPI}}}}_{o} )^{2} } } \sqrt {\sum\nolimits_{i = 1}^{n} {({\text{SPI}}_{e} - \overline{{{\text{SPI}}}}_{e} )}^{2} } }}} \right],$$
(19)

where SPIo and SPIe are the computed and predicted values, respectively. \(\overline{{{\text{SPI}}}}_{e}\) and \(\overline{{{\text{SPI}}}}_{o}\) are the average forecasted and measured values, j is the prediction lead and n is the number of datasets used. The lower the RMSE and MAE, the better the performance of the models. Given that the closer the NSE to 1, the better the model performance, in which case NSE varies between – and 1. In addition, the closer PBIAS and RSR get to zero, the higher the prediction power of models will be. CP is used to compare the performance of this model with that of a simple model using the observed value of the previous day as the prediction for the current day. The maximum value of PI, which is equal to 1, is indicative of a perfect fit. The values lower than 0 suggest that it is better to accept the last SPI as a forecast instead of using the tested model. R2 varies between 0 and 1, and the model with R2 = 1 exhibits perfect performance.

Results and analysis

Effectiveness of input variables

According to the findings in this study, in the case of SPI prediction on a three-month time scale, Tmax had the highest effect on the modeling process (r = 0.61), followed by Tmin (r = 0.60), RH min (r = 0.51), RHmax (r = 0.43) and rainfall (r = 0.41) (Table 4). Table 5 presents the correlation coefficients between SPI for different time scales and their lags. According to the results, followed by determining the highest value of the time scale (e.g., from 3 to 48 months), the rainfall variable gains higher significance than other input variables. For example, rainfall had the least effect on the SPI prediction on a three-month time scale; however, it had the highest effect on the 12 month time scale, followed by the 48 month time scale as the second effective variable. Further, the effectiveness of each variable in the final outcome declined on higher time scales because in this situation, i.e., higher time scales, the prediction process became complicated, especially when we consider the erratic nature of atmospheric variables.

Table 4 Correlation coefficient between input and output variables
Table 5 Correlation coefficient between SPI and their lags

Best input scenario

Figure 2 shows the results of the best input scenario. The best input scenario clearly varies in different models because each model has a different structure in a way that is developed according to its specific structure. For example, for SPI prediction on a three-month time scale, Input 7 is the most effectively instrumental scenario for the standalone REP Tree model, while for BA-REP Tree model, Input 9 is the best one and Input 8 has the highest effect for the remaining of the models. In the case of SPI on a 6 month time scale, Input 4 is the optimal input scenario with the lowest RMSE. In the cases of SPI on 12 and 48 month time scales, Input 6 has the highest effect on the modeling process.

Fig. 2
figure 2

Selection of the best input scenario

Model evaluation and comparison

Upon determining the most effective input scenario for each model, the models developed for each time scale are shown in Fig. 3. According to this table, on the 3 month time scale, BA-REP Tree exhibits the highest performance (R2 = 0.856), followed by RC-REP Tree (R2 = 0.790), DA-REP Tree (R2 = 0.761), AR-REP Tree (R2 = 0.721) and REP Tree (R2 = 0.690). On the 6 month time scale, BA-REP Tree has the highest performance (R2 = 0.842), followed by DA-REP Tree (R2 = 0.830), RC-REP Tree (R2 = 0.824), AR-REP Tree (R2 = 0.770) and REP Tree (R2 = 0.703). On the 12 month time scale, RC-REP Tree shows the highest performance (R2 = 0.774), followed by BA-REP Tree (R2 = 0.763), DA-REP Tree (R2 = 0.752), AR-REP Tree (R2 = 0.745) and REP Tree (R2 = 0.720). Finally, on the 48 month time scale, BA-REP Tree exhibits the highest performance (R2 = 0.867), followed by DA-REP Tree (R2 = 0.855), AR-REP Tree (R2 = 0.852), RC-REP Tree (R2 = 0.832) and REP Tree (R2 = 0.821). Therefore, on time scales of 3 to 48 months, the best model with the highest performance is observed, as implied by both data pattern and model structure.

Fig. 3
figure 3figure 3

Scatter plot of the measured and predicted SPI values (blue, green, red and gray colors denote 3, 6, 12 and 48 month time scales)

R2 metric indicates the performance of the models, yet it is subject to a number of drawbacks such as high sensitivity to outlier and maximum values. Another drawback lies in the primacy of the model precision over its accuracy. For example, a model with high R2 value only has high precision depite its very low performance (low accuracy). To overcome this problem, a number of other quantitative metrics should be employed (Table 6).

Table 6 Model evaluation and comparison in the testing period

Results revealed that on the time scale of three months, BA-REP Tree outperformed the other models (RMSE = 0.269, MSE = 0.207, NSE = 0.798 and RSR = 0.449), followed by RC-REP Tree (RMSE = 0.306, MAE = 0.212, NSE = 0.739 and RSR = 0.511), AR-REP Tree (RMSE = 0.348, MSE = 0.246, MAE = 0.662 and RSR = 0.581), DA-REP Tree (RMSE = 0.352, MAE = 0.277, NSE = 0.654 and RSR = 0.588) and REP Tree (RMSE = 0.369, MAE = 0.281, NSE = 0.621 and RSR = 0.616). According to the PBIAS metric, all the models underestimated SPI values (positive value).

On the time scale of 6 months, DA-REP Tree outperformed the other models (RMSE = 0.387, MSE = 0.313, NSE = 0.759 and RSR = 0.449), followed by AR-REP Tree (RMSE = 0.306, MAE = 0.212, NSE = 0.739 and RSR = 0.511), RC-REP Tree (RMSE = 0.399, MAE = 0.323, NSE = 0.744 and RSR = 0.506), BA-REP Tree (RMSE = 0.399, MAE = 0.332, NSE = 0.743 and RSR = 0.515) and REP Tree (RMSE = 0.468, MAE = 0.364, NSE = 0.649 and RSR = 0.592).

On the time scale of 12 months, RC-REP Tree outperformed the other models (RMSE = 0.313, MAE = 0.205, NSE = 0.745 and RSR = 0.505), followed by BA-REP Tree (RMSE = 0.316, MAE = 0.212, NSE = 0.740 and RSR = 0.510), DA-REP Tree (RMSE = 0.327, MAE = 0.207, NSE = 0.721 and RSR = 0.528), AR-REP Tree (RMSE = 0.332, MAE = 0.209, NSE = 0.714 and RSR = 0.535) and REP Tree (RMSE = 0.353, MAE = 0.243, NSE = 0.675 and RSR = 0.570).

On the time scale of 48 months, DA-REP Tree outperformed the other models (RMSE = 0.411, MAE = 0.300, MAE = 0.750 and RSR = 0.500), followed by AR-REP Tree (RMSE = 0.413, MAE = 0.302, NSE = 0.738 and RSR = 0.502), BA-REP Tree (RMSE = 0.453, MAE = 0.346, NSE = 0.697 and RSR = 0.551), RC-REP Tree (RMSE = 0.474, MAE = 0.350, NSE = 0.669 and RSR = 0.575) and REP Tree (RMSE = 0.494, MAE = 0.385, NSE = 0.639 and RSR = 0.601). Furthermore, based on the PBIAS metric, it was observed that all the developed models underestimated SPI values (positive PBIAS value).

Results demonstrated that all hybrid algorithms enhanced the modeling performance of the standalone REP Tree algorithm. On the time scale of 3 months, BA-, RC-, AR- and DA-models improved the performance of the REP Tree model by about 22.25%, 15.96%, 6.1% and 5.1%, respectively, based on the NSE metric. These enhancement rates within 6 months were about 12.65%, 12.76%, 13.2% and 14.5%. On the 12 month time scale, they were about 8.7%, 9.4%, 5.5% and 6.3%, respectively, and on the 48 month scale, they were 8.3%, 4.5%, 13.4% and 14.8%, respectively. According to the findings, the BA algorithm outperformed the other models in terms of performance enhancement (22.25%).

According to NSE metric, the standalone REP Tree model in all the cases exhibited a favorable performance (0.75 > NSE > 0.65). On the contrary, in most of these cases, the hybrid models exhibited an excellent performance (1 > NSE > 0.75). Furthermore, while comparing the performances of the models on all time scales of 3, 6, 12 and 48 in predicting SPI, lower model performance was observed on higher time scales; take, for example, the comparative performances of BA-REP Tree (time scale of 3 months), DA-REP Tree (time scale of 6 months), RC-REP Tree (time scale of 12 months) and DA-REP Tree (time scale of 48 months) with the NSE values of 0.798, 0.759, 0.745 and 0.750, respectively.

According to the NSE metric, no NSE value higher than 0.85 has been achieved for hybrid models yet; however, while predicting many other variables, NSE can reach about 0.980 by BA-M5P to consequently predict the suspended sediment load in glacierized Andean catchment in Chile (Khosravi et al. 2018b), about 0.94 by BA-M5P to predict the bed load transport rate (Khosravi et al. 2020a), about 0.98 by the weighted instance handler wrapper (WIHW-Kstar) model to predict the bridge pier scour depth (Khosravi et al. 2021a; b, c, d, e), about 0.94 by instance-based K-nearest neighbors model to predict the Fluoride concentration (Khosravi et al. 2020b), about 0.99 for river water salinity prediction by AR-M5P (Melesse et al. 2020), about 0.90 for shear stress distribution prediction by RF model (Khozani et al. 2020) and about 0.94 for water quality index prediction by BA-RT (Tien Bui et al. 2020). This shows that the atmosphere-related prediction variables, especially drought perdition variables, are more erratic than other variables and their prediction includes more uncertainty.

To compare the prediction power of newly developed models, the old and most widely used conventional ML models of SVM and ANFIS were taken into account as the benchmark (Fig. 4). Based on the obtained results, it can be concluded that new developed models in this study have higher predictive power than those of both ANFIS and SVM models.

Fig. 4
figure 4

Scatter plot of the measured and predicted SPI values (blue, green, red and gray colors denote 3, 6, 12 and 48 month time scales) for ANFIS and SVM models

One of the significant criteria that affect the results is the length of the training and testing datasets. Although there is no standard to determine what percentage of data has been used for training and testing datasets, previous studies (Liu et al. 2020; Zhao et al. 2021) proved that the testing dataset needs to be representative of approximately 10–40% of the size of the whole dataset. The 70:30 ratio is often used for training and testing in machine learning models (He et al. 2020, 2021; Chen et al. 2021; Che and Wang 2021; Liang et al. 2022). Another method that has a great effect on the result is the selection of the best input combination with input variables. Sometimes, this variable has either a null or negative effect on the result; thus, it must be determined and removed from the modeling. Although there are some methods that draw the best input scenario automatically such as Principal Component Analysis (PCA), Khosravi et al. (2020b) proved that constructing different input combinations and evaluating them would sound more effective than employing the PCA method. In the literature, while some research papers used the meteorological variables as the input to predict SPI, some others considered SPI with lag time as the input. The findings of this study confirmed that a combination of both SPI and meteorological variables as the input could significantly enhance the modeling performance. Due to the different structure of each model and each data pattern, the best model with high accuracy could be different on different time scales. In other words, BA, DA and RC models are more effective than AR models.

Conclusion

This research attempted to predict drought using SPI as a drought indicator on different time scales of 3, 6, 12 and 48 months using the standalone REP Tree model and several hybrid models of BA, DA, RC and AR algorithms. The following statements briefly summarize the overall findings of this study.

  1. 1.

    Meteorological variables failed to predict SPI accurately.

  2. 2.

    SPI with lag-time as an input was much more effective than the meteorological variables.

  3. 3.

    Combination of SPI and lag-time and meteorological variables as the inputs could improve the modeling prediction power.

  4. 4.

    The best input scenario varied on different time scales.

  5. 5.

    The model with high accuracy did not function similarly on different time scales.

  6. 6.

    Standalone model had a good performance; however, hybrid models exhibited excellent performance in most of these cases.

  7. 7.

    Modeling performance decreased upon increasing the time scale from 3 to 48 months.

  8. 8.

    All of BA, DA and RC models were much effective than the AR model.

  9. 9.

    Upon increasing the time scale from 3 to 48 months, the efficiency of the variable Tmax in the SPI prediction decreased and that of rainfall increased (using correlation coefficient).

  10. 10.

    All the newly developed models exhibited more favorable performance than conventional ANFIS and SVM models.