1 Introduction

Drought is a recurring extreme climate event characterized by below-average precipitation in a given region over a period of months to years (Dai 2011). Drought is one of the most damaging natural disasters, has widespread and detrimental impacts on hydrology, agriculture, and the environment, and causes enormous economic loss (Botai et al. 2016; Chen et al. 2016; Li and Zhou 2015). Moreover, global warming has resulted in increased risk of drought-related stresses for natural and human systems (Touma et al. 2015).

Among natural disasters, drought is considered as the most complex because the inception and end of a drought are difficult to identify. Hence, there is confusion about whether a drought exists because it difficult to define precisely (Wilhite 2000). Furthermore, the influence of drought often accumulates gradually over time and may linger long after the drought is over. In addition, it is difficult and crucial to characterize drought. To quantify the characteristics of drought, such as the intensity, magnitude, duration, severity, and spatial extent, drought indices are regarded as valid measures. The indices reflect different events and conditions and are easier to use than raw indicator data (Zargar et al. 2011). The researchers have developed more than 150 drought indices so far, which correspond to different types of drought, including meteorology, agriculture, and hydrological drought (Niemeyer 2008). Several of them are the most important and highly popular in global warming scenario, such as the Palmer Drought Severity Index (PDSI; Wayne 1965), the Standardized Precipitation Index (SPI; McKee et al. 1993), the Reconnaissance Drought Index (RDI; Tsakiris and Vangelis 2005), the Standardized Precipitation Evapotranspiration Index (Vicente-Serrano et al. 2010), the Water Surplus Variability Index (WSVI; Gocic and Trajkovic 2014b), etc. The PDSI is a landmark drought index and still widely used, which based on the supply and demand concepts of the water balance equation. It considers not only precipitation but also evapotranspiration and soil moisture, and computes four terms in the water balance equation: evapotranspiration, runoff, soil recharge, and moisture. The SPI is solely based on precipitation data and capable of calculating drought levels for different timescales, and that is put forward by the World Meteorological Organization (WMO) as a universal drought index. The RDI calculates the aggregated deficit between the precipitation and evaporative demand of the atmosphere based on the ratio between two aggregated quantities of precipitation and potential evapotranspiration (PET). Like the SPI, it also can be used for the estimation of drought severity at different timescales. The SPEI is another drought index that considers precipitation and PET. It is based on the monthly (or weekly) difference between precipitation and PET, which represents a simple climate water balance, and then adjusted using a three-parameter log-logistic distribution. Moreover, it also possesses the multiscalar nature similar to the SPI and RDI. The WSVI is similar to the SPEI and following the concept of the RDI, which has good agreement with the SPI, RDI and SPEI for drought monitoring, especially in humid and sub-humid locations (Gocic and Trajkovic 2014a). The drought indices are indispensable tools for explaining the severity of drought events, which extensively used in drought modeling and forecasting.

In recent years, researchers have increasingly begun to utilize data-driven techniques in hydrological phenomena modeling. Karimi et al. (2018b) established models using gene expression programming and support vector machine techniques for forecasting daily streamflow values, and evaluated local (within station) and external (cross-station) data management scenarios. For simulating Leaf Area Index, Karimi et al. (2018a) raised valid alternatives to locally trained models, which used externally trained gene expression programming and random forest models. Additionally, intelligent algorithms are beneficial to improve the performance of traditional hydrological phenomena models (Azad et al. 2019). With the development of machine learning technology and drought index, a number of studies have employed drought index and other data (e.g., meteorological data and remotely sensed data) to establish drought prediction models, which are based on various machine learning technology, such as multivariate linear regression (Ortegren et al. 2011; Xing et al. 2016), artificial neural network (Ali et al. 2017; Byakatonda et al. 2016), support vector machine (Ganguli and Reddy 2014; Gill et al. 2006), ensemble methods (Belayneh et al. 2016; Rhee and Im 2017), etc. Deo and Şahin (2015) developed Artificial Neural Network models by optimizing hidden neurons, activation functions and different combinations of training and testing algorithms for predicting the monthly SPEI in eastern Australia. Maca and Pech (2016) found that the integrated neural network model performed better than the feed forward multilayer perceptron in predictions of the SPEI and SPI. In predicting the stream flow drought index of the Latian watershed located in Iran, the support vector machine model was superior to the artificial neural network model in terms of a better efficiency (Borji et al. 2016). Among the varied machine learning techniques, the penalized linear regression and ensemble methods are two of the most effective and widely used algorithms for the vast majority of predictive analytics (Caruana et al. 2008; Caruana and Niculescu-Mizil 2006). Drought forecasting models are a type of function approximation problem, which is a subset of supervised learning. Linear regression is pervasive in most data-driven predictive models to solve the function approximation problem. Moreover, ordinary least squares regression (OLS) is the most commonly used linear regression algorithm. However, OSL has problems such as trapping in local optimum as well as high volume of computations and therefore, scientists and engineers have rarely used the OLS algorithm to establish drought prediction models at present. The penalized linear regression represents a relatively recent development in ongoing research to improve on OLS and includes ridge regression (RR) and lasso regression (LR). In ensemble methods, which are currently some of the most effective predictive models, a set of learning algorithms is developed and combined to solve a problem; whereas in conventional learning approaches, a single learning algorithm is used and is based on training data (Zhou 2012). Bootstrap aggregation (Bagging), Boosting and Random Forests (RF) are some of the most popular ensemble methods that can be used to improve the performance of the forecasting model. Zhang et al. (2017) compared seven data mining algorithms and found that the RF and AdaBoost methods resulted in higher accuracies in most cases. Moreover, the RF performed better than the AdaBoost for an unbalanced dataset in a multi-class task. For predicting drought impacts quantified from text-based reports, Bachmair et al. (2017) tested the predictive performance of three data-driven models and found that the RF model generally performed better than logistic regression and zero-altered negative binomial regression. RF can also be used to develop the short-term drought prediction models that can produce drought prediction maps at high resolution for a very short timescale over East Asia (Park et al. 2018). With good-impact data coverage, RF machine learning approach proved to be a suitable tool for drought monitoring and early warning in Germany and the UK (Bachmair et al. 2016).

However, there are scarcely any drought prediction models based on penalized linear regression in previously published studies. And the suitability of ensemble methods for forecasting SEPI has not yet been systematically assessed. Hence, it will be worth attempting to evaluate the performance of data-driven models which based on penalized linear regression and ensemble methods for forecasting SPEI. Notably, the exploitation of optimum drought prediction models in Northeast China is a new research step. In this study, we explore the ability of several machine learning models based on penalized linear regression and DT-based ensemble methods to predict the SPEI at the different timescales of 3, 6, 12, and 24 months in Northeast China. The objectives of this study were to (1) develop drought prediction models based on two representative algorithms of penalized linear regression (in this case the RR and LR algorithms) and to compare their forecasting performance with the OLS model; (2) establish drought prediction models using DT-based ensemble methods (in this case the AdaBoost and RF algorithms), and to compare their performance with the DT model; (3) determine the optimum drought prediction model by comparing the forecasting performance of the penalized linear regression and DT-based ensemble methods and to investigate its performance.

2 Materials and methods

2.1 Study region

Northeast China is a vast geographical region with the longitude ranging from 111 to 135°N, and the latitude ranging from 38 to 53°E and includes Heilongjiang Province, Liaoning Province, Jilin Province, and the eastern part of the Inner Mongolia Autonomous Region (Fig. 1). Northeast China, encompasses an area of 1.45 million square kilometers, has complex landforms with the Changbai Mountain to the east, the Lesser Khingan Mountains to the north, and the Great Khingan Mountains to the west. The region is dominated by a typical temperate monsoon climate with four distinct seasons, hot and rainy summers, and cold and dry winters. The climate zones change from a humid zone to a semiarid zone from the southeast to the northwest and the average annual precipitation is in the range of 300–1000 mm.

Fig. 1
figure 1

Spatial distribution of the study locations for meteorological stations in Northeast China

Northeast China is a major agricultural region and plays a critical role in maintaining national food security. Additionally, Northeast China has well-developed grassland-based animal husbandry and abundant forest resources. Drought is one of the most damaging and disastrous hazards in Northeast China (Yu et al. 2014), and the risk of drought is increasing (Kong et al. 2015). Many researchers have focused on the analysis of drought characteristics (Wang et al. 2014, 2015) and the impact of drought on agriculture (Peng et al. 2012; Yin et al. 2016) in Northeast China.

2.2 Data

2.2.1 Meteorological data

Meteorological data from 1961 to 2016 for 118 meteorological stations in Northeast China were provided by the China Meteorological Data Service Center (CMDC; https://data.cma.cn/), which included daily observations of minimum, average, and maximum air temperatures, precipitation, relative humidity, sunshine duration, wind speed (at 10-m height), ground surface temperature (at 0-cm height), and atmospheric pressure (Fig. 1). We calculated the monthly accumulative meteorological data by summing the daily meteorological data and checked their qualities according to the Deo and Şahin (2015).

2.2.2 The standardized precipitation evapotranspiration index

Because of the complexity of drought, it is tough to establish a unique and universally accepted drought index for a diverse group of users. However, it is crucial to select a relevant drought index to monitor and forecast drought severity. The PDSI has several deficiencies including the strong influence of calibration period, its limited applicability in locations other than calibrated for US Great Plains’ conditions, relatively sophisticated computation, noncomparability between diverse climatological regions, applicability to regions with extreme climate, etc. (Guttman 1998; Zargar et al. 2011). Although, several modified drought indices were developed to address the shortcomings of the PDSI, such as the self-calibrating Palmer Drought Severity Index (SC-PDSI; Wells et al. 2004), etc. In comparison with the other drought indices that can be calculated at different timescales, its fixed temporal scale remains the main shortcoming of the PDSI. The SPI allows for comparison of drought severity through time and space, but it does not include the effects of temperature variability on drought severity. Under global warming scenarios, the inability of the SPI to capture an increased evaporative demand is its significant deficiency. Both of the SPEI and the RDI are more sensitive and suitable in cases of a changing environment in that they take into account the effect of PET on drought severity and enable identification of different drought types. However, there are some differences between them. The essential difference is that they adopt different calculation approaches, that the RDI is based on the quotient between precipitation and PET; whereas the SPEI uses the difference between them. Because of using the quotient of precipitation and PET as input to standardization, the RDI gives no valid values when PET is equal to 0. Besides, the RDI shows insensitivity to variations in the magnitude of precipitation and PET by reason of its calculation approach of the drought drivers (Vicente-Serrano et al. 2015). The WSVI is a newly developed drought index, which is compared to the SPI, RDI, and SPEI with good agreement in the case of obtaining the dry and wet periods. However, the performance and limitations of the index should be further verified for the reason that few studies had evaluated drought conditions using the WSVI. In contrast to the aforementioned drought indices, the SPEI does not have distinct shortcomings, which exhibits significant advantages of combining multiscalar character with the capacity to integrate potential evaporation and thereby better represent the local water balance. As global warming intensifies, the spread of drought and the loss it causes will increase in many regions (Cook et al. 2014). Effective monitoring and prediction of drought are essential tools to help reduce and mitigate the impacts on hydrology, agriculture, and the environment. The predictive models that are based on the drought index may significantly help decision-makers to achieve efficiency in risk assessments of drought occurrences and the implementation of appropriate drought mitigation strategies.

To apply drought forecasting models based on machine learning technology above, we computed the SPEI by monthly meteorological data following the methodology of (Vicente-Serrano et al. 2010), but we used the Penman–Monteith (PM) method to estimate the PET instead of the Thornthwaite method. The Thornthwaite method with fewer data requirements is the most straightforward approach to calculate PET, but the PM method incorporates the effects of solar radiation, temperature, wind speed, and relative humidity. The method used to calculate the PET is not critical for the calculation of SPEI; Beguería et al. (2014) recommend the more robust PM equation when the data needed for this equation are available. The considered stations are the official sites with complete weather data as required by the PM equation; so we selected the more robust PM method. In this study, the SPEI at the different timescales of 3, 6, 12, and 24 months was implemented using the freely available SPEI package (version 1.7; https://cran.r-project.org/web/packages/SPEI/index.html) in R software.

2.3 Drought forecasting model development

In this study, we explored the different drought states ranging from short-term to long-term. Therefore, the SPEI with 3-, 6-, 12-, and 24-month timescales was used (SPEI3, SPEI6, SPEI12, and SPEI24) for analyses. In forecasting SPEI values at each timescale, a total of 10 input parameters were used to develop the drought prediction models: monthly precipitation, maximum temperature, minimum temperature, average temperature, relative humidity, sunshine duration, wind speed (at 10-m height), ground surface temperature (at 0-cm height), and atmospheric pressure and the synchronous SPEI value. The lag time of the models is one month, i.e., that the SPEI value of next month was the target variable predicted by the above ten input parameters of the current month. For example, to predict SPEI3 on a target month, the models used the meteorological parameters and SPEI3 of the previous month as input. We had retained the available input data of the 54 years (i.e., 1963–2016) for integrity and consistency of the dataset. Moreover, we partitioned the input dataset into two parts: the training dataset and the testing dataset. 74% of the dataset (i.e., 1963–2002) was the training dataset, and the final 26% of the data (i.e., 2003–2016) was the testing dataset. All of the machine learning algorithms in this study are openly accessible. The Python programming language library Scikit-learn (Pedregosa et al. 2011) was used to implement these algorithms, which is the Python package integrating most of the world's advanced machine learning algorithms for supervised and unsupervised problems. We scaled and translated each input feature individually such that it is between zero and one on the training dataset by using the MinMaxScaler function preprocessing feature within the Scikit-learn package.

2.3.1 The ordinary least squares regression models

Linear regression is a straightforward and useful approach for predicting a quantitative response. In this study, the first drought forecasting model only uses the OLS to modeling for predicting the SPEI. The purpose is to see the benefits of the penalized linear regression models to build forecasting models from data.

2.3.2 The penalized linear regression models

It is usually a difficult task to select the variables by the given response for a linear model. Researchers may mistakenly deduce the high-correlated variables because of their high p values, but they are no necessary predictors. Moreover, there would be some other irrelevant variables included in the model and leads to unnecessary complexity and interpretability. If the number of observations is not much larger than the number of variables, then there can be much variability resulting in overfitting (increased likelihood by adding more parameters but poorer predictions on future observations not used in the model training) (Pereira et al. 2016). The OLS model has the underlying problem which is sometimes overfitting. The penalized linear regression methods can avoid the overfitting problem by shrinkage or regularization, which involves fitting a model with all the predictors. They shrink the estimated coefficients towards zero relative to the classical estimates. The penalized linear regression may improve the overall prediction accuracy by trading off a small increase in bias for a substantial decrease in variance of the predictions. The RR and LR are two of the best-known penalized linear regression.

The RR introduced by Hoerl and Kennard (1970) is very similar to the OLS, except that the coefficients are estimated by minimizing a slightly different quantity. The LR (Tibshirani 1996) is another useful algorithm, which shrinks some coefficients and sets others to 0. The difference between the RR and LR is the measure that each one uses for the vector of linear coefficients. The RR uses squared Euclidean distance but the LR uses the sum of the absolute values that is called taxicab or Manhattan distance. The different coefficient penalty functions cause some important and useful changes in the solutions. To ensure fair comparison and the generalization of each model, we made sure that the RR and LR models were estimated using the same tenfold cross-validation. We set up the default values as the parameters and found that changing these values does not make a noticeable difference in our predictions.

2.3.3 The decision trees models

The DT is a non-parametric supervised learning method used to develop either a classification or a regression model. The DT algorithms build a model in the form of a tree structure that predicts the value of a target variable by learning a set of if–then–else decision rules inferred from the data features. When using the DT, the model splits into branches that indicate the decision's choices. The procedure is repeated recursively until terminal nodes that denote the result of following a combination of decisions are reached. The DT algorithms used most frequently include C5.0, classification and regression trees (CRAT), quick unbiased efficient statistical tree (Quest) and chi-squared automatic interactive detector (CHAID) models. In this study, the DT model was based on the CART algorithm, and we used the default settings. In practice, the trees are usually grown to their maximum size before a pruning step is applied to reduce overfitting (Reiss et al. 2015) and also grouped in ensembles to improve the stability of the process.

2.3.4 The ensemble methods

Ensemble methods are effective learning algorithms that combine multiple learning algorithms to obtain better predictive performance (Dietterich 2000). The principle of ensemble methods is to create a stronger learner by combining multiple weaker learners, and there have been a large variety of ensemble methods in accordance with different weaker learners and combining types. Ensemble methods employ a hierarchy of two algorithms. The low-level algorithm is a base learner, and the upper-level algorithm manipulates the inputs to the base learners so that the models they generate are somewhat independent. There are a lot of different algorithms that can be used as base learners conceivably, but the DT is one of the base learners that gain widespread acceptance. Among various upper-level algorithms, the Bagging, Boosting, and RF are some of the most applied diffusely.

The Bagging generates some training datasets by bootstrap sampling the original training data and then trains a base learner on each of these samples. Finally, the Bagging averages out the resulting models in regression problems (Breiman 1996). The Bagging can perform quite well as long as it is used with relatively unstable learners because the unstable learners ensure the ensemble's diversity despite only minor variations between the bootstrap training datasets (Lantz 2013). Thus, the DT is often used as base learners because of its instability. Strictly speaking, the RF is an extension of the Bagging and generates its sequence of models by training them on subsets of the full training data in the same manner as the Bagging algorithm, where the principal difference with the Bagging is the incorporation of randomized feature selection (Zhou 2012). As a result of this randomness, the bias of a single non-random tree usually slightly better, however, due to averaging, the variance of the RF usually will decrease more than compensating for the increase in bias. Hence, the RF is an overall more efficient predictive model (Breiman 2001).

The Boosting is a general approach for improving the accuracy of weak learners to attain the performance of stronger learners (Freund 1995). Like the bagging, the Boosting also takes a base learning algorithm and invokes it many times with different training sets. Nevertheless, the Boosting does not involve bootstrap sampling and be explicitly constructed to generate complementary learners (James et al. 2013). Adaptive boosting (AdaBoost) that introduced by Freund and Schapire (1997) is one of the most critical Boosting algorithms since it has a solid theoretical foundation, very accurate prediction, great simplicity, and comprehensive and successful applications (Wu et al. 2008). The core principle of the AdaBoost is to fit a sequence of weak learners on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction (Trevor et al. 2009). We used the module of the ensemble in Scikit-learn for the RF and AdaBoost models. All of the parameter settings are defaults except the maximum number of estimators is 100 for these models.

Performance measures.

The following measures of goodness of fit were used in this study to evaluate the forecast performance of all the models above:

$${\text{The coefficient of determination }}\left( {R^{2} } \right){\text{ }} = \frac{{\mathop \sum \nolimits_{{i = 1}}^{N} \left( {y_{i} - \hat{y}_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{{i = 1}}^{N} \left( {y_{i} - \bar{y}} \right)^{2} }},$$
(1)

where\(\bar{y} = \frac{1}{N}\mathop \sum \nolimits_{{i = 1}}^{N} y_{i}\)

where \(\bar{y}\) is the mean value taken over N, \(y_{i}\) is the observed value, \(\hat{y}_{i}\) is the forecasted value and N is the number of data points. The coefficient of determination measures the degree of association among the observed and predicted values.

$${\text{The Root Mean Squared Error }}\left( {{\text{RMSE}}} \right){\text{ }} = \sqrt {\frac{{{\text{SSE}}}}{N}} ,$$
(2)

where SSE is the sum of squared errors, and N is the number of samples used. SSE is given by:

$${\text{SSE}} = \mathop \sum \limits_{{i = 1}}^{N} \left( {y_{i} - \hat{y}_{i} } \right)^{2} ,$$
(3)

with the variables already having been defined.

$${\text{The Mean Absolute Error }}\left( {{\text{MAE}}} \right){\text{ }} = \mathop \sum \limits_{{i = 1}}^{N} \frac{{\left| {\hat{y}_{i} - y_{i} } \right|}}{N}$$
(4)

The MAE is used to measure how close forecasted values are to the observed values. It is the average of the absolute errors.

3 Results

In this study, we developed the drought forecasting models based on the OLS, RR, LR, DT, AdaBoost, and RF to predict the SPEI at different timescales of 3, 6, 12, and 24 months for 118 meteorological stations in Northeast China. In the following sections, we will evaluate the forecasting performance of the models to determine if the penalized linear regression and DT-based ensemble methods can provide performance improvements. Subsequently, we will identify the optimum model among all the drought forecasting models and assess the feasibility by analyzing its forecasting performance for each station in detail.

3.1 Penalized linear regression models

Figure 2 shows the probability density distributions of the RMSE based on the LR, RR, and OLS models at the different timescales of 3, 6, 12, and 24 months; this gives a comparison of the forecasting performance of the penalized linear regression and OLS models. As can be seen from the Fig. 2, the probability density distribution of the RMSE based on LR and RR was closer to zero than based on OLS. It indicates that the forecast deviations were lower by the penalized linear regression model than by the OLS model. In particular, the probability density distributions of the RMSE based on the LR model at each timescale were all significantly less than those based on the other models. The probability density distributions of the MAE based on the models at different timescales are similar to those shown in Fig. 2, in that the forecasting performances based on the LR model were better than those of the other models. Table 1 lists the statistical properties of the performance measures of the OLS, RR, and LR models for predicting the SPEI at the different timescales of 3, 6, 12, and 24 months. For the forecasts of SPEI3, the RR model had the lowest average RMSE of 0.3960 (ranging from 0.2765 to 0.6034) and the highest average R2 of 0.8302 (ranging from 0.4471 to 0.9112). The LR model exhibited a good performance similar to that of the RR, and had the lowest average MAE of 0.3236 (ranging from 0.2211 to 0.4814). The LR model exhibited the best forecasting performance for predicting SPEI6, SPEI12, and SPEI24 among all models and had the lowest average RMSE, the lowest average MAE, and the highest average R2 along with a small range of the RMSE, MAE, and R2. In summary, the results above demonstrated that the penalized linear regression models were more efficient than the OLS model for predicting the SPEI in Northeast China at the different timescales of 3, 6, 12, and 24 months.

Fig. 2
figure 2

Probability density distribution of the RMSE for the prediction of a SPEI3 b SPEI6 c SPEI12, and d SPEI24 using the OLS, RR, and LR models during the test period (2003–2016) for the 118 meteorological stations in Northeast China

Table 1 RMSE, MAE, and R2 statistics of the 118 meteorological stations for the prediction of the SPEI3, SPEI6, SPEI12, and SPEI24 using the OLS and penalized linear regression models

A comparison of the number of stations that exhibited the highest R2 values for the LR, RR, and OLS models indicates that the LR model was the optimum model for predicting the SPEI at different timescales for most of the meteorological stations (Fig. 3). For the prediction of SPEI3, the use of the LR model resulted in 51.7% of the stations, which was a much higher proportion than for the OLS (32.2%) and RR (16.1%) models. The forecasting performances of the models for the mid- and long-term SPEIs were similar with regard to the R2 values. For the prediction of SPEI6, SPEI12, and SPEI24, the percentages of the stations, for which the LR model was the optimum model were 47.5%, 80.5%, and 78%, respectively. These results suggested that the LR is the optimum model among the penalized linear regression models for predicting the SPEI at the different timescales of 3, 6, 12, and 24 months in Northeast China.

Fig. 3
figure 3

Percentage of the stations with the highest R2 for the prediction of the a SPEI3; b SPEI6; c SPEI12, and d SPEI24 using the OLS, RR, and LR models

Interestingly, Fig. 4 shows a steady decrease in the forecast deviation as the timescale of the SPEI increased. At the same time, the range of the RMSE and MAE decreased as the timescale of the SPEI increased. For example, the ranges of RMSE for SPEI3, SPEI6, SPEI12, and SPEI24 based on the LR model were 0.3860, 0.2091, 0.0679, and 0.0664, respectively. The only exception was that the range of the MAE was slightly larger for the prediction of SPEI24 (0.1102) using the OLS model than for the prediction of SPEI12 (0.1090). These findings indicate a correlation between forecast deviation and the timescale of the SPEI.

Fig. 4
figure 4

Boxplot of the RMSE and MAE for the prediction of the SPEI at multiple timescales using the OLS, RR, and LR models

3.2 Ensemble methods

The DT, AdaBoost, and RF models were developed to predict SPEI3, SPEI6, SPEI12, and SPEI24 for the 118 meteorological stations in Northeast China. Figure 5 provides the forecasting performance results evaluated by the RMSE. By contrasting the probability density distribution of the RMSE and MAE between the simple DT model and the DT-based ensemble methods model, we found that the DT-based ensemble methods model had a lower forecast deviation than the simple DT model. It is apparent from these figures that the probability density distributions of the RMSE and MAE of the RF model were closest to zero, followed by the AdaBoost model and the simple DT model. For the prediction of SPEI3, the average RMSE and MAE of the RF model were 0.4745 and 0.3756, respectively; these values were lower than those of the AdaBoost (average RMSE of 0.5526 and average MAE of 0.4361) model and the DT (average RMSE of 0.8161 and average MAE of 0.6437) model. The RF model also exhibited a lower forecast deviation than the AdaBoost and the DT models in predicting SPEI6 and had an average RMSE and MAE of 0.3075 and 0.2274 whereas the average RMSE and MAE values of the AdaBoost and DT models were 0.3947, 0.3019, and 0.5725, 0.4282, respectively. For the predictions of SPEI12 and SPEI24, the model based on the RF algorithm continuously exhibited the best forecasting performance in terms of the RMSE and MAE; the average RMSE values for these predictions were 0.1674 and 0.1537 and the average MAE were 0.1120 and 0.0996, respectively. Similar to the results for the prediction of the short-term SPEIs, the average RMSE and MAE were lower for the AdaBoost model than for the DT model for the prediction of SPEI12 and SPEI24. The AdaBoost model had average RMSE values of 0.2517 and 0.2214 and average MAE values of 0.1869 and 0.1614, respectively. The DT model had average RMSE values of 0.3303 and 0.2598 and average MAE values of 0.2222 and 0.1771, respectively (Table 2). The results (Table 2) indicate that the DT-based ensemble methods provide better performances than the simple DT model.

Fig. 5
figure 5

Probability density distribution of the RMSE for the prediction of a SPEI3 b SPEI6 c SPEI12, and d SPEI24 using the DT, AdaBoost, and RF models during the test period (2003–2016) for the 118 meteorological stations in Northeast China

Table 2 RMSE, MAE, and R2 statistics of the 118 meteorological stations for the prediction of SPEI3, SPEI6, SPEI12, and SPEI24 using the DT, AdaBoost, and RF models

To further compare the forecasting performances of the DT, AdaBoost and RF models, we used theR2 value to determine the optimum model for the majority of the meteorological stations in Northeast China. Table 3 shows the summary statistics for the DT, AdaBoost, and RF models. It is apparent that the RF model was the optimum model for the majority of the meteorological stations at the different timescales of 3, 6, 12, and 24 months. For the prediction of the SPEI3 and SPEI6, the RF model was the optimum model for 92.4% and 97.5% of the meteorological stations, respectively. However, surprisingly, the percentages were 100% for the prediction of SPEI12 and SPEI24. These results demonstrate that the RF model is the optimum model among all DT-based ensemble methods in this study.

Table 3 Number and percentage of the meteorological stations with the highest R2 for the prediction of the SPEI using the DT, AdaBoost and RF models

To investigate the correlation between the forecast deviation and the SPEI timescales, we compared the intercorrelations among the RMSE and MAE of the DT, AdaBoost, and RF models for the prediction of SPEI at different timescales (Fig. 6). The distribution of the RMSE and MAE of the AdaBoost and RF models shows a decreasing trend as the timescale of the SPEI increases. However, the trend of the DT model differs from the trends of the other models in that the distributions of the RMSE and MAE do not decrease as the timescale increases and there was no significant correlation between the distribution and the timescale. Overall, these results indicate that there is a correlation between forecast deviation and the timescale of the SPEI for the DT-based ensemble methods but not for the simple DT model.

Fig. 6
figure 6

Boxplot of the RMSE and MAE for the prediction of the SPEI at multiple timescales using the DT, AdaBoost, and RF models

3.3 Comparison of penalized linear regression and ensemble methods.

To assess the forecasting performance of the penalized linear regression and ensemble methods in this study, we compared the LR and RF models because they were the optimum models of the two respective methods. Figure 7 shows the comparison of the distribution of the RMSE of the LR and RF models at the different timescales of 3, 6, 12, and 24 months. The violin plot shows that the distribution ranges of the RMSE are smaller for the LR model than the RF model at each timescale. For instance, the ranges of the RMSE of the LR model at the different timescales of 3, 6, 12, and 24 months were 0.2722–0.6582, 0.1485–0.3577, 0.0405–0.1084, and 0.0185–0.0849, respectively. In contrast, the ranges of the RF model were 0.3324–0.7637, 0.2146–0.4599, 0.1045–0.2924, and 0.0627–0.5807, respectively. The results indicate that the forecasting performance of the LR model is superior to that of the RF model.

Fig. 7
figure 7

Violin plot of the RMSE for the prediction of the SPEI at multiple timescales using the LR and RF models

The next section of the study addressed the feasibility of the LR model for the prediction of the SPEI at different timescales for the 118 meteorological stations in Northeast China. The summary statistics for the forecasting performance of the LR model (Online Resource 1) indicates that the Xinbin station had the lowest R2 (0.5143) value and the highest RMSE (0.6582) and MAE (0.4814) values of the 118 meteorological stations in the prediction of SPEI3. Thus, the Xinbin station had the worst performance for the prediction of the SPEI. Of the 118 meteorological stations, the Dalian station had the worst performance for the prediction of the SPEI6 and had the lowest R2 value (0.8102), and the highest RMSE (0.3577) and MAE (0.3083) values. For the prediction of the SPEI12 and SPEI24, the R2 values were inconsistent with the RMSE and MAE values. In terms of the highest RMSE and MAE values, the Dalian Station and Zhurihe Station had the worst performance, respectively. The Dalian Station had the highest RMSE (0.1084) and MAE (0.0946) values for the prediction of the SPEI12; whereas the Zhurihe Station had the highest RMSE (0.0849) and MAE (0.0699) values for the prediction of the SPEI24. However, their R2 values were 0.9852 and 0.9923, respectively, and these were not the lowest values among the 118 meteorological stations for the prediction of the SPEI12 and SPEI24. The Mingshui Station had the lowest R2 value of the 118 meteorological stations for the prediction of the SPEI12 and SPEI24 (0.9750 and 0.9886, respectively). Although the R2 values of the Dalian Station and Zhurihe Station were higher than those of the Mingshui Station for the prediction of the SPEI12 and SPEI24, there was only a slight difference and the R2 values were greater than 0.975. Hence, the Dalian Station and Zhurihe Station had the worst performances for the prediction of the SEPI12 and SPEI24.

Figure 8 shows the monthly observed and predicted SPEI values of the stations (Xinbin station, Dalian station, Dalian station and Zhurihe station) with the worst forecasting performance at the different timescales of 3, 6, 12, and 24 months during the test period (2003–2016). There was a very good agreement between the predicted and observed SPEI values. Some other stations had better goodness of fit due to lower RMSE and MAE values. Moreover, the goodness of fit between the predicted and observed SPEI increased with increasing timescales. In summary, the LR model exhibited an acceptable forecasting performance of the SPEI at the different timescales of 3, 6, 12, and 24 months for the 118 meteorological stations in Northeast China.

Fig. 8
figure 8

Prediction results for the SPEIs at multiple timescales using the LR model at a Xinbin station b Dalian station c Dalian station, and d Zhurihe station

4 Discussion

In this study, we applied a variety of machine learning models to predict the monthly SPEI at the different timescales of 3, 6, 12, and 24 months for 118 meteorological stations in Northeast China during the period of 1963–2016. Two types of algorithms were evaluated: the penalized linear regression (the RR and LR) and the DT-based ensemble methods (the AdaBoost and RF). The goal of this study was to investigate the feasibility of using penalized linear regression and DT-based ensemble methods for forecasting drought conditions in Northeast China. The primary findings of this study are as follows. (1) The penalized linear regression achieved better forecasting performance than the OLS algorithm and the LR model had the best performance. (2) Another significant finding is that the DT-based ensemble methods had higher prediction accuracy than the simple DT algorithm for the prediction of the SPEI at different timescales. In particular, the RF model had the best forecasting performance. (3) A comparison of the optimum models of the two types of algorithms indicated that the LR model was superior to the RF model for the prediction of the SPEI at different timescales in Northeast China.

As expected, the results indicate that the penalized linear regression was more effective than the traditional OLS for the prediction of the SPEI at different timescales because of the lower forecast deviations of the stations. In the penalized linear regression, the OLS overfitting problem is solved by adding a penalty term to the least squares estimators for coefficients that are very small or zero, which improves the prediction accuracy. However, we found that the LR model had a better forecasting performance than the RR model in most cases. The best performance for the majority of the meteorological stations was achieved using the LR model. These results seem to be consistent with other studies that reported the LR resulted in significantly higher accuracy than the RR for electricity price forecasting (Uniejewski et al. 2016). However, one of the elastic net models was the best performing model in their research. We did not use the elastic net algorithm due to its penalty term, which was already included in the LR and RR. Elastic net methods introduce another parameter to adjust the ratio of the penalty for the RR and LR. Further research should be conducted to investigate the forecasting performance of other penalized linear regression models such as the elastic net model for the prediction of the SPEI. One unanticipated finding was that the forecasting performance of the models for the prediction of the SPEI improved with increasing timescales. This result is in agreement with the results reported by Park et al. (2016), who stated that the prediction accuracy was higher for long-term drought conditions than for short-term drought conditions. It is difficult to determine the specific reason for this result but it might be related to the reciprocal causal relationship between drought factors and the SPEI. Drought factors tend to represent the influence of precipitation shortages accumulated over long-term rather than short-term periods (Gessner et al. 2013).

In the current study, it is apparent that the DT-based ensemble methods had better forecasting performance than the simple DT algorithm, especially the RF model. We used the probability density distribution of the RMSE and MAE to determine the overall forecasting performance of the models. The results consistently indicated that the forecast deviations were lowest for the RF model for all timescales, followed by the AdaBoost model and the DT model. In addition, the use of the R2 values for determining the optimum model for each station also showed that the RF model was the optimum model for most stations. It may be that the voting mechanism of the multiple tree predictors in the RF algorithm has an advantage over the overfitting problem of the DT algorithm. Also, The RF method is less time-consuming, which represents a considerable advantage for predictions task. A possible explanation for this might be that the prediction accuracy depends not only on the algorithms but also on the size, dimension, and integrity of the dataset and the degree of correlation between the variables.

We compared the forecasting performance of the LR and RF models to identify the optimum model. The results indicate that the LR model performed better than the RF model for all stations according to the ranges and distributions of the RMSE, MAE, and R2 of the models. It seems that no single machine learning algorithm has outperformed other algorithms for the SPEI prediction in these all regions. The reasons may be related to the characteristic differences between the SPEI datasets in the different study regions. In addition, the timescales of the SPEI have a significant impact on the performance of the forecasting models. Therefore, the selection of the most suitable SPEI is more important than the type of machine learning algorithm. There is abundant opportunity to investigate specific models and to optimize the prediction performance. Further research should take into account the temporal characteristics of the SPEI and utilize deep learning methods to establish drought forecasting models. In addition, the use of different timescales of the SPEI or different drought indices may also improve the predictive performance of the model. Because the sample size was limited by the size of the study area, future research should use a dataset comprising a larger number of stations to improve the generalization ability of the drought forecasting models.

5 Conclusion

This study evaluated the ability of two machine learning methods(i.e., penalized linear regression and DT-based ensemble methods) for the prediction of the SPEI at the different timescales of 3, 6, 12, and 24 months in Northeast China. The penalized linear regression models provided better prediction results than the OLS model. Furthermore, the DT-based ensemble methods models had better forecasting performance than the simple DT model. Among all the drought forecasting models, the LR model consistently exhibited the best prediction accuracy regardless of the SPEI timescales. These findings suggest that the LR model may be applied to predict drought conditions in Northeast China. This research provides a framework for the exploration of machine learning approaches for the prediction of drought conditions. Considering the expected effect of global warming, the improvement of drought prediction models is a necessary approach to mitigate drought losses and achieve sustainable development of water.