1 Introduction

Drought is the natural hazard defined as a lack in precipitation compared to the average condition, and it causes great economic losses, which accounts for an overwhelming proportion of global loss caused by natural disasters (Keshavarz et al. 2013). Especially, drought will be death-dealing in the areas where mean precipitation observations at monthly or seasonal timescales were a little (Mishra and Singh 2010; Husak et al. 2013). Since the Korean peninsula is located at the monsoonal region whose climate varies with season, its precipitation depends on season and geographical position and this region is frequently affected by drought and flood. In the Democratic People’s Republic of Korea (DPRK), drought mostly occurs in spring and autumn, which causes the damage in the fields of agriculture and water power. In particular, the growth of young crops suffers from spring drought. For example, during the springs of 2001, 2017 and 2019 most areas of the DPRK were affected from severe droughts. Especially, in the rice-cultivating areas of the North and South Hwanghae Provinces and North and South Phyongan Provinces rice seedlings were affected. Also, autumn drought may affect the production of hydroelectric power stations during winter characterized by relatively less precipitation. Therefore, improving the drought monitoring and its prediction method is essential in minimizing the damages in the fields of agriculture and power production in the DPRK.

Damage caused by drought occurs with time lags, so that drought is generally estimated with timescales and its persistence. It is very important to establish an early drought warning system and take a rational help to be prepared for the coming damage (Belayneh et al. 2012; Mouatadid et al. 2018). Drought warning system is accomplished by applying effective forecasting models using several drought indices.

Several drought indices have been suggested for quantitative estimation of lack of precipitation, such as standardized precipitation index (SPI) (McKee et al. 1993), standardized precipitation evapotranspiration index (SPEI) (Vicente-Serrano et al. 2010), Palmer drought severity index (PDSI) (Palmer 1965), drought severity index (DSI) (Nalbantis and Tsakiris 2009) and so on. SPI is used as the most universal drought index because not only calculation of SPI is very simple and the value of SPI is the standardized value but also it is possible to analyze multi-timescale characteristic of drought (WMO 2012). Therefore, SPI is selected as multi-timescale drought index for assessment of dry or wet condition in this study.

Simplistic approaches like the autoregressive integrated moving average (ARIMA) model as well as more complex nonlinear approaches using support vector regression (SVR) and artificial neural network (ANN) models can be applied to drought forecasting (Rezaeianzadeh et al. 2016; Belayneh et al. 2012; Zhang et al. 2017; Zhang et al. 2019; Fathabadi et al. 2009; Mokhtarzad et al. 2017; Deo et al. 2017a; Djerbouai and Souag 2016). Although ARIMA model is the most common model for time series forecasting, it has the limitations that is not suitable for hydro-meteorological time series forecasting with non-stationary and nonlinearity (Zhang et al. 2017). Some recent studies show that SVR and ANN models can reflect nonlinear relation between input and output more accurately than ARIMA model in drought forecast. The performances of SVR and ANN models based on machine learning technique vary with applied fields. Deo et al. (2017) used multivariate adaptive regression, least square support vector machine and M5 Tree model for forecasting SPI in eastern Australia. The results by Khan and Coulibaly (2006) showed that an SVR model for lake water levels forecasting outperformed ANN models. SVR model for SPI drought forecasting at Khorasan province of Iran was more effective than ANN model (Mokhtarzad et al. 2017). While SVR model for SPEI-12 forecasting in the Sanjiang Plain of China didn’t outperform ARIMA model, which was the best model of used three models (Zhang et al. 2019). Although some of the applications of ANN model showed that ANN model didn’t outperform ARIMA and SVR models, the ability of ANN is still high because of model’s structure that well describes nonlinearity relation between input and output (ASCE 2000).

The ability of machine learning method to forecast non-stationary drought time series is limited (Belayneh et al. 2014). A solution to overcome this limitation is to apply data pre-processing methods such as wavelet transform and empirical mode decomposition. Wavelet transform is a useful data pre-processing tool that gives successful decomposition of original data, and it is effective in forecasting nonlinear and non-stationary time series (Renaud et al. 2005; Murtagh et al. 2004). It can be implemented on multi-resolution levels for capturing useful information, and the utilization of decomposed data instead of original data can improve the ability of a machine learning model. Belayneh et al. (2016) compared the bootstrap SVR, the bootstrap ANN, the boosting SVR, the boosting ANN, the wavelet bootstrap SVR, the wavelet bootstrap ANN, the wavelet boosting ANN and wavelet boosting SVR models. The results showed that wavelet boosting machine learning methods outperformed other methods. Djerbouai and Souag (2016) designed WANN models coupled wavelet decomposition on several resolution levels using several mother wavelets from db1 to db17, and compared with ARIMA. Synthetic analysis of previous researches for forecasting SPI drought emphasizes the utility of wavelet analysis (Anshuka et al. 2019). Ali et al. (2019) used the multivariate empirical mode decomposition (MEMD) formalized by Huang et al. (1998) as a data pre-processing tool to predict SPI drought. Pre-processing of 12 synoptic-scale climate indices using MEMD explicitly improved the ability of the kernel ridge regression algorithms used for SPI forecasting. Although data pre-processing methods can make some improvement about the ability of models, SVR and ANN models have some disadvantages because of the limitation of itself (Tu 1996; Sapankevych and Sankar 2009).

Deep learning has been made a big stride in artificial intelligence technique in recent years. Augmentation of training data and improvements of training algorithms of deep layer models increase the usability of the artificial intelligence technique (Nikhil 2017; Grover et al. 2015). Shi et al. (2015) developed the convolutional long short-term memory (ConvLSTM) layers for precipitation nowcasting. Veillette et al. (2018) represented the usability of deep learning by consisting ConvLSTM for the generation of radar image and verifying the effectiveness. Du et al. (2018) estimated the ability of deep layer neural network for precipitation forecasting with training data from 200 to 2000. Results showed the deep belief network based on multi-layer restricted Boltzmann machines can overcome the shortcomings of traditional forecasting methods. Although deep belief network for precipitation forecasting has been suggested, it has not been applied to drought forecasting.

The purpose of this paper is to investigate the ability of LSTM network coupled with a wavelet decomposition on multi-resolution levels for SPI drought forecasting in the west area of the DPRK. In this study, Haar mother wavelet function was used as mother wavelet, and wavelet-based LSTM network (WLSTMN) was compared with WANN and WSVR models. WANN and WSVR models for SPI drought forecasting have been developed in some other studies, but they haven’t been investigated in the DPRK yet (Anshuka et al. 2019). Also, traditional SVR and ANN models didn’t outperform wavelet-based models, therefore, in this study only wavelet-based models have been investigated. This study forecasted SPI-6 and SPI-12 (SPI with 6 and 12-month timescales) which are factors of medium and long-term drought conditions, and the performances of models at different decomposition levels for 1-month lead time were estimated by normal performance measures.

The rest of the paper is organized as follows. Section 2 presents study area, used data and the SPI calculation method, and provides a brief description of the machine learning models coupled with wavelet decomposition and their performance measures. Section 3 presents the results from three models, and the detailed discussion and conclusion are described in Sect. 4.

2 Materials and methodology

2.1 Study area and data

Study area encompasses the west area of the DPRK including 6 spots (Pyongyang, Kanggye, Sinuiju, Phyongsong, Sariwon and Haeju) (Fig. 1), which are the main weather stations of Pyongyang city, Jagang Province, North and South Phyongan Provinces, North and South Hwanghae Provinces. North and South Phyongan Provinces, North and South Hwanghae Provinces are the main granaries of the DPRK. The geographical locations and statistical characteristic values of precipitation for six weather stations are presented in Table 1. Study area is located at the East Asian monsoon region, so precipitation varies terribly during an annual cycle, but is mostly concentrated around the rainy season from June to September and drought mainly occurred from March to May.

Fig. 1
figure 1

Study area

Table 1 Geographical locations and statistical characteristic values of precipitation for six meteorological stations

The daily precipitation data from 6 weather stations covering a period of 1960–2020 are used, and these data were taken from the State Hydro-Meteorological Administration (SHMA) of the DPRK. The meteorological observations in study area started at different time: Pyongyang (1907), Sinuiju (1931), Haeju (1945), Kanggye (1952) and Sariwon (1954). Missing data is 5% of daily precipitation data before 1960 and few after 1960. The maximum daily precipitation, 411.5 mm, was observed in Sinuiju and the maximum and minimum monthly mean precipitation days were 14.1 in July in Haeju and 2.5 in January in Sinuiju, respectively. These data have been exactly corrected and quality-controlled by using contemporaneous data from neighboring spots.

2.2 Standardized precipitation index (SPI)

SPI suggested by McKee et al. (1993) and McKee et al. (1995) is the meteorological drought index, which represents the degree of deviation from the mean precipitation. The main advantage of the SPI index is that the calculation of SPI is uncomplicated because it only depends on precipitation records (Logan et al. 2010) and it is possible to analyze the strength and duration of drought on multi-timescales (Tsakiris and Vangelis 2004; Mishra and Desai 2006; Mishra et al. 2007). The detailed calculation method of SPI is presented in Thom (1958), Edwards and McKee (1997) Abramowitz and Stegun (1965) and Zhang et al (2017).

Drought classification by SPI values is shown in Table 2 (WMO 2012).

Table 2 Drought classification of SPI value

In this study, drought indices forecasted by the machine learning models are the SPI-6 and SPI-12 series, which were derived at 10 days’ intervals. SPI series based on monthly precipitation include 12 elements for a year. However, according to the main purpose of this study to elucidate the ability of deep learning we derived SPI with 10 days’ intervals instead of monthly SPI so that the length of training samples could be extended. Calculation of SPIs was implemented at the 10th, 20th and the last days of every month. For example, SPI-6 of January 10, 1961 is calculated by the cumulated precipitation from July 11, 1960 to January 10 and SPI-12 by the cumulated precipitation from January 11, 1960 to January 10, 1961. In case of leap year, the precipitation of February 29 is added to that of February 28. In this way, SPIs are produced by the cumulated precipitation of 6 or 12 months moving the interval to every 10 or 11 days. The length of SPI time series is (2020–1960)*3*12 = 2160.

2.3 Wavelet decomposition

Wavelet transform is a useful mathematical tool to analyze nonstationary time series. Wavelet analysis can be applied to reveal aspects of breakdown points and discontinuities, and compress or de-noise a signal (Kim and Valdes 2003). In addition, it can be used for Prediction of time series (Renaud et al. 2005; Murtagh et al. 2004). Wavelet transform can be implemented according to two algorithms; the first method is the continuous wavelet transform (CWT) and the latter is the discrete wavelet transform (DWT). CWT at time t for a time series \(f\left( t \right)\) is defined as following Eq. (1) (Nason and Von Sachs 1999),

$$W_{{{\text{a}},{\text{b}}}} \left( t \right) = \int_{{ - \infty }}^{{ + \infty }} {f\left( t \right)\frac{1}{{\sqrt a }}\psi \left( {\frac{{t - b}}{a}} \right){\text{d}}t}$$
(1)

where \(a\) is the scale parameter, \(b\) is the position parameter, and \(\psi\) is the mother wavelet function (Kim and Valdes 2003).

CWT inheres complexity and requires tedious computation time, so DWT is often used to forecast time series as follows (Cannas et al. 2006):

$$W(j,k)_{t} = 2^{ - j/2} \sum \psi \left( {2^{ - j/2} - k} \right)f\left( t \right){\text{d}}t$$
(2)

where integers j and k are known as the scale and position parameters, which control the scale and translation respectively.

Disadvantage of DWT for forecasting application is that the usual DWT is based on a decimated one. This algorithm leads to decrease of decomposed time series; therefore, non-redundant DWT can’t be applied to problems related to forecasting. Another disadvantage of DWT is that DWT uses future data values. Clearly, this becomes a difficulty for prediction problems that must be given attention to the boundary. This problem has been sufficiently discussed in Renaud et al. (2005), Murtagh et al. (2004), Adamowski and Sun (2010) and Belayneh et al. (2014). In order to overcome these disadvantages of DWT, nondecimated or redundant version, known as the a’ trous algorithm has been suggested by Mallat (1998). The smoother versions \(c_{{\text{p}}}\) and detail coefficients \(d_{{\text{p}}}\) of original series x(t) at decomposition level \(p\) are defined at different decomposition levels as given by Eqs. (3)–(5)

$$c_{0} \left( t \right) = f\left( t \right)$$
(3)
$$c_{{{\text{j}} + 1}} \left( t \right) = \mathop \sum \limits_{l = - \infty }^{ + \infty } h\left( l \right)c_{{\text{j}}} \left( {t + 2^{{{\text{j}} - 1}} l} \right)$$
(4)
$${\text{d}}_{{\text{j}}} \left( t \right) = c_{{{\text{j}} - 1}} \left( t \right) - c_{{\text{j}}} \left( t \right)$$
(5)

where \(c_{0} \left( t \right)\) is the original signal, h is the low pass filter; \(j\) is the decomposition level; \(l\) is the shift frequency.

2.4 Wavelet support vector regression (WSVR)

Support vector machines (SVM) can be applied to classification and regression problems (Gao et al. 2001). Since the initial purpose of this study is to forecast quantitative SPI drought, the SVM for regression problem known as SVR was used. SVR is a supervised learning model that is based on the structural risk minimization principle (Vapnik 1995). In other words, the goal of SVR model is to minimize generalization error, assuming the linear relation between a set of predictors \(\left\{ {\vec{x}_{1} ,\vec{x}_{2} , \cdots ,\vec{x}_{N} } \right\}\) and targets \(\left\{ {y_{1} ,y_{2} , \cdots ,y_{N} } \right\}\). If the relationship between predictor and target is not linear, predictors can be mapped to high dimensional space. The formula for the function between predictor and target as follows:

$$f\left( x \right) = w^{T} \cdot \varphi \left( x \right) + b$$
(6)

where \(w\) and \(b\) are coefficient vector and scalar that have to be estimated from predictor and target data, respectively, and \(\varphi \left( x \right)\) defines the nonlinear mapping of \(x\). Minimization of ε loss SVR proposed by Vapnik (1995) is defined as Eq. (7). This optimization problem is converted to the quadratic programming problem as following Eq. (11) by utilizing Lagrange multiplier method, and then support vectors refer to predictor vectors corresponded to the positive Lagrange multipliers

$$\mathop {\min }\limits_{w} R\left( f \right) = \frac{1}{2}w^{T} w + C\sum \left( {\xi_{i} + \xi_{i}^{*} } \right)$$
(7)

subject to

$$\begin{gathered} \forall n;\,y_{n} - \left( {w^{T} \cdot \varphi \left( {x_{n} } \right) + b} \right) \le \varepsilon + \xi_{n} \hfill \\ \forall n;\, \left( {w^{T} \cdot \varphi \left( {x_{n} } \right) + b} \right) - y_{n} \le \varepsilon + \xi_{n}^{*} \hfill \\ \forall n;\, \xi_{i} \ge 0,\xi_{i}^{*} \ge 0 \hfill \\ \end{gathered}$$

where C is the regularization parameter; \(\xi_{i}\) and \(\xi_{i}^{ * }\) are nonnegative slack variables that represent the deviation of the prediction and observation, respectively.

$$\max {\text{imize}} - \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{N} \left( {\alpha_{i} - \alpha_{i}^{ * } } \right)\left( {\alpha_{j} - \alpha_{j}^{ * } } \right)K\left( {x_{i} \cdot x_{j} } \right) - \varepsilon \mathop \sum \limits_{i = 1}^{N} \left( {\alpha_{i} + \alpha_{i}^{ * } } \right) + \mathop \sum \limits_{i = 1}^{N} y_{i} \left( {\alpha_{i} - \alpha_{i}^{ * } } \right)$$
(8)

subject to \(\mathop \sum \nolimits_{i = 1}^{N} \left( {\alpha_{i} - \alpha_{i}^{*} } \right) = 0,\alpha_{i} ,\alpha_{i}^{*} \in \left[ {0,C} \right]\).where \(\alpha_{i} , \alpha_{i}^{ * }\) are the Lagrange multipliers, \(K\left( {x_{i} \cdot x_{j} } \right)\) is called kernel function defined as scalar product in the feature space, regression function for new \(x\) is given by Eq. (9).

$$f\left( x \right) = \mathop \sum \limits_{i = 1}^{N} \left( {\alpha_{i} - \alpha_{i}^{*} } \right)K\left( {x_{i} \cdot x} \right) + b$$
(9)

Used kernel function is the radial basis function (RBF), which is nonlinear function as follows:

$$K\left( {x,y} \right) = \exp \left( { - \frac{x - y}{{2\gamma^{2} }}} \right)$$
(10)

In this study, predictors used to predict the SPI series are the wavelet-based decomposed series at different decomposition levels and the models were trained using fivefold cross-validation. The optimal model was decided through training and validation stage. Validation data correspond to the set of 2012–2020.

2.5 Wavelet artificial neural network (WANN)

ANN models are machine learning models that adhere to the empirical risk minimization principle as opposed to the structural risk minimization principle used by traditional SVR (Vapnik 1995). The advantage of ANN is that it automatically creates relationships between input (predictors) and target without analyzation of variables (Hydrology 2000). The other advantage of ANN is that it controls noises contained in training data for itself by training. Recently, with the development of statistical programs, the application of ANN is very easy and many studies for drought forecasting show that ANN models improved the ability of drought forecasting (Fathabadi et al. 2009; Belayneh et al. 2012; Zhang et al. 2017; Mokhtarzad et al. 2017; Djerbouai and Souag 2016). Multi-layer perceptron (MLP) feed-forward network has been used extensively to forecast SPI drought series and more detailed architecture can be found in Mishra and Desai (2006) and Belayneh et al. (2012).

This study used MLP feed forward network with a hidden layer, the input and output layer, which was trained according to the Levenberg–Marquardt backpropagation algorithm. The decomposed series was used as input data of the ANN models and the number of neurons of hidden layer was decided by trial and error method. Sampled data was partitioned into 3 sets of training, validation and testing.

2.6 Wavelet long short-term memory network (WLSTMN)

In recent years, deep learning has achieved rapid development of the artificial intelligence fields such as image classification and time series prediction. Long short-term memory network (LSTMN) can avoid vanishing gradient problem of traditional recurrent neural network (RNN) by adding a way to carry information across long time steps as the extended formula of RNN (Hochreiter and Schmidhuber 1997). Input layer of LSTM network is composed of a sequence input (SI) layer, which is the core component of LSTMN. Time series data is inputted into the LSTM layer by a SI layer. LSTM layer contains LSTM units composed of an input gate, a forget gate, a cell with a self-recurrent connection, and an output gate, which can remove or add successfully information. The updating algorithm of an LSTM unit and the role of four components were introduced by Hochreiter and Schmidhuber (1997) and Felix et al. (2014). The stochastic gradient descent (SGD) (Murphy 2012), root-mean-square propagation (RMSProp) (Hinton 2012) and adaptive moments (Adam) (Kingma and Ba 2014) optimizers can be applied to update the LSTMN parameters for minimizing the loss function.

In general, the increase of the number of hidden layers may lead to overfitting (Srivastava et al. 2014). Overfitting can happen in every supervised learning problem, but in case the model is trained with insufficient sampled data, it can happen severely. Reducing the size of the model is the way to avoid overfitting. Another way is to add weight regularization is known as L1 and L2 regularization (Neumaier et al. 1998) and dropout layer (Srivastava et al. 2014). In addition, cross-validation (Loughrey et al. 2005) and early stopping (Kohavi 1995) can be adopted to overcome overfitting.

LSTMN used to forecast SPI series in this study has a SI layer, a LSTM layer with 100 neurons, two fully connected (FL) layers with 50 and 1 neurons, respectively, a dropout layer with 50 neurons, a regression output layer with 1 neuron and the structure of model is shown in Fig. 2. The dimension of SI layer depends on wavelet decomposition level and dropout ratio in a dropout layer varies from 0.1 to 0.5. Sampled data was partitioned as in WANN model. All the WSVR, WANN and WDNN models were developed with the MATLAB (R.2021a) software (Deep Learning Toolbox, Statistics and Machine Learning Toolbox and Wavelet Toolbox).

Fig. 2
figure 2

The structure of WLSTMN to forecast SPI drought

2.7 Performance measures

The performance of developed models was estimated by the coefficient of determination (\(R^{2}\)), Lin's concordance correlation coefficient (LCCC) (Lin 1989), root-mean-squared error (RMSE) and mean absolute error (MAE) expressed as Eq. (11)–(14):

$$R^{2} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} (y_{i} - \hat{y}_{i} )^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} (y_{i} - \overline{y}_{i} )^{2} }},\overline{y} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} y_{i}$$
(11)
$${\text{LCCC}} = \frac{{2\rho \sigma_{o} \sigma_{{\text{p}}} }}{{\sigma_{o}^{2} + \sigma_{{\text{p}}}^{2} + \left( {\mu_{o} - \mu_{{\text{p}}} } \right)^{2} }}$$
(12)
$${\text{RMSE}} = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} (y_{i} - \hat{y}_{i} )^{2} }$$
(13)
$${\text{MAE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left| {y_{i} - \hat{y}_{i} } \right|$$
(14)

where \(y_{i}\) is the observed SPI, \(\hat{y}_{i}\) is the predicted value, \(\rho\) is the Pearson correlation coefficient between the observed and predicted SPI, \(\sigma_{o}\) and \(\sigma_{p}\) are the corresponding variances of the observed and predicted SPI, \(\mu_{o}\) and \(\mu_{p}\) are the means for the observed and predicted SPI.

The coefficient of determination is the value of 0 to 1, and the highest value of R2 represents the best model. The LCCC indicates the degree to which pairs of the observed and predicted SPI fall on the 45° line through the origin. The smaller the values of RMSE and MAE, the better the performance of developed model.

3 Results

In this study, WSVR, WANN and WLSTMN models to predict 1 month-ahead SPI drought were developed for six stations in the west area of the DPRK (Pyongyang, Kanggye, Sinuiju, Phyongsong, Sariwon and Haeju). Predictions were generated using the above-mentioned three models for SPI-6 and SPI-12. The mother wavelet used to pre-process data was the Haar wavelet and SPI time series was decomposed at 1–10 levels. For WSVR models, 85% of the data was used to train and validate, and 15% of the data was used to test. All the WANN and WLSTMN models were trained using 70% of the data, while 15% of the data was used to validate models and the final 15% of the data was used to test models. Then, the testing period corresponds to the period of 2012–2020. The predicted SPI time series by the optimal models determined through optimization on training and validation stages were compared with the observed SPI time series for the testing period. Three performance measures and optimal decomposition level (ODL) for SPI-6 and SPI-12 are presented in Table 3. As shown in Table 3, all the models exhibited better results for SPI-12 compared to forecasts of SPI-6. In addition, the results showed the optimal models for the study area were the WLSTMN models which had the highest R2 and LCCC, the smallest RMSE and MAE for most stations and all the timescales. The detailed results by six models at 6 stations will be mentioned in following subsections.

Table 3 The performance measures and ODL for SPI-6 and SPI-12 during the period of testing

3.1 WSVR models

As mentioned earlier, wavelet decomposed series with time lag was used as the inputs of WSVR models. The limitation of lag with significant correlation was determined by the auto-correlation function (ACF). Figure 3 is ACF figures for SPI-6 and SPI-12 at Pyongyang station, respectively (other stations were omitted). The number of inputs were chosen to have between 1 and 16 lags for SPI-6, while the values from 1 and 34 lags were tested for SPI-12. The number of lag and decomposition level with the optimal performance selected using a trial and error procedure were 12 and 3 for SPI-6 at all stations respectively, while the values were 15 and 5 for SPI-12 at all stations, respectively.

Fig. 3
figure 3

ACF for SPI-6 and SPI-12 time series for the Pyongyang station

The best WSVR model for SPI-6 was proposed from Phyongsong, where the optimal model had R2 of 0.705, LCCC of 0.820, RMSE of 0.602, and MAE of 0.427. The best SVR model for SPI-12 had R2 of 0.932, LCCC of 0.944, with RMSE and MAE values of 0.258 and 0.152 in Sariwon, respectively. For SPI-6, the smallest R2 and LCCC were 0.625 and 0.728 in Kanggye, while the largest RMSE values 0.662 in Haeju. For SPI-12, the worst performance measures are R2 of 0.837, LCCC of 0.877, RMSE of 0.391, and MAE of 0.262 in Kanggye (Table 3).

3.2 WANN models

The values of R2 measured by WANN models at 1–10 levels for six stations during the training and validation period are shown in Figs. 4 and 5, in which the blue line shows comparison for the training period and the red line shows the validation period. The decomposition level with the highest R2 for the validation period of SPI-6 was not unique for all the six stations and the WANN models with more than 6 decomposition level had lower generalization capacity (Fig. 4a, b, c, d, e and f). Especially, as shown in Fig. 4d, the value of R2 for SPI-6 at Phyongsong station had a sharper decreasing trend with increasing of decomposition level (more than 6) than other stations. For SPI-12 forecasts, increasing of decomposition level (more than 7) depressed the performance measure during the validation period as opposed to the training period (Fig. 5). According to the comparison of the performance criterion at different decomposition levels, the optimal decomposition level of wavelet-based decomposition for SPI-6 forecasting was 5 in Pyongyang, Sariwon and Haeju, and 2, 3, 6 in Sinuiju, Kanggye and Phyongsong, respectively, while the values for SPI-12 were 6 in Pyongyang, Kanggye and Sariwon, 5 in Sinuiju and Haeju, and 7 in Phyongsong. In an optimal decomposition level, the differences between R2 of training and validation data do not exceed 0.02 in Pyongyang (Fig. 5a) Sinuiju (Fig. 5c), Sariwon (Fig. 5e), Haeju (Fig. 5f), but 0.04 and 0.07 in Kanggye and Phyongsong, respectively, which is larger than other stations. These results represent that the generalization capacity of the WANN models for Kanggye and Phyongsong stations is lower than the generalization capacity of models for all the other stations.

Fig. 4
figure 4

The coefficient of determination measured by the WANN models at different decomposition levels for SPI-6 during the training and validation period

Fig. 5
figure 5

The coefficient of determination measured by the WANN models at different decomposition levels for SPI-12 during the training and validation period

For SPI-6, the R2 and LCCC values of WANN models during testing period were greater than 0.64 and 0.72 respectively, with the highest values of 0.719 and 0.806 in Haeju, and the lowest RMSE and MAE values were 0.541 and 0.409 in Pyongyang, respectively (Table 3). For SPI-12, the R2 and LCCC values of the forecasted results were greater than 0.87 and 0.92 respectively, with the highest value of 0.930 and 0.958, and the lowest RMSE and MAE values of 0.263 and 0.159, respectively in Sariwon.

3.3 WLSTMN models

The results by the WLSTMN models vary with the maximum number of training epochs. The performance measures of all the WLSTMN models trained with more than 100 epochs were extremely high for training data, while low for validation data. The optimal maximum numbers of training epochs determined by the trial and error method ranged from 30 to 50. Also, Here, Adam optimizer was used as the solver of training network.

Figures 6 and 7 show the R2 values of WLSTMN models with 1–10 levels for SPI-6 and SPI-12, respectively. Figure 6 shows that the WLSTMN models with more than 5-level have the low value of R2 for the validation data of SPI-6, and the optimal decomposition level for six stations ranges from 2 to 4. The optimum values were 3 in Pyongyang and Phyongsong, 2 in Sinuiju and 4 in Kanggye, Sariwon and Haeju for SPI-6. The results in Fig. 7 indicate that the R2 values for the validation data of SPI-12 exhibit the same conformation as indicated by variation of R2 values for SPI-6 with decomposition levels, and the optimal decomposition level for SPI-12 is 3 in Pyongyang and 5 in all the other stations.

Fig. 6
figure 6

The coefficient of determination measured by the WLSTMN models at different decomposition levels for SPI-6 during the training and validation period

Fig. 7
figure 7

The coefficient of determination measured by the WLSTMN models at different decomposition levels for SPI-12 during the training and validation period

For the validation data of SPI-6, the R2 values of the best model (WLSTMN) were greater than 0.65, with the highest value of 0.75 in Pyongyang and the lowest value in Phyongsong (Fig. 6), while, for the validation of SPI-12, the R2 values were greater than 0.88, with the highest value of 0.93 in Sariwon and the lowest value in Phyongsong (Fig. 7). And the R2 value in Phyongsong was lower than the other stations (Fig. 7d). During the testing period of SPI-6, the R2 and LCCC values were greater than 0.66 and 0.75, respectively, with the highest R2 of 0.728 (Haeju) and LCCC of 0.838 (Pyongyang) and the lowest RMSE and MAE values were 0.522 and 0.390, respectively in Pyongyang, (Table 3). During the testing period of SPI-12 the R2 and LCCC values were greater than 0.90 and 0.94, respectively, with the highest value of 0.936 and 0.955 (Sariwon and Haeju), and the lowest RMSE and MAE values were 0.260 and 0.175 (Sariwon and Kanggye), respectively.

As shown in Table 3, the best forecasting results of SPI-6 in Pyongyang were obtained by WLSTMN model, in which the values of R2, LCCC, RMSE, and MAE were 0.722, 0.838, 0.522, and 0.390, respectively. Meanwhile, the values of the best performance measures (R2, LCCC, RMSE, MAE) of SPI-12 for WLSTMN were (0.914, 0.947, 0.287, 0.184) in Pyongyang. Namely, the best results of SPI-12 in Pyongyang were also obtained by WLSTMN model. For SPI-6, LCCC of WSVR in Sinuiju was greater than that of WLSTMN, but its other measures (R2, RMSE and MAE) were worse than WLSTMN. Also, for SPI-12 in Sariwon, LCCC, RMSE and MAE of WANN were better than WLSTMN. In other stations, all measures of WLSTMN were the best as compared to WANN and WSVR. The above results show that WLSTMN for SPI-12 has the best performance in all stations except for WANN in Sariwon.

The scatter plots of SPI observations and values predicted by WSVR, WANN and WLSTMN models at the six stations for the testing period are shown in Figs. 8 and 9. These figures indicate most datapoints around the trend line and several points below or over this line under a certain level of underestimation or overestimation. In particular, datapoints are closer to the trend line for SPI-12 than for SPI-6. Figures 8 and 9 show the improvement of results predicted by WLSTMN compared to WSVR and WANN. For SPI-6, the differences of R2 and LCCC values between WLSTMN and other models are particularly large in Pyongyang and the scatter plot of WLSTMN (Fig. 8a) indicates the data points closer to the trend line than other models. These improvements are visible in other stations with difference of a certain degree. Especially the locations of data points around extreme value (SPI <  =  − 1.5) were clearly improved. (Figs. 8 and 9).

Fig. 8
figure 8figure 8

Scatter plots of the three models at six stations (SPI-6)

Fig. 9
figure 9figure 9

Scatter plots of the three models at six stations (SPI-12)

Figure 10 shows the average performance measures at six stations for the testing period. The WLSTMN model provides the better performance than the WSVR and WANN models, with the highest R2 of 0.709 and LCCC of 0.806, the lowest RMSE of 0.572 and the lowest MAE of 0.427 for SPI-6 (Fig. 10a), and the highest R2 of 0.919 and LCCC of 0.950, the lowest RMSE of 0.296 and the lowest MAE of 0.190 for SPI-12 (Fig. 10b). The performances of SPI-6 for WSVR and WANN models were similar, with R2 values of 0.676 and 0.686, LCCC values of 0.795 and 0.766, respectively, and RMSE of 0.604 and 0.597, MAE values of 0.434 and 0.442, respectively. All the performance measures of SPI-12 were similar for WSVR and WANN models. According to the results shown in Table 3 and Figs. 8, 9 and 10, it can be concluded that the best SPI forecasting model is the WLSTMN model for both of SPI-6 and SPI-12 time series.

Fig. 10
figure 10

The averages of performance measures for the testing period

4 Discussion and conclusion

The results from this study represent that the WLSTMN model outperformed the accuracy of WSVR and WANN models. The use of the LSTM unit allowed the SPI prediction models to be more reliable. Although deep learning requires enormous sampled data, the WLSTMN model developed in this study outperforms the other models for all the SPIs and all the stations. Due to the limitation of sample size, it cannot achieve a remarkable improvement of performance measures. However, deep learning with LSTM unit for forecasting time series can clearly overcome the shortcoming of traditional SVR and ANN models (Du et al. 2018; Zhao et al. 2019).

SVR on the structural risk minimization principle may outperform ANN on the empirical one. However, with the introduction of wavelet decomposition of input data, ANN can make a comparable prediction of SPI time series with SVR. WANN model used in this study outperformed slightly WSVR (Fig. 10). This result is consistent with previous studies (Belayneh et al. 2016).

It is known that drought prediction performance in the Korean Peninsula is lower than other regions. In particular, the smaller the temporal scale is, the larger the prediction error is (Anshuka et al. 2019). The result in this study shows that the performance of drought prediction for SPI-12 is better than for SPI-6. As shown in Figs. 8, 9 and Table 3, performance measures for all models are available.

Although the performance of models coupled wavelet decomposition is good, it depends on the decomposition level (Djerbouai and Souag 2016). The optimal decomposition level is determined differently with region, temporal scale and model. The reasonable decomposition level is decisive in improving the performance of SPI time series prediction. Overlarge decomposition level may lead to the increase of performance for the training data, but decrease for validation data. In case the model with many parameters has limited training data, overfitting is inevitable. The limitation of WANN models in SPI time series prediction is represented in a bit large differences between training data and validation data performances in Phyongsong (Fig. 5d) and Kanggye (Fig. 5b) for SPI-12. This shows that the generalization performance of WANN is low in SPI prediction in over-mentioned stations. In order to check whether this result is the production of random separation of training sample or not, we modified the training and validation data from the whole data repeatedly, but the difference between their performances did not reduce. This implies WANN for Kanggye and Phyongsong did not represent better the characteristics of time series compared to the models for other stations. But the difference between R2 values of the training and validation data has a maximum of 0.07, which is not bad as compared to the result of (Zhang et al. 2019; Deo et al. 2017b, a). For SPI-6, the difference is 0.15 in Phyongsong (Fig. 4d), larger than other stations. This may result from the largest decomposition level of 6 in Phyongsong, hence the number of neurons of input layer more than other stations. The performance of model could be improved through the increased number of hidden layer and intrusion of dropout layer (Du et al. 2018; Grover et al. 2015).

This limitation of WANN was overcome in some degree through the application of WLSTMN (Figs. 6d, 7b and d). The differences between performance measures of the training and validation data in Kanggye and Phyongsong were reduced in WLSTMN. The addition of LSTM layer lead to the solution of the vanishing-gradient problem and the safe transfer of the characteristics of time series at the former time step to the next steps. Furthermore, the overestimation or underestimation around extreme values by WSVR and WANN was improved in WLSTMN (Figs. 8 and 9). Figure 11 shows the observed and predicted values of three models for the severely or extremely dry events with SPI less than − 1.5 for SPI-6 and SPI-12 in Pyongyang (other stations were not shown). In some cases, WLSTMN has a greater error than WSVR or WANN, but for most of drought events its predicted values are closer to the observed values than other models. With a note that the prediction around extreme values is essential in drought forecasting, it could be conclusive that WLSTMN is more superior.

Fig. 11
figure 11

The extreme value predictions of three models for the testing data in Pyongyang

Figure 12 shows the confusion matrix of drought obtained by the quantitative predicted values of SPI-6 and SPI-12 in three models. As shown in the confusion matrices of WSVR and WANN models, they were not effective for extreme droughts (Fig. 12a, b). For the testing data, two extreme droughts at 6-month time scale occurred, but WSVR and WANN predicted ‘moderately dry’ events while WLSTMN predicted ‘severely dry’ events for these two events. Also, during the testing period, 31 severe droughts were observed, exactly predicted numbers of WSVR, WANN and WLSTMN are 10, 11 and 13, respectively. For the moderate droughts, the exact predicted number of WLSTMN is greater than other models. But WLSTMN overestimated the moderate droughts into ‘severely dry’, whose number is greater than other models. For SPI-12, six extreme droughts occurred during the testing period, WLSTMN predicted them into ‘severely dry’ and others misclassified two extreme droughts into ‘moderately dry’. Generally, the classification accuracy was the highest in WLSTMN for extreme and severe droughts, but similar in all models for moderate droughts. For SPI-6 and SPI-12, the misclassified drought classes did not exceed two classes for all the events and dry events were not predicted into ‘wet’. This implies that the models developed in this study could be applied to not only drought forecasting but its classification prediction. However, prior to the application to the classification prediction, the comparison of performances of WSVR, WANN and WLSTMN for the classification could be required through further research.

Fig. 12
figure 12

The confusion matrix of SPI-6 and SPI-12 for the testing data in Pyongyang

As shown in Fig. 10, area-averaged performance measures of WLSTMN are better than those of WSVR and WANN. Also, R2 and LCCC values of these three models for SPI-6 are larger than those for SPI-12 (Fig. 8 and 9). Thus, it can be pointed out that the models developed in this study predict the SPI-12 series with moderate change better than SPI-6 series with sharp change. These results are consistent with Zhang et al. 2017 and Anshuka et al. 2019. RMSE and MAE of WLSTMN were less than those of other models. RMSE and MAE of all the developed models are larger than those of data-driven models suggested by Belayneh et al. (2016) and Ali et al. (2019). This is because drought forecasting scores depend on the area and time scale as mentioned in Anshuka et al. (2019). With a regard to the low performance of SPI drought forecasting in the Korean Peninsula, the values of RMSE and MAE obtained by this study are not bad. In particular, R2 and LCCC showed the better accuracy. The numbers of neurons of LSTM layer, FL and dropout layer were fixed 100, 50 and 50, respectively, and the training data was not enough. In addition, wavelet mother function was restricted to Haar. Due to these limitations, WLSTMN model may not outperform significantly WSVR and WANN.

In this study, WSVR, WANN and WLSTMN models based on coupling machine learning technique with Haar wavelet decomposition were proposed to forecast SPI drought series during the period of 1961–2020 in the west area of the DPRK. WLSTMN with the best performance is operational in the drought forecasting of the study area. Through the comparison of the prediction performances of all the models, the following conclusions were derived.

First, WLSTMN models provided better prediction results compared to WANN and WSVR models. The WLSTMN model had an advantage compared to the WANN and WSVR models for drought predictions in the west area of the DPRK, with respect to the R2, LCCC, RMSE, MAE values. Next, the comparisons of area-averaged performances of the WANN and WSVR models showed that the WANN model exhibited higher R2 and lower LCCC during the testing period, while their RMSE and MAE values were approximately the same. Then, the results of all machine learning models for SPI-12 showed better prediction results than SPI-6 for all the stations.

Although wavelet analysis was an effective pre-processing tool, incorporating models with overlarge decomposition levels didn’t avoid overfitting. This study presents the drought predictions of WLSTMN, WANN and WSVR models using Haar wavelet at 1-month lead time. Further studies should be required to estimate the performance of machine learning models coupled with different Daubechies wavelets (Djerbouai and Souag 2016), and especially WLSTMN with different neurons of hidden layers should be explored. Also, these new models should be applied in any other stations with different characteristics and at longer lead times.