Introduction

Drought is an environmental disaster that occurs around the world, and it happens when precipitation is less than a specified amount for a period of time. It has widely negative impact on economy, agriculture, water resources, tourism, and ecosystems (Dai 2013; Maca and Pech 2015; Wambua et al. 2016). Some arid or semi-arid regions in Iran are among the most vulnerable to the impacts of weather variation and drought. On this matter, it is necessary to find out the most effective solution for exact drought prediction to reduce its harmful effects on the nature and environment. The aim of this research indicates that artificial intelligence techniques have been widely used for drought prediction. The standard precipitation index (SPI) is used extensively to forecast drought within certain time scales of precipitation, and this index is developed by (McKee et al. 1993). The SPI demonstrates the severity and probability of drought phenomenon, that the more negative values of SPI indicate to severe drought, while positive values illustrate wet condition (Lloyd-Hughes and Saunders 2002; Barker et al. 2016). Other indices like Palmer Drought Severity Index (PDSI) and Standardized Precipitation Evapotranspiration Index (SPEI) have been extended to observe, forecast, and evaluate the severity of drought (Palmer 1965; McKee et al. 1993; Paulo et al. 2012; Moreira et al. 2016).

The new techniques of artificial intelligence such as ANN, ANFIS, and SVM have been recently accepted as impressive alternative tools for drought forecasting (Shirmohammadi et al. 2013).

A functional method of drought forecasting is presented by artificial neural networks (ANNs). This model is a nonlinear algorithm and is used for solving the system modeling problems (Rezaeianzadeh et al. 2016; Sepahi et al. 2016). ANFIS is a type of artificial intelligence model that is classified as a system theoretical model and is able to create an acceptable simulation of complicated and nonlinear processes even when the data are infrequent (Kadhim 2011; Akbari and Vafakhah 2016). Also, SVM is a new machine learning method often claimed as a best model dealing with intricate classification problems (Ghosh et al. 2014; Suess et al. 2015).

Previous studies executed on these notions were as follows: Maca and Pech (2015) used two different models of artificial neural network and two drought indices SPI and SPEI based on two watersheds of the USA. The training of both neural network models was made by the adaptive version of differential evolution. The results showed that the integrated neural network model was superior to the feed-forward multilayer perceptron. Nguyen et al. (2015) demonstrated the correlations between sea surface temperature anomalies and used both indices SPI and SPEI with ANFIS model at the Cai River Basin in Vietnam. They found that the ANFIS forecasting model with long term was the best forecasting model. Keskin et al. (2009) applied the SPI for meteorological drought analysis at nine stations with different time scales, located around the Lakes District, Turkey. They used ANFIS and Fuzzy Logic models. Comparison of the observed values and the modeling results showed a better agreement with ANFIS models with long term than with fuzzy logic models. Shirmohammadi et al. (2013) used the ANFIS, ANN, Wavelet-ANN, and Wavelet-ANFIS models to forecast drought in the next 3 months on the basis of the SPI for Azerbaijan Province of Iran. The results of research demonstrated that all of the considered modeling methods were able to forecast SPI, but the hybrid Wavelet-ANFIS model demonstrated a better performance. Ustaoglu et al. (2008) used three different intelligent system methods in order to predict minimum, maximum, and dairy average temperature. Hosseinpour et al. (2011) used ANFIS model in order to forecast autumn droughts in eastern Iran with different input variables. The results showed that appropriate inputs were different for different delays and using a specific input could not lead to optimal modeling.

Belayneh and Adamowski (2013) used artificial neural networks (ANN), wavelet neural networks (WNN), and support vector regression (SVR) for forecasting drought conditions in the Awash River Basin of Ethiopia. A 3-month standard precipitation index (SPI) and SPI 12 were forecasted over lead times of 1 and 6 months in each sub-basin. The performance of all the models was assessed and compared using RMSE and coefficient of determination-R2. WNN models revealed superior correlation between observed and predicted SPI compared to simple ANNs and SVR models. The main aim of this research is to investigate techniques of artificial intelligence such as ANN, ANFIS, and SVM to find proper model for drought forecasting in Bojnourd. The rest of the article is organized as follows. In “Materials and methods” section, related literature like drought indices, training method, datasets and model performance measures are briefly described. “Results and discussion” section is dedicated to the results and discussion of methodologies implementation and comparing the performances of these models. Conclusions are given in the last section.

Materials and methods

Study area and data

Bojnourd is the capital city of north Khorasan Province, located 701 km away from Tehran, between the Latitudes 37°28′30″ Northern and Longitude 57°20′00 Eastern, is shown in Fig. 1. Bojnourd is located in semi-arid region. The classification of SPI is shown in Table 1.

Fig. 1
figure 1

Map of the Bojnourd containing SPI index

Table 1 Drought classification according to the SPI index (Jinal and Parekh 2013)

Artificial Neural Network (ANN)

An artificial neural network (ANN) is a computational model for information processing that is inspired of the human brain (Maier et al. 2010; Wambua et al. 2016). In this study, among the approaches of ANN, multilayer perceptron neural network is applied. Multilayer perceptron (MLP) can solve the math problem that needs nonlinear equations by defining proper weights (Scarselli and Tsoi 1998). The typical MLP consists of at least three layers. The first layer is called input layer, the last layer is called output layer, and the remaining layers are called hidden layers (Zhang et al. 2003). The structure of MLP is shown in Fig. 2.

Fig. 2
figure 2

Structure of multilayer perceptron

Log sigmoid function is used as activation function between input layer and hidden layer, and linear activation function is used between hidden layer and output layer (Adam et al. 2016). These functions are given below:

$$ Y = f\left( x \right) = 1 / (1 + e^{ - x} ) $$
(1)
$$ Y = f\left( x \right) = x $$
(2)

Among the different training methods, Levenberg–Marquardt (LM) algorithm is one of the neural network training algorithms that is used to train the network with the highest efficiency. The proposed method is the fastest method and provides a numerical solution to obtain mean square errors (Kayri 2016)

In this part, firstly, the input variables are divided into 2 parts, 85% of the dataset is used for training phase and 15% of the dataset is used for testing phase.

The prediction models require different metrics to measure the accuracy of the models. In this stage, we use the statistical parameters such as the correlation coefficient (R) and root square mean error (RSME) to measure the difference between estimated and observed values (Han and Kamber 2006; Gonzalez-Sanchez et al. 2014; Arabasadi et al. 2017).

$$ R = \sqrt {1 - \frac{{\sum\nolimits_{i = 0}^{n} {\left( {y_{0} - y_{p} } \right)^{2} } }}{{\sum\nolimits_{i = 0}^{n} {\left( {y_{0} - \frac{{\sum y_{o} }}{n}} \right)^{2} } }}} $$
(3)
$$ {\text{RSME}} = \sqrt {\frac{{\sum\nolimits_{i = 0}^{n} {\left( {y_{0} - y_{p} } \right)^{2} } }}{N}} $$
(4)

That y 0 is referred to the observed data and y p is referred to estimated data and N is the number of data. The best output is produced when the amount of RSME approaches to 0 and the amount of regression approaches to 1. The steps of artificial neural network are demonstrated as follows (Devi et al. 2012):

  • Step 1: Preparation of the training and testing dataset.

  • Step 2: Decide the number of nodes, and as initialization, set all weights and threshold value of the network to random number.

  • Step 3: For every neuron in every layer j = 1, 2, …, M, from input to output layer, find the output from the neuron:

    $$ Y_{j,i} = f\left( {\sum\limits_{k = 1}^{N} {Y_{{\left( {j - 1} \right)k}} W_{jik} } } \right)\quad {\text{where}}\;\;f\left( x \right) = \frac{1}{{1 + \exp \left( { - x} \right)}} $$
    (5)
  • Step 4: Calculate error value:

    $$ E\left( w \right) = \frac{ 1}{2}\sum {\sum\limits_{d \in D} {\sum\limits_{{k \in {\text{outputs}}}} {\left( {t_{kd} - o_{kd} } \right)^{2} } } } $$
    (6)
  • Step 5: For each network output unit k, calculate its error term:

    $$ \delta_{k} = y_{k} \left( {1 - y_{k} } \right)\left( { \, t_{k} - y_{k} } \right) $$
    (7)

    For each hidden unit h, calculate its error term:

    $$ \delta_{h} = \delta_{h} \left( {1 - \delta_{h} } \right)\sum\limits_{{k \in {\text{outputs}}}} {w_{kh} \delta_{k} } $$
    (8)
  • Step 6: Update each network weight w ji :

    $$ w_{ji} = w_{ji} +\Delta w_{ji} \quad {\text{where}}\;\Delta w_{ji} = \eta \delta x_{ji} $$
    (9)
  • Step 7: Update bias θ j in network

    $$ \Delta \theta_{j} = \left( l \right){\text{Err}}_{j} $$
    (10)
    $$ \theta_{j} = \theta_{j} + \Delta \theta_{j} $$
    (11)
  • Step 8: If termination condition is met then stop, else go to step 3 (Devi et al. 2012).

Fuzzy inference system (FIS)

The fuzzy logic theory proposed by Lotfi Zadeh aimed to solve the problems and ambiguous features that do not have precise mathematical solutions (Ramlan et al. 2016). Neural networks and fuzzy systems have some characteristic in common. They can solve the problems for instance pattern recognition, time series forecasting, or diagnostics if there is no mathematical solution. The comparison of the two methods is shown in Table 2 (Kruse 2008).

Table 2 Comparison of neural control and fuzzy system (Kruse 2008)

Adaptive neuro-fuzzy inference system (ANFIS)

The ANFIS model (Demyanova et al. 2017) has an approximating ability in real continuous function on a set to any degree of accuracy (Jang et al. 1997). The structure of neural-fuzzy network is organized by a combination of neural networks and fuzzy systems. This structure employs the capability of fuzzy systems which increases the power and trainable features of the neural networks and the inference precision in uncertain conditions (Alipour et al. 2014).

A basic Sugeno inference system generates an output function f from input variables x and y by using Gaussian membership function (Patel and Parekh 2014; Demyanova et al. 2017). Assume the Sugeno-type of ANFIS model contains two fuzzy IF–THEN rules as follows (Patel and Parekh 2014):

$$ {\text{Rule}}\; 1 :\;{\text{If}}\;x_{1} \;{\text{is}}\;A_{1} \;{\text{and}}\;x_{2} \;{\text{is}}\;B_{1} ,\;\;{\text{then}}\;f_{1} = a_{1} x_{1} + b_{1} x_{2} + c_{1} $$
(12)
$$ {\text{Rule}}\; 2 :\;{\text{If}}\;x_{1} \;{\text{is}}\;A_{2} \;{\text{and}}\;x_{2} \;{\text{is}}\;B_{2} ,\;\;{\text{then}}\;f_{2} = a_{2} x_{1} + b_{2} x_{2} + c_{2} $$
(13)

where a 1, a 2 and b 1, b 2 are the membership of input variable x and y and the parameters of the output function are f 1 and f 2. The ANFIS structure is composed of five layers. In the first layer or input layer, the amount of allocation of each input to different fuzzy areas is defined by the user. In the second layer, the weight of rules can be acquired by multiplying the input values in each node. In the third layer, the relative rule weights are computed. In layer four, each node computes the contribution of the rule to the entire output (Mohammadi et al. 2014). The last layer is the output layer of the network which aims to minimize the discrepancy between the acquired output and the actual output. The goal of training adaptive networks is to approximate unknown functions that are obtained from training data and meet accurate values. Suitable ANFIS structure is specified according to the input data, type of input and output membership functions, the number of functions and the IF–THEN rules (Alipour et al. 2014).

The ANFIS method is used via MATLAB toolbox (R2014b). The steps and the pseudocode of ANFIS are demonstrated in Fig. 3.

Fig. 3
figure 3

Flowchart of ANFIS model

Support vector machine (SVM)

Support vector machine (SVM) is a supervised learning model that is presented by Vapnik (1999), and this model can be used for various processes like natural language processing, diagnostic, voice recognition, etc. The advantage of this method is more convenient than the other models in training phase and also has high efficiency. This algorithm designates the best separating line to classify the data with more safety margin. In this method, vectors are elected as a selection criterion, and these vectors are applied the best boundaries and categories for data. These vectors are termed as Support Vector (Vapnik and Chervonenkis 1991; Sujay Raghavendra and Deka 2014; Vieira et al. 2017).

According to Fig. 4, two parallel hyper planes on either sides o“f maximum margin hyper plan”e are created to separate the data which belong to each class (Chihaoui et al. 2016). Maximum margin hyper plan”e is hyper plane that maximizes the spacing between two parallel hyper planes. It is assumed that the classification error will reduce whatever boundary separating or the distance between the two parallel hyper planes increases (Vapnik 1999).

Fig. 4
figure 4

Hyper plane with a maximum margin along with the separation boundaries for two classifications of data. The samples that are located on the borders are support vectors

Overall, the hyper plane with linear decision boundary (Demyanova et al. 2017) can be defined as follows:

$$ {\mathbf{w}} \cdot {\mathbf{x}} + b = 0 $$
(14)

In Eqs. (14, 15), x is a point on the decision boundary and w is an n-dimensional weight vector orthogonal to the hyper plane and b is called the bias.

$$ {\text{If}}\;x_{i} \;{\text{is}}\;{\text{a}}\;{\text{support}}\;{\text{vector}}\;{\text{then}}\;{\mathbf{w}} \cdot {\mathbf{x}}_{{\mathbf{i}}} + b = 1 $$
(15)

Optimal decision boundary is a boundary that has the maximum margin. Optimal decision boundary is calculated by detection of the best hyper plane that maximizes the margin between two classes and minimizes the magnitude of the weights, shown as follows:

$$ \mathop {\hbox{max} }\limits_{w \cdot b} \mathop {\hbox{min} }\limits_{i = 1 \ldots L} \left\lceil {y_{i}\frac{{\left( {wx_{i} + b} \right)}}{\left| w \right|}} \right\rceil $$
(16)

According to Eq. (15) and performing a series of mathematical operations, above equation is converted to below equation:

$$ \hbox{min} \frac{1}{2}_{w \cdot b} \left| w \right|^{2} ,\quad y_{i} \left( {wx_{i} + b} \right) - 1 \ge 0\quad i = 1, \ldots ,L $$
(17)

In order to solve the optimization problem by using the Lagrange multiplier λi, this problem can be converted to below optimization equation (Sujay Raghavendra and Deka 2014; Chao and Horng 2015; Wang et al. 2015). This algorithm is used when the classes are separable:

$$ \begin{array}{*{20}l} { \mathop {\hbox{max} }\limits_{\lambda \ldots \lambda i} \left[ {\sum\limits_{i = 1}^{L} {\lambda i} - \frac{1}{2}\sum\limits_{i = 1}^{L} {\sum\limits_{j = 1}^{L} {\lambda iy_{i} \left( {x_{j} x_{j} } \right)y_{i} \lambda j} } } \right]} \hfill \\ { \lambda i \ge 0\quad i = 1, \ldots ,L\quad \sum\limits_{i = 1}^{l} {\lambda iy_{{i_{i} }} = 0} } \hfill \\ \end{array} $$
(18)

where λi is the Lagrangian multiplier, Eq. (18) provides linear boundary between two classes that are completely separated, but in non-separable cases, the error increases when the classes are separating by linear decision boundary which have overlapping (Demyanova et al. 2017). Consequently, the Karush–Kuhn–Tucker (KKT) conditions (Chao and Horng 2015) are used to solve the optimization problems that are stated as follows Eqs. (19, 20):

$$ \begin{aligned} & w = \sum\limits_{i = 1}^{L} {\sum\limits_{j = 1}^{L} {\lambda i\,y_{i} x_{i} } } \\ & \lambda_{i } \left( {y_{i } \left( {{\mathbf{w}} \cdot {\mathbf{x}}_{i} + b} \right) - 1} \right) = 0\quad i = 1, \ldots ,L \\ \end{aligned} $$
(19)
$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{\lambda \ldots \lambda i} \left[ {\sum\limits_{i = 1}^{L} {\lambda_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{L} {\sum\limits_{j = 1}^{L} {\lambda_{i} y_{i} \left( {\varphi \left( {x_{i} } \right)\varphi \left( {x_{j} } \right)} \right)y_{j} \lambda_{j} } } } \right] \\ & C \gg \lambda_{i} \gg 0 \\ & \sum\limits_{i = 1}^{L} {y_{i} \lambda_{i} } = 0 \\ \end{aligned} $$
(20)

In Eq. (20), the Lagrange multipliers for nonlinear separable data are limited to \( 0 \ll \lambda_{i} \ll C \) (Kumar 2016).

$$ K\left( {x_{i} \cdot x_{j} } \right) = \varphi \left( {x_{i} } \right)\varphi \left( {x_{j} } \right) $$
(21)

By Eq. (21), x is mapped to a high-dimensional space φ (xi). Computing in non-separable or high dimensions case is fulfilled by using the kernel functions. The kernel functions like linear kernel function, polynomial kernel function, radial basis kernel function (RBF), and Sigmoid kernel function are used to calculate the inner products in the high-dimensional feature spaces (Hsu et al. 2013), that are shown as follows:

  • Linear kernel

    $$ K\left( {x_{i} \cdot x_{j} } \right) = \varphi \left( {x_{i} } \right)\varphi \left( {x_{j} } \right) $$
    (22)
  • Polynomial kernel

    $$ K\left( {x,y} \right) = \left( {xy + 1} \right)^{p} $$
    (23)
  • Radial basis kernel (Gaussian)

    $$ K\left( {x,y} \right) = e^{{ - \left\| {x - y} \right\|^{2} /2\sigma^{2} }} $$
    (24)
  • Sigmoid

    $$ K\left( {x,y} \right) = \tan h\left( {kxy - \delta } \right) $$
    (25)

RBF kernel (radial basis kernel function) is selected in this study because it is appropriate for large or small data and various dimensions (Jiao et al. 2016).

Kolmogorov–Smirnov hypothesis test

This article uses analytical procedure and statistical hypothesis test in order to ensure the acceptable solution. Nonparametric statistical analysis methods like Chi square and Kolmogorov–Smirnov test are useful in this case. The two-sample Kolmogorov–Smirnov test is a nonparametric test that compares the cumulative distributions of two datasets and sees if they are meaningfully different. Kolmogorov–S–mirnov was named in honor of two Russian statisticians: A.n. Kolmogorov and N.v. Smirnov (Sahoo 2013).

$$ {\text{Ds}} = \hbox{max} \left| {F\left( x \right) - F\left( {x_{0} } \right)} \right| $$
(26)

F(x) refers to the empirical distribution based on predicted data and F(x 0) refers to the empirical distribution based on observed data. Ds is specified as the greatest difference between two cumulative distribution functions (CDF) (Hassani and Sirimal-Silva 2015). Hence, the two-sample K–S test hypothesis almost is illustrated as follows:

H 0

Demonstrates the rejection of the null hypothesis at the significance level.

H 1

Demonstrates a failure to reject the null hypothesis at the significance level.

In K–S test2, the decision to reject the null hypothesis is based on comparing the p value: If p value is > 0.05, the null hypothesis is accepted, otherwise if p value is < 0.05, the null hypothesis is rejected at a confidence level 0.95.

Results and discussion

In ANN approach, correlation coefficient in training phase is shown in Fig. 5. The most common way to determine the number of neurons in the hidden layer is the trial and error method.

Fig. 5
figure 5

Correlation coefficient output

This stage is part of the networks training and development phase. In general, the number of neurons in the hidden layer should be modified to produce a satisfactory answer. According to Table 3, the best structure of neural network is number four with 10 neurons in hidden layer.

Table 3 Output of neural networks after training network

In ANFIS approach, we applied ANFIS method to our datasets. The criterion chosen for the ANFIS model is shown in Table 4 is as follows:

Table 4 ANFIS base modeling criterion
  • Membership function type

  • Epoch size

  • Data size

  • Learning algorithm

  • Output type

Finding proper membership functions (MFs) to minimize the output error measure and maximize performance is essential. The Gaussian is used as a membership function in this research (Folorunsho et al. 2012). The number of training epoch in neuro-fuzzy system was set on 3000 epochs which show that error decreases by increasing the number of epochs. Fuzzy system is defined by linguistic variables.

The range for each input can be divided into three parts or subsets and converted to linguistic variables. These linguistic variables are defined as: min, average, and max. We insert IF–THEN rules to fuzzy system. Like this rule:

$$ {\text{IF}}\,\left( {Temp\;{\text{is}}\;{\text{average}}} \right)\,{\text{AND}}\,\left( {Rain\;{\text{is}}\;\hbox{max} } \right)\,{\text{AND}}\,\left( {Humid\;{\text{is}}\;\hbox{max} } \right)\,{\text{THEN}}\,\left( {SPI\;{\text{is}}\;\hbox{min} } \right) $$
(27)

Temp referred to temperature and Rain referred to precipitation. After adding the rules, the FIS surface presents relationship between certain inputs and SPI output.

In Fig. 6, we can see that there is a reasonable pattern such as when the temperature is extremely high, the SPI would be high. Also, it shows that the SPI tends to decrease when precipitation is high. The performance of these models for drought forecasting according to regression and KS test are presented in Table 5. Three methods of ANN, ANFIS, and SVM are applied to forecasting drought of Bojnourd in Fig. 7. The result shows that the predicted data by SVM are approaches to the actual values of meteorological data.

Fig. 6
figure 6

Output surface for SPI output versus temperature and rain inputs

Table 5 Correlation coefficient for the predicted results
Fig. 7
figure 7

Comparing the calculated values using three methods with actual values

In SVM approach, like ANN model, 85% of data are used for model training and 15% are used to test the performance of this model. The result is shown in Table 5. According to this result, it is recommended to use SVM model to obtain suitable approximation of real dataset. Also, comparison the performance of the predicted data and observed data by using two-sample K–S test is shown in Table 5, and the cumulative distribution function plot is depicted in Fig. 8. The K–S test reports the maximum difference between these cumulative distributions. The results confirm that the K–S test of SVM model is more powerful than the other models and SVM model created the best result.

Fig. 8
figure 8

Comparison of two cumulative distribution functions (CDF), a real data and neural networks output, b real data and ANFIS, c real data and SVM

Conclusions

The results of the predicted data by these models indicate low errors and more accuracy of these methods to predict. On the other hand, achieve to the regression with 0.9974 of SVM model is shown that this model can be applied in other meteorological stations. Moreover, the results indicate that the high flexibility and accuracy of SVM model can be used as a powerful tool in simulating and forecasting. The meteorological factors such as precipitation, temperature, and humidity are the most effective factors in increasing the accuracy of prediction. At the end, the 2-sample K–S test accepts the null hypothesis because the p value of SVM model is 0.9303, and this value is greater than the default value of the level of significance compared to ANN and ANFIS, can be reached an acceptable response.