1 Introduction

Accurate streamflow forecast is very important in water resources system planning, design, operation and management as well as identifying hydrologic drought spells [6], controlling flood events [28], optimizing hydrologic system [17], determining environmental flow portions [33], modeling surface water-groundwater interactions [10], and modeling suspended sediment load in rivers [18]. Traditionally, conceptual simple models have been developed by numerous researchers to describe the rainfall-runoff process for computing the total amount of surface water flows. Although such models do not require more detailed information on the physical parameters, they can produce acceptable results in some cases [34]. In the contrary, physically-based models of river flow forecast are generally time consuming and complex which need lots of input variables for simulating river flow magnitudes [2]. So, using autoregressive moving average (ARMA) approaches for forecasting the streamflow magnitudes using the previously recorded flow magnitudes have been proposed as alternatives for physically-based models [23].

As a substitute, application of heuristic models e.g. adaptive neuro-fuzzy inference system (ANFIS) in streamflow forecasting has become viable. For instance, Wang et al. [35], Kagoda et al. [16] and Humphrey et al. [14] applied artificial neural networks (ANNs) models for streamflow forecasting. Shiri and Kisi [30] introduced a wavelet-neuro-fuzzy model of streamflow forecasting. Sharma et al. [29] compared neuro-fuzzy model with a physically based watershed model for streamflow forecasting and concluded that the neuro-fuzzy model was equally comparable to physical model especially when rain gauge stations were not adequate. Ballini et al. [3] applied ANFIS for seasonal river flow forecast. Nayak et al. [25] ANFIS modeling approach to model the long-term daily river flow magnitudes in India and reported that ANFIS gave promising results in this case. Vernieuwe et al. [34] compared data-driven Takagi–Sugeno models for rainfall–discharge dynamics modeling. Zounemat Kermani and Teshnelab [39] introduced ANFIS approach as a strong method of daily streamflow prediction when compared to ANN and traditional regression models. El-Shafie et al. [9] proposed ANFIS technique to forecast the inflow for the Nile River at Aswan High Dam on monthly basis. Wang et al. [36] examined different heuristic models for forecasting monthly discharge time series and introduced the ANFIS models as the most accurate technique in this field. Rath et al. [26] applied hierarchical neuro-fuzzy model for real-time flood forecasting. He et al. [11] compared ANN, ANFIS and SVM for forecasting riverflow in a semiarid mountain region and found that the performance of the applied models in river flow forecasting was satisfactory. Yarar [37] introduced a hybrid wavelet-ANFIS model for forecasting monthly streamflow time series. Yilmaz and Muttil [38] utilized different machine learning techniques including ANFIS for runoff estimation. Talei et al. [32] applied Takagi–Sugeno neuro-fuzzy model with online learning for runoff forecasting. Anusree and Varghese [1] compared ANFIS, ANN and MNLR models for daily streamflow forecasting and found the ANFIS as the superior model.

The main aim of this study is to forecast long period daily streamflows using three different adaptive neuro fuzzy techniques, i.e. ANFIS with Grid Partition (ANFIS-GP), ANFIS with subtractive clustering (ANFIS-SC) and ANFIS with fuzzy C-Means clustering (ANFIS-FCM). Two different ANFIS-GP methods were considered in the present study: ANFIS-GP with constant output and ANFIS-GP with linear output. The models were also compared according to their complexity and training durations.

2 Methods

2.1 Adaptive Neuro-fuzzy Inference System (ANFIS)

ANFIS is a merger of an adaptive neural network (ANN) and a fuzzy inference system (FIS), where the parameters of FIS are identified by the ANN learning algorithms. ANFIS is able to estimate real continuous functions on a compact set of parameters with any degree of accuracy [15]. There are two approaches for FIS, namely, Mamdani and Assilian [24] and Takagi and Sugeno [31]. The differences between the two approaches corresponds to the consequent part where Mamdani’s method uses fuzzy membership functions, while linear or constant functions are utilized in Sugeno’s method.

2.1.1 ANFIS Architecture

Let’s assume a FIS having two input variables of x and y and one output variable f. The first-order Sugeno fuzzy model, a typical rule set with two fuzzy If-Then rules would read:

$$ {\text{Rule}}\, 1 :{\text{ If}}\,x\,{\text{is A}}_{ 1} {\text{and}}\,y\,{\text{is B}}_{ 1,} {\text{then}}\,f_{1} = p_{1} x + q_{1} y + r_{1} $$
(1)
$$ {\text{Rule}}\, 2 :\,{\text{If}}\,x\,{\text{is A}}_{ 2} \,{\text{and}}\,y\,{\text{is}}\,{\text{B}}_{ 2,} {\text{then}}\,f_{2} = p_{2} x + q_{2} y + r_{2} $$
(2)

where A1, A2 and B1, B2 are the membership functions (MFs) of the inputs x and y, respectively and p1, q1, r1 and p2, q2, r2 are the parameters of the output function. Here, the output f is the weighted mean of the single rule outputs.

The output of the ith node in layer l is shown as O l,i . Every node i in Layer 1 is an adaptive node with node \( O_{l,i} \, = \,\varphi A_{i} \left( x \right) \) , for i = 1, 2, or \( O_{l,i} \, = \,\varphi B_{i - 2} \left( y \right) \), for i = 3, 4, where x (or y) is the input to the ith node and A i (or Bi-2) is a linguistic label (such as ‘low’ or ‘high’) associated with this node. The MFs for A and B are generally described by generalized bell functions as:

$$ \varphi A_{i} \left( x \right) = \frac{1}{{1 + \left[ {\left( {x - c_{i} } \right)/a{}_{i}} \right]^{{2b_{i} }} }} $$
(3)

where {a i , b i , c i } is the parameter set. Parameters in this layer are called as premise parameters. The outputs of this layer are the membership values of the premise part. Layer 2 includes the nodes labeled Π which multiply incoming signals and sending the product out. For instance,

$$ O_{2,i} = w_{i} = \varphi A_{i} (x)\varphi B_{i} (y), \,i = { 1},\, 2. $$
(4)

Each node output shows the firing strength of a rule. The nodes labeled N computes the ratio of the ith rule’s firing strength to the sum of all rules’ firing strengths in Layer 3,

$$ O_{3,i} = \overline{{w_{i} }} = \frac{{w_{i} }}{{w_{1} + w_{2} }},\,i = { 1},\, 2. $$
(5)

The outputs of this layer are referred to as normalized firing strengths. The nodes of the Layer 4 are adaptive with node functions

$$ O_{4,i} = \overline{{w_{i} }} f_{i} = \overline{{w_{i} }} \left( {p_{i} x + q_{i} y + r_{i} } \right) $$
(6)

where \( \overline{{w_{i} }} \) is the output of Layer 3, and \( \left\{ {p_{i} ,\,q_{i} ,\,r_{i} } \right\} \) are the parameter set. Parameters of this layer are called as consequent parameters. The single fixed node of the Layer 5 labeled Σ computes the final output as the summation of all incoming signals

$$ O_{5,i} = \sum\limits_{i = 1}^{{}} {\overline{{w_{i} }} f_{i} } = \frac{{\sum\nolimits_{i} {w_{i} f_{i} } }}{{\sum\nolimits_{i} {w_{i} } }} $$
(7)

So, an adaptive network which is functionally equivalent to a Sugeno first-order fuzzy inference system is built.

2.1.2 Grid Partition Method

Grid partition (GP) is one of the commonly used methods for producing initial FIS rules for ANFIS building, where the space including input- output parameters is divided into certain partitions called as grids. Each grid expresses a fuzzy surface, and interference areas between grids make a continuous output surface [13, 22]. There is no fixed rule for defining the number of MFs for each variable, so they are identified through a trial and error process. The learning process is begun from zero output and during the learning process, functions and fuzzy rules are trained gradually [7]. Although they are different membership function types which can be applied in modeling various procedures, the literature review shows that triangular MFs are commonly used and the most optimal MF types for practical applications [27]. Nevertheless, other studies (e.g. [34]) have confirmed that the type of MFs cannot affect the results of simulation majorly, though other studies have demonstrated the major effect of MFs in modeling accuracy (e.g. [19]).

2.1.3 Subtractive Clustering Method

The subtractive clustering method assumes that each data point is a potential cluster center and calculates a measure of the likelihood that each data point would define the cluster center on the basis of the density of surrounding data points. Considering a set of n data points \( \left\{ {x_{1} ,\,x_{2} ,\, \ldots ,\,x_{i} } \right\} \) in m-dimensional space, it is assumed that all data points within a cubic space have been normalized. In subtractive clustering, each of the data points is considered as a potential cluster center. As a result, the density index Di corresponding to the data xi can be expressed as follows:

$$ D_{i} = \sum\limits_{j = 1}^{n} {\exp \left( { - \frac{{\left\| {x_{i} - x_{j} } \right\|}}{{(r_{a} /2)^{2} }}} \right)} $$
(8)

Here, ra is a positive quantity called cluster radius. If many data points are adjacent to a data point, hence, that data point has the maximum density. After measuring the density of each data point, data point with the highest density is selected as the first data center clustering [12]. If the effect limited area of the center of the first cluster center is removed, following formula is used to measure the other points density.

$$ D_{i} = D_{i} - D_{c1} \sum\limits_{j = 1}^{n} {\exp \left( { - \frac{{\left\| {x_{i} - x_{cl} } \right\|^{2} }}{{\left( {r_{b} /2} \right)^{2} }}} \right)} $$
(9)

Here, xc1 and Dc1 are the selected points and density potential, respectively. r b is a positive constant. To avoid approaching the cluster centers, the r b constant value is normally larger than r a (r b is considered 1.5r a ). After measuring the density for each data point, the next cluster center xc2 is selected and all the measured density for data points will be recalculated. This process continues until a sufficient number of cluster center produce [2, 20].

2.1.4 Fuzzy C-Means Method

In fuzzy clustering, each pattern might belong to several clusters or segment. One of the most functional clustering algorithms is K-mean algorithm. This unsupervised algorithm in large datasets, exposures with some limitations in the process, may not work properly. To deal with the disadvantages, different clustering algorithms have been proposed. Among them, fuzzy c-means as a proper alternative method is used [21]. Fuzzy C-means (FCM) was developed by Dunn [8] and Bezdek [4] improved it.

The FCM method blocks a set of N vector x i , i = 1,…, n, into c fuzzy clusters, where each pattern is corresponded to a cluster with a degree specified by a membership grade uij between 0 and 1. The final object by the FCM algorithm is to find c cluster centers so that the cost function of the dissimilarity measure can be minimized. The aim is minimizing the objective function that is defined as below:

$$ Min J_{FCM} = \mathop \sum \limits_{c = 1}^{C} \mathop \sum \limits_{i = 1}^{N} w_{ic}^{p} \left\| {w_{i} } \right. - \left. {v_{c} } \right\|^{2} \quad s.t. \sum\limits_{c = 1}^{C} {w_{ic} } = 1 , i = 1,2, \ldots , N $$
(10)

which p (\( 1 < p \)) is known as fuzzifier portion and N, is the number of data points; C, the number of clusters; \( w_{ic} \), the number of belongings of the ith data point to the cth cluster; v, is the clusters center and x is the number of the input. for calculating the amount of \( w_{ic} \) the following formula is used [5]:

$$ w_{ic} = \frac{1}{{\mathop \sum \nolimits_{L = 1}^{C} \left( {{{d_{ic}^{2} } \mathord{\left/ {\vphantom {{d_{ic}^{2} } {d_{ij}^{2} }}} \right. \kern-0pt} {d_{ij}^{2} }}} \right)^{{{1 \mathord{\left/ {\vphantom {1 {(p - 1)}}} \right. \kern-0pt} {(p - 1)}}}} }}\quad for \;i = 1,2, \ldots ,N\;and\;c = 1,2, \ldots ,C $$
(11)

For beginning of the center vectors, centers are calculated by:

$$ v_{c} = \frac{{\mathop \sum \nolimits_{j = 1}^{N} w_{jc}^{p} x_{j} }}{{\mathop \sum \nolimits_{j = 1}^{N} w_{jc}^{p} }} $$
(12)

FCM procession continues until a convergence condition is obtained.

3 Case Study

Daily streamflow data from two stations, Murder Creek near Evergreen (Hydrologic Unit Code 03140304, Latitude 31°25′06″, Longitude 86°59′12″, Drainage area 176.00 square miles, Gage datum 178.29 feet a.s.l.) and Choctawhatchee River Near Newton (Hydrologic Unit Code 03140201, Latitude 31°20′34″, Longitude 85°36′38″, Drainage area 686.00 square miles, Gage datum 138.56 feet a.s.l.), Alabama, USA were used in this research. The location of the study area and stations are illustrated in Fig. 1. The reason of selection of these stations is due to having long data period. Data covering the period of 1936–2016 for the both stations were divided into three parts, training (01/01/1938–12/31/1976, 14245 values), validation (01/01/1977–12/31/1996, 7305 values) and testing (01/01/1997–12/31/2016, 7305).

Fig. 1
figure 1

The location of the studied area

The summary of statistical properties is reported in Table 1 for the used streamflow data. From the table, it is clear that data have highly skewed distributions, skewness coefficient ranging between 8.19 and 20.7. Data of the Choctawhatchee River has more autocorrelations than those of the Murder Creek. This may be due to the high discharge volume of the Choctawhatchee.

Table 1 The statistical properties of the streamflow data sets

4 Application and Results

In this study, the ability of cluster based neuro fuzzy methods, ANFIS-SC and ANFIS-FCM, was investigated in forecasting daily streamflows which have long data period (1936–2016). The results of these methods were compared with the ANFIS-GP which uses all possible rule combinations and generally has higher complexity and computational time when compared to cluster based neuro fuzzy methods. For the ANFIS-GP, two different outputs, constant and linear, were applied to detect the difference with each other. The models were compared according to the four different statistics, root mean square error (RMSE), mean absolute relative error (MARE), determination coefficient (R2) and Nash-Sutcliffe efficiency (NE) which can be expressed as

$$ RMSE = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {Q_{m,i} - Q_{o,i} } \right)^{2} } } $$
(13)
$$ MARE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{\left| {Q_{m,i} - \left. {Q_{o,i} } \right|} \right.}}{{Q_{o,i} }}} 100 $$
(14)
$$ NSE = 1 - \frac{{\sum\nolimits_{i = 1}^{n} {\left( {Q_{m,i} - Q_{o,i} } \right)} }}{{\sum\nolimits_{i = 1}^{n} {\left( {Q_{o,i} - \overline{Q}_{o} } \right)} }} $$
(15)

where Q o,i and Q m,i are the observed and estimated streamflows, N is the number of time steps,\( \overline{Q}_{o} \) is the mean of the observed streamflows. First, auto and partial auto correlation analysis were employed and they suggested three previous lags. In the applications, however, five input scenarios comprising five previous lags were used from input(i) to input(iv) comprising Qt-1 to Qt-1, Qt-2, Qt-3, Qt-4 and Qt-5.

The results of the ANFIS-GP models with constant and linear outputs are presented for the Choctawhatchee River in Tables 2 and 3. For the ANFIS-GP models, different number of triangular membership functions were tried and the best one that gave the minimum RMSE in the validation period was selected. It is clear from the tables that the ANFIS-GP with constant output has the best accuracy in test period for the 3rd input combination while the 1st input provides the best results for the ANFIS-GP with linear output. First model comprising constant output seems to be superior to the second model. This can also be seen from the mean of the all input combinations. From the mean statistics, it is clear that the training accuracy of ANFIS-GP models with linear output is better compared to the other one. The 2nd model with linear output can approximate better than the 1st model with constant output because it has higher number of parameters and more flexible than the latter one. Assume that we used 2 Gaussian membership functions (each has 2 parameters) and 5 inputs for each model. In this case, the premise parameters of the both models will be 2 * 25 = 64. The 1st model will have 32 rules, each has 1 constant output parameter and totally it will have 32 output parameters while the 2nd model will have 32 rules, each has 6 output parameters and totally 32 * 6 = 192 output parameters. For this reason, the training duration of the ANFIS-GP models with linear output is also much higher than those of the ANFIS-GP models with constant output especially when the number of inputs is higher than 2.

Table 2 Results of ANFIS-GP models with constant output—Choctawhatchee River
Table 3 Results of ANFIS-GP models with linear output—Choctawhatchee River

Tables 4 and 5 report the results of ANFIS-FCM and ANFIS-SC models with respect to RMSE, MARE, R2 and NE for the Choctawhatchee River. For the ANFIS-FCM models, different number of cluster numbers (vary between 1 and 8) which decides the number of rules were tried and the best one that gave the minimum RMSE in the validation period was selected. For the ANFIS-SC models, different number of radii values (vary between 0.1 and 1) which decides the number of membership functions and rules were tried. It is apparent from the tables, the 3rd input combination has the best accuracy for the both methods and after 3 lags input, the accuracy of the models does not considerably increase. Comparison with the ANFIS-GP models indicates that the ANFIS-FCM model slightly performs superior to the ANFIS-GP with constant output and ANFIS-SC models. The training duration of the ANFIS-GP with constant output is less than those of the cluster based ANFIS-FCM and ANFIS-SC models. The reason of this might be fact that the ANFIS-FCM and ANFIS-SC have linear output comprising more consequent parameters. However, the main advantage of the cluster based neuro fuzzy methods is that their rules are automatically determined based on the selected cluster number or radii value. For example, in case of 8 clusters, we will have only 8 rules for the whole fuzzy model while the ANFIS-GP has 32 rules when the input number is 5.

Table 4 Results of ANFIS-FCM models—Choctawhatchee River
Table 5 Results of ANFIS-SC models—Choctawhatchee River

The statistics of the ANFIS-GP models with constant and linear outputs are compared in Tables 6 and 7 for the Murder Creek. As seen from the tables, the both methods have the best accuracy in the test period for the 2nd input combination. After 2nd input, the accuracy of ANFI-GP model comprising linear output is worsening. It can be said that increasing input number increases the complexity of the model and this results in worse streamflow forecasts. Similar to the Choctawhatchee River, the training durations of the ANFIS-GP with linear output is higher than those of the models with constant output especially for the inputs higher than 2. Tables 8 and 9 present the training, validation and testing results of ANFIS-FCM and ANFIS-SC models for the Murder Creek. ANFIS-FCM model has the best accuracy in 2nd input combination while the 3rd input combination provides the best accuracy for ANFIS-SC model. Comparison of the Tables 6, 7 and 8 clearly shows that the ANFIS-GP model with linear output slightly performs superior to the ANFIS-GP with constant output and ANFIS-FCM models. ANFIS-SC model has the worst accuracy even though it has the least training duration. Comparison of two stations obviously indicates that the accuracy of the applied models is better for the Choctawhatchee River compared to Murder Creek. The main reason of this might be the fact that the data of first station has lower autocorrelations than the latter one.

Table 6 Results of ANFIS-GP models with constant output—Murder Creek
Table 7 Results of ANFIS-GP models with linear output—Murder Creek
Table 8 Results of ANFIS-FCM models—Murder Creek
Table 9 Results of ANFIS-SC models—Murder Creek

The time variation of observed and forecasted streamflows by using the optimal ANFIS-GP-constant, ANFIS-FCM and ANFIS-SC models can be seen from Fig. 2 for the Choctawhatchee River. From the figures, it is clear that the ANFIS-FCM model catches the high streamflow values better than the other models. The ANFIS-SC also seems to be better than ANFIS-GP model. Figure 3 makes the scatterplot comparison of the applied models. ANFIS-GP model has slightly higher R2 than the ANFIS-FCM. However, the a and b coefficients of the fit line equation (assume that the fit line is y = ax + b) are respectively closer to the 1 and 0 (exact line is y = x) for the ANFIS-FCM compared to the ANFIS-GP model. Figure 4 illustrates the time variation graphs of the observed and forecasted streamflows by using the optimal ANFIS-GP-linear, ANFIS-FCM and ANFIS-SC models for the Murder Creek. The ANFIS-GP-linear and ANFIS-FCM models considerably underestimate peak discharges while the ANFIS-SC overestimates. The scatter diagrams of the three methods are given in Fig. 5. As seen from the figure, the ANFIS-GP model has slightly higher R2 than the ANFIS-FCM while the slope coefficient of the latter model closer to the 1 compared to the first model. The ANFIS-SC seems to be insufficient in forecasting daily streamflows of Murder Creek.

Fig. 2
figure 2

The time variation of the observed and forecasted streamflows by using the optimal ANFIS-GP-constant, ANFIS-FCM and ANFIS-SC models—Choctawhatchee River

Fig. 3
figure 3

The scatterplots of the observed and forecasted streamflows by using the optimal ANFIS-GP-constant, ANFIS-FCM and ANFIS-SC models—Choctawhatchee River

Fig. 4
figure 4

The time variation of the observed and forecasted streamflows by using the optimal ANFIS-GP-linear, ANFIS-FCM and ANFIS-SC models—Murder Creek

Fig. 5
figure 5

The scatterplots of the observed and forecasted streamflows by using the optimal ANFIS-GP-linear, ANFIS-FCM and ANFIS-SC models—Murder Creek

5 Conclusion

Long period streamflow data from two hydrometric stations in USA were used in the present research to forecast streamflow magnitudes in daily forecast horizon. Adaptive neuro-fuzzy inference system (ANFIS) with three different running algorithms, namely, ANFIS grid partitioning (ANFIS-GP), ANFIS sub clustering (ANFIS-SC) and ANFIS fuzzy C means (ANFIS-FCM) were then trained, validated and tested using these data. Five input combinations were tried by also considering the auto- and partial-auto-correlation functions of the streamflow records during the study period to see the effect of 5 time lags on the predictions. Using different models and input combinations it was revealed that the best input combination (which can be used to feed the predictive models) is somewhat model-specific, where introducing more input parameters (beyond the double-input combination) has deteriorated the ANFIS-GP accuracy. This might be linked to the model complexity by using more inputs and might dictate a risk of redundancy when using inputs roughly based on linear measures (e.g. auto correlation). Nevertheless, the models architectures had monotonous influence on the predictive models performance that showed the necessity of performing sensitivity analysis on the models architectures. This might be crucially important when using short period data, where the time domain is limited and general trend of data which can affect the predictions are not involved in model training. It was seen that the ANFIS-GP model with linear output produce complex model structure especially in case of high number of inputs compared to ANFIS-GP with constant output.

In this study, high number of membership functions were not tried for that ANFIS-GP model because its parameters exponentially increase when the number of MFs was increased. In future studies, the effect of MFs numbers may be investigated by using high speed computers.