Introduction

Effective estimation of streamflow data, one of the basic parameters of the hydrological cycle, is one of the basic steps of effective management of water resources and disaster reduction, early warning, and management (Sharma and Machiwal, 2021). Daily and hourly flow forecasts are of great importance for flood management systems, while monthly and annual flow forecasts are valuable for reservoir operation, irrigation system management, and hydroelectric generation (Yaseen et al., 2015; Wegayehu and Muluneh, 2022). However, flow data depend on many meteorological parameters such as precipitation, temperature, evaporation, ground moisture, and infiltration, making estimating flow data difficult. Therefore, artificial intelligence technologies that can easily model nonlinear relationships for flow prediction have been in vogue recently (Zhang et al., 2018a, 2018b).

In recent years, many studies have been carried out in many different disciplines about data modeling through artificial neural networks. Many artificial neural networks are used mainly in modeling the hydrological and hydrometeorological data. Quite a few artificial neural networks (ANN) types such as feed forward neural networks (FFNN), generalized regression neural networks (GRNN), radial basis neural networks (RBNN), and Levenberg–Marquardt (LM) algorithms have been used in modeling many hydrometeorological data especially like streamflow, precipitation, and evaporation. On the other hand, the modeling technique through recurrent artificial neural networks has been applied frequently in recent years in polyphonic music, sound and video models, in modeling the signals of speaking, in language models, and in modeling the sequential time series types, and lately this technique has become recommended thanks to the successful results widely that it has led (Sattari et al., 2012; Bahdanau et al., 2014; Cho et al., 2014; Chung et al., 2014; Shoaib et al., 2016; Xiao et al., 2017).

Elman demonstrated the recurrent neural network technique for the first time (1990). Elman applied this technique to divide the sentence structures into noun and verb categories and achieved successful results. Connor et al. (1994) applied this technique to modeling nonlinear load time series and achieved successful results by integrating the robust learning algorithm into recurrent neural networks. Coulibaly et al. (2000) studied the effect of climatic trends on forecasting the annual flow values using the RNN approach. They used wavelet transform when deciding the climatic patterns. They showed that this modeling technique could successfully model the climatic trend effect. Kumar et al. (2004) applied two different networks, feed forward, and recurrent, and demonstrated the use of ANNs to forecast monthly river flows. Coulibaly and Baldwin (2005) used the dynamic RNN technique in forecasting the non-stationary hydrological time series. They compared the results of the RNN-based model with the multivariate adaptive regression splines (MARS) model and found that the RNN-based model performed better than MARS. Cheng et al. (2008) proposed a three-stage indirect multi-step-ahead prediction model for long-term hydrologic forecasting. Banerjee et al. (2011) evaluate the prospect of (ANN) simulation over mathematical modeling in estimating safe pumping rates to maintain groundwater salinity in island aquifers. Chandra and Zhang (2012) suggested the ANN technique and the use of an alternative approach, real-time recurrent learning (RTRL). They produced various synthetic time series and operated auto-regressive (AR) and moving average (MA) models. Then, they compared RTRL and other models and found that the RTRL model proved more successful results. Sattari et al. (2012) evaluated the performance of the time lag recurrent neural networks (TLRN) model to predict the daily inflow into the Elevian reservoir. Prasad and Prasad (2014) studied the ability of deep networks to extract high-level features and recurrent networks to perform time series inference. Shoaib et al. (2016) explored the potential of wavelets for the first time and they modeled river flows by using coupled time-lagged recurrent neural network (TLRNN). Chang et al. (2018) proposed a deep learning-based model named memory time series networks which is used for time series modeling and prediction. Che et al. (2018) suggested the GRU model, which is a new deep learning approach to model the missing patterns much better. They applied this model to clinic data sets and found that it gave more successful results, especially in modelling the missing patterns. Alizadeh et al. (2021) GRU, LSTM, and SAINA-LSTM methods were compared in four different basins in the USA. As a result of this study, it was seen that SAINA-LSTM gave promising results for the region. In addition, it was stated that LSTM and GRU models performed better than RNN. Hu et al. (2018) used ANN and LSTM network models to model the precipitation-runoff relationship in the Fen River basin. Zhang et al., (2018a, 2018b) aimed to predict and simulate the water level in combined sewer overflow structures using four different neural network models such as MLP, WNN, LSTM, and GRU. Zhao et al., (2021a, 2021b) combined the gray wolf optimizer (IGWO) and GRU method to estimate flow data. The model evaluated the success of the LSSVM and ELM methods with the model results obtained. Wegayehu and Muluneh (2022) employed stacked-LSTM (S-LSTM), bidirectional LSTM (Bi-LSTM), and GRU with the classical multilayer perceptron (MLP) network for the prediction of daily streamflow in Awash river basin. MLP and GRU models showed better prediction results than other models.

The present study aims to test and model the performance of recurrent neural network algorithms in terms of high variability. For this purpose, three flow observation stations with a high coefficient of variation of the algorithm were used to avoid the disappearing gradient problem. Model performances were evaluated with various statistical parameters and graphical methods.

Material and method

Study area and data

In the present study, the data belonging to the water years between 1978 and 2015 were obtained from three current observation stations, numbered E23A004, E14A022, and E21A019, located in Erzincan, Bayburt, and Gümüshane provinces were used. Data statistics and station details are given in Table 1. These three selected gauging stations have approximately the same climatic conditions. And, the schematic view of the locations of these stations are shown in Fig. 1.

Table 1 Station details and data information
Fig. 1
figure 1

The schematic locations of the gauging stations

Recurrent neural networks (RNN)

To understand the RNN, it is advisable to remember the artificial neural networks that work feed forward. The operating logic of these two techniques is similar. In other words, it can be said that they are two structures that produce outputs by applying a set of mathematical operations to the information that comes to the neurons in the networks (Coulibaly & Baldwin, 2005; Coulibaly et al., 2000; Kumar et al., 2004).

The information in the feed forward network is processed forward only and cannot be returned back to any point. In this structure, input data is simply passed through the network and output data is obtained. Feed forward neural network structure is shown below in Fig. 2.

Fig. 2
figure 2

Structure of feed forward neural networks

Also, as shown in Fig. 3, besides the input, the content units that refer to the previous output also affect the network in a RNN structure. For example, the decision for the information at (t-1) also affects the decision to be made at t. In a word, the inputs in such networks produce outputs by combining the existing and the previous information (Chandra & Zhang, 2012; Connor et al., 1994; Coulibaly et al., 2000; Donate & Cortez, 2014; Elman, 1990; Prasad & Prasad, 2014).

Fig. 3
figure 3

Recurrent neural network cycle

The main aim of recurrent neural networks is to use sequential information. The main reason for naming it “recurrent” is that the output always depends on the previous calculation steps. In other words, the RNNs store and make use of information about the steps that are so far calculated. Therefore, they work like a memory (Chang et al., 2002; Cheng et al., 2008; Bahdanau et al., 2014; Smith & Yin 2014; Che et al., 2018).

Gated recurrent neural networks

Long short-term memory unit

LSTM is a RNN structure that remembers the values at random intervals. This specific RNN type can learn long-term dependencies and is widely used and applied to various problems of many different disciplines. LSTM, considering the unknown size and time delays between important events, is a very convenient method to sort, process, and foresee the time series. Moreover, LSTM’s relative insensitivity to gap length provides a significant advantage when compared to alternative RNNs, concealed Markov models, and many other learning methods.

The recurrent module in standard RNNs has a very simple structure, like a single tanh layer. On the other hand, a LSTM network contains LSTM units instead of the other network units. The LSTM unit remembers the long or short periods. The key to this capability is that it does not use any activation functions in its recurrent components. Therefore, the stored value is not recursively changed and the gradient does not vanish over time with backprop (Hu et al., 2018; Kim et al., 2018; Zhang et al., 2018a, 2018b).

Gated recurrent unit

The recurrent cells or the gated recurrent cells proposed by Cho et al. (2014) are a gate mechanism in recurrent neural networks. It is seen that the performance of these cells is similar to LSTM in many areas, and sometimes even better. GRUs have fewer parameters than LSTM because they have no exit gates. The LSTM unit inspired the GRU, but it is considered simpler to compute and implement. They also have a memory mechanism but with significantly fewer parameters than LSTM. GRU is often used when there is fewer data available and is faster to compute (Chang et al., 2018, Hu et al., 2018, Kim et al., 2018, Zhang et al., 2018a, 2018b).

Evaluation of model performance

This research follows the basic guideline to assess the goodness of fit for the developed models. The guideline uses the correlation coefficient (r), root mean square error (RMSE), ratio of RMSE to the standard deviation (RSR), Nash–Sutcliffe efficiency coefficient (NSE), index of agreement (d), and volumetric efficiency (VE) as fitness indices to evaluate the model performance. All these fitness indices are calculated using the Eqs. (16).

$$r=\left(\frac{{\sum }_{i=1}^{l}\left({Q}_{{E}_{i}}-{Q}_{{\overline{E} }_{i}}\right)\left({Q}_{{O}_{i}}-{Q}_{{\overline{O} }_{i}}\right)}{\sqrt{{\sum }_{i=1}^{l}{\left({Q}_{{E}_{i}}-{Q}_{{\overline{E} }_{i}}\right)}^{2}{\sum }_{i=1}^{n}{\left({Q}_{{O}_{i}}-{Q}_{\overline{O}i }\right)}^{2}}}\right)$$
(1)
$${\text{RMSE}}=\sqrt{\frac{\sum\limits_{i=1}^{l}{\left({Q}_{{E}_{i}}-{Q}_{{O}_{i}}\right)}^{2}}{l}}$$
(2)
$$RSR=\frac{\text{RMSE}}{{\text{STDEV}}^{obs}}=\frac{\sqrt{{\sum }_{i=1}^{l}{\left({Q}_{{O}_{i}}-{Q}_{{E}_{i}}\right)}^{2}}}{\sqrt{{\sum }_{i=1}^{l}{\left({Q}_{{O}_{i}}-{\overline{Q} }_{{E}_{i}}\right)}^{2}}}$$
(3)
$$\mathrm{NSE}=\left(1-\frac{\sum\limits_{i=1}^{l}{\left({Q}_{Ei}-{Q}_{{O}_{i}}\right)}^{2}}{\sum\limits_{i=1}^{l}{\left({Q}_{{O}_{i}}-{Q}_{{\overline{O} }_{i}}\right)}^{2}}\right)$$
(4)
$$\mathrm{d}=1-\left[\frac{\sum\limits_{i=1}^{l}{\left({Q}_{{E}_{i}}-{Q}_{{O}_{i}}\right)}^{2}}{\sum\limits_{i=1}^{l}{\left(\left|{Q}_{{E}_{i}}\right|-\left|{Q}_{{O}_{i}}\right|\right)}^{2}}\right],0\le d\le 1$$
(5)
$$VE=1-\left(\mathrm{sum}\left({Q}_{{O}_{i}}-{\overline{Q} }_{{E}_{i}}\right)/\mathrm{sum}\left({Q}_{{O}_{i}}\right)\right)$$
(6)

where \({Q}_{{E}_{i}}\) is the \({i}^{th}\) estimated monthly streamflow discharge using models; \({Q}_{{O}_{i}}\) is the \({i}^{th}\) observed monthly streamflow discharge; \({Q}_{{\overline{E} }_{i}}\) is the average of the estimated monthly streamflow discharge; \({Q}_{{\overline{O} }_{i}}\) is the average of the observed monthly streamflow discharge; and \(l\) is the number of observations.

Rank analysis

Determining the best model is a complex task in models in which many statistical indicators are used together. For this reason, the rank values of the statistical indicators used in this study were determined separately, and then, the most effective model was determined according to the total rank values. A rank is assigned to each performance parameter when performing rank analysis. Rankings were arranged from the maximum value equal to the number of models, which was three in our study, to the minimum value equal to one. Here, the best-performing model is assigned the third rank, and the lowest-performing model is appointed the first rank. The model with the highest total rank shows the best, while the model with the lowest shows the worst (Zhang et al., 2020).

Results and discussion

This work aims to develop a forecasting model to predict streamflow using the GRU model. In addition, two similar configured sequential type models (RNN and LSTM) were also tested to find its applicability to forecasting the streamflow. Furthermore, to check its robustness, three different stations, namely Erzincan, Gümüshane, and Bayburt were considered. The dataset was split into three parts; training (from Oct 1978 to April 2004), validation (from May 2004 to Sep 2009), and testing set (from Oct 2010 to Sep 20,015), i.e., data split is 0.7/0.15/0.15. The training data is used to develop the models, and validation data is used to tune and to select the best-performing models. At the same time, the test data is used to evaluate the performance of the developed models. As stated above, obtaining final model configuration or hyper-parameter tuning is big topic and requires skill. Therefore, multiple scenarios with different time horizons of trail and the final architecture for each station is selected based on the highest correlation coefficient (r).

The algorithm for sequential deep learning models (RNN, LSTM, and GRU) were developed under TensorFlow using “Keras deep Learning library.” The major problem in time series analysis is selection of random components. Therefore, this study considered monthly streamflow discharge as a random component to developing the models. The appropriate architecture of sequential models consists of model input (i.e., previous time steps), number of the memory cell (memory block), and model output. This research trails different combination of inputs and memory cells for the three stations (Erzincan, Gümüshane, Bayburt). Initially, the previous time steps (i.e., look back) are varied between 1 and 20 and the final architecture of models are selected based on the highest correlation coefficient (r). The input selection (i.e., look back) is one of the unwieldy and important tasks during model development. Based on the critical appraisal, different researchers considered different looks back for the model development. Ouyang and Lu (2018) have considered 12-month previous time steps for the development of ANN model and multi-gene genetic programming and support vector machine. Qin et al. (2019) have also tested the LSTM model for hydrological time series analysis by adopting different batch sizes and the number of the memory cell. Furthermore, Kumar et al. (2019) have tested different time steps to check their effect on the performance of the LSTM model. The final model has complied by adopting the loss function, i.e., mean square error (MSE) and optimizer ReLu. The final model configuration was discovered by trial and error based on skill. This leaves the door open to explore new and possibly better configurations. Table 2 shows the final configuration of the models at the three selected stations.

Table 2 The final configuration of the models at different stations

Model performance for Erzincan station (E21A019)

Table 3 shows the performance indices of models at the different stations. The results show that during the training phase, the RNN model outperformed, followed by LSTM and GRU. The correlation coefficient (r) for RNN and LSTM model is approximate, showing that the models were well trained during the training phase, whereas, for GRU, the performance of slightly less as compared to the RNN and LSTM (Figs. 4 and 5). While the validation period, the correlation coefficient (r) was found to be approximate similar for all the models (RNN (r = 0.906), LSTM (r = 0.904), and GRU (r = 0.904)). In addition, other fitness parameter was also calculated. In general, the lowest the error better the model. The lowest RMSE was found for the GRU (RMSE = 0.121) followed by the RNN (RMSE = 0.125) and LSTM (RMSE = 0.127). The other fitness index confirmed that the GRU model is well developed and can give significant results to forecast the streamflow at the Erzincan station. Furthermore, all three models were also tested on remained 15% test dataset to check the robustness of the models. From the analysis of results, it was found that GRU model well capable and performance is similar to RNN and LSTM in terms of fitness indices. The result signifies that the GRU model can be used as a replacement of LSTM to predict the streamflow for this station. For better representation, a graphical Taylor diagram has been drawn to visualize the performance of the models in terms of correlation coefficient (r), standard deviation and root mean square error (blue line). Figure 6a shows that the GRU model is clearly better than the RNN and LSTM as it gives smaller standard deviation without changing much in the correlation coefficient. In addition, box wisher plot (Fig. 6b) is also drawn to map the spread of the error (observed-predicted) to visualize the range, median of the error, that occurred during the testing phase by the different models.

Fig. 4
figure 4

Scatter plot displaying the models (RNN, LSTM, and GRU) performance at stations Erzincan during the training phase

Fig. 5
figure 5

Comparison of observed and forecasted streamflow using the proposed model (GRU) and LSTM and RNN models at Erzincan station

Fig. 6
figure 6

a Taylor plot displaying the models (RNN, LSTM, and GRU) performance at Erzincan station during the validation and testing phase. b Box whisker plot displaying testing error (observed-predicted) at Erzincan station

Again, the GRU model shows better performance as the error range is lesser than the LSTM and the median of error occurred towards the center of the box. Whereas in the case of RNN, the median of the error shifted to the upper interquartile range. Therefore, analysis of the results showed than the GRU model performed better than the LSTM at this station.

Model performance for Gümüshane station (E14A022)

In order to test the robustness of the GRU model, it is further tested on the Gümüshane station. During the training phase RNN model (r = 0.931) outperformed followed by the GRU (r = 0.914) and LSTM (r = 0.908) (Fig. 7). But during the validation period, the performance of the RNN model is reduced significantly and showed the underfit condition and found LSTM model (r = 0.901) is better than the RNN model (r = 0.895) (Table 3). Whereas during the testing period, the LSTM model reached the condition of overfitting (r = 0.913) and predict higher value than the observed, but same time GRU model showed consistent performance to forecast the streamflow (Fig. 11). In terms of other performance indices, the GRU model (RMSE = 0.073, RSR = 0.439) showed approximately close RMSE error than the LSTM (RMSE = 0.076, RSR = 0.435). The volumetric efficiency of the GRU model (VE = 0.953) is approximately equal to the LSTM model (VE = 0.956). The Taylor plot showed that GRU and LSTM model gives smaller standard deviation during testing in comparison to RNN, whereas accuracy of the GRU is approximately similar to LSTM (Fig. 9). In addition, the interquartile range of box wisher plot shows that the LSTM model has more outlier in the lower quantile which is opposite in the case of GRU. This signifies that the LSTM model frequently predicts less value than the observed value. This is also observed in Fig. 8 as the LSTM model cannot predict the streamflow peak. This analysis of the results showed that the GRU model is better than the LSTM model to forecast the peak streamflow (Fig. 9).

Fig.7
figure 7

Scatter plot displaying the models (RNN, LSTM, and GRU) performance at stations Gümüshane during the training phase

Fig. 8
figure 8

Comparison of observed and forecasted streamflow using the proposed model (GRU) and LSTM and RNN models at Gümüshane station

Fig. 9
figure 9

a Taylor plot displaying the models (RNN, LSTM, and GRU) performance at Gümüshane station during the validation and testing phase. b Box whisker plot displaying testing error (observed-predicted) at Gümüshane station

Model performance for Bayburt station (E23A004)

GRU model is again established on the Bayburt station to check it performance. During the training phase RNN model (r = 0.943) outperformed the other models. But during the validation period, the performance of the RNN model (r = 0.925) is reduced significantly and showed the underfit condition. This underfit condition was observed consistently for the three stations during model development. This concludes that the RNN model fails to maintain accuracy efficiently (Table 3). The performance of the GRU (r = 0.916) and LSTM (r = 0.916) are same during the training phase (Fig. 10). During the validation period, the performance of the LSTM model (r = 0.914) is more consistent than the GRU model (r = 0.902), but it failed during the testing period. During the testing period, the LSTM model overfit and predicted a slightly higher value than the observed. But same time GRU model showed the consistent performance to forecast the streamflow (Fig. 11). In terms of other performance indices, the GRU model (RMSE = 0.159, RSR = 0.459) showed approximately close RMSE error than the LSTM (RMSE = 0.154, RSR = 0.451) during testing. The volumetric efficiency of the GRU model (VE = 0.869) is approximately equal to the LSTM model (VE = 0.872). The Taylor diagram further confirms this fact as the model coincided (Fig. 12a). In addition, box whisker also shows a similar range of error and median for the GRU and LSTM (Fig. 12b). This analysis concludes that the GRU model can replace the LSTM model for this station.

Fig. 10
figure 10

Scatter plot displaying the models (RNN, LSTM, and GRU) performance at Bayburt stations during the training phase

Fig. 11
figure 11

Comparison of observed and forecasted streamflow using the proposed model (GRU) and LSTM and RNN models at Bayburt station

Fig. 12
figure 12

a Taylor plot displaying the models (RNN, LSTM, and GRU) performance at Bayburt station during the validation and testing phase. b Box whisker plot displaying testing error (observed-predicted) at Bayburt station

In Table 3, the success of the LSTM, GRU, and RNN models in estimating monthly flows is evaluated according to various statistical parameters. At the end of the analysis, the RNN model at Erzincan and Bayburt stations and the RNN and GRU models at Gümüshane station showed the most successful estimation results. In addition, it has been determined that all the established deep learning models predict monthly flows at a satisfactory level and close to each other.

De melo et al. (2019) determined that the models of GRU and LSTM networks perform more effectively than the MLP and ARIMA models. Sahoo et al. (2019) India used LTSM and RNN models to estimate daily discharge data in the Mahanadi River basin. He found that the LSTM-RNN model showed more successful prediction outputs than the RNN model. Zhao et al., (2021a, 2021b) and Shu et al. (2021) found that deep learning models outperform machine learning models such as ANN, ELM, and SVR for monthly flow predictions. When the study is compared with the existing literature, it overlaps with the literature in that deep learning algorithms produce effective results from stream estimation.

Table 3 Evaluation of model performance

Conclusion

Application of these deep learning techniques is encouraged because this type of model offers the possibility to take advantage of the sequential nature that can help achieve higher accuracy. This article discusses the feasibility of deep learning approaches such as GRU, RNN, and LSTM for estimating monthly stream flow. In this, a new generation of sequential deep learning models is tested to forecast the streamflow, which is always challenging in water resources. As a result of the rank analysis, it was determined that the most successful prediction model was RNN.

Multiple benchmarks are comprehensively tested and compared to the proposed GRU to forecast framework on a real-world dataset. Considering this dataset, the proposed GRU framework achieved significant forecasting accuracy like the LSTM model. The inconsistency in the monthly streamflow profile at different stations generally affects the predictability. The higher the inconsistency, the more the GRU can contribute to the forecasting improvement compared to the simple RNN and LSTM. As for future work, methodologies for parameter tuning can be developed to further increase forecasting accuracy on different types of stations, especially for stations with varying flow regimes. In this study, parameter optimization was done simply. However, in the future, the parameters can be adjusted thanks to hybrid structures. Moreover, although individual streamflow forecast is far from accurate, aggregating all individual predictions yields a better prediction for the aggregation level than the conventional strategy of directly forecasting the streamflow.