1 Introduction

One of the most important ways to reduce the damages of floods in large cities is to design a robust, reliable, and universal method for flood forecasting and early warning detections (Walton et al. 2019; Gholami et al. 2019). Due to climate change, water-related studies have attained increased importance by many researchers. Climate change forecasts have suggested that impacts on water resources will have devastating consequences on human and ecological health. In this manner, the planning and management of water resources will be the most critical issue faced by humankind (Gharabaghi and Sattar 2019). Climate change can increases the potential for extreme rainfall as well as the risk of flooding. Indeed, river flow forecasting is vital for the management and planning of river basins, including water allocation for agriculture, generation of hydroelectric energy, navigation planning, risk appraisal, droughts, and flood control (Khatibi et al. 2012). Floods are presently the dominant natural disaster and tend to have the most significant associated economic cost (Serinaldi et al. 2018). One-third of all natural disasters in Canada between 1950 and 2012 were the result of floods (Kelly and Stodolak 2013).

1.1 Data-Driven Models

Expanding upon data-driven models has an extended application in flood modelling, which has gained a heightened reputation in recent years. Amongst them, one of the most popular methods is the group method of data handling (GMDH) (Najafzadeh et al. 2015). The GMDH is a self-organized approach that is capable of introducing different explicit equations for practical applications. However, due to intricate non-linear patterns in time series, their use needs professional programming knowledge. Also, the critical question with these methods is the selection of the best input parameters for predicting the results with high accuracy (Ebtehaj et al. 2020).

1.2 Problem Statement

The use of stochastic-based models has generally been rejected due to their linear nature in the modelling of hydrological processes, and in most studies, inefficiencies have been reported (Mosavi et al. 2018). Recently, Bonakdari et al. (2019) indicated that considering an appropriate linear methodology could result in improved modelling compared to a non-linear approach (such as ANFIS and neural network) in terms of accuracy and simplicity. Therefore, the following fundamental questions arise: 1) Can the identification of deterministic and stochastic terms of the time series also provide a useful solution in modelling river discharge with strong seasonal and non-seasonal correlations with a stochastic model?; and 2) What is the performance of this linear methodology in comparison with the non-linear model? This study seeks to address these questions for the case of a highly complex daily time series data set with strong seasonal and non-seasonal correlations.

1.3 Scope of the Current Study

This study provides a comparison of a novel stochastic based linear methodology with a new encoding of the GMDH known as the generalized structure of the GMDH (GS-GMDH) for the daily discharge forecasting in the Bow River in Alberta, Canada. Both proposed linear and non-linear approaches are encoded in MATLAB. The linear methodology assessed the existence of the deterministic and stochastic terms and suggested a method to remove the deterministic terms. In the non-linear GS-GMDH model, to overcome the limitation of the classical GMDH method (which uses only a second-order polynomial and lacks the use of nonadjacent inputs in each neuron), a second and third order polynomial and inputs from adjacent layers in each neuron. Finally, the performance of the best linear-based stochastic model is compared with the best GS-GMDH based model as a non-linear approach using multi-criteria statistical indices.

1.4 Hydrological Data Collection

The highest natural disaster in Canadian history in terms of economic losses was the June 2013 flood event that affected the city of Calgary, Alberta, where five lives were lost. As much as $6 billion CAD in economic losses were sustained (Pomeroy et al. 2016). This flood event resulted in one of the top causes of the domestic insurance misfortunes in Canada (Insurance Bureau of Canada 2017).

Moreover, the cost of infrastructure damages, recovery costs, and emergency response were $409 million, $323 million, and $55 million CAD, respectively. In addition to the June 2013 flood event, $186,831,824 was paid by insurance companies in response to 21,179 flood claims from the 2005 flood event that also affected Calgary (Dohy 2005).

The Bow River is located in Alberta, Canada, and flows through the city of Calgary. The headwaters are located in the Rocky Mountains at the Bow Glacier and merges with the Oldman River and eventually form the South Saskatchewan River (Fig. 1a). A hydrometric station had collected daily discharge data in the Bow River (05BH004, located at 51o03’00” N, 114o03’05” W) near Calgary from 2000 to 2018.

Fig. 1
figure 1

The location and daily discharge of the Bow River near Calgary, Alberta, Canada

The Bow River drains a gross area of 7870 km2. The average daily discharge of the Bow River is 90.7 m3/s. The banks of this river overflow when the flow rate reaches 500 m3/s and influences to structures, and overland flooding happens when the flow rate reaches 850 m3/s (City of Calgary 2018). The daily discharge data related to the Bow River and statistical indices of these data for training and testing stages are presented in Fig. 1b and Table 1, respectively.

Table 1 Statistical Indices of Bow River daily discharge, divided into Total, Train, and Test Periods

2 Theoretical Conceptions

2.1 Linear Modeling Conceptions

The autoregressive integrated moving average (ARIMA) is one of the most popular linear methods for predicting time series. This method may also be defined seasonally, which is then known as the Seasonal ARIMA (SARIMA). The ARIMA method is defined as:

$$ \mathrm{ARIMA}\left(\ p,d,q\right)=\varphi (B)\left(1\ B\right)x(t)=\theta (B)\ \varepsilon (t) $$
(1)

where p and q are the order of autoregressive (AR) and moving average (MA), d is the differencing degree, φ and θ are the AR and MA parameters (respectively), (1-B)d is the dth non-seasonal differencing operator, x(t) is the raw time series, and the ε is the residual. Considering k as the non-seasonal parameters (φ and θ), the non-seasonal differencing is calculated as follows:

$$ k(B)1\ k\ B\ k{B}^2-{K}_3{B}^3-\dots {K}_n{B}^n $$
(2)

where n is the order of non-seasonal parameters (p and q). In the modelling of the time series using stochastic processes, the series should be subject to certain conditions. The Jarque-Bera (JB) (Jarque and Bera 1980) test is used to verify the normality of the series, and is defined as follows:

$$ JB=n\left(\frac{S_K^2}{6}+\frac{{\left({K}_u-3\right)}^2}{24}\right) $$
(3)

where Ku is elongation, SK is skewed.

If the time series is regular, in the next step, the static term is evaluated, but if the series is not normal, it will normalize the series using the expression of Box and Cox (1964):

$$ {X}_n\left(\uplambda \right)=\left\{\begin{array}{c}\frac{{\left(\mathcal{X}+\mathcal{a}\right)}^{\uplambda}}{\uplambda}\\ {}\log \left(x+\mathcal{a}\right)\end{array}\right.{\displaystyle \begin{array}{c}\uplambda \ne 0\\ {}\uplambda =0\end{array}} $$
(4)

where Xn(λ) is the normalized time series, λ is the transform data, and α is a constant that xt + α > 0. We also evaluated the stationary of the time series, to make necessary transformations on the time series, if necessary before the series can be modelled. One of these tests, which is applied before the series, is the KPSS static time series test (Kwiatkowski et al. 1992) as follows:

$$ {S}^2(l)=\frac{1}{n}\sum \limits_{t=1}^n{e}_t^2+\frac{2}{n}\sum \limits_{j=1}^1w\left(j,l\right)\frac{1}{n}\sum \limits_{t=j+1}^n{e}_t{e}_{t-s} $$
(5)
$$ w\left(s,l\right)=1-j/\left(l+1\right) $$
(6)
$$ KPSS=\frac{1}{n^2}\sum \limits_{t=1}^n\frac{S_t^2}{S^2(l)} $$
(7)

where St is Σet, l is the truncation lag. KPSS is a series static-statistic at level or trend. Each time series is formed from the four terms of trend, jump, period, and the stochastic term. The existence of any of the first three terms in the time series causes the time series to become non-stationary. One of the most commonly utilized methods for time series stationary is differencing. In this method, the differential series is created by subtraction of two consecutive data values (i.e., Diff (t) = X (t) -X (t-1)). The trend and seasonal changes in the series are eliminated, and ultimately stationary the series can be obtained. Alternatively, the methods of differentiation (diff.), standardization (Std.), and spectral analysis (Sf.) can be used as time-series methods (Bonakdari et al. 2019). The non-parametric Mann-Kendal test is applied to test the process, to identify the gradual changes that occur over time in the time series (Jain and Kumar 2012). The standard of Mann-Kendall statistic (STDMK), can be obtained as follows:

$$ STDMK={\displaystyle \begin{array}{c}\left( MK-1\right)\operatorname{var}{(MK)}^{-0.5}\\ {}0\\ {}\left( MK+1\right)\operatorname{var}{(MK)}^{-0.5}\end{array}}{\displaystyle \begin{array}{c} Mk>0\\ {} MK=0\\ {} MK<0\end{array}} $$
(8)

where MK is the Man-Kendall statistic, and var(MK) represents the variance of MK. MK and var (MK) are calculated as:

$$ MK=\sum \limits_{i=1}^{N-1}\sum \limits_{j=i+1}^N\operatorname{sgn}\left({X}_j-{X}_i\right) $$

and

$$ \operatorname{var}(MK)=\left(\left(2{N}^3-7{N}^2-5N\right)-\sum \limits_j^g{Obs}_j\left({Obs}_j-1\right)\left(2{Obs}_j+5\right)\right)/18 $$
(9)

where X is data values, Obsj is the number of observations at the jth group, g is the number of identical groups, N is the number of samples, and sgn is the sign function. Gradual changes in the time series may occur alternately and seasonally, leading to a seasonal process in the time series. In this case, using the seasonal Mann-Kendall test, the seasonal process is identified as follows:

$$ {S}_k=\sum \limits_{i=1}^{N_k1}\sum \limits_{j=i+1}^{N_k-1}\operatorname{sgn}\left({X}_{ki}-{X}_{kj}\right) $$
(10)
$$ SMK=\sum \limits_{k=1}^{\omega}\left({S}_k-\mathit{\operatorname{sgn}}\left({S}_k\right)\right) $$
(11)
$$ \operatorname{var}(SMK)=2\sum \limits_{k=1}^{\omega -1}\sum \limits_{j=i+1}^{\omega }{\sigma}_{ij}+\sum \limits_k^{\omega}\left(2{N}^3k-5{N}_k\right)/18 $$
(12)
$$ {STD}_{SMK}= MK \operatorname {var}{(MK)}^{-0.5} $$
(13)

where σij is the covariance of statistic test in season i and j, and ω represents the number of the seasons in a year. If the probability of the statistics of these tests is higher than the significant level of 0.05, the time series has lacked any process. The jump the series can be tested with the following equation (Mann and Whitney 1947):

$$ {MW}_U=\sum \limits_{t=1}^{N_1}\left( Dg\left({X}_{ordered}\right)-\frac{N_{m1\left({N}_{m1}+{N}_{m2+}1\right)}}{2}\right)/\left({\left({N}_{m1}{N}_{m2}\left({N}_{m1}+{N}_{m2}+1\right)\right)}^{0.5}/12\right) $$
(14)

In the relationship, Xordered is the series arranged according to the original X(t), Dg (Xordered) degrees of the Xordered function, Nm1, and Nm2 is the number of members of the original series, Nm1 + Nm2 = Ntotal. The frequency in time series can be verified using the autocorrelation function (ACF) and the partial autocorrelation (PACF) diagrams. Another test that numerically examines time series is the Fisher test (Kashyap and Rao 1976). The test statistic is calculated as:

$$ {F}^{\ast }=\frac{N\left(N-2\right)\left({a}_k^2+{\beta}_k^2\right)}{4\left(\sum \limits_{z=1}^k\left(x(t)-{a}_z\mathit{\cos}\left({\Omega}_{z^t}\right)-{\beta}_z\mathit{\sin}\left({\Omega}_{z^t}\right.\right)\right)} $$
(15)

where N is the number of sample data, F* is the Fisher test statistic, αz and βz are Fourier coefficients, and Ωz is the angular frequency obtained as follows:

$$ {a}_z=\frac{2}{N}\left(\sum \limits_{t=1}^Nx(t)\mathit{\cos}\left(2\pi f{z}^t\right)\right)z=1,2,\dots, k $$
(16)
$$ {\beta}_z=\frac{2}{N}\left(\sum \limits_{t=1}^Nx(t)\left(2\pi f{z}^t\right)\right)z=1,2,\dots, k $$
(17)
$$ {f}_z=\frac{z}{N}{\Omega}_z=\frac{2\pi z}{N}\ z=1,2,\dots, k $$
(18)

In the above relationships, fz is equal to the z-th harmonic of the base frequency. The periodicity of Ωz is significant when the critical value F at the confidence level F (2, N-2) is lower than the F* value.

$$ {F}^{\ast}\ge F\left(2,N-2\right) $$
(19)

For a significant level of 0.05, the level of freedom in the denominator is equal to 3. The Ljung-Box test is used to check the validity of the modeling to verify the autonomy of the residuals of the time series (Ljung and Box 1978). The test statistic is calculated as follows:

$$ {\mathcal{Q}}_m=N\left(N+2\right)\sum \limits_{h=1}^m\frac{r_h}{N-1} $$
(20)

In this relationship, N is the number of samples, rh is the correlation coefficient of the residues (εt) in delay h, m is equal to ln (N). If the probability of the Ljung-Box test statistic in the χ2 distribution is higher than the confidence level α (in this case PQ > α = 0.05), the residue series is independent, and the model is appropriate.

2.2 Group Method of Data Handling (GMDH)

The GMDH neural network arises from the bonding of different pairs through a quadratic polynomial by a set of neurons. The system describes a quadratic polynomial obtained by an approximate function \( \hat{f} \)with output \( \overset{\frown }{y} \) from all neurons, for inputs X = f (x1,x2, …xn)with the lowest error compared to the actual output of y. Therefore, for the observed sample M, including n inputs and one output, the results are represented in the form of:

$$ {y}_i=f\ \left({x}_{i1},{x}_{i2},{x}_{i3},\dots, {x}_{in}\right)\left(i=1,2,\dots, M\right) $$
(21)

In the GMDH method, a network that can predict the output value \( \overset{\frown }{y} \) for any input vector x can be calculated according to:

$$ {\hat{y}}_1=\hat{f}\left({x}_{i1},{x}_{i2},{x}_{i3},\dots, {x}_{in}\right)\kern0.75em \left(i=1,2,\dots, M\right) $$
(22)

So that the mean square error between the observed values and the estimated values is minimized, as:

$$ MSE=\frac{\sum \limits_{i=1}^M{\left({\hat{y}}_i-{y}_i\right)}^2}{M}\to \mathit{\operatorname{Min}} $$
(23)

The general formula of structure between the input and output variables can be represented using the polynomial function, as:

$$ y={a}_{0=}+\sum \limits_{i=1}^n{a}_1{x}_i+\sum \limits_{i=1}^n\sum \limits_{j=1}^n{a}_{ij}{x}_i{x}_j+\sum \limits_{i=1}^n\sum \limits_{j=1}^n\sum \limits_{k=1}^n{a}_{ij k}{x}_i{x}_j{x}_k+ $$
(24)

The following second order form and two-variable polynomials are expressed as:

$$ \hat{y}=G\left({x}_i,{x}_j\right)+{a}_0+{a}_1{x}_i+{a}_2{x}_j+{a}_3{x}_j^2+{a}_4{x}_j^2+{a}_5{x}_i{x}_j $$
(25)

The unknown coefficients ai in the above equation are estimated by regression methods in such a way that the difference between the true output y and the estimated \( \overset{\frown }{y} \) values for each pair of input variables xi and xj is minimized. A set of polynomials is constructed using Eq. (25), all unknown coefficients are calculated by the least squares (LS) method. The coefficients of each neuron equation (for each function Gi) are obtained by minimizing its total error to adapt the inputs to all pairs of input-output sets optimally.

$$ E=\frac{\sum \limits_{i=1}^M{\left({y}_i-{G}_i\right)}^2}{M}\to \mathit{\operatorname{Min}} $$
(26)

In the GMDH algorithm, all dual neurons are constructed of n input variables, and unknown coefficients of all neurons are calculated using the LS method. Therefore, the number of neurons to build the second layer are \( \left(\begin{array}{c}n\\ {}2\end{array}\right)=\frac{n\left(n-1\right)}{2} \), which can be represented as the following set:

$$ \left\{\left({y}_i,{x}_{ip},{x}_{iq}\right)\mid \left(i=1,2,\dots, M\right)\&p,q\in \left(1,2,\dots, M\right)\right\} $$
(27)

From the quadratic form of the function expressed in the relationship (5), each M triple row is used; these equations can be expressed in the form of the following matrix:

$$ Aa=Y $$
(28)

where A is the vector of unknown coefficients of the second order equation shown in Eq. (25) and:

$$ a=\left\{{a}_0,{a}_1,\dots {a}_5\right\} $$
(29)
$$ Y={\left\{{y}_1,{y}_2\dots {y}_M\right\}}^T $$
(30)
$$ A=\left[\begin{array}{c}1\kern0.5em {x}_{1p}\kern0.5em \begin{array}{cc}{x}_{1p}& \begin{array}{cc}{x}_{1p}^2& \begin{array}{cc}{x}_{1q}^2& {x}_{1p}{x}_{1p}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}1& \begin{array}{cc}{x}_{2p}& \begin{array}{cc}{x}_{2p}& \begin{array}{cc}{x}_{2p}^2& \begin{array}{cc}{x}_{2p}^2& {x}_{2p{x}_{2q}}\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& .\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& .\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& .\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{cc}1& \begin{array}{cc}{x}_{Mp}& \begin{array}{cc}{x}_{Mp}& \begin{array}{cc}{x}_{Mp}^2& \begin{array}{cc}{x}_{Mq}^2& {x}_{mp}{x}_{Mq}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\right] $$
(31)

The least squared method of multi-regression analysis calculates the equations in the form of the following equation:

$$ a={\left({A}^T\ A\right)}^{-1}{A}^TY $$
(32)

This equation generates a vector of coefficients of Eq. (25) for all three triangular M sets.

2.3 Generalized Structure of GMDH

Although GMDH has a great ability to model non-linear problems, this method is subject to some limitations, including 1) the use of a second-order polynomial; 2) inputs of each neuron are provided only from adjacent neurons, and 3) each neuron only has two inputs. Therefore, this method may not be accountable for issues of considerable complexity. Thus, in this study to address the problems presented, GMDH generalized structure (GS-GMDH) is introduced. In this method, neuron inputs can be two to three. In addition to the second-order polynomials, the third-order polynomial can also be used. The inputs of each neuron can be from adjacent layer neurons, and can also use the neurons of nonadjacent layers.

2.4 The Structure of the Proposed Models

In this study, two linear and non-linear methods for modelling the daily discharge of the Bow River are presented. In the non-linear process, the GMDH algorithm is used where, as explained in the previous section, the structure of this method has been modified so that it has advantages over the classical GMDH method. The linear method used is the ARIMA method, which has been used by several previous researchers. First we evaluated the normalization of the time series training data by the Jarque-Bera test. In the case of non-normality, the Box-Cox Transform normalization is completed. The stationary time series is then assessed using the KPSS test. Jump and period are other definite terms. By using the Mann-Whitney and Fisher tests (respectively), the existence of jump and period are examined, and by differencing, standardization and spectral analysis are removed. After eliminating definite terms, the time series modelling is performed using the ARIMA method. After modeling, the independence of the residuals is evaluated using the Ljung-Box test. Following the verification step, the accuracy of the linear modelling results and the GS-GMDH methods are evaluated using the test data (Fig. 2).

Fig. 2
figure 2

The structure of the proposed model

3 Modelling Evaluation Measures

Due to the stochastic nature of the hydrological variables, the use of single criteria to assist in the execution of a statistical model is not enough. In this study, the coefficient of determination (R2), as well as two relative indices (mean absolute percentage error (MAPE) and root mean square relative error (RMSRE)), are used to establish the efficacy of a linear and non-linear model.

$$ {R}^2\left(\%\right)=100\times \left(\frac{\left(\sum \limits_{1=1}^n\left({x}_{obsi}-{\overline{X}}_{obst}\right)\left({X}_{pi}-{\overline{X}}_{Pt}\right)\right)}{\sqrt{\sum \limits_{i=1}^n{\left({X}_{obsi}-{\overline{X}}_{obst_t}\right)}^2}\sum \limits_{i=1}^n{\left({X}_{i=1}-{\overline{X}}_{Pt}\right)}^2}\right) $$
(33)
$$ MAPE=\frac{100}{n}\sum \limits_{i=1}^n\frac{X_{obs,i}-{X}_{p,1}}{X_{obs,i}} $$
(34)
$$ RMRSE=(100)\times \sqrt{\frac{1}{n}}\sum \limits_{i=1}^n{\left(\frac{X_{obs,i}-{X}_{P,i}}{X_{obs,i}}\right)}^2 $$
(35)

The value of R2 bounded by [0, 1] explains the covariance in the actual daily discharge data that can be described by the predicting model, but it originates from the linear assumptions (Krause et al. 2005). The R2, RMSRE, and MAPE are insensitive to outliers (Legates and Mccabe 1999;). The Nash-Sutcliffe coefficient (EN-S) is employed to overcome the limitation of the previously mentioned indices. Since neither the R2, RMSRE, MAPE, nor EN-S consider the complexity of the model, the Akaike information criterion (AIC) is used to compare the performance of linear and non-linear models regarding accuracy and complexity simultaneously.

$$ {E}_{N-S}\left(\%\right)=\left[\frac{\sum \limits_{i=1}^N{\left({X}_{obs,i}-{X}_{P,i}\right)}^2}{\sum \limits_{i=1}^N{\left({X}_{obs,i}-{\overline{X}}_{o\mathrm{b}s}\right)}^2}\right]\times 100 $$
(36)
$$ AIC=N1n\left(\sum \limits_{i=1}^N{\left({X}_{obs,i}-{X}_{P,i}\right)}^2\right)+2k $$
(37)

In the above equations k is the number of parameters, N number of samples, Xobs,i and XP,i are respectively the ith value of observed and predicted value.

4 Development of Linear and Non-linear Modeling

4.1 Linear Modelling

For linear modelling using the ARIMA model, the time series features need to be well identified and, if necessary, be static and normal using appropriate pre-processes. Therefore, at first, the correlations of the daily discharge (DD) series are plotted (Fig. 3a). It can be seen that the DD time series is volatile and has strong seasonal and non-seasonal correlations. Non-seasonal correlations of up to 52 primary lags and seasonal correlations of up to four lags with 365-day steps exist. The period should be eliminated by the appropriate methods in the residual time series.

Fig. 3
figure 3

Autocorrelation function plot of pre-processed BRDD data with proposed methods for N/4 of data: a) daily data, b) Pre-processing data with three methods

Table 2 presents different test results for the numerical verification of the DD time-series features. In this table, it can be seen that the DD time series has seasonal and non-seasonal trends. Also, based on the fisher and JB test statistic, the all-time series are periodic and have no normal distribution. As the figure shows, the DD time series has a jump in the validation period, which is confirmed by the Mann-Whitney test. Despite these features, the DD time series is non-stationary based on the KPSS numerical test.

Table 2 Test results of applied tests on and pre-processed outcomes

The ACF graphs of the series were re-drawn (Fig. 3b) to investigate the changes in the pre-processed series. In this figure, diff, Std and Sf represent differencing, non-seasonal standardization and spectral analysis and the changes from pre-processing in the series are clearly seen. The degree of seasonal and non-seasonal correlations in the series of differential equations has been greatly reduced, and the stationary of the pre-processed time series is evident. Standardization and spectral analysis methods have reduced the amount of seasonal and non-seasonal relations, but they have not been able to make the series stationary, and it can be seen that these correlations are still high. Therefore, the ARMA model cannot be used for modelling. Differencing is done to examine the possibility of data modelling using the ARIMA model.

The results are presented in Table 2 which shows that both the seasonal and non-seasonal trends and the jumps in the series have been eliminated. Although the periodic term has been created in standardization and spectral analysis methods, it can be seen that the series are considered stationary. Changes in the correlation diagrams of these series are shown in Fig. 4. Seasonal correlations have been eliminated, and the graphs have been taken up to a maximum of two lags. Therefore, using the ARIMA linear model with a maximum of the two non-seasonal parameters p and q and one differencing is very suitable.

Fig. 4
figure 4

Autocorrelation function plot of subtracted (ARIMA differencing operator) pre-processed BRDD data with proposed methods: a. diff., b. Std., c. Sf, for N/4 of data

Table 2. Test results of applied tests on BRDD data and pre-processed outcomes.

4.2 Nonlinear Modeling

Using the graphs presented in Fig. 2 and considering that in the GS-GMDH method, at least two variables should be considered as inputs, several models were considered as follows:

$$ \mathrm{M}1:Q(t)=Q\left(t-1\right),Q\left(t-2\right) $$
$$ \mathrm{M}2:Q(t)=Q\left(t-1\right),Q\left(t-2\right),Q\left(t-3\right) $$
$$ \mathrm{M}3:Q(t)=Q\left(t-1\right),Q\left(t-2\right),Q\left(t-3\right),Q\left(t-4\right) $$
$$ \mathrm{M}4:Q(t)=Q\left(t-1\right),Q\left(t-2\right),Q\left(t-3\right),Q\left(t-4\right),Q\left(t-5\right) $$

Using the GS-GMDH method and considering the four models above, various relationships are proposed to predict the Bow River discharge, as shown in Table 3.

Table 3 The proposed GS-GMDH equations for M1 to M4

5 Results and Discussion

Figure 5 indicates the scatter plot of the ARIMA-based linear method (diff, Std, Sf) and GS-GMDH (M1-M4) techniques in daily discharge prediction. Comparison of GS-GMDH models with linear models shows a relatively similar function to the non-linear and linear method, with the difference that the maximum discharge for model testing in the GS-GMDH method has a better performance compared with linear methods (Std, Sf).

Fig. 5
figure 5

Scatter plot of the linear (diff, Std, Sf) and non-linear (M1-M4) models in daily discharge prediction

Figure 6a depicts the box plot of the observed and predicted daily discharge. It is observed that the performance of all methods (linear and non-linear) in different domains is approximately the same, so that the average values estimated with the average observed amounts are approximately equal. The scattering of these values (observed and estimated) in the first and third quantile is also similar. As observed in the scatter plot, the main difference between the performances of the models is in the peak discharges.

Fig. 6
figure 6

The box plot for a relative error of the ARIMA: a) observed and predicted daily discharge; and b) pre-processed data (diff, Std, Sf) and GS-GMDH (M1-M4) models

The qualitative comparison of the two sets of models presented in this study (Figs. 5 and 6a) depicts the good and similar performance of both models in estimating the Bow River daily discharge.

Figure 6b presents the box plot for a relative error of the ARIMA (diff, Std, Sf) and GS-GMDH (M1-M4) models for estimation of the Bow River daily discharge. The distribution of the relative error in non-linear and linear methods shows that the average value of the relative error for all purposes is less than 10%. Regardless of the outlier errors, the maximum relative error of the methods used is less than 20%. The performance of the models with respect to outlier relative errors shows that the maximum error is due to linear methods, and especially due to the diff method. The minimum value associated with the maximum error measured for outlier relative errors is related to the GS-GMDH (M1) method.

The performance evaluation of the ARIMA (diff, Std, Sf) and GS-GMDH (M1-M4) methods qualitatively confirmed the ability of these two methods for the prediction of Bow River daily discharge. For a closer comparison of these two models and determination of the superior model, several quantitative studies are required. The indices presented in Table 4 confirm the significant performance of the proposed models in this study, which were qualitatively examined. The average relative error of these models is about 6%, and all models have a very high correlation coefficient. An obvious point in choosing the superior model is the use of an index that has a great deal of accuracy and simplicity.

Table 4 Statistical indices for linear and non-linear methods

The complexity of the model is evaluated using the AIC index. The more superior model will have the smaller lower and upper limits of this index. For linear models, the values of p and q used in the ARIMA model are considered as k in the definition of the AIC relationship, while for the GS-GMDH model, the coefficients are used to estimate the GS-GMDH model. In linear methods, the lowest AIC is the Std method. The AIC value in this method is slightly better than Sf, but its difference is significant compared to diff.

In non-linear methods, the values of all indices, except for AIC, are constant in all models, so that with increasing inputs, not only is the accuracy of the model not significantly changed, but this also leads to the increased complexity of the model relative to the model with lower input parameters. Therefore, considering that the GS-GMDH (M1) method has the lowest AIC among both linear and non-linear methods, this model is selected as the superior model for predicting the Bow River daily discharge.

6 Conclusions

In this study, the accuracy of a linear stochastic model and non-linear GMDH daily discharge forecast models were compared. The linear stochastic method incorporates three input data pre-processing methods of differencing (diff), standardization (Std), and spectral analysis (Sf). In addition to the linear methodology, a non-linear method based on the GMDH was developed. A summary of the most notable results are listed as follows:

  • The proposed GS-GMDH improved the results of classical GMDH by considering more than two input parameters in each neuron, admissibility of the input of each neuron from nonadjacent layers and employing second- and third-order polynomials to build the structure between the input and output variables.

  • Comparison of the linear stochastic and the non-linear GMDH methods showed that all linear methods (diff, Std and Sf) and non-linear methods (M1-M4) have high accuracy in forecasting the Bow River daily discharge with an average relative error below 6%.

  • This study showed that an appropriate pre-processing process can improve the results of a stochastic model and it can provide a similar forecast accuracy of the daily discharge compared to the more complex non-linear GMDH model.

  • Comparison of all methods using an index that considers simultaneously the accuracy and simplicity of the model (AIC) showed that the GS-GMDH (M1) method has the best performance among all considered methods and can be used in practical applications.