A Comparative Study of Linear Stochastic with Nonlinear Daily River Discharge Forecast Models

Bonakdari, Hossein; Binns, Andrew D.; Gharabaghi, Bahram

doi:10.1007/s11269-020-02644-y

A Comparative Study of Linear Stochastic with Nonlinear Daily River Discharge Forecast Models

Published: 08 August 2020

Volume 34, pages 3689–3708, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Water Resources Management Aims and scope Submit manuscript

A Comparative Study of Linear Stochastic with Nonlinear Daily River Discharge Forecast Models

Download PDF

470 Accesses
19 Citations
Explore all metrics

Abstract

Accurate forecast of the magnitude and timing of the flood peak river discharge and the extent of inundated areas during major storm events are a vital component of early warning systems around the world that are responsible for saving countless lives every year. This study assesses the forecast accuracy of two different linear and non-linear approaches to predict the daily river discharge. A new linear stochastic method is produced by evaluating a detailed comparison between three pre-processing approaches, differencing, standardization, spectral analysis, and trend removal. Daily river discharge values of the Bow River with strong seasonal and non-seasonal correlations located in Alberta, Canada were utilized in this study. The stochastic term for this daily flow time series is calculated with an auto-regressive integrated moving average. We found that seasonal differencing is the best stationarization method for periodic effect elimination. Moreover, the proposed non-linear Group Method of Data Handling (GMDH) model could overcome the known accuracy limitations of the classical GMDH models that use only two inputs in each neuron from the adjacent layer. The proposed new non-linear GMDH-based method (named GS-GMDH) can improve the structure of the classical linear GMDH. The GS-GMDH model produced the most accurate forecasts in the Bow River case study with statistical indices such as the coefficient of determination and Nash-Sutcliffe for the daily discharge time series higher than 97% and relative error less than 6%. Finally, an explicit equation for estimation of the daily discharge of the Bow River is developed using the proposed GS-GMDH model to showcase the practical application of the new method in flood forecasting and management.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

One of the most important ways to reduce the damages of floods in large cities is to design a robust, reliable, and universal method for flood forecasting and early warning detections (Walton et al. 2019; Gholami et al. 2019). Due to climate change, water-related studies have attained increased importance by many researchers. Climate change forecasts have suggested that impacts on water resources will have devastating consequences on human and ecological health. In this manner, the planning and management of water resources will be the most critical issue faced by humankind (Gharabaghi and Sattar 2019). Climate change can increases the potential for extreme rainfall as well as the risk of flooding. Indeed, river flow forecasting is vital for the management and planning of river basins, including water allocation for agriculture, generation of hydroelectric energy, navigation planning, risk appraisal, droughts, and flood control (Khatibi et al. 2012). Floods are presently the dominant natural disaster and tend to have the most significant associated economic cost (Serinaldi et al. 2018). One-third of all natural disasters in Canada between 1950 and 2012 were the result of floods (Kelly and Stodolak 2013).

1.1 Data-Driven Models

Expanding upon data-driven models has an extended application in flood modelling, which has gained a heightened reputation in recent years. Amongst them, one of the most popular methods is the group method of data handling (GMDH) (Najafzadeh et al. 2015). The GMDH is a self-organized approach that is capable of introducing different explicit equations for practical applications. However, due to intricate non-linear patterns in time series, their use needs professional programming knowledge. Also, the critical question with these methods is the selection of the best input parameters for predicting the results with high accuracy (Ebtehaj et al. 2020).

1.2 Problem Statement

The use of stochastic-based models has generally been rejected due to their linear nature in the modelling of hydrological processes, and in most studies, inefficiencies have been reported (Mosavi et al. 2018). Recently, Bonakdari et al. (2019) indicated that considering an appropriate linear methodology could result in improved modelling compared to a non-linear approach (such as ANFIS and neural network) in terms of accuracy and simplicity. Therefore, the following fundamental questions arise: 1) Can the identification of deterministic and stochastic terms of the time series also provide a useful solution in modelling river discharge with strong seasonal and non-seasonal correlations with a stochastic model?; and 2) What is the performance of this linear methodology in comparison with the non-linear model? This study seeks to address these questions for the case of a highly complex daily time series data set with strong seasonal and non-seasonal correlations.

1.3 Scope of the Current Study

This study provides a comparison of a novel stochastic based linear methodology with a new encoding of the GMDH known as the generalized structure of the GMDH (GS-GMDH) for the daily discharge forecasting in the Bow River in Alberta, Canada. Both proposed linear and non-linear approaches are encoded in MATLAB. The linear methodology assessed the existence of the deterministic and stochastic terms and suggested a method to remove the deterministic terms. In the non-linear GS-GMDH model, to overcome the limitation of the classical GMDH method (which uses only a second-order polynomial and lacks the use of nonadjacent inputs in each neuron), a second and third order polynomial and inputs from adjacent layers in each neuron. Finally, the performance of the best linear-based stochastic model is compared with the best GS-GMDH based model as a non-linear approach using multi-criteria statistical indices.

1.4 Hydrological Data Collection

The highest natural disaster in Canadian history in terms of economic losses was the June 2013 flood event that affected the city of Calgary, Alberta, where five lives were lost. As much as $6 billion CAD in economic losses were sustained (Pomeroy et al. 2016). This flood event resulted in one of the top causes of the domestic insurance misfortunes in Canada (Insurance Bureau of Canada 2017).

Moreover, the cost of infrastructure damages, recovery costs, and emergency response were $409 million, $323 million, and $55 million CAD, respectively. In addition to the June 2013 flood event, $186,831,824 was paid by insurance companies in response to 21,179 flood claims from the 2005 flood event that also affected Calgary (Dohy 2005).

The Bow River is located in Alberta, Canada, and flows through the city of Calgary. The headwaters are located in the Rocky Mountains at the Bow Glacier and merges with the Oldman River and eventually form the South Saskatchewan River (Fig. 1a). A hydrometric station had collected daily discharge data in the Bow River (05BH004, located at 51^o03’00” N, 114^o03’05” W) near Calgary from 2000 to 2018.

The Bow River drains a gross area of 7870 km². The average daily discharge of the Bow River is 90.7 m³/s. The banks of this river overflow when the flow rate reaches 500 m³/s and influences to structures, and overland flooding happens when the flow rate reaches 850 m³/s (City of Calgary 2018). The daily discharge data related to the Bow River and statistical indices of these data for training and testing stages are presented in Fig. 1b and Table 1, respectively.

Table 1 Statistical Indices of Bow River daily discharge, divided into Total, Train, and Test Periods

Full size table

2 Theoretical Conceptions

2.1 Linear Modeling Conceptions

The autoregressive integrated moving average (ARIMA) is one of the most popular linear methods for predicting time series. This method may also be defined seasonally, which is then known as the Seasonal ARIMA (SARIMA). The ARIMA method is defined as:

$$ \mathrm{ARIMA}\left(\ p,d,q\right)=\varphi (B)\left(1\ B\right)x(t)=\theta (B)\ \varepsilon (t) $$

(1)

where p and q are the order of autoregressive (AR) and moving average (MA), d is the differencing degree, φ and θ are the AR and MA parameters (respectively), (1-B)^d is the d^th non-seasonal differencing operator, x(t) is the raw time series, and the ε is the residual. Considering k as the non-seasonal parameters (φ and θ), the non-seasonal differencing is calculated as follows:

$$ k(B)1\ k\ B\ k{B}^2-{K}_3{B}^3-\dots {K}_n{B}^n $$

(2)

where n is the order of non-seasonal parameters (p and q). In the modelling of the time series using stochastic processes, the series should be subject to certain conditions. The Jarque-Bera (JB) (Jarque and Bera 1980) test is used to verify the normality of the series, and is defined as follows:

$$ JB=n\left(\frac{S_K^2}{6}+\frac{{\left({K}_u-3\right)}^2}{24}\right) $$

(3)

where K_u is elongation, S_K is skewed.

If the time series is regular, in the next step, the static term is evaluated, but if the series is not normal, it will normalize the series using the expression of Box and Cox (1964):

$$ {X}_n\left(\uplambda \right)=\left\{\begin{array}{c}\frac{{\left(\mathcal{X}+\mathcal{a}\right)}^{\uplambda}}{\uplambda}\\ {}\log \left(x+\mathcal{a}\right)\end{array}\right.{\displaystyle \begin{array}{c}\uplambda \ne 0\\ {}\uplambda =0\end{array}} $$

(4)

where X_n(λ) is the normalized time series, λ is the transform data, and α is a constant that x_t + α > 0. We also evaluated the stationary of the time series, to make necessary transformations on the time series, if necessary before the series can be modelled. One of these tests, which is applied before the series, is the KPSS static time series test (Kwiatkowski et al. 1992) as follows:

$$ {S}^2(l)=\frac{1}{n}\sum \limits_{t=1}^n{e}_t^2+\frac{2}{n}\sum \limits_{j=1}^1w\left(j,l\right)\frac{1}{n}\sum \limits_{t=j+1}^n{e}_t{e}_{t-s} $$

(5)

$$ w\left(s,l\right)=1-j/\left(l+1\right) $$

(6)

$$ KPSS=\frac{1}{n^2}\sum \limits_{t=1}^n\frac{S_t^2}{S^2(l)} $$

(7)

where S_t is Σe_t, l is the truncation lag. KPSS is a series static-statistic at level or trend. Each time series is formed from the four terms of trend, jump, period, and the stochastic term. The existence of any of the first three terms in the time series causes the time series to become non-stationary. One of the most commonly utilized methods for time series stationary is differencing. In this method, the differential series is created by subtraction of two consecutive data values (i.e., Diff (t) = X (t) -X (t-1)). The trend and seasonal changes in the series are eliminated, and ultimately stationary the series can be obtained. Alternatively, the methods of differentiation (diff.), standardization (Std.), and spectral analysis (Sf.) can be used as time-series methods (Bonakdari et al. 2019). The non-parametric Mann-Kendal test is applied to test the process, to identify the gradual changes that occur over time in the time series (Jain and Kumar 2012). The standard of Mann-Kendall statistic (STD_MK), can be obtained as follows:

$$ STDMK={\displaystyle \begin{array}{c}\left( MK-1\right)\operatorname{var}{(MK)}^{-0.5}\\ {}0\\ {}\left( MK+1\right)\operatorname{var}{(MK)}^{-0.5}\end{array}}{\displaystyle \begin{array}{c} Mk>0\\ {} MK=0\\ {} MK<0\end{array}} $$

(8)

where MK is the Man-Kendall statistic, and var(MK) represents the variance of MK. MK and var (MK) are calculated as:

$$ MK=\sum \limits_{i=1}^{N-1}\sum \limits_{j=i+1}^N\operatorname{sgn}\left({X}_j-{X}_i\right) $$

and

$$ \operatorname{var}(MK)=\left(\left(2{N}^3-7{N}^2-5N\right)-\sum \limits_j^g{Obs}_j\left({Obs}_j-1\right)\left(2{Obs}_j+5\right)\right)/18 $$

(9)

where X is data values, Obs_j is the number of observations at the j^th group, g is the number of identical groups, N is the number of samples, and sgn is the sign function. Gradual changes in the time series may occur alternately and seasonally, leading to a seasonal process in the time series. In this case, using the seasonal Mann-Kendall test, the seasonal process is identified as follows:

$$ {S}_k=\sum \limits_{i=1}^{N_k1}\sum \limits_{j=i+1}^{N_k-1}\operatorname{sgn}\left({X}_{ki}-{X}_{kj}\right) $$

(10)

$$ SMK=\sum \limits_{k=1}^{\omega}\left({S}_k-\mathit{\operatorname{sgn}}\left({S}_k\right)\right) $$

(11)

$$ \operatorname{var}(SMK)=2\sum \limits_{k=1}^{\omega -1}\sum \limits_{j=i+1}^{\omega }{\sigma}_{ij}+\sum \limits_k^{\omega}\left(2{N}^3k-5{N}_k\right)/18 $$

(12)

$$ {STD}_{SMK}= MK \operatorname {var}{(MK)}^{-0.5} $$

(13)

where σ_ij is the covariance of statistic test in season i and j, and ω represents the number of the seasons in a year. If the probability of the statistics of these tests is higher than the significant level of 0.05, the time series has lacked any process. The jump the series can be tested with the following equation (Mann and Whitney 1947):

$$ {MW}_U=\sum \limits_{t=1}^{N_1}\left( Dg\left({X}_{ordered}\right)-\frac{N_{m1\left({N}_{m1}+{N}_{m2+}1\right)}}{2}\right)/\left({\left({N}_{m1}{N}_{m2}\left({N}_{m1}+{N}_{m2}+1\right)\right)}^{0.5}/12\right) $$

(14)

In the relationship, X_ordered is the series arranged according to the original X(t), Dg (X_ordered) degrees of the X_ordered function, N_m1, and N_m2 is the number of members of the original series, N_m1 + N_m2 = N_total. The frequency in time series can be verified using the autocorrelation function (ACF) and the partial autocorrelation (PACF) diagrams. Another test that numerically examines time series is the Fisher test (Kashyap and Rao 1976). The test statistic is calculated as:

$$ {F}^{\ast }=\frac{N\left(N-2\right)\left({a}_k^2+{\beta}_k^2\right)}{4\left(\sum \limits_{z=1}^k\left(x(t)-{a}_z\mathit{\cos}\left({\Omega}_{z^t}\right)-{\beta}_z\mathit{\sin}\left({\Omega}_{z^t}\right.\right)\right)} $$

(15)

where N is the number of sample data, F* is the Fisher test statistic, α_z and β_z are Fourier coefficients, and Ω_z is the angular frequency obtained as follows:

$$ {a}_z=\frac{2}{N}\left(\sum \limits_{t=1}^Nx(t)\mathit{\cos}\left(2\pi f{z}^t\right)\right)z=1,2,\dots, k $$

(16)

$$ {\beta}_z=\frac{2}{N}\left(\sum \limits_{t=1}^Nx(t)\left(2\pi f{z}^t\right)\right)z=1,2,\dots, k $$

(17)

$$ {f}_z=\frac{z}{N}{\Omega}_z=\frac{2\pi z}{N}\ z=1,2,\dots, k $$

(18)

In the above relationships, f_z is equal to the z-th harmonic of the base frequency. The periodicity of Ω_z is significant when the critical value F at the confidence level F (2, N-2) is lower than the F* value.

$$ {F}^{\ast}\ge F\left(2,N-2\right) $$

(19)

For a significant level of 0.05, the level of freedom in the denominator is equal to 3. The Ljung-Box test is used to check the validity of the modeling to verify the autonomy of the residuals of the time series (Ljung and Box 1978). The test statistic is calculated as follows:

$$ {\mathcal{Q}}_m=N\left(N+2\right)\sum \limits_{h=1}^m\frac{r_h}{N-1} $$

(20)

In this relationship, N is the number of samples, r_h is the correlation coefficient of the residues (εt) in delay h, m is equal to ln (N). If the probability of the Ljung-Box test statistic in the χ2 distribution is higher than the confidence level α (in this case PQ > α = 0.05), the residue series is independent, and the model is appropriate.

2.2 Group Method of Data Handling (GMDH)

The GMDH neural network arises from the bonding of different pairs through a quadratic polynomial by a set of neurons. The system describes a quadratic polynomial obtained by an approximate function $ \hat{f} $with output $ \overset{\frown }{y} $ from all neurons, for inputs X = f (x_1,x_{2, …}x_n)with the lowest error compared to the actual output of y. Therefore, for the observed sample M, including n inputs and one output, the results are represented in the form of:

$$ {y}_i=f\ \left({x}_{i1},{x}_{i2},{x}_{i3},\dots, {x}_{in}\right)\left(i=1,2,\dots, M\right) $$

(21)

In the GMDH method, a network that can predict the output value $ \overset{\frown }{y} $ for any input vector x can be calculated according to:

$$ {\hat{y}}_1=\hat{f}\left({x}_{i1},{x}_{i2},{x}_{i3},\dots, {x}_{in}\right)\kern0.75em \left(i=1,2,\dots, M\right) $$

(22)

So that the mean square error between the observed values and the estimated values is minimized, as:

$$ MSE=\frac{\sum \limits_{i=1}^M{\left({\hat{y}}_i-{y}_i\right)}^2}{M}\to \mathit{\operatorname{Min}} $$

(23)

The general formula of structure between the input and output variables can be represented using the polynomial function, as:

$$ y={a}_{0=}+\sum \limits_{i=1}^n{a}_1{x}_i+\sum \limits_{i=1}^n\sum \limits_{j=1}^n{a}_{ij}{x}_i{x}_j+\sum \limits_{i=1}^n\sum \limits_{j=1}^n\sum \limits_{k=1}^n{a}_{ij k}{x}_i{x}_j{x}_k+ $$

(24)

The following second order form and two-variable polynomials are expressed as:

$$ \hat{y}=G\left({x}_i,{x}_j\right)+{a}_0+{a}_1{x}_i+{a}_2{x}_j+{a}_3{x}_j^2+{a}_4{x}_j^2+{a}_5{x}_i{x}_j $$

(25)

The unknown coefficients a_i in the above equation are estimated by regression methods in such a way that the difference between the true output y and the estimated $ \overset{\frown }{y} $ values for each pair of input variables x_i and x_j is minimized. A set of polynomials is constructed using Eq. (25), all unknown coefficients are calculated by the least squares (LS) method. The coefficients of each neuron equation (for each function G_i) are obtained by minimizing its total error to adapt the inputs to all pairs of input-output sets optimally.

$$ E=\frac{\sum \limits_{i=1}^M{\left({y}_i-{G}_i\right)}^2}{M}\to \mathit{\operatorname{Min}} $$

(26)

In the GMDH algorithm, all dual neurons are constructed of n input variables, and unknown coefficients of all neurons are calculated using the LS method. Therefore, the number of neurons to build the second layer are $ \left(\begin{array}{c}n\\ {}2\end{array}\right)=\frac{n\left(n-1\right)}{2} $, which can be represented as the following set:

$$ \left\{\left({y}_i,{x}_{ip},{x}_{iq}\right)\mid \left(i=1,2,\dots, M\right)\&p,q\in \left(1,2,\dots, M\right)\right\} $$

(27)

From the quadratic form of the function expressed in the relationship (5), each M triple row is used; these equations can be expressed in the form of the following matrix:

$$ Aa=Y $$

(28)

where A is the vector of unknown coefficients of the second order equation shown in Eq. (25) and:

$$ a=\left\{{a}_0,{a}_1,\dots {a}_5\right\} $$

(29)

$$ Y={\left\{{y}_1,{y}_2\dots {y}_M\right\}}^T $$

(30)

$$ A=\left[\begin{array}{c}1\kern0.5em {x}_{1p}\kern0.5em \begin{array}{cc}{x}_{1p}& \begin{array}{cc}{x}_{1p}^2& \begin{array}{cc}{x}_{1q}^2& {x}_{1p}{x}_{1p}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}1& \begin{array}{cc}{x}_{2p}& \begin{array}{cc}{x}_{2p}& \begin{array}{cc}{x}_{2p}^2& \begin{array}{cc}{x}_{2p}^2& {x}_{2p{x}_{2q}}\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& .\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& .\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{c}\begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& \begin{array}{cc}.& .\end{array}\end{array}\end{array}\end{array}\end{array}\\ {}\begin{array}{cc}1& \begin{array}{cc}{x}_{Mp}& \begin{array}{cc}{x}_{Mp}& \begin{array}{cc}{x}_{Mp}^2& \begin{array}{cc}{x}_{Mq}^2& {x}_{mp}{x}_{Mq}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\end{array}\right] $$

(31)

The least squared method of multi-regression analysis calculates the equations in the form of the following equation:

$$ a={\left({A}^T\ A\right)}^{-1}{A}^TY $$

(32)

This equation generates a vector of coefficients of Eq. (25) for all three triangular M sets.

2.3 Generalized Structure of GMDH

Although GMDH has a great ability to model non-linear problems, this method is subject to some limitations, including 1) the use of a second-order polynomial; 2) inputs of each neuron are provided only from adjacent neurons, and 3) each neuron only has two inputs. Therefore, this method may not be accountable for issues of considerable complexity. Thus, in this study to address the problems presented, GMDH generalized structure (GS-GMDH) is introduced. In this method, neuron inputs can be two to three. In addition to the second-order polynomials, the third-order polynomial can also be used. The inputs of each neuron can be from adjacent layer neurons, and can also use the neurons of nonadjacent layers.

2.4 The Structure of the Proposed Models

In this study, two linear and non-linear methods for modelling the daily discharge of the Bow River are presented. In the non-linear process, the GMDH algorithm is used where, as explained in the previous section, the structure of this method has been modified so that it has advantages over the classical GMDH method. The linear method used is the ARIMA method, which has been used by several previous researchers. First we evaluated the normalization of the time series training data by the Jarque-Bera test. In the case of non-normality, the Box-Cox Transform normalization is completed. The stationary time series is then assessed using the KPSS test. Jump and period are other definite terms. By using the Mann-Whitney and Fisher tests (respectively), the existence of jump and period are examined, and by differencing, standardization and spectral analysis are removed. After eliminating definite terms, the time series modelling is performed using the ARIMA method. After modeling, the independence of the residuals is evaluated using the Ljung-Box test. Following the verification step, the accuracy of the linear modelling results and the GS-GMDH methods are evaluated using the test data (Fig. 2).

3 Modelling Evaluation Measures

Due to the stochastic nature of the hydrological variables, the use of single criteria to assist in the execution of a statistical model is not enough. In this study, the coefficient of determination (R²), as well as two relative indices (mean absolute percentage error (MAPE) and root mean square relative error (RMSRE)), are used to establish the efficacy of a linear and non-linear model.

$$ {R}^2\left(\%\right)=100\times \left(\frac{\left(\sum \limits_{1=1}^n\left({x}_{obsi}-{\overline{X}}_{obst}\right)\left({X}_{pi}-{\overline{X}}_{Pt}\right)\right)}{\sqrt{\sum \limits_{i=1}^n{\left({X}_{obsi}-{\overline{X}}_{obst_t}\right)}^2}\sum \limits_{i=1}^n{\left({X}_{i=1}-{\overline{X}}_{Pt}\right)}^2}\right) $$

(33)

$$ MAPE=\frac{100}{n}\sum \limits_{i=1}^n\frac{X_{obs,i}-{X}_{p,1}}{X_{obs,i}} $$

(34)

$$ RMRSE=(100)\times \sqrt{\frac{1}{n}}\sum \limits_{i=1}^n{\left(\frac{X_{obs,i}-{X}_{P,i}}{X_{obs,i}}\right)}^2 $$

(35)

The value of R² bounded by [0, 1] explains the covariance in the actual daily discharge data that can be described by the predicting model, but it originates from the linear assumptions (Krause et al. 2005). The R², RMSRE, and MAPE are insensitive to outliers (Legates and Mccabe 1999;). The Nash-Sutcliffe coefficient (E_N-S) is employed to overcome the limitation of the previously mentioned indices. Since neither the R², RMSRE, MAPE, nor E_N-S consider the complexity of the model, the Akaike information criterion (AIC) is used to compare the performance of linear and non-linear models regarding accuracy and complexity simultaneously.

$$ {E}_{N-S}\left(\%\right)=\left[\frac{\sum \limits_{i=1}^N{\left({X}_{obs,i}-{X}_{P,i}\right)}^2}{\sum \limits_{i=1}^N{\left({X}_{obs,i}-{\overline{X}}_{o\mathrm{b}s}\right)}^2}\right]\times 100 $$

(36)

$$ AIC=N1n\left(\sum \limits_{i=1}^N{\left({X}_{obs,i}-{X}_{P,i}\right)}^2\right)+2k $$

(37)

In the above equations k is the number of parameters, N number of samples, X_obs,i and X_P,i are respectively the i^th value of observed and predicted value.

4 Development of Linear and Non-linear Modeling

4.1 Linear Modelling

For linear modelling using the ARIMA model, the time series features need to be well identified and, if necessary, be static and normal using appropriate pre-processes. Therefore, at first, the correlations of the daily discharge (DD) series are plotted (Fig. 3a). It can be seen that the DD time series is volatile and has strong seasonal and non-seasonal correlations. Non-seasonal correlations of up to 52 primary lags and seasonal correlations of up to four lags with 365-day steps exist. The period should be eliminated by the appropriate methods in the residual time series.

Table 2 presents different test results for the numerical verification of the DD time-series features. In this table, it can be seen that the DD time series has seasonal and non-seasonal trends. Also, based on the fisher and JB test statistic, the all-time series are periodic and have no normal distribution. As the figure shows, the DD time series has a jump in the validation period, which is confirmed by the Mann-Whitney test. Despite these features, the DD time series is non-stationary based on the KPSS numerical test.

Table 2 Test results of applied tests on and pre-processed outcomes

Full size table

The ACF graphs of the series were re-drawn (Fig. 3b) to investigate the changes in the pre-processed series. In this figure, diff, Std and Sf represent differencing, non-seasonal standardization and spectral analysis and the changes from pre-processing in the series are clearly seen. The degree of seasonal and non-seasonal correlations in the series of differential equations has been greatly reduced, and the stationary of the pre-processed time series is evident. Standardization and spectral analysis methods have reduced the amount of seasonal and non-seasonal relations, but they have not been able to make the series stationary, and it can be seen that these correlations are still high. Therefore, the ARMA model cannot be used for modelling. Differencing is done to examine the possibility of data modelling using the ARIMA model.

The results are presented in Table 2 which shows that both the seasonal and non-seasonal trends and the jumps in the series have been eliminated. Although the periodic term has been created in standardization and spectral analysis methods, it can be seen that the series are considered stationary. Changes in the correlation diagrams of these series are shown in Fig. 4. Seasonal correlations have been eliminated, and the graphs have been taken up to a maximum of two lags. Therefore, using the ARIMA linear model with a maximum of the two non-seasonal parameters p and q and one differencing is very suitable.

Table 2. Test results of applied tests on BRDD data and pre-processed outcomes.

4.2 Nonlinear Modeling

Using the graphs presented in Fig. 2 and considering that in the GS-GMDH method, at least two variables should be considered as inputs, several models were considered as follows:

$$ \mathrm{M}1:Q(t)=Q\left(t-1\right),Q\left(t-2\right) $$

$$ \mathrm{M}2:Q(t)=Q\left(t-1\right),Q\left(t-2\right),Q\left(t-3\right) $$

$$ \mathrm{M}3:Q(t)=Q\left(t-1\right),Q\left(t-2\right),Q\left(t-3\right),Q\left(t-4\right) $$

$$ \mathrm{M}4:Q(t)=Q\left(t-1\right),Q\left(t-2\right),Q\left(t-3\right),Q\left(t-4\right),Q\left(t-5\right) $$

Using the GS-GMDH method and considering the four models above, various relationships are proposed to predict the Bow River discharge, as shown in Table 3.

Table 3 The proposed GS-GMDH equations for M1 to M4

Full size table

5 Results and Discussion

Figure 5 indicates the scatter plot of the ARIMA-based linear method (diff, Std, Sf) and GS-GMDH (M1-M4) techniques in daily discharge prediction. Comparison of GS-GMDH models with linear models shows a relatively similar function to the non-linear and linear method, with the difference that the maximum discharge for model testing in the GS-GMDH method has a better performance compared with linear methods (Std, Sf).

Figure 6a depicts the box plot of the observed and predicted daily discharge. It is observed that the performance of all methods (linear and non-linear) in different domains is approximately the same, so that the average values estimated with the average observed amounts are approximately equal. The scattering of these values (observed and estimated) in the first and third quantile is also similar. As observed in the scatter plot, the main difference between the performances of the models is in the peak discharges.

The qualitative comparison of the two sets of models presented in this study (Figs. 5 and 6a) depicts the good and similar performance of both models in estimating the Bow River daily discharge.

Figure 6b presents the box plot for a relative error of the ARIMA (diff, Std, Sf) and GS-GMDH (M1-M4) models for estimation of the Bow River daily discharge. The distribution of the relative error in non-linear and linear methods shows that the average value of the relative error for all purposes is less than 10%. Regardless of the outlier errors, the maximum relative error of the methods used is less than 20%. The performance of the models with respect to outlier relative errors shows that the maximum error is due to linear methods, and especially due to the diff method. The minimum value associated with the maximum error measured for outlier relative errors is related to the GS-GMDH (M1) method.

The performance evaluation of the ARIMA (diff, Std, Sf) and GS-GMDH (M1-M4) methods qualitatively confirmed the ability of these two methods for the prediction of Bow River daily discharge. For a closer comparison of these two models and determination of the superior model, several quantitative studies are required. The indices presented in Table 4 confirm the significant performance of the proposed models in this study, which were qualitatively examined. The average relative error of these models is about 6%, and all models have a very high correlation coefficient. An obvious point in choosing the superior model is the use of an index that has a great deal of accuracy and simplicity.

Table 4 Statistical indices for linear and non-linear methods

Full size table

The complexity of the model is evaluated using the AIC index. The more superior model will have the smaller lower and upper limits of this index. For linear models, the values of p and q used in the ARIMA model are considered as k in the definition of the AIC relationship, while for the GS-GMDH model, the coefficients are used to estimate the GS-GMDH model. In linear methods, the lowest AIC is the Std method. The AIC value in this method is slightly better than Sf, but its difference is significant compared to diff.

In non-linear methods, the values of all indices, except for AIC, are constant in all models, so that with increasing inputs, not only is the accuracy of the model not significantly changed, but this also leads to the increased complexity of the model relative to the model with lower input parameters. Therefore, considering that the GS-GMDH (M1) method has the lowest AIC among both linear and non-linear methods, this model is selected as the superior model for predicting the Bow River daily discharge.

6 Conclusions

In this study, the accuracy of a linear stochastic model and non-linear GMDH daily discharge forecast models were compared. The linear stochastic method incorporates three input data pre-processing methods of differencing (diff), standardization (Std), and spectral analysis (Sf). In addition to the linear methodology, a non-linear method based on the GMDH was developed. A summary of the most notable results are listed as follows:

The proposed GS-GMDH improved the results of classical GMDH by considering more than two input parameters in each neuron, admissibility of the input of each neuron from nonadjacent layers and employing second- and third-order polynomials to build the structure between the input and output variables.
Comparison of the linear stochastic and the non-linear GMDH methods showed that all linear methods (diff, Std and Sf) and non-linear methods (M1-M4) have high accuracy in forecasting the Bow River daily discharge with an average relative error below 6%.
This study showed that an appropriate pre-processing process can improve the results of a stochastic model and it can provide a similar forecast accuracy of the daily discharge compared to the more complex non-linear GMDH model.
Comparison of all methods using an index that considers simultaneously the accuracy and simplicity of the model (AIC) showed that the GS-GMDH (M1) method has the best performance among all considered methods and can be used in practical applications.

References

Bonakdari H, Moeeni H, Ebtehaj I, Zeynoddin M, Mahoammadian A, Gharabaghi B (2019) New insights into soil temperature time series modeling: linear or non-linear? Theor Appl Climatol 135:1155–1177
Article Google Scholar
Box GE, Cox DR (1964) An analysis of transformations. J R stat Soc series B:211-252
City of Calgary (2018) Understanding river flow rates. Retrieved from Calgary: http://www.calgary.ca/UEP/Water/Pages/Flood-Info/Types-of-flooding-in-Calgary/Understanding-river-flow-rates.aspx
Dohy L (2005) Flood costs soaring in Alberta. Infomart, Postmedia Network Inc., Don Mills
Google Scholar
Ebtehaj I, Zeynoddin M, Bonakdari H (2020) Discussion of “comparative assessment of time series and artificial intelligence models to estimate monthly streamflow: a local and external data analysis approach” by Saeid Mehdizadeh, Farshad Fathian, Mir Jafar Sadegh safari and Jan F. Adamowski J Hydrol 583:124614
Article Google Scholar
Gharabaghi B, Sattar A (2019) Empirical models for longitudinal dispersion coefficient in natural streams. J Hydrol 575:1359–1361
Article Google Scholar
Gholami A, Bonakdari H, Mohammadian M, Zaji AH, Gharabaghi B (2019) Assessment of geomorphological bank evolution of the alluvial threshold rivers based on entropy concept parameters. Hydrolog Sci J 64(7):856–872
Article Google Scholar
Insurance Bureau of Canada (2017) Facts of the property and casualty insurance industry in Canada 2017 . Insurance Bureau of Canada
Jarque CM, Bera AK (1980) Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Econ Lett 6(3):255–259
Article Google Scholar
Kashyap RL, Rao AR (1976) Dynamic stochastic models from empirical data. Academic Press, New York, USA
Kelly G, Stodolak P (2013) Why insurers fail, natural disasters and catastrophes. Casualty Insurance Compensation Corperation, Toronto
Google Scholar
Khatibi R, Sivakumar B, Ghorbani MA, Kisi O, Kocak K, FarsadiZadeh D (2012) Investigating chaos in river stage and discharge time series. J Hydrol 414–415:108–117
Article Google Scholar
Krause P, Boyle DP, Base F (2005) Comparison of different efficiency criteria for hydrological model assessment. Adv Geosci 5:89–97
Article Google Scholar
Kwiatkowski D, Phillips PC, Schmidt P, Shin Y (1992) Testing the null hypothesis of stationarity against the alternative of a unit root: how sure are we that economic time series have a unit root? J Econ 54(1–3):159–178
Article Google Scholar
Jain SK, Kumar V (2012) Trend analysis of rainfall and temperature data for India. Curr Sci 102(1):37–49
Legates DR, Mccabe GJ (1999) Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour Res 35:233–241
Article Google Scholar
Ljung GM, Box GEP (1978) On a measure of lack of fit in time series models. Biometrika 65(2):297–303
Article Google Scholar
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18:50–60
Article Google Scholar
Mosavi A, Ozturk P, Chau KW (2018) Flood prediction using machine learning models: literature review. Water 10(11):1536
Article Google Scholar
Najafzadeh M, Barani GA, Hessami-Kermani MR (2015) Evaluation of GMDH networks for prediction of local scour depth at bridge abutments in coarse sediments with thinly armored beds. Ocean Eng 104:387–396
Article Google Scholar
Pomeroy J, Stewart RE, Whitfield PH (2016) The 2013 flood event in the South Saskatchewan and Elk River basins: causes, assessment and damages. Can Water Resour J 41(1–2):105–117
Article Google Scholar
Serinaldi F, Loecker F, Kilsby CG, Bast H (2018) Flood propagation and duration in large river basins: a data-driven analysis for reinsurance purposes. Nat Hazards 94:71–92
Article Google Scholar
Walton R, Binns A, Bonakdari H, Ebtehaj I, Gharabaghi B (2019) Estimating 2-year flood flows using the generalized structure of the group method of data handling. J Hydrol 575:671–689
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Soils and Agri-Food Engineering, Laval University, Québec, G1V0A6, Canada
Hossein Bonakdari
School of Engineering, University of Guelph, Guelph, Ontario, N1G 2W1, Canada
Andrew D. Binns & Bahram Gharabaghi

Authors

Hossein Bonakdari
View author publications
You can also search for this author in PubMed Google Scholar
Andrew D. Binns
View author publications
You can also search for this author in PubMed Google Scholar
Bahram Gharabaghi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hossein Bonakdari.

Ethics declarations

Conflict of Interest

None

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(DOCX 80 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bonakdari, H., Binns, A.D. & Gharabaghi, B. A Comparative Study of Linear Stochastic with Nonlinear Daily River Discharge Forecast Models. Water Resour Manage 34, 3689–3708 (2020). https://doi.org/10.1007/s11269-020-02644-y

Download citation

Received: 08 January 2020
Accepted: 29 July 2020
Published: 08 August 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11269-020-02644-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Comparative Study of Linear Stochastic with Nonlinear Daily River Discharge Forecast Models

Abstract

Explore related subjects

1 Introduction

1.1 Data-Driven Models

1.2 Problem Statement

1.3 Scope of the Current Study

1.4 Hydrological Data Collection

2 Theoretical Conceptions

2.1 Linear Modeling Conceptions

2.2 Group Method of Data Handling (GMDH)

2.3 Generalized Structure of GMDH

2.4 The Structure of the Proposed Models

3 Modelling Evaluation Measures

4 Development of Linear and Non-linear Modeling

4.1 Linear Modelling

4.2 Nonlinear Modeling

5 Results and Discussion

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation