1 Introduction

Linked to the hydrological and environmental evolutionary modeling, there exists a significant progress in suspended sediment transport modeling in recent years. Understanding the sediment transport process and modeling such a complex phenomenon are of importance in water resources management [29, 30]. The suspended sediment concentration (SSC) in the river is a crucial problem in environmental, hydraulic, and water resources engineering. Sediments change several features of river systems such as quality and health of water (transport pollutants), geography and navigability of river and channel [20, 44]. Sediments conveyed within the flow remain in suspension for a considerable length of river and time referred to the suspended load, and its prediction is a challenging task due to the effect of several hydrological and metrological parameters in a particular watershed [14, 47]. Conventionally, sediment rating curve method is widely applied for SSC computation. It shows the exponential relationship between the river discharge and SSC through a regression analysis. As an exponential regression equation is over-fitted on entire data set, it may generate poor results on alternative data set. Therefore, a precise modeling approach is needed to solve such a complex problem [27]. Modeling of the sediment transport in rivers considering theoretical equations and mathematical models needs a wide range of data, but due to the lack of such an extensive data range, these models do not provide accurate estimations [47, 48].

Recently, numerous studies have been conducted for sediment transport modeling applying variety of machine learning methods. For example, Rajaee et al. [33] applied four methods of neuro-fuzzy (NF), artificial neural networks (ANNs) and multi-liner regression (MLR) to simulate the daily SSC in the USA. Based on the outcomes of this study, both ANN and NF models generated high performance in predicting SSC. Cobaner et al. [8] developed adaptive neuro-fuzzy model for SSC computation considering streamflow, rainfall, and suspended sediment data. The outperformance of adaptive neuro-fuzzy model is reported in contrast to the different types of ANN such as generalized regression neural network (GRNN), radial basis function neural network (RBNN) and multi-layer perceptron (MLP). Altunkaynak [2] and Zhang et al. [50] applied genetic algorithm (GA) and found out that its better accuracy in comparison with other approaches. Using neural differential evolution (NDE), NF and ANN and sediment rating curve (SRC) methods, Kisi [20] modeled daily SSC and found out that the NDE outperforms the other techniques in daily SSC estimation. Comparing the performance of linear genetic programming (LGP) to adaptive neuro-fuzzy inference system (ANFIS) and SRC methods for SSC estimation, Kisi and Guven [21] reported that the LGP model provides more accurate results than the other mentioned models. Singh and Chakrapani [43] examined the capability of feed-forward backpropagation (FFBP) ANN method for simulating the SSC considering rainfall, temperature, and discharge as model parameters. It was determined that increasing the number of input parameters improves the developed model’s accuracy. The classification and regression tree (CART), RBNN, MLR, ANN, M5 model tree, and least square support vector regression (LSSVR) were used to estimate the suspended sediment at a basin of rive in India by Kumar et al. [23]. It was found that both ANN and LSSVR models generated accurate results.

Studies mentioned above applied standalone machine learning algorithms for SSC modeling. Due to the complexity of the problem and deficiencies of the standalone models in terms of adjusting the variety of model parameters, hybrid algorithms can be reliable approaches for modeling the SSC in rivers. As examples from the literature, because of the complication of the relationship of SSC and streamflow (Q), the wavelet-artificial neural network model (W-ANN) was utilized in predicting the SSC factor in the Kuye River by the Liu et al. [24]. They decomposed the daily time series of SSC and Q into subseries at different scales as inputs for the model. It was pointed out that the W-ANN model had the best performance, which has higher forecasting precision than other models like SRC and ANN. ANFIS-FCM (fuzzy c-means clustering model) was suggested by Kisi and Zounemat-Kermani [22] to estimate the SSC in the USA. Based on the obtained results, the ANFIS-FCM gave better results than other models, including ANFIS, ANN, and SRC. Zounemat-Kermani et al. [52] utilized the data of hydrometric stations that placed in different part of the USA, such as Arkansas, Delaware and Idaho to assess four different models of support vector regression (SVR) and three different ANN methods in daily SSC prediction/approximation and compared to that of MLR and SRC methods. Statistical parameters indicated the superior performance of SVR and ANN models in comparison with the traditional methods. Malik et al. [26] reported the outperformance of co-active neuro-fuzzy inference system (C-ANFIS) technique in comparison with the MLR, multiple nonlinear regression (MNLR), MLP and SRC. Ghose and Samantaray [16] used ANN-FFBP and regression approaches together with their hybridized versions as GA-BPNN and GA-regression for the same purpose at various basins of the Suktel River to realize their sensitivity at regional scale. Liu et al. [25] modeled SSC time series for Kuye River watershed, China using hybrid ensemble empirical mode decomposition (EEMD-ANN), EEMD-MLR, ANN, and MLR methods. The performance of EEMD-ANN and EEMD-MLR models were improved by a factor of 52.9% and 41.0%, in contrast to the ANN and MLR methods.

Advanced hybrid machine leaning algorithms were implemented for river suspended sediment estimations as an evolutionary hydrological modeling. For instance, [40] appraised impression of a hybrid model merging support vector machine (SVM) with whale optimization algorithm (SVM-WOA) and compared to SVM-PSO and conventional SVM and RBFN models for estimating SSC at Sundargarh and Salebhata stations in Mahanadi River, India. The results showed that SVM-WOA accomplished superiorly in comparison with SVM-PSO, SVM and RBFN models for five various input scenarios. Roushangar et al. [36] modeled SSC and river discharge of two stations of Mississippi river and improved performance of the implemented models using wavelet transform (WT) and ensemble empirical mode decomposition (EEMD) approaches. Results indicated that data processing with WT was more suitable than the EEMD in enhancement of the models' performance. Data processing improved the models' performance by a factor of up to 15%. It was found that using the merged kernel extreme learning machine (KELM) method, the previous stations data could be applied successfully for SSC and river discharge modeling when the stations' own data were not available. Dang [10] improved coupled models of discrete wavelet transform (DWT) with ANFIS, named DWT-ANFIS, and principal component analysis (PCA) with ANFIS, named PCA-ANFIS, for SSC simulation. The merged and single ANFIS models were trained and tested utilizing long-term daily SSC and river discharge which were measured on the Schuylkill and Iowa Rivers in the United States. The results revealed that the PCA-ANFIS accomplished better than the single ANFIS and the coupled DWT-ANFIS. Mehri et al. [28] utilized four intelligent methods of ANFIS-PSO, ANFIS-GA, ANFIS, and group method of data handling (GMDH) to estimate the sediment concentration distribution. Since both GA and PSO optimization approaches were utilized to improve the ANFIS model, the efficiency of these models was improved. The results showed that the performance of the ANFIS-PSO was better than ANFIS-GA, ANFIS, and GMDH models for estimation of suspended sediment distribution.

While the outperformance of RF and MLP has been already reported in the relevant literature, this study is designed to enhance the RF and MLP models’ accuracy for SSC modeling through hybridization with GA and SGD to develop novel MLP-GA, RF-GA, MLP-SGD and RF-SGD models. The weakness in accurate estimation of suspended sediment using conventional regression formulas was already reported in the literature. To this end, in this study, an attempt was made to make an accurate estimation of the suspended sediment concentration using efficient methods such as MLP and RF and to improve the results using GA and SGD methods. Through the modeling, historical SSC and river discharge values of two stations were used for the modeling. Still, to the best of our knowledge, hybrid SGD-MLP and SGD-RF models have not been used for SSC estimation.

2 Materials and methods

2.1 Study area

Daily discharge (Q) and SSC data of the Minnesota and San Joaquin Rivers (Fig. 1) for the time period of 2000–2017 were acquired from the United States Geological Survey (USGS). The Minnesota River is one of the Mississippi Rivers tributaries having almost 534 km long, in the USA state of Minnesota. The station name is Minnesota with station number of 05,325,000, basin area of 35,065 km2, latitude of 37° 40′34″ and longitude of 121° 15′55″. The studied river in Joaquin River is located at central California. The correspondent USGS station name is San Joaquin with station number of 11,303,500, basin area of 38,590 km2, latitude of 44° 10′08'' and longitude of 94° 00′11″. Table 1 shows the statistical characteristics of SSC and Q parameters in both Minnesota and San Joaquin stations. It can be seen from Table 1 that SSC and Q parameters presented skewed distributions.

Fig. 1
figure 1

Geographical location of the Minnesota and San Joaquin Rivers

Table 1 Statistical features of the applied data

2.2 Multi-layer perceptron (MLP)

Multi-layer perceptron (MLP) neural network is considered as the most common neural networks for the nonlinear fitting with higher accuracy. In order to acquire that performance with high precision, this method selects an appropriate number of neurons and layers at its structure. The training process is used to find the suitable amount of weight for the connections between neurons. The backpropagation network is a common algorithm for the feed-forward neural network in which the output of each layer is considered as the input of the next one [13]. The MLP involves three layers of input, hidden, and output layers. In order to train the MLP neural networks, several algorithms are applied and the selection of each one can affect its accuracy and the learning rate of the network [6]. Figure 2 illustrates the flowchart of the MLP model. In this process, the number of the hidden layers and neurons of each hidden layer should be designated in a way to model give high performance. In structure of MLP, Levenberg–Marquardt (LM) algorithm is mostly utilized for calculation of output signals. Variety of hidden layers can be adopted for MLP structure design; however, one hidden layer provides satisfactory results in hydrological problems [38].

Fig. 2
figure 2

Flowchart of the MLP model

2.3 Random forest (RF)

The random forest (RF) is an ensemble learning method proposed by Breiman [5]. In order to acquire better generalization abilities, ensemble learning builds multiple base learners or combines several trees at its structure [31, 37]. Among various rule generation approaches, a RF method is an effective and practical approach. The RF uses the algorithms of the decision tree (classification and regression tree (CART)) as the base algorithm. It is notable that the RF is a more robust approach in comparison with other decision tree ensembles [9]. The RF method has ability in defining the appropriate predictor and re-scaling the data like other techniques is not required. The conventional regression tree has weak performance due to its tendency to over-fit on the training data set. The RF method uses randomness characteristics to overcome this problem [42]. In this method, each decision tree is made up of a bootstrap sample from the calibration dataset, which comprises about two-thirds of the sample. The rest of the elements are considered as out-of-bag data. Variables are designated in random and based on the lowest Gini index and then, the best split is selected. According to the repetition at each bootstrap, the value for out-of-bag data is acquired. In this process, the repetition in each tree is continued until reaching the stop condition defined previously. In RF method, parameters are optimized by the usage of mean square error (MSE) and calibration dataset: confidence, number of trees, minimal leaf size, maximal depth, minimal size for split, subset ratio, and number of prepruning alternatives [12]. A RF method is effectively utilized for the broad datasets analysis. Figure 3 illustrates the structure of the RF method.

Fig. 3
figure 3

Flowchart of the RF model

2.4 Genetic algorithm (GA)

Holland [18] and Goldberg [17] developed the genetic algorithm (GA). GA is a powerful method for the exploratory development of large-scale hybrid optimization problems. GA encodes the problem as a set of strands that contain fine particles, and then, it changes the strands to stimulate the process of gradual evolution. Compared to the local search algorithms, in public search where there is only one acceptable solution, GAs consider a community of individuals. Working with a group of people makes it possible to study the main structures and characteristics of different people, which leads to the identification and discovery of more efficient solutions [15]. In this study, GA selects the most relevant disciplines and eliminates those disciplines that are less relevant to the study population. Each member of the population, which is an approximation of the final answer, is coded as strings of letters or mergers. These strands are called chromosomes. The most common mode is the display with the digits zero and one. Other modes of using three digits, real numbers and integers are also used. The values on the chromosomes alone do not have a specific meaning, but must be decoded and have meaning and result only as decision variables. It should be noted that the search process is performed on encrypted information unless it is from genes with real values. Once the chromosomes have been encoded, the efficiency or fit of each member of the population can be calculated. Fit is a relative scale that indicates the suitability of individuals to produce the next generation. In the nature, fit is equivalent to one's ability to survive. The objective function plays a decisive role in determining the fit of individuals. During reproduction, the fit of each individual is determined with the help of the primary information obtained from the objective function. These values are used in the selection process to lead it to select the right people. The higher the fit of the individual in relation to the population has a more chance to be elected. The lower the relative fit, the less likely it is to be selected for the next-generation production. The act of replication in a GA is used to exchange genetic information between a pair or more individuals. The simplest type of multiplication is the intersection of a point. After this step, a mutation operator is likely to be applied to the generated strands. In mutation, each individual alone can change according to the laws of probability. Mutation means changing the value of one of the string cells from zero to one or from one to zero. After the amplification and mutation steps, the chromosomes are decoded and the value of the target function is calculated, then a fit is assigned. If necessary, the selection and reproduction steps, etc., are performed again. During this process, the average efficiency of the response population is expected to increase. The algorithm ends when a specific goal is met. For example, if a certain number of generations are created, the amount of merit of people reaches a certain amount, or a certain point is reached in the search space [4].

2.5 Stochastic gradient descent (SGD)

Stochastic gradient descent (SGD) is an iteration-based method for developing a derivative function called a target function, which is a stochastic approximation of the gradient descent (GD) method. In fact, the SGD is an algorithm to obtain the minimum value of a function in several iterative loops and the values for which the function takes its minimum value. The difference between a SGD and a standard GD is that, unlike the standard GD, which uses all training data to develop the target function, the SGD uses a randomly selected set of training data for optimization. This method has many applications in statistical and machine learning problems [34].

In the machine learning application, a problem usually appears in which, it is important to determine a function such as f of statistical data with one or more parameters and then, to define these parameters in such a way that the sum (or average) of the amounts of the function f for each piece of data statistically cause the minimum possible amount. It is assumed that there is a set of statistical data where the function f is determined only on the basis of the parameter θ, in which case by giving the i data from the data set to the function f a function of \(\theta \) is obtained that called \({\vartheta }_{\mathrm{i}}(\theta )\). Now, the problem is streamlined to find a \(\theta \) that minimizes the following expression:

$$ \vartheta \left( {\varvec{\theta}} \right) = \left( {\frac{1}{{\varvec{n}}}} \right)\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{n}}} \vartheta_{{\mathbf{i}}} \left( {\varvec{\theta}} \right) = {\varvec{E}}\left| {\vartheta_{{\mathbf{i}}} \left( {\varvec{\theta}} \right)} \right| $$
(1)

where \(\boldsymbol{\vartheta }\left({\varvec{\theta}}\right)\) is a target function. In many cases, the target function becomes a simple function on which the application of the SGD method is not intricate and time consuming. In these cases, the standard GD is used, such as the family of exponential functions of a parameter used to appraise economic functions. However, since the standard or stochastic GD method requires the calculation of the objective function gradient in each loop, in some cases where the target function parameters are large or the training data set is very large, the calculation accomplished in each loop can be very time consuming and intricate. For this reason, a SGD is used, which in each loop accomplish this operation only for a section of the training data set that we have. In the SGD method, in each loop, the desired operation is not performed on only one member of the training data set that is randomly selected in each loop, and instead is performed on a subset of it where there are two reasons for this:

  1. 1.

    Dispersion reduces the amount acquired for the parameter in each loop and convergence is more stable.

  2. 2.

    Utilizing matrix operations that have a very fast execution.

2.6 GA-MLP

Determining the number of neurons in hidden layers, training cycles, learning rate, momentum, error epsilon, and local random seed is one of the complicated modeling procedures in the MLP method. For this purpose, a hybrid algorithm including GA and MLP was utilized for modeling the SSC in rivers. This process started with the selection of a random initial population in which each individual consists of various numbers of hidden layer neurons. Then, the elite population with the best individuals is selected. The model is run repeatedly, and for each individual, the function is calculated and the obtained functions are stored. In the last stage, if the termination criteria had satisfaction results, the individual with the best function is saved. Otherwise, this process continues to find an appropriate population with a new function. The Levenberg–Marquardt algorithm is used mostly in the training stage of this process, but it has a random nature. By the usage of GA, the model is protected against this problem and chooses the best transfer function for the hidden and output layers. Figure 4 displays the flowchart presenting the GA-MLP method.

Fig. 4
figure 4

Flowchart of the GA-MLP model

2.7 GA-RF

To promote the RF model’s productivity and accuracy, the optimization of key parameters in the RF model structure is a necessary task [35]. To achieve higher performance, Zhou et al. [51] applied a novel approach where a few number of learners were utilized with high-quality. Based on [1], trees have different percentages in traditional RF’s precision, for example, some of them can make incorrect predictions and underestimate the performance and efficiency of the model. Different strategies are used to increase the model’s accuracy, including a general climbing strategy, a greedy algorithm and so on; however, they have some deficiencies. For instance, using the greedy method to promote the performance of the RF model can give rise to becoming confined at local optima. Therefore, GA is implemented to solve that problem by choosing the best subset of features that is able to improve the RF model performance. Consequently, the RF model that is optimized by the GA has high accuracy in comparison with the traditional one. Figure 5 shows the flowchart illustrating the GA-RF method.

Fig. 5
figure 5

Flowchart of the GA-RF model

2.8 SGD implementation on MLP and RF

In the general implementation of a SGD, \(\theta \) is the vector that includes all the parameters of the cost function. Firstly, \(\theta \) is set to the ideal vector. Then, for each update of this vector, a member of the training data set is randomly selected, and at the \(\boldsymbol{\alpha }\) rate, the vector of the cost function gradient at point \(\theta \) is subtracted from \(\theta \):

$$ \theta = \theta - \alpha \nabla_{{\uptheta }} \vartheta_{{\text{i}}} \left( {\theta ;x^{{\left( {\text{i}} \right)}} ,y^{{\left( {\text{i}} \right)}} } \right) $$
(2)

where \(\vartheta \) is a cost function and \(({x}^{\left(\mathrm{i}\right)},{y}^{(\mathrm{i})})\) is a randomly selected member of the training data, and \({\vartheta }_{\mathrm{i}}(\theta ;{x}^{\left(\mathrm{i}\right)},{y}^{(\mathrm{i})})\) denotes the i sentence of the objective function. \(\boldsymbol{\alpha }\) is the rate at which \({\varvec{\theta}}\) is updated and has an experimental value that prolongs convergence if it is too small, and convergence may not happen if it is too large [45].

In another implementation, in each loop, a random member of the data set is not selected, but in each loop, the all data set is randomly rearranged once, and then the upgrade operation is accomplished in order of \({\vartheta }_{1}\),\({\vartheta }_{2}\),…,\({\vartheta }_{\mathrm{n}}\) where n shows the size of the training dataset. The following pseudocode indicates this implementation:

  1. 1.

    Give \(\theta \) and \(\boldsymbol{\alpha }\) the input value

  2. 2.

    Repeat until the minimum is reached

  3. 3.

    Randomly retrieve training data

  4. 4.

    Repeat for i from 1 to n: \(\theta = \theta - \alpha {\nabla }_{\theta }{\vartheta }_{\mathrm{i}}(\theta ;{x}^{\left(\mathrm{i}\right)},{y}^{(\mathrm{i})})\)\

Usually the update operation is not performed for \(\vartheta \) from a single member of the training data set, but for a subset of this data called a small set. Figure 6 shows how GD works for the single-input and dual-input function.

Fig. 6
figure 6

SGD operation for one-input and two-input functions

SGD algorithm has advanced facilities including epochs, rho, L1 or L2 adjustment, momentum training, adaptive learning rate and rate annealing that enable high prediction precision in modeling by both MLP and RF models. In addition to optimizing MLP results using SGD, the network contains many hidden layers containing neurons with tanh (hyperbolic tangent function), rectifier (where x is the input value, select the maximum of (0, x)), max out (select the maximum input vector coordinates), and ExpRectifier (exponential rectifier function) activation operations.

The size of the weight updates is described by the user determined learning rate when adaptive learning rate is inactivated and is a function of the difference between the predicted value and the target value. That variance commonly named delta, which is only presented at the output layer. Backpropagation is applied to accurate estimation of the output at each hidden layer. The momentum is ramped up slowly since redundant momentum can result in oscillation. The rate annealing, momentum training, dropout rate annealing, and momentum training parameters activate if the adaptive learning rate is disabled.

2.9 Performance criteria

In this study, three performance indexes including correlation coefficient (CC), Willmott’s index of agreement (WI), and scattered index (SI) are utilized in order to measure the model’s accuracy. The mathematical description of CC, SI, and WI can be expressed, respectively, as follows:

$$ CC = \frac{{\left( {\sum\nolimits_{i = 1}^{n} {O_{{\text{i}}} P_{{\text{i}}} } - \frac{1}{n}\sum_{i = 1}^{n} {O_{{\text{i}}} } \sum_{i = 1}^{n} {P_{{\text{i}}} } } \right)}}{{\left( {\sum\nolimits_{i = 1}^{n} {O_{{\text{i}}}^{2} } - \frac{1}{n}\left( {\sum_{i = 1}^{n} {O_{{\text{i}}} } } \right)^{2} } \right)\left( {\sum_{i = 1}^{n} {P_{{\text{i}}}^{2} } - \frac{1}{n}\left( {\sum_{i = 1}^{n} {P_{{\text{i}}} } } \right)^{2} } \right)}} $$
(3)
$$ {\text{SI}} = \frac{{\sqrt {\frac{1}{n}\sum_{i = 1}^{n} {\left( {P_{{\text{i}}} - O_{{\text{i}}} } \right)^{2} } } }}{{\overline{O} }} $$
(4)
$$ {\text{WI}} = 1 - \left[ {\frac{{\sum_{i = 1}^{n} {\left( {O_{{\text{i}}} - P_{{\text{i}}} } \right)^{2} } }}{{\sum_{i = 1}^{n} {\left( {\left| {P_{{\text{i}}} - \overline{O}_{{\text{i}}} } \right| + \left| {O_{{\text{i}}} - \overline{O}_{{\text{i}}} } \right|} \right)^{2} } }}} \right] $$
(5)

in which Pi is predicted and Oi is the observed ith value and n is the number of data. CC is a statistical tool to determine the type and degree of relationship of one quantitative variable with another quantitative variable. CC is one of the criteria used to determine the correlation between two variables. The CC indicates the intensity of the relationship as well as the type of relationship (direct or inverse). This coefficient is between 1 and -1, and if there is no relationship between the two variables, it is equal to zero. If the range of data used is large, the amount of root mean square error (RMSE) in the modeling error evaluation section will also be high, which does not indicate that the model is inaccurate. For this purpose, in this study, the SI index was used, which is the result of dividing the RMSE index by the average test data. The closer the SI value to zero shows that the model is more accurate. WI is also one of the standardized indicators for calculating the model prediction error, the value of which is between zero and one. WI = 1 indicates the highest agreement and WI = 0 indicates no agreement. This index is also highly sensitive to limit values due to the use of difference squares [49]. Furthermore, Taylor diagrams were implemented for additional investigation of utilized hybrid GA-MLP, GA-RF, SGD-MLP, and SGD-RF models performances in SSC estimation [46].

3 Result and discussion

In the current study, MLP, RF, GA-MLP, GA-RF, SGD-MLP, and SGD-RF models with different input combinations were utilized for estimating the SSC in two stations and their results are examined in terms of accurate estimation. Moreover, due to the fact that there is not any direct way for dividing the entire data to train and test data sets in data driven methods, different proportions were utilized in the literature, e.g., Choubin [7] implemented a total of 63% of their data for training, whereas Qasem et al. [32] and Kargar et al. [19] used 67% of data, Dodangeh et al. [11], Asadi et al. [3], Shabani et al. [41] and Samadianfard et al. [39] utilized 70%, and Zounemat-Kermani et al. [53] exploited 80% of whole data for model development. Therefore, for the model development is this study, data were split into training (70%) and testing (30%). Accordingly, the time period of 2000–2012 was used to train the models and the 2013–2017 data were implemented as the test data set.

Table 2 provides the input parameters for each model where the SSC and Q parameters are shown in the current time (SSCt and Qt) and also with the previous daily lag times. It can be seen from Table 2 that six different scenarios were considered for SSC estimation utilizing different input combinations of SSCt-1, SSCt-2, SSCt-3, Qt, Qt-1, Qt-2, Qt–3 parameters. It is worthy to note that the selected scenarios are considered based on auto correlation of SSC and Q variables. For predicting SSC through MLP and RF and optimizing with SGD algorithm at Minnesota and San Joaquin rivers, the adaptive learning rate due to better performance was activated and presumed equal to 0.004. Moreover, the amounts of epochs, rho, L1 and L2 were presumed as 10, 1, 0.000009 and 0, respectively. In addition, the ExpRectifier was considered as activation operation because of its better efficiency in contrast to other activation operations for predicting SSC in SGD-MLP model. Additionally, Tables 3and4 show the default and optimized parameters used in RF and GA-RF models development in estimating the SSC for two different stations, including Minnesota and San Joaquin, respectively. Similarly, the related parameters of the MLP and GA-MLP models are displayed in Tables 5and6 for the mentioned stations.

Table 2 Implemented models and their input parameters
Table 3 Parameters of the RF and GA-RF models (Minnesota station)
Table 4 Parameters of the RF and GA-RF models (for San Joaquin station)
Table 5 Parameters of the MLP and GA-MLP models (for Minnesota station)
Table 6 Parameters of the MLP and GA-MLP models (for San Joaquin station)

Table 7 gives results of the RF, GA-RF, and SGD-RF models in Minnesota station. Diverse combinations are considered for these models, and the accurate estimations with high performance were obtained from RF-5 with CC of 0.938, SI of 0.325, and WI of 0.968 among standalone RF models and, GA-RF-5 and SGD-RF-4 with CC of 0.944 and 0.943, SI of 0.308, and WI of 0.971 among the hybrid RF ones. It is noticeable that the GA-RF-5 has a high accuracy in comparison with the RF-5 model and GA improved the model by reducing 5.2% of the SI parameter. In the GA-RF-5, number of trees is 27, maximal depth of 27, the confidence of 0.383, minimal leaf size of 55, minimal size for split of 51, number of prepruning alternatives of 77 and subset ratio of 0.9009 that are shown in Table 3. Evident is that the GA plays a vital role as an optimizer in SSC estimation. In Minnesota station, based on the results in Table 8, the MLP-4 has CC of 0.948, SI of 0.296, and WI of 0.973, and the GA-MLP-5 model has CC of 0.950, SI of 0.290, and WI of 0.974. These two models provide more accurate results in contrast to other models. Due to the presence of the GA, the GA-MLP-5 has more accurate predictions and GA enhanced the model’s precision. As it is shown in Table 8, GA decreases the SI parameter by a factor of 2%. The GA-MLP-5 model has 77 training cycle, 0.1825 learning rate with the momentum of 0.0774, 0 error epsilon, and with a local random speed of 77 (Table 5). Also, it should be noted from Table 8 that although SGD had positive effects on increasing the prediction accuracy of the standalone MLP model, but in comparison with GA, it showed lower capability in reducing the prediction errors. By and large, in the Minnesota station, the GA-MLP-5 has more accurate performance in comparison with other optimized models. Also, among the SGD-MLP and SGD-RF models, obtained results indicated that SGD-MLP-6 with CC of 0.948, SI of 0.299, and WI of 0.973 and SGD-RF-5 with CC of 0.943, SI of 0.308, and WI of 0.971 presented more accurate performances.

Table 7 General results of the computations for the RF, GA-RF and SGD-RF models (Minnesota station)
Table 8 General results of the computations for the MLP, GA-MLP and SGD-MLP models (Minnesota station)

In Table 9, the results of the RF, GA-RF, and SGD-RF models in San Joaquin station are displayed. The RF-4 with CC of 0.884, SI of 0.366, and WI of 0.938 is considered as the best one among various combinations of RF, and the GA-RF-4 with CC of 0.892, SI of 0.353, and WI of 0.942 has the highest performance among other GA-RF models. GA-RF-4 has 27 trees with a maximal depth of 27, the confidence of 0.3826, minimal leaf size of 55, minimal size for split of 51, number of prepruning alternatives of 0 and subset ratio of 0.9009 (Table 4). According to Table 9, the GA and SGD improved the precision of all standalone models. In the San Joaquin station, SGD-MLP-4 and SGD-RF-5 have the best performance with CC of 0.901, SI of 0.335 and 0.339, WI of 0.941 and 0.941, respectively (Table 10). In this station, SGD has reduced the SI parameter by factors of 6.9% and 8.6% in contrast to the RF-5 and MLP-4 models, respectively.

Table 9 General results of the computations for the RF, GA-RF, and SGD-RF models (San Joaquin station)
Table 10 General results of the computations for the MLP, GA-MLP and SGD-MLP models (San Joaquin station)

As it is seen from Figs. 7 and 8, GA-RF-5 and GA-MLP-6 models provided more accurate results for SSC estimation in comparison with other models in the Minnesota station. Furthermore, in the San Joaquin station, SGD-RF-5 and SGD-MLP-4 models illustrated better performances. In both Minnesota and San Joaquin stations, hybrid optimized models gave accurate results in predicting the SSC values, whereas other models state poor outcomes. Based on the scatter plots presented at Figs. 9 and 10, in two studied stations, the most accurate MLP, RF, GA-MLP, GA-RF, SGD-MLP, and SGD-RF models are shown. Overall, the combination of input parameters has no significant effects on their outcomes. For instance, in Minnesota station, the GA-MLP-5 with SSCt-1, SSCt-2, Qt, Qt-1, Qt–2 as input parameters and in San Joaquin station, SGD-MLP-4 model with inputs of SSCt-1, Qt, Qt–1 were considered as the best models.

Fig. 7
figure 7figure 7

Observed and estimated values with MLP, GA-MLP, SGD-MLP, RF, GA-RF, SGD-RF models at Minnesota station

Fig. 8
figure 8figure 8

Observed and estimated SSC values with MLP, GA-MLP, SGD-MLP, RF, GA-RF, and SGD-RF models at San Joaquin station

Fig. 9
figure 9

The scatter plots of observed and estimated SSC values of the most accurate MLP, GA-MLP, SGD-MLP, RF, GA-RF, SGD-RF models at Minnesota station

Fig. 10
figure 10

The scatter plots of observed and estimated SSC values of the most accurate MLP, GA-MLP, SGD-MLP, RF, GA-RF, SGD-RF models at San Joaquin station

Additionally, Taylor diagrams are utilized in order to scrutinize standard deviation and correlation values between observed and estimated SSC. The RF-5, GA-RF-5, SGD-RF-5, MLP-4, GA-MLP-5 and SGD-MLP-6 models for the Minnesota station and RF-4, GA-RF-4, SGD-RF-5, MLP-4, GA-MLP-4 and SGD-MLP-4 models for San Joaquin station are displayed in Fig. 11. The length of the space from the green point (a reference point) to each point described as centered root mean square error (RMSE). Consequently, the minimum interval between the green point and the correspondent point shows the most precise model [46]. Pursuant to Fig. 11, in the Minnesota station, the red point (GA-MLP-5) is the nearest point to the reference point, and also, the light blue point (SGD-MLP-4) has the least distance from the green point at the San Joaquin station, thus providing the best estimates for the SSC.

Fig. 11
figure 11

Taylor diagrams of the studied models at both stations

Sediment transport has complicated process, and estimation of SSC is a quite difficult issue as a fundamental hydrological problem. Numerous conventional regression alternatives are available in the literature; however, their applicability on rivers in different climate conditions is a challenging task. In recent decade, runoff-suspended sediment load modeling attracts interests in implementation of machine learning algorithms to develop robust models with high computational ability. So, this study focused on the applicability of MLP and RF; hybridized with GA and SGD methods for SSC prediction. Variety of scenarios were implemented for the modeling to find the superior combination from historical records. It was found that SSCt-1, Qt, Qt-1 and SSCt-1, SSCt–2, Qt, Qt-1, Qt2, combinations provided more accurate results which means that one and two days ahead records could be joined the upcoming day’s SSC value. The results obtained in this study showed that GA-MLP, GA-RF, SGD-MLP, and SGD-RF models successfully estimated SSC in two different rivers. Extension of the present study may be considered as the application of the suggested algorithms in other rivers with different climate conditions. In the current research, GA and SGD algorithms as a metaheuristic algorithms were implemented for optimization of the MLP and RF models. Future studies may consider application of the alternative optimization algorithms for SSC modeling.

4 Conclusions

The sedimentation problem is an essential issue in hydrological sciences, due to its imperative role in river hydraulic. In the current research, MLP and RF methods and their hybrid forms with GA and SGD are proposed to estimate SSC at of Minnesota and San Joaquin rivers. For each method, different combinations as input parameters were implemented for the modeling. The performance of models was examined in order to discover the best one in this process. The results demonstrated that the accuracy of standalone models was increased using GA and SGD optimized models. Overall results indicated that GA-MLP and SGD-MLP were found to be the robust techniques for modeling SSC relying on the statistical results obtained based on various indexes, including SI, CC, and WI. Moreover, assessing the precision of models in estimating SSC revealed that the standalone model's superiority in predicting SSC was less than their hybrid counterparts. Conclusively, these models can be used in water resource management and alternative fields of engineering with a high degree of confidentiality. Additional to the streamflow and suspended sediment load variables, incorporating the further hydrological parameters for the modeling seems to be worthy for improving the model credibility.