Introduction

Measuring water quality parameters (WQPs) plays a significant role in environmental monitoring. As water quality stands at a poor state, it has a negative influence on the life of living organisms in various aquatic ecosystems and water bodies such as natural rivers, lakes, dam reservoirs, confined and unconfined aquifers. Several parameters that characterize physical, chemical, and biological (or biochemical) characteristics of water are considered as WQP in the literature, among which the most used ones are chemical oxygen demand (COD), and biochemical oxygen demand (BOD), phosphate, turbidity, and electrical conductivity.

The accurate measurement of WQPs is at the mercy of taking a lot of time and difficult-to-carry out procedures. Measuring some of these WQPs such as BOD and COD is more complicated than the others. To obtain a reasonable estimation of BOD concentration, two types of measurement should be considered: first, the amount of oxygen, required to perform oxidization process for all the organic elements in a specific water volume, and second, the amount of oxygen, absorbed by other living creatures. Thus, the performance of fields and experimental investigations may result in inaccuracies because the volume of absorbed oxygen is not taken into account. Similarly, in accordance with COD measurement, the results of experimental studies are distorted in the presence of different inorganic radicals (Verma & Singh, 2013).

To eradicate these difficulties, many researchers proposed to estimate the WQPs (for example, chemical oxygen demand, biochemical oxygen demand, dissolved oxygen demand) as a function of other common WQPs (for instance, nitrate, turbidity, electrical conductivity, pH) instead of measuring them. To this end, different regression models have been considered. From previous investigations, artificial neural networks (ANNs) (Ay & Kisi, 2011; Emamgholizadeh et al., 2014; Singh et al., 2009), adaptive neuro-fuzzy inference system (ANFIS) (Emamgholizadeh et al., 2014; Soltani et al., 2010), and support vector machine (SVM) (Bozorg-Haddad et al., 2017; Li et al., 2017) have been employed frequently to estimate the WQPs in various water bodies through the world. Furthermore, along with the use of these techniques, some other techniques such as gene-expression programming (GEP), evolutionary polynomial regression (EPR), M5 model tree (Najafzadeh et al., 2018), multivariate adaptive regression spline (MARS) (Heddam & Kisi, 2018; Najafzadeh & Ghaemi, 2019), and linear genetic programming (LGP) proved to be efficient for estimation of WQPs. In addition, wavelet decomposition techniques, locally weighted linear regression model and multigene genetic programming, have been applied recently to estimate accurately different WQPs in natural streams (Ahmadianfar et al., 2020; Jamei et al., 2020).

One of the most successful ones of these regression algorithms is the support vector regression (SVR), which is a support vector machine algorithm (SVM) developed for regression analyses (Cortes & Vapnik, 1995; Smola & Schölkopf, 2004). This algorithm has been used widely and proved efficient in different fields for parameter estimation and time-series analysis and forecasting (Mukherjee et al., 1997; Niazmardi et al., 2013; Tuia et al., 2011; Wu et al., 2004; Yu et al., 2006). However, the SVR suffers from the same shortcomings as the other members of the SVM family, i.e., its performance is highly tied with the proper selection of its kernel function (Abbasnejad et al., 2012). Selection (or construction of) an optimal kernel function for a learning problem is not a straightforward task. Thus, in the last decades, several kernel learning approaches have been proposed (Abbasnejad et al., 2012). Multiple-kernel learning (MKL) algorithms are one of the categories of kernel learning approach which, due to their sound theoretical background and outstanding results, gained the attention of many researchers (e.g., Bucak et al., 2014; Gönen & Alpaydın, 2011; Niazmardi et al., 2016, 2018).

One of the main issues of the SVM algorithm is that its performance depends highly on the choice of kernel function and fine-tuning the parameters of the selected kernel. During the last decade, several techniques have been proposed to assist this choice, among which the MKL framework was the most promising one (i.e., Niazmardi et al., 2018; Qiu & Lane, 2005, 2009; Yeh et al., 2011). In another word, the MKL algorithms can solve the kernel selection problem in the kernel-based learning algorithms. Different kernel functions are combined using a combination function, which can be either linear or nonlinear. The proposed algorithm is among the MKL algorithms that can handle both types of combination functions. The parameters of the combination function are estimated by solving the existing optimization problem.

The MKL algorithms address the problem associated with selecting the proper kernel function, by learning an optimal task-specific kernel through either linear or nonlinear combination of some precomputed kernels (Bucak et al., 2014). Most of the MKL algorithms have been proposed for classification purposes (Niazmardi et al., 2018), and there are only a few algorithms proposed for the regression analysis (Gonen & Alpaydin, 2010; Qiu & Lane, 2005, 2009; Yeh et al., 2011). These few algorithms cannot learn the nonlinear combination of kernels and usually use complex optimization strategies. To address these issues, we proposed multiple-kernel support vector regression (MKSVR) algorithm for accurate estimation of WQPs. The MKSVR benefits from a flexible structure of MKL algorithm which can learn both linear and nonlinear combinations of kernels for regression analysis. Besides, this algorithm uses the particle swarm optimization algorithm to optimize the combination of the kernels, which makes its implementation very easy.

The rest of this paper is organized as follows. First, brief reviews of the theorem associated with the SVR and MKL algorithms are presented. After that, the structure of the proposed MKSVR algorithm and its optimization strategy is described. Next, water quality data and experimental setups are presented, followed by the presentation of the results in terms of qualitative and quantitative performance. The conclusions are drawn in the final step of this research.

Methodology

Regression and Support Vector Regression Algorithms

Suppose we are provided with a set \(T = \left\{ {{\mathbf{x}}_{i} ,y_{i} } \right\}_{i = 1}^{n}\) of \(n\) training samples \({\mathbf{x}}_{i} \in {\mathbb{R}}^{p}\), each of which is assigned with a real-valued target \(y_{i} \in {\mathbb{R}}\). Assume that these samples are obtained through sampling an unknown function \({\text{g}} :{\mathbb{R}}^{P} \to {\mathbb{R}}\). The main purpose of regression is to estimate a function \(f:{\mathbb{R}}^{p} \to {\mathbb{R}}\) that approximates the unknown function \(\text{g}\) using the set of training samples (Mukherjee et al., 1997); accordingly, regression is also known as function approximation in some literature.

Several mathematical methods were proposed for solving regression problems. Among these models, the support vector regression attracted great attention (Camps-Vails et al., 2006; Gunn, 1998; Niazmardi et al., 2013; Qiu & Lane, 2009; Smola & Schölkopf, 2004). The SVR aims at approximating a linear function \(f({\mathbf{x}}) = {\mathbf{w}}^{T} \varphi ({\mathbf{x}}) + b\) in which \({\mathbf{w}}\) and \(b\) are regression parameters that should be estimated using the training data. In this function, also known as the prediction function or regressor, \(\varphi (.)\) is a mapping function from the original space of data into the kernel space. The SVR algorithm estimates \({\mathbf{w}}\) and \(b\) through optimizing a loss function and a regularization term (Mukherjee et al., 1997). The loss function minimizes the error of estimated function, while the regularization term controls its flatness (Rojo-Álvarez et al., 2018).

One of the most used loss functions in the structure of SVR is ε-insensitive loss function, proposed by Vapnik (2013). The value of this function \(L_{\varepsilon }\) for an error value \(e\) is calculated as:

$$L_{\varepsilon } (e) = \max (0,|e| - \varepsilon )$$
(1)

The flatness of the SVR regressor can be estimated through the calculation of its norm \(\left\| {\mathbf{w}} \right\|^{2}\). Accordingly, the SVR optimization is written as:

$$\mathop {\min }\limits_{{{\mathbf{w}},b}} \left\{ {C\sum\limits_{i = 1}^{n} {L_{\varepsilon } (y_{i} - w^{T} \varphi ({\mathbf{x}}_{i} ) - b) + \frac{1}{2}\left\| {\mathbf{w}} \right\|^{2} } } \right\}$$
(2)

where \(C\) is a positive real number (known as trade-off parameter) that controls the trade-off between the flatness and the error of the estimated function. It can be shown that the minimization of Eq. 2 is equivalent to the following constraint optimization problem (Scholkopf & Smola, 2001):

$$\mathop {\min }\limits_{{{\mathbf{w}},b,\xi_{i} ,\xi_{i}^{*} }} \left\{ {C\sum\limits_{i = 1}^{N} {\left( {\xi_{i} + \xi_{i}^{*} } \right)} + \frac{1}{2}\left\| {\mathbf{w}} \right\|^{2} } \right\}$$
(3)

Constrained to:

$$y_{i} - \left\langle {w,\varphi ({\mathbf{x}}_{i} )} \right\rangle - b \le \varepsilon + \xi_{i} \;\forall i = 1, \ldots ,N$$
(4)
$$\left\langle {w,\varphi ({\mathbf{x}}_{i} )} \right\rangle + b - y_{i} \le \varepsilon + \xi_{i}^{*} \;\forall i = 1, \ldots ,N$$
(5)
$$\xi_{i} ,\xi_{i}^{*} \ge 0\;\forall i = 1, \ldots ,N.$$
(6)

In Eq. (3), \(\xi_{i}\) and \(\xi_{i}^{*}\) are the positive slack variables that are used to cope with the infeasible constraints (Smola & Schölkopf, 2004). With the aid of the Lagrange multiplier technique, Eq. (3) is re-written by the dual problem as:

$$\begin{gathered} \mathop {\max }\limits_{{\alpha ,\alpha^{*} }} \left\{ { - \frac{1}{2}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right)\left( {\alpha_{j} - \alpha_{j}^{*} } \right)} } K({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) - \varepsilon \sum\limits_{i = 1}^{N} {\left( {\alpha_{i} + \alpha_{i}^{*} } \right)} + \sum\limits_{i = 1}^{N} {y_{i} \left( {\alpha_{i} - \alpha_{i}^{*} } \right)} } \right\} \hfill \\ \begin{array}{*{20}l} {{\text{subjct}}\;{\text{to:}}} \hfill & {\sum\limits_{i = 1}^{N} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right) = 0} } \hfill \\ {} \hfill & {\alpha_{i} ,\alpha_{i}^{*} \in \left[ {0,C} \right]} \hfill \\ \end{array} \hfill \\ \end{gathered}$$
(7)

where \(\alpha\) and \(\alpha^{*}\) are the dual variables associated with inequality constraint of Eqs. (4) and (5), respectively. This is a convex optimization problem that is solved conveniently. After solving this optimization problem and obtaining optimum values for dual variables, b is estimated considering the conditions and training samples, and \({\mathbf{w}}\) is calculated as:

$${\mathbf{w}} = \sum\limits_{i = 1}^{N} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right)\varphi \left( {{\mathbf{x}}_{i} } \right)}$$
(8)

After computing b and \({\mathbf{w}}\), the target value y is estimated as:

$$y = \sum\limits_{i = 1}^{N} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right)K({\mathbf{x}}_{i} ,{\mathbf{x}}) + b}$$
(9)

where K is known as the kernel function with only two open parameters ε and C.

Multiple-Kernel Learning

Choosing a kernel function and fine-tuning its parameters have a profound effect on the performance of kernel-based learning algorithms such as support vector regression (Yeh et al., 2011). There are several different kernel functions available for a specific learning task, from which the user should choose the best performing one without any prior knowledge about their performances. There is a wide range of kernel learning methods that are employed to either assist this choice or to estimate a valid kernel function from the available training data (Abbasnejad et al., 2012). The most promising category of these algorithms is the MKL algorithms.

MKL algorithms estimate a (sub)-optimal kernel function, known as the composite kernel, for a specific learning task by combining a group of precomputed basis kernels (Gönen & Alpaydın, 2011). The basis kernels are combined by the use of a parametric combination function into the composite kernel. Thus, the main goal of MKL algorithms is to estimate the optimal values for the parameters of the combination function (Niazmardi et al., 2018). MKL algorithms yield this goal by optimizing a target function with respect to these parameters. Although most of the MKL algorithms have been proposed for the classification problems, the optimization techniques and the combination functions associated with these algorithms can be also used for the regression problems (Bucak et al., 2014; Gönen & Alpaydın, 2011; Kloft et al., 2011; Niazmardi et al., 2016). In the MKL literature, it is a common practice to replace the kernel function of the dual problem related to the kernel-based learning task (Eq. 7 in the case of the SVR algorithm) with the composite kernel. This issue was considered as the target function of the MKL algorithm (Bucak et al., 2014). Following this strategy, the target function of the MKSVR can be written as the following min–max problem:

$$\begin{gathered} \mathop {\min }\limits_{\eta } \;\mathop {\max }\limits_{{\alpha ,\alpha^{*} }} \;\left\{ { - \frac{1}{2}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right)\left( {\alpha_{j} - \alpha_{j}^{*} } \right)} } K_{c} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) - \varepsilon \sum\limits_{i = 1}^{N} {\left( {\alpha_{i} + \alpha_{i}^{*} } \right)} + \sum\limits_{i = 1}^{N} {y_{i} \left( {\alpha_{i} - \alpha_{i}^{*} } \right)} } \right\} \hfill \\ \begin{array}{*{20}l} {{\text{Subjct}}\;{\text{to:}}} \hfill & {\sum\limits_{i = 1}^{N} {\left( {\alpha_{i} - \alpha_{i}^{*} } \right) = 0} } \hfill \\ {} \hfill & {\alpha_{i} ,\alpha_{i}^{*} \in \left[ {0,C} \right]} \hfill \\ {} \hfill & {\eta \in \Delta } \hfill \\ \end{array} \hfill \\ \end{gathered}$$
(10)

where \(\eta\) and \(\Delta\), respectively, denote the parameter of the combination function and their feasible set; \(K_{c}\) shows the composite kernel. An alternative optimization strategy is used occasionally to solve this optimization because optimizing Eq. (10) with respect to \(\eta\) is not a convex problem (Gönen & Alpaydın, 2011).

As mentioned previously, one of the most important characteristics of the MKL algorithm is how the basis kernels are combined into the composite kernel which is controlled by the combination function. Considering the linearity of this function, the MKL algorithms can be categorized into two groups of linear and nonlinear algorithms (Niazmardi et al., 2016). The linear MKL algorithms apply the following function to n available basis kernels \(K_{i} ,i = 1,...,n\) to construct the composite kernel:

$$K_{c} = \sum\limits_{i = 1}^{n} {d_{i} K_{i} }$$
(11)

where \(d_{i} ,i = 1, \ldots ,n\) are non-negative weights associated with the basis kernels, which should be optimized by using the MKL algorithm.

Several options available can be used as the combination function of the nonlinear MKL algorithms. Among these functions, the polynomial function of degree in the state of \(d \ge 1\) is the most common (Cortes et al., 2009), and it is expressed as:

$$K_{c} = \sum\limits_{q \in Q} {\mu_{{q_{1} \ldots q_{n} }} K_{1}^{{q_{1} }} K_{2}^{{q_{2} }} } \ldots K_{n}^{{q_{n} }}$$
(12)

where Q is known as \(\left\{ {q:q \in {\mathbb{Z}}_{ + }^{n} ,\sum\nolimits_{i = 1}^{n} {q_{i} \le d} } \right\}\) and \(\mu_{{q_{1} \ldots q_{n} }}\) is equal to zero at least. Due to many open parameters, adopting Eq. (12) as the combination function leads to create an optimization problem with a high degree of complexity. To reduce the complexity of this problem, the following combination function is considered occasionally (Cortes et al., 2009):

$$K_{c} = \sum\limits_{q \in R} {\mu_{1}^{{q_{1} }} \mu_{2}^{{q_{2} }} \ldots \mu_{n}^{{q_{n} }} } K_{1}^{{q_{1} }} K_{2}^{{q_{2} }} \ldots K_{n}^{{q_{n} }}$$
(13)

where \(R = \left\{ {q:q \in {\mathbb{Z}}_{ + }^{n} ,\sum\nolimits_{i = 1}^{n} {q_{i} = d} } \right\}\) and \(\mu_{{q_{1} ....q_{n} }} \ge 0\).

According to the literature, quite a few linear and nonlinear MKL algorithms have been employed for data classification by adopting various target functions and optimization strategies. Additionally, most of these algorithms are applied for regression problems. To the best of our knowledge, there is no general guideline for selecting fixed values for the SVR parameters. Thus, these parameters are estimated occasionally using an n-fold cross-validation technique. Although SVR demonstrated acceptable performance, this technique is highly time-consuming. Besides, the user should provide a set of candidate values from which the optimum values for the SVR parameters are selected. These issues may yield sub-optimal values for the parameters. However, these problems can be avoided through simultaneous optimization of the SVR parameters and the kernel parameters in the MKSVR algorithm. We refer the readers to Gönen and Alpaydın (2011), Bucak et al. (2014) and Niazmardi et al. (2018) for details of MKL algorithms.

Proposed MKSVR

The MKSVR algorithm should optimize Eq. (10) jointly with respect to the dual variables and parameters of the considered combination function. This is achieved occasionally using an alternative optimization (AO) strategy. The AO, introduced as a two-stage optimization strategy, optimizes either dual variables or the parameters of the combination functions at each stage while assigning fixed values to the other one. The AO algorithm iterates until a termination criterion is met.

Optimizing the target function of the MKSVR with respect to the dual variable is a convex problem that can be solved easily. However, optimization with respect to the parameters of the combination function can be highly challenging due to the non-convexity of the optimization problem. Occasionally, the gradient descendant method has been employed widely to solve this optimization problem (e.g., Rakotomamonjy et al., 2008; Varma & Babu, 2009). In contrast, the gradient descendant method is an iterative optimization method that needs several evaluations of the gradient of the target function at each step. Accordingly, adopting this method increases the computational complexity of the MKL algorithm.

To address the above issue, we proposed an AO framework that replaces the gradient descendant method with the particle swarm optimization (PSO) algorithm. In the proposed optimization strategy, the parameters of the combination function are considered as the particles, whose search space is defined as the feasible set of the parameters of the combination function. To evaluate the fitness of each particle, one composite kernel is constructed by considering the values of each particle as the parameters of the combination function. Equation (7) can be solved easily by employing a convex optimization method, once this composite kernel as the main kernel function of Eq. (7) was adopted. After solving this optimization, the prediction accuracy of this function is evaluated by employing a five-fold cross-validation method. In fact, the fitness value of each particle is considered as the precision level of the models’ performance.

The proposed optimization strategy of MKSVR has several advantages over the gradient descendant method. Firstly, the proposed strategy is very flexible and able to perform with any combination function. Secondly, this is a cost-effective method owing to the fact that there is no need to estimate or evaluate the gradient. Finally, in addition to the parameters of the combination function, the proposed strategy is capable of optimizing the SVR parameters (i.e., C and \(\varepsilon\)). In fact, SVR parameters are estimated occasionally using n-fold cross-validation, which is a very time-consuming technique. In this case, each particle consists of two separate parts for parameters of combination function and SVR parameters (C and \(\varepsilon\)). Table 1 shows the optimization strategy of the MKSVR algorithm. The flowchart of the processing steps of the MKSVR algorithm is illustrated in Fig. 1. It is noteworthy that the PSO algorithm is adopted due to the continuity of its search space and its satisfying performance (Sengupta et al., 2019). However, the PSO algorithm can be replaced by other meta-heuristics optimization techniques.

Table 1 Optimization strategy of MKSVR algorithm
Figure 1
figure 1

Flowchart of MKSVR

Study Area, Dataset Description, and Experimental Setup

Study Area

We evaluated the performance of the proposed strategy for estimating WQPs of Karun River in Khuzestan Province, Iran. This river drains from the Bakhtiari area in the central Zagros Mountain and follows a tortuous course on the Khuzestan plain, and joins the Shatt al-Arab in Bousher before its final discharge into the Persian Gulf. Karun River (Fig. 2), with 829 km long and a watershed area of 65,230 km2, is the longest and the only navigable waterway of Iran. Quite a few dams have been constructed on the Karun River, whose main aims are not only hydro-power generation but also flood control. There is no denying the fact that dams on the Karun River play a key role in the evolution of some riverine issues such as land use, sediment transport, and management of water quality. Karun River is also the main source of water for several cities, among which the largest is Ahvaz with just above 1.3 million residents. Thus, assessment of the water quality of this river is of high practical importance.

Figure 2
figure 2

Drainage basin of Karun River (https://en.wikipedia.org/wiki/Karun)

Dataset Description

In this paper, 11 different WQPs were considered, namely BOD, COD, electrical conductivity (EC), sodium (Na+), calcium (Ca2+), magnesium (Mg2+), phosphate (PO43−), nitrite (NO2), nitrate nitrogen (NO3), turbidity, and pH. The WQPs were measured monthly from eight hydrometric stations along the Karun River between the years 1995 and 2011; the location of these stations is shown in Fig. 3 and is also listed in Table 2. As mentioned, COD and BOD are harder to measure than the other WQPs. Thus, these parameters were considered as the main variables to be estimated using the other nine different parameters. In the case of Karun River, Emamgholizadeh et al. (2014) were the first researchers to use these WQPs data for estimation of BOD and COD by ANN and ANFIS models. Additionally, Najafzadeh et al. (2018) applied several explicit formulations for prediction of BOD and COD by using EPR, MT, and GEP models. Najafzadeh and Ghaemi (2019) applied MARS and SVM techniques recently to estimate BOD and COD; they used an improved simple version of SVM. In contrast, this present research aimed to employ a newly developed version of SVM on the basis of kernel learning. The main statistics of the WQP used in this study are given in Table 3.

Figure 3
figure 3

Location of hydrometry stations along the Karun River

Table 2 Names and coordinates of the hydrometry stations
Table 3 Statistical properties of water quality parameters in the Karun River

In Table 3, magnesium and calcium with 60 mg/l and 58.4 mg/l, respectively, have the highest concentrations. These two parameters also had relatively large standard deviations, which can be interpreted as high dispersion in the concentration levels of these parameters. Nitrite and nitrate nitrogen have become almost stable during the measuring period; accordingly, they showed the smallest standard deviations among the parameters.

Experimental Setup

To assess the performance of the proposed MKSVR, we designed two different experiments. In the first experiment, the performance of the MKSVR was evaluated. The MKSVR algorithm with the proposed optimization strategy was implemented for both linear and nonlinear combination functions (2nd-degree polynomial). The values of the trade-off and the epsilon parameters were also optimized along with the parameters of the combination function. For this algorithm, 19 different functions were constructed as the basis kernels. The basis kernels consisted of nine different radial basis function (RBF) kernels and 10 polynomial kernels, whose parameters were selected, respectively, from the ranges \(\left\{ {10^{ - 4} ,10^{ - 3} , \cdots ,10^{4} } \right\}\) and \(\left\{ {1,2, \ldots ,10} \right\}\). To run the MKSVR algorithm, control parameters of the PSO algorithm, introduced as swarm size, the number of iterations, inertia weight, and the acceleration constants (shown usually by \(c_{1}\) and \(c_{2}\) in the literature), need to be set. Based on the suggestions by Shi and Eberhart (1998), Trelea (2003) and Bansal et al. (2011), the swarm size, the number of iterations, and inertia weight were fixed at 20, 300, and 0.72, respectively. Both acceleration parameters were set to 2.

In the second experiment, the results of the proposed method were compared with those obtained using other regression algorithms. In this step of the experiment, the Random Forest regression (RFR) and SVR were selected as benchmark for comparison. The SVR algorithm was implemented by adopting both the polynomial and the RBF as kernel functions. For this experiment, besides the value of the used kernel parameter (i.e., the spread of RBF kernel and the degree of polynomial), values of the SVR trade-off and the epsilon parameters should be set. Here, we used a fivefold cross-validation strategy to tune these parameters. The value of spread of the RBF kernel and the epsilon were both selected from the range {10−4, 10−3, … 104} and the used range for the value of trade-off parameters was {10−3, 10−2, … 103}. The degree of polynomial kernel function was selected from the range {1, 2, … 10}.

In both experiments, 75% of the datasets were used for training the algorithms and the remaining 25% was applied for validation. The performance of the algorithms was measured by means of different validity measures including correlation coefficient (R), root mean squared error (RMSE), and mean absolute error (MAE). These error criteria are used frequently in the literature for evaluation of environmental processes (e.g., Ahmadianfar et al., 2020; Jamei & Ahmadianfar, 2020; Jamei et al., 2020; Najafzadeh & Ghaemi, 2019; Pourrajab et al., 2020). In addition to these validity measures, some recent statistical measures were used, namely uncertainty at 95% confidence level (denoted as U95), reliability, and resilience (Zhou et al., 2017).

Framework of Random Forest Regression

The RFR algorithm acts on the basis of assembling the tree-like structure. It is capable of establishing congruous formulation among a set of input–output variables (Jamei et al., 2021). Generally, it can be noted that RFR model increases different decision trees (DTs), which are learned by means of a sample of input datasets that are bootstrapable. The final output vector of RFR model is calculated by taking the average of these prediction trees. According to Svetnik et al. (2003) research, RFR is a triple step model. At first, X matrix is defined as the training datasets which has N samples. Afterward, k samples are selected randomly by using the bootstrap resampling approach in order to generate k regression trees. In this stage, probability values pertaining to those samples that were excluded (P) are calculated as (Jamei et al., 2021):

$$P = \left( {1 - \frac{1}{N}} \right)^{N}$$
(14)

Based on Eq. (14), if \(N\) has infinite value, the probability will become approximately 37% of the main training datasets that are not taken into account to be drawn, being introduced as out-of-bag datasets and considered for the performance of testing stages. In the second stage of RFR development, regression trees (RTs) which were not pruned, k bootstrapped data samples are generated. As a tree is grown structurally, in each node, an input variable (attribute) is selected randomly from all input variables (A), introduced as internal nodes. Thus, a minimum Gini index is applied to measure how each attribute has a contribution to evaluating elements of tree structures (i.e., nodes and leaves). In this way, the optimum input variable is defined by a splitting variable in order to generate the branches hierarchy. Through the last phase of RFR development, the final model is composed of extracted k regression trees. Fundamentally, there are two statistical measures, introduced as the coefficient of determination and mean squared error, to assess the accuracy level of RFR.

Results and Discussion

Results of the First Experiment

Table 4 presents the quantitative performance of the MKSVR algorithm developed by linear and nonlinear combination functions for the estimation of BOD and COD.

Table 4 Accuracy level of the MKSVR algorithms in the estimation of the WQP parameters

Analysis of accuracy level obtained using the MKSVR algorithm shows that this algorithm yielded acceptable performances in estimating both BOD and COD. However, marginally better performances were obtained in the case of using the nonlinear combination functions. This is because, due to their flexibility, the nonlinear combination functions can better model the underlying structure of data. These results were also in line with the results obtained by Cortes et al. (2009), where nonlinear combination of kernels was used for classification algorithms. However, it should be noted that linear combination functions are less computationally complex than the nonlinear ones, so their adoption will reduce the computational complexity of the algorithm.

The performance of the MKL algorithms depends highly on their optimization strategy by which they find the optimal combination of the kernels. The good performance of the MKSVR algorithm using the proposed optimization strategy can substantiate the effectiveness of PSO as the optimization algorithm. Figures 4 and 5 show the qualitative comparison of the MKSVR algorithm for BOD and COD, respectively. For BOD measured values between 5 and 10 mg/l, Fig. 4 indicates that some estimated values of BOD were placed out of the ± 25% allowable error range. As shown in Fig. 5, both MKSVR techniques had over-predicted slightly COD values between 2 and 5 mg/l; for COD = 5–7 mg/l, all the models have shown a remarkable over-estimation for some measured COD values.

Figure 4
figure 4

Scatter plot of observed BOD values versus estimated ones by MKSVR

Figure 5
figure 5

Scatter plot of observed COD values versus estimated ones by MKSVR

Comparison to Other Regression Algorithms

Table 5 summarizes the obtained accuracies of the regression algorithms used as comparison benchmarks.

Table 5 Performance of benchmark regression algorithms

As observed from the results, the RMSE values of the SVR algorithm for estimation of COD and BOD in the case of adopting polynomial kernel, respectively, were 5.79 and 6.32, respectively. However, these values reduced to 4.85 and 5.97 when the RBF was adopted as the kernel function. The higher performance of the RBF kernel in comparison with the other kernel is due to its higher ability to characterize the data. Besides, the RBF kernel function also has fewer numerical difficulties, so it is a more appropriate choice than the polynomial kernel for the SVR algorithm. Comparison of the performances of the MKSVR, RFR, and SVR algorithms shows that, in most cases, the MKSVR algorithm outperformed the other algorithms. As an example, the RMSE of the MKSVR considering the nonlinear combination function for estimating the BOD was 4.76 mg/l, while for SVR using RBF kernel function and for the RFR algorithm the RMSEs were 5.97 mg/l and 5.15 mg/l, respectively. The worst performance of SVR compared to the MKSVR is due to the fact that using a single kernel cannot guarantee to obtain the best model for data under consideration. However, different kernel functions can provide different modeling of the data, and their combination can lead to the best possible data model.

Figures 6 and 7 illustrate the qualitative performance of SVR and RFR models for both BOD and COD indices, respectively. From Fig. 6, for BOD of 5–10 mg/l, the two SVR models and the RFR technique indicated relatively high over-estimation and consequently placing out-of-error bound. Additionally, for BOD of 25–40 mg/l, there was slight under-estimation of BOD. In Fig. 7, for COD of 2.5–15 mg/l, all the three models relatively over-estimated COD. SVR with polynomial kernel has under-estimated COD remarkably when COD was between 25 and 35 mg/l.

Figure 6
figure 6

Scatter plot of observed BOD values versus estimated ones by SVR and RFR models

Figure 7
figure 7

Scatter plot of observed COD values versus estimated ones by SVR and RFR models

Conclusions

In this paper, regression analysis was used to obtain hard-to-estimate water quality parameters such as BOD and COD, as a function of other easy-to-measure field parameters. In this way, various regression algorithms in terms of a newly improved SVR model were considered. The performance of the SVR algorithm depended on the mathematical structures of the kernel function. To address issues associated with various types of kernel selection, novel multiple-kernel support vector regression (MKSVR) algorithms were proposed. These algorithms are capable of learning an optimal kernel through linear or a nonlinear combination of some precomputed basis kernels. From this study, the following conclusions have been drawn.

  • Using the SVR algorithm, both BOD and COD were estimated using other water quality parameters with acceptable accuracy.

  • The performance of the SVR algorithm was highly dependent on the kernel function and on fine-tuning its parameters. Additionally, the RBF kernel applied in the SVR algorithm yielded better results than those obtained by the second-order polynomial kernel and RFR model.

  • The MKSVR algorithm could increase the performance of the SVR algorithm for the estimation of BOD and COD. For this algorithm, the nonlinear combination functions yielded better performance than the linear ones.

For future studies, it is highly recommended to study the role of different setting parameters of SVR and MKSVR models on the accurate estimation of BOD and COD. Although the second-degree polynomial was used to construct the nonlinear combination of the kernels in this research, there is a strong need to employ a higher degree of polynomial kernel in the topology design of the MKSVR algorithm.