Introduction

Poor quality of surface water is a serious problem in the world which threatens human health, ecosystems and plants/animals life. Water quality (WQ) is, therefore, a main concern in water resource, environmental systems and ecosystem. It is a terminology used to describe the chemical, physical, and biological characteristics of water in connection with a set of standards (Liou et al. 2004). WQ assessment can be used to evaluate water properties in reference to natural quality and human health effects (Fernández et al. 2004). It can be assessed by measuring a broad range of variables to represent the water pollution level. Hence, a robust mathematical technique is required to combine the physico-chemical characterization of water into a single variable which describes the water quality. In view of this, a water quality index (WQI) was developed as a single number which uses a set of physico-chemical water variables to explain the water quality at a certain place and time (Zandbergen and Hall 1998).

WQI is a unit-less number which reflects the status of water quality in wetlands, lakes, streams, rivers, and reservoirs. The concept of WQI is based on the comparison of the water quality parameter with respective regulatory standards (Khan et al. 2003). There are several equations for WQI in different countries such as the US, Canada, and Malaysia which are developed based on the standards of the US National Sanitation Foundation (Said et al. 2004). In 1974, the Department of Environment (DoE) Malaysia recommended an index to assess the quality of surface waters in Malaysia. Totally, six parameters were chosen as main water quality variables (WQVs) to develop WQI for surface water such as dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), ammoniacal nitrogen (AN), suspended solid (SS) and pH (DoE 2005; Khuan et al. 2002; Norhayati et al. 1997). These variables should be converted into non-dimensional parameters by sub-index functions. The conventional method recommended by DoE requires long-lasting transformations to calculate sub-indices. In addition, the sub-indices required the inclusion of different equations, which need lengthy effort and time to estimate the final WQI. Therefore, estimation of such a WQI is cumbersome and can lead to occasional mistakes (Gazzaz et al. 2012), and robust techniques can be employed to solve these problems (Mohammadpour et al. 2013a). The gene expression programming (GEP) and artificial neural networks (ANNs) can be suggested as alternative techniques for estimation of WQI, as both employ the raw data instead of sub-indices.

In the last decade, the GEP and genetic programming (GP) have been successfully used in water resources modelling issues (Azamathulla et al. 2010; Zakaria et al. 2010; Azamathulla and Ghani 2011). Furthermore, these methods were recommended as significant tools in environmental and river engineering problems (Chen et al. 2008; Aras et al. 2007; Mohammadpour et al. 2013b; Ghani and Azamathulla 2014; Mohammadpour et al. 2015b). Vink and Schot (2002) developed GP for optimization of drinking water. The performance of the GP was compared with analytic solution of a series of hypothetical case studies. Hashmi et al. (2011) developed GEP for downscaling of watershed precipitation in Canada. Azamathulla (2012) applied GEP for prediction of scour depth at downstream of sills. Ni et al. (2012) evaluated water storage in wetlands using the GP technique. The result indicated that the GP method can be used for estimation of water fluctuation in the wetlands. Azamathulla and Ahmad (2012) used GEP approach to predict the transverse mixing coefficient in open channel flows. Xu and Qin (2013) solved the problem related to agricultural water quality management by using a combination of GA and fuzzy simulation. Orouji et al. (2013) investigated the performance of GP and ANFIS-GP to estimate water quality parameters. Different combinations of data set were employed in their study, and the results showed that GP is superior to ANFIS for prediction of water quality parameters. Zaman Zad Ghavidel and Montaseri (2014) employed GEP and other artificial intelligence approaches to predict total dissolved solids in river basin. A comparison between all selected approaches emphasized the superiority of GEP over the other intelligent methods.

Recently, a lot of studies have been reported in literature regarding the application of ANNs in different fields such as water quality, wastewater treatment and other water resources problems (Singh et al. 2009; Civelekoglu et al. 2009; Verma and Singh 2013; Mohammadpour et al. 2013c, 2014b, 2015a). In the area of river management, ANNs was used to simplify and speed up the calculation of water quality index (Khuan et al. 2002; Juahir et al. 2004; Gazzaz et al. 2012). The ANNs was employed to determine water quality parameters and simulate wetlands processes (Wang et al. 2012; Kashefi Alasl et al. 2012; Li et al. 2013; Song et al. 2013). Schmid and Koskiaho (2006) developed ANNs to model concentrations of dissolved oxygen in free surface wetlands. They have also used ANNs to estimate the relative influence of flow rate and wind shear on near bottom oxygen saturation. The results indicated that ANNs was able to produce estimates of convective oxygen transport. Dadaser-Celik and Cengiz (2013) simulated the water level in wetlands using ANNs. It was found that the ANN method can successfully be employed to predict water levels in wetlands. Karthikeyan et al. (2013) developed ANNs to predict ground water levels in the upland of a tropical coastal wetland with fairly accurate results.

The main objective of this research is to reduce substantial time and effort for calculation of WQI in the free surface constructed wetlands. The GEP and ANNs were employed as the robust techniques to determine WQI. Seventeen points in a wetland were monitored twice a month over a period of 14 months and an extensive data set was collected for 11 water quality variables. A principal factor analysis (PFA) was used to determine and interpret the correlation between variables. To develop GEP and ANN, the significant variables were chosen using sensitivity analysis. Finally, accuracy of each method was evaluated using a comparison between the obtained results.

Materials and methods

Study area

In this research, the free surface constructed wetland (FSCW) in the Universiti Sains Malaysia (USM) in Penang (Malaysia) was chosen as a case study. The landscape area is about 320 hectares, and it is covered by oil palm plantation (Shaharuddin et al. 2013; Mohammadpour et al. 2014a). The wetland is located at latitude 5° 9′ 7.8294″ North and longitude 100° 29′ 53.1672″ East. The FSCW was designed based on the Stormwater Management Manual for Malaysia (Zakaria et al. 2003). Seventeen sampling points with different plant species and water depths were chosen to monitor the water quality. These points include the inlet, six stations in the macrophyte area (W1–W6), nine points in micropool (MA1–MC3), and the outlet (Fig. 1). These points have been chosen in such a way that covers all range of plants and the water depths in the wetland (Table 1).

Fig. 1
figure 1

Seventeen sample points in the constructed wetland of USM

Table 1 Plant species and the water depth in the USM wetland

The data were collected twice a month over a period of 14 months (from Oct. 2010 to Dec. 2011). Totally, 11 water quality variables (WQVs) were collected in the wetland, including dissolved oxygen (DO), pH, temperature, conductivity, suspended solid (SS), nitrite, nitrate, ammoniacal nitrogen (AN), chemical oxygen demand (COD), biochemical oxygen demand (BOD), and phosphate. Table 2 indicates statistical parameters of the collected data.

Table 2 Descriptive statistics of wetland parameters

The local water quality index

As mentioned earlier, to determine WQI of water surface, the DoE (2005) recommended six variables such as, DO, BOD, COD, AN, SS and pH. These variables should be converted into non-dimensional variables using sub-index functions (SI). Table 3 shows the required functions which can be used to estimate sub-indices. In this table, X is the concentration parameter in terms of mg/L, except for pH and DO. For DO, the X refers to percentage of saturation and for pH it refers to the pH value. Finally, the WQI can be calculated using the following equation (DoE 2005; Khuan et al. 2002):

$${\text{WQI}} = 0.22\,{\text{SI}}_{\text{DO}} + 0.19\,{\text{SI}}_{\text{BOD}} + 0.16\,{\text{SI}}_{\text{COD}} + 0.15\,{\text{SI}}_{\text{AN}} + 0.16\,{\text{SI}}_{\text{SS}} + 0.12\,{\text{SI}}_{\text{pH}}$$
(1)

where SI stands for sub-index.

Table 3 The sub-index equation for WQI in Malaysia (DoE 2005)

WQI is a unit-less number which varies between 0 and 100, where a high value of WQI represents high (good) water quality and a low value of WQI represents low (poor) water quality. Based on this index, the water quality can be classified into five classes. Table 4 shows the water quality classes suggested by the DoE.

Table 4 Water quality classes, WQI and water status (DoE 2005)

Principal factor analysis

In this study, principal factor analysis (PFA) has been employed to determine the correlation between variables and WQI. Furthermore, insignificant variables can be clarified in this analysis. To avoid the effect of strong variables with high values on PFA, the z scale transformation was used to standardize the collected data set. The KMO (Kaiser–Meyer–Olkin) and Barlett’s tests of sphericity were employed to evaluate sampling size adequacy and verification of PFA, respectively.

The PFA was applied to a matrix with the dimension of 442 objects and twelve variables (a WQI and 11 WQVs). The KMO test produces a value equal to 0.822 which indicates the number of collected data is adequate. In addition, the Bartlett’s test of sphericity with approximate Chi Square of 3792.804 (ρ = 0.000 < 0.05 and df = 66) reveals that the principal factor analysis can be used to explain the WQVs.

As shown in Table 5, three factors were extracted by the PFA with eigenvalue bigger and equal to one. To estimate the effect of each variable in the PFA, the Varimax rotation was employed to determine values of rotated factor loadings. However, a factor loading less than 0.4 was recognized as a weak factor (Lambrakis et al. 2004; Gazzaz et al. 2012).The strong and moderate factors (bigger than 0.40) are shown in bold in Table 5.

Table 5 Matrix of the weights for the principal components

Eight variables including the WQI are loaded on the first factor with a variation of 49 %. The WQVs and their factor loadings are SS (0.85), nitrate (0.84), phosphate (0.84), AN (0.81), nitrite (0.81), BOD (0.79), COD (0.77), and WQI (−0.62). A negative factor loading for WQI indicates that the WQI increases with decreasing values in the mentioned variables in the first factor. Among all variables, SS has higher correlation with WQI. Consequently, it is a significant parameter on WQI.

Suspended solids (SS) is solid materials, including organic and inorganic, that are suspended in the water. High concentrations of SS increases the amount of light which can be absorbed by the water. In this condition, the water becomes warmer and loses its ability to hold oxygen. Aquatic plants also receive less light and less oxygen that is produced by photosynthesis. The combination of less light, warmer water and less oxygen decreases the water quality.

The loaded variables on the second factor are pH (0.90), conductivity (0.82), DO (0.47) and WQI (−0.69). The high correlation between pH (0.90) and WQI (−0.69) illustrates that pH is another significant variable. In addition, negative coefficient indicates that WQI decreases with increasing pH in range of 6.11 and 9.19 (Table 2).

The third factor received the highest factor loading from DO (0.68) and temperature (0.82). The WQI is loaded on this factor with very low value (−0.09). In the second extracted factor, it was observed that DO has a correlation with WQI indicating that temperature alone has no effect on the WQI. It may be due to low variation of wetland temperature in tropical areas with minimum, maximum and average value of 27.3, 35.15 and 31.12, respectively, (Table 2). Consequently, temperature is an insignificant variable for wetlands which are located in tropical areas.

Artificial neural networks (ANNs) methods

Artificial neural networks (ANNs) are a computational process which attempts to represent and compute a mapping from multivariate data set as inputs to another as outputs. A neuron is the smallest part of the neural network, with artificial neurons arranged in the structure like a network. In this study, feed forward back propagation neural network (FFBP) was used to predict WQI in the wetland. The network consists of a set of neurons in three, inputs, hidden and output layers to approximate a multi-variant function of f(x). The number of neurons in hidden layers can be detected by trial and error. The learning procedure includes the best weight vector to achieve the best approximation of f(x). Firstly, a set of input data (x 1, x 2,…x R ) is fed to the input layer, and the output of each neuron can be determined from the following equation:

$$n = \sum {w_{i\,j} x_{i} + b}_{i}$$
(2)

where n is the neuron output, w ij is weight of the connection between the jth neuron in the present layer and ith neuron in the previous layer, x i is neuron value in the previous layer and b i is the bias. The sigmoid function can be used as a transfer function to generate the output of each neuron (Bateni et al. 2007) given by:

$$y_{i} = \frac{1}{{1 + e^{{ - C_{1} \left( {\sum {w_{i\,j} } \,x_{i} \,\, + b\,_{i\,} } \right)\,}} }},\quad C_{1} > 0$$
(3)

A comparison between the target value and obtained results was used to estimate network errors, while the back propagation algorithm corrects the weight between neurons. The back-propagation (BP) method is a descent algorithm, which tries to minimize the error at each iteration. The network weights are set by the algorithm such that the network error decreases along a descent direction (gradient descent). Generally two parameters, called momentum factor (MF) and learning rate (LR), are used to control the weight adjustment in the descent direction.

Sensitivity analysis using ANNs

In this study, the ANNs was employed to reduce the number of independent variables for prediction of the WQI. Range of data for sensitivity analysis is shown in Table 6.

Table 6 Range of data for training and testing

A network with feed forward back propagation method (FFBP) was developed for sensitivity analysis. The number of neurons in the input layer was determined based on the number of input variables. Since the WQI was chosen as the network output, then the number of neurons in the output layer was selected equal to one. One layer was chosen in hidden layer and the optimum number of neurons in this layer was found equal to 5 using trial and error approach.

The leave-one-out method was used to assess the effect of each variable on the WQI. In this method, two indicators, the ratio of error and its rank, were estimated by removing each input variable at a time (Ha and Stenstrom 2003). The ratio of the error is obtained after elimination of individual variable to the error obtained using all variables. The high ratio illustrates the importance of individual variable and vice versa (Table 7).

Table 7 Sensitivity analysis using ANNs

Another attempt was conducted to determine the significance or influence of input variables on WQI. Table 8 compares the ANNs models with one of the independent variables removed in each case. As shown in this table, pH, COD, DO, AN and SS are significant variables with R 2 = 0.9882, RMSE = 0.0179 and MAE = 0.0136 and have a non-negligible influence on WQI. These parameters were chosen to developed GEP and ANNs in this study. Other parameters such as BOD, phosphate, nitrate, nitrite and conductivity do not have any significant effect on WQI and can be ignored.

Table 8 Sensitivity analysis using different variables

In light of these findings, the pH with the highest rank can be considered as a main parameter for WQI in the wetlands (Table 7), although it is ranked only as the 6th variable in the conventional WQI equation. This equation (Eq. 1) is suggested for estimation of WQI in the rivers, and the difference between ranking of pH in Eq. (1) and the present study may be due to the discharge of the point source and non-point source pollution loads to rivers. However, the selected wetland is mainly polluted by discharge from non-point source pollution due to storm water. This point can be considered for re-establishment of a new equation for WQI in the wetlands and other water resources with discharge from non-point pollution.

Development of GEP for water quality index

Gene expression programming (GEP) is a learning algorithm which was developed based on genetic programming (GP) and genetic algorithms (GA). In each individual population, the chromosomes are generated randomly and evaluated using a fitness function. Mutation is found as effective genetic operators to modify chromosomes. The following steps were used to develop GEP model.

In the first step, the size of the population was chosen equal to 30 as optimum size. Ferreira (2001) recommended a population size between 30 and 100 chromosomes as being able to provide an accurate result.

Secondly, the root relative squared error (RRSE) was chosen as fitness function in the GEP.

In the third step, a basic mathematical function (power), and four basic arithmetic operators (+, −, ×, /) were chosen to create chromosomes in each gene.

In next step, the chromosome architecture was chosen based on the length of the head, number of genes, and tail. The optimum result was determined for length head of seven and three genes per chromosome (Ferreira 2001, Mohammadpour et al. 2011, 2013b).

In the fourth step, both addition and multiplication operators were evaluated to find the best linking function, and the result showed that the addition function is more accurate. This function was employed to make a link between the sub-expression (chromosomes) in the GEP.

In the last step, the operators of GEP such as, mutation, transpositions, inversion, and cross-over, were employed to develop the GEP model.

Performance of the GEP and ANNs was assessed through the statistical parameters such as, coefficient of determination (R 2), mean absolute error (MAE) and root mean square error (RMSE). Expressions for these measures are given as follows:

$$R^{2} = 1 - \frac{{\sum\limits_{i = 1}^{p} {(O_{i} - P_{i} )^{2} } }}{{\sum\limits_{i = 1}^{p} {(O_{i} - \overline{O}_{i} )^{2} } }}$$
(4)
$${\text{MAE}} = \,\frac{1}{n}\sum\limits_{i = 1}^{p} {\left| {\,O_{i} - P_{i} } \right|}$$
(5)
$${\text{RMSE}} = \sqrt {\frac{{\sum\nolimits_{i = 1}^{p} {(O_{i} - P_{i} )^{2} } }}{n}}$$
(6)

where O i is observed values, P i is predicted value, \(\overline{O}_{i}\) is average of observed value and n is the number of samples.

Results and discussion

The total 442 datasets were divided randomly into training and testing subsets, 80 % (353 data set) for training and 20 % (89 data set) for testing (Table 6). Regarding the sensitivity analysis, five main variables of pH, COD, DO, AN and SS were employed to develop ANNs and GEP.

Figure 2 indicates an architecture of FFBP with five neurons as input and one neuron at output layer. Based on trial and error, the ANNs-FFBP network with 2000 epochs provided better results in comparison with the other networks.

Fig. 2
figure 2

Architecture of ANNs-FFBP for free constructed wetland

The ANNs was developed with a different number of neurons in the hidden layer to find ANNs with the best performance. To assess over-fitting of network (low training error but high test error), the root mean square error (RMSE) was employed as a criterion. As shown in Fig. 3, the RMSE decreases dramatically with increasing number of neurons in the hidden layer. Table 9 indicates the performance of ANN-FFBP with different neurons in the hidden layer. The testing data was assessed to find the optimum number of neurons in hidden layer.

Fig. 3
figure 3

Variation of RMSE for training and testing data in terms of number of neurons

Table 9 Performance of ANN with different neurons in hidden layer

The best performance was provided for networks with five neurons in hidden layer. In this network, ANNs-FFBP predicts WQI with high accuracy in the wetland (R 2 = 0.9887, RMSE = 0.0173 and MAE = 0.0130). An over-fitting was observed in testing data for a number of neurons bigger than 5.

To evaluate the WQI, the GEP model has been developed using the same data set employed for the ANNs. The GEP expression tree is shown in Fig. 4. The simplified analytic form of the GEP model can be expressed as:

$${\text{WQI}} = \left[ {\left( {\frac{8.5 + 0.85SS}{\text{DO}}} \right)\left( {{\text{AN}} - 0.81} \right)\left( {\text{pH}} \right) + {\text{AN}} - 7.68} \right]\left( {\text{AN}} \right) - \frac{{{\text{DO}}^{2} - 7.63\;{\text{DO}}}}{\text{COD}} - 0.19{\text{COD}} - \left( {{\text{pH}} - 7.31} \right)^{2} + 96.63$$
(7)
Fig. 4
figure 4

Expression trees for the GEP equation

This equation predicts WQI in constructed wetlands with only five direct variables instead of sub-index variables. Therefore, this equation is more useful and rapid in comparison with Eq. (1).

A comparison between predicted and observed WQI for both GEP and ANN-FFBP is shown in Fig. 5 and Table 10. It should be noted that the raw dataset was used to develop GEP (Fig. 5a) while the normalized dataset was employed for prediction of WQI in ANN-FFBP (Fig. 5b). Prediction of proposed GEP with R 2 = 0.983, RMSE = 0.379 and MAE = 0.295 is comparable with ANNs (R 2 = 0.988, RMSE = 0.017 and MAE = 0.013). The results indicate that the both GEP (Eq. 7) and ANN-FFBP can be used as a reliable and precise method in the range of the collected data (Table 2). Furthermore, these methods propose some advantages in comparison with the traditional method.

Fig. 5
figure 5

Comparison between predicted and observed WQI using a GEP; b ANN-FFBP

Table 10 Statistical parameters to predicted WQI using the GEP and ANNs

Firstly, the BOD is excluded in both GEP and ANN, and these methods have been developed using five variables. Therefore, the number of variables required is less than those in traditional methods which required six sub-indices. Furthermore, measurement of BOD requires significant time, cost and commitment. The BOD test is run in the dark at 20 °C for 5 days. The temperature is specified because the rate of oxygen consumption is temperature dependent, and with no light source to eliminate the possibility of photosynthesis. However, determination of BOD is a very time-consuming process in comparison with other variables. Therefore, the recommended methods are more rapid and cost effective.

Secondly, the conventional equation recommended by DoE (2005) employs six sub-indices parameters, which requires a more cumbersome attempt and longer time to convert the six raw data into its sub-indices (Table 3). In addition, instead of using the original parameters, all parameters are based on the sub-indices (Eq. 1) which should be obtained from rating curves. In contrast, both the GEP and ANN approaches use the raw variables rather than the sub-indices which lead to a direct prediction of the WQI. Most importantly, the GEP and ANN techniques are more direct, rapid, and convenient compared to the conventional method.

Thirdly, the proposed GEP (Eq. 7) is more practical in comparison to ANNs, and raw data without normalization can be used in this equation. In comparison with conventional equation, GEP is more direct, convenient, and rapid. An example is mentioned in the Appendix to compare calculation of WQI based on proposed and traditional methods. The WQI obtained by GEP with a value of 84.15 is close to the value obtained by the traditional equation (84.34). The results show that GEP is accurate, simple and quick to calculate WQI. In this sample, the water was classified as group-II with a range of WQI between 76.5 and 92.7 (Table 4).

Accordingly, this research highlights that the GEP and ANN-FFBP can be employed as valuable techniques for estimation of water quality in the FSCW. These methods simplify the calculation of the WQI and reduce substantial time and effort by optimizing the computations. These approaches are highly recommended to be used for water quality assessment of any aquatic system in the world. This research should encourage the researchers and managers to apply the GEP and ANN-FFBP methods as more direct and reliable alternatives to estimate water quality in wetlands and other water bodies.

Conclusions

In this study, GEP and ANNs techniques were employed to develop the WQI in the free surface constructed wetlands. Seventeen points of the wetland were monitored twice a month over a period of 14 months, and an extensive data set was collected for 11 water quality variables. The PFA was employed to interpret correlation between WQI and other variables. This analysis indicated that WQI was greatly affected by pH and SS, while temperature had no significant effect on the WQI in tropical areas. A sensitivity analysis was carried out using ANNs to reduce the number of variables. Subsequently, five significant parameters including pH, COD, DO, AN and SS were chosen to develop GEP and ANN methods. A high value of the coefficient of correlation (R 2 = 0.983) and low error (MAE = 0.295) indicated that the GEP method was able to successfully predict the WQI with high accuracy. The statistical parameters indicate that, although the ANN-FFBP with R 2 = 0.988 and MAE = 0.013 produced better results compared with GEP, the GEP-based formula is more useful for practical purposes. This research highlights that the GEP and ANN-FFBP can be employed as powerful and highly reliable methods to estimate water quality in wetlands and other water bodies. These two techniques are highly recommended to be used for accurate, quick and cost effective water quality assessments for any aquatic system in the world.