1 Introduction

River water is the main pivotal sources of irrigation in agricultural activities and affects human daily activities such as drinking. It is essential to forecast the future quality of river water using machine learning models. The Water Quality Index (WQI) of a river is dependent on the various quality parameters. There are various quality parameters presented in the literature for water. Hence, researchers have utilized numerous combinations of parameters with various machine learning models to forecast the water quality of a river and the results were promising. They used total dissolved solids, chlorophyll a, total suspended solids, turbidity, and blue-green algae phycocyanin with different machine-learning models including extreme learning machine regression, support vector machine regression, Gaussian process regression, linear regression, and partial least-squares regression to predict the mentioned variables [1]. In another study, physicochemical parameters such as concentrations of Ca2+, Mg2+, Na+, SO42− and CI were used as inputs to obtain the salinity of the river water [2]. Researchers utilized different predictive models namely standalone machine learning (ML), deep learning (DL), and hybrid models to forecast river water quality. The input data were monitored and obtained from each country’s research center and they were collected in various scenarios such as hourly, every 4 h, daily, and monthly [3,4,5,6,7]. According to all the previous research works, a more user-centric approach is required to mitigate the water quality issues using user-friendly tools and an interactive environment [8]. They found that there was no way for identifying the best network structure for forecasting the parameters of water quality [9, 10].

The artificial intelligence approaches have been considered and applied across many countries to forecast the parameters of water quality. Among the regular models utilized are ML models followed by the hybrid model and DL models. The Deep learning methods were not commonly used in prediction because they require a vast amount of data in training stage. In other words, the performance of deep learning models is highly dependent on the amount of data. To evaluate the predictive models, the performance indicators such as correlation coefficient (R2), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), and Mean Absolute Error (MAE) were used. The comparisons between different models to predict river water quality has discovered that DL models performed better than the ML modal in research works conducted in China [11]. However, other studies showed that hybrid-machine learning models were more accurate [2, 12, 13] and thus sometimes they outperformed deep learning models [14,15,]–[16].

This study aims to review the research works carried out for forecasting the water quality from 2009 to 2023. A summary of the modelling approaches used in the respective studies is presented. The performance of the predictive models used for this purpose is compared and evaluated. Additionally, the input water parameters used to train and test the models are validated and examined by calculating the performance indicators. The limitations of current water quality prediction methods and future research works are highlighted in this paper. Moreover, this paper proposes to utilize the deep-learning-based generative model called Generative Adversarial Networks (GANs) which have not been employed yet for water quality prediction.

Several contributions are included in this review study as follows:

  1. 1.

    We present 83 studies related to water quality prediction published recently in many countries.

  2. 2.

    Various methods that have utilized various input parameters such as chemical and meteorological are explored to show the possible combinations of input parameters and their impact on water quality prediction.

  3. 3.

    Various modelling algorithms including machine learning, deep learning, and hybrid models that have been used in numerous research articles are demonstrated to highlight the advantages and drawback of them to model water quality outputs for forecasting task.

  4. 4.

    We present several time scale scenarios such as hourly, daily, weekly, and monthly that usually research articles are used to conduct the experiments and analyze the results.

  5. 5.

    Numerous performance evaluation matrices RMSE, MSE, MAE, R2 that have been utilized in reviewed studies are described to highlight their advantages and limitations.

2 Research Methodology and Literature Review

2.1 Research Methodology

The search engine, Google Scholar was used in the preliminary step of this study to search for the relevant scientific research articles. After that, the results shown by the search engine were filtered and analyzed according to the relevancy of the keywords which were “water quality” and any equivalent meaning of the word “prediction”. Only research articles that contained the keywords were considered. Based on the findings of this study, much research works published in recent years were observed. Based on our humble knowledge, there is no comprehensive reviews published on water quality estimation. As a result, in this paper, we are looking for an answer for an open question which is “What is the best network structure to predict the water quality parameters” [9]. Hence, it is critical to perform an analysis on the most recent predictive models and algorithms including data pre-processing and prediction.

The search equation for water quality prediction in Google Scholar insertion was identified. Several combinations of keywords were applied to compose this search equation:

$$\left( {\text{A1 OR A2}} \right){\text{ AND }}\left( {\text{B1 OR B2 OR B3 OR B4}} \right)$$

Where A1 and A2 are “river water quality”, and “water quality index”, respectively. On the other hand, B1, B2, B3, and B4 are: “modelling”, “forecasting”, “prediction”, and “machine learning”, respectively. The research articles were selected from 2009 to 2023.

Figure 1 illustrates the process of filtering and selecting the articles for this review. Where nos stands for number of studies. A total of 83 articles were selected from 801 articles that matched the search equation from the database. Furthermore, 44 articles were rejected due to duplication, and 674 articles were disregarded from this study because they were not about water quality forecasting or their main findings were not relevant to water quality prediction.

Fig. 1
figure 1

Flow chart of process of selecting articles

The articles reviewed in this review study were selected to cover experiments focused specifically on water quality prediction. We found 83 research articles as shown in Table 1 and Fig. 1. Most of these articles were published in the last 5 years as shown in Fig. 2. Additionally, we selected to review these articles because they used various input parameters to predict the water quality as clear in Table 2. As can be seen, the input parameters are divided into meteorological inputs and chemical inputs. The origin country that the study was located is illustrated in Fig. 3. Furthermore, these articles were selected to cover numerous modelling algorithms such as traditional machine learning (ML), ensemble learning (EL), deep learning (DL), and hybrid models as shown in Table 3 and Fig. 7. Finally, the selection of articles considers also covering various performance metrics such as RMSE, MAE, R2 with various output paraments required to be predicted in several scenarios including hourly, daily, weekly, and monthly as can be seen in Table 4.

Table 1 Ranking of selected journals
Fig. 2
figure 2

Annual number of articles published

Table 2 Research works on water quality prediction
Fig. 3
figure 3

Article frequency grouped by the country where the study was located

Table 3 The methods used in each research article
Table 4 Various scenarios for time scales of the WQI and the evaluation metrics

2.2 Literature Review

The selected articles focused on water quality prediction, the input parameters, and the performance indicators to evaluate the results. These 83 research articles were from 46 journals. The number of articles selected per journal from the highest to the lowest was shown in Table 1. The most selected articles were from the journal of Hydrology with a quantity of 8, followed by Water journal with 7 articles, and Sustainability journal with 5 articles. Next, Environmental Science and Pollution Research journal and Science of The Total Environment journal contained 4 articles each whereas the journal of Environmental Management had 3 articles. The journals which had 2 articles were Complexity, Environment Pollution, IEEE Access, Marine Pollution Bulletin, Neural Computing and Applications, Water Research and Water Supply. And the rest of the journals have 1 article. The year 2022 showed most reviewed articles as shown in Fig. 2. A summary of the research work on water quality prediction is tabulated in Table 2. The table includes the location of the studies, the data size (initial and end dates) used to train the predictive model and the water quality input parameters utilized. The water quality input parameters used in these studies can be classified into 2 categories which are chemical and meteorological.

Figure 2 shows the reviewed studies that were selected to predict the water quality. These studies were grouped by the location where the experiments were conducted. As can be seen, the majority of researches related to water quality prediction were done in China and India.

Figure 3 demonstrates a bar chart of the countries where the studies were located to forecast the water quality. China was ranked the top in estimating water quality, followed by India, Malaysia and Iran. The studies conducted in these 4 countries covered more than half of the reviewed articles. Another 20 countries covered the remaining studies were Algeria, Australia, Bangladesh, Czech Republic, Germany, Ghana, Greece, Hong Kong, Iraq, Ireland, Italy, Kenya, South Korea, New Zealand, Pakistan, Spain, Taiwan, Turkey, USA, and Vietnam. Figure 4 shows the general framework that was usually found in the reviewed studies for water quality modelling including various parameters such as chemical and Meteorological, preprocessing techniques, and modelling algorithms.

Fig. 4
figure 4

Water quality index predicting process

The pre-processing techniques have been considered as important stage before modelling process. Usually, water or river data have missing values that result from limitations in sensors. Therefore, identifying these missing values and handling them is significant to clean the data to be prepared for further processing. In literature, several statistic methods have been used for filling missing values in data. Additionally, several values are unreal and far from their actual values. These values are considered as outliers and required to be detected in early stages to avoid any mistakes in modelling process. Furthermore, the values of water input parameters do not have same scale. In other words, some values are large and other are bounded. Therefore, scaling these input parameters can speed up the modelling process and produce more robust modelling results. Several features or input parameters are correlated and some of these parameters have no roles in modelling process, and thus removing these features can enhance the prediction. When large number of parameters used, reducing these parameters by selecting only subset of them is the good solution for prediction improvement. The feature selection process can be engineered or learned considering modelling algorithm. For example, in conventional ML algorithm, feature engineering is well known stage before modelling. On the other hand, deep learning model targets to learn features automatically to improve the prediction. Applying general ML methods without pre-processing techniques is behind the prediction performance degradation.

3 Classification of Studies

Numerous types of input data can be used to make a prediction about water quality indices. Structured data that can be arranged and tabulated were used in each work. Many publications that have been reviewed used chemical inputs for prediction. Additionally, meteorological data were also utilized for prediction. Furthermore, other research works used combining of both chemical and meteorological inputs. The types of input data that were utilized to estimate the river water quality index in the research papers are shown in Fig. 5.

Fig. 5
figure 5

Predictive variable variations

3.1 Chemical Inputs

Lakes, rivers, oceans, and even groundwater can be better understood by chemical input used for analysis. It also demonstrates the maximum degree of pollution that can be absorbed by a body of water without causing harm to the aquatic ecosystem, its inhabitants, and anyone drinking the water. Some of the examples of chemical parameters utilized are pH, alkalinity, chloride, and others that are appropriate. Biochemical oxygen demand (BOD5), fluoride, salinity, manganese, potassium, calcium, iron, chemicals, sulphate, chloride, silica, magnesium, pH, phosphate, nitrate, ammonium, and are among the 25 water quality factors included in the modelling of the SVMs and ANN as inputs [45]. These inputs were used to predict the dissolved solids, total solids as well total suspended solids.

3.2 Meteorological Inputs

Meteorological inputs are parameters related to the study of the atmosphere and its phenomena, notably as a way of predicting the weather. For instance, relative humidity, temperature, and solar radiation. Because it influences so many other aspects of weather, the temperature is the single most influential factor in both meteorology as well as ecology. Air temperature, humidity, sunshine, and rainfall were used together with a few chemical inputs such as pH, dissolved oxygen, turbidity and electric conductivity. According to the author, priority targets that had field-measurable parameters, readily accessible statistical data, and a substantial effect on water quality were used to narrow down the list of potential predictor factors [48]. Prediction of water quality characteristics according to temperature, dissolved oxygen, pH, total phosphorus, turbidity, and trophic level (position of an organism in the food chain); electrical conductivity (EC), total dissolved solids (TSS), and discharge; and nutrient budget (balance between crop inputs and outputs) [28].

4 Water Quality Index Modeling Techniques

In this section, we discuss various water quality modelling techniques. Water quality index (WQI) prognosis modelling methodologies are summarized in the Fig. 6.

Fig. 6
figure 6

Types of algorithms in predicting WQI

Because the review is about applying machine learning methods to predict water quality, we targeted various techniques that can be used for water quality forecasting. These techniques are divided into traditional machine learning (ML), ensemble learning (EL), deep learning (DL), and hybrid models. ML methods [90] include decision tree (DT), k nearest neighbor (KNN), multi-layer perceptron (MLP), support vector machine (SVM), multiple linear regression (MLR), and adaptive neuro fuzzy inference system (ANFIS). To produce more powerful model, a combination of several models has been used under ensemble learning such as random forest (RF) bagging, gradient boosting (GB), and stacking of models. On the other hand, deep learning methods have been found to produce superior performance when big data is available. They contain deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN) such as long-term short memory (LSTM). Furthermore, hybrid models have been used to boost the performance using various techniques. Some additional algorithms that cannot be simply classified into any of the aforementioned groups were labelled by a (O) classification system. The Modelling algorithms used in each research article are summarized in Table 3. Additionally, statistics on how often various modelling strategies have been employed in published studies are shown in Fig. 7 to highlight the frequency of using each AI method in research articles between 2009 and 2023.

Fig. 7
figure 7

Quantity analysis to show frequency of each algorithm used in reviewed papers

The selection of each method depends on various factors such as data size (small to large), hidden pattern complexity (easy to difficult to learn), and data type (spatial or temporal). With small datasets that have few features, usually traditional ML methods and ensemble learning give good performance with availability of patterns hidden inside the data. Increasing number of features with more complex patterns was behind the need to use DNN or CNN to learn features before the mapping to prediction. Having a time series with sequential data necessitates the use of RNN and LSTM to predict future data related to time.

4.1 Multi Linear Regression (MLR)

In ML, MLR stands out as one of the most basic and standard algorithms to utilize. The idea behind it is straightforward, and the method performs reliably. When compared to the other models, MLR’s ability to accurately predict outcomes was the lowest. Possible explanation: inputs and outcomes are strongly intertwined in a nonlinear fashion [83]. To simulate the system’s linear interactions, the tried-and-true MLR approach was utilized. It’s frequently utilized to serve as a standard against which other, non-linear models can be evaluated. The purpose of employing MLR in this research was to provide a standard against which other ML-based methods could be evaluated [1].

4.2 Artificial Neural Network (ANN)

An artificial neural network was employed as a benchmark in this study. The independent variable is multiplied by weights and then added to a constant in the intermediate layer, after which they are output from the algorithm. The neural network’s concealed layer performs nonlinear processing of the data, while the output layer is utilized to create learning outcomes [43]. In ANN, the signal is transmitted in one direction, while errors are relayed in the other direction. The output fault is “back propagated”, or sent layer by layer to the input layer via the concealed layer [83].

One of the neural networks’ main strengths is that they can simulate nonlinear relationships with little prior information about those relationships. Several studies advocated neural networks as a reliable method for estimating river water quality, and they anticipate future applications to enhance comprehension of contamination patterns in rivers [5]. For an intelligent early warning system, monitoring and predicting water quality metrics using machine learning models is essential. It is possible that the suggested optimization of hyperparameters in the ANN modelling approaches may result in adequate prediction accuracy for DO, but this might be enhanced by using additional AI models like Random Forest and Boosted Tree method [31].

When it comes to constructing a model to comprehend the connection between the parameters and their dependency on each other, one research was done to successfully addresses the problem of missing variables. The most important input parameters have been determined by a thorough sensitivity study [64]. A variety of MLP models were built and evaluated to find the optimal hidden-layer- and transfer-function sizes. The complexity of an MLP is determined by having more hidden layers which results in more connections and parameters in the artificial neural network (ANN). Similar to MLP, RBF networks were used to model nonlinear data and they were trained in a single stage, rather than iteratively [17].

4.3 Support Vector Machine (SVM)

Among supervised machine learning methods, the support vector machine family of algorithms is useful for addressing issues in both classification and regression [72]. Although commonly employed for classification, support vector machines (SVMs) can also be utilized for regression [91, 92]. To reduce the number of near misses, SVMs define a hyperplane between the classes by seeing data points projected on a plane and increasing the margin [21]. With the help of the structural risk reduction concept, SVM is a model that can overcome the issue of overfitting. The SVM model’s estimations are derived from a support vector that is a tiny sample of the training data [43]. Multiclass classification is another issue that it helps to clear up. Maximizing the shortest distance from the hyperplane to the nearest example is its primary objective. More parameters and limitations are used in this approach to classify or forecast the classes effectively in the multiclass issue [65].

4.4 Adaptive Neuro-Fuzzy Inference System (ANFIS)

By fusing the power of neural networks with the flexibility of a fuzzy inference system, the neuro-fuzzy method can learn and adapt to new situations. Any genuine continuous function on a compact set may be approximated with FIS to arbitrary precision. When constructing an ANFIS, it is also important to carefully pick the most suitable membership functions (MFs) [17]. In terms of accuracy and precision, the adaptive neuro-fuzzy inference system (ANFIS) performed admirably. The Takagi–Sugeno fuzzy inference system is the foundation for this artificial neural network implementation. When analyzing water, this model is among the most widely used ones [54].

As an artificial intelligence model, ANFIS can function beyond the bounds of traditional fuzzy inference and ANN. The ANFIS model can deal with complicated non-linear interactions between input and output since it combines the strengths of ANN and Fuzzy logic. In calculating the WQI, it was fared better than the MLR model [6]. To help map input space to the desired output region, ANFIS utilizes neural network learning techniques and fuzzy reasoning across several layers of a feed-forward network. The WDT-ANFIS method was introduced to reduce the impact of noise on data mining results. The wavelet de-noising technique ANFIS (WDT-ANFIS) model surpassed all the other models in terms of accurately forecasting the water quality metrics [9].

4.5 Decision Tree (DT)

Due to its ease of use, DT has gained widespread popularity. It’s a network, hence it has nodes and links (called “edges”). In DT, choices and their consequences are organized in a hierarchical framework [55]. Decision trees use a tree-like structure to create models for classification and regression. When a dataset is used as an input to this model, it is automatically sliced and diced into manageable chunks. A study using DT suggested a methodology to provide a faster and cheaper method for calculating and forecasting WQIs [71]. The outcomes demonstrate the capacity of the suggested prediction model to correctly forecast the WQI class.

The tree was constructed by breaking the input data into leaf nodes and inner nodes, which may include descendants. If the subset originating from a root node has the same intended output values, or if no new values are added to the forecast, the operation terminates [65]. When it comes to classifying data, the M5 model tree outperforms other decision tree models. The model’s emphasis on numbers makes it more useful for benchmarking against other models [33]. The strengths of decision-tree-based model lie in its efficiency, versatility, and insensitivity to missing data or features. While other machine learning models may be faster, on the whole, decision-tree-based models excel in making short-term forecasts [43].

4.6 Ensemble Model (EM)

Machine learning ensemble models were used to boost the accuracy of predictions. Building an ensemble may be done in two ways: alone and together. Bagging and random forest (RF) are two examples of independent approaches, whereas coordinated methods like gradient boosting (GB) models are more of an example of a hybrid approach [53]. The issue of dividing a dataset into many classes was also addressed. Several decision trees were combined into one larger one to do the categorization. The forecast from each tree in the forest was aggregated, and the class with the most scores was the one that is considered the forest’s output class. It’s a quicker and more adaptable technique, however, it does have its limitations [65].

Nonetheless, ensemble models based on decision trees, such as Random Forest (RF) and Gradient Boosting (GB), nearly always perform better than the individual decision tree [43]. While both RF and DNN provide extensive latitude to account for non-linear correlations between drivers and modelled parameters, doing so carries a risk of overfitting that increases as more drivers are included in the model. Authors compared how well they performed by gradually introducing new drivers and documenting the performance boost that came with it [56]. In GB, they utilized an additive model in which model performance increased with repeated repetitions. Differentiable loss functions can be optimized with this method [72].

The majority of current contests employed this most recent algorithm. A differentiable loss function can be optimized using an additive model [21]. To a greater or lesser extent, the effectiveness of various ML algorithms vary depending on the location in question. Consequently, it is a continuing challenge to investigate and design a generic ML model for water quality assessment applications [29].

4.7 Deep Learning (DP)

As a deep learning approach, the long short-term memory (LSTM) model is well-suited to predicting time-series data when the size of the time step is uncertain. In the LSTM model, a logistic sigmoid activation function was applied. It appeared that this WQI forecasting approach was not widely used in the literature [66]. Data relationships and hidden patterns can be revealed by various processing layers in Deep Learning network which functions similarly to a human brain’s neural network.

Another deep learning model was convolutional neural network (CNN). Each neuron in CNN is connected to a feature extracted from a lower neural layer. CNN can reduce the number of computations required and help to prevent the overfitting problems. Thus, CNN has been implemented in several studies that analyzed the content of digital photographs [24]. To combat the mediocre accuracy of previous scales, they developed a prediction model using LSTM deep neural networks and water quality monitoring data for training and testing [11].

4.8 Hybrid Techniques

Because of the constraints of some algorithms on the processing of stochastic data, experts often resort to hybrid modelling strategies. The prediction accuracy of a model may be greatly enhanced by combining two or more algorithms at various phases of the modelling process. In this article, we examined several hybrid machine learning and hybrid deep learning models that have been utilized in the study of WQI prediction. Due to using long short-term memory (LSTM) model as a reference point, the transfer learning and long short-term memory (TL-LSTM) model was used to generate deterministic point predictions when data was unavailable. Based on the findings, it is clear that the Multivariate Bayesian Uncertainty Processor (MBUP) strategy, which consists of deep learning and post-processing, was successfully identified the complicated dependency structure between the model’s output and the observed water quality [22].

Water quality characteristics were predicted using a three-part hybrid neural network model built from one-dimensional residual convolutional neural networks (1-DRCNN) and Bi-directional Gated recurrent units (BiGRU). To better capture the local change direction of these three parameters and to track their real value fluctuations, the 1-DRCNN-BiGRU hybrid neural network outperformed the single reference depth learning technique [4]. Predicting water quality data using a hybrid model might be an effective option because it is possible to capture more of the underlying patterns by mixing many models. The hybrid model outperformed both the Auto Regressive Integrated Moving Average (ARIMA) and the neural network models in terms of accuracy due to its superior recognition of time series patterns and nonlinear properties [39].

The model’s fundamental premise is to improve prediction accuracy by minimizing the influence of unimportant factors and amplifying the significance of significant factors through the adaptive weighting of components in the neural network’s hidden layers. For predicting water quality, the attention-based LSTM (AT-LSTM) model was better than the LSTM model [14].

5 Performance Evaluation Metrics

WQI requires validation data to evaluate the performance of the models. The data size may range from minutes interval up to more than seasonal data collected for the analysis. The time scale refers to the frequency of the collection of the water parameter data taken at the stations. The data may be taken daily, weekly, or monthly, depending on the design and purpose of the studies performed by the researchers. Various inputs or independent variables may be used to estimate the water quality index. The most common inputs applied are: temperature, dissolved oxygen, pH, turbidity and total phosphorus. In order to evaluate the model performance under various conditions, the models can be designed with varying inputs. Typically, when evaluating the best predictive models, the comparison should be based on the scenario in which all inputs were used.

The performance metrics are critical to determine how effectively the proposed models can provide predictive values that are comparable to or closed to the desired actual values. In this scenario, it is significant to select relevant performance indicators for model evaluation.

Several numbers of performance metrics are available to measure the performance of prediction in forecasting models. These metrics include coefficient of determination (R2), Mean absolute error (MAE), root mean square error (RMSE) and nash–sutcliffe efficiency coefficient (NSE). Most of the reviewed articles have used the R2, RMSE, and MAE which were successfully used in studies. However, for deeper evaluation of the performance, there were other metrics such as global performance indicator (GPI), correlation factor (R), Willmott Index of agreement (WI) and more.”

The coefficient of determination is a number between 0 and 1 where a value of 1.0 indicates a perfect correlation. R2 is used to explain the relationship between an independent and dependent variable and measures how well a statistical model predicts an outcome. The limitation of R-squared is inability to indicate if a regression model provides a proper fit to your data. In other words, sometimes good model may have a low R2 value. Additionally, it cannot inform if the data and predictions are biased or not.

$$R^{2} = 1 - \frac{RSS}{{TSS}}$$
(1)

where R2 = coefficient of determination, RSS = sum of squares of residuals, TSS = total sum of squares.

Mean absolute error (MAE) measures the absolute difference between the model prediction and the target value. The lower MAE score leads to better model. MAE is a robust and an unbiased estimator which is useful if the training data has outliers. The limitations of MAE are that MAE is not differentiable at zero. Additionally, it follows a scale-dependent measure.

$${\text{MAE}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left| {y_{i} - \widehat{{y_{i} }}} \right|}}{n}$$
(2)

where MAE is mean absolute error, y is target value, \(\widehat{y } \mathrm{is predicted value},\) \(\mathrm{n is numver of samples}\)

Root mean square error (RMSE) measures the average of squared difference between values predicted by a model and the actual values. Lower values of RMSE indicate better fit. Opposite to MSE which is highly biased for higher error values. RMSE is better in terms of reflecting performance when dealing with large error values. The limitation of RMSE is it is prone to outliers.

$${\text{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \widehat{{y_{i} }}} \right)^{2} }}{n}}$$
(3)

where RMSE is root mean squared error.

Nash–sutcliffe efficiency (NSE) coefficient is a reliable statistic used for assessing the goodness and predictive skill of fit of model. It is equal to one minus the ratio of the error variance of the modelled time-series divided by the variance of the observed time-series. NSE ranges between − ∞ and 1.0 (1 inclusive), with NSE = 1 being the optimal value. Values between 0.0 and 1.0 are generally viewed as acceptable levels of performance, whereas values < 0.0 indicates unacceptable performance.

$$NSE = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {q_{o} - q_{s} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {q_{o} - \hat{q}} \right)^{2} }}$$

where NSE is Nash–Sutcliffe coefficient, qo is observed value, qs is simulated value, \(\widehat{q}\) is average of observed value.

Table 4 presents a summary of various scenarios that the reviewed studies have presented. These scenarios are related to time scales of data used such as hourly, daily, weekly, and monthly. In other words, in daily scenario for example, the data collected for one day is used for prediction of future data. Additionally, the table indicates numerous performance indicators that were mentioned in the reviewed articles to evaluate the model prediction capability.

6 Conclusion and Recommendations

The objective of this study was to address the performance of the predictive models used in water quality prediction via different water parameters based on the results shown and the limitations mentioned. This paper has reviewed various 83 studies that were conducted recently between 2009 and 2023 to predict water quality index (WQI) using machine learning methods. In this review, we identified and categorized various types of modelling algorithms, input paraments and outputs. It was found that machine learning techniques were effective in simulation and prediction of the water quality index in many regions around the world. These methods have found the connections between water quality index and hydrological and meteorological variables without knowledge about physical characteristics of the modelled system. In other words, when it is difficult to design a knowledge-based model, machine learning techniques seem to be useful without a need to build physical models for the observed system. For a successful estimate of the water quality index, studies showed specific steps taken in the modelling process such as data preprocessing, dividing data into training, validation, and testing, and the selection of suitable predictors.

Advancements in modelling techniques employed machine learning (ML) and hybrid models in forecasting water quality index. In this study, it was observed that hybrid models have improved WQI estimation performance significantly. Additionally, since DL models has better performance than ML models in several studies, the hybrid-DL methods may show also superior performance compared to the hybrid-ML methods. However, since the studies of hybrid-DL models employed for WQI estimation were limited, the comparison was not done in this review.

Most of the studies were conducted in the Middle East and Asian Countries Therefore, we recommend more research works on water quality index prediction for regions where the availability of surface water is limited, such as in the African continent and parts of Europe and South America. For the modelling techniques used in the reviewed works, ensemble learning methods were limited, even though they are the most accurate methods.

When it is possible to collect large water quality data, more powerful algorithms of deep learning models such as convolutional neural network (CNN), long-term short memory (LSTM), and transformer can take place of traditional ML methods and produce significant improvement in prediction performance. The recent DL methods, specifically transformer [86], may open door to capture the temporal relations of history of water quality samples collected previously to forecast the future quality value with remarkable performance. This transformer uses attention mechanism to allow the model to focus on specific samples in data sequence by assigning different weights to different data samples applied at input. This technique was found to outperform LSTM in several applications [87,88,89].

Generative Adversarial Networks (GANs) have been discussed and evaluated in several domains and were able to give better prediction results. However, research works on using GANs a for predicting water quality index and comparison with standalone and hybrid models are still required. GAN may play a significant role to address the data-hungry problem of deep learning models by generating synthetic data. The potential benefit from synthetic samples generated by GAN can solve problems related to cost of data collection and lack of data that most of applications suffer from. GAN can increase size of data which open doors to utilize recent models of deep learning such as CNN, LSTM, and transformers for water quality prediction. By GAN, it will also be easy to retrieve some of the missing values in the history of water values collected in previous years. However, using GAN requires powerful machine to train and run the GAN model and it requires to fine tune the hyperparameters to get the expected performance because GAN is extremely sensitive to hyperparameter settings. The conclusions drawn from this review analysis can serve as a guidance for future studies to enhance the performance of Water Quality prediction using GAN’s generated data followed by the existing state-of-the-art methods.