Introduction

Unscientific utilization of water resources has resulted in its scarcity and contamination across different parts of the globe (Srinivas et al. 2018), especially in developing nations. Recent statistics indicate that riverine ecosystems of the largest rivers of the world such as Ganges (India), Mississippi (USA), Jordan (Israel), Sarno (Italy), and Yellow (China) are severely affected due to agricultural and domestic sewages, multi-purpose projects, dams, and industrial manufacturing processes (Srinivas et al. 2017; Dwivedi et al. 2018). Hence, planners and scientific bodies need a robust approach to evaluate the water quality (Tripathi and Singal 2019; Srinivas et al. 2020b).

Water Quality Index (WQI) has been extensively used to determine the appropriateness of water for various beneficial purposes (Shah and Joshi 2017; Shil et al. 2019). Zotou et al. (2020) applied 7 different WQI methods to examine their application in water bodies of the Mediterranean region. Conventional WQI tools and other techniques such as multiple linear and nonlinear regression and auto-regressive aggregated moving average are unable to (i) capture the nonlinear and non-stationary aspects of water quality; (ii) analyze the sensitivity of individual water quality parameters; (iii) assess uncertain and subjective behavior of parameters; (iv) consider complex inter-relationships among parameters; and (v) develop a relationship with climatic parameters (Melesse et al. 2011; Oladipo et al. 2021). To overcome these shortcomings, researchers have also used artificial intelligence (AI)-based approaches such as fuzzy logic (Jha et al. 2020; Srinivas et al. 2017; Srinivas and Singh 2018a), neural networks (García-Alba et al. 2019; Ucun Ozel et al. 2020), neuro-fuzzy models (Aghaarabi et al. 2017; Azad et al. 2019), genetic programming (Sotomayor et al. 2018), and different hybrid models (Yaseen et al. 2018; Lu and Ma 2020). Several models such as PTMApp, ACPF, and HSPF-SAM have been coupled with AI techniques to assess the impact of various pollutant sources on impaired waters of watersheds in the USA (Srinivas et al. 2020a). Machine learning (ML) models such as adaptive neuro-fuzzy inference system (ANFIS) and hybrid ANN models can capture the nonlinear patterns of water quality by considering each parameter’s sensitivity and subjective behavior to develop a robust WQI forecasting model (Tabbussum and Dar 2021). In addition, these models can also tackle problems such as high errors, outliers, noise, and missing data.

Kadam et al. (2019) projected the suitability of groundwater from the river basin Shivaganga to fulfill drinking purposes. Matta et al. (2020) classified spatio-temporal variation of water quality of the River Ganges (India) using WQI, multivariate statistical models, and principal component analysis (PCA). Azad et al. (2019) improved water quality classification and prediction by integrating artificial neural networks with nature-inspired algorithms like Cuckoo search, particle swarm optimization (PSO), and genetic algorithm. Juan et al. (2017) used ANN to simulate and forecast the variation of runoff in the Three-River Headwater Region (TRHR), Qinghai-Tibet Plateau. Al-Mukhtar and Al-Yaseen (2019) used ANN, ANFIS, and MLR to analyze electrical conductivity (EC) and total dissolved solids (TDS) in southern Iraq. Results established the supremacy of ANFIS among the three techniques. Fu et al. (2020) proposed an optimized wavelet de-noising ANFIS for estimating the water quality of wastewater. Pramanik and Panda (2009) compared ANN and ANFIS using daily data released from the Hirakud Reservoir, India. Sahu et al. (2011) demonstrated the capabilities of ANFIS to model actual and predicted water quality. They dealt with the data uncertainty and impreciseness by fuzzifying the parameter values.

Alizamir and Sobhanardakani (2018) used the ANN-PSO approach to predict heavy metals such as arsenic, copper, lead, and zinc in the groundwater of Toyserkan Plain (Hamedan Province, Iran). Chen and Liu (2015) simulated DO, phosphorus, chlorophyll ‘a’, and Secchi depth using ANFIS, radial basis function network, and a multiple linear regression model for Mingder reservoir (Taiwan). Aghel et al. (2019) modeled and predicted water quality parameters and achieved high computational speed and accuracy to predict unknown parameters using a hybrid PSO–neural-fuzzy technique. Azad et al. (2019) enhanced the performance of ANFIS using GA, differential evolution, and ant colony optimization for continuous domains and determined water quality parameters of river Gorganroud along with sensitivity analysis. Zanganeh (2020) employed PSO algorithms to address some of the limitations of ANFIS and enhanced its accuracy.

Rivers are the most vulnerable of all ecosystems to climate change’s effects. Hence, it is imperative to evaluate the role of climate variability on water quality (Avand and Moradi 2020). Variations in air temperature and rainfall are expected to have a significant impact on river flows, and as a result, they impact the mobility and dilution of pollutants as well (Srinivas et al. 2020c). Increased water temperatures can have an impact on chemical kinetics, which would impair the freshwater ecology and overall water quality (Mujere and Moyce 2018; Laanaya et al. 2017). Precipitation affects river flows, and as a result, sediment loads would vary, potentially altering river morphology and sediment transfer to water bodies, affecting freshwater ecosystems (Stryker et al. 2018). Avand and Moradi (2020) used integrated machine learning models with remote sensing to explore the impact of climate change and land use on water quality and flood probability. Hence, studying the uncertain behavior of climate parameters on river water quality is essential.

Recently, cloud computing is increasingly becoming popular to perform water quality analysis due to the increase in the computational requirements of analytical models such as machine learning (Geetha and Gouthami 2016; Sagan et al. 2020). The computational requirements may not necessarily be satisfied by locally available compute resources. To overcome the computational constraints, cloud computing platforms such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure Cloud can be utilized for large-scale water quality analysis. The users can request for the required amount and type of computational resources on these cloud platforms. They will be charged with a monetary fee in turn for utilizing the cloud platform.

There are very few artificial intelligence-based models which simultaneously deal with water quality, climate change, and river flow along with sensitivity analysis of water quality parameters. Considering this, the present study proposes a cloud-based optimized hydro-climatic ML-based modeling framework using ANN, ANFIS, and ANN coupled with advanced PSO. Five critical water quality parameters namely BOD, DO, temperature, pH, and total coliform are used for predicting the WQI of the River Ganges (India) under changing climate. This study is aimed at.

  1. (i).

    analyzing the sensitivity of water quality parameters toward predicting WQI using ANN, ANFIS, and ANN-PSO;

  2. (ii).

    assessing the performance of proposed models using R2, RMSE, and NMSE;

  3. (iii).

    incorporating uncertain behavior of climate parameters (air temperature and precipitation) and river flow as well as their impact on water quality parameters using ANN and ANFIS;

  4. (iv).

    comparing the modeling efficiency of ANN and ANFIS to predict WQI and assess the impact of hydro-climatology; and

  5. (v).

    enabling all the above-mentioned analyses in the cloud by sending the data from the monitoring sensors placed in the river to the cloud.

Overall, the proposed advanced machine learning-based models would enable the watershed managers and decision-makers to predict water quality accurately in a continuous manner while incorporating uncertainties concerning climate change, water quality parameters, and river flow.

Methodology

Study area and data

The Ganges basin occupies around one-fourth (26.3%) of India’s total area. River Ganges originates as Bhagirathi in the Garhwal Himalaya (30 55′ N, 79 7′ E). The catchment area of the Ganges basin is 8.6 × 105 km2 (26.4% of India’s area) (CPCB 2013). In Uttarakhand, the river traverses a length of 450 km. Raiwala is situated at an altitude of 356 m (above sea level) with an average rainfall of 2136.7 mm. Minimum and maximum summer temperatures are 29 ℃ and 40 ℃, respectively. In winters, the temperature drops lowest to 5 ℃ and goes up to 20 ℃. The following lists the characteristics of the River Ganges in the study area (Raiwala district of Uttarakhand) (Fig. 1):

  1. 1)

    The downstream stretch of the River Ganges at Raiwala and Rishikesh is not suitable for bathing with respect to the BOD criteria.

  2. 2)

    Total coliform does not meet the primary water criteria based on the designated best use for category ‘C’ at d/s Raiwala and d/s Haridwar (CPCB 2013).

  3. 3)

    Drains in Rishikesh discharge 178.5 MLD directly to the Ganges River. Open defecation and discharge of untreated sewage (CPCB 2016) leads to higher concentrations of coliform bacteria and organic pollution.

Fig. 1
figure 1

Map of Ganges River basin highlighting the study area

The interaction of rising temperatures and changing discharge patterns due to climate change and the discharge of heavy chemicals and sewage into the river imbalances its ecosystem. Water quality has its unique relationship with climate variables (e.g., precipitation, temperature, daylight hours, wind speed). On the other hand, river discharge also plays a key role in regulating riverine ecosystems and overall self-purifying capacity (Srinivas et al. 2018b). Limited research has been performed to explore the relationships among climate change, river discharge, and water quality (Mimikou et al. 2000; Zhang et al. 2015). In this study, monthly datasets of water quality, climate parameters, and river discharge are collected for the time period of nine years, i.e., 2011–2019. Water quality parameters data (pH, BOD, DO, water temperature, and total coliform) at the Raiwala station are obtained from the Central Pollution Control Board (CPCB 2019). River discharge data is provided by Central Water Commission (CWC 2019), while the climate data (rainfall and river temperature) is procured from Indian Meteorological Department (IMD 2019). For illustration purposes, water quality, climate parameters, and river discharge values only for the year 2011 are presented (Fig. 2). On analyzing WQI values for the study period (2011–2019), it is clear that the water quality of the river has significantly degraded from 2011 to 2019 primarily due to increased concentrations of total coliform and BOD.

Fig. 2
figure 2

Observed data of water quality, climate parameters, river discharge, and WQI at Raiwala (2011)

Optimized artificial intelligence models

The overall goal of the study is to accurately assess the impact of variation in hydro-climatic conditions and river discharge on water quality using ANN, ANFIS, and ANN-PSO. The methodology used in this study has been described in Fig. 3. We also propose collecting all the data from the monitoring sensors in the river and sending them to the cloud where the aforementioned analysis will be carried out; this is represented in Fig. 4. Firstly, a sensitivity analysis of five water quality parameters with respect to WQI of River Ganges has been performed. These results are analyzed on the basis of root mean square errors (RMSE), normalized mean square error (NMSE), and the determination coefficient (R2) for every combination of data using equations. Based on sensitivity analysis, critical parameters are obtained. Then, the impact of hydro-climatology on water quality is accessed using ANN and ANFIS. The models are applied to the case study on River Ganges in the following manner:

  • WQI calculation: Firstly, monthly WQI values are calculated using the weighted arithmetic mean method for the 9-year monthly data (Sutadian et al. 2016) using Eqs. (1)–(3).

    $${q}_{n}= \frac{100 ({X}_{n}-{X}_{io})}{({P}_{n}-{X}_{io})}$$
    (1)
    $${W}_{n}= {~}^{K}\!\left/ \!{~}_{{P}_{n}}\right.$$
    (2)
    $$WQI=\frac{\sum {q}_{n}{W}_{n}}{\sum {W}_{n}}$$
    (3)

where n is the number of water quality parameters, \({q}_{n}\) is the qualitative rating of the nth parameter, \({X}_{n}\) is the observed value of the nth parameter at the sampling station, \({P}_{n}\) is the standard prescribed value of the nth parameter, \({V}_{io}\) is the ideal value of the nth parameter in pure water, i.e., zero for all other parameters except pH and DO (7.0 mg/L and 14.6 mg/L, respectively), K is the constant for proportionality, and \({W}_{n}\) is the unit weight of the nth parameter. Table 1 provides the drinking water standards and unit weights of parameters provided by the Bureau of Indian Standards (BIS) and the World Health Organization (WHO).

Fig. 3
figure 3

Methodology adopted in this study

Fig. 4
figure 4

Utilizing the cloud for continuous monitoring and analysis

Table 1 Drinking water standards recommending agencies and unit weights
  • Normalization: The entire input datasets (water quality parameters) are normalized on a scale of 0 to 1 (Srinivas et al. 2017).where f is the output of the model, i.e., WQI and important water quality parameters in our study, p = 1,2,3…….N and N is the number of data samples.

  • Identifying sensitive water quality parameters: The normalized values of five water quality parameters and WQI are considered as input and output, respectively, to each of the ANN, ANFIS, and ANN-PSO models. Models are trained using different combinations of four water quality parameters at a time. Additionally, models are also trained by feeding all five water quality parameters as input. By using statistical methods (R2, RMSE, and NMSE) [Eqs. (4)–(6)], sensitive parameters are obtained for further analysis.where f is the output of the model, i.e., WQI and important water quality parameters in our study, p = 1,2,3…….N and N is the number of data samples.

    $$R^2=\frac{\sum_{k=1}^N\left[\left(f\left(p\right)-\overline{f\left(p\right)}\right)\left(f\left(\widehat p\right)-\overline{f\left(p\right)}\right)\right]}{\sqrt{\sum_{k=1}^N{(f\left(p\right)-\overline{f\left(p\right)})}^2}\sum_{k=1}^N{(f\left(\widehat p\right)-\overline{f\;\left(\widehat p\right)})}^2}$$
    (4)
    $$RMSE= \sqrt{\sum_{K=1}^{N}\frac{1}{N-1}{[f\left(p\right)-f\left(\widehat{p}\right)]}^{2}}$$
    (5)
    $$NMSE=\frac{\sum_{k=1}^{N}{[f\left(p\right)-f\left(\widehat{p}\right)]}^{2}}{\sum_{k=1}^{N}{[f\left(p\right)-\stackrel{-}{f\left(p\right)}]}^{2}}$$
    (6)
  • Assessing hydro-climatic impact: Another set of ANN and ANFIS models are trained for identifying the impact of climate and river discharge on water quality. A separate ANN and ANFIS models are built for each of the sensitive water quality parameters. Each model consists of flow rate of the River Ganges, air temperature, and rainfall at Raiwala as inputs while the water quality parameters as outputs. The trained ANN and ANFIS models are analyzed to assess the level of impact of climate and river flow on water quality using R2 and RMSE.

ANN and ANFIS models

In artificial neural networks (ANNs), an activation function (which is generally nonlinear) is applied to inputs (either water quality parameters or climate and discharge data) arriving at each neuron for generating an output signal (either WQI or water quality parameters). An additional bias input is added to the weighted sum of inputs to increase/decrease the overall input to the activation function. Such synthesized functions depend on the network architecture and interconnections between the processing units. Most ANN algorithms are feedforward networks which use error backpropagation algorithms for training. The input and output nodes represent independent and dependent variables, respectively. Hidden layers are added to perform nonlinear transformations (such as sigmoid functions) on the input space. Detailed information on ANN can be found in (Rajaee et al., 2020).

An adaptive neuro-fuzzy inference system (ANFIS) uses a fuzzy inference system (FIS) to express the uncertainty associated with water quality, river discharge, and climate parameters. FIS uses a nonlinear mapping from the input to the output space using a set of ‘IF–THEN’ fuzzy laws, each defining the mapping’s local behavior. In general, backpropagation alone or backpropagation along with the least square approach is used to optimize the fuzzy membership parameters. ANFIS model is a stronger model than ANN and independent fuzzy logic models in terms of its flexibility, adaptivity, learning speed, and memorization accuracy (Srisaeng et al., 2015). It is also able to better capture the nonlinearity, uncertainty, and inaccuracy in the data.

Sugeno-type FIS has been used in this study as it models the real-world data more accurately and is more computationally efficient than Mamdani (Wang and Chen 2014; Şahin and Erol 2017). The Sugeno system has either linear or constant output membership functions, whereas the Mamdani system can have triangular, Gaussian output membership functions. The Mamdani type is more reliant on expert knowledge and requires fuzzy rules. On the other hand, the Sugeno-type FIS utilizes real data for training, and it develops the rules itself while training according to the input and output data. In general, Sugeno FIS is used for one output and Mamdani FIS is used if the number of outputs is more than one. In this study, there is only one output, i.e., WQI, and the model needs to be trained based on real-time data; hence, Sugeno FIS is chosen. Figure 5 shows the architecture of an ANFIS model, where Oj,i represents the output of the ith node and jth layer, x is the input (water quality parameters) to the ith node, and Ai is the linguistic label of inputs denoted by membership functions (triangular, trapezoidal, and Gaussian).

Fig. 5
figure 5

Architecture of ANFIS model having two inputs, two rules, and one output

The fuzzy IF–THEN rules for a first-order Sugeno fuzzy model are given below:

$$\mathrm{Rule}\;1:\;\mathrm{If}\;x\;\mathrm{is}\;A_1\;\mathrm{and}\;y\;\mathrm{is}\;B_1,\;\mathrm{then}\;f_1=a_1x\;+\;b_1y+c_1,$$
$$\mathrm{Rule}\;2\;:\;\mathrm{If}\;x\;\mathrm{is}\;A_2\;\mathrm{and}\;y\;\mathrm{is}\;B_2,\;\mathrm{then}\;f_2=c_2x+b_2y+c_2,$$

where x and y are inputs, Ai and Bi are membership functions, fi is the output, and ai, bi, and ci are determined during the training process (Şahin and Erol 2017). The layers of the ANFIS model are explained below.

Layer 1: Nodes are represented through the following functions:

$$\begin{array}{cc}O_{1,i}\;=\;\mu_{Ai}\;\left(x\right)&for\;i\;=1,2\end{array}$$
$$\begin{array}{cc}O_{1,i}\;=\;\mu_{Bi-2}\;\left(y\right)&for\;i\;=3,4\end{array}$$

Layer 2: Firing strength of a rule is computed as follows:

$$\begin{array}{cc}{O}_{2,i}={w}_{i}={\mu }_{Ai}\left(x\right)\times {\mu }_{Bi}\left(y\right)& where i=\mathrm{1,2}\end{array}$$

Layer 3: Normalized firing strengths are calculated as

$$\begin{array}{cc}{O}_{3,i}={\overline{w} }_{1}={~}^{{w}_{i}}\!\left/ \!{~}_{{w}_{1}+{w}_{2}}\right.& i=\mathrm{1,2}\end{array}$$

Layer 4: The node function of this layer calculates the contribution of the ith rule to the total output

$${O}_{4,i}=\overline{{w }_{1}} \times {f}_{i}= \overline{{w }_{1}} ({a}_{i} \times x+ {b}_{i} \times y+ {c}_{i})$$

where \({w}_{i}\) is the output of layer 3 and {ai, bi, ci} is the parameter set.

Layer 5: Single output node computes the overall output of the ANFIS as

$${O}_{5, i}= \sum_{i}\overline{{w }_{1}} \times {f}_{i}= {~}^{\sum_{i}{w}_{i}\times {f}_{i} }\!\left/ \!{~}_{\sum_{i}{w}_{i}}\right.$$

In this study, a backpropagation learning algorithm is used to reduce the error between observed and expected data. A well-designed ANFIS has been developed to address the nonlinearity or complexity associated with inputs and outputs with high precision.

ANN training and architecture

Feedforward backpropagation neural network: A total of 108 normalized (on a scale of 0–1) input and output values are fed into the hidden layer, and desired weights are assigned. The number of hidden layers affects the error estimation. The weights connect each link between an input and a hidden layer, as well as a bias value. The output layer then connects to the hidden layer, which gives the output. All the neural network models generated in this study are analyzed using R2 and RMSE.

Training the neural network

Training achieves highly efficient and accurate modeling of a given dataset. For sensitivity analysis, input data includes five water quality parameters and output includes the WQI of the river. For assessing the impact of hydro-climatic changes, river discharge, rainfall, and air temperature are considered as inputs; and the output dataset includes pH, DO, BOD, temperature, and total coliform. The backpropagation algorithm is used for the training. Using the fitnet function in MATLAB 2020a, Levenberg–Marquardt (LM) function has been chosen as the training algorithm and tansig as the activation function for hidden as well as output layers.

Neural network architecture

The structural attributes of a feedforward neural network such as the number of hidden layers, hidden neurons, and training, testing, and validation datasets directly affect the performance of the model and are important for adequate mapping. The trial-and-error process has been used extensively to arrive at a perceptron with three layers consisting of a number of functional nodes and connection weights. Input and output values are randomly partitioned into 70%, 15%, and 15% as training, testing, and validation sets, respectively. Figure 6 represents the architecture of the ANN model for analyzing the impact of variation in hydro-climatology on water quality. Because of the lack of availability of a huge dataset, overfitting in the ANN model is dealt with by changing the network complexity. The network structure was changed in a manner that the parameters (weights) of the model remain small. Small parameters suggest a less complex and, in turn, more stable model that is less sensitive to statistical fluctuations in the input data. A trial-and-error process was undertaken till the model achieved a suitable performance on a validation dataset, and the corresponding network structure was finalized.

Fig. 6
figure 6

Architecture of ANN model for climate change analysis

Following the above-mentioned concept, while training an ANN model with five water quality parameters as input and Water Quality Index (WQI) as the output, taking one hidden layer between the input and output layer was the basic starting point so that number of weights in the model could be optimized to a minimum level. The model was developed with 15% of the dataset as a validation dataset. For deciding the number of hidden neurons in that layer, trial and error was done with 5, 6, and 7 neurons. The values of the regression coefficient for all the three models with 5, 6, and 7 hidden neurons were similar (around 0.99). This indicated that all these three models were able to train well. However, the predictive error on the validation dataset was the least for the model with 5 neurons with also the training error being the least (7.5 × 10−7). For 6 and 7 neurons, the predictive error of the validation dataset was very high as compared to the low training error. Therefore, with 5 input parameters, 5 hidden neurons in 1 hidden layer and 1 output node (WQI) in the output layer, the neural network model was developed to achieve, if any, the least amount of overfitting.

ANFIS training and architecture

Both sensitivity and hydro-climatic analysis use a combination of input and output values similar to that of the feedforward backpropagation of ANN, as described earlier. Also, R2 and RMSE are calculated.

Sugeno-FIS architecture

The ANFIS model’s ability to meet the output target is determined by internal ANFIS parameters including the number and shape of membership functions. Each input parameter’s contribution to the regression parameter and the desired output is determined by the type and number of membership functions used. Optimization of internal parameters is therefore a necessary step during ANFIS modeling. If the target RMSE can be met, the trial-and-error method is considered to be efficient for such optimization. The trial-and-error method also has the benefit of producing an information rule base with a lower likelihood of overfitting the training data collection. Hence, internal ANFIS learning parameters are tuned to render the optimal parameters capable of mimicking the given data pattern sequences after a comprehensive trial-and-error process. For every input, three Gaussian membership functions are used. For illustration, Fig. 7 shows the membership functions used for pH in this study. The Gaussian membership function (\({\mu }_{A}\)) is generally represented as Gaussian (x: b, s), where b and s represent the mean and standard deviation, respectively (Eq. (7)).

$${\mu }_{A}\left(x,b,s,n\right)=\mathrm{exp}[-\frac{1}{2}{\left|\frac{x-b}{n}\right|}^{n}$$
(7)

where n represents the fuzzification factor.

Fig. 7
figure 7

Gaussian membership function plot for an input variable pH

The output membership function of the ANFIS model is linear. The number of inputs, their types, and the number of fuzzy membership functions used in the model are used to determine the number of fuzzy rules and the optimum number of parameters required for describing the FIS to obtain the best results.

Training the ANFIS

Training is done with the aid of a neural-fuzzy designer in MATLAB 2020a. For illustration, the architecture of the ANFIS model for sensitivity analysis has been shown in Fig. 8. The data space is divided into rectangular subspaces using grid partition and axis-parallel partition on the basis of a predetermined number of membership functions (MFs) and their types in the dimensions. Backpropagation along with the least-squares algorithm (hybrid) adjusts the parameters of the membership functions. The parameters of input membership functions are adjusted using backpropagation, while those of output membership functions are adjusted using least-squares estimation in this hybrid optimization process. In this study, the training stops after 50 epochs as there is no change in error after 50 epochs.

Fig. 8
figure 8

Architecture of ANFIS model in Neuro-Fuzzy Designer for sensitivity analysis

Hybridized particle swarm optimization (PSO)-ANN model

Meta-heuristic optimization algorithms are used to train ANNs instead of classic algorithms (backpropagation algorithm) to tackle the disadvantages associated with the latter like getting stuck in local minima/maxima and plateaus of the error function landscapes (Chatterjee et al. 2017). In a hybrid ANN-PSO model, PSO is used to reduce the ANN’s errors by selecting the best values for the model’s weights and biases. As a result, the weights and biases are variables in this problem, and the problem’s feasible space is determined by the interval at which these variables differ.

In ANN-PSO, each particle contains one set of weights and the value of weights corresponding to a particle is termed as the particle’s position. A particle’s fitness function can be expressed in terms of NMSE. Firstly, a number of hidden layers of neurons are chosen, and a neural network is developed with initial weights and biases. The initial weights and biases are then improved in order to be able to locate the particle in the D-dimensional, where D refers to the total number of weights and biases. In every iteration, each particle’s output values are predicted, followed by the calculation of normal mean squared error (cost function in this study). The location of particles is updated for a specified number of populations and iterations until the cost function is minimized (Poli et al. 2007). For this study, the weights refer to the weights of the ANN model with water parameters as input and WQI as output for sensitivity analyses. A total of 50 particles (population) have been considered in the PSO algorithm. Out of these 50 particles, the most optimum one whose position when fixed as the weights in the ANN model gives the least NMSE after training.

In the proposed work, PSO along with controlled parameters has been proposed to improve the ANN model. To investigate particle movement patterns in PSO with controlled parameters, base frequency (F) and variance (\({M}_{c}\)) are used to estimate particle position coefficients. The term ‘base frequency’ is used to describe the movement as a measure of positional correlation. The range of movement patterns is characterized by variance (\({M}_{c}\)). The frequency and variance are measured theoretically, and the inertial weight (w) is calculated using Eqs. (8)–(10).

$$f\left(w\right)=\frac{-48V_cw^2+48M_c}{28M_c+w-20M_cw+1}-1-w+2\;\cos\;\left(2\pi F\right)\sqrt w\;;\;f\left(w\right)=0$$
(8)

For a given frequency and variance, values of coefficients are calculated as

$${c}_{1}=1+w-2\mathrm{cos}(2\pi F)\sqrt{w}$$
(9)
$${c}_{2=}\frac{-48{V}_{c}{w}^{2}+48{M}_{c}}{28{M}_{c}+w-20{M}_{c}w+1}$$
(10)

In comparison to a long-run problem, the number of iterations for a short-run problem is lower. If a short run and a long run are conducted, the experimental values of \({M}_{c}\) and F are given as 25.6 and 0.2, respectively, for a short run, 6.4 and 0.2, respectively, for a long run (Bonyadi and Michalewicz, 2016).

ANN-PSO architecture

Particle swarm optimization has been used to improve the ANN modeling of water quality parameters and WQI while performing sensitivity analysis. R2 and NMSE are used to compare the models and evaluate the sensitivity of each water quality parameter.

Neural network architecture

A feedforward perceptron with one hidden layer is improved using advanced PSO. The training algorithm and activation functions are the same as in the ‘ANN training and architecture’ section. Input (pH, DO, BOD, temperature, and total coliform) and output (WQI) values are randomly divided into 70%, 15%, and 15% as training, testing, and validation sets, respectively.

Optimization of weights

Obtaining suitable weights between ‘input and hidden layers’ and ‘hidden and output layers’ using a meta-heuristic nature-inspired algorithm has proven to be extremely efficient as compared to classical methods. The number of weights to be optimized is calculated using Eq. (11).

$$\mathrm{No}.\;\mathrm{of}\;\mathrm{weights}=(\mathrm{No}.\;of\;\mathrm{input}\;\mathrm{nodes}\ast n)+n+n+1$$
(11)

where n is the number of hidden nodes.

In the PSO algorithm, values of these weights are considered as positions corresponding to each particle. Particle size is known as population. The velocity of each particle is also generated, and after each iteration, both the position and velocity of the particle are updated based on global best values. Following a trial-and-error method to get the best performance, iterations are fixed to a maximum of 200, the population is 50, and the damping coefficient is set to be 0.99. In order to enhance PSO, controlled parameters are used (Bonyadi and Michalewicz, 2016) as given below:

$$\mathrm{Inertial}\;\mathrm{weight}\;(w)=0.836$$
$$\mathrm{Acceleration}\;\mathrm{constants}\;(c_1\;\mathrm{and}\;c_2)=1.271$$

Fitness function

The fitness (cost) function in the PSO algorithm equals NMSE. After being trained with PSO updated weights, ANN gives the error for a specific input and output dataset used for modeling the trained neural network. These error and output values are used to calculate the NMSE of those combinations of weights. The weights that produce the least value of the fitness function (NMSE) are chosen for modeling the ANN. Figure 9 shows how error decreases as the iterations progress to finally reach the least NMSE.

Fig. 9
figure 9

Relation between iterations and NMSE for ANN-PSO model with five water quality inputs

Results and discussion

The AI models described in the previous sections have been applied using the nine-year (2011–2019) datasets of water quality and climate parameters, and discharge at the Raiwala sampling station of the River Ganges. This section presents the results for each of the ANN, ANFIS, and ANN-PSO models.

Sensitivity analysis of water quality parameters

ANN, ANFIS, and ANN-PSO have been applied to carry out a sensitivity analysis to identify the water quality parameters which affect the WQI of the Ganges River in a significant manner. Table 4 presents the results of the ANN models. Based on the results, the combination of one hidden layer with five hidden neurons is selected, as it yields minimum RMSE (7.50 × 10−7). However, on selecting six or seven hidden neurons, RMSE increases. Therefore, five hidden neurons are selected for water quality parameters sensitivity analysis. Five models are trained, each time leaving one parameter out, using the fitnet function (nnstart toolbox) in MATLAB 2020a. On removing any parameter from the analysis, the RMSE and R2 values of the neural models increase or decrease, signifying their importance accordingly. ‘ANN-WQI-ALL’ is taken as the reference model. When all five parameters are taken as input, a correlation coefficient of 0.999 is obtained. On the other hand, if BOD is not taken as input, a correlation coefficient of 0.856 and RMSE of 0.0659 is obtained. Hence, comparing the values of statistical measures, it is observed that pH, BOD, and total coliform have the largest influence on the WQI. DO (R2 = 0.965 and RMSE = 0.044) and water temperature (R2 = 0.999 and RMSE = 5.53 × 10−6) have comparatively lesser impact on WQI (Table 2).

Table 2 Overall results of eight ANN-WQI models developed for sensitivity analysis

Table 3 presents the results of the ANFIS models. The ANFIS-WQI-ALL model has the lowest RMSE value (1.02 × 10−5) and shows good accuracy as well as little residual errors when compared to other models. Hence, it is taken as the reference model. RMSE increases greatly (0.000131) when total coliform is not taken into consideration. This reveals that total coliform can be considered as the most crucial parameter for WQI prediction. Low errors corresponding to ‘ANFIS-WQI-BOD’ and ‘ANN-WQI-pH’ also indicate the importance of BOD and pH as inputs when compared to temperature and DO. The R2 values do not vary significantly from 0.99 indicates that ANFIS has been able to achieve very good training accuracy on the provided data.

Table 3 Overall results of six ANFIS-WQI models developed for sensitivity analysis

Table 4 gives the results of sensitivity analysis for ANN-PSO models. On taking 7 hidden neurons, all parameters when considered into the model yield an R2 value of 0.998 and NMSE of 0.0024. Optimized weights through PSO gave the best results when the ANN architecture had one hidden layer and 7 hidden nodes. The highest dip in R2 value (0.7196) is seen in the case of ANN PSO-WQI-COLIFORM. Consequently, the NMSE value of 0.3103 proves that total coliform is the most significant input parameter. The remaining models are statistically close with one another.

Table 4 Overall result of eight ANN PSO-WQI models developed for sensitivity analysis

Because of their ability to yield more accurate results, ANFIS and ANN-PSO are considered more reliable compared to ANN models. The observations obtained from the results of these models can be easily verified based on the works performed by governmental bodies and other secondary sources (CPCB 2016; Srinivas et al. 2017; UPCB 2020). In Rishikesh, a total of 187.92 MLD of wastewater sewage and 1.27 tonnes/day of BOD is discharged into the River Ganges (CPCB 2016). On major religious days, the discharge of municipal sewage increases significantly as more than 1.5 million people gather near the River Ganges. In addition, Uttarakhand is the home of 71 grossly polluting industries discharging 94,992 KLD of industrial waste and 2150 kg/day of BOD into the River Ganges and its tributaries (ENVIS 2016). These industries include paper pulp, sugar, distilleries, and pesticides. Since most of them discharge BOD, pH at high temperatures, a strong correlation is obtained using ANN, ANFIS, and ANN-PSO models due to the presence of a common source. BOD, pH, and coliform are also present in sewage water; thus, coliform is also closely related to other parameters.

Due to constant discharge from sewage treatment plants, untreated drains, and industries across the Dehradun-Rishikesh stretch, BOD and coliform exceed the water quality criteria (CPCB 2019). pH levels increase tremendously, particularly in the monsoon seasons. Haritash and Gaur (2016) conclude that Rishikesh’s water cannot be advised for drinking or other domestic needs unless it is treated.

This study would enable governmental bodies such as CPCB to take necessary precautions against the discharge of sewage, human, and animal waste into the waters of the River Ganges and devise suitable measures to assess the climatic impact on aquatic life. It would also help the concerned authorities to plan and manage big pilgrimages and the establishment of industries and factories in a safe and efficient manner.

Suitable models for WQI forecasting

ANN, ANFIS, and ANN-PSO show great capability during the training stage when used for the numerical modeling of WQI, considering pH, BOD, water temperature, and total coliform as inputs. However, ANFIS models are the most robust and accurate, with no major fluctuations in the values of R2 and lesser residual errors (Table 3). ANFIS models are observed to be superior over ANN and hybrid ANN models (Kisi 2015). They have proven to efficiently deal with areas that are ill-defined and ambiguous, such as water quality projections (Aghaarabi et al. 2017; Tiwari et al. 2018) and suitable for onsite water quality evaluation (Khadr and Elshemy 2017).

Figure 10 depicts the deviation of the water quality indices predicted using six ANFIS models from the calculated WQI values. %parameter_name indicates the graph of the model in which that particular parameter was not considered as one of the inputs. The most evident deviations from WQI values are observed for blue, orange, and yellow lines with one or two spikes of red and green lines. This suggests the importance of BOD, pH, and total coliform as input parameters as compared to water temperature and DO. As BOD, pH, and coliform are associated with the River Ganges locale, utilizing them as three input parameters for WQI prediction via ANFIS may result in good model reproducibility. The less significant inputs could be removed because it contributes less variance to the WQI forecast.

Fig. 10
figure 10

Deviation of all ANFIS models in comparison to actual WQI values

Impact of hydro-climatic variations

Although BOD, pH, and coliform turn out to be the most sensitive parameters based on the ANFIS model, all parameters are considered for studying the effects of hydro-climatology on water quality. This is due to all parameters being somewhat sensitive toward WQI. Discharge of the Ganges, air temperature, and rainfall are taken as input, and one by one, all water quality parameters are taken as output for modeling the unknown and uncertain relationship among climate change and water quality using ANN and ANFIS. ANN models with six hidden neurons show that water temperature is significantly affected by rainfall, air temperature, and flow rate of the river with an R2 value of 0.47 (Fig. 11). Although the error of ANN-COLIFORM is the lowest (0.0942), and its R2 value is also very low. Similarly, other ANN models have not achieved good accuracy in the training stage. This indicates ANN’s inability to efficiently capture nonlinear relationships and map ambiguous data with time variations.

Fig. 11
figure 11

Comparison of R2 and RMSE values for ANN and ANFIS models for hydro-climatic impact assessment

On the other hand, ANFIS emerges successful in providing better clarity on how water quality depends on flow and climate parameters (Fig. 11) with relatively good R2 values. The highest correlation is found for total coliform (0.927) and the least for DO (0.587). Models with DO and BOD as outputs have the least R2 and maximum error values. ‘ANFIS-WATER.TEMP’ and ‘ANFIS-PH’ are also comparable based on their RMSE values. Results of ANFIS modeling can be employed for understanding time-dependent uncertain relationships among water quality parameters and climate. In addition, the watershed modelers would also get clarity on framing seasonal policies and planning treatment practices to enhance the water quality under varying climates.

Both the water quality measures and river flow are highly dependent on cumulative antecedent precipitation (Srinivas et al. 2020c). At the Raiwala sampling station, during the rainy season, individual household septic tanks get overcrowded, allowing untreated wastes to spill into drainage ditches and neighboring water bodies, which results in increased coliform concentration. When rain falls on a water source that is not well buffered, runoff can lower the pH of neighboring water. Global warming (increased levels of CO2 in the air) affects the atmospheric temperature and river flows which in turn is responsible for changing pH and DO levels in surface water (Nepal 2016). River water flow has an impact on both low DO and high BOD concentrations. The heavier the rainfall, the more diluted the river water will be, lowering the BOD value (Srinivas et al. 2017). Therefore, studying the impact of climate variations on the water quality of the Ganges is essential.

Future work

The outcomes of the cloud-based neuro-fuzzy hydro-climatic model proposed in this study have the capability to enhance the existing conventionally available software such as PTMApp (BWSR 2021) and ACPF (ACPF 2021) by incorporating the uncertainty associated with water quality parameters (Srinivas et al. 2020a). Authors have discussed in detail about how these software would help in developing a decision support system in their previous work (Srinivas et al. 2020b). Some of the prescribed changes in these software are given below.

PTMApp

  • Individual practices: PTMApp now has over 20 individual practices based on USDA NRCS practice types (such as farm ponds, saturated buffers, denitrifying bioreactors, and multi-stage ditch) as compared to the 6 treatment groups of practices (source reduction, protection, etc.) in the earlier versions.

  • Economics: In the original PTMApp, the only upfront cost of installation of the practice was considered. But now, PTMApp outputs that number as well as the total life cycle costs of a practice. Life cycle costs look at the long-term costs for the effective life of the practice and then annualized to get a different perspective on costs. This is helpful when looking at practices like cover crops that have low installation costs but over the long term may cost just as much or more than a structural practice that has high upfront installation costs.

ACPF

The new version of ACPF (version 4) would be released in Spring 2022. Some of the expected changes include.

  • inclusion of economics similar to PTMApp

  • addition of pollution reduction benefits and potentially hydrology and carbon benefits as well, and

  • also, it would be compatible with the latest versions of ArcGIS.

Looking at the future enhancements of PTMApp and ACPF, there is a great scope to integrate a cloud-based neuro-fuzzy hydro-climatic model to deal with uncertainty and sensitivity associated with costs, locations and sizes of practices, water quality benefits, and stakeholder preferences.

Conclusions

The study presents various running AI-based models in the cloud to identify sensitive water quality parameters and assess the impact of hydro-climatic conditions on water quality. The collection of data from the monitoring sensors in the river and sending this data to the cloud can enable continuous analysis and assessment. Relationships among water quality and climate change derived using the proposed models would guide the hydro-climatologists to plan advanced treatments based on predicted climatic changes. The results demonstrate that artificial intelligence has been proven effectively equipped for forecasting the future health of the rivers in a real-time network, especially when data is scarce and lacks proper detail. Machine learning models (ANN, ANFIS, ANN-PSO) generate the desired results in a fast, accurate, and inexpensive manner, even with small and imperfect datasets. ANFIS could very well handle the uncertainties associated with water quality parameters. Overall, this study proposes a framework to develop machine learning models (ANN, ANN-PSO, and ANFIS) for predicting the Water Quality Index (WQI) of River Ganges and to assess the impact of climate change on water quality. The novelty of this work lies in evaluating the sensitivity of water parameters with respect to WQI which enhances the model development for WQI prediction. In addition, ANN has been modified using fuzzy logic and PSO algorithms. A unique meta-heuristic method of linking flow rate and climate change parameters with water quality characteristics has been developed using ANN and ANFIS, where ANFIS provided accurate results. ANN, hybrid ANN (weights optimized by advanced PSO), and ANFIS capture the linear relationship between parameters and numerically computed WQI with satisfactory R2 values of 0.999, 0.998, and 0.999, respectively. Sensitivity analysis shows the significance of ‘BOD, pH, and total coliform’ as inputs for WQI forecasting as compared to water temperature and DO. Moreover, ANFIS is revealed to be a more reliable tool than ANN for assessing the impact of hydro-climatic change on a river’s water quality, with a maximum R2 of 0.927 and least NMSE of 0.027 in the case of total coliform. Uncertainty in our study is dealt with by amalgamating fuzzy inference systems in the ANN model (ANFIS). Results clearly indicate that ANFIS achieves pattern recognition for unknown nonlinear and linear relationships better than ANN and hybrid ANN algorithms. The proposed machine learning framework is flexible and replicable for any number of water quality and climate parameters in any water body in the world. The future scope includes the coupling of classical and probabilistic ways with AI and Geographical Information Systems (GIS) to incorporate uncertainty for large datasets in a more satisfactory manner.