Interpreting optimised data-driven solution with explainable artificial intelligence (XAI) for water quality assessment for better decision-making in pollution management

Mallick, Javed; Alqadhi, Saeed; Hang, Hoang Thi; Alsubih, Majed

doi:10.1007/s11356-024-33921-7

Interpreting optimised data-driven solution with explainable artificial intelligence (XAI) for water quality assessment for better decision-making in pollution management

Research Article
Published: 17 June 2024

Volume 31, pages 42948–42969, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Environmental Science and Pollution Research Aims and scope Submit manuscript

Interpreting optimised data-driven solution with explainable artificial intelligence (XAI) for water quality assessment for better decision-making in pollution management

Download PDF

Javed Mallick ORCID: orcid.org/0000-0002-6155-3720¹,
Saeed Alqadhi¹,
Hoang Thi Hang² &
…
Majed Alsubih¹

234 Accesses
Explore all metrics

Abstract

In Saudi Arabia, water pollution and drinking water scarcity pose a major challenge and jeopardise the achievement of sustainable development goals. The urgent need for rapid and accurate monitoring and assessment of water quality requires sophisticated, data-driven solutions for better decision-making in water management. This study aims to develop optimised data-driven models for comprehensive water quality assessment to enable informed decisions that are critical for sustainable water resources management. We used an entropy-weighted arithmetic technique to calculate the Water Quality Index (WQI), which integrates the World Health Organization (WHO) standards for various water quality parameters. Our methodology incorporated advanced machine learning (ML) models, including decision trees, random forests (RF) and correlation analyses to select features essential for identifying critical water quality parameters. We developed and optimised data-driven models such as gradient boosting machines (GBM), deep neural networks (DNN) and RF within the H2O API framework to ensure efficient data processing and handling. Interpretation of these models was achieved through a three-pronged explainable artificial intelligence (XAI) approach: model diagnosis with residual analysis, model parts with permutation-based feature importance and model profiling with partial dependence plots (PDP), accumulated local effects (ALE) plots and individual conditional expectation (ICE) plots. The quantitative results revealed insightful findings: fluoride and residual chlorine had the highest and lowest entropy weights, respectively, indicating their differential effects on water quality. Over 35% of the water samples were categorised as ‘unsuitable’ for consumption, highlighting the urgency of taking action to improve water quality. Amongst the optimised models, the Random Forest (model 79) and the Deep Neural Network (model 81) proved to be the most effective and showed robust predictive abilities with R² values of 0.96 and 0.97 respectively for testing dataset. Model profiling as XAI highlighted the significant influence of key parameters such as nitrate, total hardness and pH on WQI predictions. These findings enable targeted water quality improvement measures that are in line with sustainable water management goals. Therefore, our study demonstrates the potential of advanced, data-driven methods to revolutionise water quality assessment in Saudi Arabia. By providing a more nuanced understanding of water quality dynamics and enabling effective decision-making, these models contribute significantly to the sustainable management of valuable water resources.

Reliable water quality prediction and parametric analysis using explainable AI models

Article Open access 29 March 2024

Explainable AI and Ensemble Learning for Water Quality Prediction

An optimized explainable artificial intelligence approach for sustainable clean water

Article Open access 10 August 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In the Kingdom of Saudi Arabia (KSA), the challenge of ensuring water quality is particularly acute due to the unique geographical and climatic conditions (Al-Omran et al. 2016). As a country largely devoid of natural freshwater bodies like rivers and lakes, KSA predominantly relies primarily on groundwater and desalination of seawater to meet its water needs for drinking, irrigation and industrial purposes. This dependence is exacerbated by the country’s rapid economic development and population growth (Al-Omran et al. 2016). The extensive use of groundwater, particularly in agriculture, has led to considerable challenges. The use of artificial fertilisers, which is necessary to meet the increased demand for food, often leads to excessive nutrient extraction from the soil (Fallatah 2020). This excess fertiliser, which is rich in nitrates and phosphates, then seeps into the groundwater and contaminates it. Furthermore, the transport of these pollutants through the groundwater system contributes to the broader problem of water pollution, which affects both human health and the environment (Saud and Abdullah 2009; Alghamdi et al. 2020).

Saudi Arabia’s geographical landscape, characterised by thick Mesozoic and Cenozoic sedimentary rocks, forms productive aquifers that are central to the country’s groundwater resources (Alharbi and Zaidi 2018). However, over-reliance on these aquifers in an arid environment is leading to declining groundwater levels and deteriorating quality, posing a major environmental challenge (Alharbi 2018). Rapid urbanisation and the expansion of agricultural activities across the country have further increased the demand for freshwater resources. As a result, the task of assessing and managing water quality is becoming increasingly complex and requires more efficient and reliable methods (Khanfar 2008; Al-Hammad and Abd El-Salam 2016). The traditional approach to water quality assessment is the use of water quality indices (WQIs) (Khuan et al. 2002; Asadollah et al. 2021). WQIs are an important tool for water resources management as they represent a single measure that encompasses various physical and chemical parameters of water quality (Zali et al. 2011; Hameed et al. 2017). However, the calculation of these indices is often associated with challenges, such as being time-consuming, complex and prone to inconsistencies due to the use of different equations and methods (Kouadri et al. 2021). This complexity is exacerbated by the lack of a universal WQI method, leading to different interpretations and assessments of water quality (Leong et al. 2021).

In response to these challenges, artificial intelligence (AI)-based WQI prediction emerges as a promising solution (Aldhyani et al. 2020; Hmoud Al-Adhaileh and Waselallah Alsaade 2021; Sajib et al. 2024). AI-based models offer a transformative solution by eliminating the need for tedious sub-index calculations, enabling fast and efficient water quality assessment (Sarafaraz et al. 2024; Tiyasha et al. 2021). These models are characterised by their non-linear structures that can handle large data sets with different scales and are resistant to missing data (Elbeltagi et al. 2022). The strength of AI algorithms in predicting complex phenomena lies in their ability to analyse data and recognise patterns (Irwan et al. 2023; Sidek et al. 2024). In this process, algorithms are constructed using a subset of the data set (training data) and the prediction performance is validated using a separate subset (test set) (Irwan et al. 2023). Notable AI algorithms that have been successfully applied in water quality prediction include adaptive boosting (Adaboost), gradient boosting (GBM), extreme gradient boosting (XGBoost), decision trees (DT), extra trees (ExT), random forest (RF), multilayer perceptron (MLP), radial basis function (RBF), deep feed-forward neural network (DFNN) and convolutional neural network (CNN) (Kim et al 2022; Khoi et al. 2022; Aldrees et al 2022; Nayan et al. 2020; Talukdar et al. 2023a; Al-Sulttani et al 2021; Sinha 2023; Yusri et al 2022; Ho et al 2019; Sakaa et al. 2022; Sheikh Khozani et al 2022; El-Shebli et al 2023; Kogekar et al. 2021, Mei et al. 2022; Sidek et al. 2024). These AI models have been used in various contexts e.g. in the prediction of manganese removal prediction (Bhagat et al. 2020; Erickson et al. 2021), flood susceptibility studies (Talukdar et al. 2020a; 2023a; Islam et al. 2021; Saha et al 2021; Ahmed et al. 2022; Mahato et al 2021), the identification of pollution sources (Mia et al 2023) and the general prediction of water quality (Talukdar et al. 2023b; Sinha 2023; Yusri et al 2022; Khoi et al 2022), with varying degrees of accuracy. The application of AI in water quality assessment represents a significant advance as it offers a more streamlined, accurate and efficient approach compared to conventional methods (Lap et al. 2023).

Grid search optimisation in machine learning is a methodological approach to improve the performance of models by fine-tuning their hyperparameters (Wu et al. 2019; Kim and Seo 2024). In contrast to general model parameters derived from training data, hyperparameters are predefined settings that control the learning process. For example, while the coefficient of a logistic regression (LR) model is determined during training, the number of decision trees in an RF model is a hyperparameter that is set before training (Talukdar et al. 2020b). The importance of hyperparameters in machine learning cannot be overstated, as they directly influence the accuracy, speed and reliability of the models. Grid search involves exhaustively exploring a given range of hyperparameters, iteratively running through all possible combinations and evaluating their performance to select the optimal set (Dodangeh et al. 2020). This process ensures that the chosen hyperparameters maximise the effectiveness of the model (Talukdar et al. 2024). However, this task can be computationally intensive and time-consuming, particularly for complex models with numerous hyperparameters (Raheja et al. 2024). However, the meticulousness of grid search makes it an indispensable tool in machine learning, especially for applications where precision and model performance are critical (Fang et al. 2021).

Although advanced AI models, including hybrid and deep learning systems, can provide highly accurate predictions, they often operate as "black boxes" that offer little insight into how they arrive at these predictions (Alshehri and Rahman 2023). This ambiguity limits their applicability, particularly in areas where understanding the motivations behind decisions is critical (Park et al. 2022a). Explainable artificial intelligence (XAI) addresses this fundamental challenge in the field of AI and machine learning for decision-making processes (Talukdar et al. 2023a, 2024). XAI aims to demystify these complex models and provide clarity on the internal mechanisms and decision-making processes (Talukdar et al. 2023b; Park et al. 2022a). Techniques such as SHapley Additive exPlanations (SHAP), partial dependence plot (PDP), permutation-based feature importance, accumulated local effect (ALE) have gained prominence in this field (Ahmed et al. 2023, 2024, Mia et al. 2023). SHAP, an additive feature assignment method, and PDP decompose the prediction of a model into the contribution of each feature, providing insights into the interaction and relative importance of the different variables to the model’s results (Ahmed et al. 2024). This interpretive approach not only increases the transparency of AI models, but also builds trust with users and stakeholders and ensures that AI-driven decisions are understandable and justifiable (Talukdar et al. 2023b). The application of XAI is critical in areas such as geohazard prediction and environmental monitoring, where understanding the basis of predictive modelling is as important as the predictions themselves (Ahmed et al. 2024).

This study addresses critical gaps in water quality assessment in Saudi Arabia, where conventional methods are inadequate given the dynamic environmental changes and the region’s dependence on groundwater and seawater desalination, exacerbated by agricultural and industrial pressures. We use advanced machine learning (ML) models to conduct real-time data-driven analyses with an entropy-weighted WQI and integrate XAI to identify key water quality parameters. This enables targeted policy measures and improves the understanding of water quality interactions. In addition, the H2O API in R programming, which is central to our methodology, facilitates both grid search optimisation and XAI integration, simplifying the exploration of multiple parameter combinations to optimise model configuration and interpretability (Talukdar et al. 2024; Šandera and Štych 2024). This integration of grid search optimisation with XAI within the H2O framework ensures that model performance is not only improved, but that the results are transparent, trustworthy, and meet the requirements of complex data-driven decision-making processes. This approach represents a significant innovation in environmental monitoring as it improves both the performance and interpretability of ML models, which is crucial for effective environmental management and policy formulation. These innovations hold great potential for global sustainable water management and could influence future academic and policy directions.

Materials and methodology

Study area

The Asir Province in Saudi Arabia is located within the coordinates 18°12′029.355″N, 18° 12′051.436″N latitude and 42° 29′05.157″E, 42° 29′019.795″E longitude, covering an area of 84,250 square kilometres (as shown in Fig. 1b). Asir experiences various climatic conditions, including hot desert, cold desert, cold semi-arid and hot semi-arid zones. It receives an annual rainfall of 350 mm, making it a significant region for agriculture. The geological composition of Asir consists of aquifers such as quaternary alluvium, quartz sandstone and conglomerates, with secondary aquifers primarily composed of calcareous deposits undergoing lateral diagenetic modifications. These aquifers exhibit greater porosity and karstification (Mallick et al. 2018). Figure 1B provides a geological map of the Asir region. One crucial source of groundwater in the region is the unconfined quaternary alluvial aquifers, which are replenished by runoff from the Asir highlands. These shallow aquifers have an estimated annual recharge of 1196 × 106 m3 and exhibit varying water quality, ranging from poor to good, as noted by Dabbagh and Abderrahman in 1997. The high-quality groundwater found in Wadi-al-Dawassir is attributed to a 100-m-thick layer of alluvial fill.

The Asir region is rapidly advancing its rainwater harvesting efforts through the construction of check dams. These dams enable the collection of sufficient water to cultivate 15,000 hectares of agricultural land. Presently, if just a quarter of the runoff water currently lost could be effectively harvested, it would fulfil all of Saudi Arabia’s existing agricultural water requirements. The majority of Saudi Arabia’s runoff occurs along the escarpment in the Asir region, where wadis flow towards the coastal area, contributing approximately 60% of the nation’s total runoff. Most wadi structures are filled with sand and gravel, and after a short distance, the runoff seeps into subsurface water bodies (wadi) and forms a sub-flow, recharging the groundwater. Storm runoff can occur in the Asir region at any time of the year. Saudi Arabia has a longstanding tradition of constructing dams, particularly in the Hijaz and Asir regions. As of 2018, according to the Saudi Arabia Ministry of Environment, Water and Agriculture, there are 509 dams throughout the kingdom, with 117 of them located in the Asir region. The primary purpose behind constructing these dams is to capture runoff and replenish the groundwater network, although some dams also serve as sources of drinking water and direct irrigation for agriculture.

Sampling and laboratory analysis

A total of 62 groundwater samples were taken from various wells within the study region, as shown in Fig. 1b. The locations of these wells were randomly selected to ensure a representative distribution across the different geological and hydrological conditions in the area. This random selection was intended to provide a comprehensive overview of the water quality in the entire study region. During sampling, several parameters, including electrical conductivity (EC), pH and total dissolved solids (TDS), were measured to assess water quality. Handheld sensors manufactured by HANNA were employed for these measurements. Prior to each sampling session, the sensor was calibrated daily using standard solutions with pH values of 4.0, 7.0 and 10.0, as well as EC standards at 84 uS/cm, 1413 uS/cm and 12.8 mS/cm. For each sampling site, two groundwater samples were collected in high-density polyethylene (HDPE) bottles. One of these samples was acidified in the field using a 1:1 nitric acid solution and was subsequently utilised for cation analysis. The second sample was left unaltered and was designated for laboratory-based anion estimation. To determine the concentrations of cations such as calcium (Ca²⁺), sodium (Na⁺), magnesium (Mg²⁺), potassium (K⁺) and iron (Fe), an atomic absorption spectrophotometer from Thermo Scientific’s M series was employed. On the other hand, the analysis of anions, including chloride (Cl⁻), fluoride (F⁻), nitrate (NO₃⁻) and sulphate (SO₄²⁻), was carried out using an ion chromatograph (Dionex) in gradient mode. Additionally, bicarbonate (HCO₃) was determined using a titrimetric method, while total alkalinity and hardness were assessed following the standard procedures outlined in APHA 1995. All the reagents, standards and chemicals utilised in these analyses were of analytical grade and sourced from Merck.

Water quality index estimation using entropy weighted arithmetic method

The theoretical method for estimating the WQI using the entropy-weighted arithmetic method revolves around the application of entropy theory to objectively determine the weighting of each water quality parameter, thereby eliminating the subjective biases typically associated with expert opinion (Singh et al. 2019). In this method, entropy, a concept derived from information theory, is used to quantify the degree of disorder or uncertainty associated with each water quality parameter (Verma et al. 2022). The greater the variability of a parameter across different water samples, the higher its entropy value, indicating a more significant role in the overall water quality assessment. This process begins with the collection and normalisation of water quality data, followed by the calculation of entropy values for each parameter. These entropy values are then used to derive objective weights that reflect the relative importance of each parameter to overall water quality. These weights are applied to the normalised values of each parameter and the entropy-weighted arithmetic mean of these values is calculated to obtain the final WQI. This entropy-based weighting approach ensures that the WQI is a more accurate and objective representation of water quality, as it minimises the influence of human bias and subjectivity that is inevitable in methods that rely heavily on expert opinion. In this way, it provides a more reliable and scientifically sound basis for water quality assessment, which is crucial for effective environmental monitoring and management.

Methods for machine learning–based feature selection techniques

In the theoretical framework of machine learning-based feature selection techniques, in particular the decision tree, random forest and correlation methods, each technique fulfils a specific role in identifying the most important features for predicting the WQI. The decision tree method works by creating a tree-like model of decisions where the importance of features is determined by how effectively they contribute to the partitioning of the data, indicating their influence on the outcome variable, in this case the WQI. The random forest approach, an ensemble of decision trees, further refines this process by constructing multiple trees and aggregating their results, thereby improving the reliability and generalisability of the feature importance scores. This method is particularly effective in dealing with overfitting and provides a more comprehensive understanding of feature relevance. Finally, the correlation method involves statistical analysis to assess the strength and direction of the relationship between each feature and the WQI. By assessing the correlation coefficients, this method helps to identify traits that have a significant linear relationship with the WQI. Together, these three methods provide a robust framework for feature selection, ensuring that the most predictive and relevant features are identified for use in WQI prediction models. This multi-layered approach utilises both statistical and machine learning techniques to improve the accuracy and efficiency of water quality assessments.

Selection and optimisation of ML models in H2O API for assessing water quality

The selection of the GBM, DNN and RF models in the H2O API for this water quality assessment study is based on their different capabilities in dealing with complex, non-linear data patterns commonly found in environmental datasets. These models are favoured due to their robustness, their ability to handle a large number of input features and their resistance to overfitting, which makes them particularly suitable for water quality analysis. Grid search optimisation is chosen for these models to systematically explore a wide range of hyperparameter combinations to determine the optimal model configuration. This approach is crucial for improving model performance and prediction accuracy. The H2O API is used in this study due to its user-friendly interface, scalability and efficient handling of large datasets for optimising and deploying ML models. It provides a comprehensive environment that simplifies the implementation of complex models and the grid search optimisation process, making it an ideal choice for this application.

GBM

The GBM model in H2O is an ensemble learning method that sequentially builds a series of decision trees, where each tree is designed to correct the errors of its predecessor (Talukdar et al. 2023a). This method combines the predictions from multiple trees to produce a final, more accurate prediction (Ahmed et al. 2024) (see Eq. 1). GBM is particularly effective in assessing water quality due to its ability to model complex interactions between parameters and its high prediction accuracy. The iterative nature of GBM allows it to focus on difficult-to-predict instances, making it highly adaptable to varying water quality data patterns.

$$GBM\left(x\right)=\sum_{t=1}^{T}{\gamma }_{t}{h}_{t}(x)$$

(1)

where ${h}_{t}(x)$ represents the decision trees and ${\gamma }_{t}$ are the weights for each tree.

DNN

DNNs in the H2O framework consist of multiple layers of interconnected nodes or neurons, each designed to progressively extract and refine features from the input data. The transformation in each layer of a DNN can be mathematically represented as:

$${a}_{i+1}=f({w}_{i}\cdot {a}_{i}+{b}_{i})$$

(2)

where ${a}_{i}$ are the activations from the previous layer, ${w}_{i}$ and ${b}_{i}$ represent the weight matrix and bias vector of the current layer, respectively, and σ denotes a non-linear activation function, such as ReLU or sigmoid. This structure allows DNNs to capture intricate relationships within the data, modelling complex patterns that are not readily apparent (Ahmed et al. 2024). The depth of the network (number of layers) and the number of nodes in each layer can be adjusted to suit the complexity of the dataset, with deeper networks generally being more capable of learning nuanced features of the data (Talukdar et al. 2023a).

RF

RF, as implemented in H2O, operates as an ensemble of decision trees, each constructed from a randomly selected subset of the training data and features, according to the bagging approach. The final prediction of the RF model is typically the average of the predictions from all the individual trees, which can be represented as:

$$RF\left(x\right)=\frac{1}{T}\sum_{t=1}^{T}{h}_{t}(x)$$

(3)

where ${h}_{t}(x)$ are the individual trees. Each tree is built independently, and the random selection of features and samples for each tree reduces variance and avoids overfitting, making RF particularly robust against model bias and variance issues (Palkar et al. 2022). This methodology not only enhances the stability and accuracy of the predictions but also makes RF an excellent tool for handling datasets with a high dimensional feature space.

Assessment of the performance of ML models

When evaluating the performance of machine learning models for water quality prediction, a comprehensive set of metrics is used, each providing a unique perspective on the accuracy and reliability of the model. Mean square error (MSE) and root mean square error (RMSE) are used to quantify the mean square difference and square root of this difference between the predicted and actual values, respectively, providing insight into the overall prediction error of the model. The mean absolute error (MAE) measures the average magnitude of the errors in a set of predictions without considering their direction. The Root Mean Squared Logarithmic Error (RMSLE) is particularly useful when dealing with exponential growth as it evaluates the logarithmic difference between the predicted and actual values, making it less sensitive to large errors in predicting higher values. Similar to the MSE, the mean residual deviance is a measure of the variance of the prediction errors and indicates the deviation of the model from the observed data. The coefficient of determination (R²) indicates the proportion of the variance in the dependent variable that can be predicted by the independent variables and is therefore a measure of the explanatory power of the model. In addition, the Taylor diagram is used as a visual tool to compare the statistical summary of the model’s performance, including its correlation, standard deviation and RMSE, with the observed data. This comprehensive approach to performance evaluation ensures a thorough assessment of the accuracy, reliability and suitability of the model for predicting water quality.

Interpreting optimised ML models through XAI

Interpreting optimised machine learning models in the H2O API, in particular GBM, DNN and RF, through XAI is critical to understanding the decision-making process of these models. XAI provides transparency and insight into the complex workings of these advanced models and makes them more interpretable for users and stakeholders (Talukdar et al. 2023b; Ahmed et al. 2023, 2024; Mia et al. 2023). To this end, the DALEX package in R is used, which provides a set of tools and methods to explain and understand the behaviour and predictions of machine learning models. Using the features of DALEX, we can analyse the models to gain a comprehensive understanding of how they work, to understand how different features influence the predictions, and to identify possible biases or inconsistencies.

Model diagnostic

Model diagnosis in XAI involves evaluating the performance and reliability of the optimised ML models. This is primarily done by residual analysis, which analyses the difference between the observed values and the predictions of the model (residuals). Residual analysis helps to identify patterns or anomalies in the predictions, such as systematic biases, overfitting or underfitting. By analysing these residuals, we gain insights into the accuracy of the model in different data segments and can diagnose problems that could affect the performance of the model. This diagnostic process is important to validate the effectiveness of the model and ensure that it generalises well to new, unseen data.

Model parts

The ‘model parts’ aspect of XAI focuses on understanding the contribution of each feature to the model’s predictions. This is achieved through permutation-based feature importance, a technique that assesses the impact of reshuffling each feature on the accuracy of the model. By randomly permuting the values of each feature and measuring the change in the model’s performance, we can determine the importance of each feature to the predictions. This method provides a ranking of the features based on their importance and helps to identify the most influential variables in the model and understand their role in the prediction process.

Model profile

Model profiling in XAI includes techniques such as Partial Dependence Plots (PDP), Accumulated Local Effects (ALE) and Individual Conditional Expectation (ICE) plots. These methods provide a deeper insight into the relationship between characteristics and the predicted outcome. PDPs show the average impact of a feature on the model’s predictions and allow us to identify patterns and trends in how feature values influence the outcome. ALE charts provide a similar view, but focus on the local effects of features and provide a more accurate representation in the presence of correlated features. ICE graphs, on the other hand, show how predictions for individual instances change as a feature is varied, providing a detailed insight into the behaviour of the model. Together, these profiling techniques provide a comprehensive picture of how different features and their interactions affect the model’s predictions, improving the interpretability of complex ML models.

Results

Assessment of water quality condition

Water quality was assessed by measuring physicochemical parameters such as TDS, conductivity, pH and concentrations of ions such as nitrates, sulphates, chlorides, iron and magnesium. The descriptive statistics show considerable variability. For example, the TDS value showed a high standard deviation of 479.47, which indicates a different mineral content of the samples (supplementary Section 1). The confidence intervals for pH (7.61–7.92) indicate consistent acidity, while wider intervals for TDS (583.89–822.59) and conductivity (1045.30–1397.61) reflect greater variability in these readings. Such discrepancies indicate different water sources and possible contaminants. Probability distribution plots further illustrate this variability: ammonia and nitrite levels are predominantly low, as indicated by right-skewed distributions, while broader distributions for TDS and conductivity indicate a range of dissolved concentrations (Supplementary Fig. 1). Parameters such as pH show narrow peaks, indicating uniformity in the samples, while flatter distributions for TDS and conductivity indicate greater variation. In terms of overall water quality impairment, parameters with a broad distribution are of particular concern, especially if they include ranges that exceed environmental or health standards. They may indicate the need for targeted water treatment procedures or further investigation of possible sources of pollution. The shape and distribution of these distributions can inform water quality management decisions, such as whether to focus on general treatment methods or specific pollutants.

The descriptive statistics and distributions emphasise the challenges of predicting overall water quality from isolated parameters alone. Therefore, we calculated the WQI using the entropy weighting method, which is based on WHO standards. The WHO sets acceptable limits such as 0.50 mg/L for ammonia (weight 0.05), 0.10 mg/L for nitrite (weight 0.09) and 10 mg/L for nitrate (weight 0.06) to ensure water safety. TDS has a higher allowable limit of 600 mg/L (weight 0.02), indicating the ubiquitous presence of TDS in water, while the limits for chloride and sulphate are both set at 250 mg/L (weight 0.03). Total hardness has a limit of 500 mg/L (weight 0.02), and the limit for total calcium is 75 mg/L with a lower weight of 0.02. Magnesium and iron have limits of 30 mg/L (weight 0.05) and 0.30 mg/L (weight 0.05), respectively, reflecting their moderate impact on the WQI. Fluoride has the highest weighting of 0.26, with a limit of 1.50 mg/L, as it has a significant impact on health at varying concentrations. The standard for alkalinity is 80 mg/L (weighting 0.03), while conductivity is set at 1000 µS/cm (weighting 0.01). The standard for pH is 8.50 (weight 0.00), which means that it has less direct impact. Turbidity and residual chlorine are weighted at 0.17 and 0.10, with standards of 5.00 NTU and 0.50 mg/L respectively, emphasising their importance in determining water clarity and microbial safety. These parameter weights, reflecting the degree of impact of each parameter on overall water quality, are used to calculate the WQI, which has a mean of 188.14 and a high standard deviation of 794.92, indicating considerable variation in water quality between samples. The WQI was then categorised into five categories—excellent, good, poor, unsuitable and very poor—as in Alam et al. (2021). The distribution analysis showed that over 35% of the samples fell into the unsuitable category, indicating the need for treatment, while less than 10% were very poor, indicating heavy contamination. Conversely, around 40% of samples were categorised as excellent or good and 15% as poor, illustrating the different water quality requirements in the different regions (Fig. 2).

Analysis of feature selection

Feature selection is a crucial step in the modelling process in both machine learning and deep learning, as it helps to improve the performance of the model by reducing complexity, preventing overfitting and increasing computational efficiency. By identifying and retaining only the most informative features, models can achieve higher accuracy with simpler, more interpretable results. Therefore, we used three ML models in this study, such as correlation of all variables with WQI, decision tree and random forest models as feature selection (Fig. 3). In the given analysis for the model of correlation with WQI (panel a), the five most influential parameters are residual chlorine, nitrate, TDS, sulphate and total hardness, which all show strong positive correlations, suggesting that they are significant predictors of water quality. The least influential parameters with the lowest correlations include ammonia, fluoride, alkalinity, total calcium and iron, indicating a weaker linear relationship with WQI. For the decision tree model (panel b), the parameters with the highest values for importance are nitrite, turbidity, sulphate, magnesium and TDS. These parameters are considered to be the most critical in determining the splits in the decision tree and therefore have a large influence on the predictions of the model. Conversely, the least important parameters that contribute least to decision making are iron, alkalinity, total calcium, fluoride and ammonia. Finally, for the Random Forest model (panel c), the left graph shows that the top five parameters that increase the mean square error the most when omitted (i.e. the most important) are residual chlorine, fluoride, pH, nitrate and TDS. This indicates that omitting these features significantly degrades the performance of the model. The diagram on the right shows IncNodePurity, where the most important factors for node purity are residual chlorine, nitrate, TDS, pH and conductivity. Based on these quantitative assessments, ammonia was found to have minimal impact on model performance and was therefore removed from the dataset for further WQI assessment with DL models. The low importance of the features and the minimal impact on model accuracy and node purity justify the exclusion of ammonia, allowing the models to focus on parameters with stronger predictive relationships to WQI.

Implementation of ML models in H2O API for assessing water quality

GBM, DNN and RF algorithms were used in the implementation of ML models within the H2O API framework to assess the WQI. The H2O API facilitates the streamlined application of these complex algorithms and enables efficient optimisation and assessment of WQI, which is crucial for the development of a data-driven system for fast and accurate water quality monitoring. This approach not only expedites the processing of large data sets but also provides a scientific interface for in-depth analyses that support more informed decision making. By utilising the computing power and user-friendly features of the H2O API, the application of these water quality assessment models becomes more accessible, enabling continuous improvements in environmental management.

Optimization of ML models using grid search algorithm

The versatility of the H2O API in grid search algorithms enables optimal identification of hyperparameters for robust ML models, which are essential for WQI assessments. For the GBM model, a grid search across 36 hyperparameter combinations (e.g. balance_classes, col_sample_rate, max_depth) identified the best model, gbm_grid1_model_38, based on the lowest RMSE. This model, shown in detail in Supplementary Fig. 2, consists of 658 trees with a depth of 3 to 6 and optimises complexity and fit to the training data. The DNNs underwent a more extensive optimization testing 97 hyperparameter variations, including activation, epochs, hidden layers and L1 and L2 regularisation (Supplementary Fig. 3). The optimal DNN model, DNNs_model_81, was characterized by an architecture with 206 weights/biases, 15 input neurons and multiple hidden layers of 5 neurons each, using a Tanh activation function. This configuration emphasises the depth of the model and the tailored complexity for accurate WQI prediction. Meanwhile, the robustness of the RF model was tested in 389 trials, optimising parameters such as max_depth, mtries, ntrees and sample_rate, without a single error occurring (Supplementary Fig. 4). The best RF model, RF_grid1_model_79, configured with 50 trees and a maximum depth of 18, shows its ability to recognise complex, non-linear patterns in the dataset. These optimization efforts for the GBM, DNN and RF models ensure a data-driven, science-based approach to rapid WQI assessment. The optimal hyperparameters, facilitated by the H2O API grid search, improve model performance and promote accurate water quality monitoring that supports informed environmental decision making.

To validate the model optimization, the learning curves for the three best models—gbm_grid1_model_38, DNNs_model_81 and RF_grid1_model_79—were analysed (Fig. 4). For gbm_grid1_model_38, the curve stabilises at 600 trees, with training and cross-validation error rates close to 30%, indicating an optimal number of trees without overfitting. DNNs_model_81 shows a decrease in error rates and reaches a plateau at 80 epochs, indicating effective learning. RF_grid1_model_79 stabilises the error reduction at around 40 trees, confirming the adequate complexity and generalizability of the model. These curves show that the models are well tuned and capture the necessary data patterns without being too specific to the training set.

Assessment of optimised ML models

The performance assessment of machine learning models is a crucial step to ensure their reliability and effectiveness in predictive tasks. In the context of water quality analysis, evaluating the accuracy and generalisability of models such as GBM, DNN and RF is essential to determine their practical utility in predicting WQI from various water parameters (Table 1).

Table 1 Statistical analysis of model performance in predicting WQI using GBM, DNN and RF models

Full size table

The GBM model (gbm_grid1_model_38) was evaluated using various regression metrics. The performance of the model on the training data shows an MSE of 501.84, RMSE of 22.40, MAE of 8.89, RMSLE of 0.16 and mean residual deviance of 501.84. These results indicate that the model fits the training data well, with a relatively low error rate. However, when evaluating the validation data, the error metrics are higher, with an MSE of 1868.66, RMSE of 43.23, MAE of 19.33, RMSLE of 0.26 and mean residual deviance of 1868.66. The increase in error rates in the validation set compared to the training set indicates that the model may not generalise as effectively to new, unseen data, which could be a sign of overfitting. Cross-validation, a more robust metric, reports an MSE of 1723.09, RMSE of 41.51 and MAE of 18.11. The summary of the cross-validation metric shows variation in the model’s performance across different folds, with an RMSE ranging from 19.92 to 65.39 and an R² (coefficient of determination) metric ranging from 0.56 to 0.87, highlighting some variability in the model’s predictive accuracy. The average MAE value of 18.73 and the RMSE value of 38.52 from the cross-validation are higher than the values from the training data, but lower than the values from the validation data. This suggests that while the model may be slightly overfitting, it still has a reasonable degree of predictive power that could be applicable to new data, especially when considering the mean R² value of 0.64, which suggests that a good proportion of the variance is explained by the model. The slight overfitting observed does not preclude the use of the model for new data, but indicates that the predictions of the model should be considered with an awareness of its limitations and potential for error.

The DNN model (DNNs_model_81) shows different performance levels in the training, validation and cross-validation data sets. For the training data, the model has an MSE of 118.91, an RMSE of 10.90, an MAE of 4.99, an RMSLE of 0.12 and a mean residual deviation of 118.91. These metrics indicate strong performance in the training set with relatively low error rates, suggesting that the model has learnt to fit the training data effectively. When evaluating the validation data, the error increases with an MSE of 354.97, an RMSE of 18.84, an MAE of 9.34, an RMSLE of 0.13 and a mean residual error of 354.97. Although the error is not too high, the predictive ability of the DNN model for unseen data is still very good. The cross-validation results, which allow a more robust assessment by training the model on multiple folds of the data, show further increased error rates: an MSE of 601.77, RMSE of 24.53, MAE of 11.69 and RMSLE of 0.31, with a mean residual deviance of 601.77. The summary of cross-validation metrics illustrates the variability in the model’s performance across different folds, with RMSE ranging from 9.84 to 35.55 and R² values ranging from 0.77 to 0.90. The mean cross-validation RMSE of 22.02 and MAE of 11.32 are higher than the training metrics, but not excessively so, indicating that the model maintains its predictive ability across different subsets of the data.

The RF model (RF_grid1_model_79), which was evaluated for water quality prediction, shows different performance metrics for training, validation and cross-validation data. On the training data, the RF model reports an MSE of 943.01, an RMSE of 30.71, an MAE of 10.39, an RMSLE of 0.27 and a mean residual error of 943.01. These figures suggest that when applied to the training data on which it was trained (particularly out-of-bag samples), the model has a moderate level of error, which is to be expected with a diverse dataset. Moving to the validation data, the performance of the model improves with a lower MSE of 532.33, RMSE of 23.07, MAE of 12.80, RMSLE of 0.22 and mean residual error of 532.33. The reduction in MSE and RMSE in the validation set compared to the training set is atypical, as models typically perform better on the training data. This could indicate that the model is very robust and does not over-fit as it maintains its performance when exposed to unseen data. The cross-validation metrics provide a comprehensive assessment of the generalisability of the model. In a fivefold cross-validation of the training data, the model shows an MSE of 980.10, an RMSE of 31.31, an MAE of 11.63 and an RMSLE of 0.27, with a mean residual deviance of 980.10. The summary of the cross-validation metrics shows some variability in the model’s performance, with an average RMSE of 24.62 and a standard deviation of 17.65, indicating that the model’s prediction error can fluctuate but generally maintains a consistent level of performance across different data subsets. The R² values, which indicate the proportion of variance explained by the model, range from 0.75 to 0.97 and average 0.86, indicating that the model captures a substantial proportion of the variance in the data. The RF model RF_grid1_model_79 thus shows robust performance with consistency across training and cross-validation datasets, with a slight increase in error rates in cross-validation. The ability of the model to maintain a relatively stable error rate across different subsets of the data, without a significant increase in error rate during cross-validation, suggests that the model is not overfitting and is likely to perform well on new data. The high R² values also speak in favour of the model’s ability to reliably predict water quality.

Comparisons of the performance of ML models

The performance of the statistical comparison of the ML models’ performance in predicting water quality indices using the scatter heat map (Fig. 5) and the Taylor diagram (Fig. 6). For the GBM model (gbm_grid1_model_38), an R² of 0.92 for training indicates that 92% of the variability in the training dataset is captured by the model, dropping slightly to 0.90 in the testing phase. This drop means that although the GBM model is robust, there may be some overfitting as the model is slightly less effective at predicting unseen data. The DNN model (DNNs_model_81) has R² values of 0.98 for training and 0.97 for the test phase, indicating exceptional performance and generalisation from training to unseen data. The minimal decrease in the R² value from training to testing indicates that the DNN model captured the underlying patterns in the data very well without overfitting. The RF model (RF_grid1_model_79) has the highest R² of 0.99 during training, which indicating almost perfect predictability. During testing, however, the R² drops to 0.96, which is a slight decrease, but still indicates a highly predictive model with strong generalisation capabilities.

In a Taylor diagram, which provides a visual summary of several aspects of model performance, the correlation coefficient (radial distance from the origin), centred RMSE (contours) and standard deviation (distance along the x-axis) of the model predictions relative to the observed values are displayed simultaneously (Fig. 6). The DNN and RF models with high R² values and a smaller drop between training and testing are closer to the ‘observed’ point in the diagram, indicating better performance. The GBM model still performs well, but is slightly further away due to its lower R² values.

Overall, although the RF model shows a slight decrease in R² value from training to test, its high R² values indicate that it performs best and is good at capturing and predicting the variance in the water quality data. The DNN model also shows excellent performance and is comparable to the RF model, but the slightly higher R² of the RF model in training gives it an advantage. Despite its good fit, the GBM model is slightly outperformed by the other models, as R² decreases more strongly from training to testing, which can be crucial in model selection for predictive tasks where the highest accuracy is required.

Interpreting optimised WQI-ML models using XAI for better decision making in water pollution management

The interpretation of optimised WQI-ML models using XAI is crucial for informed decision making in water pollution management. XAI facilitates understanding and confidence in ML models by providing insights into their decision-making processes.