Introduction

People rely heavily on groundwater in many countries for drinking, agriculture, and industrial needs (Jat Baloch et al., 2020; Rehman et al., 2019; Ullah et al., 2022c). Groundwater supplies drinking water for one-third of the world and is the source of freshwater in arid and semi-arid areas of Pakistan (Ghani et al., 2022; Jat Baloch et al., 2023). Because of rapid population expansion, agricultural and industrial activities, groundwater withdrawal has steadily increased and prompted concerns regarding assessing and managing groundwater resources for sustainable development (Iqbal et al., 2023b; Rashid et al., 2023; Ullah et al., 2022a). Groundwater is one of the essential water resources in Pakistan (Jat Baloch et al., 2020, 2023). However, several chemical elements, such as arsenic (As), threaten groundwater quality increasingly. Over 47 million are currently exposed to As contamination in Pakistan (Jat Baloch et al., 2022a; Rashid et al., 2023). Thus, understanding groundwater quality is critical for effective water management and long-term sustainability (Zhang et al., 2022; Zhang et al., 2022b).

The International Agency for Research on Cancer (IARC) classifies As as a class 1 human toxic element (Tropea et al., 2021). The World Health Organization (WHO) has reduced the As level in drinkable water from 50 to 10 µg/L due to the high carcinogenic risk (Zhou et al., 2021). Water contamination is a significant concern worldwide, notably in developing countries such as China, Pakistan, India, Bangladesh, and Vietnam. The consumption of As in drinking water has affected over 2 million people worldwide (Rahman et al., 2021). Weathering, evapotranspiration, and volcanic emissions are all geological factors that influence groundwater quality. Other recent studies have found that human activities contribute significantly to groundwater contamination. Subsurface contamination is caused by activities such as petroleum refining, herbicides, pesticides, and mining (Dilpazeer et al., 2023; Li et al., 2023; Stojanović Bjelić et al., 2023; Tariq et al., 2023). As contamination in eggs, water, milk, food, and meat can result in many health problems. The ingestion of bovine milk is among the most significant sources of toxicants in the food chain (Ullah et al., 2021). Groundwater As concentration rises due to physicochemical and geochemical conditions and rock-water interaction. People can be exposed to As through various mechanisms, including breathing, drinking, and skin contact (Çiner et al., 2021). Numerous national and international cases demonstrate that drinking contaminated water endangers people's health via these interconnected pathways (Iqbal et al., 2023a; Tabassum et al., 2019). This high As concentration in drinking water may cause various health issues, including hair loss, kidney failure, and cardiovascular disease (Rashid et al., 2019). Geochemical compositions, concentration levels, and bedrock geology all have an impact on groundwater quality around the world. Freshwater resources are critical for all life forms and are required for the survival of life and the natural environment. Overconsumption and poor management threaten freshwater resources (Jat Baloch & Mangi, 2019). To identify trends and ensure sustainability, groundwater modeling, quality analysis, and monitoring are required.

In recent water studies, Machine Learning (ML) methods are often used to solve various issues (Hussain et al., 2022; Sahin et al., 2021; Sun & Scanlon, 2019). ML approaches generally emphasize the relationship between the model's outputs and inputs rather than the mechanisms that enable the process. Sophisticated nonlinear associations between many variables can be appropriately documented with or without previous knowledge of the investigated system by learning a massive amount of data (Abbas et al., 2023; Hussain et al., 2021; Iqbal et al., 2020; Jamil et al., 2019). The presence of F, As, and other contaminants in groundwater has thus been estimated using various ML techniques, such as Artificial neural networks (ANN) (Ahmadi et al., 2017). The Random Forest (RF) model is most widely used for regression and classification. RF has many valuable features for classification. Because RF is a non-parametric, nonlinear method, it can handle large datasets with numerical and categorical data and complicated nonlinearity and factor interactions (Ranjgar et al., 2021).

Furthermore, logistic regression defines and clarifies the relationship between one or more independent nominal, ordinal, interval, or ratio-level variables and the dependent binary variable (Erguzel et al., 2019). Many researchers used RF, ANN, and LR to forecast groundwater pollution worldwide. ANN was used in China to predict geogenic groundwater F contamination across the country (Cao et al., 2022). It was also used to predict high NO3 in the groundwater of Harran Plain, Turkey (Yesilnacar et al., 2008). In the Yinchuan Region of central China, RF was employed to forecast NO3 pollution in groundwater (He et al., 2022). The RF method is used in Southern Spain to forecast NO3 in the groundwater and factors related to intrinsic and specific susceptibility (Rodriguez-Galiano et al., 2014). In Nigeria, research has demonstrated the appropriateness of utilizing ANN models for monitoring and evaluating water quality (Egbueri, 2021). ANN and multiple linear regression (MLR) exhibited strong reliability in monitoring groundwater resources. Both models demonstrated excellent performance, with MLR (ranging from 95 to 100%) outperforming ANN (ranging from 85 to 99%) in modeling the majority of potentially toxic elements (PTEs) and water quality indices (Agbasi & Egbueri, 2023). In the southern region of Nigeria, both MLR and multilayer perceptron neural networks (MLP-NN) methodologies were used to estimate and predict water quality indices, as well as the index of pollution (OIP) and water quality index (WQI). Remarkably minimal modeling errors were observed for both approaches, signifying the models' robust and concurrent predictive capabilities (Egbueri & Agbasi, 2022a). In the context of water quality analysis in Nigeria, the recent investigation synergistically integrated various soft computing algorithms. The outcomes validate that employing a combination of multiple models typically results in more robust and improved assessments compared to relying solely on an individual model (Egbueri & Agbasi, 2022b). However, these algorithms have been used independently to predict groundwater pollution, leaving gaps in determining the best ML model to predict groundwater contamination. The current study compares three machine learnings, RF, ANN, and LR, to predict the As in groundwater using binary classification analysis.

Pakistan is contending with a significant challenge of groundwater pollution caused by As, leading to adverse effects on groundwater quality across multiple regions, notably Punjab Province. The Pakistan Council of Research in Water Resources (PCRWR) identified elevated levels of As in Punjab's groundwater, surpassing the permissible drinking water limit set by WHO 2022 (2022). Moreover, in Sindh Province, the consumption of arsenic-contaminated drinking water has impacted 36% of the local population. Tragically, heavy metal contamination in the drinking water was responsible for the loss of 40 lives in the Hyderabad district in 2004 (Ullah et al., 2021). Groundwater pollution has increased due to the rapid population growth in the Indus plains of Punjab Province. Numerous studies have investigated groundwater As contamination across diverse settings, encompassing rural and urban areas and peri-urban zones of Pakistan (Fatima et al., 2018; Shahid et al., 2018a). As contamination in Vehari district's groundwater underscores its significance. Studies by (Shah et al., 2020) highlight elevated As levels, mainly attributed to geological factors and agricultural practices. These findings echo the concerns (Jat Baloch et al., 2022b) raised, highlighting the urgency of assessing health risks and implementing effective mitigation strategies.

A comprehensive analysis of potential drinking water contaminants remains imperative to safeguard the local population's well-being. Remarkably, minimal attention has been directed toward investigating As contamination and prediction within the drinking water sources of the Vehari district. Thus, this study assumes significance in pioneering: (i) an in-depth exploration of groundwater hydrogeochemistry, emphasizing the spatial distribution of arsenic contamination; (ii) an innovative approach utilizing ANN, RF, and LR classifiers to unravel determinants influencing groundwater As; (iii) an assessment of the suitability of groundwater for human consumption through the Water Quality Index (WQI). The research innovation is particularly highlighted by its pioneering use of machine learning models, a previously unexplored approach in the study area. This utilization significantly improves the accuracy of arsenic prediction, leading to a substantial enhancement in our comprehension of local water safety.

Materials and methods

Study area

Vehari holds significance as a prominent district within the Punjab Province of Pakistan. Burewala, Mailsi, and Vehari emerge within this district as key sub-districts (refer to Fig. 1). Geographically, the area is bounded by the Sutlej and Ravi rivers, positioned between coordinates 30°04′19′′ N and 72°35′28′′ E (Fig. 2). With a population of approximately three million, the Vehari district witnesses a climate characterized by scorching summers, temperatures peaking at 50 °C, and chilly winters, where temperatures can drop to about 5 °C. The summer also brings frequent dust storms, while the annual precipitation hovers around 125 mm. Groundwater is critical for agricultural and domestic needs in Vehari, Pakistan. Its accessibility promotes irrigation, which is essential to the local economy and ensures crop growth and food security. Groundwater also serves as a reliable buffer during droughts, protecting against erratic surface water availability. Significant human activities influencing groundwater quality and As levels in Vehari, Pakistan, include intensive agricultural practices involving fertilizer and pesticide use, industrial operations with potential chemical releases, insufficient waste disposal practices, and possible contamination from unregulated domestic and municipal wastewater (Jat Baloch et al., 2022b).

Fig. 1
figure 1

Study area map showing the sampling location

Fig. 2
figure 2

Hydrogeology map of the study area

Geology and hydrogeology

The geology and hydrogeology of the Vehari district reveal a strong relationship between borehole depths and fundamental aquifer properties, shedding light on subsurface characteristics and groundwater dynamics. The region is dominated by alluvial deposits, with borehole depths ranging from shallow to deep, revealing a stratigraphic succession of sediments such as silts, sands, and gravels. The Satluj and Ravi rivers run through the study area and serve as groundwater recharge sources (Khalid et al., 2020). The South Indus River forms the alluvial plain deposit, and its five major tributaries transport Pleistocene and Holocene sediments carried by the Ravi and Sutlej rivers (Fig. 2). The aquifer is made up of loose alluvial deposits that contain varying amounts of sand, a high percentage of fine sand and silt, and very little organic matter. Since the late Tertiary period, the Indus Rivers and streams have deposited these materials in the vast alluvial plain stretching from the Himalayan foothills to the Arabian Sea. The mineralogical evaluation identified aragonite, anhydrite, calcite, dolomite, gypsum, goethite, hematite, and halite as minerals (Ahmad et al., 2002). During the Pleistocene epoch, the Indus River sediment deposits formed a substantial 400 m thick layer. The groundwater in the Punjab region is a mix of alluvial sand and alternating silt layers. The two main aquifer systems in the hydrogeological structure are the upper unconfined aquifer and the lower confined aquifer. Borehole data provides critical insights into aquifer depths, revealing that the upper aquifer is generally at shallower depths than the confined aquifer. The properties of aquifers are crucial to understanding the region's hydrogeology. Aquifer yield is a measure of water provisioning capability, whereas storativity is a measure of water storage capacity. The rate of groundwater movement is significantly influenced by transmissivity, which measures the aquifer's ability to transmit water (Shahid et al., 2018b). Borehole logs also reveal various lithological structures, such as fault zones and permeable layers, significantly impacting groundwater movement and distribution. The interaction of borehole depths, aquifer properties, and lithological structures shapes the groundwater flow regime (Ali et al., 2023).

Sampling and analysis

Four hundred groundwater samples were collected from the Vehari district. These groundwater samples were obtained explicitly from drinking wells at diverse depths ranging from 50 to 400 feet (Table 2). All wells were flushed for at least 5 min to obtain fresh water before collecting groundwater samples. Groundwater samples (1000 mL each) were taken in duplicate in two separate plastic bottles having airtight caps. The samples were filtered through a 0.45 μm filter for further analysis. One water sample was acidified on-site by adding 2–3 drops of concentrated nitric acid (HNO3) to stabilize As and metal ions and reduce precipitation (Shah et al., 2020). The acidified water samples were used to analyze total As contents and other elements. The second water sample was kept non-acidified to analyze various cations and anions. The American Public Health Association's recommended procedures were implemented (Jat Baloch et al., 2022b). Using a multi-parameter analyzer, the pH, electrical conductivity (EC), total dissolved solids (TDS), total hardness (TH), and turbidity of the study area were measured in situ (Hanna HI9829). The groundwater samples were then tested in the water quality laboratory of the Pakistan Council for Research in Water Resources (PCRWR) for further analysis. The samples were examined for significant anions such as NO3, SO42−, and PO43− using a UV–VIS spectrophotometer. The concentration of F was determined using "Mohr's method and Fluoride Analyzer" ISE (ion-selective electrode) (Rashid et al., 2018a). The titration method was used to assess bicarbonate (HCO3) and chloride (Cl). Volumetric titration with ethylene diamine tetra acetic acid was utilized to determine calcium (Ca2+) and magnesium (Mg2+) concentrations. The sodium (Na+) and potassium (K+) concentrations were measured using a flame photometer (Zhou et al., 2021). As levels in the samples were measured using an atomic absorption spectrophotometer (AAS Vario 6, Analytik Jena, Jena, Germany (Baloch et al., 2022). The charge balance error (CBE) for each sample was calculated (ionic concentrations are measured in meq/L) to ensure the accuracy of the results. Groundwater samples containing ± 5% CBE were chosen for further examination (Jat Baloch et al., 2022a).

Statistical and hydrochemical analysis

Statistical software XL STAT 2021 was employed to compute the mean values, including minimum, maximum, average, and standard deviation, for each parameter. Piper diagram was utilized to determine the hydrogeochemical type and concentration of major anions and cations in the water samples and identify geochemical processes that contribute to assessing groundwater quality (Ullah et al., 2022b). The Piper diagram was produced using Grapher, and the Gibbs diagram was used to determine groundwater evolution. Saturation indices were calculated using PHREEQC Interactive to measure water's mineral balance and dissolved mineral reactivity.

Preprocessing of data for machine learning model

The input parameters were EC, pH, TDS, Turbidity, Hardness, Cl, HCO3−, Ca2+, Mg2+, SO42+, K+, Na+, Fe2+, NO3, F, and the dependent variable (As). All As concentrations less than 10 μg/L were assigned a value of zero (0), and concentrations greater than 10 μg/L were given one (1) value. To improve the model's speed and accuracy, the independent variables for the three algorithms were scaled between 0 and 1 (Nafouanti et al., 2021). Subsequently, the dataset was randomly partitioned into two segments: 80% designated for the training phase and 20% allocated for testing. The adjustment of actual groundwater variable concentrations, particularly for As concentrations, serves a scientific purpose in enhancing the modeling process. By categorizing As concentrations as below 10 μg/L (assigned as 0) or above 10 μg/L (assigned as 1), the study aims to create a binary classification framework that aligns with regulatory thresholds for safe drinking water. This approach offers several benefits: it simplifies the modeling task, focusing on classifying water as safe or contaminated and addressing potential noise and variability in the dataset. Moreover, it aligns with real-world decision-making scenarios where the primary concern is identifying water sources with elevated As levels that exceed permissible limits. This categorization facilitates efficient model training, convergence, and prediction accuracy, contributing to a more practical and actionable outcome for groundwater quality assessment and management strategies.

Choosing the appropriate input

Feature selection is crucial in classification because it enhances the classifier's performance while reducing computation complexity by eliminating duplicated data (Zebari et al., 2020). In this study, filter methods were used to select the relevant inputs. These approaches are faster than wrapper methods because they do not require model training. They can also link the independent and dependent variables (Coulibaly et al., 2000). The chi-squared method can create independent comparison tests (Zebari et al., 2020). For feature selection, chi-squared analysis was used to compute the chi-squared score of each class, resulting in a ranking list of all features. The numeric attributes were discretized to use the chi-squared statistic to find inconsistencies in the data (Kim, 2017). The following equation was used to calculate a feature's chi-squared score.

$$X^{2} = \mathop \sum \limits_{i = 1}^{r} \mathop \sum \limits_{j = 1}^{c^{\prime}} \frac{{\left( {n_{ij} - \mu_{ij} } \right)^{2} }}{{\mu_{ij} }}$$
(1)

where c represents the total classes, and r denotes the discrete intervals for the specified feature. nij signifies the observed frequency of the groundwater samples in the ith interval and jth class.

If ni = cj = 1, the number of samples in ith interval for a feature is nij; otherwise, the number of samples in the ith interval for a feature is nj = ri = 1. The sample count for class j is n, the total sample count is n, and the expected frequency of nij is \(\mu\) ij = ni.

When the observed number is close to the expected number, and the Chi-squared value is small, the variables are considered independent. Because of the higher Chi-Squared value, a variable is significant to the outcome and should be used to train models. Python's sklearn module and the "SelecktBest" function were used to select the variables, which kept the first k (no of the samples being summed) input variables (Table 1). The twelve (12) variables with high Chi-squared values were chosen as critical groundwater inputs. pH, TDS, SO42−, Na+, Fe2+, Cl, HCO3, Ca2+, Mg2+, NO3, K+ and F for the As prediction.

Table 1 Selection of Relevant Inputs by using the Chi-Squared analysis

Artificial neural networks (ANN)

Artificial neural networks has proven to be an effective categorization, clustering, pattern recognition, and prediction model. ANN is an ML model that outperforms conventional regression and statistical models (Musa et al., 2019). ANN are multilayered biologically inspired computer models with input, hidden, and output layers. The primary processing unit of ANN is the neuron, which connects all layers (Afzaal et al., 2019). Multilayer perceptron (MLP) neural networks used in this study are among the most common types of ANN. MLP includes an input layer with source neurons, one or more hidden layers of neurons, and an output layer. The number of nodes in the input and output layers changed according to the number of input and output variables (Fig. 3). The generalization potential of the network is determined by the number of hidden layers and the number of nodes per hidden layer and it contains two layers. The relatively limited number of hidden layers and neurons may cause underperformance.

Fig. 3
figure 3

Artificial Neural Network structure with the Inputs variables

In contrast, too many hidden nodes may overfit training data and poorly generalize new input (Otchere et al., 2021). In this study, the "adam" optimizer was utilized to update the weight in the network. The permutation feature has been used to identify critical variables in the correlation of predictors and dependent variables. It describes the impact of variable elimination on network accuracy.

Random forest modeling

Random forest avoids the limitations of overfitting and instability when only one decision tree is used. The primary goal of RF is to generate many decision trees from random subsets of the original training dataset. The average forecasts of these single trees are used to increase the model's generalization (Wu et al., 2020). RF classification was used in this research to predict As pollution in groundwater. To generate a training subset for every tree, a bootstrapping technique determines the training dataset into an "in-bag" subset for decision tree training and an "out-of-bag (oob) subset that is not used in the training process. Internal validation is performed because each tree is partitioned. The mean of all oob forecasts would provide a metric for the accuracy rate of the RF model, and oob samples from each tree are used to assess its efficiency. A decision tree's in-bag and out-of-bag sample sizes are 66.67 and 33.333% (2:1) of the original training dataset. After the model has been formed and fitted with the training dataset, its performance is assessed using the test dataset. Consequently, upon both training and test sets, the model makes oob predictions. To evaluate model performance, metrics including mean absolute error, root mean square error, and the coefficient of determination (R2) are employed to measure the disparities between observed and predicted response variables (Markwart et al., 2019). In addition, the trained and validated RF model evaluates predictor variable significance to determine how each predictor factor influences the response variable. The RF model in the current study was built using 100 trees. RF can identify and characterize the critical predictive variables that cause groundwater contamination. The permutation function was utilized to determine the significant factors in the association between predictors and dependent variables (Hussain et al., 2021). A considerable decline in impurity constitutes an essential split. The greater the significance of the variable, the more significant the mean impurity reduction.

Logistic regression

Logistic regression (LR) is a statistical model that uses a logistic function to illustrate a binary dependent variable. LR is a method for defining the requirements of a logistic model in regression analysis (Wasserman & Pattison, 1996). This study uses LR to forecast the level of As in groundwater. The LR equation is as follows:

$$F\left( x \right) = \frac{1}{{1 + e^{{ - \left( {\beta o + \beta 1x} \right)}} }}$$
(2)

βo and β1 are estimated parameters.

Machine learning model assessment criteria

Data analysis in a confusion matrix is a standard method for assessing predictive performance. The accuracy, specificity, sensitivity, and error have been computed to determine the model prediction. AUC (Area Under the Curve) is a metric commonly used in binary classification to assess the performance of machine learning models. It represents the area beneath the receiver operating characteristic (ROC) curve, reflecting the model's ability to distinguish between positive and negative classes. A higher AUC value (closer to 1) indicates better model discrimination and classification accuracy. LR was also evaluated using the ROC and AUC. A confusion matrix determines the ability to predict binary classification correctly and accurately. It demonstrates how the model distinguished predicted and actual values (Nafouanti et al., 2021). To analyze the classified data percentages, the prediction has been evaluated by comparing them to the identified concentration. The sensitivity is the percentage of As correctly classified, while the specificity is the percentage of non-arsenic correctly classified. The Python 3.9 programming language was used to create the three models.

The confusion matrix metrics equation is as follows:

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}}$$
(3)
$${\text{Sensitivity}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(4)
$${\text{Specificay}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}$$
(5)
$${\text{Error rate}} = \frac{{{\text{FP}} + {\text{FN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(6)

where TP = True Positive, TN = True Negative. FP = False Positive, FN = False Negative.

Water quality index

The study employed the WQI to evaluate the suitability of groundwater for drinking purposes in the study area, as conducted by (Agbasi & Egbueri, 2023; Egbueri & Agbasi, 2022a; Omeka & Egbueri, 2023; Onyemesili et al., 2022). The WQI values were calculated based on the World Health Organization (WHO, 2022) drinking water standards for nine parameters, TDS, pH, Turbidity, Ca2+, Mg2+, Na+, K+, Cl, SO42−, HCO3, NO3, F, and Fe2+. To calculate the WQI, three computing steps were undertaken. First, weights (ωi) were assigned to each parameter based on their significance in determining groundwater quality, with Table 2 providing the weight and relative weight of all hydrochemical parameters.

Table 2 The weight and relative weight of hydrochemical parameters

The relative weight (Wi) for each parameter was computed using Eq. (7), where Wi denotes the relative weight, wi represents the weight of the specific parameter, and n indicates the total number of parameters. This step aimed to weigh the importance of each parameter proportionally.

$$W_{i} = \frac{{{{W}}_{i} }}{{\mathop \sum \nolimits_{i = 1}^{n} {{W}}_{i} }}$$
(7)

The second step involved determining each parameter's quality rating scale (qi) using Eq. (8). In this equation, qi represents the quality ranking, Ci signifies the parameter's quality in milligrams per liter (mg/L), and Si represents the WHO (2022) standard for that parameter. This calculation allowed for the evaluation of the quality of each parameter with the established standards.

$$q_{i} = \frac{{{\text{Ci}}}}{{{\text{Si}}}}$$
(8)

The sub-index (SIi) for each parameter was computed using Eq. (9) to consolidate the various parameter sub-indices into a single representative value. In this equation, SIi represents the sub-index of the ith parameter, Wi signifies the relative weight of that parameter, and qi corresponds to the rating associated with the concentration of the specific parameter. This step aimed to reflect the significance of each parameter in contributing to the overall assessment.

$${\text{SI}}i = Wi \times qi$$
(9)

Finally, the comprehensive WQI was determined by summing up all the individual sub-indices using Eq. (10). This final index provided a holistic understanding of the drinking water quality in the research area, aligning with the WHO 2022 drinking water quality standards for the specified hydrochemical parameters.

$${\text{WQI}} = \mathop \sum \limits_{i = 1}^{n} {\text{SI}}_{i}$$
(10)

Results and discussion

Hydrogeochemical analysis of groundwater

The hydrogeochemical characteristics of the groundwater samples are displayed in Table 3 and compared with the WHO 2022 standards for drinking water quality (Organization, 2022). The EC shows the ability of water to transmit an electric current between dissolved salts. However, EC ranges from 85 to 4550 μS/cm with a mean value of 1363.10, showing that groundwater mineralization is responsible for elevated EC saturating salinity in the groundwater system. The findings suggest that the groundwater chemistry in the study area is impacted by geochemical processes, rock-water interactions, and human activities (Adimalla et al., 2021). Total dissolved solids (TDS) measurements are essential for reporting dissolved chemical concentrations. TDS concentrations varied from 234 to 3173 mg/L, with a mean of 968.20 mg/L. The elevated TDS is due to salt leaching and sewage infiltration (Khan et al., 2018). Higher salinity in groundwater cause high EC and TDS levels, typically related to semi-arid and arid climatic conditions (Herczeg et al., 2001). The groundwater pH varies from 6.78 to 8.18, with a mean value of 7.16, indicating slightly alkaline. Because of pH variations, the chemical composition of groundwater changes, and this variation is primarily determined by lithology. Weathering and chemical reactions of plagioclase feldspar in sedimentary rocks (Ali et al., 2023). The groundwater Turbidity levels ranged from 0.3 to 188.0 NTUs (Nephelometric Turbidity Units), with a mean of 6.34 NTU. Poorly constructed and too-shallow wells can cause high turbidity (Azis, 2015). Furthermore, the alkaline condition increases conductivity over time due to the dissolution process. The total Hardness ranged between 100 and 820 mg/L, with an average of 361.39 mg/L. Water with a hardness of > 500 mg/L is unsafe for human consumption (WHO, 2022). The amount of CO2 in the soil increases due to humus decomposition and respiration in the topsoil. The breakdown of feldspar and carbonate minerals is accelerated by high soil CO2, resulting in high groundwater alkalinity (Roy et al., 2018).

Table 3 Statistical analysis of physicochemical parameters for groundwater samples for the study area

The groundwater is dominated by cations in the following order: Na+  > Ca2+  > Mg2+  > K+  > Fe2+. The concentration of Na+ ranged from 13 to 662 mg/L, with an average of 148.73 mg/L. The high levels of Na+ in the groundwater are attributed to ion exchange caused by silicate weathering, saline water infiltration, or clay minerals (Mitchell et al., 2018). Furthermore, agricultural activities in the research region may impact the prevalence of Na+ in groundwater. The average concentration of Ca2+ was 86.20 mg/L, with a range of 8–208 mg/L. Higher Ca2+ content is from geological sources, such as the dissolution of carbonate and evaporite minerals or carbonate minerals within rock formations (Chidambaram et al., 2018). Mg2+ levels varied between 2 and 104 mg/L, with a mean value of 35.25 mg/L. The elevated Mg2+ originating from minerals like mica, gypsum, and dolomite could also arise through ion exchange (Chidambaram et al., 2018). Groundwater with higher Ca2+ and Mg2+ concentrations is classified as hard water. The K+ concentrations in the study region ranged from 2.6 to 74.0 mg/L, with an average of 9.072 mg/L, and were influenced by agricultural activities and water seepages from agrarian lands. Natural sources of K+ ions, such as silicate minerals, also contribute to K+ ions in groundwater. The maximum permissible Fe2+ concentration in groundwater is 0.3 mg/L, according to the (WHO, 2022). The Fe2+ levels in the study region ranged from 0.01 to 3.92 mg/L, with an average of 1.1455 mg/L, and were primarily sourced from ferruginous minerals on the Earth's surface (Raju, 2006).

The dominant anions in the groundwater are HCO3 > SO42− > Cl > NO3 > F. The concentrations of HCO3 ranged from 80 to 900 mg/L, with an average of 319.90 mg/L, making it the most prominent anion. The presence of HCO3 in groundwater can be attributed to the breakdown of carbonate minerals and the interaction of atmospheric CO2 with silicate minerals (Fornes et al., 2020). The SO42− concentrations varied from 18 to 1432 mg/L with a mean value of 255.02 mg/L. The higher SO42− levels in groundwater resulted from agricultural activities (Manjusree et al., 2009). Groundwater Cl concentrations ranged from 10 to 518 mg/L, with a mean value of 92.48 mg/L. Higher Cl content in the aquifers is caused by saline water infiltration and evaporite dissolution (Gopinath et al., 2018). The NO3 varied from 0.01 to 17.66 mg/L with a mean value of 1.76 mg/L. Fertilizer runoff, septic systems, and improperly treated wastewater are the anthropogenic sources of NO3 (Selvakumar et al., 2017). The As concentrations in the study area ranged from 0.1 to 52 µg/L, with an average of 11.64 µg/L. Elevated levels of As in groundwater is due to natural and anthropogenic sources (Adimalla et al., 2018). The higher levels of HCO3 in the groundwater of Vehari district result in increased As concentration showing an oxidative condition in the aquifers. Most regions in southern Punjab contain high levels As in groundwater due to arsenic minerals, making most of the water resources unsuitable for drinking. Vehari district faces a critical challenge due to the widespread As contamination (Ullah et al., 2021).

Hydrogeochemical evolutional processes

The hydrochemical facies diagram depicts the groundwater interactions in a lithological formation (Boateng et al., 2016). The chemical differences among the groundwater samples are shown in the Piper diagram (Fig. 4). Most of the samples fall into Zones 1 (CaHCO3 type), 4 (mixed CaMgCl type), and 5 (CaCl) and no dominance type, indicating that the rock-water interaction plays a significant role in determining the hydrochemical composition. Zone 1 (CaHCO3 type) represents fresh recharge water samples. Regarding cations, the groundwater samples can be classified into Zone B (mixed type) or Zone D (Na + K type), highlighting the significance of silicate weathering and ion exchange. The majority of the groundwater samples are classified into Zone B (mixed type) and E (HCO3 type) from the anions' perspective, with a few samples falling into Zone G (Cl type). This implies that carbonate weathering and evaporite dissolution are the dominant processes, whereas gypsum dissolution is negligible in the study area. For the cations, the majority of the samples fall into Zone B (No dominance), Zone A (Calcium type), and Zone C (Magnesium type). For Ca2+ and Mg2+ components in water samples, limestone and sandstone weathering significantly influences the groundwater system (Mallick et al., 2021). In the Piper plot, the types of waters Na+, SO42−, Ca2+, and Mg2+ were demonstrated As released by sedimentary rocks into groundwater. As mobilization in groundwater is caused by several vital mechanisms, including calcium dissolution, salt mineral dissolution, and desorption (Jat Baloch et al., 2020).

Fig. 4
figure 4

Geochemical evolution of groundwater types

Gibbs diagram

The Gibbs diagram portrays the groundwater chemistry-influencing variables: evaporation dominance, precipitation dominance, and weathering dominance (Jat Baloch et al., 2021; Rashid et al., 2018b; Salem et al., 2015). Most of the samples are plotted in the rock dominance region in Fig. 5, signifying that rock dominance influences the majority of groundwater, while a few are also in the evaporation dominance. Rock weathering is the foremost driving force behind the heightened presence of minerals within the groundwater system. This enrichment is facilitated by intermingling soluble salts and minerals within the groundwater. Furthermore, the extended period of water–rock interaction, resulting from the prolonged residence time, allows the potential dissolution of minerals. This phenomenon underscores the complex interplay between geological processes and groundwater composition, a pivotal focus in scientific research (Tariq et al., 2022).

Fig. 5
figure 5

The Gibbs diagram demonstrates the ionic composition of the samples of groundwater a Na/Na + Ca mg/L versus Log TDS, b Cl/Cl + HCO3 mg/L versus Log TDS

Pearson correlation

The findings of Pearson's correlation analysis are displayed in Table 4. In the conventional interpretation, quality parameters exhibiting correlation coefficients (r) of < 0.5, between 0.7 and 0.5, and > 0.7 signify weak, moderate, and strong relationships, respectively (Onyemesili et al., 2022). From the correlation matrices for Vehari, we were able to understand the geochemical process in the study area. The strong correlations between EC and TDS, HCO3−, Cl, SO42+, Mg2+, Na+, TH, and F indicate higher ion exchange possibilities in the aquifers. The significant correlation between TDS and HCO3, Cl, SO42−, Ca2+, Mg2+, Na+, and TH. As the TDS value increase, all ionic concentrations also increase, primarily due to weathering of sedimentary rocks. TH positively correlated with EC, TDS, HCO3, Cl, SO42+, and Ca2+, illustrating that groundwater has elevated hardness due to Ca2+ and Mg2+, and other ions in the study area (Xue-Jie et al., 2013). The significant correlation between HCO3− with Cl, SO42+, Mg2+, and Na+ suggests a significant contribution from multiple anthropogenic sources like improper disposal of wastes, agricultural activity, sanitation, discharge of industrial effluents, and organic decomposition in the study area. In Vehari, As exhibited a negative correlation with EC, TDS, pH, Turbidity, HCO3−, Cl, Mg2+, Na+, K+, Hardness, and NO3, Fe2+ and F. Such correlations highlighted the influence of pH on As concentration in groundwater (Jia et al., 2023).

Table 4 Pearson correlation matrix of the study area’s physiochemical parameters for groundwater

Principal component analysis (PCA)

Principal component analysis was implied to find and classify the sources that influenced the groundwater variables. The factors related to groundwater were subjected to a PCA, as shown in Table 5. The varimax rotation was applied to the PCA results to understand better the factors that impact groundwater (Rashid et al., 2020; Zhang et al., 2020). Four components were obtained, with eigenvalues of 6.476, 1.874, 1.359, and 1.016, accounting for 38.096, 11.026, 7.993, and 5.974% of the total variability, respectively (Fig. 6). PC1 had 38.89% of variability with an eigenvalue of 6.476. The significant loadings factors of EC, TDS, HCO3−, Cl, SO4−2, Mg2+, Na+, and TH were calculated to be 0.959, 0.980, 0.763, 0.865, 0.911, 0.718, 0.868, and 0.723. Thus, PC1 showed the highest contribution of strong loading factors in PCA results, demonstrating the geogenic and anthropogenic sources in the study area. The PC1 results indicate the ionic formation in groundwater, resulting from the ion exchange process, dissolution of minerals, and weathering of rocks. PC2 exhibits 11.026% variability with an eigenvalue of 1.874. The moderate loadings factors of groundwater variables were Ca2+ (0.731) and TH (0.645) in the study area. The levels of Ca2+ and Hardness are likely to be influenced by anthropogenic and weathering activities (Li et al., 2020a). The PC3 and PC4 showed the lowest contribution in PCA results with 7.993 and 5.974% variability and eigenvalues of 1.359 and 1.016. The moderate factors of variables in PC3 and P4 could be associated with anthropogenic activities leading to influence the hydrochemical characterization of groundwater aquifers. These results highlight the contribution of anthropogenic and natural factors to groundwater contamination in the area under study region.

Table 5 Principal component analysis of selected groundwater parameters for Vehari district
Fig. 6
figure 6

Sum of all the calculated factors, b Contribution of the four loading factors F1, F2, F3 and F4 after varimax rotation

Machine learning model evaluation and comparison

The test predictor data were used to evaluate the models' precision in predicting the presence of As in groundwater after model development and training. The ANN, RF, and LR evaluation metrics were obtained from their confusion matrix. Tables S1, S2, and S3 provide more information. Based on the assessment criteria applied to the three models, the RF model demonstrated accuracy, error rate, specificity, and sensitivity values of 0.85, 0.10, 0.79, and 0.95, respectively (Table 6). High sensitivity over specificity means fewer false negatives in binary classification, indicating a good prediction model. RF's capability to forecast groundwater pollution for F has previously been investigated, which supports our study (Nafouanti et al., 2021). The accuracy of RF efficiency is improved in this work by identifying appropriate inputs and employing many trees, resulting in a performance boost for the model. The ANN's accuracy, error rate, specificity, and sensitivity were 0.80, 0.20, 0.73, and 0.88, respectively. The current finding supports the previous research, Water quality indicator forecasting for irrigation applications using ANN (Abrahart et al., 2005; El Bilali et al., 2021) and (Awu et al., 2015). In this study, increasing the number of hidden layers in the network training improved ANN performance. By increasing the number of hidden layers, accuracy can be significantly enhanced (Karsoliya, 2012). In ANN, a suitable number for network training with two hidden layers can be obtained.

Table 6 Performance for Logistic Regression using ROC (AUC) curve

The LR's accuracy, error rate, specificity, and sensitivity were 0.59, 0.41, 0.52, and 0.63, respectively; the model's capabilities were assessed using the ROC curve (AUC) (Fig. 7). LR's AUC was 0.73; the current finding supports the findings of previous research, Groundwater NO3 pollution in a semi-arid environment utilizing integrated parametric IPNOA and data-driven logistic regression (Rizeei et al., 2018). The diminished effectiveness of the ANN model, in contrast to the RF model, arises from the ANN model's limitation in making predictions outside the range of its training data. Consequently, the intricate challenge of overfitting becomes pronounced within the ANN training data (Al-Mukhtar, 2019). Because the RF model avoids overfitting and combines many trees to generate a prediction. Regarding accuracy, specificity, and sensitivity, the LR model performed the worst of the three models (Table 6). Low-dimension data in the training data set can reduce LR performance. The model on the test data set may be overfitting and incorrect. Despite their poor performance in the current study, ANN and LR have advantages when forecasting groundwater contamination in previous studies. Because of the presence of numerous variables, the process of groundwater pollution is difficult to comprehend. Hence, the model's precision and dependability increase proportionally with the algorithm's enhanced adaptability (Tsoar et al., 2007). An algorithm's structure, the data type, and the parameter selections influence its performance (Üstün et al., 2005). In this classification task, feature selection should be considered for statistical analysis to produce an excellent predictive model.

Fig. 7
figure 7

Performance for Logistic Regression using ROC (AUC) curve

Identifying the variables that impact arsenic mobilization

The mean decrease in impurity (MDI), a factor significance metric via RF, was used to determine the relationship between predictors and As (He et al., 2022). It's a tree-specific variable importance metric calculated with Python's RF "skirt-learn" module's feature importance implementation. Each time a variable is chosen to split a node, the cumulative MDI per feature across all forest trees is calculated. Factors dividing nodes closer to the tree root have a higher significance value (Fig. 8). The F, Fe2+, K+, Mg2+, Ca2+, Cl, SO42−, NO3, HCO3, and Na+ variables are at at the top of the plot with the highest MDI values. The plot shows that pH and TDS have a low MDI in the research area.

Fig. 8
figure 8

Important Features to the Arsenic using Mean Decrease in Impurity in Random Forest

The MDI was previously used to find the essential components in data that influence the dependent variable (Bylander, 2002). Furthermore, the MDI identified significant variables related to the dependent variable in microarray and facies estimation. It was employed to find essential predictors of the dependent variables (Bhattacharya & Mishra, 2018). According to the MDI results, the variables influencing As in the study region are F, Fe2+, K+, Mg2+, Ca2+, Cl, SO42−, NO3, HCO3, and Na+, which is consistent with previous research findings (Jat Baloch et al., 2022b; Rashid et al., 2018a; Tahir & Rasheed, 2013). By evaluating the variable importance of ANN, the permutation feature was utilized to find the utmost influential aspects of the output. When a single variable is removed, the permutation lowers the final model score (Chae et al., 2016). Twelve (12) networks were tested to discover the most significant factors in the outcome. After removing a variable, each showed a change in network accuracy variance (Table 7).

Table 7 Importance features using permutation feature for ANN showing the change of the accuracy after a variable is eliminated

The accuracy is 0.80 after omitting the pH and TDS, the same as the original model accuracy. Consequently, the potential exclusion of pH and TDS from the model arises, given their limited impact on network accuracy. This observation underscores that pH and TDS insignificantly influence the concentration of As within the study region. In contrast, when additional variables such as F, Fe2+, K+, Mg2+, Ca2+, Cl, SO42−, NO3, HCO3, and Na+ are removed from the model, the model's accuracy decreases, showing their significance to the As model. Permutation was previously used in research to identify critical components in dissolved oxygen (DO) (Matayoshi et al., 2019). In the current research, the permutation feature and the MDI give similar outcomes to the variables affecting As in the study region, F, Fe2+, K+, Mg2+, Ca2+, Cl, SO42−, NO3, HCO3, and Na+. When analyzing the correlation between the input and output variables, the permutation feature outperforms the MDI feature. The permutation technique, employed to evaluate the significant contributors influencing the output of any algorithm, highlights the distinctive aspect of the MDI as an exclusive feature within the realm of the RF algorithm.

Arsenic mechanism in groundwater

Arsenic levels in groundwater in the Vehari district varied from low to high levels of enrichment, as depicted in Fig. 9. Results indicated that 43.5% of the samples in the Vehari district exceeded the WHO 2022 permissible limit of As (10 µg/L). The correlation between As and some essential parameters was drawn to investigate the As release mechanism. The correlations are presented in scatter diagrams in Fig. 10. The results from the present study area showed some trend of oxidative desorption with an increased evaporative concentration mechanism concluded based on alkaline pH (6.7–8.2), low iron, high bicarbonates, high sulfates, negative correlation of iron with arsenic, respectively, and significant positive correlation between As–HCO3 and As–SO42−, and slight positive correlation with pH in groundwater of Vehari. The Gibbs diagram also justified the evaporative mechanism, which showed that evaporation is also a dominant natural phenomenon in controlling the water chemistry of the study area (Fig. 5). Ion-exchange processes and the adsorption of ions in the study region significantly impacted the concentration of As. Previous research has indicated that Ca2+ can potentially interfere with As adsorption due to the effect of ion reactions on mineral surfaces (Xie et al., 2008). High competing ionic compositions can thus aid arsenic desorption (HCO3 , SO42−, Na+, K+, Mg+, and Ca2+). The findings were consistent with previous studies of (Brahman et al., 2016) and (Shahab et al., 2019) with a high As concentration in Sindh province, Pakistan. Moreover, variations in As enrichment in high-pH groundwater could be attributed to soil salinization and subsurface environmental conditions (Li et al., 2020b). The weak correlation between As and pH observed in this study may be due to alkaline desorption, which can impact the release of As into groundwater. Additionally, the aquifers in the study region have been reported to be alluvial, composed of silt, sand, and gravel, and have elevated As levels in the Punjab province (Shahab et al., 2019). Punjab province has a high evaporation rate, with 74–80 percent of groundwater being highly evaporated (Yu et al., 2015). In this study, no statistically significant correlation between As, NO3, and F was found, as their concentration levels were very low in almost all of the groundwater samples. The SI estimation facilitates understanding the reaction pathways and the measurement of mineral dissolution and precipitation. In the geochemical simulation model (Fig. 11), aquifer conditions were undersaturated (SI < 0) with calcium carbonate and rock salt minerals, including aragonite, calcite, dolomite, and halite. These mineral phases had negative SI values and were unlikely to precipitate, but they may have played an important role in releasing As into aquifers through dissolution (Rashid et al., 2022). In contrast, the SI was positive for anhydrite, gypsum, and iron oxide mineral phases, including goethite and hematite. These minerals tended to participate in groundwater (Fig. 11). (Bhattacharya et al., 2009) found that iron oxides in the sediments of the flood plain in Bangladesh inhibited As mobility in groundwater.

Fig. 9
figure 9

Spatial distribution of groundwater As in study area

Fig. 10
figure 10

Scatter diagram showing the correlation between arsenic and different variables in groundwater

Fig. 11
figure 11

Relationships between As and saturation indices

WQI

The WQI is a popular method for determining groundwater quality for drinking (Narsimha & Sudarshan, 2017). The WQI was used to check the suitability of groundwater in the research region. The WQI is divided into five classes: excellent (50), good (50–100), poor (100–200), and unsuitable (> 200). Table 8 shows that the samples (n = 161) were classified as "Poor" with 40.25 and 30% unsuitable contributions, while the samples 1.5 and 27.75% were classified as "Excellent" and "Good," respectively. Most samples had poor to unsuitable drinking water quality, showing that the study areas' groundwater sources are unsafe to drink. The water quality suitability map is depicted in Fig. 12.

Table 8 Water Quality Index (WQI) classification of the study area
Fig. 12
figure 12

Groundwater suitability assessment for drinking purposes in the study area

Conclusions

The presence of high levels of As in drinking water sources can make it unsuitable for consumption. In the current study, 174 of the 400 samples (43.5%) had As concentrations that exceeded the permissible limit of 10 μg/L set by the World Health Organization (WHO, 2022) for drinking water. The As levels measured ranged from 0.1 to 52 μg/L. Ion-exchange processes and the adsorption of ions in the study region significantly impacted the concentration of As. The elevated concentrations of basic physiochemical parameters, such as EC, TDS, HCO3, and Na+, exceeded the permissible limits set by WHO, thereby rendering the water unsafe for drinking. Multivariate statistical approaches in the study suggest that geogenic and anthropogenic activities in the region cause As enrichment in groundwater. The hydrochemical analysis of groundwater samples indicates a combination of CaMgCl and CaCl types. The Gibbs plot demonstrated that the prevailing rock composition substantially influences the groundwater's chemical makeup. Moreover, the results from geochemical modeling displayed that As had negative saturation indices with calcium carbonate and salt minerals, including aragonite, calcite, dolomite, and halite. According to the WQI, most of the water samples from the Vehari district had poor water quality. Artificial Neural Networks, Random Forest, and Logistic Regression machine learning techniques were used to predict As levels in the study region. Results indicate that the Random Forest technique was the most effective, with an accuracy of 0.85. The permutation feature and the MDI were employed to identify the variables influencing arsenic levels in the region. These approaches identified variables such as F, Fe2+, K+, Mg2+, Ca2+, Cl, SO42−, NO3, HCO3, and Na+ as contributing factors to As concentration. These findings suggest that the Random Forest model can be used as a reliable algorithm for forecasting groundwater arsenic in the Vehari region and can be extended to other locations for predicting groundwater contamination. However, future research should focus on developing more adaptive models to improve the accuracy of groundwater pollution prediction.