Introduction

In arid regions, groundwater is the principal source of irrigation, drinking, and industrial use (Kawo and Karuppannan 2018), including Sidi Okba region in Algeria. Groundwater provides 96% of the world’s freshwater for around 2.4 billion people (Duran-Llacer et al. 2022; Pandey et al. 2023). Climate change, expansion of agricultural areas, and the scarcity of precipitations led to the increasing of the number of pumping wells in these regions (Hamamouche et al. 2018; Boudibi et al. 2021a; Li et al. 2020; Neshat and Pradhan 2017; Şen 2019). The overexploitation, the poor farming practices, and ineffective management of this valuable and scarce resource contributed to a deterioration of groundwater quality (Aouidane and Belhamra 2017; Afrasinei et al. 2017).

Groundwater salinization, expressed in terms of electrical conductivity (EC), is one of the major constraints for the agricultural production in the study area, because saline irrigation water is responsible for the alteration of physicochemical properties of the soil, which causes soil salinization and reduction of plants productivity (Boudibi 2021; Bradaï et al. 2016; Pulido-Bosch et al. 2018). Currently, salinization is rising at 10% annual rate (Barbieri 2023) and it is the primary issue with irrigation-related groundwater quality. Aquifer salinity can be directly or indirectly impacted by agricultural activities operations. As a result, changes in salinity brought about by irrigation water application can be categorized as direct impacts and those coming from irrigation abstraction as indirect impacts (Pulido-Bosch et al. 2018). Thus, managing the irrigation in any region requires precise prediction and mapping of the aquifer’s groundwater salinity. In order to improve decision makers’ and land use planners’ capacity to accurately assess the spatial variability of GWS and to serve as a foundation for studies on groundwater quality risk assessment in various other regions of the world, it is mandatory to determine the accurate machine learning (ML) technique to assess the risk level of groundwater salinization using the adequate digital elevation model (DEM) derivatives.

ML algorithms and geostatistical models are the most suitable methods for digital mapping (DM) (Zhang et al. 2017; Qu et al. 2024). Ordinary kriging (OK) and its derivatives such as cokriging and regression kriging are the most applied geostatistical methods for GWS modeling studies. In this study, OK is used to estimate the salinity at unsampled points and to get an insight into the spatial distribution of GWS. DM uses DEM derivatives and satellite image indices as covariates based on ML, which is widely applied for modeling soil properties (Qu et al. 2024). To our best knowledge, the most recent studies are restricted to using GWS controlling factors (e.g., evaporation, transmissivity, water table, precipitation, and water cations and anions) to predict water salinity for areas of unknown sampling points. The application of such data to model GWS is not always possible because they are not available everywhere, costly, require extensive sampling, and time consuming. The major limitation associated with the application of DM based on ML for modeling GWS is the availability of the adequate covariates (inputs) such as DEM derivatives (e.g., elevation, slope, curvature, distance to rivers and streams, longitude, altitude, and aspect) (Sahour et al. 2020). Therefore, many scholars have applied ML techniques to deal with the non-linear and complex relationships between the target variable and the independent variables in order provide accurate predictions (Xiao et al. 2023; Tran et al. 2021; Meyer and Pebesma 2021).

In recent years, ML techniques have demonstrated their potential as effective tools in predicting GWS. For instance, Sahour et al. (2020) applied multiple linear regression (MLR), deep neural network (DNN), and extreme gradient boosting (EGB) to estimate the GWS in a coastal aquifer of the Caspian Sea. It was concluded that EGB method is the optimal alternative considering its better performance on the testing phase. Cui et al. (2021) used Gaussian processes (GPs) for GWSprediction in the NE part of South Australia. The findings suggested that GPs should be promoted actively in the prediction of groundwater researchers. Araya et al. (2023) employed random forest (RF) to make spatial predictions of GWS in the Horn of Africa. It was reported that RF is powerful tool for the geospatial predictive modeling. Al-Waeli et al. (2022) illustrated the ability of artificial neural networks (ANNs) to predict GWS at the Najaf–Kerbala plateau in Iraq using cations and anions as input data. Jamei et al. (2022) worked on GWS distribution of multi-aquifers in Bangladesh using adaptive neuro‑fuzzy inference system (ANFIS) and Boruta‑random forest (B-RF). The authors emphasized the great predictability of the applied methods. Lal and Datta (2020) compared the performance of four ML techniques, including ANNs, genetic programming (GP), Gaussian processes regression (GPR), and support vector regression (SVR), for GWS predictions in a coastal aquifer system and attested that GPR performed better than other models.

All the above previous studies have proved the high ability of ML algorithms to handle the complexity of GWS prediction related to groundwater research studies using numerous controlling factors including aquifer characteristics and the groundwater quality elements. Unfortunately, such numerous data are not available in many regions in addition to time consuming. Accordingly, the accuracy of these predictions varies widely depending on the adopted technique (Muniappan et al. 2023). Despite the high importance of the predictors’ selection and its effect on the accuracy of the ML techniques, still maximum number of inputs increases computation time and may worsen learning accuracy (Cai et al. 2018). During this study, it is observed that the literature lacks comprehensive application of relevant feature selection methods and readily available influencing factors for modeling GWS using various ML technique. However, in this study the most effective ML method is identified as random forest (RF) with few input parameters, and its performance is shown numerically. Of the main steps in the execution of this study are (1) to map digitally the GWS using readily available DEM derivatives and a small sample dataset of EC; (2) to compare the performance of five commonly used ML techniques, namely random forest (RF), hybrid neuro-fuzzy inference system (HyFIS), K-nearest neighbors (KNN), cubist regression model (CRM), and support vector machine (SVM) for spatial modeling and digital mapping of GWS; and (3) to explore the effect of the different selection feature methods and the number of candidate inputs on the accuracy of these modeling techniques.

Material and methods

Description of the study area

The research was carried out in the Algerian Sahara at the Sidi Okba region (Wilaya of Biskra), which is located 19 km southeast of Biskra province, between 5°45′ N–6°2′ N longitudes and 34°39′ E–34°52′ E latitudes with a total surface area of 280 km2 and an elevation ranging from 2 m, in the southern part of the study zone, to 126 m in the northern part. It is characterized by an arid climate with cold winters and hot, dry summers, and annual rainfall of less than 150 mm (Hamamouche et al. 2018). The mean annual temperature and evapotranspiration are 23 °C and 2500 mm, respectively (Boudibi et al. 2021b). It is crossed by two wadis, namely wadi Biskra and wadi El-Biraz, which ultimately flow into the natural depression of Chott Melghir (Fig. 1).

Fig. 1
figure 1

Location map of the study area

Geologically, the study area is a transitional zone, characterized by both structural and sedimentary features, positioned between the mountainous and folded Atlas domain in the north and the expansive, flat desert domain of the northern Sahara in the south (Abdennour et al. 2020; Chebbah 2016; Ghiglieri et al. 2020). Many geologist stated that the sedimentary formations in this region are a succession of Mesozoic to Cenozoic (Guiraud and Bosworth 1997). The Neogene stretches over a large surface area and unevenly cover a range of ages’ formations, including Oligocene, Eocene, and Upper Cretaceous (Guiraud 1990; Ghiglieri et al. 2020). A large Quaternary formation discordantly overlies and covers these Neogene deposits (Chebbah 2016).

Hydrogeologically, the study area is recognized by the superposition of two main aquifer systems (Fig. 2): the continental intercalary aquifer (CIa), which is the deepest, and the terminal complex aquifer (CTa) (Besbes et al. 2003). These two aquifer systems are separated by a Cenomanian impermeable horizon and are a part of the North Western Sahara Aquifer System (NWSAS), often known as Système Aquifère du Sahara Septotrional (SASS), that extends over an area of 1 million km2 shared by Libya, Tunisia, and Algeria, where the major part is in Algeria (about 700 000 km2) (Al-Gamal 2011; Besser et al. 2018). The CTa comprises several minor aquifers extending from the Upper Cretaceous to The Mio-Pliocene (Edmunds et al. 2003; Ghiglieri et al. 2020). The Mio-Pliocene (called aquifer of sands) consists of alternating layers of clay, sand, and gravel (Reghais et al. 2024). It is the primary exploited aquifer in the eastern part of Biskra province, including Sidi-Okba region. The thickness of this aquifer reaches 1000 m and its depth varies from 90 to 300 m in the study area (Hamamouche et al. 2017; Reghais et al. 2024).

Fig. 2
figure 2

Hydrogeological map of the study area

Successive droughts and the expansion of the irrigated agriculture that characterize Sidi Okba regions lead to intensive exploitation of groundwater through deep wells tapping the MPa. In the last decade, groundwater of MPa became the main source of irrigation and drinking purposes in the study area (Hamamouche et al. 2015), despite the existence of Foum Elgherza dam that is used only to irrigate the palm groves of Sidi-Okba, Gharta, and Seriana Oases (Fig. 1). As the study area is experiencing a shortage of surface water from the dam, pumped groundwater is incorporated into the pre-existing irrigation infrastructure, resulting in the generation of an integrated surface and groundwater system. The agricultural sector is the largest groundwater consumer in the study area (more than 90% of the pumped groundwater) (Hamamouche et al. 2018).

Dataset acquisition and preparation

Groundwater salinity measurement

In order to acquire representative network of groundwater wells covering the entire study area as in Fig. 1 and capturing the Mio-Pliocene aquifer in Sidi Okba region, a total of 56 boreholes are used for agricultural and drinking purposes, and they were the subject of on-site measurements during Mai 2020 to obtain the electrical conductivity (EC in mS/cm), which is used to express the GWS. Most of the boreholes were operational during field sampling works. Otherwise, the well water was pumped for about 20 min before sampling to ensure that it represents the aquifer’ current state. The EC measurements were carried out using the portable digital multiparameter (WTW multi 3430). The adopted methodology is summarized in Fig. 3.

Fig. 3
figure 3

Flowchart methodology

Digital elevation model (DEM) derivatives

According to the recent studies, GWS is influenced by several factors including climatic, topographic, hydrologic, geologic, land use and land cover (LULC), and aquifer characteristics. In this study, the focus is on the most easily accessible influencing factors, which are the DEM derivatives, namely the slope, flow direction (FlowD), elevation, curvature, aspect, topographic wetness index (TWI), distance to streams, and wadis (DTS and DTBW). It is stated by Avand et al. (2020) that these factors can affect GWS salinity directly or indirectly. On the other hand, slope, elevation, curvature, and aspect play important roles in flushing and exporting saline materials from the soil into fluvial plains through transportation and accumulation of these materials in lowland areas (Mosavi et al. 2020). It is also well known that lower elevation areas often have higher GWS levels due to the accumulation of salts from evaporative concentration. Conversely, higher elevations might show lower GWS levels due to increased groundwater recharge and less evaporation (Leaney et al. 2003; Mosavi et al. 2021). Aspect, which represents the orientation of slopes and the direction of water flow, indirectly affects the amount of water infiltrating into the ground by influencing land cover, wind speed, precipitation direction, and evapotranspiration (Benjmel et al. 2022). Slope and curvature influence the flow and accumulation of water related to the rate of groundwater recharge (Avand et al. 2020; Benjmel et al. 2022), thus, affecting the distribution and concentration of salts in groundwater storage. Topographic wetness index (TWI) is related to soil moisture patterns (Kalantar et al. 2019). Areas with high TWI values indicate higher soil moisture and potentially influence groundwater recharge through the infiltration of surface water, waterlogging, and leaching, which can dilute salinity levels (Mosavi et al. 2021; Benjmel et al. 2022). Streams and wadis play a crucial role in groundwater recharge within the study area. In addition to serving as a primary source of groundwater recharge, streams and wadis also significantly influence the mobilization and distribution of salts within the aquifers (Balakrishnan et al. 2024). Geographic coordinates are in correlation with climatic precipitation and temperature conditions (Zhao et al. 2007), which affect groundwater recharge and evaporation rates. Longitude and latitude are considered due to their correlation with GWS. The ASTER DEM data of the study area are provided by the US Geological Survey (USGS) (https://earthexplorer.usgs.gov) at a spatial resolution of 30 m (raster cell size of 30 * 30 m). This DEM is utilized to prepare the maps (30 × 30 m of pixel resolution) of the 10 abovementioned derivatives that are extracted and calculated using ArcGIS 10.8 software (Fig. 4). The generated raster maps are imported into the R environment and run using the Raster package for GWS modeling in the entire study area.

Fig. 4
figure 4

Digital elevation model (DEM) derivatives

Modeling procedure and performance evaluation

All the steps of the modeling process are performed in RStudio/2022.12.00 software using Caret package.

Data standardization

Standardization, also known as centering and scaling, is a preprocessing technique commonly used in ML. This process is typically achieved by subtracting the mean value of each feature from all data points and then dividing by the standard deviation (Müller and Guido 2016). It involves transforming the features of a dataset so that they have a mean of 0 (centering) and a standard deviation of 1 (scaling) (Kraiem et al. 2024; Ouameur et al. 2020). Standardization ensures that features are on similar scales, which can improve the performance of many ML algorithms, particularly those sensitive to the scale of features (Shanker et al. 1996). In this study, standardization is applied automatically during feature selection and model training using preProcess argument passed to train function in R environment.

Feature selection

Feature selection (FS) serves as a valuable tool in ML, offering numerous benefits such as mitigating overfitting, enhancing model performance, and reducing computational complexity by strategically removing irrelevant or redundant features (Cai et al. 2018; Tran et al. 2021). FS methods can be divided into three principal groups, unsupervised, supervised, and semi-supervised alternatives (Cai et al. 2018). There are three primary categories of supervised feature selection methods: embedded methods, filters, and wrappers (Lualdi and Fasano 2019; Cai et al. 2018). These methods utilize machine learning algorithms and search strategies to iteratively train and test feature subsets, integrating feature selection into model training (Lualdi and Fasano 2019; Jamei et al. 2022). Empirical evidence favors wrappers in terms of performance. In this study, wrappers of FS methods, i.e., backward feature selection (BFS), forward feature selection (FFS), and recursive feature elimination (RFE), are used to pick the best candidate input combinations for the different ML modeling techniques.

Machine learning models

For modeling GWS, five ML models were employed, namely, RF, HyFIS, KNN, SVM, and CRM. A succinct overview of each ML model is provided below.

RF is a tree-based machine learning algorithm (Cutler et al. 2012) harnessing the collective strength of multiple decision trees. Ho (1995) developed the first such algorithm, and Breiman (2001) and Cutler et al. (2012) expanded upon her work, refining and popularizing the algorithm for broader applications in predictive modeling where the fundamental concept is to construct numerous decision trees using the dataset and then amalgamate them to create a predictive model known as the random forest (Parzinger et al. 2022; Kim et al. 2024). RF is attractive and widely applied by researchers due to its high accuracy and efficiency, rapid convergence, and exhibit lower susceptibility to overfitting (Wang et al. 2024; Li et al. 2021). Another notable advantage of this method is its high flexibility, as it does not rely on assumptions about data distribution or necessitate detailed physical models (Parzinger et al. 2022). Additionally, RF possesses the capability to effectively handle missing data and outliers, and thus can be used for tackling both classification and regression problems (Kim et al. 2024; Zhang et al. 2024).

HyFIS is a hybrid neuro-fuzzy system proposed by Kim and Kasabov (1999) for constructing and enhancing fuzzy models through the combination of fuzzy logic principles with learning capabilities of ANNs (Saleh et al. 2023). It is widely applied in ML, where the learning is optimized by hybrid learning scheme that consists of two phase: rule finding using knowledge acquisition module in the initial phase, followed by the parameter learning phase using an error backpropagation learning scheme for a neural fuzzy system (Kim and Kasabov 1999; Hassan and Arman 2023). Wang and Mendel (1992) proposed a fuzzy technique for the extraction of fuzzy rules in the HyFIS model, which is a simple method that segments the input and output data into an optimal number of fuzzy sets, then assigns a fuzzy membership function (MF) to each segment (Verma et al. 2022). The procedure is a supervised learning approach that employs gradient descent-based learning algorithms with a multilayer perceptron (Hassan and Arman 2023). In HyFIS, Gaussian function is applied as the MF and, subsequently, during the prediction stage the standard Mamdani methodology can be employed (Ali et al. 2018).

KNN is nonparametric and lazy learning algorithm (Silverman and Jones 1989) proposed by Fix and Hodges (1951) and expanded by Cover and Hart (1967). It is one of the widely employed supervised ML algorithms for forecasting, classification, and regression problems (Chacón et al. 2023). The fundamental concept behind the KNN algorithm is that when examining the feature space, if a significant proportion of the k-nearest neighbors surrounding a particular sample are categorized within a specific group, then that sample is also categorized within that group (Liu et al. 2022; Chacón et al. 2023; Gomez-Gil et al. 2024). Alternatively, the KNN algorithm categorizes an unknown data point by selecting the category of the most similar data point from the training dataset, which is often determined by calculating the Euclidean distance between them (Motevalli et al. 2019; Zamri et al. 2022).

SVM is a powerful supervised ML technique developed by Cortes and Vapnik (1995), originally conceived for tackling classification tasks (He et al. 2022), and later extended to solve regression problems due to its empirical success and application in various research areas (Onyekwena et al. 2022). The SVM gained widespread attention in the first two decades of the twenty first century due to its robust statistical and mathematical foundations, supporting the principles of generalization, optimization, and notable characteristics (Wang et al. 2024; Onyekwena et al. 2022), including independence from data distribution, straightforward algorithm structure, manageable computational complexity, and remarkable generalization capabilities (Joshi 2020). The SVM prioritizes fitting the best line within a specified margin over minimizing the variation between observed and predicted values (Ouameur et al. 2020). This margin delineates the separation between the boundary line and the hyperplane with the nearest data points on both sides designated as support vectors (Majumdar et al. 2023).

CRM is a rule-based model developed based on Quinlan’s M5 model tree (Quinlan 1992) and further it utilizes ensemble learning principles to enhance accuracy by combining multiple model trees (Ao et al. 2024). In the ensemble model, each tree mimics a regression tree but it substitutes constant values in terminal nodes with linear regression models (Quinlan 1992; Li et al. 2020). Terminal nodes represent distinct areas in the input space with explanatory variables included in linear regression models, if they significantly influence the response variable within those areas (Li et al. 2020). The main advantage of the cubist model lies in its ability to handle complex non-linear relationships between the inputs (explanatory variables) and the output (target variable) (Ao et al. 2024).

Validation and performance criteria

Validation assesses the efficiency of the applied models. The models are built using the hold out method, where 70% of the training data sets are utilized for training, while the remaining 30% are reserved for testing. K-fold cross-validation is an appropriate evaluation method for a limited dataset (Suleymanov et al. 2023; Chacón et al. 2023). In this study, fivefold cross-validation (K = 5) is applied and repeated five times on the training set throughout all the modeling techniques via Caret package.

Three performance metrics are utilized to describe the accuracy of the ML models, namely root mean squared error (RMSE), mean absolute error (MAE), and the correlation coefficient (R).

RMSE is defined as

$$RMSE=\sqrt{\frac{\sum_{i=1}^{n}{({GWS}_{i}^{O}-{GWS}_{i}^{P})}^{2}}{n}}$$
(1)

MAE is defined as

$$MAE=\frac{1}{n}\sum\nolimits_{i=1}^{n}\left|{GWS}_{i}^{O}-{GWS}_{i}^{P}\right|$$
(2)

R is defined as

$$R= \frac{\sum_{i=1}^{n}({GWS}_{i}^{O}-\overline{{GWS }_{i}^{P}})({GWS}_{i}^{O}-\overline{{GWS }_{i}^{P}})}{\sqrt{\sum_{i=1}^{n}({GWS}_{i}^{O}-\overline{{GWS }_{i}^{P}})}\sqrt{\sum_{i=1}^{n}({GWS}_{i}^{O}-\overline{{GWS }_{i}^{P}})}}$$
(3)

where GWSiO denotes the measured groundwater salinity at a location i, GWSiP signifies the predicted groundwater salinity at a location i, and n represents the total number of sampling points.

Spatial interpolation of groundwater salinity using kriging

Spatial interpolation methods are frequently applied in various fields to estimate values of a variable at locations lacking direct measurements. Kriging is a robust geostatistical interpolation method founded upon the theory of regionalized variable (Delhomme 1978; Şen 1989; Miao and Wang 2024). Ordinary kriging (OK) seeks to offer unbiased and optimal estimates of variables by examining the spatial relationships between data points within the analyzed area through semi-variance (Qu et al. 2024; Zhu et al. 2021). The geostatistical analyst extension of ArcGIS 10.8 is used for conducting OK and spatial prediction of GWS.

Results and discussion

Descriptive statistics of GWS

The descriptive statistics of GWS in terms of EC(mS/cm) are summarized in Table 1. The EC values range from 1.45 to 9.62 mS/cm, with a mean value of 4.573 mS/cm. The results indicate a significant variability, with a coefficient of variation of 41.5%. The Shapiro–Wilk (SK) test is used to assess the normal distribution of EC. The test yielded a significant value of 0.122. Additionally, skewness with a value of 0.226 (near to zero) and kurtosis with a value of 2.239 (near to three) provide further confirmation for the normal distribution assumption of EC (Fig. 5).

Table 1 Descriptive statistics of EC (mS/cm)
Fig. 5
figure 5

QQ plot of EC distribution

Spatial interpolation of groundwater salinity

In this research, OK is employed to generate the spatial distribution map of GWS using only the EC field data. The Shapiro–Wilk test indicated that the data of EC follow a normal distribution; therefore, no transformation is required. The primary processing step in Kriging approaches involves fitting a theoretical semivariogram model to the empirical semivariogram. Various models, such as spherical, Gaussian, and exponential are tested, and the most suitable model is selected based on the results of cross-validation, namely by identifying the model with the lowest MAE and RMSE. The parameters of the best-fitted semivariogram are given in Table 2. The cross-validation results of the OK method show good accuracy with a low RMSE of 1.045 and MAE of 0.775, as well as a high R value of 0.832.

Table 2 Parameters of the best fitted semivariogram

The generated map of GWS using OK as in Fig. 6 demonstrates the spatial distribution of the various GWS classes in Sidi Okba region. According to the US Salinity Laboratory Staff (Richards 1954), EC (mS/cm) measurements categorize water salinity for irrigation into five classifications: 0 ≤ low (C1) ≤ 0.25, 0.25 < medium (C2) ≤ 0.75, 0.75 < high (C3) ≤ 2.25, 2.25 < very high (C4) ≤ 5, and excessive (C5) > 5. Each category delineates the suitability of water for different crops and soil types, ranging from low risk of salinization to unsuitability for irrigation. C1 is suitable for most crops, while category 5 is deemed unsuitable. Categories 2 to 4 require varying degrees of caution and crop selection based on salt tolerance and drainage conditions. Ayers and Westcot (1988) showed through extensive experiments that a groundwater EC of 3 mS/cm is acceptable for irrigating most crops. In this research, since the EC values range from 1.45 to 9.62 mS/cm, the groundwater of the study area is classified into three categories: C3, C4, and C5.

Fig. 6
figure 6

EC spatial distribution map using OK

From the spatial distribution map in Fig. 6 and values in Table 3, the dominance of groundwater is apparent as very high risk of salinity (C4) with 47.85% of the total surface area. It is localized particularly in the middle and the southern parts of the study area. Groundwater at excessive risk of salinity (C5) occupies a large part of the study area (38.57% of the total surface area). This class is located specifically in the northwestern part with an extension into the middle of the study area (Sidi Okba oasis) along the main road linking the city of Biskra and the town of Sidi Okba. The least saline groundwater (C3) is located in the northeastern part of the study area near the Foum El Gherza dam and the beginning of Wadi Elbiraz, covering 13.58% of the total surface area. According to Ayers and Westcot’s recommendations, only 27% of the groundwater in the study area is deemed acceptable for irrigation, and it is located in the eastern part of the study area along Wadi Elbiraz.

Table 3 Surface areas of different classes predicted using OK and RF

Digital mapping of groundwater salinity using machine learning techniques

Covariate selection and model performance

The results in Tables 4, 5, and 6 show the selected covariates using REF, FFS and BFS, respectively, as inputs to the five ML techniques, their tuning parameters, and performance statistics in terms of RMSE, R, and MAE criteria.

Table 4 Model performance metrics for GWS prediction using REF selection method
Table 5 Model performance metrics for GWS prediction using FFS selection method
Table 6 Model performance metrics for GWS prediction using BFS selection method

For REF selection method (Table 4), among the original ten candidates, only five covariates were selected as inputs to the RF, HyFIS, KNN, CRM, and SVM modeling techniques using REF. From the most important to the least important, they are the distance to Elbiraz Wadi (DTBW), longitude (X), elevation, distance to streams (DTS), and aspect. During the training phase, the SVM model achieved slightly better predictions of GWS with RMSE = 1.010, R = 0.865, and MAE = 0.750 compared to the RF model, CRM model, KNN model, and HyFIS model. During the testing phase, the RF model demonstrated superior predictions of GWS with RMSE = 1.069, R = 0.831, and MAE = 0.921 compared to the SVM model. However, both the SVM and RF models outperformed the CRM model, the HyFIS model, and the KNN model.

For the FFS selection method as shown in Table 5, nine out of 10 covariates (Y, TWI, slope, curvature, elevation, aspect, FlowD, DTBW, and DTS) were selected as inputs for different ML techniques. During the modeling process, RF model outperformed all other models in both the training and testing phases. The performance metrics were significantly better in the training phase with an RMSE of 1.113, an R-value of 0.817, and a MAE of 0.903. In the testing phase, the RF model continued to perform well, with an RMSE of 1.150, an R-value of 0.862, and a MAE of 0.878. In addition, the CRM model, which is the second-best performer, demonstrated superior performance compared to SVM and other models.

For the BFS method results in Table 6, all the covariates including Y, X, TWI, slope, curvature, elevation, aspect, FlowD, DTBW, and DTS were used as inputs for the various ML techniques. The RF model is still the best performer, outperforming all other models with higher performance metrics. In the training phase, the RF model achieved an RMSE of 1.171, an R-value of 0.812, and a MAE of 0.958. In the testing phase, the RF model attained an RMSE of 1.163, an R-value of 0.858, and a MAE of 0.918. These results indicate the strong performance of the RF model in both phases. Additionally, the CRM model exhibited superiority over the remaining models by achieving the lowest RMSE and MAE, as well as the highest R-value in both phases.

How well the model fits the training dataset perform during the training phase measure? However, this measure alone does not evaluate the model’s prediction and generalization abilities (Tran et al. 2021). In contrast, the model’s predictive performance, which evaluates its accuracy during the testing phase, better demonstrates its ability to predict outcomes reliably (Rahmati et al. 2019; Tran et al. 2021). Therefore, the optimal model to predict GWS in the study area was selected by evaluating both its goodness-of-fit performance during the training phase and its generalization capabilities as demonstrated by its prediction performance during the testing phase.

The chosen model for predicting and mapping GWS in the study area is the RF alternative that uses DTBW, X, elevation, DTS, and aspect as inputs selected through the REF selection method. This model exhibited the lowest RMSE (1.016 and 1.069) and MAE (0.759 and 0.831), as well as the highest R-value (0.854 and 0.831) for the training and testing phases, respectively.

Groundwater salinity map

While all input variables are available in continuous raster maps across the entire study area, the representation of GWS in terms of EC (mS/cm) is limited to specific samples distributed within it. However, the selected best model is applied to generate a digital map of GWS (Fig. 7) across the entire Sidi Okba region. This generated GWS map is entered in ArcMap for classification and layout.

Fig. 7
figure 7

Digital map of GWS in terms of EC generated using RF

The GWS classes of the digital map generated using RF are consistent with those of the spatial distribution map from OK, displaying the same general structure and distribution. The difference lies in the extension of each class with variations in their extent. This congruence enhances the reliability of the modeling results. The surface areas of each class are presented in Table 3. Analysis of Fig. 7 and this table reveals that RF model tends to underestimate the salinity class with high risk (C4), accounting for 7.04% of the entire surface area. Conversely, RF tends to overestimate the salinity class with excessive levels (C5), covering 46.68% of the total surface area. Additionally, 46.28% of the land is covered by groundwater within the C3 category, indicating a very high salinity level. According to Ayers and Westcot’s threshold, acceptable groundwater for irrigation covers less than 15% of the entire research area.

Discussions and predictor variables’ importance

The type and number of input combinations play a crucial role in the accuracy of ML techniques as demonstrated by the results of three selection methods (REF, FFS, and BFS) applied in this study to identify appropriate predictors (DEM derivatives) of GWS in the Sidi Okba region. The results revealed that the selection methods can yield different input combinations as aspect often overlooked by researchers. This finding is consistent with the results of Theng and Bhoyar (2024), who concluded that the presence of redundant and irrelevant features can lead to less effective ML algorithms as in several research papers in the literature. The REF selection method identified five DEM derivatives as top predictors as shown in Table 4, while the FFS selection method identified nine variables as in Table 5, and BFS considered all variables as top predictors (Table 6). The application of these various input combinations to five ML modeling techniques (RF, HyFIS, KNN, CRM, and SVM) demonstrated that utilizing only the five most important predictors consistently yielded the highest accuracy across all models during both the training and testing phases. This result is also supported by the comprehensive survey conducted by Li et al. (2017), which provides empirical evidence that fewer features often lead to better model performance, especially in terms of accuracy and generalization. This is the case also in this paper.

The results of this research indicated that RF is the most effective modeling technique for predicting GWS using distance to Elbiraz wadi (DTBW), X, elevation, distance to streams (DTS), and aspect as input variables. The best RF model was employed to generalize prediction results across the entire study area and generate the digital map of GWS using raster formats of the five best predictors. The current result is aligned with recent research displaying the effectiveness and superior predictive capability of RF compared to other ML methods in predicting groundwater quality parameters such as groundwater nitrate pollution (Ouedraogo et al. 2019), groundwater contamination by ammonia concentration (Madani et al. 2022), and groundwater arsenic contamination (Guo et al. 2023; Iqbal et al. 2024).

The feature importance analysis indicated that the distance to Elbiraz wadi, which is fed by releases from the Foum Elgherza dam and other tributaries, is the most impactful variable for predicting GWS in the Mio-Pliocene aquifer in the Sidi Okba region, with a variable importance score (VIS) of 16.37 (see Fig. 8). The digital map clearly shows that as we move further away from the Foum Elgherza dam and Elbiraz wadi, groundwater salinity (GWS) increases. This increase is likely due to the reduced recharge of the aquifer as one moves away from the wadi. In this context, Balakrishnan et al. (2024) confirmed the essential role of streams and wadis not only in groundwater recharge but also in the dissolution, mobilization, and distribution of salts within the aquifers. Longitude (X) is the second most important variable affecting the accuracy of the prediction results (VIS = 10.93). This is due to its strong negative correlation with the direction of GWS salinity changes, which generally decrease from the western to the eastern parts of the study area. Moreover, the northeastern part of the study area, which is characterized by lower GWS, is associated with higher elevation areas. In contrast, the northwestern and southeastern parts, which display excessive GWS salinity levels, are associated with lower elevation areas. Elevation (VIS = 8.91) can influence groundwater flow paths as water typically flows from higher to lower elevations transporting minerals and salts along the way. This process can increase salinity in lower elevation areas. Similar observations were made by Mosavi et al. (2020), who indicated that lower elevation areas in Sarvastan plain (Iran) are highly susceptible to GWS evolution. The middle of the study area, which exhibits excessive GWS, is linked to areas with the greatest DTS and is the fourth most important variable (VIS = 6.45). This can be attributed to the low recharge rates in the middle of the study area, which is characterized by intensive agriculture and a high number of wells. Due to these circumstances, the overexploitation of groundwater contributes to the increase of GWS. Pulido-Bosch et al. (2018) and Foster et al. (2018) indicated that poor irrigation practices can intensify GWS by soil salts leaching into groundwater, and thus leading to further groundwater salinity increases. The aspect, which refers to the orientation of slope, is identified as the last significant predictor of GWS in the study area with the lowest VIS of 0.75. It affects GWS dynamics through its impact on precipitation patterns and runoff direction influencing the movement of water and salts within the landscape (Benjmel et al. 2022).

Fig. 8
figure 8

Predictor variables’ importance

The GWS classes on the digital map generated using RF align with those on the spatial distribution map from OK, sharing the same general structure and distribution. This congruence enhances the reliability of the modeling results. However, there are variations in the extent of each class because the RF model tends to moderately underestimate the high-risk salinity class (C4) and slightly overestimate the excessive salinity class (C5). The final digital map of GWS illustrates the alarming state of groundwater quality in the study area. The high levels of GWS may lead to significant environmental problems and economic costs posing a substantial risk to global health (Jamei et al. 2022), emphasizing the urgent need for intervention and remedial measures. Despite this alarming situation, farmers in the study area continue to use groundwater for irrigation without implementing sufficient measures to mitigate the ongoing deterioration situation of groundwater quality. Authorities and policymakers must adapt to increased groundwater salinity, by taking some urgent interventions, including careful freshwater resource management and planning, (e.g., Foum Elgherza dam); communicate with public, especially farmers, the need to prevent overexploitation, and contamination of groundwater resources due to anthropogenic activities; raise farmers’ awareness of the saline water hazards for irrigation and support the implementation of effective methods such as leaching, drainage systems, reverse osmosis desalination of groundwater, and fertigation using modern technologies to prevent the infiltration of saltwater back into the aquifer and the subsequent increase of GWS; and cease pumping from groundwater wells that are no longer suitable for irrigation due to excessive groundwater salinity. After all what have been explained, comparison of the main observations in this study with similar groundwater salinity (GWS) modeling studies in various regions worldwide, such as Bangladesh (Jamei et al. 2022), Vietnam (Tran et al. 2021), Algeria (Tachi et al. 2023), and Iran (Gharechaee et al. 2024), demonstrates that the combination of appropriate machine learning techniques and the effective selection of readily available digital elevation model (DEM) derivatives can result in accurate GWS predictions, even in the absence of specific aquifer parameters, which are costly and not available everywhere.

Conclusion

Although modeling and mapping of groundwater salinity (GWS) procedures are essential for groundwater resources management in any region but especially in arid regions, five machine learning (ML) techniques are used in this study based on electrical conductivity measures and digital elevation model (DEM) derivatives for GWS mapping in the Sidi Okba region. A limited number of strategically positioned wells and 10 DEM derivatives are used with input parameter combinations through RFE, FFS, and BFS methods. The ML models (RF, HyFIS, Knn, CRM, and SVM) are evaluated using RMSE, R, and MAE error measurement metrics. The following points are among the main key interpretations obtained from this study.

  • It is explained that the type and number of input combinations significantly influence the accuracy of machine learning (ML) techniques. Three of these methods (RFE, FFS, and BFS) are used in the study to identify different sets of predictors (DEM derivatives) for GWS in the Sidi Okba region based on fewer strategically selected features leading to better model performance.

  • The random forest (RF) model with five key predictors (distance to Elbiraz wadi, X, elevation, distance to streams, and aspect) is found as the most effective alternative for GWS prediction. This model outperformed other ML techniques and therefore, it is used to generate a digital map of GWS in the study area.

  • The most impactful variables for predicting GWS are identified as the distance to Elbiraz wadi, longitude (X), and elevation. These variables are significantly influenced by the spatial distribution of GWS, especially in higher salinity levels further away from the wadi and in lower elevation areas.

  • The study emphasizes the need for authorities and policymakers to implement interventions such as careful groundwater resource management as fresh water resource coupled with public awareness campaigns, and the adoption of effective irrigation practices to mitigate the ongoing deterioration of groundwater quality due to salinity.

For the future research directions, this study has some limitations such as the dataset used is relatively small and the resolution of the DEM derivatives is limited to 30 × 30 m. Expanding the dataset to include more groundwater wells covering the entire Mio-Pliocene aquifer in the region can enhance the accuracy and reliability of GWS predictions in order to obtain higher resolution remote sensing derivatives. As another alternative future study, a long-term monitoring and temporal analysis of groundwater salinity are necessary to assess changes and salinity trends over time. Despite these limitations, the results of this study mark a notable accomplishment and carry substantial significance for groundwater managers in the study area, as well as researchers globally engaged in similar investigations.