Introduction

Groundwater is an important natural resource. It constitutes the main source of water for industries and irrigated agriculture in the arid and semiarid areas (Nampak et al. 2014). The effective quality and quantity management of groundwater has become a major issue, since climate change, rapid population increase, and overuse of groundwater for irrigation can have major effects on groundwater. Therefore, to ensure the sustainable management of groundwater, the assessment of groundwater resources and associated pressure at the local scale are strongly required (Hasiniaina et al. 2010).

Nitrate (NO3) is the most abundant pollutant in groundwater (Laftouhi et al. 2003; Moore et al. 2006). Indeed, NO3 concentrations increase with increasing and intensification of agricultural activities due to the overuse of nitrogen fertilizers (Nolan 2001; Puckett et al. 2011; Ki et al. 2015). Consequently, the consumption of water polluted by NO3 can be associated with health problems, such as methemoglobinemia and cancers for adults (Ward et al. 2005).

In the Sais basin, two aquifers are present: the lias and the plioquaternary aquifers. Their main uses are mainly for drinking and irrigation purposes. These aquifers have been the subject of several geomorphological, geological, hydrogeological, and geophysical studies (Taltasse 1953; Chamayou et al. 1975; Fassi 1999; Essahlaoui et al. 2001; Amraoui 2005). The plioquaternary aquifer is more heavily used for irrigation and drinking of the rural population, due to its shallow depth compared to the lias aquifer.

The geological and hydrogeological characteristics of the Sais basin may contribute positively to the NO3 pollution of the plioquaternary aquifer (Tabyaoui et al. 2003). Therefore, the assessment of its vulnerability degree can be an important tool for groundwater resource management, which allows determining the most affected area in the basin or presents a high risk of contamination by NO3.

Groundwater vulnerability is defined as the degree of protection that the natural environment provides against groundwater pollution (National Research Council 1993). In fact, there are two types of groundwater vulnerability: The first type is the intrinsic vulnerability, which is assessed based on the characteristics of the natural environment, including aquifer, soil and climatic characteristics (Schnebelen et al. 2002). However, this type of vulnerability is considered static and invariable. Several methods have been proposed for assessing the intrinsic vulnerability, among others DRASTIC, GOD, SI, and SINTACS frameworks (Ghazavi and Ebrahimi 2015; Al-Shatnawi et al. 2015; Baghapour et al. 2016; El Himer et al. 2013). The second type is the specific vulnerability which concerns a specific pollutant or group of pollutants. This type is assessed using the intrinsic properties of the basin and the characteristics of the pollutant as well as anthropogenic factors related to the pollutant (Ribeiro et al. 2017). The specific vulnerability is assumed to be dynamic and closer to reality. Unlike the first type, the specific vulnerability can changes over time.

In recent years, machine learning techniques such as artificial neural network (ANN), support vector machine (SVM), random forest (RF), and decision tree (CART) have been applied in several fields. The RF model is robust and easy to apply compared to other machine learning techniques, it has the particularity to determine the importance of each explanatory variable in the prediction result. Besides, the RF model can provide good results compared to the multivariate statistics or other machine learning methods such as SVM and ANN (Breiman 2001; Liaw and Wiener 2002; Loosvelt et al. 2012; Ouedraogo et al. 2018).

In groundwater research, RF method has been used to predict NO3 and arsenic (As) concentrations in groundwater (Anning et al. 2012; Wheeler et al. 2015) and to assess the groundwater vulnerability (Rodriguez-Galiano et al. 2014; Mendes et al. 2016). These studies revealed that RF has a good prediction performance.

To the best of our knowledge, there are no previous studies that have assessed groundwater vulnerability using machine learning in Morocco. Furthermore, no study aimed to assess the specific groundwater vulnerability to NO3 in the Sais basin. However, Sadkaoui et al. (2013) have applied intrinsic methods in the Sais basin to assess groundwater vulnerability. Nevertheless, the rating proposed by some intrinsic frameworks such as DRASTIC (Aller et al. 1987) may differ depending on the study area specificities. Additionally, intrinsic vulnerability may ignore some important parameters which may affect the groundwater vulnerability. Consequently, the RF model may be a novel technique for the groundwater vulnerability assessment in Morocco.

The main objective of this study was to develop an accurate RF model to assess the specific groundwater vulnerability to NO3 of the plioquaternary aquifer of the Sais basin, using 14 parameters that may contribute to NO3 pollution.

The output of this research will contribute to:

  1. 1.

    Identify the most vulnerable areas to NO3 pollution in the Sais basin;

  2. 2.

    Determine the most important factors that control the groundwater vulnerability to NO3 pollution of the plioquaternary aquifer.

Materials and methods

Research area

The research area is the Sais basin, part of the Fez-Meknes region in Morocco (Fig. 1). The surface area of the basin is approximately 2100 km2. The basin is located between the latitude 33°38′ to 34°4′N and longitude 5°49′ to 4°53′W. It is limited by the middle atlasic ranges in the south and the rife ranges in the north (Fig. 2). The geological setting is mainly dominated by the lacustrine limestone of the lias. The altitude of the study area varies between 185 m in the north and 1047 m in the south at the middle atlas ranges, with an average of 600 m. The study area is characterized by a Mediterranean climate (Amraoui 2005). The mean annual rainfall recorded by three stations located at Douyet (Northeastern of the basin), Meknes and Ain Taoujdate during the period 1981–2018 is 468 mm. The Sais basin is characterized by high agricultural activity due to good soil fertility. The agriculture is conducted under rainfed and irrigated conditions. Moreover, the Sais basin contains several lithological classes, including sandstone, siltstones, marlstone, alluvium and oncolite limestone, representing, respectively, 39, 18.4, 18, 11.6, and 10.6% of the total surface area of the basin.

Fig. 1
figure 1

Geographic location of the Sais basin

Fig. 2
figure 2

Geological map of Sais basin

In the Sais plain, two aquifers are distinguished: lias aquifer which constituted by dolomitic limestone, and plioquaternary aquifer, composed of pliovillafranchien sandstone, conglomerate sand, and lake limestone (Essahlaoui et al. 2001; Tabyaoui et al. 2004; Amraoui 2005; Belhassan et al. 2010), the latter has a substratum from the upper Miocene with a depth exceeds 1000 m in some parts of the northeastern Sais basin. The recharge of the plioquaternary aquifer is done mainly by rainfall and irrigation water infiltration, as well as by the drainage of the lias aquifer from the southern part of the Sais basin (Sadkaoui et al. 2013).

Random forest

Random forest is a supervised nonparametric machine learning method, developed by Breiman (2001). It is based on multiple trees’ decision algorithm (Rodriguez-Galiano et al. 2014; Catani et al. 2013; Micheletti et al. 2013). The method is used for data prediction and interpretation purposes. The RF model can be divided into a classification tree and a regression tree (Zabihi et al. 2016).

The RF can compute an unbiased error estimated by bootstrapping (Siroky 2009). The dataset used for RF is divided into two parts: training sub-dataset containing 2/3 of dataset randomly chosen with replacement, and validation sub-dataset containing the remaining 1/3. The validation sub-dataset is called out-of-bag (OOB) (Breiman 2001; Catani et al. 2013). The latter can be used to assess the prediction performance of RF and the input variables importance. In addition, RF presents some other interesting characteristics which justify its application in the groundwater vulnerability assessment:

  • It can manage both categorical and numerical variables;

  • It can determine the importance of each explanatory variable in the prediction result;

  • It can learn complex patterns, without a linear relationship between the explanatory variables and dependent variable;

  • It can handle outliers’ data;

  • It can handle a large dataset with high dimensionality;

  • Its implementation is less complex compared to other machine learning techniques such as ANN and SVM.

The RF model uses two methods to assess the importance of explanatory variables used in the prediction. The first one is called the mean decrease accuracy (MDA), which is an indirect measure of the effect of each explanatory variable on the prediction accuracy (Calle and Urrea 2010). To compute MDA, RF uses the out-of-bag (OOB) dataset and permute each explanatory variable while others are fixed. Increasing the RF model error percentage indicates that the permuted variable is important (Naghibi et al. 2017). The RF model error is calculated from OOB sub-dataset based on the following formula (Grömping 2009):

$${\text{OOB }-\text{ MSE}} = \frac{1}{{n{\text{OOB}},t}}\sum\limits_{\begin{subarray}{l} i = 1 \\ t \in {\text{OOBt}} \end{subarray} }^{n} {(yi - \hat{y}i)^{2} }$$
(1)

where \(yi\) and \(\hat{y}i\) are, respectively, the observed and the mean of the predicted values from all trees; nOOB is the number of OOB observations in tree \(t\) and \(i\) is the OOB observation for the tree. Therefore, MDA can be an accurate tool for variable selection.

The second method is the mean decrease in the GINI, based on the heterogeneity decrease defined from the entropy. This tool determines the importance of explanatory variable \(j\). It is the weighted sum of the decreases in the node heterogeneity, averaged over all trees using the GINI index. The GINI index can be used to explain the variable strength used as input in the RF model (Al-Abadi and Shahid 2016). The higher GINI value assigned to a variable indicates that it is more important in the prediction compared to other variables (Yang et al. 2019).

Observed nitrate concentrations

A total of 154 water samples of the plioquaternary aquifer in the rural area of the Sais basin were collected for NO3 analysis. Sampling campaigns were carried out in the spring and autumn seasons of 2013 (56 samples) and 2018 (98 samples). The samples were collected and stored at 2–4 °C and then analyzed within 24 h using the UV-Spectrophotometeric method. The distribution of observed NO3 concentrations in the different sampling campaigns is shown in Fig. 3. The mean NO3 concentrations were 60 and 64 mg/L in 2013 and 77 and 70 mg/L in 2018, respectively, in the Spring and Autumn season. Overall, the highest NO3 concentrations were observed in the north, northwestern and central parts of the basin.

Fig. 3
figure 3

Spatial distribution of NO3 concentrations in spring (a) and autumn (b) of 2013 and spring (c) and autumn (d) of 2018

Explanatory variables

In order to assess the groundwater vulnerability to NO3 in the Sais basin, a total of 14 explanatory variables related to the intrinsic and specific groundwater vulnerability to NO3 were used as RF model inputs (Fig. 4). All variables were mapped using geographic information system (GIS). Table 1 presents the 14 explanatory variables, their data sources, and their estimations methods. These variables are rainfall, texture (sand, silt, and clay), lithology, organic matter, piezometric level, altitude, land use, Calcium carbonate (CaCO3), Carbon/nitrogen ratio (C/N), slope, hydraulic gradient, and soil classification. All variables were compiled within a 500-m-radius circular.

Fig. 4
figure 4figure 4

Raster layers of explanatory variables used in Random Forest: a (Land use), b (Hydraulic gradient), c (Slope degrees), d (C/N ratio), e (%CaCO3), f (Soil classification), g (% Silt), h (%Sand), i (%Clay), j (Altitude), k (Annual rainfall), l (%Organic matter), m (Piezometric level), n (Lithology)

Table 1 Explanatory variables used in the RF model

The explanatory variables were selected based on the following reasons:

  • The slope is an important parameter that controls the runoff. A low slope contributes to water retention and therefore increases the probability of groundwater contamination (Tilahun and Merkel 2009).

  • The altitude was selected based on the hydrogeology of the Sais basin. A part of the plioquaternary recharge is provided from the lias aquifer in the southern part of the basin, where the altitude is high. Which may contribute to the diminution of NO3 pollution by dilution.

  • The piezometric level indicates whether the NO3 can rapidly reach the groundwater surface. The shallower water depth can increase the probability of NO3 contamination (Stigter et al. 2005).

  • The rainfall contributes positively to groundwater recharge, which leads to the leaching of soil NO3 (Aslam et al. 2018).

  • The hydraulic gradient is related to the groundwater flow direction (Rodriguez-Galiano et al. 2014). Which may contribute to the NO3 accumulation.

  • Lithology can affect groundwater quality. It influences the facility of pollutant transfer to the aquifer (Chenini et al. 2015).

  • Soil classification and texture can influence NO3 loss. NO3 leaching may be more important in sandy soils (Ahirwar and Shukla 2018). The texture components (sand, silt, and clay) were introduced in the RF model separately, to determine the most important component.

  • Organic matter and C/N ratio are considered as parameters to be parameters related to the soil nitrogen cycle, which can contribute to NO3 losses. Moreover, Berdai et al. (2004) have considered these two parameters as important in the specific groundwater vulnerability to NO3.

  • Calcareous soils are characterized by high CaCO3 content. The latter is considered as a factor dominating the ammonification and nitrification processes, which may increase NO3 leaching (Zarabi and Jalali 2012; Kutiel and Shaviv 1992).

  • Land use is a parameter that represents a potential anthropogenic factor related to NO3 pollution. (Huang et al. 2017).

Modeling approach using RF

The groundwater vulnerability is generally understood as a contamination probability. Therefore, to obtain the groundwater vulnerability map to NO3, the first step was rescaling the NO3 concentrations. The observed NO3 concentrations dataset observed in the 154 samples were divided into two groups, based on the threshold value of 50 mg/L. Concentrations that exceed the threshold were given a value equal to 1 (nitrate pollution) and concentrations lower or equal the threshold value equal to 0 (no nitrate pollution). The rescaled NO3 concentrations were used in the RF as output variable, while specific and intrinsic parameters as input variables. Secondly, the dataset (input and output) were split randomly into two sub-datasets. The first sub-dataset which contains 80% of dataset, was used for the training and validation and the remaining 20% was used for the testing of the RF model. It should be mentioned that RF model split the first sub-dataset (80% of dataset) into two groups, 2/3 for training and the remaining 1/3 for validation purposes. Figure 5 shows the methodology flowchart used for this study. The distribution of the training, validation, and testing samples are shown in Fig. 6. The RF implementation requires the number of trees and the number of variables (m) used to determine the split at each node. Breiman (2001) recommends using m number close to 1/3 of all input variables. For this study, we used a maximum of 10,000 trees and we tested different numbers of random input variables (mtry = 1, 2, 3 and 4) at each node. The optimal mtry is one that computes the lowest error. The RandomForest package in R software (V 1.1.4) was used for the RF model. The optimal mtry was determined using the TuneRF function in R software.

Fig. 5
figure 5

Random forest flowchart used in this study

Fig. 6
figure 6

Location of training, validation, and testing samples used in random forest

Two modeling approaches based on variable importance were used for this study. The first approach (RF1), added all explanatory variables selected (14 variables) as model input. The second approach selected the most important variables in the RF1 model result and used them as input for a new RF implementation (RF2). The predicted values obtained by both RF models were considered as Groundwater Vulnerability Indexes (GVI).

The validation and the testing are essential steps in any study aimed at modeling using machine learning techniques. First, the GVI predicted by RF1 and RF2 were validated and compared based on the error computed by each mtry used, we retained the result with the lowest error. Second, the predictive accuracy of the RF model was tested using the receiver operating characteristic (ROC) analysis, through the ROCR package in R Software. The ROC curve allows calculation of the Area Under the Curve (AUC). The ROC plots the false-positive rate on the X-axis and the true positive rate on Y-Axis. It explains the trade-off between the two rates (Sezer et al. 2011; Ozdemir and Altural 2013; Akgun 2011). The classification of the prediction accuracy based on AUC can be described as follows: AUC > 0.9, excellent; 0.8 < AUC < 0.9, very good; 0.7 < AUC < 0.8, good; 0.6 < AUC < 0.7, average and 0.5 < AUC < 0.6, poor (Pourghasemi and Kerle 2016; Bradley 1997; Fawcett 2006).

Mapping groundwater vulnerability to nitrate

After the validation and testing, the vulnerability maps were created using all GVI predicted by RF1 and RF2 models, through the Kriging interpolation method in GIS. The most reliable interpolation retained, is the one that generated the lowest error. However, the average of some GVI coincident values was used in the interpolation.

The GVI were categorized into four vulnerability classes namely low, medium, high and very high. The most accurate map was obtained by comparing the different classification methods proposed by the GIS (quantile, natural breaks, geometrical interval, and equal interval), using the Spearman rank correlation (ρ) and one-way ANOVA, between the vulnerability classes and the observed NO3 concentrations. All the statistical tests were carried out by R Software.

Results and discussion

Random forest results

Accuracy of the random forest

Figure 7 shows the error computed as function of the number of trees, for each explanatory variable randomly sampled (mtry) at each node, using the RF1 model. From this result, it can be observed that the error decreased when more trees are used. In fact, from 2000 trees, the error of each mtry was low and stable. The same result observed for the other mtry used. However, the mtry that computed the lowest error was 4, which is consistent with that recommended by Breiman (2001). Furthermore, the mean error value obtained was 0.1100, with a minimum and maximum values of 0.1091 and 0.1545, respectively.

Fig. 7
figure 7

Impact of the number of trees and random split variable (mtry) on the out-of-bag (OOB) error computed by Random Forest applied to all variables (RF1)

Selection of the most important explanatory variables

The variable importance of the RF model is a particular output indicator of the relative contribution of each input variable in the prediction result. The comparison of variable importance was based on MDE (% increase in MSE) and the mean decrease in the GINI (% Increase in node purity). The importance of each explanatory variable is presented in Fig. 8. The high value indicates that the variable is more important.

Fig. 8
figure 8

Relative importance of variables using % increase MSE (a) and increase of node purity (b)

As shown in Fig. 8a, the relative increase in MSE obtained was relatively high for all explanatory variables. It varies between 55 and 130.3%. This finding indicates that all explanatory variables selected are considered to be controlling factors to groundwater NO3 pollution. Nevertheless, rainfall, sand, clay, piezometric level, organic matter, and lithology are the most important explanatory variables. The same result was obtained using the mean decrease in GINI (Fig. 8b) with different importance ranks.

According to the MDE results, the rainfall has the highest importance in GVI prediction, followed by sand and clay contents, with a value of 130, 118, and 116%, respectively. These results can be explained by the fact that rainfall contributes to groundwater recharge and therefore contributes to the NO3 leaching. Indeed, the areas where NO3 concentrations are high are located within areas containing high soil sand content, mainly in the central and western parts of the basin. Regarding clay importance, the result can be explained by its capability to protect groundwater against NO3 contamination due to its high retention capacity. The piezometric level was considered also as an important variable with a value of 115%. Concerning the importance of the organic matter, the result shows that the increase in MSE was 103.5%. This finding suggests that groundwater may receive high loads of organic nitrogen. NO3 leaching increases as a result of high mineralization in the case of high soil organic matter content (Hoffmann and Johnsson 1999; Kulabako et al. 2007). The same importance value was observed for lithology. However, the silt, CaCO3, C/N and altitude have revealed medium importance. In contrast, land use, slope, hydraulic gradient, and soil classification are the less important parameters, with values of 54.91, 72.86, 76.81, and 76.82%, respectively.

According to the RF1 importance result, we selected the most important variables (Increase in MSE above 100%) as input for RF2, which are: rainfall, sand content, clay content, piezometric level, organic matter, and lithology.

The result revealed that the error tendency is relatively similar to the RF1 result. The lowest errors were computed from 2000 trees (Fig. 9). However, the best mtry for RF2 was 2, which computed the lowest error compared to other mtry. The mean error value obtained was 0.1099 with a minimum and maximum values of 0.1083 and 0.1750, respectively. Therefore, using the most important parameters can decrease slightly the OOB error.

Fig. 9
figure 9

Impact of the number of trees and random split variable (mtry) on the out-of-bag (OOB) error computed by Random Forest applied to the most important variables (RF2)

Relative operating characteristics (ROC) curve

The ROC curve plots for both RF models are shown in Fig. 10. The AUC results are quite similar for both RF models. The AUC were 0.822 and 0.82, which correspond to the prediction accuracy of 82.2 and 82% for RF1 and RF2 models, respectively. Therefore, both RF models produce very good prediction performance.

Fig. 10
figure 10

ROC curve computed using RF1 and RF2 models

Mapping groundwater vulnerability to nitrate

As seen in Fig. 11, the predicted GVI increase significantly as a function of observed NO3 concentrations, these findings were similar for both models (RF1 and RF2). However, the predicted GVI obtained showed that RF2 predicts more accurately GVI compared to RF1. The predicted values range from 0.003 to 0.998 and 0.0019–0.999 for RF1 and RF2, respectively. Therefore, the removal of the less important explanatory variables caused a slight increase in the GVI prediction accuracy. This finding was consistent with the OOB errors computed.

Fig. 11
figure 11

predicted GVI through the RF1 (a) and RF2 (b), the red points represent the observed values (0 and 1)

The GVI predicted using both RF models were classified according to four vulnerability classes (low, medium, high and very high). The comparison between the classification methods based on the Spearman rank correlation (ρ) and Eta coefficient (η), showed that geometric interval and equal interval are considered as the most appropriates methods in RF1 and RF2, respectively (Table 2 and Table 3). These classification methods were used to create vulnerability classes.

Table 2 Comparison between classification methods applied to RF1
Table 3 Comparison between classification methods applied to RF2

The observed NO3 concentrations according to the vulnerability classes obtained are presented as a boxplot in Fig. 12. These plots summarize the observed NO3 concentrations by a central point which indicates the median, a box to indicate the variability around the median (25th and 75th percentiles), whiskers around the box to indicate the range of variables and the points to indicate the outliers’ values. It can be observed that the vulnerability classes show their suitability for observed NO3 concentrations. The low class presents the lowest concentrations, while very high class contains the highest NO3 concentrations. However, the comparison between two RF models (Table 2 and Table 3), showed that the RF2 model was more reliable in GVI prediction, the Spearman rank correlation (ρ) and Eta coefficient (η) between vulnerability classes and observed NO3 concentration were up to 0.6645 and 0.2837, respectively, which are relatively greater than those obtained by the RF1 model (0.6547 and 0.2800, respectively).

Fig. 12
figure 12

Box plot of observed nitrate concentrations and vulnerability classes obtained through the RF1 (a) and RF2 (b)

The vulnerability maps obtained using both RF models are shown in Fig. 13. It shows that the northern, central, northeastern and western parts of the basin are the areas where the groundwater vulnerability to NO3 is classified as high to very high. These two classes cover, respectively, 25.04 and 22.9% of the total area, for RF1 and 36.38 and 26.5% for RF2 (Table 4). In these areas, the annual rainfall varies between 430 and 550 mm, while the sand content varies between 40 and 84%. Regarding clay content, it varies between 2 and 30%. As for the organic matter, the content varies between 1.5 and 5%. Moreover, three lithological classes are dominants in these areas, namely sandstone, marlstone, and oncolite limestone.

Fig. 13
figure 13

Vulnerability maps obtained using RF1 (a) and RF2 (b)

Table 4 Statistics of the groundwater vulnerability surface area

Concerning the medium vulnerability class, it occupies 27.71 and 26.14% of the total area, respectively for RF1 and RF2. This class is located mainly in some eastern, northern and western parts of the basin. However, the area that presents a low vulnerability does not exceed 24.80 and 11% of the total area, for RF1 and RF2, respectively, and located mainly in some southern (middle atlas limits) and eastern parts of the basin. However, these areas are characterized by high clay content. The latter varies between 30 and 52%. Moreover, two lithological classes are dominants, which are marlstone and siltstone.

Based on these results, the RF model provides good performance in the determination of groundwater vulnerability, this is due to its ability for learning non-linear relationships between NO3 concentrations and explanatory variables used in this study. However, the groundwater vulnerability maps to NO3 obtained can be improved continuously over time, when new input variables are considered, such as groundwater recharge and nitrogen fertilizer application.

Conclusion

Improving water management strategies need a robust method to assess groundwater vulnerability. The present study aimed to develop an accurate RF model for the prediction of groundwater vulnerability to NO3. The observed NO3 concentrations in the Sais basin were rescaled to 0 and 1, based on the drinking threshold of NO3 (50 mg/L). The predicted values were considered as GVI. Fourteen explanatory variables related to the intrinsic and specific groundwater vulnerability were used as inputs in the RF model. These variables were rainfall, organic matter, soil texture (sand, clay, and silt), altitude, lithology, land use, C/N ratio, piezometric level, CaCO3, slope, hydraulic gradient, and soil classification. The OOB-error and AUC were 0.1100 and 82.2%, respectively. Moreover, the study revealed that all explanatory variables used are considered to be controlling factors to groundwater NO3 pollution, with differing importance degrees. In fact, the rainfall, sand content, clay content, organic matter, piezometric level, and lithology were the most important predictors of GVI. Moreover, using only these important parameters as RF input showed that the OOB-error and AUC were of 0.1099 and 82%, respectively. The comparison between the observed NO3 concentrations and the vulnerability classes obtained showed that the RF2 model can produce slightly more accurate groundwater vulnerability map.

The results revealed that about 48 and 63% of the total surface area are under high to very high vulnerability to NO3, using RF1 and RF2, respectively. While about 27.7 and 26.1% of the surface area are in medium vulnerability, and 24.8 and 11% of the surface area are in low vulnerability, using RF1 and RF2, respectively.

Base on the RF results, the most important factors in the prediction result should be taken into consideration when recommending nitrogen fertilization since the agricultural activity is intense in the Sais basin.

Nevertheless, NO3 pollution can be affected by other variables related to the biogeochemical process, overuse of nitrogen fertilizers and the land use change. Consequently, including these factors in the RF model may also improve the groundwater vulnerability map to NO3 in the Sais basin.

The current study is a novel application of machine learning technique in groundwater vulnerability assessment in Morocco. In the future, the RF model performance can be compared with other machine learning methods. This study will provide valuable information for groundwater management in the study area.