Introduction

Groundwater is a vital resource that supports residential life, agricultural and industrial activities, and ecosystem sustainability (Anand et al., 2021; Cui & Shao, 2005). Nearly a quarter of the world’s freshwater resources come from groundwater, according to statistics (Panahi et al., 2020). In arid and semi-arid regions, the proportion of groundwater can be even higher due to the scarcity of surface water resources. In Northwest China, which is a typical arid and semi-arid region, the population represents only 7.3% of the country, while groundwater reserves account for 1/8 of the country’s total (Chen, 1986; Wang et al., 2008). Effective utilization of groundwater can help alleviate freshwater scarcity in these regions (Arabameri et al., 2019). However, the availability of groundwater is currently threatened by over-exploitation, climate change, and land-use changes. In semi-arid regions like China’s Guanzhong Basin, where groundwater resources are limited and coupled with high population density and rapid economic development, demand for groundwater resources has intensified (Kong et al., 2019). Therefore, it is crucial to conduct accurate assessments of groundwater potential to promote sustainable groundwater management in these areas.

The assessment of groundwater potential aims to identify areas with a high likelihood of containing groundwater resources (Jhariya et al., 2021). This assessment is crucial for optimizing the placement of groundwater wells, estimating the potential yield of groundwater, and promoting sustainable use of groundwater resources (Tegegne, 2022). Typically, the assessment involves the use of geological, hydrological, and environmental information to identify favorable zones for groundwater development (Farhat et al., 2023). However, traditional methods for determining groundwater potential, such as pumping tests and borehole drilling, are time-consuming and expensive. To overcome this challenge, more researchers have been turning to advanced technologies such as geographic information systems (GIS) (Bera et al., 2021), remote sensing (Shamsudduha & Taylor, 2020; Sun et al., 2019), and drones (Jansen, 2019) to predict and assess the potential of groundwater (Wang et al., 2022). The biggest advantage of using these methods is their ability to analyze the entire study area by combining satellite images with regional geographic or geological information. However, one disadvantage of these methods is that the results obtained are not always directly related to the groundwater itself, but are relatively related, which can make them less efficient and reliable than drilling or pumping tests. In order to tackle this issue, researchers have developed several assessment and evaluation techniques to analyze the data. Two main approaches have emerged for assessing groundwater potential. The first approach involves weighing the various factors that affect groundwater and overlaying them (Singh et al., 2019). Weight-based methods include the analytic hierarchy process (Ahmad et al., 2023; Arefin, 2020), which depends on human intervention and expert ratings and entropy weight (Al-Abadi et al., 2016; Zhang et al., 2021) which considers the distribution characteristics of the factors. Although these methods are more interpretable during the calculation process, their final results often do not meet expectations. The second method involves dividing the research area into multiple points and then scoring or ranking each point using methods such as the technique for order of preference by similarity to ideal solution (TOPSIS) (Zaree et al., 2019) or compressing the factor features of these points into a one-dimensional vector using principal component analysis (Sun et al., 2021). Point-based methods usually produce more accurate results than weight-based methods. The two types of methods can also be combined, such as the entropy-TOPSIS method (Li et al., 2019), to statistically determine the distribution of groundwater potential. However, another problem arises that these methods are difficult to integrate with hydrogeology itself, and some valuable drilling or pumping test data have not been used accurately enough.

In recent decades, the rapid development of machine learning algorithms has integrated them into various industries (Kaur & Sood, 2020; Reichstein et al., 2019), including hydrogeology. Machine learning is used in various areas of hydrology, such as hydrogeological modeling (Sun et al., 2019), model parameter inversion (Mo et al., 2020; Zhou et al., 2014), groundwater pollution source identification (Han et al., 2020), and groundwater potential assessment. The machine learning approach combines the factors that affect groundwater potential with borehole data or field survey site information. By utilizing the factors as machine learning features and borehole data as results, machine learning can determine potential non-linear associations between different features and build complex black box models (Wang et al., 2022). One advantage of using machine learning for groundwater potential assessment is that the drilling and pumping test data can be used for verification to ensure the reliability of the prediction results. Common machine learning models, such as decision trees (Duan et al., 2016), linear regression, support vector machines (Panahi et al., 2020), k-nearest neighbors, and Bayes models (Pham et al., 2021), as well as integrated models such as random forests (RF) (Wang et al., 2022), adaptive boosting (Rizeei et al., 2019), and extreme gradient boosting (XGB) (Ibrahem Ahmed Osman et al., 2021), are widely used for groundwater potential assessments. Ensemble models enhance the accuracy and usability of machine learning, compared to single base learner models (Arabameri et al., 2021). In most cases, the prediction of groundwater potential using ensemble learning models are more accurate (Pham et al., 2021). However, the complexity of the geological environment introduces variations in the model’s performance and feature engineering, which can differ greatly in different study areas. Therefore, it is necessary to investigate and find the most suitable machine learning model to predict groundwater potential, while keeping in mind the importance of integrating and validating these models with the reality of the area being studied.

The Guanzhong Basin is a semi-arid region heavily reliant on groundwater resources for agricultural practices and other human activities (Zhang et al., 2022). As a region experiencing rapid urbanization and economic growth, the local water supply faces an increasing demand for water resources, which poses significant challenges. Thus, assessing the groundwater potential of the Guanzhong Basin is crucial to ensure sustainable water resource management in the area. In this study, we collected factors that influence groundwater potential in the Guanzhong Basin and used three ensemble learning models, RF, XGB, and local cascade ensemble (LCE) to predict the groundwater potential. We conducted multiple calculations and cross-validation to determine the most suitable parameter sets enabling us to provide a comprehensive and accurate assessment of the groundwater potential of the Guanzhong Basin, using GIS-based ensemble learning models. The results allow for the identification of groundwater enrichment areas while minimizing costs. This information can provide valuable guidance for subsequent drilling and other related activities. Furthermore, this study can also serve as a reference for other semi-arid regions that share similar characteristics with the Guanzhong Basin.

Data and data processing

Description of the study area

The Guanzhong Basin is a large geological basin located in the central part of China (Fig. 1), covering an area of approximately 20,722.63 km2 (Zhang et al., 2022). Its longitude and latitude range are 107°–110°30′E and 35°10′–34°N, with an altitude range of about 200–2000 m. The basin is an important agricultural and industrial region, which is home to several large cities, including Xi’an, the capital of Shaanxi Province. The geology of the Guanzhong Basin is dominated by a thick layer of sedimentary rock that was deposited during the Mesozoic Era (Xu et al., 2019). The climate of the Guanzhong Basin is temperate, with four distinct seasons. The area has an average annual temperature of 12–13.6℃ and an average annual rainfall of about 500–800 mm, mainly in summer, but it is prone to droughts and water shortages (Ren et al., 2021). The annual average evaporation is 800–1200 mm (Bei et al., 2016). The main river in the Guanzhong Plain is the Weihe River, which is the largest tributary of the Yellow River. The Weihe River is an important source of water for the basin, which has close interactions with the groundwater in the basin (Kong et al., 2019). Given its importance as a center of agriculture and industry, as well as its vulnerability to water scarcity, accurate assessments of groundwater potential in the Guanzhong Basin are critical for sustainable development and management of the region’s water resources.

Fig 1
figure 1

a Location of study area in China. b Location of the samples and groundwater pumping rate.

Borehole datasets

The accuracy of machine learning models used for groundwater potential assessment is largely dependent on the quality of training data (Chen et al., 2019a; Panahi et al., 2020). The borehole datasets utilized in this study were collected from two sources, namely GeoCloud (http://geoscience.cn) and field observations. In total, 205 sets of borehole data were collected and analyzed. The distribution of these boreholes is illustrated in Fig. 1. To facilitate model training and testing, the borehole data was divided into two categories based on the water pumping rate, with a threshold of 5 t/h used to distinguish between enriched groundwater (> 5 t/h) and lack of groundwater (≤ 5t/h). Subsequently, the data for enriched groundwater and lack of groundwater were randomly partitioned into training and test sets in the ratio of 0.7:0.3 (Panahi et al., 2020). The training data was utilized for model calibration, while the test set was used to assess the accuracy of the model. It is worth noting that the partitioning of the data was performed randomly to ensure that the training and test sets were representative of the entire dataset (Wang et al., 2022).

Database of conditioning factors

The selection of indicators that influence groundwater potential is critical to accurately assess the potential for groundwater in a given area (Zaree et al., 2019). These indicators are variables that can affect the recharge and availability of groundwater. By identifying and analyzing these factors, it is possible to map the groundwater potential of a region, which can aid in groundwater management, water resources planning, and sustainable development. Based on the characteristics of semi-arid areas (Arabameri et al., 2019) and previous literature reviews (Díaz-Alcaide & Martínez-Santos, 2019), this study selected 14 factors as features for ensemble learning, including landform, slope, slope aspect, curvature, precipitation, evapotranspiration, distance to fault, distance to river, road density, topographic wetness index (TWI), soil type, lithology, land cover, and normalized difference vegetation index (NDVI) (Figs. 24).

Fig. 2
figure 2

Conditioning factors in the groundwater potential assessment: a Landform. b Slope (°). c Curvature. d Slope aspect. e Precipitation (mm). f Evapotranspiration (mm)

Landform is a crucial factor in groundwater potential assessment since it determines the surface water recharge (Razandi et al., 2015). In this study, landform was classified into five categories: floodplain, hill, mountain, plain, and plateau (Fig. 2a). Generally, high groundwater levels are common in floodplain areas near rivers and streams, while hills and mountains have lower groundwater potential due to steep slopes and limited surface water recharge.

Slope is another critical factor that influences groundwater potential. The slope of a land surface affects the rate of water infiltration and runoff, which ultimately impacts groundwater recharge (Doke et al., 2021). In the Guanzhong Basin, slopes were calculated using a digital elevation model (DEM) with a 30 m resolution (obtained from https://www.gscloud.cn) and ranged from 0 to 74.88° (Fig. 2b). In general, concave slopes have higher groundwater recharge than flat or convex slopes. Slope aspect also affects groundwater potential, as it determines the amount of solar radiation and wind exposure a surface receives (Naghibi et al., 2015b). Flat slopes are more likely to be recharged due to their even exposure to solar radiation and wind. The slope aspect was also computed from the DEM (Wang et al., 2020) and classified into ten categories based on the angle of the slope aspect: flat, north, northeast, east, southeast, south, southwest, west, northwest, and north (Fig. 2c). Curvature, defined as the rate of change of slope along a contour line, can also affect groundwater flow and recharge by influencing the direction and velocity of water movement (Arabameri et al., 2019). We classified the factor into three types — concave, flat, and convex — based on the size of its value (Fig. 2d). Concave areas exhibit negative curvature, suggesting lower slope angles in the center and higher angles on the periphery. Such areas accelerate the convergence of surface water bodies and augment their interaction with groundwater (Chen et al., 2019b). Conversely, convex areas featuring higher slope angles in the center and lower angles on the periphery often promote the divergence of surface water bodies, preventing their interaction with groundwater. Flat areas, characterized by zero curvature and a uniform slope, serve a role somewhere between concave and convex areas.

Precipitation and evapotranspiration are two additional factors that affect groundwater potential. Precipitation provides the source of water for groundwater recharge, while evapotranspiration limits the amount of water available for recharge (Jia et al., 2011; Jin et al., 2013). Therefore, areas with high precipitation and low evapotranspiration have higher groundwater potential. In this study, precipitation data were obtained from WorldClim2 (Fick & Hijmans, 2017), with a range of 500–757 mm (Fig. 2e), while evapotranspiration data (obtained from https://data.cma.cn/) ranged from 794–1277 mm (Fig. 2f). It is evident that the overall evapotranspiration of the Guanzhong Basin is slightly higher than the rainfall, indicating that it is a typical semi-arid area.

Fault zones can serve as pathways for groundwater to interact with surface water (Ahmad et al., 2021), while rivers and streams serve as surface water sources (Golkarian et al., 2018). In this study, the distance to faults was divided into five categories: 1, 5, 10, 20, and 50 km (Fig. 3a), while the distance to rivers was divided into six categories: 1, 2, 5, 10, 20, and 50 km (Fig. 3b). Road density is defined as the length of road per unit area and is calculated from the road network data (Velis et al., 2017). High road density can lead to increased surface runoff, reducing infiltration and groundwater recharge. In this study, we calculated the road density ranging from 0 to 2.29 in the Guanzhong Basin (Fig. 3c).

Fig. 3
figure 3

Conditioning factors in the groundwater potential assessment: a Distance to fault (km). b Road density (km/km2). c Distance to river (km). d TWI

TWI is an important factor that indicates the degree of land surface wetness, which is used to estimate the spatial distribution of soil moisture and potential groundwater recharge (Sørensen & Seibert, 2007). TWI is calculated based on the slope and contributing area of each pixel in the DEM (Wang et al., 2022). The calculation method assumes that water accumulates in areas with low slope and high contributing area. The TWI is computed using the following formula (Wang et al., 2022):

$$TWI=ln\left(\frac{a}{\mathrm{tan}\beta }\right)$$
(1)
$$a=\frac{\mathrm{tan}a}{S}+1$$
(2)

where a is the specific catchment area and β is the local slope. The specific catchment area (a) is defined as the upslope contributing area per unit contour length. α is the local flow direction, and S is the local slope. The local flow direction and slope are calculated from the DEM. The TWI values range from negative to positive infinity, with higher values indicating areas with higher potential for groundwater recharge. In Guanzhong Basin, the TWI ranges from 5.44 to 30.56 (Fig. 3d).

Soil type and lithology are important factors in assessing groundwater potential. Soil types affect water infiltration and storage, while lithology affects subsurface permeability and porosity (Arabameri et al., 2021). In the Guanzhong Basin, soil types were categorized into seven types: paddy soils, brown earths, cinnamon soils, fluvo-aquic soils, dark loessial soils, cultivated loessial soils, and gray-cinnamon soils (Fig. 4a). Lithology was divided into seven categories: intrusive rock, Lower Quaternary, Upper Quaternary, Tertiary, Paleozoic, Proterozoic, and Archean (Fig. 4b).

Fig. 4
figure 4

Conditioning factors in the groundwater potential assessment: a Soil type. b Lithology. c Land cover. d NDVI

Land cover is another factor that influences groundwater potential. The Guanzhong Basin has different land use types, including cropland, forest, grassland, shrub, wetland, aquatic, artificial surface, and bareland (Fig. 4c) (data from http://www.globallandcover.com/). Changes in land use can significantly affect groundwater recharge and discharge, affecting groundwater potential. For example, deforestation can reduce groundwater recharge by decreasing the interception and infiltration of precipitation. Conversely, land use types like wetlands and grasslands can enhance groundwater recharge by increasing infiltration and reducing surface runoff.

NDVI is a remote sensing-based conditioning factor that measures the density of green vegetation cover. NDVI is calculated using the spectral reflectance values of the red and near-infrared bands of remote sensing data. NDVI values range from − 1 to 1, with higher values indicating denser vegetation cover (Han et al., 2021). Vegetation cover can influence groundwater recharge by reducing surface runoff and increasing infiltration. In the Guanzhong Basin, the NDVI ranges from 0.18 to 0.90 (Fig. 4d) (data from https://www.resdc.cn/).

Methodology

In this study, we began by dividing the study area into 230,224 points based on a size of 300 × 300 m. The values corresponding to the 14 factors influencing groundwater potential in the Guanzhong Basin were extracted, resulting in a database of conditioning factors with a specification of 230,224 × 14. A sample dataset of 205 × 15, including a column of result values, was created and used to train and validate the RF, XGB, and LCE algorithms. The loop and cross-validation techniques were employed to obtain optimal prediction parameters. These three models were then applied to the database of 230,224 points, and the results were converted into raster data (Wang et al., 2022). The best model was identified by comparing the model results. Figure 5 shows the multi-phase methodological framework employed in this study.

Fig. 5
figure 5

Flow chart of the research

Random forest (RF)

RF is a powerful machine learning algorithm used for various tasks such as classification and regression (Breiman, 2001). It is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model (Paul et al., 2018). The basic principle of the RF algorithm is to build multiple decision trees on randomly sampled subsets of the training data and combine their predictions. This randomness ensures that the trees are diverse and not overfitting to the training data. The calculation process of the RF algorithm involves several steps (Breiman, 2001):

Step 1: Randomly select a subset of the training data (with replacement) to build a decision tree. This subset is known as the bootstrap sample.

Step 2: At each node in the decision tree, randomly select a subset of features to use for splitting. This subset is known as the random feature subset.

Step 3: Repeat steps 1 and 2 to create multiple decision trees.

Step 4: To make a prediction for a new instance, pass it through each decision tree in the forest and take the majority vote of the predictions. For regression tasks, the average of the predictions is taken.

The RF algorithm is a highly flexible and versatile tool that finds application in various fields, including finance, medicine, and environmental science (Gislason et al., 2006; Wang et al., 2020). Its capabilities have also been leveraged in groundwater potential prediction (Naghibi et al., 2015a).

Extreme gradient boosting (XGB)

XGB is a machine learning algorithm that uses a tree learning algorithm and a linear model learning to do parallel computation on a single machine (Chen & Guestrin, 2016). It is faster than other gradient boosting algorithms because it has a block structure for parallel learning. It also uses a distributed weighted quantile sketch algorithm to handle weighted data. XGB minimizes an objective function that consists of a loss function and a regularization term. The objective function can be written as (Ibrahem Ahmed Osman et al., 2021):

$$obj\left(\theta \right)=\sum_{i=1}^{n}l\left(Yi,\widehat{Yi}\right)+\sum k={1}^{K}\Omega \left({f}_{k}\right)$$
(3)

where θ is the set of model parameters, l is a differentiable convex loss function that measures how well the model fits the data, Ω is a regularization term that controls the complexity of each tree fk, and K is the number of trees. The regularization term Ω(fk) can be defined as:

$$\Omega\left(f_k\right)=\gamma T+\frac12\lambda\left|w\right|^2$$
(4)

where T is the number of leaves in tree fk, w is a vector of leaf scores, γ is a parameter that penalizes the number of leaves, and λ is a parameter that penalizes large leaf scores.

The loss function measures how well the model fits the data, and the regularization term controls the complexity of the model. XGB uses gradient descent to update the model parameters based on the gradients of the objective function. The gradient descent update rule can be written as:

$${\theta }^{\left(t+1\right)}={\theta }^{\left(t\right)}-{\eta g}^{\left(t\right)}$$
(5)

where t is an iteration index, η is a learning rate, and g(t) is a vector of partial derivatives of objective function with respect to each element in θ. In XGB, each element in θ corresponds to a leaf score in one tree. XGB grows trees sequentially, adding one tree at a time that fits the current pseudo-residuals (the negative gradients of objective function. Each tree is a weak learner that makes a small improvement over the previous prediction. The final prediction is a weighted sum of all the trees.

XGB is a popular and efficient ensemble model that has been applied to various domains (Rasool et al., 2022). Following the RF, we used XGB to predict groundwater potential in the Guanzhong Basin.

Local cascade ensemble (LCE)

The LCE algorithm is a new machine learning method that enhances the prediction performance of Random Forest and XGB by combining their strengths and adopting a complementary diversification approach (Fauvel et al., 2022). It can be used for classification and regression tasks. The LCE algorithm consists of two main steps: first, it trains a RF model on the original dataset and obtains its predictions as new features for each instance. Second, it trains an XGB model on a subset of instances that are selected based on their proximity to the decision boundary of the RF. The proximity is measured by a score function that depends on the number of trees that agree on the predicted class for each instance. The final prediction of LCE is obtained by combining the predictions of both models using a weighted average scheme. The calculation formula for LCE is given by (Fauvel et al., 2022):

$$\widehat{y}=a\widehat{y}RF+\left(1-a\right)\widehat{y}XGB$$
(6)

where ŷ is the final prediction, ŷRF is the prediction of random forest, ŷXGB is the prediction of XGB, and α is a weight parameter that controls the balance between both models.

Results and discussion

Application of the three models

To ensure that the models are as accurate as possible and to avoid potential overfitting risks, we utilized multiple calculations to identify the performance of three models, namely RF, XGB, and LCE, by selecting different parameters. Since each model has dozens of parameters, not all of them can have a significant impact on the accuracy of the model. Thus, we selected key parameters that have a greater impact on the model for identification. For the RF model, the key parameters are n estimators, max depth, min samples split, and max feature (Breiman, 2001). For the XGB model, the key parameters are n estimators, learning rate, max depth, and subsample (Chen & Guestrin, 2016). For the LCE model, the key parameters are n estimator and max depth (Fauvel et al., 2022). Within a reasonable range of these parameters, we trained the training set gradually according to a certain interval and obtained the accuracy score of the model by verifying on the test set. Using this approach, the parameter ranges were obtained for high `accuracy scores. Based on this range, cross-validation and grid search can significantly increase the computational efficiency of the model and improve its accuracy (Pedregosa et al., 2011). The parameter tests of the three models are shown in Figs. 68.

Fig. 6
figure 6

The parameters of RF: a n estimators. b max depth. c min sample split. d max features

The term “n estimators” of the RF refers to the number of decision trees. Increasing the number of decision trees can enhance the generalization ability of the RF model, but it also raises computational costs and the risk of overfitting (Sexton & Laake, 2009). In this study, we observed that the accuracy of the RF model increased rapidly as the number of decision trees increased, and the accuracy reached a peak when the number of trees reached about 150, after which the accuracy rate gradually declined (Fig. 6a). Therefore, we set the “n estimators” range of RF to 100–150. The parameter “max depth” determines the maximum depth of each tree. If the value is too small, the model cannot capture the details of the data, and if the value is too large, the generalization ability of the model will decrease. In this study, we found that the accuracy of the RF model did not change after the “max depth” reached 10 (Fig. 6b). Thus, to reduce computational cost, we set max depth to 10–12. The “min samples split” parameter specifies the minimum number of samples required for each node split. As shown in Fig. 6c, the min samples split of the RF model is between 2 and 6, and the accuracy rate gradually decreases. Thus, we choose 2 as the value of this hyperparameter. The “max features” refers to the number of predictors that the model examines at each split. The size of this parameter affects the diversity and accuracy of the tree. In this study, the performance of “sqrt” and “log2” was basically the same (Fig. 6d). Thus, we chose both as candidates for grid search.

The n estimators parameter of the XGB model represents the number of boosting trees (Chen & Guestrin, 2016). As with the RF model, the n estimators range for XGB should not be too large or too small. We performed sequential calculations for the model from 30 to 800 with a step size of 1 and determined the range of this parameter to be 250–350 based on the results shown in Fig. 7a. The learning rate controls the weight of new trees added to the model. Generally, a smaller learning rate leads to a higher accuracy rate but requires more trees and longer training time. Figure 7b shows that after the learning rate reaches 0.2, the accuracy rate gradually stabilizes. Therefore, we selected the learning rate range of 0.2–0.4. The max depth parameter controls the depth of the boosting tree. Figure 7c indicates that the accuracy rate reaches its peak at max depth values of 6 and 7 and does not change after reaching 10. We therefore chose a range of 5–8 for adjusting the max depth of the XGB model. The subsample parameter controls the proportion of random sampling during each training, which can reduce the risk of overfitting. However, too little sampling will reduce accuracy by affecting the training samples of the model. Therefore, we selected the hyperparameter selection range of 0.4–0.8.

Fig. 7
figure 7

The parameters of XGB: a n estimators. b learning rate. c max depth. d subsample

As the LCE model is a hybrid ensemble method, it encompasses nearly all the parameters in both the RF and XGB models (Fauvel et al., 2022). However, the two hyperparameters that have the most significant impact are n estimators and max depth. The n estimators parameter denotes the number of base learners for each division of LCE. As depicted in Fig. 8a, it is challenging to significantly improve the model’s accuracy beyond 10 base learners, and there are significant fluctuations. Thus, we chose the range of 9–12 for adjusting this parameter. The max depth parameter controls the maximum depth of the base learners. Based on Fig. 8b, we determined the range of 6–10 for this parameter.

Fig. 8
figure 8

The parameters of LCE: a n estimators. b max depth

Based on the provided information, we screened out the parameter selection range for three models, cross-validated the training data, and performed a grid search on the hyperparameters to find the optimal parameter combination for each model. The optimal hyperparameter values for the RF model were 123, 10, 2, and log2; for the XGB model, they were 304, 0.38, 6, and 0.77; and for the LCE model, they were 10 and 8. Using these hyperparameters, the three models were constructed and imported 230,224 groups of points to be predicted into the model for calculation one by one. The results were converted into raster data and split into 5 categories with an interval of 0.2 for easy comparison: very low [0.0–0.2), low [0.2–0.4), moderate [0.4–0.6), high [0.6–0.8), and very high [0.8–1.0). The resulting prediction map is shown in Fig. 9.

Fig. 9
figure 9

Groundwater potential assessment using: a RF, b XGB, and c LCE

Based on the results presented in Fig. 9, it can be observed that the three models provide relatively similar spatial distribution of groundwater potential in the Guanzhong Basin. Specifically, the central and southern parts of the model are characterized by high and very high groundwater potential, while the northeastern and northwestern parts of the model are associated with low and very low groundwater potential. This spatial distribution of groundwater potential seems reasonable given that the central part of the model corresponds to Xi’an, the most densely populated area in the region. In addition, higher groundwater potential tends to be distributed on both sides of the Weihe River and its tributary, the Bahe River. The RF model seems to have a larger uncertainty in the assessment of groundwater potential since the distribution of very low and very high groundwater potential is small, and the prediction results are mainly centered on moderate groundwater potential. This could limit the help to decision-makers in formulating next-step water resource management policies. On the other hand, XGB and LCE models tend to predict more clearly either very high or very low groundwater potential, which could provide more insights into the regional groundwater potential.

Validation of groundwater potential maps

To better assess the impact of predicting groundwater potential in the Guanzhong Basin, we utilized ROC (receiver operating characteristic) and AUC (area under the curve) metrics to evaluate the accuracy of the three models (Arabameri et al., 2021). ROC-AUC are measures commonly used in evaluating the performance of binary classification models. The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1—specificity) for different classification thresholds, while AUC measures the area under the ROC curve (Chen et al., 2019b; Wang et al., 2020). A perfect classification model would have an AUC of 1, indicating that it can perfectly distinguish between the two classes. On the other hand, a model with an AUC of 0.5 would perform as well as random guessing, while a model with an AUC below 0.5 would perform worse than random guessing. Given that the training set accuracies of all three models were high (either close to or reaching 1), we primarily conducted ROC-AUC analysis on the test set, as shown in Fig. 10.

Fig. 10
figure 10

ROC-AUC for the three algorithms

Based on Fig. 10, the XGB model achieved the highest accuracy (0.874), followed by RF (0.859) and LCE (0.810), as measured by the size of AUC. The superiority of the XGB model could be attributed to its ability to prune unimportant features and reduce model complexity, as opposed to the RF model, which randomly selected a subset of features without pruning. Additionally, gradient boosting could better fit the data distribution and loss function, while avoiding the risk of overfitting. Although the LCE model integrated the XGB and RF models, it failed to improve accuracy and instead increases computational load, as shown in Table 1. The LCE model required 7.7 s, 11m10.7 s, and 64m32.6 s to run 1, 100, and 1000 times, respectively, while the XGB model required 0.1 s, 18.1 s, and 2m45.3 s, and the RF model took 0.1 s, 20.8 s, and 9m6.3 s, respectively. Since the LCE model’s computational time increases considerably as the sample size grows, it can put a heavy burden on the entire calculation process. The LCE model took much longer to calculate than the other two models because it trained two layers of base learners, with RF in the first layer and XGB in the second, increasing the computational cost and time. Furthermore, the LCE model made multiple predictions per sample with different base learner subsets and requires parameter tuning for each base learner. Therefore, in general, we do not recommend using the LCE model to predict groundwater potential. Conversely, the XGB model calculated the gain of each boosting tree in parallel, speeding up the training process, while the RF model only calculated each decision tree in parallel. In conclusion, the XGB was the best model for predicting groundwater potential in the Guanzhong Basin.

Table 1 The time consumed of the three models by 1, 100 and 1000 times

Figure 11 illustrates the data distribution from 230,224 predicted points using three models. The blue histogram displayed the ratio of each groundwater potential category to the total number of rasters. The clarity of the model’s prediction result on the presence of groundwater in the area depended on the proportion of very low and very high groundwater potential. Conversely, a higher proportion of moderate groundwater potential led to a more ambiguous direction of the prediction result of the model. The figure showed that RF’s predictions about the groundwater potential of the Guanzhong Basin mostly fall into the moderate, low, and high categories, accounting for 96.81% of the study area, while only 3.2% was predicted as very low or very high groundwater potential. The results indicate that the predicted groundwater potential by the RF model is centered around 0.5. Notably, a value of 0.5 is the threshold that distinguishes the test set label into enriched groundwater (labeled as 1) or lack of groundwater (labeled as 0). The predicted probabilities of the two types classified by the RF model are extremely similar. Consequently, the model is not sufficiently reliable in dividing the testing set. The XGB model predicted that 61.18% of the study area has very high or very low groundwater potential, while the LCE model predicted 46.85%. This suggested that the XGB and LCE models are more directional in determining the groundwater potential of the basin.

Fig. 11
figure 11

Raster distributions of groundwater potential of different classes: a RF, b XGB, and c LCE

The green and red histograms in the figure indicated the distribution of the 205 sample groups in different groundwater potential zones across the Guanzhong Basin. The green histogram represented the samples with scarce groundwater, and the more it accounts for in the very low and low groundwater potential areas, the more accurate the model’s prediction of low groundwater potential. The red histogram represented the samples with enriched groundwater, and the more it accounts for in very high and high groundwater potential areas, the more accurate the model’s prediction of high groundwater potential. Among the RF, XGB, and LCE models, the proportions of samples with scarce groundwater were 57.14%, 66.67%, and 74.29% for areas predicted to have very low and low groundwater potential, respectively. In contrast, for areas predicted to have very high and high groundwater potential, the proportions of samples with enriched groundwater were 33.66%, 69.31%, and 52.45%, respectively. Overall, the LCE model was more accurate in predicting low groundwater potential, while the XGB model was more accurate in predicting high groundwater potential. As a result, the credibility of the RF output is questionable because the probability values for distinguishing between enriched groundwater and the lack of groundwater are nearly identical, resulting in inconclusive findings.

Of the three ensemble models used to assess groundwater potential in the Guanzhong Basin, the XGB model demonstrated the highest accuracy and required the least amount of computation time. While the LCE model was better at predicting low groundwater potential, its overall score was lower, and its computation time was 7–30 times longer than the XGB model. The RF model had moderate accuracy and computation performance but provided more vague results, mostly within the moderate grade. In conclusion, the XGB model is the most suitable for evaluating and predicting groundwater potential in the study area.

Conclusion

The present study aimed to assess the groundwater potential in the Guanzhong Basin, China, using GIS-based ensemble learning models. To achieve this, fourteen influencing factors were considered, including landform, slope, slope aspect, curvature, precipitation, evapotranspiration, distance to fault, distance to river, road density, TWI, soil type, lithology, land cover, and NDVI. The values of these factors were extracted into 205 sets of samples and 230,224 points discretized from the study area. To train and validate the models, the 205 groups of samples were divided into training and test sets using a ratio of 0.7:0.3. The three ensemble models, RF, XGB, and LCE, were trained and cross-validated using the samples, and their hyperparameters were tuned to obtain the best models. The three models were then applied to the 230,224 points to predict the groundwater potential of the Guanzhong Basin. The results showed that the AUC values of the RF, XGB, and LCE models were 0.859, 0.874, and 0.810, respectively. The RF model’s predictions were primarily focused on the areas with moderate groundwater potential, indicating that it has some uncertainty in predicting the test set label. In contrast, the XGB and LCE models performed better in identifying and distinguishing the areas with very high and very low groundwater potential in the study area. Out of the three models, the proportion of samples without groundwater in areas predicted to have very low and low groundwater potential were 57.14%, 66.67%, and 74.29%, respectively. On the other hand, in areas predicted to have very high and high groundwater potential, the proportion of samples with abundant groundwater was 33.66%, 69.31%, and 52.45% for RF, XGB, and LCE, respectively. In terms of computational load, LCE required the most time and resources, while the XGB model required the least. Hence, based on the results, the XGB model was found to be the best for predicting the groundwater potential of the Guanzhong Basin. This study provides valuable insights into groundwater potential assessment using GIS-based ensemble learning models. The results can be useful for policymakers and water resource management authorities for sustainable management of groundwater resources in the Guanzhong Basin and similar regions.