Introduction

Groundwater is an essential resource for supporting human existence and global socio-economic progress (Anand et al. 2021). It serves as a critical source of freshwater for drinking, farming, and industrial activities. Because of its wide availability and affordability, it acts as the major source of water for many suburban and rural settlements in developing countries (Wang et al. 2023a). Nevertheless, groundwater depletion presents an increasing worldwide concern since it is not a renewable source. Providing an uninterrupted groundwater supply to fulfill the upsurging demands of the community continues to be a daunting issue (Sun et al. 2019). Hence, forecasting groundwater potential is imperative for productive management, utilization, and preservation of this scarce asset.

Groundwater potential refers to either the availability of groundwater in a particular area or the volume of water that can be withdrawn from an aquifer without affecting the surrounding environment (Jhariya et al. 2021). The formation and change process are complex and determined by various environmental factors (Tegegne 2022). In the past, forecasting groundwater potential relied on traditional methods, such as drilling and geophysical surveys, which were both costly and time-consuming, and limited to specific areas. However, with the advancement in artificial intelligence algorithms and computer performance (Reichstein et al. 2019), researchers are increasingly turning to machine learning models to forecast groundwater potential. By collecting and quantifying environmental factors through remote sensing (Shamsudduha and Taylor 2020) and geographic information systems (Bera et al. 2021), researchers can employ machine learning algorithms to predict the spatial distribution of groundwater potential or examine the intricate relationships between various environmental factors and groundwater potential. Commonly, logistic regression (Rizeei et al. 2019), decision trees (Lee and Lee 2015), random forests (Golkarian et al. 2018), gradient boosting machines (Sachdeva and Kumar 2021), support vector machines (SVM) (Naghibi et al. 2017), deep neural networks (Wang et al. 2022), and convolutional neural networks (Panahi et al. 2020) were used to forecast groundwater potential. Additionally, hybrid models that combine machine learning with other evaluation methods like analytic hierarchy process (Ahmad et al. 2023), Technique for Order of Preference by Similarity to Ideal Solution (Mahnaz Zaree et al. 2019), rank sum ratio (Wang et al. 2023a), multi-criteria decision-making (Farhat et al. 2023), and genetic algorithms were also employed to improve predictions.

Machine learning algorithms provide efficient methods for evaluating regional groundwater potential. However, previous studies focused mainly on comparing and evaluating the performance of various models (Wang et al. 2023b). To predict groundwater potential in a particular area, researchers collected existing data, typically obtained from boreholes that pump groundwater (Arabameri et al. 2021), and used them as the training dataset for the model. Researchers often chose two or more separate models for training and compared their accuracy to identify the most precise one ( Razandi et al. 2015). However, collecting sufficient training data is challenging due to economic and natural constraints that limit sample collection. Several studies have imposed limits on the sample size, which range from a few tens to two hundred (Bera et al. 2021), and may be inadequate for machine learning models to offer precise predictions. Furthermore, to achieve more credible model performance, multiple studies split the dataset into a training set of 70% of the samples and a 30% test set (Pham et al. 2021), resulting in a further one-third reduction in the original sample size. Additionally, the prediction of groundwater potential can be affected by the choice of training features, as some factors may be more sensitive to regional groundwater potential than others (Thanh et al. 2022). Researchers often aim to use as many factors as possible in training to increase accuracy, but this approach can lead to multi-collinearity problems, particularly in commonly linear models, such as linear regression, logistic regression, SVM, and naive Bayes. Furthermore, selecting too many factors can cause the curse of dimensionality (Pedregosa et al. 2011), which limits the accuracy of the model, particularly due to the small number of borehole samples. Therefore, in the context of predicting groundwater potential, the selection and number of factors, as well as the sample size, can have a greater impact on the accuracy of the model than the quality of the model or selection of parameters.

In this study, we aim to enhance the accuracy of groundwater potential prediction by exploring the impact of feature selection methods and sample size on prediction performance. A total of 16 factors that affect groundwater potential in Qinghai Province, China were collected. Unlike previous studies that directly train all factors, this research incorporates four distinct methods: the “ALL” method, which utilizes all 16 factors affecting groundwater potential, and three techniques that reduce the dimension of factors — Principal Component Analysis (PCA) (Sun et al. 2021), Entropy (Naghibi et al. 2015b), and Criteria Importance Through Intercriteria Correlation (CRITIC) (Rostamzadeh et al. 2018). Each of these three methods includes only eight factors for training, providing a comprehensive comparison of their performance. Furthermore, an extensive analysis on the influence of sample size on prediction performance is conducted, a factor often overlooked in previous studies. As our research was not focused on finding the ideal model or parameter, the automated machine learning (AutoML) method (Wang et al. 2021) was used to identify and select the optimal model and parameter set. The predictive model was applied to determine groundwater potential across the entire Qinghai Province. By using trained models, the study was able to accurately assess the groundwater potential in Qinghai Province. The results of this study will be helpful for guiding future research on groundwater potential prediction by assisting with the selection of appropriate sample sizes and features.

Data and data processing

Description of the study area

Qinghai Province is situated in the northeast region of the Qinghai-Tibet Plateau in Northwest China, spanning from 89°35′E to 103°04′E for longitude and 31°9′N to 39°19′N for latitude (Liu et al. 2013). The province covers an area of approximately 720,000 km2 and has an altitude ranging from 2000 to 5000 m (Han et al. 2016). Its geological conditions chiefly arose from the Himalayan orogeny and the Qinghai-Tibet Plateau uplift. The region is endowed with a plethora of rock strata from various geological periods from the Paleozoic to the Late Cenozoic, along with some scattered volcanic and intrusive rocks. The study area experiences a plateau continental climate and its annual mean temperature oscillates between − 5.1 and 9.0 °C, while the annual mean rainfall varies substantially across different locations, ranging from 18 to 780 mm (Han et al. 2021). The average annual evaporation is around 1012–3335 mm (Fig. 1). The region is the origin of numerous rivers and lakes, including those that flow outward into the ocean, such as the Yangtze River and the Yellow River (Cao et al. 2020), as well as inland rivers that pour into the Qaidam Basin (Wang et al. 2022). Qinghai Province is crucial to China’s water conservancy and mining industries, as it is abundant in a broad range of non-ferrous metals, coal, oil, gas, and salt minerals (Kong et al. 2017). Despite the abundance of local water resources, the province still faces a notable disparity between water supply and demand. Therefore, it is necessary to provide an accurate assessment of groundwater resources in Qinghai Province to achieve sustainable development and efficient management, given the importance of water for industrial and agricultural activities in the region.

Fig. 1
figure 1

a Location of Qinghai Province in China. b Topographic features and location of the samples in the study area

Sample dataset

The reliability of prediction results and the accuracy of machine learning algorithms used for groundwater potential prediction depend on the number of training samples (Panahi et al. 2020). However, the limited economic and natural conditions in the geosciences field pose several challenges in collecting sufficient samples for the prediction process. Therefore, many researchers use a limited number of samples to estimate the groundwater potential of various study areas. In this study, we investigated the influence of the number of samples on the prediction results by constructing different datasets involving various sample sizes. 800 samples were extracted randomly from the hydrogeological map of GeoCloud (http://geoscience.cn) (Wang et al. 2023b), and we categorized them into five groups based on the number of samples available: 50, 100, 200, 600, and 800 samples (Fig. 1). We also used 1500 validation samples to evaluate the model’s performance. Based on groundwater abundance at the source location indicated on the hydrogeological map, the samples were divided into two categories: type 1 for enriched groundwater and type 0 for groundwater scarcity. Moreover, the entire study area was discretized, which had an accuracy of 1 km × 1 km, into 699,016 points to draw a spatial distribution of groundwater potential in the study area after completing the model training.

Database of conditioning factors

The generation, enrichment, migration, and discharge of regional groundwater are directly or indirectly controlled by groundwater conditioning factors (Mahnaz Zaree et al. 2019). These factors serve as features in the machine learning model used for predicting groundwater potential. Therefore, before proceeding with model training and predicting groundwater potential, it is necessary to collect the eigenvalues of all conditioning features that could potentially impact groundwater in the entire study area. Furthermore, the spatial resolution of the factor features must be no lower than that of the discrete units to ensure their accuracy. In our study, the resolution was less than 1 km × 1 km. We analyzed 16 characteristics in the Qinghai region that affect groundwater (Díaz-Alcaide and Martínez-Santos 2019), including precipitation, evapotranspiration, normalized difference vegetation index (NDVI), landcover, slope, topographic wetness index (TWI), slope aspect, curvature, distance to rivers, distance to roads, fault density, residential density, landform, vegetation types, soil, and lithology (Figs. 2, 3, 4, and 5). We discussed each characteristic separately and its impact on predicting groundwater potential.

Fig. 2
figure 2

Factors affecting groundwater potential: a precipitation (mm), b evapotranspiration (mm), c NDVI, and d vegetation types

Fig. 3
figure 3

Factors affecting groundwater potential: a slope (degree), b TWI, c slope aspect, and d curvature

Fig. 4
figure 4

Factors affecting groundwater potential: a distance to rivers (km), b distance to roads (km), c fault density (km−1), and d residential density (km.2)

Fig. 5
figure 5

Factors affecting groundwater potential: a landform, b landcover, c soil, and d lithology

Precipitation and evaporation are the two critical sources and sinks of groundwater. During rainfall, some water infiltrates into the ground, fully replenishing groundwater (Jin et al. 2013). Precipitation also enhances the replenishment of surface runoff, indirectly influencing groundwater replenishment (Jia et al. 2011). On the other hand, evaporation is the process through which water is removed from soil and groundwater reservoirs, thereby reducing groundwater storage. For this study, we acquired precipitation data for Qinghai Province from WorldClim (Fick and Hijmans 2017), with values ranging between 18 and 780 mm, and evaporation data from https://data.cma.cn, with values ranging between 1012 and 3335 mm. The spatial variability of precipitation and evaporation across the study area was significant, as shown in Fig. 2 (a) and (b). The Qaidam Basin, situated in the northwest of Qinghai Province, experienced minimal precipitation, paired with high evaporation. Conversely, the eastern part of the study area received relatively abundant rainfall.

Vegetation type and coverage are critical factors in regulating and using groundwater. Vegetation can impact soil water-holding capacity and the rate of evaporation, which leads to changes in groundwater recharge and the water table level (Orellana et al. 2012). In areas with ample vegetation, plants can reduce the evaporation rate of groundwater, and the root system can increase infiltration capacity. Consequently, this can decrease surface water runoff and improve groundwater recharge. However, vegetation also absorbs more groundwater and subsequently releases it into the atmosphere through transpiration, which can lead to groundwater depletion. To assess the impact of vegetation on groundwater potential, this study considers vegetation types and the normalized difference vegetation index (NDVI) as factors. We divided the vegetation in Qinghai Province into six types: coniferous forests, shrublands, deserts, grasslands, meadows, and bare grounds. NDVI is calculated using multispectral remote sensing data using the following formula (Han et al. 2021):

$${\text{NDVI}}=\frac{NIR-RED}{NIR+RED}$$
(1)

where NIR represents the near-infrared band’s reflectance value and RED represents the reflectance value of the red band. Both NIR and RED are reflectance values ranging from 0 to 1. The resulting NDVI values range from 0 to 1 (Fig. 2(c)), with higher values indicating greater vegetation growth and coverage. We calculated the NDVI index of Qinghai Province using MODIS images (accessed from https://glovis.usgs.gov), which ranged from 0 to 0.92 (Fig. 2(d)).

Slope is the measure of the degree of rise or fall in the vertical direction as the surface moves a particular distance in the horizontal direction, typically expressed in angles. On steeper slopes, surface runoff flows more quickly and infiltrates less into the groundwater. In contrast, in areas with gentle slopes, surface water is more likely to recharge the groundwater (Naghibi et al. 2015a). The study area’s slope data can be extracted from the digital elevation model (DEM; obtained from https://gscloud.cn), which is calculated based on the elevation difference and horizontal distance from one DEM grid to the surrounding eight DEM grids. The slope range of Qinghai Province is 0–30.37° (Fig. 3(a)). Slope aspect is the surface slope’s orientation, specifically, the direction with the steepest slope expressed as an angle relative to the north direction. Slope aspect affects the direction of water flow in surface runoff, leading to variations in groundwater recharge across space (Singh et al. 2019). The slope aspect data were obtained by extracting the direction with the most significant elevation difference of the surrounding grid values within the DEM. Based on the aspect angle, the slope aspect of the study area can be categorized into nine categories: flat, north, northeast, east, southeast, south, southwest, west, and northwest (Fig. 3(b)). Curvature is a measure used to describe the topography of a surface. This metric has a direct impact on the flow and infiltration of surface runoff. In regions with lower curvature, concave areas can develop, which are prone to retaining and accumulating surface water (Arabameri et al. 2019). These areas facilitate the complete recharge of groundwater. In contrast, regions with higher curvature are convex, allowing for the swift flow and flooding of surface runoff and resulting in a decreased supply of groundwater. To determine the curvature data for the study area, we extracted the values from the DEM data and sorted them into three categories based on magnitude: concave (< 0), flat (0), and convex (> 0) (Fig. 3(c)).

TWI is a metric that quantifies the potential for water retention in soil and vegetation within a specific region. It is calculated based on DEM using the following formula:

$$\text{TWI}=\ln\left(\frac\alpha{\tan\;\beta}\right)$$
(2)
$$\alpha =\left(FA+1\right)\times S$$
(3)

where α denotes the catchment area, which can range from 0 to the total area of the catchment. β represents the slope’s tangent, which can range from 0 (for a flat surface) to infinity (for a vertical surface). FA is the flow accumulation value, which can range from 0 (for areas with no inflow) to a large number representing the total inflow into a point. S corresponds to the grid area, which is a fixed value based on your grid resolution. A higher TWI value indicates poor drainage and longer water retention times, which can contribute to maintaining soil moisture and increasing groundwater supply (Rinderer et al. 2014). Conversely, a lower TWI value indicates good drainage, lower soil moisture content, and relatively lower groundwater supply. In Qinghai Province, the TWI was calculated to be between 14.60 and 34.61 (Fig. 3(d)).

The connection between rivers and groundwater is clear. During periods of high water flow, water infiltrates into the groundwater, boosting groundwater recharge. Conversely, during droughts, groundwater sustains the river’s flow (Golkarian et al. 2018). Groundwater reservoirs located near rivers are typically permeable and mobile, allowing for faster water flow and spreading. In Qinghai Province, rivers are categorized into six classes based on their distance from the river: < 5, 5–15, 15–30, 30–50, 50–100, and > 100 km (Fig. 4(a)). Road construction can interrupt soil connectivity, hindering groundwater flow and drainage as well as changing the direction of groundwater flow (Velis et al. 2017). In this study, we classified the distances from roads in Qinghai Province into six categories: < 5, 5–15, 15–30, 30–50, 50–100, and > 100 km (Fig. 4(b)).

Varied fault densities can have diverse effects on groundwater distribution and flow. Faults can be influenced by vertical stresses and are susceptible to deformation and rupture, which, in turn, facilitate the flow and penetration of groundwater, resulting in abundant underground water supply in those areas (Ahmad et al. 2021). However, high-density subsurface faults can also interfere with water flow during certain times. The fault density in Qinghai Province has been calculated to range from 0 to 0.28 km−1 (Fig. 4(c)). In areas with high population density, there is an inherent rise in water demand which results in the extraction of groundwater and a resultant drop in the groundwater table. Furthermore, urban areas with dense populations have increased conversion rates between surface water and groundwater, for example, rainwater seeping into sewers and underground pipes. The study area houses most of its population in the eastern region, where the resident density ranges from 0 to 0.51 km−2 (Fig. 4(d)).

The undulations within the terrain dictate the height and rate of groundwater flow, as well as its direction. In comparison to flat terrain, areas with tortuous topography and significant fluctuations encounter more fluctuations in groundwater levels and experience complex water flow dynamics. In steep and irregular areas, ground rainfall quickly gathers to form rivers and streams, leading to water loss, while mountainous areas with pitted terrain are more favorable to groundwater accumulation and retention (Subba Rao 2006). Additionally, solid precipitation, like snow, can influence hydrological processes through subsurface processes, in varying geomorphological regions. The study area is characterized by four basic landform types: mountain, plateau, plain, and glacier (Fig. 5(a)). Differential land use and land management practices have a profound impact on recharge rates and aquifer storage capacity. For instance, urban expansion contributes to enlarged impervious surfaces such as buildings and roads that reduce infiltration, cause surges in stormwater runoff, and thus cause a decline in groundwater recharge rates. Conversely, intensive agricultural practices, such as irrigation, can exhaust or deplete groundwater reservoirs. In Qinghai Province, the types of land cover (data obtained from http://globallandcover.com) can be categorized into nine types: cropland, forest, grassland, aquatic, artificial, tundra, sandy land, Gobi, and saline-alkali land (Fig. 5(b)).

The interaction between surface water and groundwater and the connection between them is reflected in soil and lithology, as they both play a crucial role in this process. Different types of soil and lithology exhibit varying levels of permeability, water storage, and drainage capabilities, and thus, influence the hydrodynamic attributes of groundwater (Wang et al. 2022). The study area encompasses ten soil categories: black soil, brown soil, desert soil, meadow soil, saline soil, felty soil, glacier and snow cover, alpine meadow soil, alpine desert soil, and salt crust (Fig. 5(c)). Lithology, on the other hand, is divided into seven distinct categories: intrusive rock, Cenozoic, Mesozoic, early Paleozoic, middle Paleozoic, early Proterozoic, and middle Proterozoic (Fig. 5(d)).

Methodology

Figure 6 illustrates the systematic flow chart of this study, which is composed of four key steps:

  1. (1)

    The 16 conditioning factors were meticulously categorized into 5 categorical variables and 11 numerical variables. A one-hot encoding technique was employed to transform the categorical variables into 26 categories, while the numerical variables were used directly as continuous data and standardized.

  2. (2)

    During the machine learning training phase, four distinct methods were utilized to select different factors as sample features. The ALL feature subset incorporated all the factors, while the PCA method projected the factors onto 8-dimensional space. The Entropy and CRITIC methods were used as weighting techniques to quantify the weight values of the 16 factors were quantified, and the top 8 factors with higher weights were selected, respectively. This step is crucial for reducing dimensionality and focusing on the most influential features.

  3. (3)

    A range of training samples were sequentially input into the AutoML models sequentially, and the model yielding the highest accuracy was chosen. This step allows for an unbiased and automated selection of the best performing model. Furthermore, we conducted an experiment by incrementally increasing the sample size from 50 to 800 to assess the accuracy of the AutoML test set after each training iteration.

  4. (4)

    Finally, the study area was divided into 699,016 points which were used to create raster data to map the groundwater potential of the entire study area. By comparing the accuracy and generalization of model predictions using different model factors and sample sizes, an evaluation was conducted on the effect of selected samples and factors on prediction.

Fig. 6
figure 6

Flowchart of the methodology

One hot encoding and Principal Component Analysis

One hot encoding is a technique used for encoding discrete features. It involves encoding a discrete factor into binary form, such that each feature’s binary encoding is unique and distinct. To implement one hot encoding in the context of a groundwater potential analysis, discrete types are first assigned unique integers (Bai et al. 2022). These integers are then converted to binary numbers, as depicted in Fig. 7. The resulting matrix has rows that represent individual samples and columns that represent various discrete types of encoded bits. Each code bit can only have a value of 0 or 1, indicating whether a specific sample belongs to the corresponding discrete type (Pedregosa et al. 2011).

Fig. 7
figure 7

One hot encoding

One hot encoding ensures that each variable type of input is equal in the model. If integer labels are utilized to encode different types, the machine learning model will learn that the size of the encoded value between different types has a quantitative relationship, potentially leading to inaccurate predictions by the model. One hot encoding changes each type into a binary classification, thereby increasing the interpretability of the model’s predictions. For this study, we utilized one hot encoding to encode five features, including landscape, vegetation types, soil, lithology, and landcover. These features cannot be expressed numerically; therefore, one hot encoding was used to encode them accurately.

PCA is a data dimensionality reduction method that transforms high-dimensional data into a lower dimensional space (Pan et al. 2016). Specifically, it performs a linear transformation of the original data to a new coordinate system, finding the direction that maximizes the variance of the data in the new coordinate system, referred to as the first principal component. The second principal component is then found, which is orthogonal to the first principal component, and subsequent principal components are found successively, until the first k principal components are generated (Helena et al. 2000). These principal components comprise a new, lower dimensional space and have certain explanatory properties that can aid in understanding the data distribution. For instance, assuming that there is a groundwater potential assessment data set x containing m samples and n features, where m, n ∈ N (set of natural numbers). This data must first be standardized into y:

$${y}_{ij}=\frac{{x}_{ij}-{\mu }_{j}}{{\delta }_{j}},\left(i=\mathrm{1,2},3,\dots n;j=\mathrm{1,2},3,\dots m\right)$$
(4)

where δj represents the standard deviation of feature j and μj denotes its mean. Both δj and μj are real numbers (δj, μj ∈ R). Once y has been obtained, the covariance matrix C can be computed. For any pair of features j and k, the covariance calculation expression is as follows:

$${C}_{jk}=\frac{1}{n-1}\sum \limits_{i=1}^{n}\left({y}_{ij}-{\mu }_{j}\right)\left({y}_{ik}-{\mu }_{k}\right)$$
(5)

The eigenvalue decomposition of the covariance matrix C produces m eigenvalues λ1, λ2, … λm (λi ∈ R), and corresponding eigenvectors v1, v2, … vm. Each element of the eigenvector vi represents the weight of the corresponding feature in the new coordinate system, which is also the direction in the new coordinate system. The eigenvalue λi represents the variance of the corresponding feature in the new coordinate system, used to measure the degree of dispersion of the data. The eigenvectors are sorted according to the eigenvalues, from largest to smallest, and the top k eigenvectors corresponding to the largest eigenvalues are selected as the principal components. The data is then projected onto the principal components to obtain the dimensionally reduced data matrix Z:

$$Z={V}_{k}Y$$
(6)

Here, Vk represents the matrix composed of the first k principal components (k ∈ N and k ≤ m), and Y is the normalized original data matrix. PCA is a commonly used technique, and its strengths include the ability to compress data dimension while retaining maximum information. During groundwater potential prediction, too many features may lead to the curse of dimensionality, due to the lack of training samples. As such, we reduced the 16 factors to eight features after applying PCA processing.

Entropy weight method and criteria importance through intercriteria correlation

The Entropy Weight Method (EWM) utilizes fuzzy mathematics theory and information entropy theory to calculate the weight of indicators (Zhang et al. 2021b). A matrix N of m indicators and n samples can be formed for data that has been standardized or normalized, where m, n ∈ N. The proportion P of the jth index in the ith sample can be obtained, which reflects the variation of the index, such that (Li et al. 2019)

$${P}_{ij}=\frac{{N}_{ij}}{\sum_{i=1}^{n}{N}_{ij}}$$
(7)

Using P, the information entropy of the jth index is calculated as follows:

$$E_j=-\frac1{\ln\;n}\sum \limits_{i=1}^nP_{ij}\ln P_{ij},\left(j=1,2,\cdots,m\right)$$
(8)

By using the above formula, the information entropy of each indicator can be calculated. Based on this, the weight γj of indicator j can be obtained through the information entropy of all indicators:

$${P}_{ij}=\frac{1-{E}_{ij}}{\sum_{i=1}^{n}\left(1-{E}_{ij}\right)}$$
(9)

The information entropy is used to reflect the degree of difference between evaluation indicators. The weight decreases as information entropy increases, reflecting a greater difference between evaluated indicators. Conversely, the weight increases as information entropy decreases, indicating a smaller difference between evaluation indicators.

CRITIC calculates the correlation coefficient between indicators to determine the degree of mutual influence between indicators, and subsequently, calculates their weight. Unlike the EMW, the CRITIC method requires normalization, not standardization, before processing (Giao et al. 2023). This is because the weight evaluation criterion is based on the standard deviation. Different indicators require different normalization methods (Zhang et al. 2021a). For instance, indices that have a positive impact on groundwater potential enrichment are calculated using the expression:

$${y}_{ij}=\frac{{x}_{j}-{x}_{\mathrm{min}}}{{x}_{\mathrm{min}}-{x}_{\mathrm{min}}}$$
(10)

For indicators that have a negative impact on groundwater potential enrichment, yij can be calculated using

$${y}_{ij}=\frac{{x}_{\mathrm{max}}-{x}_{j}}{{x}_{\mathrm{max}}-{x}_{\mathrm{min}}}$$
(11)

Using this, the amount of information C of the jth index can be calculated, as follows:

$${C}_{j}={\upsigma }_{j}\sum \limits_{k=1}^{m}\left(1-{r}_{jk}\right)$$
(12)

where δj represents the standard deviation of index j (σj ∈ R) and rjk denotes the correlation coefficient between index j and k (rjk ∈ R). All C values are calculated based on the above formula, and ultimately, the weight vector for each indicator is obtained as follows:

$${W}_{j}=\frac{{C}_{j}}{{\sum }_{k=1}^{m}{C}_{k}}$$
(13)

The CRITIC method pays attention to the relationship between indicators, unlike the EWM, but requires prior knowledge of the correlation between indicators. Through the EWM and CRITIC methods, the weights of the 16 factors were obtained. The eight factors with larger weights were then selected to predict groundwater potential, thereby reducing the model’s complexity and making groundwater prediction more interpretable.

Automated machine learning

Machine learning, a subfield of artificial intelligence, constructs general paradigms for predicting or classifying new data by recognizing patterns and rules in known datasets. However, the multitude of machine learning models available, each with varying effectiveness for different problems, makes choosing the best model a challenge. Existing machine learning models, whether simple single models like decision trees or complex ensemble models, all contain a wealth of hyperparameters. Therefore, the traditional process of building machine learning models involves algorithm selection and manual adjustment of hyperparameters, which requires a significant amount of time and effort.

AutoML is a process that automates these time-consuming iterative tasks in machine learning model development. It simplifies the application of machine learning by automatically selecting models, adjusting hyperparameters, and optimizing model performance (Feurer et al. 2015). It can quickly build high-quality machine learning models without the need for laborious manual tuning. The process of AutoML usually includes data preprocessing, feature engineering, model selection, hyperparameter tuning, and post-evaluation. This study mainly involves model selection and hyperparameter optimization. Hyperparameter tuning can be represented by the following formula:

$$x^\ast=\arg\;\underset x{\min f\left(x\right)}$$
(14)

In this formula, x represents the best model parameters and f(x) represents the model’s loss function. We use Mean Squared Error (MSE) to represent it:

$$f\left(x\right)=\frac{1}{n}\sum_{i=1}^{n}({y}_{i}-{z}_{i}{)}^{2}$$
(15)

Here, n is the total number of training samples (n ∈ N), yi is the true value of the ith sample (yi ∈ R), and zi is the prediction value of the ith sample by machine learning (zi ∈ R). Traditional hyperparameter optimization includes grid search and random search. Both methods exhaustively or randomly search possible parameter combinations in the parameter space to find the optimal solution. However, for parameters with higher dimensions, these two methods consume too much time and result in unreliable parameter selection results. This study adjusts machine learning hyperparameters by establishing a probability model of the loss function using Bayesian optimization. The specific steps are as follows:

  • (1) Select some initial hyperparameter samples x, and calculate their target function values f(x).

  • (2) Based on the existing samples x and their corresponding target function values f(x), establish a surrogate model p(·) for the target function f(·), usually using Gaussian Process.

  • (3) Based on the surrogate model and observed data points, use an acquisition function to determine the next query point. Common acquisition functions include Expected Improvement (EI) and Probability of Improvement. This study uses EI, whose formula is as follows:

    $$EI\left(x\right)=E\left[\mathrm{max}\left(f\left(\mathbf{x}\right)-f\left({x}_{i}^{*}\right),0\right)\right]$$
    (16)

Here, xi represents a batch of candidate points generated in the ith iteration. We calculate the EI values of all candidate points and select the point ximax with the maximum EI value as the next query point. We then calculate its corresponding target function value f(ximax) and add this data point to the sample set x.

(4) Repeat steps 2 and 3 until a stopping condition is met, such as when the number of iterations reaches a preset value or when the target function value is below a certain threshold. In this way, we can achieve a balance between exploration (searching for unassessed areas) and exploitation (searching for known information), thereby effectively finding a global optimal solution.

The emergence of AutoML has considerably reduced the difficulty of machine learning modeling, making it more efficient and user-friendly. In this study, we selected five ensemble models for predicting groundwater potential. These models include Extra Trees (ET) (Geurts et al. 2006), Light Gradient Boosting Machine (LGBM) (Fan et al. 2019), L1-Regularized Logistic Regression (LRL1), Random Forest (RF) (Breiman 2001), Extreme Gradient Boosting (XGB) (Chen and Guestrin 2016), and XGB limit depth (XGBLD). Compared to other single models, these ensemble models can achieve superior performance and more robust generalization results. We utilized FLAML (Wang et al. 2021), a Python AutoML framework, to automate the process of model selection and hyperparameter tuning. This allowed us to select and optimize the most effective machine learning models for predicting groundwater potential under varying characteristics and sample sizes.

Results and discussion

Factor correlation and importance

The study utilized Pearson’s correlation coefficient to determine the correlations between potential groundwater influencing factors (Chen et al. 2018) in Qinghai Province (Fig. 8). The curvature feature did not pass the null hypothesis rejection test for vegetation types (0.9986), soil (0.1684), lithology (0.1374), residential density (0.9214), fault density (0.7576), distance to roads (0.8059), and distance to rivers (0.9462), indicating that curvature has no statistical correlation with these factors. Except for distance to roads (0.6287) and TWI and slope and soil (0.7254), the p-values of all other factors were less than 0.01, indicating a correlation between most factors. Furthermore, a significant negative correlation exists between evaporation and landform, reflected by a correlation coefficient of − 0.8. This can be explained by the high altitude of the Qinghai Plateau, to which Qinghai Province belongs, compared to the Qaidam Basin located in the northwest of Qinghai Province that has a lower altitude and a higher annual evaporation rate of over 3000 mm per annum (Fig. 2(b)). As such, landform and evaporation showed a significant negative correlation. The correlation coefficient between precipitation and NDVI is 0.8, which can be attributed to the little precipitation in the plateau desert climate of the study area. Increased precipitation leads to more vegetation growth resulting in an increase in NDVI value. Moreover, although some factors, such as slope, slope aspect, and TWI, were extracted from DEM data, their linear correlations between each other were relatively low, with positive correlation coefficients less than or equal to 0.5, and negative correlation coefficients greater than or equal to − 0.5. In conclusion, the generally low Pearson correlation coefficients among the various factors suggest a weak correlation between the factors. Each factor shows a high degree of independence, which allows them to perform their respective roles in predicting the groundwater potential of the study area effectively.

Fig. 8
figure 8

Heatmap of factor correlations

Figure 9 displays the weights of all factors calculated through both the CRITIC and EWM methods. The EWM approach determined the weights of the 16 factors from largest to smallest as follows: evapotranspiration (0.390), landform (0.325), curvature (0.107), soil (0.051), lithology (0.041), fault density (0.021), distance to rivers (0.021), NDVI (0.011), precipitation (0.010), distance to roads (0.006), residential density (0.005), slope (0.004), land cover (0.004), TWI (0.002), slope aspect (0.002), and vegetation types (0.002). The CRITIC method determined their weights in the following descending order: landform (0.472), evapotranspiration (0.248), precipitation (0.102), slope aspect (0.0640), distance to roads (0.050), distance to rivers (0.0270), soil (0.0140), land cover (0.009), lithology (0.007), vegetation (0.003), slope (0.002), TWI (0.002), NDVI (< 0.001), curvature (< 0.001), fault density (< 0.001), and residential density (< 0.001). Due to differences in the distribution of weights and decision-making objectives, the weights assigned to some indicators are inconsistent. Nevertheless, both methods illustrate that landform and evapotranspiration are critical factors in groundwater enrichment in Qinghai. The two factors that exhibit the greatest difference between the two methods are curvature and precipitation. Based on the weight values calculated by the methods, the first eight factors with the highest weights are selected for machine learning training features.

Fig. 9
figure 9

Factor weights calculated using the EWM and CRITIC

Influence of factor selection and sample quantity on groundwater potential prediction

Using AutoML, we employed four distinct feature selection methods to train machine learning models with a sample size of 200, focusing on groundwater conditioning factors. The machine learning models, selected by AutoML corresponding to the four factor selection methods, were XGB, XGB, RF, and ET. Figure 10 (a) and (b) illustrate the scores of the prediction model trained by four different factor selection modes using 1500 test samples. The accuracy scores were 0.783, 0.685, 0.745, and 0.703, respectively, with an area under curve (AUC) of 0.819, 0.724, 0.779, and 0.747. For a more comprehensive performance assessment, Table 1 provides detailed metrics such as accuracy, precision, AUC, recall, and F1 score. Notably, apart from precision, XGB-ALL outperforms the other models, followed by RF Entropy, ET-CRITIC, and XGB-PCA.

Fig. 10
figure 10

a Accuracy and AUC values of the models. b ROC curves of the models. c Density distribution of groundwater potential in the study area

Table 1 The performance of the four models under various assessment criteria

The ALL factor selection mode yielded the highest accuracy, indicating that all factor choices are reasonably valid. This is because, while reducing some factors, the model’s accuracy decreases as well. Furthermore, the primary utilization of a tree-based ensemble model in the prediction model, as opposed to a linear model, ensured that our predictions did not encounter issues with multicollinearity. Linear models quite often face multicollinearity problems since their trained features exhibit a linear relationship with each other. In contrast, tree-based models are designed so that each node in the tree depends on a single optimal feature to divide the data, meaning that each node only utilizes a specific feature, thereby minimizing the intricacy of the connections between features (Paul et al. 2018). Additionally, with ensemble models such as RF or XGB, the characteristics employed in each tree are diverse, further mitigating collinearity issue among them. Consequently, using a tree-based ensemble model with an appropriate number of samples can ensure higher accuracy in predicting models involving multiple features.

The XGB-PCA method shows the lowest accuracy in predicting groundwater potential models. This is because of the inadequate correlation among the various factors, especially after applying one hot encoding, leading to a lack of correlation between each encoding category. But PCA works well when variables possess a robust correlation and loses more information when the correlation is weak. Therefore, when the 16 factors are reduced to eight dimensions, XGB-PCA’s forecast is inaccurate. The RF-Entropy and ET-CRITIC methods exhibit moderate effects on predicting groundwater potential in Qinghai Province. These methods utilize their calculated weights to pick only the eight factors with the most significant weights for training. It is noteworthy that RF-Entropy exhibits greater accuracy than ET-CRITIC, which indicates that information entropy is a better weight-assignment technique for determining factors influencing groundwater potential in Qinghai Province.

The spatial distribution of groundwater potential in Qinghai Province, as drawn under the four factor selection modes, is represented in Fig. 11, while its density distribution is shown in Fig. 10(c). The prediction results have been stratified into five separate categories using the natural breakpoint method, namely, very low, low, moderate, high, and very high, in order to differentiate the types of groundwater potential. The values of groundwater potential in the region are generally higher than 0.5, and the areas with high and very high potentials are primarily concentrated in the southwest and southeast regions of the study area, which are the primary source of three critical rivers in China. Conversely, the low groundwater potential areas are concentrated in the north-western part of the study area, which is an arid region with major salt lake industries (Wang et al. 2022). Although the overall distribution of the four models is similar, the density distribution of groundwater potential values varies. The prediction outcomes of the ET-CRITIC are more concentrated around 0.7, while those of XGB-PCA are between 0.9 and 1.0. In comparison, the results of RF-Entropy and XGB-ALL model are somewhat analogous (around 0.8). This demonstrates that utilizing the Entropy method to screen factors with higher weights for predicting groundwater potential can bring about effective dimensionality reduction without sacrificing accuracy.

Fig. 11
figure 11

Spatial distribution of groundwater potential: a XGB-ALL, b XGB-PCA, c RF-Entropy, and d ET-CRITIC

Besides feature selection, the number of samples significantly affects the model results. Figure 12 depicts the performance of the model in predicting groundwater potential in Qinghai Province by varying sample sizes. As the number of samples increases under the AutoML training framework, the model’s accuracy displays an upward trend with fluctuations. The model accuracy fluctuated around 0.7 but did not increase significantly as the sample size increased from 50 to 200. The accuracy and AUC improved when the number of samples exceeded 200, reaching an accuracy of approximately 0.9 after training with 600 samples, and then gradually stabilizing. In many cases, the number of samples was limited due to certain conditions. Therefore, to attain a model accuracy and AUC of 0.75 and above, the assessment of regional groundwater potential should include a minimum of 200 samples. For a more detailed characterization of regional groundwater potential, the sample size must exceed 600.

Fig. 12
figure 12

Comparison of model performance with different sample sizes

Due to AutoML being re-run each time the number of samples is modified, different model types were selected on each occasion. Among the 6 used models, XGB was the most popular, with 44 executions, followed by the ET and RF models making an appearance 12 times. As for the LGBM, LRL1, and XGBLD models, they were included no more than five times. Thus, when predicting groundwater potential, priority can be given to the XGB model.

In conclusion, the primary hurdle in accurately assessing the groundwater potential of a specific area is not the performance of machine learning algorithms, but rather the scarcity of available samples. While various machine learning models, particularly well-established ensemble learning models, may yield different results across various research areas and feature sets, these differences are typically minor. On the other hand, collecting a sufficient number of samples in a vast research area is a daunting task that requires significant time and financial resources. This challenge contradicts our initial intention of using machine learning algorithms, which is to achieve the most accurate predictions at the lowest cost. Furthermore, the lack of samples limits our ability to incorporate a large number of features for machine learning evaluation, leading to the curse of dimensionality. Therefore, the primary challenge in predicting groundwater potential lies in finding a balance between sample quantity and feature selection to achieve the most accurate results.

In this study, we utilized AutoML with the aim of streamlining the process of machine learning model selection and hyperparameter tuning, allowing us to concentrate on the samples and features themselves. The prediction of groundwater potential under various sample sizes and feature selection methods was carried out using AutoML, thereby minimizing potential biases arising from manual model and hyperparameter choices. When the sample size permits, the groundwater potential prediction model should include as many factors as possible to enhance accuracy. However, this approach depends on the use of an ensemble model based on tree models, such as XGBoost, as excessive multicollinearity among numerous factors may negatively impact the model’s predictive performance, which tree models can alleviate. In situations with limited sample sizes, it is advisable to limit the number of input features in the machine learning model. We observed that using the entropy method to evaluate the importance of all factors and selecting those with high weights for training can maximize the accuracy of groundwater potential prediction. While PCA (Principal Component Analysis) can reduce the number of factors and linear correlations between them during dimension reduction, if the correlations are weak, PCA may result in a loss of information and ultimately lead to less accurate predictions. Moreover, while increasing the sample size improves the accuracy of groundwater potential prediction, this improvement tends to plateau after reaching a certain scale. To effectively address the challenge of limited sample availability, we recommend that in the research area of Qinghai Province, a minimum sample size of 200 is necessary to achieve an accuracy level of 0.75. However, for higher precision requirements, a sample size of approximately 600 is needed.

This research outcome presents a fresh perspective on how to approach the issue of groundwater potential prediction and offers a novel method for tackling problems related to limited sample quantity and feature selection. We believe that this will contribute to the advancement of groundwater potential prediction for future research. Nevertheless, due to variations in geological, geographical, and human activity conditions across different regions, it is essential to use the methods detailed in this study to reevaluate and determine the optimal sample size and feature selection approach when assessing groundwater potential, rather than directly applying the recommended values from this research.

Conclusions

This study leveraged AutoML technology to predict groundwater potential in Qinghai Province, with a particular focus on analyzing the influence of the feature selections and sample sizes on the predictions. The models were trained using 50 to 800 samples, while an additional 1500 were used for model evaluation. Sixteen groundwater conditioning factors in Qinghai Province were classified into categorical and numerical variables based on feature types. Categorical variables underwent one hot encoding to prevent the model from being misled by the quantitative relationship of integer classifications. Four different feature selection modes, including ALL, PCA, Entropy, and CRITIC, were employed to train the model. Upon training completion, the entire research area was discretized into 699,016 points and fitted into the trained model. The output results were subsequently transformed into maps of groundwater potential in Qinghai Province. The study revealed that despite the general statistical correlation among 16 groundwater conditioning factors, their Pearson correlation coefficients were low. This implies that when using the tree model to predict the groundwater potential, a larger number of features can be utilized as long as there are sufficient samples, thereby enhancing the accuracy of the model. Due to the weak linear correlation between factors, the PCA method struggled to effectively reduce model dimensionality, negatively impacting prediction performance. Conversely, using the Entropy method to screen factors with higher weights ensured better accuracy while also reducing dimensionality, thus circumventing the potential curse of dimensionality. Results from model training revealed that as the number of samples increases, so does the accuracy and AUC value of the groundwater potential prediction model. Training with 8 factors and 200 samples resulted in an accuracy of 0.745, sufficient for evaluating regional groundwater potential. On the other hand, training with 600 samples led to a model accuracy performance of 0.9, thus realizing accurate prediction of groundwater potential. In summary, when dealing with the small sample sizes and low degrees of linear correlation between factors, we recommend using the Entropy method to screen factors with higher weights based on sample size and employing the XGB model for groundwater potential prediction. This study provides both theoretical and practical support for decision-makers dealing with groundwater resource management in the Qinghai Province. The findings underscore the importance of feature selection and sample size in machine learning models for groundwater prediction. Furthermore, the model and methodology developed in this research can also be applied for predicting groundwater potential in other regions.