Keywords

25.1 Introduction

Groundwater is identified as the primary source of water for about two billion people and accounts for 33% of the total water withdrawal worldwide (Famiglietti 2014). It is a crucial freshwater resource for domestic uses, industrial development, and irrigational activities (Mohanty and Rao 2019; Behera et al. 2019; Kaur et al. 2020). However, groundwater resource is highly vulnerable to human activities (Ma et al. 2019a; Brouwer et al. 2018; Graaf et al. 2019) and natural variation (Kagabu et al. 2020; Giambastiani et al. 2018), especially in coastal regions where are facing groundwater overexploitation, seawater intrusion, climate change, and sea-level rise (Ferguson and Gleeson 2012). In such regions, groundwater is likely to increase in salinity due to paleo-seawater intrusion (Delsman et al. 2014), modern seawater intrusion (Han and Currell 2018), leaking brines from oil fields and irrigation activities (Paine 2003). High salt concentrations in groundwater may cause various environmental and health issues. For example, high salinity in irrigated water may cause physiological drought and reduce crop yield (Nishanthiny et al. 2010). High salt in drinking water increases the risk of hypertension (Vineis et al. 2011), coronary heart disease (Park and Kwock 2015), and chronic kidney disease; therefore, assessing groundwater quality, especially the salinization level, is crucial to protect the environment and human health (Melloul and Goldenberg 1997; Guhl et al. 2006; Gallardo and Marui 2007; Carretero et al. 2013; Larsen et al. 2017).

For last several decades, mathematical model has been used widely in prediction of groundwater dynamics and seawater intrusion into coastal aquifers (Lal and Datta 2019; Abdelhamid et al. 2016; Mahmoodzadeh and Karamouz 2019; Stein et al. 2019; Voss and Souza 1987). However, mathematical groundwater modelling requires expert knowledge about the physical characteristics of hydrogeological system, governing process, various types of input data (i.e., topography, soil properties, geology, initial and boundary conditions, hydrological and climate data, etc.) while the accuracy of the model simulation depends on reliable model input parameters (Lal and Datta 2019; Kim and Yang 2018). Meanwhile, machine learning is a data-driven model with little requirement about the physical process, and it could provide an accurate prediction (Sun et al. 2016; Yadav et al. 2018). Therefore, machine learning has been considered as an alternative, i.e., Genetics algorithm (Sreekanth and Datta 2010), artificial neural networks (Banerjee et al. 2011), multi-objective optimization (Javadi et al. 2015), multivariate adaptive regression spline (Roy Dilip and Datta 2017), support vector regression (Lal and Datta 2019; Isazadeh et al. 2017; Nadiri et al. 2018), ensemble multiadaptive boosting logistic regression (Rizeei et al. 2019), and Gaussian Process Regression (Yadav et al. 2018; Kopsiaftis et al. 2019), and hybrid computational intelligence models (Pham et al. 2019a; Chen et al. 2019). A common conclusion from the above works is that machine learning is a highly flexible tool with the ability to handle complex non-linear relationships between groundwater salinity and influencing factors (Naghibi et al. 2015; Ransom et al. 2017; Sajedi-Hosseini et al. 2018). Nonetheless, no studies have figured out which are the most important factors influencing on groundwater salinity in coastal areas, while the rapid development in the field of computer science has introduced more superior methods.

Inspite of many advantages of applying machine learning in predicting environmental issues, this approach has some limitations such as lacking good data, deterministic problems, and misapplication. Especially, the predictive results mainly based on statistical relationship instead of performing directly physical processes like numerical models therefore it requires in-depth understanding between target variable and independent variables to improve reliability and accuracy of the ML models. In this research, therefore, we propose and validate a new artificial intelligence approach, which is based on Extreme Gradient Boosting (XGB) and Genetic Optimization (GO), named as GO-XGB, for predicting groundwater salinity in the coastal aquifers of the Mekong River Delta (Vietnam). To the best of our knowledge, this is the first time that GO-XGB is considered for groundwater salinity modelling. We also compare and discuss the performance of our models and traditional models such as random forests and Gaussian processes to understand if this approach adds value to the field of groundwater salinity prediction. Besides, the role of various influencing factors in aquifer salinization is assessed. The proposed models were tested using groundwater salinity data and its controlling factors in the multi-aquifers in the Mekong Delta, Vietnam.

25.2 Background of the Machine Learning Algorithms Used

In this section, we first review two traditional machine learning models which are already applied to predict groundwater salinity, namely random forests, and Gaussian processes. We then introduce the idea of the combination of Extreme Gradient Boosting and Genetic Optimization to form a new hybrid algorithm. The performance of the two traditional models is then considered as benchmarks to assess our model.

25.2.1 Gaussian Processes

Gaussian processes (GP) are a type of supervised learning for both regression and classification problems (Kopsiaftis et al. 2019; Rasmussen et al. 2003; Hall et al. 2012; Azimi et al. 2018). The principal idea of Gaussian processes is that in the input space x = [x1, …, xn] T, every point is associated with a random variable, so as the joint distribution of them can be modelled as a multivariate Gaussian and a function (called f) can be modelled using an infinite multivariate Gaussian distribution (Ma et al. 2019b). Similarly, if we have a salinity dataset M = ([Xi, yi], i = 1, 2, …, m) with Xi ∈ Rn is a matrix of m input variables with n observation, whereas yi ∈ R is an output variable (Cl concentration in groundwater). A GP regression model formulates the relation of the input and output variables as following equation (Rasmussen et al. 2003; Hoa et al. 2019):

$$ y \left( x \right) = \mathop \sum \limits_{i = 1}^{n} \alpha_{i} K\left( {X_{i} ,X} \right) $$
(25.1)

where αi is the weight and K is the Radial Basis kernel function (RBF) (Eq. 25.2) (Park and Sandberg 1991; Scholkopf et al. 1997).

$$ K\left( {X_{i} ,X} \right) = \beta \times e^{{ - \mathop \sum \limits_{i = 1}^{m} \left[ {\frac{{(X_{i}^{m} - X_{i}^{m} )^{2} }}{{2\sigma^{2} }}} \right]}} $$
(25.2)

where β is the scaling factor and σ is the kernel parameter.

The performance of the GP model is dependent on the parameters β and weights αi and they could be automatically turned and optimized through maximizing the marginal likelihood (Rasmussen et al. 2003).

25.2.2 Random Forests

A random forest (RF) is a method for both classification and regression based on the ensemble of decision trees (Breiman 2001). A decision tree is a top-down tree-like structure, in which each non-leaf node is a test, each branch is an outcome of the test, and each leaf node is a decision. Regression with a single decision tree may result in the problem of overfitting (high variance) and is dependent on the distribution of training sets. A large number of decorrelated decision trees can form a random forest which then can reduce the variance and boost model performance (Criminisi 2011). The procedure developing RFs is as follows: (1) n random subsets (called “bootstrapped subsets”) are sampled from a training dataset based on a random selection of features of the dataset. A subset may contain overlapped data in other subsets; (2) n decision trees are built using these n bootstrapped subsets (Fig. 25.1). The number of trees n is decided using either cross-validation or out-of-bag (OOB) error methods. A detailed description of the statistical formulation of RF can be found in Breiman (2001).

Fig. 25.1
A left-side square-shaped schema partitioned, and right-side tree structure starts from the root node with two classes and ends in the intermediate node with two classes colored green and red.

Example of the partitions left and classification tree structure right with two classes coloured in green and red

25.2.3 Extreme Gradient Boosting

Similar to the random forest, an Extreme Gradient Boosting (XGB) is an ensemble-machine learning algorithm that is based on decision trees (weak learners) (Friedman 2001). However, a boosting model constructs the “forest” of decision trees sequentially, or one decision tree can be constructed based on learning experience inherited from previous trees (Chen and Guestrin 2016; Johnson et al. 2018). The second tree focuses on the cases in which the first tree gives a poor prediction, and this learning process is repeated many times, so as the combination of these trees can better capture the relationship between predictands and predictors. Gradient Boosting is a form of boosting models in which poor prediction cases are assessed if they contribute to minimize the overall lost function (also called the prediction error) (Lim and Chi 2019). A case can be considered as highly valuable if the adding decision tree built for this case can reduce the prediction error significantly while no change in the error implicates a no value case; thus, only useful decision trees are kept. This may give XGB models advantages in complex problems like quantifying saline concentration in groundwater since data measurement in the underground environment may contain many special cases. It is also worth to notice that the learning efficiency of each machine learning algorithm is controlled by its model parameters, and in the case of the XGB model, they include three groups: tree-specific, boosting, and miscellaneous parameters. Selection of these model parameters is a challenging task and depends on user experience while this process does not always return in an optimum set of parameters. Thus, we propose to use a genetic algorithm to automatically search in parameter spaces to improve the accuracy of numerical forecasts.

25.2.4 Genetic Algorithm

Genetic Algorithm takes the idea from the Darwinian theory of natural selection to evolve solutions by utilizing computer capacity to tune model parameters as an alternative to manual efforts (Forrest 1993). The most crucial concept of GA is the chromosome which consists of model parameters to define a solution (called individual) (Jennings et al. 2019). A certain number of individuals then forms a population. In the lower level, each chromosome consists of some genes which are often denoted as 0 s or 1 s (X ≡ (x1, x2, …, xn), xk ∈ [0.0, 1.0] ∀ k). Each individual is evaluated by its fitness value, a result of a fitness function.

The basic operation performed during the training of XGB based model is as following steps: (1) A number of individuals are initialized to form a population, (2) individuals with the best fitness values are selected to generate a mating pool, (3) from the mating pool, either sequential or random selection methods select parents, and (4) several operators called crossover and mutation are then applied to each pair of parents to generate their offspring. This process keeps high-quality individuals to create more individuals, so as it evolves solutions to obtain the desired solutions.

25.3 Study Area and Data

25.3.1 Description of the Study Area

The study area, Soc Trang province, is in the coastal area of the Mekong River Delta. The study area covers an area of 3,312 km2 with an elevation ranging from 0.5 to 2.5 m above the mean sea level (Fig. 25.2). The province is bordered by the Hau River (one main branch of the Mekong River) to the Northwest and the Vietnamese East Sea (South China Seas) to the Southwest. Since this area has a dense river system connected to the sea, the hydrological regime in the study area is complex and strongly influenced by the flow regime of the Mekong River and tidal fluctuation.

Fig. 25.2
A set of three maps displays the Soc Trang province and the coastal area of the Mekong River Delta in different color traces and an elevation ranging from minus 0.3 to 2.5 meters.

Location of the study area (Soc Trang province), the Vietnamese Mekong River Delta

The study area is in a tropical monsoon climate region with two distinct seasons, the dry season from May to November and the rainy season from December to April (in the following year). The annual average rainfall is about 1772 mm with substantial seasonal variation. About 85% of the annual rainfall occurs during the rainy season. The study area has recognized as one of the most vulnerable regions to climate change and sea-level rise in the world.

Soc Trang province has around 1.20 million people in which a majority of the population depends on agriculture for their livelihoods, contributing to 42% of the total GDP in the province (Hoang et al. 2019). Agriculture lands are dominant, accounting for 84.77% (276,690 ha of total area), which includes rice fields (52.98%), fishponds (19.69%), orchards (15.51%), and lands of other vegetable types (6.75%), and other types of land use (Decision No. 108/NQ-CP of the Government 2018).

In the study area, groundwater is used as a dominant source of water for domestic, industrial and agricultural activities, resulting in rapid groundwater level depletion in the irrigated areas (Hoang and Bäumle 2018; Minderhoud et al. 2017). Groundwater salinization has been identified as one of the significant threats to the groundwater resource in this region (An et al. 2018). The extent of groundwater salinization in the study area has recently been increased due to the rapid increase in groundwater demand (Minderhoud et al. 2017; Nam et al. 2019).

The hydrogeological setting of the study area is characterized by a multi-layered aquifer system, formed between the Miocene and Holocene epoch (Wagner et al. 2012; Hung et al. 2019). Groundwater in the Pleistocene aquifers is the primary source of drinking water because these aquifers have high yields and good-quality water compared to other aquifers (An et al. 2018). In this study, we focus on assessing the vulnerability and risk of groundwater in the Pleistocene aquifers to salinity.

25.3.2 Data Preparation and Variables Selection

In this research, 215 groundwater samples from the Pleistocene aquifers were collected between 2013 and 2018 during both the rainy and dry seasons. On-site measurements were conducted to obtain physical parameters such as groundwater temperature T (°C), pH, dissolved oxygen DO, and electrical conductivity EC using the HANNA portable instruments (Hanna Instruments Inc. 2015). The chloride concentration in groundwater samples was analyzed using Ion Liquid Chromatography (Shimadzu Co. Ltd., Japan) at the University of Tsukuba, Japan.

The accumulation of salinity in groundwater is a complex process because it is controlled by influencing factors (Mahlknecht et al. 2017; Kanagaraj et al. 2018). The selection of influencing factors for groundwater salinization prediction based on the possibilities of saltwater migration into aquifers. In the Pleistocene aquifers, groundwater salinity is originated from (1) downward or upward leakage of paleo-saline water (Khaska et al. 2013; Chatton et al. 2016), (2) halite dissolution in the topsoil layer (Walter et al. 2017; Blasco et al. 2019), (3) seawater intrusion (Han and Currell 2018; Kanagaraj et al. 2018; Werner et al. 2013), and (4) irrigation return flow (Essaid and Caldwell 2017; Lapworth et al. 2017; Malki et al. 2017; Tweed et al. 2018). The downward or upward leakages of paleo-saline water may relate to the formation of aquifers, which is further incorporated into the lithology influencing factor. Furthermore, the thicknesses of aquitards, distance to the hydraulic window, distance to fault, fault density, and vertical hydraulic conductivity could also affect the leaking rate (Elmahdy and Mohamed 2013; Liu et al. 2018). Besides, other geographical variables such as distance from main rivers, distance to the drainage and drainage density are also widely considered as influencing factor to groundwater salinity (Winkel et al. 2008). The halite dissolution process is characterized by salt rock/sediment properties, soil type, and horizontal and vertical hydraulic conductivity. Variables which represent the effect human activities on groundwater salinity in the study area are the groundwater level, extraction capacity, well density, extraction density, and operation time. The severity of seawater intrusion may also depend on the distance to the sea, groundwater level, well density, extraction capacity, extraction density, and horizontal hydraulic conductivity (Lee et al. 2016; Yechieli et al. 2019). The four processes mentioned above interact with each other and result in a complex salinization process in the study area (An et al. 2018). Based on the analysis mentioned above, 20 influencing factors were selected for predicting the spatial distribution of salinity in groundwater (Table 25.1).

Table 25.1 Influencing factors for prediction of groundwater salinity using machine learning models

25.4 The Proposed Methodology for the Prediction of Groundwater Salinity in Coastal Aquifers with Artificial Intelligence Techniques

The modelling framework used in this study is as follows: (1) data pre-processing, (2) feature selection, (3) model parameters, (4) model performance and evaluation, (5) Data post-processing (Fig. 25.3).

Fig. 25.3
A schema of the framework model has data pre-processing and then moves to feature Selection next to groundwater salinity model configuration then to performance evaluation and finally ends with post-processing.

Methodological chart of the present study

25.4.1 Data Pre-processing

Prior to modelling, 215 groundwater samples from middle and lower Pleistocene aquifers were selected, and each sample consists of 20 variables (Table 25.1). The measured Cl concentration is assigned as a dependent variable, while the 20 influencing factors are assigned as independent variables. The dataset was then randomly split to training and testing datasets 80% of the dataset was used for training, and 20% of the dataset was used for testing.

Since the influencing factors for predicting groundwater salinization have significantly different ranges, normalization was used to convert the values of numeric columns into a range from 0 to 1 using the following equation:

$$ X_{n} = \frac{{X_{i} - X_{\max } }}{{X_{\max } - X_{\min } }} $$
(25.3)

where Xn and Xi represent the moralized and raw training and testing data; Xmax and Xmin are the minimum and maximum of the training and testing data.

25.4.2 Feature Selection

As many factors control groundwater salinization processes in coastal aquifers, the selection of influencing factors plays a vital role in reducing time and cost of computation processes and improving the accuracy of prediction results. For several decades, numerous variable selection methods have been applied to identify significant variables before feeding machine learning algorithms to construct predictive models such as filters, wrappers, and embedded techniques (Kohavi and John 1997; Guyon and Elisseeff 2006; Hira and Gillies 2015).

Recently, Random Forests (RF) and its improved algorithms (XGB) have been widely used not only for predicting but also for selecting essential variables as the embedded technique to predictive models (Rodriguez-Galiano et al. 2014; Zeng et al. 2018; Zhao et al. 2019). In this study, the RF algorithm is employed to select input parameters for predicting chloride concentrations in the middle and lower Pleistocene aquifers of the study area. The procedure was followed below steps:

Step 1: Estimation of permutation-based mean squared error (MSE) reduction as Eq. (25.2):

$$ MSE_{OOB}^{t} = \frac{1}{nOOB\left( t \right)}.\mathop \sum \limits_{i = 1}^{nOOB} \left( {y_{i } - \hat{y}_{iOOB,t} } \right) $$
(25.4)

where MSEOOB is mean squared error, nOOB is the total of out-of-bag (OOB) samples, yi is the measure Cl concentration in groundwater samples, and \(\hat{y}_{iOOB,t}\) is the predicted Cl concentration of the i-th sample from a decision tree t of OOB samples.

Step 2: Estimation of MSE for permuted input variable xi using the following equation:

$$ MSE_{OOB}^{t} \left[ {x_{i} permuted} \right] = \frac{1}{nOOB\left( t \right)}.\mathop \sum \limits_{i = 1}^{nOOB} \left( {y_{i } - \hat{y}_{iOOB,t} } \right)\left[ {x_{i} permuted} \right] $$
(25.5)

Step 3: Estimation of variable importance score for variable xi using the following equation:

$$ VI\left( {x_{i} } \right) = \frac{1}{{T_{tree} }}.\mathop \sum \limits_{t = 1}^{Ttree} (MSE_{OOB}^{t} \left[ {x_{i} permuted} \right] - MSE_{OOB}^{t} ) $$
(25.6)

25.4.3 Model Configuration and Training

The configuration and training for the three machine learning models are conducted using a training dataset (80% of measured data). For the RF model, the tree-net system is built from 1000 trees with a maximum of 4 nodes per tree and the maximal tree depth of 17. For the GP model, the radial basis function (RBF) kernel and gamma = 0.014 are chosen to predict chloride concentrations in groundwater. In the GO-XGB model, each XGB prediction rule is trained with tenfold cross-validation to identify the number of trees (ntree) that minimizes an objective function. The prediction rule is fine-tuned by identifying the optimal combination of hyperparameters that further minimized the objective function for each area. The hyperparameters include the number of base classifiers (n_estimators), the maximum depth of each tree (max_depth), the learning rate (eta), the number of observations in each leaf node of the tree (min_child_weight), the minimum loss reduction required to partition further a leaf node on a single tree (gamma and reg_alpha), the proportion of observed data were used by XGB algorithm to grow each tree (subsample), and the proportion of predictor variables used at each level of tree splitting (colsample_bytree). The n_estimators is defined as the number of base classifiers and improper setting of n_estimators will result in model failure. The maximum_tree_depth was selected appropriately to prevent model complexity. This parameter is crucial in controlling under and over-fitting issues in which too small values of maximum_tree_depth will cause underfitting while too large values will result in overfitting. Learning_rate represents the weight-reduction factor of each base classifier. Min_child_weight represents the weight of the minimum leaf node sample and is used to improve the generalization of the model. The value of gamma ranges from 0 to infinitive, which represents the minimum loss reduction required to make a further partition on a leaf node of the tree. The gamma parameter controls the drop value of the model loss function when the node splits. The subsample controls the proportion of random sampling for each tree, typically between 0.5 and 1. The regularization parameter alpha (reg_alpha) denotes the L1 regularization term of the weight, which is used to simplify the complexity of the model.

In this study, the XGB algorithm was used to construct the model and optimizes parameters with GA. The details framework is described in Fig. 25.3. The main parameters of the XGB algorithm that need to be optimized are max_depth, learning_rate, min_child_weight, subsample, alpha, and gamma. After adjusting parameters by a genetic optimization function, we found the best value of these parameters max_depth = 15, learning_rate = 0.153, min_child_weight = 1, subsample = 1, alpha = 0.005, and gamma = 0.0015. In addition, n_estimators = s1200, colsample_bytree = 0.635, and n_estimators = 1200 were selected. The decision rule was retrained and applied to the withheld testing data to predict a new series of count observations and evaluate the accuracy of the decision rule based on the optimal values of the hyperparameters and number of trees. The variable importance of each environmental predictor variable was also obtained using the XGB algorithm.

25.4.4 Performance Assessment

The performance criteria used for evaluating model performance depends on the output variables of each model, e.g., categorical or continuous variable (Tien Bui et al. 2016). For evaluating the model with output values is continuous, performance criteria such as the root mean square error (RMSE), the mean absolute percentage error (MAPE), the mean absolute error (MAE), and Pearson's correlation coefficient (r) (Pham et al. 2019a) are used. Each performance criteria term indicates specific information regarding predictive performance efficiency (Li et al. 2016). RMSE is a quadratic scoring rule that measures the average magnitude of errors. It gives a relatively high weight to large errors; hence, it is most useful when large errors are undesirable. The Mean Absolute Percentage Error (MAPE) is the average of absolute errors divided by actual observation values. MAE measures the average magnitude of errors in a set of predictions without considering their direction. It is a linear score, implying that all individual differences between predictions and corresponding observed values are weighted equally in the average. The r is a measure of the linear correlation between observation and prediction values. RMSE, MAPE, MAE, and rare estimated by the equations (Pham et al. 2019b):

$$ RMSE = \sqrt {\frac{{\mathop \sum \nolimits_{t = 1}^{n} \left( {y_{i}^{obs} - y_{i}^{pr} } \right)^{2} }}{n}} $$
(25.7)
$$ MAE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (y_{i}^{obs} - y_{i}^{pr} ) $$
(25.8)
$$ MAPE = \mathop \sum \limits_{i = 1}^{n} \frac{{\left| {\frac{{y_{i}^{obs} - y_{i}^{pr} }}{{y_{i}^{obs} }}} \right|}}{n} \times 100 $$
(25.9)
$$ r = \frac{{\mathop \sum \nolimits_{t = 1}^{n} (y_{i}^{obs} - \overline{{y_{obs} }} ) \times (y_{i}^{pr} - \overline{{y_{pr} }} )}}{{\sqrt {\mathop \sum \nolimits_{t = 1}^{n} (y_{i}^{obs} - \overline{{y_{obs} }} )} \times \sqrt {\mathop \sum \nolimits_{t = 1}^{n} (y_{i}^{pr} - \overline{{y_{pr} }} )} }} $$
(25.10)

where \(y_{i}^{obs}\) and \(y_{i}^{pr}\) are measured and predicted Cl concentration in observation i, and n is the number of observations. Higher values of rare preferred, i.e. close to 1, means better model performance and regression line fits the data well. Conversely, the lower values of RMSE, MAPE, and MAE values the better model performances.

25.4.5 Generating Groundwater Salinity Map

The results from the three machine learning models are then used to create chloride concentration maps. Prediction maps are constructed with four main steps as follows: (i) interpolating chloride concentrations in groundwater based on prediction results, (ii) reclassifying chloride concentrations based on the drinking water standard from WHO, (iii) estimating the salinity affected area, and (iv) estimating the number of people in each class of salinity affected area. In the first step, the predicted chloride concentrations are interpolated to create maps using the Kriging method by Spatial Analysis Tool in ArcGIS 10.3. In the second step, the interpolated results are reclassified into four main classes, including low (Cl < 250 mg/L), moderate (250 ≤ Cl ≤ 500 mg/L), high (500 ≤ Cl ≤ 1000 mg/L), and high (Cl > 1000 mg/L). In the third step, the salinity affected area for each class of the salinity concentration in groundwater was calculated using geometry functions in ArcGIS 10.3. In the final step, the numbers of people within each salinity affected area was estimated based on the salinity-affected areas and population density.

25.5 Result and Discussion

25.5.1 Feature Selection for the Groundwater Salinity Modelling

The results in Table 25.2 showed the variable importance selection with the permutation based MSE decreased values ranged from 4.03 to 0.69.

Table 25.2 Variable importance (permutation based MSE decreased)

In the study area, the top ten most important influencing factors are groundwater level (4.03), vertical hydraulic conductivity (2.50), lithology (2.37), extraction capacity (2.10), horizontal hydraulic conductivity (1.85), distance to saline sources (1.73), well density (1.26), distance to hydraulic windows (0.85), depth of screen wells (0.79), and thickness of aquitards (0.69). The result reveals that groundwater salinization depends not only on hydrogeological features (vertical and horizontal hydraulic conductivities, lithology, paleo-saline sources, hydraulic connection, depth of screen well, and thickness of aquitard) but also groundwater extraction practices (groundwater level, extraction capacity, well density). These influencing factors also play an important role in transportation processes of other solutes such as arsenic, fluoride and nitrate in groundwater (Ransom et al. 2017; Winkel et al. 2008; Podgorski et al. 2018). The hydrogeological features influence on moving of saline groundwater from shallow to deeper aquifers (Hung et al. 2019) while groundwater exploitation activities exacerbate groundwater salinization (Hoang and Bäumle 2018; An et al. 2018). The result may also suggest that saline groundwater leaking from upper layers to lower layers is a dominant process, resulting in an increase of chloride concentration in groundwater of the study area. Hydraulically, an increase hydraulic gradient due to groundwater depletion coupled with high vertical hydraulic conductivity, think aquitard, and high-density gradients cause an increase of vertical flow rate as shown in the following equations (Ma et al. 2015).

$$ q_{v} = - \delta \times K_{v} \left[ {\frac{{h_{up} - h_{low} }}{\Delta L} + \varepsilon \left( {\frac{{C_{up} + C_{low} }}{2}} \right)} \right] $$
(25.11)
$$ \delta = \frac{{\mu_{0} }}{\mu } = 1 - \xi \times {\upvarepsilon } $$
(25.12)

where: δ—the ratio of the dynamic viscosity of freshwater to seawater; Kv is a vertical hydraulic conductivity (m d−1); hup and hlow denote the freshwater equivalent hydraulic heads at upper and lower layers (m), ∆L is the distance from upper to lower layers (m); μ0 and μ denote the dynamic viscosity (kg m−1 d−1); \(\xi\) is a constant; Cup is average observed salinity of pore water in upper aquifers (kg/m3); Clow is observed salinity of pore water in lower aquifers (kg/m3), and ε is a constant. The similar findings were also observed in other coastal aquifers in the world (Chatton et al. 2016; Cary et al. 2015; Delsman et al. 2014; Larsen et al. 2017), which indicated strong influences of over groundwater exploitation on seawater intrusion in coastal aquifers (Yechieli et al. 2019; Yu and Michael 2019; Han et al. 2015).

The other major influencing factors have permutation based MSE values from 0.64 for groundwater temperature to 0.10 for distance to the sea. It was noted that the distance to the sea had a little score value of 0.10, indicating less contribution to groundwater salinization processes. This result may suggest that direct seawater intrusion from the sea to coastal aquifers of the study is not dominant in the study area.

25.5.2 Model Performance Evaluation and Comparison

In this study, the predictive models for groundwater salinization are built using the training and the testing datasets, drawing upon a total of 215 observation wells and 20 variables. The results of the goodness-of-fit assessment of the three machine learning algorithms-based models including the GO-XGB model, RF model and the GP model for both training and testing steps are shown in Fig. 25.4 and summarized in Tables 25.3 and 25.4, and respectively.

Fig. 25.4
A set of six graphs represents the observed versus predicted chloride concentration for models including the G O-X G B, R F, and the G P for both training and testing steps.

Observed versus predicted chloride concentration for training and test data for a GO-XGB, b RF, and GP model

Table 25.3 Goodness-of-fit of the ground water salinity models on the training dataset
Table 25.4 Prediction performance of the ground water salinity models using the validation dataset

The training model performance (Table 25.3) shows that the GO-XGB model has the lowest value RMSE = 141.042 mg/L, followed by the RF (RMSE = 176.179 mg/L) and GP (RMSE = 176.179 mg/L) models. The similar trend is also observed in MAE and MAPE for the GO-XGB (MAE = 4.864, MAPE = 2.070), RF (MAE = 58.286 mg/L, MAPE = 29.410 mg/L) and GP (MAE = 71.802 mg/L, MAPE = 61.42 mg/L). In contrast, the GO-XGB model has the highest r-value of 0.999 compared to that of RF (r = 0.786) and Gaussian Processes (r = 0.882).

In the testing step, the results of the predictive models are validated by using the testing dataset consisted of 20% random samples from the original dataset (Fig. 25.4). The testing results show that the GO-XGB model has the highest performance compared to the RF and GP models (Table 25.4). For example, GO-XGB has the best result of r = 0.787, followed by the RF model (r = 0.596) and the GP model (r = 0.214). Similarly, the GO-XGB model shows the lowest values of RMSE = 141.042 mg/L, MAE = 74.993 mg/L, and MAPE = 87.250 mg/L, followed by the RF (RMSE = 176.179 mg/L, MAE = 84.708 mg/L, MAPE = 95,780 mg/L) and GP (RMSE = 305.782 mg/L, MAE = 127.355 mg/L, MAPE = 130.840 mg/L) models.

Overall, the GO-XGB model produces an excellent predictive performance with the highest value of r = 0.99 and r = 0.787 for training and validation steps among three predictive models. Likewise, this model also has the lowest values of RMSE, MAE, and MAPE compared to the RF and GP models in both training and validation steps.

Although we have considered various influencing factors to provide the accurate prediction of groundwater salinity in a coastal area of the Mekong River Delta, however, the processes of seawater intrusion into fresh aquifers depend not only human activities but also natural variations. Therefore, for broader applicability, these models would be required to include additional influencing factors such as the regional groundwater flow system, tidal fluctuation, climate change, and sea-level rise. Also, the performance of prediction models may have to compare with numerical models and other stochastic models.

25.5.3 Mapping Salt-Groundwater-Affected Area

In general, the average results obtained from the three machine learning models, including the GO-XGB (Fig. 25.5), RF (Fig. 25.6), and the GP models (Fig. 25.7), shows the main salinity-affected region, extending from the My Thanh River to the Central of Soc Trang City. It was noted that the prediction results from GO-XGB model strongly agree with salinity observation in this study (Fig. 25.8) and previous studies (An et al. 2018). Accordingly, high chloride concentrations which exceed the limited standard for drinking water Cl > 250 mg/L is predicted in the areas with to paleo-saline sources, high extraction rates, and significant groundwater level depletion. The severely affected areas are the Tran De estuary, the My Thanh river and the central region including Soc Trang city and My Xuyen district where chloride concentrations in wells elevate to 2000 mg/L. Surprisingly, low chloride concentrations (Cl < 250 mg/L) in groundwater is predicted in coastal areas even if in the production wells located just around 2 km from the sea and at −10.5 m below the mean sea level (m.a.m.sl). Meanwhile, Soc Trang city, which locates far from the sea approximately 40 km, is predicted to have high chloride concentrations in groundwater. This reveals that processes of salinity accumulation in aquifers are very complex, depending not only on natural processes but also human-induced activities.

Fig. 25.5
A map of chloride concentration in groundwater using G O X G B. Most of the areas display chloride concentrations of less than 250 milligrams per liter and very few areas have concentrations from 500 through 1000 milligrams per liter.

Predicted chloride concentration in groundwater of the study area using GO-XGB model

Fig. 25.6
A map of chloride concentration in groundwater using the R F model. Most of the areas display chloride concentrations of less than 250 milligrams per liter and very few areas have concentrations from 250 through 500 milligrams per liter.

Predicted chloride concentration in groundwater of the study area using RF model

Fig. 25.7
A map of chloride concentration in groundwater using the G P model. Most of the areas display chloride concentrations of less than 250 milligrams per liter and very few areas in Soc Trang have more than 1000 milligrams per liter.

Predicted chloride concentration in groundwater of the study area using GP model

Fig. 25.8
A map of measured chloride concentration in groundwater. Most of the areas display chloride concentrations of less than 250 milligrams per liter and few areas around Soc Trang have concentrations from 500 through 1000 milligrams per liter.

Measured chloride concentration in groundwater of the study area

The spatial distributions of affected areas with moderate and high chloride concentration are relative differences among models. For example, in the GO-XGB model, the affected area is predicted to extend from the coastal line to the central area of the study (Fig. 25.6).

In addition, the profoundly affected area is observed in the substantial groundwater extraction locations. These locations are located close to the paleo-saline groundwater sources coupled, and these areas also have high groundwater extraction rates and significant groundwater level depletion. This indicates that these influencing factors play an essential role in increasing chloride concentrations in groundwater. The similar finding is also in-line with recent studies (Hoang and Bäumle 2019; Tran et al. 2019). Conversely, the results from the RF (Fig. 25.6) and the GP models (Fig. 25.7) show that the moderately affected areas are the central area of the study region.

The three models provide different predictions in the affected area (Table 25.5). The RF model predicted the largest affected area (3118.50 km2) followed by the GP model (3055.35 km2) and GO-XGB (2879.0 km2) with low chloride concentration (Cl < 250 mg/L). Meanwhile, the largest affected-areas with moderate-high chloride concentration (Cl = 250–500 mg/L) are observed by the GO-XGB model (433 km2), the GP model (256.65 km2), and the RF model (193.50 km2). Both the GO-XGB model and the GP models predicted the large affected-areas with high (Cl = 500–1000 mg/L) and very chloride concentration (>1000 mg/L) while RF model predicted non-affected areas of high and very high chloride concentration.

Table 25.5 Predictive results of affected areas (in km2) following four classes of chloride concentration in groundwater

25.6 Concluding Remarks

In this study, three advanced machine learning models, including GO-XG, RF, and GP, were employed to predict chloride concentration in groundwater and assess impacts of salinity on water users in a coastal area of the Mekong River Delta, Vietnam. Twenty influencing factors were evaluated using the RF model based on score estimation. The most influenced factors to high salinity are related to both groundwater exploitation (groundwater level depletion, extraction capacity, and well density) and hydrogeological features (vertical hydraulic conductivity lithology, horizontal hydraulic conductivity, distance to the saline source, distance to the hydraulic window, depth of screen well, and thickness of aquitard). This finding confirms previous studies in which groundwater exploitation is one of the most important influencing factors to seawater intrusion in coastal lowland regions.

All three models perform well in predicting the probability of groundwater salinity. However, the GO-XGB model provides the highest accuracy prediction with RMSE = 18.450, MAE = 4.864, MAPE = 2.070, and r = 0.999 compared to the GP model (RMSE = 219.329, MAE = 71.329, MAPE = 61.42, and r = 0.882) and the RF model (RMSE = 244.754, MAE = 58.286, MAPE = 29.410, and r = 0.786). This indicated that GO-XGB model could be a useful tool to predict groundwater salinization in the coastal aquifers.

All three models predicted that approximately 35% of the total population might have to use groundwater with chloride concentration exceeding the WHO drinking water standard (Cl > 250 mg/L). More seriously, urban areas are close to paleo-saline sources. While the thicknesses of aquitards are thin and groundwater levels deplete quickly, leaking paleo-saline becomes more server and cause groundwater salinization. This is stimulated by the hydraulic connection between aquifers and over groundwater exploitation. Given the rapid increase of water demand, significant groundwater depletion and unpredictable impacts of climate change and sea-level rise, immediate actions must be taken by the water authorities to find a suitable solution to this environmental crisis.