Introduction

Dense nonaqueous phase liquids (DNAPLs), which have caused serious environmental and health hazards around the world (Fernandez-Garcia et al. 2012), have low solubility, high toxicity, high interfacial tension, and a high tendency to sink in water (Qin et al. 2007). There are many difficulties in DNAPL-contaminated aquifer remediation such as low contaminant removal rates, long remediation durations, and high remediation costs. Thus, selecting a reasonable and efficient remediation strategy based on information about the DNAPL contamination source in the aquifer is critical.

However, one of the characteristics of groundwater contamination is concealment, and the discovery of groundwater contamination usually lags behind the contamination event or events, which results in minimal knowledge about the groundwater contamination sources, including their number, location, and release history (Atmadja and Bagtzoglou 2001; Sun et al. 2006; Sun 2009), thus making groundwater contamination source identification (GCSI) especially important.

GCSI is accomplished by inversely solving a simulation model that describes contaminant transport in the aquifer based on limited groundwater contamination monitoring data. GCSI can be used to take effective action in protecting groundwater resources, estimating risks, mitigating disaster, and designing remediation strategies (Mirghani et al. 2012).

There have been several comprehensive reviews of GCSI (Atmadja and Bagtzoglou 2001; Michalak and Kitanidis 2004; Bagtzoglou and Atmadja 2005). Among the proposed solutions, the simulation–optimization method (Ayvaz and Karahan 2008; Mirghani et al. 2009; Ayvaz 2010; Datta et al. 2011; Zhao et al. 2016) and the Bayesian method (Michalak and Kitanidis 2003; Wang and Jin 2013; Zeng et al. 2012; Zhang et al. 2015, 2016) are effective tools for solving GCSI problems. The effectiveness of the simulation–optimization method on programming and identification has been confirmed in many fields; however, running a multiphase flow numerical simulation model of DNAPL-contaminated aquifers is time consuming. The high computational burden that results from invoking the numerical simulation model repeatedly limits the applicability of GCSI simulation–optimization modeling at DNAPL-contaminated sites.

Previous studies (e.g., Mirghani et al. 2009, 2010) have mostly relied on parallelization and grid computing to decrease the computation time of the simulation model. The emerging surrogate model, which has a similar input and output relationship to the simulation model, can be computed several orders of magnitude faster than the simulation model (Queipo et al. 2005; Sreekanth and Datta 2010).

The most crucial requirement of the surrogate model is its approximation accuracy, because it greatly influences the reliability of the simulation–optimization model. Many surrogate model techniques have been applied to groundwater remediation strategy optimization problems such as polynomial regression (He et al. 2008), radial basis function artificial neural networks (RBFANN; Bagtzoglou and Hossain 2009; Luo et al. 2013), the Kriging algorithm (Hou et al. 2016), and support vector regression (SVR; Hou et al. 2015).

Asher et al. (2015) present a review of surrogate models and their application to groundwater modeling. The surrogate modeling techniques fall into three categories: data-driven, projection, and hierarchical-based approaches. The techniques mentioned before are all data-driven surrogates, which approximate a groundwater model through an empirical model that captures the input–output mapping of the original model, and were most widely used. Artificial neural networks (ANNs) are the most popular tool used as a surrogate of the numerical simulation model for GCSI problems (Singh et al. 2004; Rao 2006; Mirghani et al. 2012; Srivastava and Singh 2014, 2015); however, they suffer from instability and overfitting problems that are difficult to solve. Zhao et al. (2016) applied the Kriging model to GCSI problems and tested the accuracy, calculation time, and robustness of the Kriging model in three cases. However, the applicability of the Kriging model in GCSI of DNAPL-contaminated aquifers has not previously been reported; furthermore, there are few applications of other surrogate models in GCSI problems.

This study therefore proposes utilizing the SVR and kernel extreme learning machine (KELM) models to enrich the content of the surrogate model for solving GCSI problems, especially for DNAPL-contaminated aquifers. Additionally, the report examines the effectiveness of the proposed model with a comparative study between the Kriging, SVR, and KELM models, and finds that the disparities in applicability and approximation accuracy between these models for solving DNAPL-contaminated aquifer solute migration and transformation problems is significant. It is therefore necessary to select a best-fit surrogate model for the target problem.

In addition to the modeling method, the parameters and training sample dataset structure of the surrogate model also strongly impact its approximation accuracy to the simulation model; however, these aspects have been insufficiently investigated. Previous work has generally determined the parameters and the number of training samples empirically (Mirghani et al. 2012; Luo et al. 2013; Jiang et al. 2015; Zhao et al. 2016). As an extension of previous studies, this paper presents another two comparative studies analyzing the influence of these factors on the approximation accuracy of the surrogate model—first, there is an examination of the differences in the surrogate models with and without parameter optimization, and then it examines surrogate models built with different numbers of training samples.

Methodology

Multiphase flow numerical simulation model

Any meaningful approach to GSCI problems must obey the flow and transport principle. The simulation model is the principal part of the simulation–optimization model, in which the simulation model is set as an equality constraint (Datta et al. 2011). An overview of this process is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of the proposed GCSI solution process

The fundamental mass conservation equation for each multiphase flow component can be written as follows (Hou et al. 2015; Jiang et al. 2015):

$$ \frac{\partial \left(\phi {\tilde{C}}_k{\rho}_k\right)}{\partial t}+\overrightarrow{\nabla}\left[\sum \limits_{l=1}^2{\rho}_k\left({C}_{kl}{\overrightarrow{v}}_l-\phi {S}_l{\overrightarrow{\overrightarrow{K}}}_{kl}\overrightarrow{\nabla}{C}_{kl}\right)\right]={R}_k $$
(1)

where k is a component index and l is a phase index including water and oil. The initial and boundary conditions were integrated with the mass conservation equation to build the mathematical model, which was solved by UTCHEM.

Kriging

Kriging was denoted as the sum of two components: the linear model and a systematic departure (Hemker et al. 2008). The basic formulation can be expressed as (Bagtzoglou et al. 1991, 1992)

$$ y\left(\mathbf{x}\right)={\mathbf{f}}^{\mathrm{T}}\left(\mathbf{x}\right)\boldsymbol{\upbeta} +Z\left(\mathbf{x}\right)=\sum \limits_{j=1}^k{f}_j\left(\mathbf{x}\right){\beta}_j+Z\left(\mathbf{x}\right) $$
(2)

where f(x) = [f 1(x), f 2(x), ⋯, f k (x)]T are determinate regression functions and β = (β 1, β 2, ⋯, β k )T denotes the matrix of regression coefficients to be estimated from the training samples. Z(x) is the local deviation from the regression model. A detailed introduction to the Kriging method can be found in Hou et al. (2015) and Zhao et al. (2016).

Support vector regression

SVR is a support vector machine (SVM)-based multiple regression method that balances fitting accuracy and prediction accuracy (Hu et al. 2014; Zhang et al. 2014). For training input X = [x 1 , x 2 , ⋯, x m ]T (where each element represents an N-dimensional input vector x i  = (x i, 1, x i, 2, ⋯, x i, N ), i = 1, 2⋯, m) and output Y = (y 1, y 2, ⋯, y m )T, the nonlinear regression function can be expressed as:

$$ f\left({\mathbf{x}}_i\right)=\left\langle \mathbf{w},\varPhi \left({\mathbf{x}}_i\right)\right\rangle +b $$
(3)

where 〈w, Φ(x i )〉 denotes the dot product of fitting coefficients w = (w 1, w 2, ⋯, w N ) and x i , b is the fitting error. The goal is to find a function f(x i ) that has at most ε deviation from the target output y i for all training inputs; the norm of w(‖w‖) should be as small as possible.

A kernel function is applied to project the samples from low-dimensional space to high-dimensional space:

$$ k\left(\mathbf{x},{\mathbf{x}}^{\hbox{'}}\right)=\exp \left[\hbox{-} \frac{{\left\Vert \mathbf{x}\hbox{-} {\mathbf{x}}^{\hbox{'}}\right\Vert}^2}{2\sigma}\right] $$
(4)

The regression problem can be expressed as an optimization problem:

$$ {\displaystyle \begin{array}{l}\operatorname{minimize}\kern1.00em \frac{1}{2}{\left\Vert \mathbf{w}\right\Vert}^2+C\sum \limits_{i=1}^m\left({\xi}_i+{\xi}_i^{\ast}\right)\\ {}\mathrm{subject}\kern0.5em \mathrm{to}\kern1em \left\{\begin{array}{l}{y}_i-\left\langle \mathbf{w},\varPhi \left({\mathbf{x}}_{\mathbf{i}}\right)\right\rangle -b\le \varepsilon +{\xi}_i\\ {}\left\langle \mathbf{w},\varPhi \left({\mathbf{x}}_{\mathbf{i}}\right)\right\rangle +b-{y}_i\le \varepsilon +{\xi}_i^{\ast}\\ {}{\xi}_i,{\xi}_i^{\ast}\ge 0\end{array}\right.\end{array}} $$
(5)

where constant C determines the trade-off between the flatness and the maximum tolerable number of the samples whose deviation is larger than ε, and ξ i and \( {\xi}_i^{\ast } \) are the upper and lower limits of the slack variables. The optimization problem in Eq. (4) is often solved in its Lagrange dual form (Smola and Scholkopf 2004; Hou et al. 2015):

$$ \mathbf{w}=\sum \limits_{i=1}^m\left({\alpha}_i-{\alpha}_i^{\ast}\right)\varPhi \left({\mathbf{x}}_{\mathbf{i}}\right)\kern0.5em ,\kern1em f\left(\mathbf{x}\right)=\sum \limits_{i=1}^m\left({\alpha}_i-{\alpha}_i^{\ast}\right)k\left({\mathbf{x}}_{\mathbf{i}},\mathbf{x}\right)+b $$
(6)

where α i and \( {\alpha}_i^{\ast } \)are Lagrange multipliers. Fitting error b can be computed by exploiting the Karush-Kuhn-Tucker (KKT) conditions.

Kernel extreme learning machine (KELM)

KELM generalize extreme learning machines (ELM) by transforming their explicit activation function to an implicit mapping function (Shi et al. 2014; Chen et al. 2014). Given N training samples (x j , t j ),  j = 1, ⋯, N, the KELM is expressed as an optimization model:

$$ {\displaystyle \begin{array}{l}\min \kern1.5em \left\{\frac{1}{2}{\left\Vert \boldsymbol{\upbeta} \right\Vert}^2+\frac{C}{2}\sum \limits_{i=1}^N{\xi}_j^2\right\}\\ {}\mathrm{subject}\kern0.5em \mathrm{to}\kern1em \mathbf{m}{\left({\mathbf{x}}_j\right)}^{\mathrm{T}}\cdot \boldsymbol{\upbeta} ={t}_j-{\xi}_j\end{array}} $$
(7)

where β denotes a vector in the feature space F, C denotes the regularization coefficient, m(x i ) maps the input x j to a vector in F, and ξ j denotes the error (Wang and Han 2014).

The optimization problem can be transformed into Lagrange dual (L D) form

$$ {L}_D=\frac{1}{2}{\left\Vert \boldsymbol{\upbeta} \right\Vert}^2+\frac{C}{2}\sum \limits_{j=1}^N{\xi}_j^2-\sum \limits_{j=1}^N{\theta}_j\left(\mathbf{m}{\left({\mathbf{x}}_j\right)}^{\mathrm{T}}\cdot \boldsymbol{\upbeta} -{t}_j+{\xi}_j\right) $$
(8)

where θ j is the jth Lagrange multiplier. This problem can be computed by exploiting the KKT optimality conditions (Jiang et al. 2015).

The kernel matrix of the ELM can be defined as

$$ {\mathbf{K}}_{\mathbf{ELM}}={\mathbf{MM}}^{\mathrm{T}} $$
(9)

and

$$ {K}_{\mathrm{ELM}\left(i,j\right)}=\mathbf{m}{\left({\mathbf{x}}_i\right)}^{\mathrm{T}}\cdot \mathbf{m}\left({\mathbf{x}}_j\right)=K\left({\mathbf{x}}_i,{\mathbf{x}}_j\right) $$
(10)

where M is the mapping matrix of training sample inputs in the feature space F.

Finally, the KELM output function can be written as

$$ f(x)=\mathbf{m}{\left(\mathbf{x}\right)}^{\mathrm{T}}{\mathbf{M}}^{\mathrm{T}}{\left({\mathbf{M}\mathbf{M}}^{\mathbf{T}}+\frac{\mathbf{I}}{C}\right)}^{-1}\mathbf{T}={\left[\begin{array}{l}K\left(\mathbf{x},{\mathbf{x}}_1\right)\\ {}\kern1.2em \vdots \\ {}K\left(\mathbf{x},{\mathbf{x}}_N\right)\end{array}\right]}^{\mathrm{T}}{\left({\mathbf{K}}_{\mathbf{ELM}}+\frac{\mathbf{I}}{C}\right)}^{-1}\mathbf{T} $$
(11)

Case study

Site overview

To analyze the practical application of different surrogate models for DNAPL-contaminated aquifer GCSI problems, a hypothetical chlorobenzene-contaminated site was set up as a case study. The site was located in the saturated zone of a 20-m-deep aquifer with a complex mixture of clay and sand deposits in which the groundwater flowed in a right-left direction. There are three potential contamination sources at the site. The goal was to simultaneously identify the actual single source, release strength, and release duration, and estimate the aquifer parameters. Five observation wells were set at the lower reaches of the groundwater gradient of the potential sources to obtain groundwater quality data (Fig. 2).

Fig. 2
figure 2

Locations of potential contamination sources (S1, S2, S3) and observation wells (O1–O5)

Multiphase flow numerical simulation model

A three-dimensional (3D) multiphase flow numerical model was developed in which the aquifer is homogeneous and it was assumed that the initial and boundary conditions are known. The left and right boundaries of the site were set as first-type boundary conditions, while other boundaries were no-flux boundaries. The horizontal hydraulic gradient was set to 0.0112. The simulation domain was discretized into 10 vertical layers, each of which was further discretized into 40 × 20 grid cells.

Surrogate models of the multi-phase flow numerical simulation model

In order to identify the DNAPL source, an optimization model was established that uses the minimal deviation between actual observations and model predictions as its objective function; this model will be demonstrated in future research, as this study focuses on the surrogate model. The size of the surrogate model output should be matched to the actual groundwater-quality-observation data. There were two sets of actual groundwater-quality-observation data with an interval of 6 months between them. Each set of observation data contained five constants, i.e. the chlorobenzene concentrations at the middle of the aquifer in five observation wells (Fig. 3).

Fig. 3
figure 3

Schematic diagram of chlorobenzene concentration observation location

The middle aquifer was chosen as an observation object because there may be an oil phase while sampling at the bottom of an aquifer in real-world situations, and the sampling proportion of water and oil is random, leading to significant deviation between the experimental analysis results and the actual volume fraction of oil in the groundwater at the bottom of the aquifer. Thus, the output variables of the surrogate model were the chlorobenzene concentrations at the middle of the aquifer in five observation wells at two observation time points, for a total of 10 elements.

The release strengths and duration of the three potential DNAPL sources were treated as controllable input variables when building the surrogate model. In addition, calibration and verification cannot be carried out without contaminant source information; thus, the contaminant source and aquifer parameters should be identified simultaneously (Starn et al. 2015). Finally, the input vectors of the surrogate model consist of eight elements: the release durations and strengths of sources S1, S2, and S3; porosity; aqueous phase dispersivity; oleic phase dispersivity; and permeability.

Four groups of training samples and 20 testing samples in feasible regions of input variables were obtained using Latin hypercube sampling (LHS; Hossain et al. 2006). Each training sample group consisted of 30 samples. The release strength and duration were uniform distribution variables in (0, 1.5 m3 day−1) and (600, 900 days), respectively. The aquifer parameters obey the normal distribution while LHS sampling and the distribution characteristics of porosity, dispersivity, and permeability were taken as N (0.3, 0.0001), N (1 m, 0.01), and N (8,500 md, 100,000), respectively. As the study case was hypothetical, the distribution characteristics of aquifer parameters were assumed. The corresponding outputs of the 140 sets of input vectors were obtained for the developed simulation model runs.

There are three factors that affect approximation accuracy: surrogate modeling method, number of training samples, and surrogate model parameters. To analyze the influence of each of these factors, three comparative studies of different surrogate models were conducted.

Comparison between surrogate models built using different methods

In this experiment, the Kriging, SVR, and KELM models were built with the same training samples and the uncertain parameters of the surrogate models were optimized with a genetic algorithm (GA) to improve their approximation accuracy to the simulation model (Hou et al. 2015). The three models were then compared using test samples. The Kriging and KELM models were built in MATLAB. The Libsvm toolbox (Chang and Lin 2001) was used to train and test the SVR model (Hou et al. 2015). The comparison results showed that the KELM model performed best, so only the KELM model was chosen as the research object in the “Comparison between surrogate models with and without parameter optimization” and “Comparison between surrogate models built with different number of training samples” sections.

Comparison between surrogate models with and without parameter optimization

The KELM models with and without parameter optimization were compared using testing samples to analyze the improvement of the surrogate model after parameter optimization. The KELM model was optimized by establishing a model using the minimal sum of the relative error by threefold cross-validation with 90 training samples as its objective function. The regularization coefficient in Eq. (7) and the kernel parameters served as decision variables and the constraints were the range of parameters. A GA was used to solve the optimization model.

Comparison between surrogate models built with a different number of training samples

To analyze the influence of training sample dataset structure on the approximation accuracy of the surrogate model, three KELM models were built and compared. The number of training samples for the three surrogate models were 60, 90, and 120. The parameters of the three surrogate models were optimized by a GA. LHS was used to obtain four groups of 30 training samples; thus, the training sample datasets of three surrogate models consisted of different training sample groups.

Surrogate model performance evaluation indices

Three indices were applied to evaluate the performance of surrogate models:

  1. 1.

    Certainty coefficient R 2

$$ {R}^2=1-\frac{\sum \limits_{i=1}^n\sum \limits_{j=1}^m{\left({y}_{i,j}-{\widehat{y}}_{i,j}\right)}^2}{\sum \limits_{i=1}^n\sum \limits_{j=1}^m{\left({y}_{i,j}-\overline{y}\right)}^2} $$
(19)

where n is the sample number, m is the dimension of the simulation model output vector, y i, j is the jth element of the ith simulation model output vector, \( {\widehat{y}}_{i,j} \)is the jth element of the ith surrogate model output vector, and \( \overline{y} \)is the average of the simulation model outputs. The surrogate model is better when the R 2 is closer to 1.

  1. 2.

    Mean relative error (MRE)

$$ \mathrm{MRE}=\frac{\sum \limits_{i=1}^n\sum \limits_{j=1}^m\raisebox{1ex}{$\left|{y}_{i,j}-{\widehat{y}}_{i,j}\right|$}\!\left/ \!\raisebox{-1ex}{${y}_{i,j}$}\right.}{n} $$
(20)
  1. 3.

    Maximum relative error

$$ \max \kern1em \raisebox{1ex}{$\left|{y}_{i,j}-{\widehat{y}}_{i,j}\right|$}\!\left/ \!\raisebox{-1ex}{${y}_{i,j}$}\right. $$
(21)

Results and discussion

The outputs of 20 testing samples obtained using the trained surrogate models (a total of 200 values) were compared with those obtained using the developed simulation model. Figure 4 shows boxplots of the relative error metrics corresponding to the three different surrogate models.

Fig. 4
figure 4

Boxplot of relative errors of different surrogate models

The transport of organic contaminants in multiphase flow is complicated and the solubility of chlorobenzene in water is particularly low, making it difficult to follow the relationship between the inputs and outputs of the simulation model. The relative errors of the three surrogate models were higher than those of the same surrogate models applied to other problems (Luo et al. 2013; Hou et al. 2015, 2016; Zhao et al. 2016).

Figure 4 clearly shows that the number of relative errors larger than 20% for the Kriging model are much larger than those of the other two models, and the max relative error for the Kriging model is 39.9035%. These findings illustrated that the Kriging model performance was unstable with respect to this problem. Three surrogate models were also evaluated using the three indices previously described (Table 1). The closer the certainty coefficient R 2 is to 1, the more accurate the surrogate model. Table 1 shows that the accuracy of the KELM and SVR models is higher than that of the Kriging model. Furthermore, the KELM model was better than the SVR model in all indices, and the max relative error for KELM model was less than 20%; thus, it is concluded that the KELM model is an acceptable method for creating a surrogate model.

Table 1 Performance evaluation of different surrogate models

Figure 5 illustrates the distribution of the relative errors of the surrogate models. The relative error values concentrated between 0.5 and 7%, and most of the relative error values were less than 12%. The KELM and SVR models were significantly superior to the Kriging model, according to the relative error cumulative frequency curves.

Fig. 5
figure 5

Distributions of relative errors for different surrogate models. a Distribution of relative errors in different intervals; b relative error cumulative frequency curve

Figures 6 and 7 show the results corresponding to the KELM surrogate models with and without parameter optimization. The parameters of the KELM model greatly affect its approximation accuracy. After parameter optimization, all performance evaluation indices of the KELM model were significantly improved (Table 2).

Fig. 6
figure 6

Boxplot of relative errors of KELM models with and without parameter optimization

Fig. 7
figure 7

Distributions of relative errors for KELM models with and without parameter optimization. a Distribution of relative errors in different intervals; b relative error cumulative frequency curve

Table 2 Performance evaluation of KELM models with and without parameter optimization

Using 20 testing samples, the maximum and average relative errors of the groundwater contamination monitoring data predicted by the KELM model without parameter optimization (41.2639 and 5.9290%) were far larger than those of the optimized KELM model (18.2611 and 4.2053%). The relative error cumulative frequency curve of the optimized KELM model was located below that of the unoptimized KELM model throughout.

Figures 8 and 9 compare the results corresponding to the KELM surrogate models built with different training sample datasets. When the number of training samples increased from 60 to 90, the approximation accuracy of the KELM model improved significantly. However, the KELM model built with 120 training samples performed no better than, or even worse than, the KELM model built with 90 training samples, as per Figs. 8 and 9 and Table 3.

Fig. 8
figure 8

Boxplot of relative errors of KELM models built with different training sample datasets

Fig. 9
figure 9

Distributions of relative errors for KELM models built with different training sample datasets. a Distribution of relative errors in different intervals; b relative error cumulative frequency curve

Table 3 Performance evaluation of KELM models built with different training sample datasets

The structure of the training dataset affects the approximation accuracy of the surrogate model; however, the approximation accuracy does not simply improve with increasing numbers of training samples. It is necessary to provide sufficient training samples to improve the performance of the surrogate model, while avoiding unnecessary computation.

The optimal number of training samples depends on the surrogate modeling method, the number of input variables, the number of output variables, and many other factors. Too few training samples cannot cover the input variable intervals well, while too many are unhelpful for improving approximation accuracy; thus, further research on a technique for estimating the number of training samples required for the KELM model is needed.

A conventional simulation optimization model required 20,000 runs of the simulation model. The simulation for the chlorobenzene-contaminated site required nearly 500 s of CPU time on a 3.2GHz Intel core i5 CPU and 4 GB RAM PC platform, while each run of the KELM model just takes 0.9 s. Thus, replacing the simulation model with the KELM model in the optimization process reduced the CPU time from 10,000,000 s (116 days) to 18,000 s (5 h).

Though the approximation accuracy of the surrogate model was acceptable when the optimal surrogate method and parameters were selected, the maximum relative error of the groundwater contamination monitoring data predicted by the KELM model was greater than 15%. Future studies will be needed to further improve the approximation accuracy of the surrogate model and make simulation-surrogate-optimization-based GCSI results more reliable.

Conclusions

This study demonstrates the applicability of the Kriging, SVR, and KELM models for optimal identification of unknown groundwater pollution sources by presenting performance evaluations for different surrogate models. The proposed methodology overcomes some of the severe computational limitations of the embedded simulation–optimization approach.

Three comparative studies were carried out to select the optimal surrogate model and analyze the influence of parameters and the structure of the training dataset on the approximation accuracy of the surrogate model. Several general conclusions that can be drawn from this study are summarized in the following:

  1. 1.

    The KELM model was the most reliable surrogate model of the Kriging, SVR, and KELM models. The KELM model reasonably predicted system responses for given operation conditions.

  2. 2.

    The performance of the KELM model was significantly improved through parameter optimization. Using 20 test samples, the maximum and average relative errors of the groundwater contamination monitoring data predicted by the KELM model without parameter optimization were 41.2639 and 5.9290%, whereas those of the optimized KELM model were only 18.2611 and 4.2053%.

  3. 3.

    The structure of the training dataset significantly affects the approximation accuracy of the surrogate model; however, additional training samples do not always lead to higher approximation accuracy. Determining and utilizing the appropriate number of training samples is critical for improving the performance of the surrogate model and avoiding unnecessary computation.