1 Introduction

Identification of groundwater pollution sources is the process of reversing the location of the pollution source, the release intensity of the pollution source and the release time. And the process is carried out by establishing groundwater solute transport model and using the monitoring data in the monitoring well. Clearly, the essence of identification of groundwater pollution sources is to inverse and identify the solute transport model parameters by using the monitoring data. At present, the methods for solving the inverse problem mainly include Bayesian statistical method (Sohn et al. 2000; Chen et al. 2018), geostatistical method (Snodgrass and Kitanidis 1997), differential evolution algorithm (Ruzek and Kvasnicka 2001), genetic algorithm (Giacobbo et al. 2002), simulated annealing algorithm (Dougherty and Marryott 1991), and Kalman filter (Wang et al. 2018). Among them, the Bayesian statistical method aims to obtain parameter information from the monitoring data and combines the parameter prior probability density function with the sample likelihood function, so it is seen as a set of very flexible and intuitive methods for the inverse problem, and applied more and more extensively.

In the inversion of model parameters, it is often necessary to solve the posterior estimation value or posterior distribution of the parameters by Bayesian statistical methods. However, when the dimension of the model parameters is large, the numerical integration solution process is complicated and difficult. So the Monte Carlo method (MC) (Roberts and Casella 2004) is used for approximate solution. And the Markov chain Monte Carlo method (MCMC) (Metropolis et al. 1953; Hastings 1970; Tierney and Mira 1999; Mira 2002; Haario et al. 2001; Haario et al. 2006) is widely used as a classical sampling method. In recent years, some common methods of constructing Markov chains have been developed, such as Metropolis-Hastings algorithm (MH) (Metropolis et al. 1953; Hastings 1970), delay rejection algorithm (DR) (Tierney and Mira 1999; Mira 2002), adaptive Metropolis algorithm (AM) (Haario et al. 2001), and delay rejection adaptive Metropolis algorithm (DRAM) (Haario et al. 2006). The DRAM algorithm combines the DR algorithm and the AM algorithm, which not only ensures the local adaptation of the Markov chain but also guarantees the global adaptive adjustment of the chain. Wei et al. (2016) applied the DRAM algorithm to identify the source information after a sudden water contamination incident. Zhang (2017) used the DRAM algorithm to invert the parameters of the groundwater model and also pointed out its defects. For example, the DRAM algorithm was a single-chain MCMC algorithm, and suitable for the parameter posterior distribution to be a single-peak case. Therefore, an improved multi-chain delay rejection adaptive Metropolis algorithm based on Latin hypercube sampling (Gao 2008) was proposed.

On the other hand, the results of model parameters inversion are affected by monitoring scheme including monitoring well position, quantity, and monitoring frequency. However, the monitoring scheme is often limited by monitoring funds and other objective conditions, leading to ill-posedness (Carrera and Neuman 1986). In order to get an ideal model parameter inversion result, it is necessary to optimize the monitoring scheme. Firstly, an objective function needs to be defined to quantify the information amount of the monitoring scheme. Some objective functions have been developed so far, such as signal-to-noise ratio (SNR) (Gabriela et al. 2008) and relative entropy based on Bayesian formula (Huan and Marzouk 2013; Lindley 1956). However, the SNR only considers the interference effect of monitoring error on the monitoring data. And the relative entropy does not consider the influence of the prior distribution of parameters on the posterior distribution. Shannon (1948) pointed out that information entropy was a measure of information uncertainty. The greater the uncertainty, the larger the information entropy. This paper combined the Bayesian formula with information entropy (Shannon 1948; Zhang et al. 2019) to optimize the monitoring scheme.

In the process of the optimization design of monitoring scheme and the identification of pollution source, it is necessary to repeatedly call the groundwater solute transport model, making the calculation load very high. However, the application of the surrogate model can effectively reduce the calculation load. The commonly used methods of constructing surrogate model include polynomial regression (Knill et al. 1999) and Kriging (Kuhnt and Steinberg 2010; Luo et al. 2019). Kriging is an improved method of polynomial regression analysis, and the Kriging surrogate model can be established in MATLAB software by using the special DACE toolbox (Lophaven et al. 2002). So the Kriging method is used widely for constructing surrogate model.

In this paper, a two-dimensional solute transport simulation model for phreatic groundwater was established. Under the condition of initial monitoring time and monitoring frequency, the Kriging method was used to establish a surrogate model of the solute transport simulation model. The optimized single-objective monitoring scheme MP1 with the minimum information entropy and the optimized multi-objective monitoring scheme MP2with minimum information entropy and shortest monitoring time were calculated respectively. Then the improved multi-chain delay rejection adaptive Metropolis algorithm was used to identify the pollution source parameters based on the two optimized monitoring schemes. This paper will provide reference for the identification of groundwater pollution source and the optimization of monitoring schemes.

2 Study Methods

2.1 Bayesian Formula

The Bayesian formula is expressed as follows:

$$ p\left(\boldsymbol{\alpha} |\boldsymbol{d}\right)=\frac{p\left(\boldsymbol{d}|\boldsymbol{\alpha} \right)p\left(\boldsymbol{\alpha} \right)}{p\left(\boldsymbol{d}\right)}\propto p\left(\boldsymbol{d}|\boldsymbol{\alpha} \right)p\left(\boldsymbol{\alpha} \right) $$
(1)

Where,

  • α is the unknown model parameter;

  • d is the monitoring data;

  • p(α| d) is the posterior probability density function of the model parameter;

  • p(α) is the prior probability density function of the model parameter;

  • p(d| α) is the conditional probability density function;

  • p(d) =  ∫ p(d| α)p(α)  is the normalized integral constant, also called appearance probability of monitoring data d.

Assuming that

  • The number of the unknown parameters in the model is m, namely α = (α1, α2, ⋯, αm);

  • The environmental hydraulic model parameters are all distributed in a specific range;

  • Each parameter obeys uniform distribution;

  • α1, α2, ⋯, αm are mutually independent.

So the prior probability density function of model parameter αi can be defined as follows:

$$ p\left({\alpha}_i\right)=\Big\{{\displaystyle \begin{array}{l}\frac{1}{B_i-{A}_i},\kern0.5em {\alpha}_i\in \left[{A}_i,{B}_i\right]\\ {}0,\kern3em others\end{array}} $$
(2)

And the total prior distribution p(α)can be expressed as follows:

$$ p\left(\boldsymbol{\alpha} \right)=\prod \limits_{i=1}^mp\left({\alpha}_i\right) $$
(3)
  • The monitoring data in the model is recorded as d = (d1, d2, ..., dn);

  • F(α) indicates the calculated values of model under the condition of parameters α, and ε = d − F(a) represents the error;

  • ε = (ε1, ε2, ..., εn) obeys normal distribution with the mean of 0;

  • ε1, ε2, ..., εn are mutually independent.

So the conditional probability density function can be expressed as follows:

$$ p\left(d|\alpha \right)=\frac{1}{{\left(2\pi \right)}^{n/2}\mid C\left(\varepsilon \right){\mid}^{1/2}}\exp \left\{-\frac{1}{2}{\left(d-F\left(\alpha \right)\right)}^TC{\left(\varepsilon \right)}^{-1}\left(d-F\left(\alpha \right)\right)\right\}, $$
(4)

where,

$$ C\left(\boldsymbol{\varepsilon} \right)=\left[\begin{array}{cccc}{\sigma}_1^2& 0& \dots & 0\\ {}0& {\sigma}_2^2& \dots & 0\\ {}\vdots & \vdots & \vdots & \vdots \\ {}0& 0& \dots & {\sigma}_n^2\end{array}\right]; $$
  • C(ε)∣is the determinant of matrix C(ε);

  • C(ε)−1is the inverse matrix of matrix C(ε);

  • σi > 0(i = 1, 2, ⋯, n).

Combining the above functions (1), (2), (3), and (4), the posterior probability density function p(α| d) of α can be expressed as follows:

$$ {\displaystyle \begin{array}{c}p\left(\alpha |d\right)=\frac{\prod \limits_{i=1}^mp\left({\alpha}_i\right)}{p(d){\left(2\pi \right)}^{n/2}\mid C\left(\varepsilon \right){\mid}^{1/2}}\exp \left\{-\frac{1}{2}{\left(d-F\left(\alpha \right)\right)}^TC{\left(\varepsilon \right)}^{-1}\left(d-F\left(\alpha \right)\right)\right\}\\ {}=\lambda \exp \left\{-\frac{1}{2}{\left(d-F\left(\alpha \right)\right)}^TC{\left(\varepsilon \right)}^{-1}\left(d-F\left(\alpha \right)\right)\right\}\end{array}} $$
(5)

where \( \lambda =\frac{\prod \limits_{i=1}^mp\left({\alpha}_i\right)}{p\left(\boldsymbol{d}\right){\left(2\pi \right)}^{n/2}{\left|C\left(\boldsymbol{\varepsilon} \right)\right|}^{1/2}} \)is a fixed value, and independent of parameters α.

Equation (5) can be viewed as a function about parameters α under the condition that the measured value is fixed. Since it was difficult to draw the explicit expression of Eq. (5) by a numerical integral method, the Markov Chain Monte Carlo method was employed to solve the equation.

2.2 Optimization Design of Monitoring Scheme Based on Bayesian Formula and Information Entropy

The optimization design of the monitoring schemes mainly includes the optimization of the number, positions of the monitoring wells, and monitoring frequency. Under the condition of single well monitoring, both position D (D indicates the serial number of the monitoring wells) and monitoring time interval Δt of the monitoring wells will be optimized simultaneously.

Assume that

  • The initial monitoring time is t1 (fixed value);

  • Monitoring scheme MP = (D, Δt);

  • Monitoring data is still recorded as d

The Bayesian formula can be rewritten as follows:

$$ p\left(\boldsymbol{\alpha} |\boldsymbol{d},D,\varDelta t\right)=\frac{p\left(\boldsymbol{\alpha} |D,\varDelta t\right)p\left(\boldsymbol{d}|\boldsymbol{\alpha}, D,\varDelta t\right)}{\int p\left(\boldsymbol{\alpha} |D,\varDelta t\right)p\left(\boldsymbol{d}|\boldsymbol{\alpha}, D,\varDelta t\right)d\boldsymbol{\alpha}} $$
(6)

The prior distribution p(α| D, Δt) suggests a preliminary set of unknown parameters α, and is not affected by D and Δt. So p(α| D, Δt)can be written as p(α). Equation (6) becomes the following:

$$ p\left(\boldsymbol{\alpha} |\boldsymbol{d},D,\varDelta t\right)=\frac{p\left(\boldsymbol{\alpha} \right)p\left(\boldsymbol{d}|\boldsymbol{\alpha}, D,\varDelta t\right)}{\int p\left(\boldsymbol{\alpha} \right)p\left(\boldsymbol{d}|\boldsymbol{\alpha}, D,\varDelta t\right)d\boldsymbol{\alpha}} $$
(7)

where ∫p(α)p(d| α, D, Δt) indicates the probability of monitoring value d obtained on the condition of position number of D and the monitoring time interval of Δt. So it can be denoted as p(d| D, Δt) in the following format:

$$ p\left(\boldsymbol{d}|D,\varDelta t\right)=\int p\left(\boldsymbol{\alpha} \right)p\left(\boldsymbol{d}|\boldsymbol{\alpha}, D,\varDelta t\right)d\boldsymbol{\alpha} $$
(8)

Assuming that the probability density function of one-dimensional continuous random variable Θ is f(θ), the information entropy (Shannon 1948) of Θ in the interval [a, b] can be defined as follows:

$$ H\left(\varTheta \right)=-{\int}_a^bf\left(\theta \right)\ln f\left(\theta \right) d\theta $$
(9)

So we can use the monitoring data d gotten at the position D to back-calculate the unknown parameters α, and then the posterior probability density function p(α| d, D, Δt) can be obtained. The information entropy of the posterior distribution α can be similarly expressed as follows:

$$ H\left(D,\varDelta t,\boldsymbol{d}\right)=-\int p\left(\boldsymbol{\alpha} |\boldsymbol{d},D,\varDelta t\right)\ln \kern0.5em p\left(\boldsymbol{\alpha} |\boldsymbol{d},D,\varDelta t\right)d\boldsymbol{\alpha} $$
(10)

The left side of Eq. (10) contains monitoring data d, which could not be really obtained before the optimization design of the monitoring schemes. So d could be considered as a random variable, and the probability density function of d can be expressed as p(d| D, Δt). In order to obtain a function only containing variable D and Δt, both sides of Eq. (10) are multiplied by p(d| D, Δt), then integrated by d. And the expectation of information entropy H(D, Δt, d) can be written as follows:

$$ {\displaystyle \begin{array}{l}E\left(H\left(D,\varDelta t,d\right)\right)\\ {}=-\int \left[\int p\left(\alpha |d,D,\varDelta t\right)\ln p\left(\alpha |d,D,\varDelta t\right) d\alpha \right]\;p\left(d|D,\varDelta t\right) d d\\ {}=-\iint p\left(\alpha |d,D,\varDelta t\right)p\left(d|D,\varDelta t\right)\ln p\left(\alpha |d,D,\varDelta t\right) d\alpha dd\end{array}} $$
(11)

where E(H(D, Δt, d)) is only affected by D and Δt, and is a continuous function on D and Δt. Therefore, E(H(D, Δt, d)) can be expressed as E(D, Δt). And the optimal monitoring scheme MP can be gotten by calculating the minimum value of E(D, Δt). According to the concept of information entropy, we can use the monitoring value d from monitoring scheme MPto back-calculate the unknown parameters α. At this time, the information entropy of the posterior distribution of α is the smallest, indicating that the uncertainty of α is also minimal, and the inversion effect is optimal.

The solving algorithm of Eq. (11) is very complicated, and it is difficult to obtain the expression. This paper will get the approximate result by using Monte Carlo method (Huan and Marzouk 2013; Zhang et al. 2019).

2.3 Improved Multi-chain Delay Rejection Adaptive Metropolis Algorithm Based on Latin Hypercube Sampling

2.3.1 Improved Multi-chain Delay Rejection Adaptive Metropolis Algorithm

Delayed rejection adaptive Metropolis algorithm (DRAM) was first proposed by Haario and others in 2006. The specific steps of the algorithm can be found in Haario et al. (2006). However, the single-chain DRAM algorithm easily causes the inversion result local convergence or no convergence (Zhang 2017). This paper proposes an improved multi-chain delay rejection adaptive Metropolis algorithm (multi-chain DRAM) based on Latin hypercube sampling. Latin hypercube sampling is a multi-dimensional hierarchical random sampling method with good dispersion uniformity and representation. The specific algorithm of Latin hypercube sampling is shown in the literature (Gao 2008).

Specific steps of the improved multi-chain DRAM algorithm based on Latin hypercube sampling are as follows:

  • q sets of initial samples are randomly extracted from the prior ranges of model parameters by the Latin hypercube sampling method.

  • Taking the q sets of samples as initial points in step (1), q parallel Markov Chains are generated by the DRAM algorithm.

  • Convergence judgment. If the Markov chain satisfies the Gelman-Rubin convergence criterion (Gelman and Rubin 1992), the calculation terminates, otherwise the parallel sequence continues to evolve.

  • The averages of the calculated results of q Markov Chains are taken as the final results.

2.3.2 Convergence Judgment of Improved Multi-chain DRAM Algorithm

In this study, the convergence of the last 50% sampling process by the multi-chain DRAM algorithm is guided by the Gelman-Rubin convergence diagnosis method (Gelman and Rubin 1992). The convergence indicator is as follows:

$$ {\hat{R}}_i=\sqrt{\frac{g-1}{g}+\frac{q+1}{q}\cdot \frac{B_i}{W_i}} $$

where

  • \( {\hat{R}}_i\;\left(i=1,2,\cdots, m\right) \) is the judgment indicator of the ith parameter;

  • g is half the length of the Markov chain length in the multi-chain Metropolis algorithm;

  • q is the number of Markov chains used for the judgment;

  • Bi is the variance of the means of the last 50% samples in the q Markov chains of the ith parameter;

  • Wi is the average of the variance of the last 50% samples in the q Markov chains of the ith parameter.

when \( {\hat{R}}_i<1.2 \), the Markov chain converges; While when \( {\hat{R}}_i\ge 1.2 \), the Markov chain does not converge.

2.4 Kriging Surrogate Model and Sampling Method

In order to reduce the calculation load generated by repeatedly calling the groundwater solute transport numerical simulation model during the optimization design of the monitoring scheme and the parameter inversion process, the Kriging method (Lophaven et al. 2002) is used to construct the surrogate model of the numerical simulation model.

Both the establishment of the surrogate model and the selection of the initial samples of the improved DRAM algorithm need to adopt a certain sampling method to extract some samples. In the process of constructing the surrogate model, in order to ensure that the surrogate model can capture the trend of the object function, and the samples could be evenly distributed in the entire space of the prior distribution, the optimal Latin hypercube sampling method (Hickernell 1998) with centralization L2 deviation (CL2) as the optimizing index is used to extract samples. The improved DRAM algorithm is a multi-chain MCMC algorithm. The initial samples could be randomly extracted within the parameter prior distribution range by Latin hypercube sampling method (Kuhnt and Steinberg 2010), which could reduce the impact of randomly selecting samples on the inversion results.

3 Example Application

3.1 Model Establishment and Problem Overview

3.1.1 Model Establishment

Assuming that the study area was a rectangular area with 1000 m in length and 600 m in width, and the aquifer was a sandy aquifer with a thickness of 35 m (Table 1 for hydrogeological parameters), both the western boundary Γ1 and the eastern boundary Γ3 were the given head boundary. The eastern head was 25 m, and the western head was 30 m. Both the northern boundary Γ2 and the southern boundary Γ4 were the impermeable boundary.

Table 1 Known hydrological parameters in the study area

A total of 58 monitoring wells were set up in the study area. The initial concentration of aquifer pollutant was zero. Pollutant x was found downstream of the study area on a certain day. The pollution source was initially determined in a certain upstream region S (a priori range). And the pollutant was continuously and constantly injected into the aquifer in the form of a water injection well (200 m3/day) over a period of time. So the groundwater flow can be generalized as a two-dimensional homogeneous isotropic unsteady flow flowing from west to east. The schematic diagram of the study area is shown in Fig. 1.

Fig. 1
figure 1

Sketch of example model

The coordinate system was established with the southwest corner as the coordinate origin, and the numerical model of groundwater flow was established according to the hydrogeological conditions of the study area:

$$ \Big\{{\displaystyle \begin{array}{l}\frac{\partial }{\partial x}\left[K\left(H-B\right)\frac{\partial H}{\partial x}\right]+\frac{\partial }{\partial y}\left[K\left(H-B\right)\frac{\partial H}{\partial y}\right]+w=\mu \frac{\partial H}{\partial t}\kern1.25em \left(x,y\right)\in \varOmega, t\ge 0\\ {}{\left.H\left(x,y,t\right)\right|}_{t=0}={H}_0\left(x,y\right)\kern12em \left(x,y\right)\in \varOmega, t=0\\ {}\begin{array}{l}{\left.H\left(x,y,t\right)\right|}_{\varGamma_1,{\varGamma}_3}={H}_1\left(x,y,t\right)\kern11em \left(x,y\right)\in {\varGamma}_1,{\varGamma}_3,t\ge 0\\ {}{\left.\frac{\partial H}{\partial \tau}\right|}_{\varGamma_2,{\varGamma}_4}=0\kern16.5em \left(x,y\right)\in {\varGamma}_2,{\varGamma}_4,t\ge 0\end{array}\end{array}} $$

where K is the permeability coefficient, m/day: H is the water level, m: B is the aquifer floor elevation, m: w is the sources and sinks items: μ is the specific yield, dimensionless: Ω is the scope of the study area: H0(x, y) is the initial water level, m: H1 is the known head on the first boundary, m: Γ1, Γ3 are the given head boundaries with Dirichlet boundary condition: Γ2, Γ4 are the impervious boundaries with Neumann boundary condition: τ is the outer normal direction for the Neumann boundary.

A numerical model of groundwater solute transport can be established based on the numerical model of groundwater flow. The boundaries of the simulated area can be generalized as follows: Γ1 was the zero concentration boundary with Dirichlet boundary condition; Γ3 was the convective diffusion flux boundary with Cauchy boundary condition; Γ2 and Γ4 were the zero diffusion flux boundaries with Neumann boundary condition. The solute transport model in the study area is as follows:

$$ \Big\{{\displaystyle \begin{array}{l}n\frac{\partial c}{\partial t}=\frac{\partial }{\partial x}\left(n{D}_x\frac{\partial c}{\partial x}\right)+\frac{\partial }{\partial y}\left(n{D}_y\frac{\partial c}{\partial y}\right)-\frac{\partial }{\partial x}\left({v}_xc\right)-\frac{\partial }{\partial y}\left({v}_yc\right)+{C}_{inj}{\mathrm{Q}}_{inj}\kern1.25em \left(x,y\right)\in \varOmega, t\ge 0\\ {}{\left.c\left(x,y,t\right)\right|}_{t=0}=0\kern19.75em \left(x,y\right)\in \varOmega, t=0\\ {}\begin{array}{l}{\left.c\left(x,y,t\right)\right|}_{\varGamma_1}=0\kern20em \left(x,y\right)\in {\varGamma}_1,t>0\\ {}{\left.D\frac{\partial c}{\partial \tau}\right|}_{\varGamma_2,{\varGamma}_4}=0\kern20.25em \left(x,y\right)\in {\varGamma}_2,{\varGamma}_4,t>0\\ {}{\left.\left(-n{D}_x\frac{\partial c}{\partial x}+c{v}_x\right)\right|}_{\varGamma_3}=f\left(x,y,t\right)\kern14.25em \left(x,y\right)\in {\varGamma}_3,t>0\kern1em \end{array}\end{array}} $$

where Dx and Dy are the components of the hydrodynamic diffusion coefficient in the x and y directions, m2/day: vx and vy are the percolation velocities of groundwater in the x and y directions respectively, m/day: n is the porosity of the aquifer medium, dimensionless: c is the mass concentration of the pollutant, mg/L: Qinj is the amount of liquid injected into the aquifer, m3/day: Cinj is the concentration of the pollutant entering the aquifer, mg/L: f(x, y, t) indicates the solute mass passing through a unit area of flow sections in unit time only under the action of the hydrodynamic dispersion.

The established groundwater flow and solute transport models were calculated by GMS (Groundwater Modeling System) software. In order to ensure that each grid center corresponded to a potential pollution source position, the study area was divided into 150 rows and 250 columns, and the basic cell side length is 4 m.

3.1.2 Problem Overview

For the potential pollution source parameter ranges, it was required to optimize the monitoring schemes using the existing 58 candidate monitoring wells. The optimized schemes contained single-objective and multi-objective monitoring schemes. The single-objective monitoring scheme was optimized with the minimum posterior distribution information entropy, and the multi-objective monitoring scheme was optimized with the minimum information entropy and the shortest monitoring time. Then the pollution source parameters, including the position of the pollution source, the start and stop time of emission pollutant, and the mass concentration of the pollutants, were identified based on the optimized schemes. That is, the unknown parameters of the pollution source α = (XS, YS, T1, T2, QS) were solved, where (XS, YS) is the position of the pollution source, m; T1 and T2 are the start and stop time of emission pollutant, d; QS is the mass concentration of the pollutant, mg/L.

3.1.3 Parameter Prior Range

The initial time was determined as a certain time when no pollution occurred. At this time, t = 0. Assuming that the prior distributions of the above five parameters α = (XS, YS, T1, T2, QS) were evenly distributed, the prior ranges of the five parameters were as follows:

80 m ≤ XS ≤ 200 m, 260 m ≤ YS ≤ 380 m, 10th day ≤ T1 ≤ 15th day, 25th day ≤ T2 ≤ 30th day, 3, 000 mg/L ≤ QS ≤ 3, 500 mg/L.

3.2 Establishment of the Kriging Surrogate Model

Fifty sets of samples of α were evenly extracted from the prior distribution by using the optimal Latin hypercube sampling method. The samples were taken as the input dataset of the Kriging surrogate model (at this time,CL2(Φ50, 5) = 0.0054). Fifty sets of samples are shown in Table 2.

Table 2 Fifty sets of training input dataset obtained from the prior distribution

Establishing the Kriging surrogate model for 58 candidate monitoring wells respectively.

The 50 sets of parameters in Table 2 were taken into the GMS software to obtain the daily pollutant mass concentrations of 58 candidate monitoring wells respectively within [450th day, 649th day]. The daily pollutant mass concentrations were seen as the output dataset of the Kriging surrogate models. Then the 50 sets of input and output dataset as the training samples were taken into MATLAB software. And the Kriging surrogate model of each monitoring well was trained by the DACE toolbox in the MATLAB software.

In order to test the accuracy of the Kriging surrogate models of 58 monitoring wells, 10 sets of parameters in the prior distributions of α in Table 3 were evenly extracted again by using the Latin hypercube sampling method, which were taken as the input values of the test samples. Then the 10 sets of parameters were taken into the GMS software to obtain the daily pollutant mass concentrations of 58 candidate monitoring wells respectively within [450th day, 649th day]. The daily pollutant mass concentrations were seen as the output values of the test samples, which were recorded as yi, out, and yi, out = (yi, 1, yi, 2, ⋯, yi, 2000), where i = 1, 2, ⋯, 58 indicated the ith candidate monitoring well. Then the 10 sets of input values of the test samples were taken into the Kriging surrogate model to obtain output values, which were recorded as \( {\hat{y}}_{i, out} \), and \( {\hat{y}}_{i, out}=\left({\hat{y}}_{i,1},{\hat{y}}_{i,2},\cdots, {\hat{y}}_{i,2000}\right) \),where i = 1, 2, ⋯, 58 indicated the ith candidate monitoring well.

Table 3 Ten sets of testing input dataset obtained from the prior distribution

Taking the 58th monitoring well as an example, the output values of the test samples by numerical model were taken as the abscissa, and the output values of the Kriging surrogate model were taken as the ordinate. The comparison between the output values of the surrogate model and the test samples is plotted in Fig. 2. Figure 2 shows that the output values are concentrated in y = x, which indicates that the surrogate model can be a good substitute for the numerical model.

Fig. 2
figure 2

Output comparison between the Kriging surrogate model and numerical model

Then the coefficient of determination, the mean absolute error, and the root mean square error were used to further test and evaluate the accuracy of the surrogate models, as shown in Table 4.

  1. (1).

    Coefficient of determination (R2):

Table 4 R2, MAE, and RMSE of the Kriging surrogate models for the 58 monitoring wells
$$ {R_i}^2=1-\frac{\sum \limits_{j=1}^{2000}{\left({y}_{i,j}-{\hat{y}}_{i,j}\right)}^2}{\sum \limits_{j=1}^{2000}{\left({y}_{i,j}-{\overline{y}}_i\right)}^2},i=1,2,\cdots, 58, $$

where \( {\overline{y}}_i=\frac{\sum \limits_{j=1}^{2000}{y}_{i,j}}{2000} \) represents the mean of the numerical model output values.

  1. (2).

    Mean absolute error (MAE):

$$ {\mathrm{MAE}}_i=\frac{\sum \limits_{j=1}^{2000}\mid {y}_{i,j}-{\hat{y}}_{i,j}\mid }{2000},i=1,2,\cdots, 58; $$
  1. (3).

    Root mean square error (RMSE):

$$ {\mathrm{RMSE}}_i=\sqrt{\frac{\sum \limits_{j=1}^{2000}{\left({y}_{i,j}-{\hat{y}}_{i,j}\right)}^2}{2000-1}},i=1,2,\cdots, 58. $$

It can be seen from the dataset in Table 4 that the Kriging surrogate model has higher prediction accuracy, indicating that the surrogate model can be a good substitute for the numerical model.

3.3 Optimization of Monitoring Schemes

3.3.1 Single-Objective Optimization Model Based on the Minimum Information Entropy

The value ranges of the parameters in the study area were described in Section 3.1.3. The first monitoring was set at time t1 = 450th day, and the monitoring was a total of 10 times. The interval between two adjacent monitoring was recorded as Δt, which is a positive integer, and 1 day ≤ Δt ≤ 20 days. That is, the pollution source identification task was completed before the 630th day. The purpose of this section was to select the optimal monitoring scheme MP1 = (D, Δt) from the 58 candidate monitoring well positions D and 20 monitoring intervals Δt. Therefore, the value ranges of the monitoring scheme (D, Δt) can be written as follows:

$$ \varOmega =\left\{1\le D\le 58,1\;\mathrm{day}\le \varDelta t\le 20\;\mathrm{day}\mathrm{s},\mathrm{and}\ D\ \mathrm{and}\;\varDelta t\ \mathrm{are}\ \mathrm{positive}\ \mathrm{integers}\ \mathrm{respectively}\right\}. $$

It can be seen from Section 2.2 that the optimal monitoring scheme based on the minimum information entropy can be generalized to the minimum value of function (11), namely:

$$ E\left({\mathrm{MP}1}^{\ast}\right)=\underset{\left(D,\varDelta t\right)\in \varOmega }{\min }E\left(D,\varDelta t\right) $$
(12)

The posterior probability density function in function (11) is given by Eq. (5), and the covariance matrix C(ε) in Eq. (5) needs to be given. In the optimization design of monitoring schemes by using the Kriging surrogate model, there must be error uncertainty in the surrogate models and measurement.

Assuming that

  • The error \( {\varepsilon}_{i^{\prime }} \) of the Kriging surrogate models for the ith (i = 1, 2, ⋯, 58) monitoring well satisfies normal distribution \( N\left(0,{\left({\sigma}_{i^{\prime }}\right)}^2\right) \) , the mean value \( E\left({\varepsilon}_{i^{\prime }}\right)=0 \), and mean square deviation \( {\sigma}_{i^{\prime }}={\mathrm{RMSE}}_i \);

  • The measurement error ε'' satisfies normal distribution N(0, (σ'')2) , the mean value E(ε'') = 0 , and mean square deviation σ'' = 0.01;

  • ε'i and ε'' are completely independent of each other.

The global error \( {\overline{\varepsilon}}_i={\varepsilon}_{i^{\prime }}+{\varepsilon}^{{\prime\prime} } \) for the ith (i = 1, 2, ⋯, 58) monitoring well satisfies normal distribution \( N\left(0,{\left({\sigma}_{i^{\prime }}\right)}^2+{\left({\sigma}^{{\prime\prime}}\right)}^2\right) \). According to this, C(ε) in Eqs. (4) and (5) can be determined.

The information entropy of all monitoring schemes can be obtained according to function (11). Because the values of D and Δt were all positive integers, the minimum value of objective function E(D, Δt) was obtained, and \( \underset{\left(D,\varDelta t\right)\in \varOmega }{\min }E\left(D,\varDelta t\right)=12.16 \). So the optimal monitoring scheme MP1 = (37, 20). That is, the best monitoring well was no.37, and the best monitoring intervals was Δt = 20 days.

In order to verify the optimization design effect of the monitoring schemes based on Bayesian formula and information entropy, another 9 monitoring schemes were randomly selected from Ω = {1 ≤ D ≤ 58, 1 day ≤ Δt ≤ 20 days, and D and Δt were positive integers respectively}. MP1 and the other new 9 monitoring schemes were represented by the symbol (Di, Δti)  (i = 1, 2, ⋯, 10). Then the 10 monitoring schemes were evaluated by the information entropy E(D, Δt) and the mean relative errors MRE(D, Δt) of inversion results as indicators. However, it was unfair to evaluate the inversion results of the monitoring schemes by using a certain set of “the true values of parameters”. So 20 sets of parameter values in the prior distributions of α were randomly and uniformly extracted by using Latin hypercube sampling method. The parameter values were recorded as “the true values of parameters” (Table 6), which was written as χ = [χ(j, k)]20 × 5. Corresponding to the 10 monitoring schemes, 20 groups of “ the true values of parameters” generated 200 groups of concentration monitoring values through Kriging surrogate models. Then the parameters α could be inversed by using the generated monitoring values and the improved DRAM algorithm (the number of parallel chain was 10). The length of each Markov chain is 34,000. When the length of the Markov chain is 30,000, the convergence judgment indexes of 5 parameters were \( {\hat{R}}_i<1.2\;\left(i=1,2,\cdots, 5\right) \). In order to ensure the accuracy of inversion results, only the last 4000 samples after a stabilization trend were used for posterior statistics, and the parameter posterior mean estimation \( {M}_{\left({D}_i,\varDelta {t}_i\right)} \) of 20 sets of “the true values of parameters χ” can be calculated. \( {M}_{\left({D}_i,\varDelta {t}_i\right)} \) was recorded as \( {M}_{\left({D}_i,\varDelta {t}_i\right)}={\left[{M}_{\left({D}_i,\varDelta {t}_i\right)}\left(j,k\right)\right]}_{20\times 5} \). Then the true values of parameters in Table 6 were brought into MRE(Di, Δti). The expression was as follows:

$$ \mathrm{MRE}\left({D}_i,\varDelta {t}_i\right)=\left[\sum \limits_{j=1}^{20}\sum \limits_{k=1}^5|\frac{M_{\left({D}_i,\varDelta {t}_i\right)}\left(j,k\right)-\chi \left(j,k\right)}{\chi \left(j,k\right)}\right]/100 $$
(13)

MRE(Di, Δti) of the 10 monitoring schemes were obtained by function (13). The results are shown in Table 5.

Table 5 E(Di, Δti)and MRE(Di, Δti) of 10 monitoring schemes

MRE(Di, Δti) and E(Di, Δti) in Table 5 are fitted linearly, and they show a good positive linear relationship, which can be written as

MRE(D, Δt) = 0.0072E(D, Δt) − 0.0345 (R2 = 0.9101), and shown in Fig. 3.

Fig. 3
figure 3

The fitting diagram of the relationship between E(D, Δt) and MRE(D, Δt)

It can be seen from Fig. 3 and Table 5 that MRE(D, Δt) and E(D, Δt) show a good positive linear relationship, which indicates E(D, Δt) is an effective measure of the accuracy of the parameter inversion results. The smaller the E(D, Δt), the higher the accuracy of the parameter inversion. But when E(D, Δt) gets the minimum, MRE(D, Δt) is not the minimum. For example, E(D1, Δt1)=12.16 is less than E(D4, Δt4)=12.88, but MRE(D1, Δt1)=0.060 is greater than MRE(D4, Δt4)=0.053. The main reason is that although 20 sets of “ the true values of parameters” in Table 6 are uniformly extracted from the prior distributions of α as far as possible by using Latin hypercube sampling method, the amount of “the true values of parameters” is relatively few, and it is impossible to make the values of parameters evenly distributed within the prior range in the true sense. The information entropy is solved by the Monte Carlo method (MC method). The Latin hypercube sampling method is used to extract 40,000 samples in the parameter prior ranges. So it is more reliable to take the minimum value of E(MP) as the index to select the optimal monitoring scheme. The minimum value of MRE(MP) cannot be used as the index to select the optimal monitoring scheme.

Table 6 Twenty sets of true values of parameters obtained from the prior distribution

In summary, the smaller the information entropy E(D, Δt), the smaller the uncertainty of the parameter posterior distribution, and the higher accuracy the inversion result. It fully verifies that the monitoring well optimization design method based on Bayesian formula and information entropy can be a good method for parameter inversion.

3.3.2 Multi-objective Optimization Model Based on Minimum Information Entropy and Minimum Monitoring Time

It not only needs to optimize the monitoring scheme with minimize information entropy but also requires the monitoring scheme to be the least time-consuming in order to find the pollution source as soon as possible. So the trade-off between information entropy and time-consuming of monitoring scheme should be considered. Assuming that the monitoring number of times was still set as 10, the multi-objective optimization model was established with the minimum information entropy and the shortest monitoring time. And the mathematical formula is as follows:

Objective 1: minimum information entropy

$$ \underset{\left(D,\varDelta t\right)\in \varOmega }{\min }E\left(D,\varDelta t\right) $$
(14)

Objective 2: shortest monitoring time

$$ \underset{\varDelta t\in \left[1,20\right],\mathrm{positive}\ \mathrm{inegers}}{\min }T=\left(10-1\right)\times \varDelta t=9\varDelta t $$
(15)

Generally speaking, the optimal solution of multi-objective optimization problem is not unique. The optimal solution set consists of the solutions whose reduction must be at the cost of increasing the value of other objective functions. The optimal solution set is called Pareto domain. The continuous multi-objective optimization problem can be solved by using the non-dominated sorting genetic algorithm with elite strategy (NSGA-II) (Deb et al. 2002). Since the values of D and Δt in the multi-objective optimization model are all positive integers, this paper used exhaustive method to get D with minimum information entropy under 20 kinds of Δt. The above 20 combinations of D and Δt are Pareto domains. The Pareto front is shown in Fig. 4.

Fig. 4
figure 4

Pareto front of the objective function

If the time-consuming of monitoring scheme must be less than or equal to 20 days, the optimal monitoring scheme MP2 was calculated as follows: the best monitoring well was no.37, and the best monitoring intervals was Δt = 2 days, and the time-consuming T = 18 days. At this time, E(MP2) = 13.59.

3.4 Identification of Pollution Source Based on Optimized Monitoring Schemes

Taking the true values of the first set of parameters in Table 6 (XS = 111.99, YS = 277.54 ,T1 = 13.54, T2 = 27.00) as an example, the pollution source was identified by using the single-objective and multi-objective optimized monitoring schemes respectively.

3.4.1 Identification of Pollution Source Based on Single-Objective Optimized Scheme

Both the monitoring values obtained by single-objective optimized monitoring scheme MP1 and the improved DRAM algorithm (10 chains in total) were used to invert pollution source parameters. In the parameter inversion process, the length of each Markov chain was 34,000, among which the length of the non-adaptive Markov chain was 4000 and the length of the adaptive Markov chain was 30,000. The ergodic mean plots of model parameters based on MP1 are shown with the solid lines in Fig. 5. When the length of Markov chain was 30,000, the convergence judgment indexes of 5 parameters were \( {\hat{R}}_i<1.2\;\left(i=1,2,\cdots, 5\right) \), and all the Markov chains of all parameters converged. Then the previous unstable 30,000 Markov chains results were excluded. Only the last 4000 stable results were used to perform the posterior statistical analysis. The results are shown in Table 7.

Fig. 5
figure 5

Ergodic mean plots of model parameters based on MP1 and MP2

Table 7 Posterior statistical results of model parameters based on MP1 and MP2, and the convergence judgment indicators \( {\hat{R}}_i \)

3.4.2 Identification of Pollution Source Based on Multi-objective Optimized Scheme

Both the monitoring values obtained by multi-objective optimized monitoring scheme MP2 and the improved DRAM algorithm (10 chains in total) were used to invert pollution source parameters. In the inversion process, the Markov chain condition was set in the same way as in Section 3.4.1. The ergodic mean plots of model parameters based on MP2 are shown with the dotted line in Fig. 5, and the posterior statistical results are shown in Table 7.

It can be seen from Table 7 that the mean relative errors of the posterior mean of 5 parameters by MP1 and MP2 are 3.06% and 5.87% respectively. The precision of parameter inversion by MP1 is higher than that by MP2, such as the inversion positions of the pollution source shown in Fig. 6. This is mainly due to E(MP1) = 11.90 < E(MP2) = 13.35. Moreover, when we observe pollutant concentrations at the no.37 monitoring well by forward simulation from the inversion pollution sources by MP1* and MP2*, we can see that the monitoring concentration residue between the pollution source by MP1* and the truth pollution source is less than that by MP2*. The comparison is shown in Fig. 7. It is further verified that the smaller the information entropy of the monitoring scheme, the smaller the uncertainty of the parameter posterior distribution, and the higher accuracy the inversion result.

Fig. 6
figure 6

Inversion positions of pollution source by MP1* and MP2*

Fig. 7
figure 7

Comparison of pollutant concentrations observed at the no.37 monitoring well from the inversion pollution sources by MP1* and MP2*

Compared with the inversion results based on MP1*, the mean value of the relative errors of the posterior mean of 5 parameters increases by 2.81% by MP2*, but the monitoring time is shortened from 180 to 18 days. Therefore, the multi-objective optimized monitoring scheme is of more practical significance for the rapid identification of pollution source.

3.5 Sensitivity Analysis

As can be seen from Table 7, the relative errors of the posterior mean of parameters XS, YS, T1, and T2 by MP1 are all small, but not QS. And the relative errors of the posterior mean of the parameter XS and QS by MP2 are large. This is mainly due to the different sensitivities of each parameter to the monitoring mass concentration values.

In order to avoid the defect that the local sensitivity analysis method does not consider the influence of the interaction between different parameters on the output results, the global sensitivity analysis method (Sobol’ method) (Lenhart et al. 2002) and the Kriging surrogate model were used to obtain the first-order sensitivity coefficients of parameters α to the 10 monitoring datasets by MP1 and MP2 respectively. The results are shown in Table 8. According to the parameter sensitivity classification (Table 9) (Lenhart et al. 2002), the parameters XS, YS, T1, and T2 by MP1 are medium sensitive parameters or sensitive parameters to the monitoring values, but QS is an insensitive parameter. Similarly, the parameters YS, T1, and T2 by MP2 are medium sensitive parameters or sensitive parameters, but XS and QS are insensitive parameters.

Table 8 The first-order sensitivity coefficients of α
Table 9 Parameter sensitivity classification

4 Conclusion

The surrogate model of the numerical simulation model with high accuracy could be established by using the optimal Latin hypercube sampling and Kriging method. And the surrogate model could get the similar input-output relationship to the numerical simulation model with a small amount of calculation. So the surrogate model can significantly reduce the calculation load generated by repeatedly calling the groundwater solute transport numerical simulation model in the process of monitoring scheme optimization design and pollution source identification.

The mean relative errors of the parameter inversion results and the information entropy of the parameter posterior distribution show a good positive linear relationship, which indicates that information entropy is an effective measure of the accuracy of the inversion results. The smaller the information entropy, the higher the accuracy of the inversion results. The monitoring well optimization design method based on Bayesian formula and information entropy is an effective method to determine the monitoring scheme of groundwater pollution.

Compared with the single-objective optimized monitoring scheme, although the multi-objective optimized monitoring scheme can increase the error of the inversion results, it can significantly shorten the monitoring time. The multi-objective optimized monitoring scheme is of more practical significance for the rapid identification of pollution sources.

Multi-chain DRAM algorithm based on Latin hypercube sampling could avoid Markov chains falling into local optimum or the problem of difficulty in convergence. The algorithm could greatly enhance the accuracy of the parameter inversion results.