1 Introduction

Radiological characterisation is one of the main challenges encountered in the nuclear industry for the decommissioning and dismantling (D &D) of old infrastructures such as buildings (see, e.g., Attiogbe et al. 2014; EPRI 2016; CEA/DEN 2017). Its main goal is to evaluate the quantity and spatial distribution of radionuclides. As such, measurements are made to constitute a data set and obtain preliminary information. While measurements are made, many problems can arise. Radioactivity present on site can be dangerous for operators and does not allow for many measurements. In some extreme cases, drones and robots have to be used, making measurements more expensive and reducing the size of the data sets (see, e.g., Goudeau et al. 2015; CEA/DEN 2017). It is therefore quite common in nuclear D &D characterisation to have only a small number of available data: a balance has to be found between data acquisition costs and information provided from data. Statistical tools make it possible to optimise the information extracted from the data, within a rigorous mathematical framework and with associated confidence intervals (in the D &D field; see, e.g., Zaffora et al. 2016; Blatman et al. 2017; Pérot et al. 2020).

More precisely, as in many other environmental and industrial fields (see, e.g., Webster and Oliver 2007; Daya Sagar et al. 2018), spatial statistics and geostatistical methods are used to predict the variables of interest at an unobserved location (prediction of the expected value), with an indication of the expected error in prediction (prediction variance). The methodology is often based on two steps: first, the construction of a statistical model with estimation of its parameters, followed by prediction with the statistical model for any unobserved point. The ordinary kriging model (see, e.g., Chilès and Delfiner 2012; Cressie 1993) is one of the most widely used models in industrial practice of D &D (see, e.g., Attiogbe et al. 2014; Goudeau et al. 2015; EPRI 2016). However, a common criticism is that its predictions do not take into account the uncertainty in the estimation of the model parameters. As a result, the variances of the predictions are often too optimistic, and these neglected uncertainties in the model parameters can have a significant impact. This problem is made worse for smaller data sets, which can be common in D &D projects. For radiological characterisation in D &D projects, the first examples of kriging shown in Desnoyers (2010) and Desnoyers et al. (2011) have studied practical cases based on many measurements and did not consider this issue. The more realistic studies by Boden et al. (2013), Lajaunie et al. (2020) and Desnoyers et al. (2020), carried out on smaller data sets, have instead highlighted the errors generated by the estimation errors of the kriging parameters.

To overcome this kriging issue, a Bayesian approach was first proposed by Kitanidis (1986). Its main goal was to take into account the uncertainties in the scale and mean parameters of the kriging model. The work of Handcock and Stein (1993) then completed the full Bayesian approach which considers all the parameters of the model as unknown. More recently, a slightly different approach was presented by Krivoruchko and Gribov (2019) and is called empirical Bayesian kriging. This methodology differs slightly from the one used in Kitanidis (1986), since the choice on the prior distributions of kriging parameters are obtained through unconstrained simulations of the random field. This approach was adapted to allow for multi-fidelity applications, where Bayesian theory is used to update the initial data with new, more accurate data (classically used with cokriging if correlations between old and new data exist). Some examples can be found in meteorology in Gupta et al. (2017) or for oil extraction in Al-Mudhafar (2019). Note that a more complete description of Bayesian kriging with an extension to generalised linear models is presented in Diggle and Ribeiro (2007).

In this framework, our work aims to understand the usefulness of the Bayesian kriging approach, compared to the ordinary kriging approach, for the radiological characterisation of contaminated buildings. In particular, the specification of a priori laws for the parameters in Bayesian kriging, which allows a more robust estimation of these parameters when only a few observations are available, is studied. The performance of ordinary and Bayesian kriging is compared on several numerical examples. For this, we not only focus on the kriging predictor accuracy but also on the kriging predictive variance accuracy. Indeed, the kriging variance is often used by practitioners to estimate predictive intervals on predicted quantities, to justify their choice of sampling, or to find locations of new (potentially expensive) measurements (Bechler et al. 2013). To ensure a certain level of confidence in the use of the predictive variance, the works of Marrel et al. (2012), Bachoc (2013a), Demay et al. (2022) and Acharki et al. (2023) about kriging model validation have emphasised the usefulness of several validation criteria, such as the predictive variance adequacy (PVA) and the \(\alpha \) confidence interval (\(\alpha \)-CI) plot. In addition to allowing a more accurate comparison in the case of the Bayesian kriging model, new validation criteria are required and are proposed in the present work.

The following section describes the different studied kriging models, while Sect. 3 develops the associated classical validation criteria before introducing the newly proposed criteria. Section 4 presents the results of the model comparison obtained on several numerical tests. Section 5 then illustrates application to a real case study coming from the decommissioning project of the CEA Marcoule G3 reactor. Section 6 gives some conclusions. Finally, two appendices present prior specification and parameter estimation results, which are not discussed in the main work of this article.

2 The Ordinary and Bayesian Kriging Models

This section provides reminders on kriging principles, within the framework of Gaussian random field model.

2.1 The Gaussian Random Field Model

The variable of interest is assumed to be a random field \(\{Z({\varvec{x}}), {\varvec{x}}\in D\}\), with \(D\subset {\mathbb {R}}^{2}\). \(Z(\cdot )\) is supposed to be isotropic and stationary, meaning that

$$\begin{aligned}{} & {} \forall {\varvec{x}} \in D, E[Z({\varvec{x}})]=\beta , \\{} & {} \forall {\varvec{x}},\varvec{x'} \in D,\text{ Cov }(Z({\varvec{x}}),Z(\varvec{x'}))=\sigma ^2 C_{\phi } (|{\varvec{x}}-\varvec{x'}|), \end{aligned}$$

where \(C_{\phi }\) is the correlation function where \(C_{\phi }(0)=1\), and \(\beta , \sigma ^2, \phi \) denote the mean, variance and range (or correlation length) parameters, respectively. For ease of notation, the conditioning to parameters will be simplified from \(Z|\beta ={\widehat{\beta }}\) to \(Z|\beta \). The term \(C_{\phi }\) corresponds to a positive semi-definite function. Moreover, by definition of a Gaussian process, every finite set of Z is a multivariate normal distribution (denoted \({{{\mathcal {N}}}}(.,.)\)). Thus, for n observations at positions \({{\varvec{x}}_1,\ldots ,{\varvec{x}}_n}\), we obtain the Gaussian random vector \(Z=(Z({\varvec{x}}_1),\ldots ,Z({\varvec{x}}_n))'\) with

$$\begin{aligned} Z|\beta ,\sigma ^2,\phi \sim {{{\mathcal {N}}}}(\beta {\varvec{1}}_n,\sigma ^{2}{\varvec{R}}_{\phi }), \end{aligned}$$

where \({\varvec{1}}_n\) is the vector \((1,\ldots ,1)'\) of length n, and the covariance matrix is \(\sigma ^{2}{\varvec{R}}_{\phi }~=~\left( \text{ Cov }(Z({\varvec{x}}_i),Z({\varvec{x}}_j))\right) _{1\le i,j \le n}\). The observation sample of Z is written as \({\varvec{z}}~=~(z({\varvec{x}}_1),\ldots ,z({\varvec{x}}_n))'\).

The positive semi-definite function \(C_{\phi }\) is often modelled using a common covariance function. In this work, two covariance models will be used (see, e.g., Chilès and Delfiner 2012 for an extensive list of covariance functions). The first one is the Gaussian covariance function written as

$$\begin{aligned} \forall h \in {\mathbb {R}}, C_{\phi }(h)=e^{-h^2/\phi ^2}, \end{aligned}$$

while the second one is the Matérn covariance function written as

$$\begin{aligned} \forall h \in {\mathbb {R}}, C_{\phi ,\nu }(h)=\frac{2^{1-\nu }}{\varGamma (\nu )}\left( \sqrt{2\nu }\frac{h}{\phi }\right) ^\nu K_{\nu }\left( \sqrt{2\nu }\frac{h}{\phi }\right) , \end{aligned}$$
(1)

with \(\nu \) a strictly positive parameter, \(\varGamma (\cdot )\) the gamma function and \(K_\nu (\cdot )\) the modified Bessel function of the second type and order \(\nu \). The parameter \(\nu \), that drives the regularity of the process trajectories, is not estimated. It is chosen from a set of possible values, the most commonly used being \(\nu \in \left\{ \frac{1}{2},\frac{3}{2},\frac{5}{2}\right\} \). In addition, we have the nugget effect, written as

$$\begin{aligned} \forall h \in {\mathbb {R}}, C_{\tau ^2}(h)=\tau ^2\delta (h), \end{aligned}$$

with \(\tau ^2\) a variance and \(\delta \) the Dirac function where \(\delta (h)=1\) if \(h=0\) and \(\delta (h)~=~0\) otherwise. The nugget effect is often used to model micro-scale variations and measurements uncertainties. In our case studies, it will mainly be used to improve the conditioning of the matrix \({\varvec{R}}_{\phi }\), in order to improve the stability of its numerical inversion (especially in the case of a Gaussian covariance function).

The model is therefore specified by three different parameters: the trend parameter \(\beta \in D_\beta \), the variance parameter \(\sigma ^2 \in D_{\sigma ^2}\) and the range parameter \(\phi \in D_{\phi }\). In the case of ordinary kriging and for the covariance functions considered here, the parameter spaces are

$$\begin{aligned} D_\beta = {\mathbb {R}}, D_{\sigma ^2} = ]0,+\infty [, D_{\phi } = ]0,+\infty [. \end{aligned}$$

The first step of the kriging methodology in practice is to estimate these parameters. Two main procedures are commonly used: variographic analysis and maximum likelihood estimation (MLE). An extensive literature is available about parameter estimation with variographic analysis, such as Chilès and Delfiner (2012) and Webster and Oliver (2007). In this work, we will use MLE to take advantage of the probabilistic framework and to avoid manual or automatic fitting of variograms, especially since our numerical tests will require parameter estimation for many simulated data sets. Moreover, the automatic fitting of variograms is strongly discouraged in most of the literature (see, e.g., Chilès and Delfiner 2012; Webster and Oliver 2007). Note that when kriging is used to interpolate and predict numerical experiments with a large number of inputs, a multi-start optimisation procedure is often used for the MLE to avoid the known pitfall of local extrema and better explore the input parameter space. However, this procedure will not be used here because preliminary studies have shown that in our case, it is not necessary due to the small dimension of the problem (i.e., two-dimensional random field) and the regularity of the likelihood function. This decision allowed us to reduce computation times without compromising on parameter estimation.

2.2 Kriging Model Principles

The kriging predictor is a linear interpolator whose expressions are derived from supplementary conditions, such as minimising the prediction variance. For a detailed description of kriging and its construction, the reader can refer to the reference books of Chilès and Delfiner (2012) and Cressie (1993) for geostatistics, but also Rasmussen and Williams (2006) for the Gaussian process regression point of view. Let \({\varvec{x}}_0\) be an unobserved position at which we wish to predict the expected value and the variance of \(Z({\varvec{x}}_0)|\sigma ^2,\phi ,{\varvec{Z}}={\varvec{z}}\) (the mean is considered unknown). The ordinary kriging equations are then

$$\begin{aligned} {\mathbb {E}}[Z({\varvec{x}}_0)|\sigma ^2,\phi ,{\varvec{Z}}={\varvec{z}}]= & {} \left( {\varvec{r}}+{\varvec{1}}_n\frac{1-{\varvec{1}}_n'{\varvec{R}}_{\phi }^{-1}{\varvec{r}}}{{\varvec{1}}_n'{\varvec{R}}_{\phi }^{-1}{\varvec{1}}_n}\right) '{\varvec{R}}_{\phi }^{-1}{\varvec{Z}}, \\ \text{ Var }[Z({\varvec{x}}_0)|\sigma ^2,\phi ,{\varvec{Z}}={\varvec{z}}]= & {} \sigma ^2\left( 1-{\varvec{r}}'{\varvec{R}}_{\phi }^{-1}{\varvec{r}}+\frac{{(1-{\varvec{1}}_n'{\varvec{R}}_{\phi }^{-1}{\varvec{r}})}^2}{{\varvec{1}}_n'{\varvec{R}}_{\phi }^{-1}{\varvec{1}}_n}\right) , \end{aligned}$$

with \({\varvec{r}} \in {\mathbb {R}}^n\) the correlation vector defined as \(\sigma ^2{\varvec{r}}=(\text{ Cov }(Z({\varvec{x}}_0), Z({\varvec{x}}_j))_{1 \le j \le n}\).

A major concern for applications of these equations is that they are conditional on the knowledge of the variance and range parameters, which is mostly unrealistic since they are estimated. This assumption yields overoptimistic prediction variances and narrower predictive intervals. This problem is made worse in the case of a small data set where parameter estimation is sensitive to each observation. To address this issue, Bachoc (2013b) uses a cross-validation procedure instead of MLE to estimate the model parameters in a more robust way, especially in the case of model misspecification. However, this approach always results in a single set of parameter values, tainted by an estimation error that is not taken into account. To remedy this, another solution is to consider the parameters as random variables, and then to quantify and finally propagate their uncertainties on the kriging model. The Bayesian approach therefore appears natural for this and leads to Bayesian kriging.

2.3 Bayesian Kriging Principles

Bayesian kriging deals simultaneously with estimation and predictions by considering the parameters as random variables that must be predicted conditionally to the observed data (Diggle and Ribeiro 2002). Bayesian kriging predictions are derived from the predictive distribution as

$$\begin{aligned} p_{Z({\varvec{x}}_0)}(Z({\varvec{x}}_0)|{\varvec{Z}}={\varvec{z}})&=\int _{D_{\beta }{\times }D_{\sigma ^2}{\times }D_{\phi }}p_{Z({\varvec{x}}_0),\beta ,\sigma ^2,\phi }(Z({\varvec{x}}_0),\beta ,\sigma ^2,\phi |{\varvec{Z}}={\varvec{z}})d{\beta }d{\sigma ^2}\textrm{d}\phi \\&=\int _{D_{\beta }{\times }D_{\sigma ^2}{\times }D_{\phi }}p_{Z({\varvec{x}}_0)}(Z({\varvec{x}}_0)|\beta ,\sigma ^2,\phi ,{\varvec{Z}}={\varvec{z}})\\&\qquad \qquad p_{\beta ,\sigma ^2,\phi }(\beta ,\sigma ^2,\phi |{\varvec{Z}}={\varvec{z}}) d{\beta }d{\sigma ^2}\textrm{d}\phi . \end{aligned}$$

The density \(p_{Z({\varvec{x}}_0)}(Z({\varvec{x}}_0)|\beta ,\sigma ^2,\phi ,{\varvec{Z}}={\varvec{z}})\) is known to be a Student’s t-density under the assumption that the prior is of the same family as the one presented at the end of this section (as demonstrated in Le and Zidek 1992), but the integral is usually intractable. In practice, it must therefore be estimated numerically by Markov chain Monte Carlo (MCMC) methods. One solution is to sample from the target distribution using a Monte Carlo approach. One such method is given in Tanner (1993), and used in the geoR package (Ribeiro and Diggle 2001) of R software. A slightly different approach considers a Markov chain for its Monte Carlo algorithm as described in Gaudard et al. (1999) and Carlin and Louis (2013). Thus, the algorithm described by Algorithm 1 is the one used in the geoR package and will be used in the following to estimate the Bayesian prediction.

Algorithm 1
figure a

Monte Carlo approximation for Bayesian kriging

M is chosen so that the predictive distribution is sufficiently sampled to be approximated. For our application cases, \(M=1000\). Finally, a joint prior distribution is chosen for \(\beta ,\sigma ^2, \phi \) that is

$$\begin{aligned} \pi (\beta ,\sigma ^2,\phi ) \propto \frac{1}{\sigma ^2}. \end{aligned}$$

The resulting parameter space is

$$\begin{aligned} D_\beta = {\mathbb {R}}, D_{\sigma ^2} = ]0,+\infty [, D_{\phi } = ]0,+\infty [. \end{aligned}$$

Note that a sensitivity analysis is presented in the Appendix (Sect. A) to explain our choice of priors.

3 Validation Criteria

Choosing an “optimal” covariance model for geostatistical predictions is a classical issue in geostatistics (Chilès and Delfiner 2012).

This topic has been recently studied in depth in Demay et al. (2022), where different validation criteria are investigated to assess the quality of both the model predictions, the reliability of the associated prediction variances and more generally the accuracy of the whole predictive law. Depending on the number of observations available, these criteria can be computed either on a test sample separate from the training sample or, as here, by cross-validation. Their expressions, with some new adaptations, are given in this section in their leave-one-out cross-validation form. Extension to K-fold cross-validation or to test set cases are immediate.

3.1 Predictivity Coefficient (\(Q^2\))

The main goal of this coefficient, often called the Nash–Sutcliffe criterion (Nash and Sutcliffe 1970), is to evaluate the predictive accuracy of the model by normalising the errors, allowing a direct interpretation in terms of explained variance. Its practical definition (Marrel et al. 2008) is

$$\begin{aligned} Q^2=1-\frac{\sum _{i=1}^{n}(z({\varvec{x}}_i) - {\widehat{z}}_{-i})^2}{\sum _{i=1}^{n}(z({\varvec{x}}_i) - {\widehat{\mu }})^2}, \end{aligned}$$

where \({\widehat{z}}_{-i}\) is the value predicted at location \({\varvec{x}}_i\) by the model built without the ith observation (the one located at \({\varvec{x}}_i\)), and \({\widehat{\mu }}\) is the empirical mean of the data set. Its theoretical definition can be found in Fekhari et al. (2023).

The \(Q^2\) coefficient measures the quality of the predictions and how near they are to the observed values. Its formula is similar to the coefficient of determination used for regression (with independent observations), but estimated here in prediction (by using cross-validation residuals). The closer its value is to 1, the better the predictions are (relative to the observations).

3.2 Predictive Variance Adequacy (PVA)

This second criterion aims to quantify the quality of the prediction variances given by the kriging model. Finely studied in Bachoc (2013a, 2013b) and Demay et al. (2022), it is defined by

$$\begin{aligned} \textrm{PVA} = \left| \log \left( \frac{1}{n}\sum _{i=1}^{n}\frac{(z({\varvec{x}}_i)-{\widehat{z}}_{-i})^2}{{\widehat{s}}_{-i}^2}\right) \right| , \end{aligned}$$

where \({\widehat{s}}_{-i}^2\) is the prediction variance (at location \({\varvec{x}}_i\)) of the model built without the ith observation (the one located at \({\varvec{x}}_i\)).

This coefficient estimates the average ratio between the squared observed prediction error and the prediction variance. It therefore gives an indication of how much a prediction variance is larger or smaller than the one expected. The closer the PVA is to 0, the better the prediction variances are. For example, a PVA close to 0.7 indicates prediction variances that are on average two times larger or smaller than the squared errors.

3.3 Predictive Interval Adequacy (PIA)

The PVA is a criterion of variance adequacy but does not take into account a possible skewness in the predictive distribution. In the Gaussian case (like ordinary kriging), mean and variance completely characterise the distribution. But in the case of Bayesian kriging where the predictive distribution is no longer Gaussian, the \(Q^2\) and PVA are not sufficient to evaluate the quality of the model and its prediction. As such, we propose a new complementary geometrical criterion called the predictive interval adequacy (PIA) and defined as

$$\begin{aligned} \textrm{PIA}=\left| \log \left( \frac{1}{n}\sum _{i=1}^{n}\frac{(z({\varvec{x}}_i)-{\widehat{z}}_{-i})^2}{\left( {\widehat{q}}_{0.31,-i}-{\widehat{q}}_{0.69,-i} \right) ^2} \right) \right| , \end{aligned}$$

where \({\widehat{q}}_{0.31,-i}\) (respectively \({\widehat{q}}_{0.69,-i}\)) is the estimation of the quantile of order 0.31 (respectively 0.69) of the predictive distribution (at location \({\varvec{x}}_i\)) without the ith observation.

The PIA has been defined to be identical to the PVA for a Gaussian distribution. However, rather than comparing squared errors to the predictive variance, it compares the width of predictive intervals with the squared errors. Another main difference is that the intervals considered by the PIA are centred on the median, while those of the PVA are centred around the mean. Finally, an estimation of the predictive distribution is necessary to compute in practice this criterion, whereas the PVA only requires the computation of predictive mean and variance.

3.4 \(\alpha \)-CI Plot

The Gaussian process model allows us to build predictive intervals of any level \(\alpha \in ]0,1[\) written as

$$\begin{aligned} \textrm{CI}_{\alpha }(z({\varvec{x}}_i))=\left[ {\widehat{z}}_{-i}-{\widehat{s}}_{-i}q_{(1+\alpha )/2}^{{\mathcal {N}}};\,{\widehat{z}}_{-i}+{\widehat{s}}_{-i}q_{(1+\alpha )/2}^{{\mathcal {N}}}]\right] , \end{aligned}$$

where \(q^{{\mathcal {N}}}_{(1+\alpha )/2}\) is the quantile of order \((1+\alpha )/2\) of the standard normal distribution. This expression is only valid if all parameters are known. For example, if the variance parameter is poorly estimated, the width of the predicted confidence intervals will not reflect what we might observe. But how can we validate a predictive interval without prior knowledge of the model parameters? The idea behind this criterion (see Marrel et al. 2012; Demay et al. 2022) is to evaluate empirically the number of observations falling into the predictive intervals and to compare this empirical estimation to the theoretical ones expected, with

$$\begin{aligned} \varDelta _{\alpha }=\frac{1}{n}\sum _{i=1}^{n}\phi _i\quad \text{ where }\quad \delta _i= \left\{ \begin{array}{ll} 1 &{}\quad \text{ if } z({\varvec{x}}_i)\in CI_{\alpha }(z({\varvec{x}}_i))\\ 0 &{}\quad \text{ else. } \end{array} \right. \end{aligned}$$

This value can be computed for varying \(\alpha \), and can then be visualised against the theoretical values, yielding what Demay et al. (2022) calls the \(\alpha \)-CI plot, with an example given in Fig. 1.

Fig. 1
figure 1

Example of two \(\alpha \)-CI plots and corresponding values of MSE\(\alpha \)

Similarly to the PIA, the \(\alpha \)-CI plot must be adapted to the Bayesian kriging since the posterior distribution is not Gaussian. We therefore introduce a slightly different criterion based on the quantiles of the predictive distribution. More precisely, this modified \(\alpha \)-CI plot relies now on credible intervals defined as

$$\begin{aligned} {\widetilde{\text {CI}}}_{\alpha }(z({\varvec{x}}_i))=\left[ {\widehat{q}}_{\frac{1-\alpha }{2}};{\widehat{q}}_{\frac{1+\alpha }{2}}\right] , \end{aligned}$$

where \({\widehat{q}}_{\frac{1-\alpha }{2}}\) (respectively \({\widehat{q}}_{\frac{1+\alpha }{2}}\)) is the estimation of the quantile of order \(\frac{1-\alpha }{2}\) (respectively \(\frac{1+\alpha }{2}\)) of the predictive distribution (at location \({\varvec{x}}_i\)) of the model built without the ith observation. Once again, we obtain a criterion that is identical for both methods when the predictive distribution is Gaussian.

3.5 Mean Squared Error \(\alpha \) (MSE\(\alpha \))

Finally, to summarise the \(\alpha \)-CI plot, we also introduce a quantitative criterion called “mean squared error \(\alpha \)” and defined as

$$\begin{aligned} \textrm{MSE}\alpha = \frac{1}{n_{\alpha }}\sum _{j=1}^{n_{\alpha }}(\varDelta _{\alpha _j}-\alpha _j)^2, \end{aligned}$$

where the considered levels \(\alpha \) are discretized over ]0, 1[ in \(n_\alpha \) possible values. In practice, a regular discretisation will be considered to compute MSE\(\alpha \). The closer this criterion is to 0, the better the predictive/credible intervals are on average. To illustrate the values taken by the criterion, Fig. 1 gives the \(\alpha \)-CI plot corresponding to a “good” and “bad” model fitting. In this graph, the bad model yields a MSE\(\alpha \) of 0.0101 against 0.0013 for a model with more accurate predictive intervals. In the context of dismantling and decommissioning of nuclear sites, a MSE\(\alpha \) of 0.01 will be considered to correspond to a model with wrong predictive intervals, while a model with a MSE\(\alpha \) of 0.001 will be deemed to have correct predictive intervals. Similarly to the PVA, the MSE\(\alpha \) does not explain if the poorly fitted predictive intervals are due to badly centred predictive intervals or if the predictive variance was badly estimated (and whether or not this variance was underestimated or overestimated). This criterion must therefore be used in conjunction with the previous criteria to better assert the model qualities and weaknesses. Finally, this criterion also offers a quantitative tool for comparing different models if the \(\alpha \)-CI plots do not allow us to clearly distinguish the performances of competing models. This will be illustrated in particular in the numerical tests in Sect. 4.2 (Fig. 8).

The different aforementioned criteria provide complementary information to evaluate the prediction quality of the kriging model, either in terms of mean, variance or predictive/credible intervals. They will be used in the following to compare the performance of ordinary and Bayesian kriging.

4 Numerical Tests and Results

Our goal is to compare Bayesian and ordinary kriging (the latter being the more commonly used kriging method) (The R code corresponding to these tests is given in https://gitlab.com/biooss/r-code-for-wieskotten-et-al-2023-paper). To do so, the different criteria mentioned in Sect. 3 will be computed on data sets (i.e., samples of observations), coming from different models, of different sizes. Parameter estimation results are not discussed further here, but an analysis is given in Appendix B.

4.1 Data Sets from Two-Dimensional Gaussian Process Simulations

First, we consider samples simulated from an analytical Gaussian process model with known parameters. More precisely, the samples are simulated in the input space \([0,10]^2\) from a Gaussian process with an exponential covariance (i.e., the Matérn covariance of Eq. (1) with \(\nu =0.5\)) and the parameters

$$\begin{aligned} \beta = 0.5,\quad \sigma ^2=0.1,\quad \phi =4.5. \end{aligned}$$

We simulate data sets of different sizes, varying from 16 to 81 observations, sampled on a square grid in the input space. Here, the sampling designs will be regular squared grids. This choice is made to comply with the application purpose which deals with D &D constraints of buildings. Indeed, most of the times, the radiological measurements inside buildings are made regularly (equidistant location) along lines of investigations (see, e.g., Attiogbe et al. 2014; EPRI 2016). For each size, the process is repeated 100 times with independent random Gaussian process simulations.

For each data set, Bayesian and ordinary kriging models are estimated, and the different validation criteria are computed by cross-validation. Every kriging prediction (Bayesian and ordinary) is made with the R package geoR (Ribeiro and Diggle 2001). Results are given in Fig. 2 with box plots (corresponding to the 100 random replicates) with respect to the data set sizes.

The results for the validation criteria indicate that Bayesian kriging performs better in terms of both mean and prediction variance for small sample sizes. More precisely, Bayesian kriging outperforms ordinary kriging on most criteria for data sets with less than 40 observations (with the exception of the PIA, where for 36 observations, ordinary kriging outperforms Bayesian kriging).

Fig. 2
figure 2

Distribution of validation criteria (\(Q^2\), PVA, PIA and MSE\(\alpha \)) with respect to the size of data sets, for Gaussian process simulation data sets

More precisely, if we first look at the median values of \(Q^2\) estimation, these increase from \(-0.07\) to 0.64, according to the data size, for ordinary kriging. Bayesian kriging gives better \(Q^2\) for smaller data sets, starting from a median value of 0.10 up to 0.64. For a fixed sample size, the dispersion of \(Q^2\) is quite similar between the two kriging methods (for example, we have a standard deviation of 0.21 for both methods for 36 observations).

Regarding the median of PVA, the values range from 0.25 to 0.04 for ordinary kriging, compared to 0.14 to 0.06 for Bayesian kriging. For the PIA, the results are identical for ordinary kriging, but Bayesian kriging performs slightly worse, starting at 0.21 up to 0.05. We can also see that the dispersion of PIA and PVA estimates is different for small data sets between the two kriging methods. This is explained by the fact that PVA and PIA are sensitive to the parameter estimation process. Since the number of observations is low, maximum likelihood estimations are not robust, yielding large variations in parameter estimations, and therefore in PVA and PIA estimations. Finally, we observe that for data sets larger than or equal to 49, Bayesian kriging seems to perform slightly worse than ordinary kriging.

The MSE\(\alpha \) graph shares similarities with the other graphs, since predictive and credible intervals both depend on prediction mean and variance. For the ordinary kriging, the median MSE\(\alpha \) ranges from 0.0072 to 0.0012, while for Bayesian kriging, the values are lower, from 0.0063 to 0.0015. The evolution observed is similar between the PVA and PIA, with Bayesian kriging yielding better results for smaller data sets.

It can also be noted that for larger data sets, Bayesian kriging yields slightly worse results. It can therefore be argued that Bayesian kriging becomes less advantageous and relevant for data sets with more than 40 observations. Note that \(Q^2\) values are also extremely low for 49 or fewer observations, but again this is to be expected for very small data sets.

4.2 Data Sets from a Two-Dimensional Deterministic Function

In order to test the kriging models on cases that do not fall within the theoretical framework of the Gaussian process hypothesis, we consider a sample coming from the following two-dimensional deterministic function (Iooss et al. 2010)

$$\begin{aligned} f(x,y)=\frac{e^x}{5}-\frac{y}{5}+\frac{y^6}{3}+4y^4-4y^2+\frac{7x^2}{10}+x^4+\frac{3}{4x^2+4y^2+1}, \end{aligned}$$
(2)

where (xy) are the function inputs. Figure 3 shows this function over the \(D=[-1,1]^2\) input space.

Fig. 3
figure 3

Illustration of the deterministic function f (Iooss et al. 2010)

We consider two steps for studying this test function. First, the validation criteria are used to compare the results obtained by using different covariance functions in order to identify the most appropriate one for the data set (as done in Demay et al. 2022).

Then, a regular squared grid is considered to sample the input space, composed of 144 observations. On this data set, the ordinary kriging model is fitted with different covariance functions, namely three Matérn covariances and the Gaussian covariance with a nugget effect for the latter of \(10^{-6}\) (to improve the numerical stability of the covariance matrix inversion). For each of these covariances, the validation criteria are estimated by a cross-validation process. The results are presented in Table 1 for ordinary kriging, in Table 2 for Bayesian kriging and in Fig. 4.

The main goal of this procedure is to better identify the covariance, so that this choice has no concern for the rest of our study. Therefore, a data set of 144 observations is used to ensure a good analysis of the covariance function through the use of the aforementioned validation criteria.

Table 1 Validation criteria for the ordinary kriging with different covariance functions, on the sample of \(n=144\) observations of function f
Table 2 Validation criteria for the Bayesian kriging with different covariance functions, on the sample of \(n=144\) observations of function f
Fig. 4
figure 4

\(\alpha \)-CI plots for the ordinary and Bayesian kriging with different covariances functions, on the sample of \(n=144\) observations of function f

The results show that, in this case, a Gaussian covariance function is the most appropriate covariance function with respect to the different criteria. This result is not surprising since the test function is smooth and shows large correlations between observations. Although the differences between \(Q^2\) are very small between the Gaussian and Matérn models (except for the Matérn 1/2 model), significant differences appear for the PVA and PIA. These differences become smaller for the MSE\(\alpha \). This shows the importance of using simultaneously various criteria for a better assessment of the model performance and accuracy.

Once our covariance model is chosen (the Gaussian one in this case), we can apply a similar test protocol as that in Sect. 4.1. In order to generate data sets, we have to slightly modify the protocol. Since the function is deterministic, choosing a specific geometry for a fixed data set size will not allow us to generate different data sets. Therefore, we discard here the regular grid and choose to sample random positions in the input space. It allows us to generate different data sets while considering the same deterministic function, even though such random sampling would not be recommended in practice. The observed dispersion in the results of this section is affected by that choice. This sampling is repeated 100 times for each data set size, up to 150 observations.

The results are presented in Fig. 5. The values of the \(Q^2\) criterion lead to the same conclusions as for the data from Gaussian process trajectories in the previous section. We again find better performance with Bayesian kriging, especially for small sample sizes. Note that we have higher \(Q^2\) values than for the previous test case due to the high regularity of the function f.

Fig. 5
figure 5

Distribution of validation criteria (\(Q^2\), PVA, PIA and MSE\(\alpha \)) with respect to the size of data sets, for the deterministic function f

Significant differences arise with the PVA, PIA and MSE\(\alpha \) criteria. Indeed, these criteria do not decrease steadily and monotonically with the number of observations. Moreover, they behave differently depending on the type of kriging. More precisely, for Bayesian kriging, the PVA, PIA and MSE\(\alpha \) increase between 20 observations and 50 observations, before decreasing, whereas they keep increasing for ordinary kriging. For data sets made of 50 observations or less, Bayesian kriging seems to under-perform when compared to ordinary kriging but outperform ordinary kriging for more than 50 observations. Still, once the size of the data sets exceed 80 observations, we observe similar results to those obtained with the simulated data sets.

To explain these results, we recall that the initial assumption whereby the function f is a trajectory of a Gaussian process is not verified here, at least for data sets of 50 or less observations. It is therefore possible to obtain poorer criteria as the data set size increases. We still get good prediction accuracy, since the median of the \(Q^2\) criterion stays between 0.7 and 1 for all data set sizes and kriging methods, but the predicted variances do not seem to be very accurate, yielding poorly estimated predictive and credible intervals. We can observe that once the data set size exceeds 80 observations, the evolution of the validation criteria shows that the initial assumption is now valid.

In conclusion, Bayesian kriging outperforms, on average, ordinary kriging in this case where the initial assumption of a Gaussian random field is not true. Caution is still advised, since in some cases, ordinary kriging seems to perform better than Bayesian kriging, as illustrated with the \(n=40\) or \(n=50\) observations’ data set. The conclusion obtained in Sect. 4.1 cannot be made identically here, because for small data sets, Bayesian kriging does not seem to consistently give better validation criteria.

5 Real Application Case: G3’s Data Set

This data set is made of 70 observations of radioactivity measurements from the decommissioning project of the CEA Marcoule G3 reactor (CEA 2009). They are sampled in the input domain \([0,6] \times [0,4]\). The data set is mapped in Fig. 6.

Fig. 6
figure 6

Mapping of G3 observations

Figure 7 shows the predictions of Bayesian kriging and ordinary kriging for a given data set of \(n=20\) observations (randomly sampled from the original data set). More precisely, the prediction maps obtained with ordinary and Bayesian kriging with an exponential covariance for both models are given. The figure also highlights the differences between both predictions. A small difference between predicted standard deviation appears, since they are much higher for Bayesian kriging. This is explained by the fact that for a small number of observations, Bayesian kriging takes more uncertainty into account, resulting in higher prediction variances. In the practice of D &D projects, this can have a direct impact since the estimates (or more precisely the upper quantiles or margins given by the predictive law) will be more conservative. Note that as we increase the sample size, the differences between the Bayesian and ordinary kriging maps are no longer visible. Indeed, the uncertainty of parameter estimation (only taken into account by the Bayesian kriging) becomes negligible in front of the interpolation uncertainty (common to the two kriging methods).

Fig. 7
figure 7

Predictions for 20 observations sampled from the original data set with ordinary and Bayesian kriging

Let us now examine the effects of varying sample sizes and covariance models. A similar test protocol as in Sect. 4 is applied to assess the behaviour of kriging models according to n. First, let us consider ordinary kriging for different covariance functions, applied to the initial set of 70 observations. The validation criteria estimated by cross-validation are given in Table 3 and Fig. 8. For Bayesian kriging, they are given in Table 4 and Fig. 8. The results indicate that the Matérn 1/2 model is the best choice in regards to our different criteria since it maximizes the \(Q^2\) criterion while minimising both PVA and PIA criteria (it also performs well for the MSE\(\alpha \) criterion, while not being the function minimising it overall). Therefore, only the Matérn 1/2 covariance function is considered.

Table 3 Validation criteria for the ordinary kriging with different covariance functions, on the G3 sample of \(n=70\) observations
Table 4 Validation criteria for the Bayesian kriging with different covariance functions, on the G3 sample of \(n=70\) observations
Fig. 8
figure 8

\(\alpha \)-CI plots for the ordinary and Bayesian kriging with different covariance functions, on the G3 sample of \(n=70\) observations

To generate multiple data sets, we resampled without replacement data sets of various sizes (\(n=20,30,40,50,60,70\)) with the last one being the original data set. Once again, the process is repeated 100 times for each sample size (except for 70 observations), and for each sample, a cross-validation is applied to estimate the validation criteria.

The obtained results are summarised in Fig. 9. For the \(Q^2\) criterion, the median values increase from about 0 (\(n=10\)) to 0.38 (\(n=70\)) for both kriging methods. Slightly higher results are obtained for Bayesian kriging, especially for small sample sizes. The dispersion of \(Q^2\) is similar between the two kriging methods. The obtained \(Q^2\) estimates here are very low, which normally means that the model is not predictive enough. As our objective is only to compare the kriging methods, this problem is not further investigated here.

Fig. 9
figure 9

Distribution of validation criteria (\(Q^2\), PVA, PIA and MSE\(\alpha \)) with respect to the size of data sets, for the G3 data set

Regarding the PVA, the median values decrease from 0.47 to 0.16 for ordinary kriging, compared to much lower values for Bayesian kriging, namely from 0.19 to 0.06. For the PIA, the values are very close to the ones of the PVA. For the MSE\(\alpha \), the median values range from 0.011 to 0.0017 for the ordinary kriging, against 0.008 down to 0.0017 for Bayesian kriging. Once again, Bayesian kriging yields better results, especially for smaller data sets. The results of both methods then become almost identical for data sets of 40 or more observations. This is especially visible for the MSE\(\alpha \).

We can also remark that the variance of each validation criterion is reduced as the data sets size grows. This is both explained by the larger data sets, but also by our protocol, since observations are randomly drawn without replacement among the original 70 observations. As a result, the samples differ less and less as the data set sizes increases.

6 Discussion and Conclusions

In conclusion, the use of Bayesian kriging for spatial interpolation of data sets in support of decommissioning and dismantling projects shows promising results. Its main advantage is that it allows us to take into account the uncertainty of the parameters of the kriging model. The results given in the three application cases show that, on average, Bayesian kriging outperforms ordinary kriging. Still, the second case (dealing with a deterministic function) gives a clear and interesting counter-example. Even though this result could be explained with the fact that the Gaussian assumption is not verified, it advocates for cautious use of Bayesian kriging. As the sample size increases, ordinary kriging, less computationally expensive, is then preferable for large data sets. Bayesian kriging has also the drawback of requiring a prior specification, which is often difficult to choose and can strongly influence the predictions. Therefore, the use of Bayesian kriging should be restricted to smaller data sets or cases in which prior information on parameters is well known.

Another important advantage of Bayesian kriging is that it allows us to evaluate the information brought by the data on the parameter characterisation (e.g. by comparing their prior and posterior distributions) and can share the prediction uncertainty between the data interpolation uncertainties and the one coming from the parameters’ uncertainties. It then allows us to judge if the latter uncertainty is negligible compared to the former in order to bring some confidence in this statistical tool to the user. Another fruitful perspective is that the evolution of the posterior distribution could be used for defining a new design of experiments, allowing comparison of the information brought by new observations.

In our work, we did not use the nugget effect as a modelling tool but only as a regularisation of the Gaussian covariance function. Future works will aim at adding this parameter to the model. This could be taken further by considering a heteroscedastic model (Ng and Yin 2012), since the usual nugget effect is formulated as a homoscedastic model. This could be extremely useful and show promising results in the framework of D &D of nuclear sites since radioactive measurements are prone to varying measurement uncertainties, depending on the measuring technique.

The results presented in this paper also show that the main differences between the two kriging methods are in the prediction variances, which are often larger with Bayesian kriging. This can lead to predictions with more conservative associated uncertainties, potentially increasing the difficulty of decision-making. However, this disadvantage must be put into perspective in the framework of D &D projects, because in this context, it is preferable, for safety reasons, to overestimate contamination rather than underestimate it.