1 Introduction

When considering a spatial process, dependence of the process is typically modeled by its covariance as a function of spatial locations and stationarity is often further assumed, which indicates that the covariance is a function of the difference between two spatial locations. The dependence structure of the spatial process affects estimation in spatial regression and spatial prediction. Statistical methods to analyze spatial data have enabled us not only to make statistical inference about the spatial distribution, but also to predict values of a variable of interest at unmonitored locations (Cressie 1993; Gelfand et al. 2010; Stein 1999), which requires a covariance estimate in either parametric or non-parametric way. An empirical covariance function at different lags by non-parametric estimates can be used to construct the covariance matrix, but positive definiteness could fail depending on an estimation method (Cressie 1993). Instead, one can consider a parametric structure of a covariance function such as spherical, Matérn, or powered exponential covariance functions and use the least square methods or maximum likelihood estimation for estimating their unknown parameters. However, this approach still has a possibility of model misspecification of the covariance structure, which may cause inaccurate inference for regression coefficients or poor prediction of a variable of interest. Also, it requires computation of a matrix inversion which causes great computational burden for high-dimensional spatial data as the spatial covariance matrix is a dense matrix (Aune et al. 2014; Gelfand et al. 2010). Furthermore, typical isotropy assumption is rather limited as it is frequently violated in many real world applications.

Recently developed methodologies have made great progress in fitting a complicated shape of spatial distribution and handling large spatial data. Low-rank approximation approach approximate the spatial process as a linear combination using a set of a priori designed basis functions that is fixed in number (Cressie and Johannesson 2008; Katzfuss and Cressie 2011; Stein 2008). Lattice Kriging considers multiresolution radial basis functions, which results in faster computation (Nychka et al. 2002, 2015). Predictive process approach (Banerjee et al. 2008; Finley et al. 2009) uses a set of knot locations on which the process is approximated in the form of basis functions representation under a Bayesian framework. Multiresolution approximations (Katzfuss 2017; Katzfuss and Gong 2017) also uses basis function representation with compactly supported basis functions at different resolutions, which can be adapted to any given covariance function. Stochastic partial differential equation approaches (Lindgren et al. 2011) approximate a Gaussian process with a Matérn covariance function with a Markov random field, which bring an efficient calculation of a likelihood.

Different from the approaches which make use of a basis function representation of a spatial process in some way, covariance tapering creates a sparse covariance matrix by multiplying compactly supported covariance function to increase computational efficiency and such approximation is theoretically investigated in Furrer et al. (2006), Kaufman et al. (2008) and Du et al. (2009). Spatial partitioning also creates a sparse covariance matrix by assuming independence between observations across partitioned subregions and various options for partitioning are suggested in Sang et al. (2011), Kim et al. (2005), Heaton et al. (2017) and Konomi et al. (2014). Nearest neighbor Gaussian process uses conditional specification of spatial processes to model a sparse structure of a covariance matrix which enables efficient computation (Datta et al. 2016).

On the other hand, there are algorithmic approaches in handling spatial data. Metakriging (Guhaniyogi and Banerjee 2018) is an approximate Bayesian method that introduces a combined posterior by subset posteriors from partitioned locations. The gapfill method (Gerber et al. 2018) is purely algorithmic and distribution-free. The approach chooses a subset in a neighborhood and prediction is based on sorting algorithms and quantile regression. Local approximation Gaussian process approach (Gramacy and Apley 2015) focuses on prediction by training a Gaussian process predictor using the values nearby the prediction location based on a criterion related to mean squared prediction error. The algorithm allows adaptive selection of the number of the neighbors for training.

An alternative way to study spatially structured processes is the spectral representation approach. Note that any stationary process can be represented as a superposition of random harmonic oscillations, i.e. it can be represented by some modification of the conventional Fourier integral (Whittle 1954; Yaglom 1987). Spectral analysis is a study of the spectral measure or spectral density function, which is a Fourier coefficient for the sinusoidal component of a covariance (Gelfand et al. 2010). Once we define a spectral density function for a covariance function in a space domain, spatial dependency can be also modeled by spectral density because of one-to-one correspondence between them. In time series analysis, spectral methods are widely studied and the related theories are well-established (Brillinger 2001; Priestley 1981). Especially, several studies for spectral density estimation and its application in time series regression have been already presented (Carter and Kohn 1997; Choudhuri et al. 2004; Dey et al. 2018). Once we regard a temporal structure as a one-dimensional spatial structure, several aspects of spectral methods in time series analysis can be generalized into the process with more than one-dimension. Also, there are wide applications whose data are available on a grid so that a spectral method can be applied naturally.

Royle and Wikle (2005) and Paciorek (2007) consider representation of a spatial process using a spectral process so that the corresponding covariance matrix is decomposed into the orthogonal matrix with Fourier basis functions and the diagonal matrix with the values of the spectral density. This construction helps more efficient computation but it is under parametric modeling of spectral density. Reich and Fuentes (2012) used a Dirichlet process prior for spectral density so that the resulting covariance function is flexible. Guinness and Fuentes (2017) considers discrete spectral approximation of a covariance function so that the approximated covariance matrix has a nested block circulant structure which is computationally efficient and circulant embedding can be done in a smaller size compared to Stroud et al. (2017). However, the approach is under parametric modeling of the spectral density. Guinness (2019) proposes an iterative imputation approach to estimate spectral density non-parametrically from incomplete lattice data. This work has been extended to multivariate and spatial-temporal data (Guinness 2018).

We introduce non-parametric modeling of a spectral density under a Bayesian framework by considering a Gaussian process prior on log spectral density which leads estimation of a spatial covariance matrix more flexible. Prediction is made concurrently during Bayesian inference. Our work is an extension of Carter and Kohn (1997) and Dey et al. (2018) in that we consider a spatial process. In addition, we extend the method to handle incomplete lattice data. Given the Gaussian process prior, we expect that our approach produces robust prediction results regardless of a covariance structure as we do not assume any parametric nor isotropic model on the covariance function. The works by Guinness (2019, 2018) are comparable to our approach as the spectral density is estimated non-parametrically on incomplete lattice data but it has a different flavor as we handle estimation under the Bayesian framework. The work by Reich and Fuentes (2012) is a Bayesian approach and proposes a flexible modeling for spectral density but it can be computationally demanding due to the nature of posterior sampling with a Dirichlet process prior.

In the empirical results section, we compare our approach with a parametric Bayesian approach in the simulation study and found that our approach performs well for smooth processes. We then compare our approach with some other methods in terms of prediction using two real datasets. Our approach performs reasonably good in terms of mean absolute error and root mean squared error.

The rest of the paper is organized as follows: Sect. 2 introduces spectral methods and model description in detail. Section 3 provides simulation results for estimation and prediction for various scenarios, and real data analysis with two Korean ozone exposure studies. Section 4 provides a conclusion and some related discussion. The codes for implementing the proposed method is available at https://github.com/junpeea/NSBSR.

2 Models and methods

2.1 Preliminaries

A spatially distributed variable is typically modeled as a continuously indexed stochastic process, \(\{Y(\varvec{s}): \varvec{s} \in D \subset R^d\}\), where D is a study region of interest, \(\varvec{s}\) represents a point of coordinate in D. Under a common regression structure, we consider \(Y(\varvec{s})= \mu (\varvec{s};X)+ \epsilon (\varvec{s})\), where \(\mu (\varvec{s};X)\) is a deterministic mean function including explanatory variables X with its popular choice \(\mu (\varvec{s};X) = X(s)\varvec{\beta }\) and \(\epsilon (\varvec{s})\) is a zero mean stationary spatial process with a spatial dependence structure. We can further decompose \(\epsilon (\varvec{s})= \sigma _{\epsilon } e(\varvec{s})\), where \(\sigma _{\epsilon }^2=Var(\epsilon (\varvec{s}))\) is a marginal variance and \(e(\varvec{s})\) is a normalized process characterized by a correlation function \(c(\cdot )\) such that \(Cov(e(\varvec{s}),e(\varvec{t}))= c(\varvec{s}- \varvec{t})\), which is a common assumption for spatial data. In other words, we consider the following model.

$$\begin{aligned} Y(\varvec{s}) = X(\varvec{s})\varvec{\beta }+ \sigma _{\epsilon }e(\varvec{s};c),\ \varvec{s} \in D \subset R^d. \end{aligned}$$
(1)

In addition, we assume that Y is a Gaussian process which is well accepted for tractable modeling.

We defined \(e(\cdot )\) in (1) as a zero mean stationary Gaussian process in \(R^d\) with a correlation function \(c(\cdot )\). With additional assumption of mean squared continuity, the correlation function can be represented in the following Fourier integral form

$$\begin{aligned} c(\varvec{s}) = \int _{R^d}\exp (\iota \varvec{w}^{t} \varvec{s})F(d \varvec{w}), \end{aligned}$$
(2)

where F is a positive finite measure called a spectral measure. We further assume that F is absolutely continuous so that it has a Radon–Nikodym derivative with respect to Lebesgue measure, \( f = \frac{dF}{d \varvec{w}}\), which is called a spectral density. The spectral density can be recovered by inverse Fourier transformation from \(c(\cdot )\):

$$\begin{aligned} f(\varvec{w}) = \frac{1}{(2\pi )^d}\int _{R^d}\exp (-\iota \varvec{w}^{t} \varvec{s})c(\varvec{s})d \varvec{s}. \end{aligned}$$
(3)

The periodogram is a well-known non-parametric estimate of the spectral density using the data observed on regularly spaced lattice. For two dimensional space domain (\(d=2\)), assume that observed data are located at \(n_1 \times n_2\) regular grid over a rectangular study region \(D \subset R^2\). Let \(\varDelta = (\delta _1,\delta _2)\) be the spacing between neighboring observations in each direction. Then, periodogram is defined as follows:

$$\begin{aligned} {\mathcal {I}}_{n_1n_2}(w_1,w_2)= \frac{1}{4\pi ^2n_1n_2} \Big |{\mathcal {D}}_{n_1n_2}(w_1,w_2)\Big |^2, \end{aligned}$$
(4)

where

$$\begin{aligned} {\mathcal {D}}_{n_1n_2}(w_1,w_2)= \sum _{j=0}^{n_1-1}\sum _{k=0}^{n_2-1}e(j\delta _1,k\delta _2)\exp [- \iota (w_1j\delta _1+w_2k\delta _2)] \end{aligned}$$
(5)

for \(\varvec{w}= (w_1,w_2) \in W_{\varDelta }^{2}=[-\pi /\delta _1, \pi /\delta _1)\times [-\pi /\delta _2, \pi /\delta _2)\).

\({\mathcal {I}}_{n_1n_2}(\varvec{w})\) are exponentially distributed with mean \(f_{\delta _1\delta _2}(\varvec{w})=\sum _{Q_1 \in {\mathcal {Z}}}\sum _{Q_2 \in {\mathcal {Z}}} f \left( w_1+ \frac{2\pi Q_1}{\delta _1},w_2+\frac{2\pi Q_2}{\delta _2}\right) \), where \({\mathcal {Z}}\) is the set of integers and they are asymptotically independent at distinct Fourier frequencies. These properties can be obtained by the same arguments used for a time series and Gaussian assumption (Brillinger 2001) since we consider a spatial process on a lattice, where \(\varDelta \) is fixed and the observation domain is increasing as the sample size is increasing. Similar results for a spatial lattice process when the spacing is decreasing while the observation domain is fixed are introduced in Lim and Stein (2008). Also, \({\mathcal {I}}_{n_1n_2}\) is symmetric around the half of the Fourier frequencies, i.e. \({\mathcal {I}}_{n_1n_2}(w_1,w_2) = {\mathcal {I}}_{n_1n_2}\left( \frac{2\pi }{\delta _1} - w_1,\frac{2\pi }{\delta _2} - w_2 \right) \) for \(\varvec{w} \in W_{\varDelta }^{2}\).

2.2 Proposed model

Assuming \(n_1 \times n_2\) regular grid over a rectangular study region \(D \subset R^2\), let \(|D_1|\) be the length of D in x-axis, \(|D_2|\) be the length of D in y-axis, and \(N=n_1n_2\) be the sample size. We denote a complete set of regularly spaced locations \(S_{com}^{\varDelta } = \{\varvec{s}_{jk} = (s_j,s_k) = (j\delta _1,k\delta _2); j = 0,1,\ldots ,(n_1-1), k = 0,1,\ldots ,(n_2-1)\}\), where \( \delta _1 = \frac{|D_1|}{n_1}, \delta _2 = \frac{|D_2|}{n_2}\). We first consider completely observed samples \((\varvec{Y},\varvec{X}) = \{(Y_{jk},\varvec{X}_{jk}) = (Y(\varvec{s}_{jk}),X_{1}(\varvec{s}_{jk}),\ldots ,X_{p}(\varvec{s}_{jk})); \forall \varvec{s}_{jk} \in S^{\varDelta }_{com}\}\), where p is the number of covariates. Then, the model (1) using the data becomes \(Y_{jk} = \sum _{r=1}^{p}X_{rjk}\beta _{r} + \sigma _{\epsilon }e_{jk},\text { for } j = 0,1,\ldots ,(n_1-1), k = 0,1,\ldots ,(n_2-1)\) and its matrix form is

$$\begin{aligned} \varvec{Y} = \varvec{X}\varvec{\beta } + \sigma _{\epsilon }\varvec{e}, \end{aligned}$$

where \(\varvec{\beta } = (\beta _1,\ldots ,\beta _p)^t\) and \( \varvec{e} = (e_1,\ldots ,e_N)^t\).

Given \(\varvec{e}\), we can obtain the periodogram \({\mathcal {I}}_{n_1n_2}\) at the Fourier frequencies. Due to the symmetry, we only need the first half of them. Recall that \({\mathcal {I}}_{n_1n_2}\) are exponentially distributed and asymptotically independent at distinct Fourier frequencies. The exponential density expression for \({\mathcal {I}}_{n_1n_2}\) can be viewed as a Whittle likelihood by considering it as an approximation of the Gaussian density for \(\varvec{e}\) (Whittle 1954). Carter and Kohn (1997) introduced a five-component mixture Gaussian distribution as approximation of the distribution of the logarithm of an exponential distribution so that we use Carter and Kohn (1997)’s approximation for \(\log {\mathcal {I}}_{n_1n_2}\). That is,

$$\begin{aligned} \log {\mathcal {I}}_{n_1n_2}(\varvec{w}) = \log f_{\delta _1\delta _2}(\varvec{w}) + \xi (\varvec{w}) \end{aligned}$$
(6)

with \(\xi \) having distribution \(\pi (\xi )\) such that

$$\begin{aligned} \pi (\xi ) = \sum _{l=1}^{5}p_l\phi _{v_l}(\xi -\kappa _l), \end{aligned}$$
(7)

where \(\phi _v(\cdot - \kappa )\) is a normal density function with mean \(\kappa \) and variance \(v^2\). The weights \((p_l)\), means (\(\kappa _l\)) and standard deviations (\(v_l\)) of the five components in the mixture Gaussian distribution to match the density of the logarithm of an exponential distribution are provided in Carter and Kohn (1997) and we also provide them in the Appendix.

Let \(\psi \) be a latent variable that indicates a component in (7), \(\varvec{\varphi }\) be a vector of \(\log {\mathcal {I}}_{n_1n_2}(\varvec{w})\) and \(\varvec{\theta }\) be a vector of \(\log f_{\delta _1\delta _2}(\varvec{w})\). We pursue a hierarchical model and Bayesian inference by considering a Gaussian process prior (GP) for \(\log f_{\delta _1\delta _2}(\varvec{w})\) with mean function \(\nu (\cdot )\) and covariance function \(\tau (\cdot , \cdot )\), and appropriate priors for hyper-parameters. The model and prior specifications are summarized as follows:

  1. 1.

    Data model:

    (a) \(\varvec{Y} = \varvec{X}\varvec{\beta } + \sigma _{\epsilon }\varvec{e}\) (space domain)

    (b) \(\varvec{\varphi } = \varvec{\theta }+ \varvec{\xi }\) (frequency domain)

  2. 2.

    Process model:

    \(\varvec{\theta } \sim GP(\nu (\cdot ),\tau (\cdot ,\cdot ))\) with \(\nu (\varvec{w})\equiv 0\) and

    \(\tau (\varvec{w}_1, \varvec{w}_2) = \tau _{\theta }^{-1}\exp (-\rho _{\theta _1} |w_{11} - w_{21}| -\rho _{\theta _2} |w_{12} - w_{22}|)\), \(\varvec{w}_i= (w_{i1},w_{i2}) \) for \(i=1,2\).

  3. 3.

    Parameter models:

    \(\varvec{\beta }\sim N(\mu _{\beta } \varvec{1}, \sigma _{\beta }^2\varvec{I})\),

    \(P(\psi = l)=p_l\), for \(l=1,\ldots , 5\); \(\rho _{\theta _1}, \rho _{\theta _2} \sim \) \( Unif(0,\rho _0) \) for some \(\rho _0 >0\), \(\tau _{\epsilon }=1/\sigma _{\epsilon }^2 \sim G(a,b)\), \(\tau _{\theta } \sim G(c,d)\),

    where G(ab) is the Gamma distribution with mean ab.

We consider a Gibbs sampler from the above hierarchical structure by obtaining conditional posterior distribution of each parameter given the data and other parameters. The detail construction of conditional distributions is given in the Appendix.

Once we obtain R Gibbs samples, we predict Y over a given study region D at unmonitored locations. In Bayesian framework, prediction of Y is based on conditional expectation \(E(Y | Y_{obs})\) given the observed data, \(Y_{obs}\). Given Gibbs samples, prediction of \(Y(\varvec{s}_0)\) at an unmonitored location \(\varvec{s}_0 \in D\) is given as \({\hat{Y}}(\varvec{s}_0) = \frac{1}{R}\sum _{r=1}^R E(Y(\varvec{s}_0) | Y_{obs};\hat{\varvec{\beta }}^{(r)},\hat{\sigma _\epsilon }^{(r)},\hat{\varvec{\theta }}^{(r)}) \) with \(E(Y(\varvec{s}_0) | Y_{obs};\hat{\varvec{\beta }},\hat{\sigma _\epsilon },\hat{\varvec{\theta }})= X(\varvec{s}_0)\hat{\varvec{\beta }} + {\hat{\varvec{h}}^t}\varvec{\tilde{\varGamma }}^{-1}(Y_{obs}-X\hat{\varvec{\beta }})\), where \(\hat{\varvec{h}} = {\widehat{Cov}}(\varvec{e},e(\varvec{s}_0))\) and \(\tilde{\varvec{\varGamma }}= {\widehat{Cov}}(\varvec{e})\) (Cressie 1993). The prediction error variance of Y is similarly obtained.

To obtain Gibbs samples and prediction results, we need to compute a matrix-vector multiplication involving \(\varvec{\tilde{\varGamma }}^{-1}\). If the sites at \(S_{com}^{\varDelta }\) are ordered from top to bottom and from left to right, the Eq. (5) leads to the covariance matrix \(\tilde{\varGamma }\) being \(n_2 \times n_2\) block circulant matrix, where each block is also circulant of the size \(n_1 \times n_1\). We adopt the approach by Anitescu et al. (2012) which makes use of this block-circulant of circulant blocks (BCCB) structure for efficient computation of \(\varvec{\tilde{\varGamma }}^{-1} \hat{\varvec{h}}\). The detail explanation is given in the Appendix.

The computation of \(\hat{\varvec{h}}\) requires an additional technique. The covariance estimates retrieved by the estimated spectral density are only at lags in the form of \((j \delta _1, k\delta _2)\) and this is not enough to reconstruct \(\varvec{{{\hat{h}}}}\) since it requires covariance estimates at a finer resolution. However, an interpolation of the estimated spectral density, analogous to the approach by Dey et al. (2018) is no longer applicable due to an aliasing effect of the spectrum of sample observations (Gelfand et al. 2010). To resolve this issue, we consider an interpolation of covariance estimates at a coarser resolution to get the covariance estimates at a finer resolution so that we can construct \(\hat{\varvec{h}}\). For example, \({\hat{c}} (0.5\delta _1, 0.5\delta _2)\) is obtained by bilinear interpolation with four neighboring values, \({\hat{c}} (0, 0)\), \({\hat{c}} (\delta _1, 0)\), \({\hat{c}} (0, \delta _2)\), \({\hat{c}} (\delta _1, \delta _2)\).

Our approach requires to sample \(\theta \), logarithm of the spectral density at the Fourier frequencies per Gibbs iteration, which can be time consuming. However, we argue that it does not lose much computational efficiency compared to a conventional Bayesian spatial regression method under a parametric set-up. This is partly due to the fact that we used discrete Fourier transform (DFT) by taking advantage of fast Fourier transform (FFT) algorithm (Bracewell 1986; Cooley and Tukey 1965), whose computation cost is \({\mathcal {O}}(n_1n_2\log (n_1n_2))\). Also, due to symmetry of the spectral density and the periodogram about the origin, we only need to consider the first half of Fourier frequencies. If we permit to impose more restriction on the true spectral density such as isotropy, we can further improve computation speed and save memories by considering about one fourth of the frequencies.

2.3 Proposed model for observations on an incomplete grid

Let \(\zeta _{jk}\) be a variable for indicating if Y is observed at \(\varvec{s}_{jk} \in S^{\varDelta }_{com}\). That is, \(\zeta _{jk}= 1\) if \(Y(\varvec{s}_{jk})\) is observed and \(\zeta _{jk}= 0\) if it is missing. Now we consider a complete set which includes observations as well as missing values with indicators:

$$\begin{aligned} (\varvec{Y},\varvec{\zeta },\varvec{X}) = \left\{ (Y_{jk},\zeta _{jk},\varvec{X}_{jk}) = (Y(\varvec{s}_{jk}),\zeta (\varvec{s}_{jk}),X_{1}(\varvec{s}_{jk}),\ldots ,X_{p}(\varvec{s}_{jk}))\right\} . \end{aligned}$$

Recall that the matrix form of the model using the data is

$$\begin{aligned} \varvec{Y} = \varvec{X}\varvec{\beta } + \sigma _{\epsilon }\varvec{e}. \end{aligned}$$

Suppose that both \(\varvec{X}\) and \(\varvec{Y}_{obs}\) are observed but \(\varvec{Y}_{mis}\) is missing at random given observations, where \(\varvec{Y}_{obs}\) is an observed part and \(\varvec{Y}_{mis}\) is a missing part of \(\varvec{Y}\). Let \(\varvec{\varTheta } = (\varvec{\beta }^t,\sigma _{\epsilon },\varvec{\theta }^t,\psi ,\tau _{\theta },\rho _{\theta _1},\rho _{\theta _2})^t\) be a vector of the entire model parameters. In Bayesian inference, it is common to treat \(\varvec{Y}_{mis}\) as a vector of latent variables. Then, we can augment missing observations by sampling from the conditional probabilities of missing observations, \(\varvec{h}_{mis} = P(\varvec{Y}_{mis}|\varvec{Y}_{obs},\varvec{\zeta },\varvec{\varTheta })\) in the MCMC procedure described in Sect. 2.2.

With Gaussian assumption of \(\varvec{Y}\), we can easily show that \(\varvec{h}_{mis}\) follows a multivariate normal distribution. Note that

$$\begin{aligned} \left( \begin{array}{c} \varvec{Y}_{obs} \\ \varvec{Y}_{mis} \end{array}\right) \sim \varvec{N} \left( \left( \begin{array}{c} \varvec{X}_{obs}\varvec{\beta } \\ \varvec{X}_{mis}\varvec{\beta } \end{array}\right) , \left[ \begin{array}{cc} \varvec{\varSigma }_{11} &{} \varvec{\varSigma }_{12} \\ \varvec{\varSigma }_{21} &{} \varvec{\varSigma }_{22} \end{array}\right] \right) \end{aligned}$$

so that the conditional distribution of \(\varvec{Y}_{mis}\) given \(\varvec{Y}_{obs}\) and \(\varvec{\varTheta }\) is

$$\begin{aligned} \varvec{Y}_{mis} | (\varvec{Y}_{obs},\varvec{\varTheta }) \sim \varvec{N}\left( \varvec{\mu }_{mis|obs},\varvec{\varSigma }_{mis|obs}\right) , \end{aligned}$$

where \( \varvec{\mu }_{mis|obs} = \varvec{X}_{mis}\varvec{\beta } + \varvec{\varSigma }_{21}\varvec{\varSigma }^{-1}_{11}\left( \varvec{Y}_{obs}-\varvec{X}_{obs}\varvec{\beta }\right) \) and \( \varvec{\varSigma }_{mis|obs} = \varvec{\varSigma }_{22} - \varvec{\varSigma }_{21}\varvec{\varSigma }^{-1}_{11}\varvec{\varSigma }_{12} \). \(\varvec{\varSigma }_{11}\), \(\varvec{\varSigma }_{12}\), \(\varvec{\varSigma }_{21}\) and \(\varvec{\varSigma }_{22}\) are to be recovered from our model parameters. Then, Bayes formula gives

$$\begin{aligned} \varvec{h}_{mis} = P(\varvec{Y}_{mis}|\varvec{Y}_{obs},\varvec{\zeta },\varvec{\varTheta }) = \frac{\pi (\varvec{Y}_{mis}|\varvec{Y}_{obs},\varvec{\varTheta })P(\varvec{\zeta }|\varvec{Y},\varvec{\varTheta })}{\int \pi (\varvec{Y}_{mis}|\varvec{Y}_{obs},\varvec{\varTheta })P(\varvec{\zeta }|\varvec{Y},\varvec{\varTheta })d\varvec{Y}_{mis}}. \end{aligned}$$

When we assume the missingness of \(\varvec{Y}\) occurs at random conditioning on both observed data and model parameters, i.e. \(P(\varvec{\zeta }|\varvec{Y},\varvec{\varTheta }) = P(\varvec{\zeta }|\varvec{Y_{obs}},\varvec{\varTheta })\), then the component is nothing but a constant with respect to the unknown quantities \(\varvec{Y}_{mis}\) given \(\varvec{Y}_{obs}\), and \(\varvec{\varTheta }\) (Kim and Shao 2013). Therefore, we can get the samples from \(\varvec{h}_{mis}\) by sampling from \(\pi (\varvec{Y}_{mis}|\varvec{Y}_{obs},\varvec{\varTheta })\) within the proposed Gibbs sampler.

3 Empirical results

3.1 Simulation study

In this section, we show performance of the proposed approach in terms of estimation and prediction. Then, we compare with a parametric Bayesian approach under various simulation settings. We consider a regular grid over a rectangular study region D, denoted by \(S_{\delta , n}\), in which the distance between neighboring observations in each direction is \(\delta \) and the length of each direction is n. In other words, \(S_{\delta ,n} = \{\varvec{s}_{jk} = (s_j,s_k) = (j\delta ,k\delta ),\, j,k=0,\ldots ,\lfloor (n-1)/\delta \rfloor \}\), where \(\lfloor x \rfloor \) is the greatest integer less than equal to x. We consider two covariates \(X_1\) and \(X_2\) in addition to a constant term. \(X_1\) is generated from a mixture of two normal distributions, i.e. \( X_1 = p\xi _1+(1-p)\sqrt{5}\xi _2;\, p \sim Ber(0.5), \xi _1, \xi _2 \sim N(0,1) \), and \(X_2\) is generated from a standard exponential distribution. The regression coefficients, \(\varvec{\beta }\), is set to \(\varvec{\beta }= (\beta _0, \beta _1, \beta _2)^{t} = (0.01,0.02,0.03)^{t}\).

For Bayesian inference, we choose the values of hyper-parameters such that \(\mu _{\beta } = 0\), \(\sigma ^2_{\beta } = 100\), \(a = 100\), \(b = 10\), \(c = 100\), \(d = 100\) and \(\rho _0 = 0.001\). The small value of \(\rho _0\) implies weak dependence of the covariance kernel, \(\tau _{\omega _1,\omega _2}\), for the prior of \(\varvec{\theta }\). The choices of abc and d are to make variability of the prior distributions large. We also did some sensitivity analysis (not shown) for the choice of abcd and found that the prediction results are not much different. Three chains with 10,000 iterations each with 9000 burn-in are obtained. We call our proposed approach, the non-parametric spectral density Bayesian spatial regression as NSBSR and usual parametric Bayesian spatial regression as PBSR in short.

Fig. 1
figure 1

Estimated log-scale spectral densities assuming an exponential covariance model with \(\sigma ^2 e^{-\Vert Ah\Vert /\phi }\), \(\sigma ^2 = 1\), \(\phi = 10\), \(A = \left( {\begin{matrix} 1&{}0\\ 0&{}1 \end{matrix}}\right) \) (isotropy), and \(A = \left( {\begin{matrix} 1&{}0\\ 0&{} 1/2 \end{matrix}}\right) \) (anisotropy). First row corresponds to the true log-scale spectral densities while the second row corresponds to the estimated log-scale spectral densities using the proposed method

First, we consider simulated datasets with two different grid sizes ( \(S_{1, 16}\) and \(S_{1, 32}\)) and assuming an exponential covariance model, \(\sigma ^2 e^{-\Vert Ah\Vert /\phi }\) with \(\sigma ^2 = 1\) and \(\phi = 10\) to investigate the estimation result of spectral density by comparing with the true spectral density. Two choices of A are considered: \(A = \left( {\begin{matrix} 1&{}0\\ 0&{}1 \end{matrix}}\right) \) (isotropy) and \(A = \left( {\begin{matrix} 1&{}0\\ 0&{}\ 1/2 \end{matrix}}\right) \) (anisotropy). The anisotropic choice of A implies that the x-direction is stretched twice compared to the y-direction. Figure 1 shows the estimated log-scale spectral densities in three-dimensional visualization. The first row is the true spectral density and the second row is the NSBSR-estimated log-scale spectral density. Compared to the true spectral densities, estimated spectral densities tend to over-estimate at boundaries but they try to capture anisotropic patterns. Note that this is one example dataset so that the result could vary by a different simulated dataset.

Next, we consider simulated datasets on \(S_{1, 32}\) and assuming a Matérn covariance model, \(c(\varvec{h};\sigma ^2,\phi ,\alpha ) = \sigma ^2 \frac{2^{1-\alpha }}{\varGamma (\alpha )}\left( \frac{\Vert A \varvec{h}\Vert }{\phi }\right) ^{\alpha }{\mathcal {K}}_{\alpha }\left( \frac{\Vert A \varvec{h}\Vert }{\phi }\right) \) at various smoothing levels \(\alpha \) to investigate prediction performance. We set \(\sigma ^2=1\) and \(\phi =10\) as before. For this simulation, we consider \(A = \left( {\begin{matrix} 1&{}0\\ 0&{}1 \end{matrix}}\right) \) (isotropy) and \(A = \left( {\begin{matrix} 1\big /\sqrt{2}&{}0\\ 0&{} 1\big /(2\sqrt{2}) \end{matrix}}\right) \) (anisotropy). The anisotropy choice of A implies that x-direction is stretched twice compared to the y-direction while the norm is scaled to \(1/\sqrt{2}\). We then fit the model using the data only on \(S_{2,32}\), which is a subsample of the data on \(S_{1,32}\) with neighboring distance in each direction being twice large. The prediction is made on \(S_{1,32}\) and compared with the generated data on \(S_{1,32}\). Figure 2 shows prediction results from our NSBSR approach with observed values (simulated values). We can see that our approach tries to capture observed patterns for both isotropic and anisotropic cases. Again, note that this is one example dataset so that the result could vary by a different simulated dataset.

Fig. 2
figure 2

True simulated Y (left) and NSBSR predicted Y (right) in isotropic and anisotropic Matérn covariance models. Each row corresponds to a different smoothness parameter, \(\alpha =0.1,0.5,2.0\), respectively

Now, we investigate several covariance models: (1) pure nugget (i.i.d.), (2) Bumpy Matérn (\(\alpha = 0.1\)), (3) Matérn (\(\alpha = 0.5\)), (4) Smooth Matérn (\(\alpha = 2.0\)), (5) Bumpy Powered Exponential (\(\alpha = 0.5\)), (6) Smooth Powered Exponential (\(\alpha = 1.5\)) and (7) Gaussian. Note that the form of Matérn covariance function is introduced earlier and Matérn covariance model with \(\alpha =0.5\) corresponds to the exponential covariance model. The powered exponential covariance function is \(c(\varvec{h};\sigma ^2,\alpha ) = \sigma ^2 \exp \left( -\Vert \varvec{A} h\Vert ^{\alpha }\right) \). We set \(\sigma ^2=1\) and \(\phi =10\) as well. In addition to isotropic cases, we investigate anisotropic models with \(A = \left( {\begin{matrix} 1&{}0\\ 0&{} 1/4 \end{matrix}}\right) \), which implies that the x-direction is stretched four times compared to the y-direction. We simulate 100 datasets for each covariance setting with isotropy and anisotropy. Similar to the second simulation case that produces Fig. 2, we first generate the data on \(S_{1,32}\) and use the data on \(S_{2,32}\) to fit the model. The prediction is made on \(S_{1,32}\) and compared with the generated data on \(S_{1,32}\).

Fig. 3
figure 3

Box plots of RMSPE: isotropic case is the first row and anisotropic case is the second row. Boxplots over 100 datasets using averages of root mean squared prediction errors over locations. Each block in the figure represents a different covariance model for data generation: pure nugget, Bumpy Matérn (\(\alpha \) = 0.1), exponential, Bumpy powered exponential (\(\alpha \) = 0.5), Smooth Powered exponential (\(\alpha \) = 1.5), smooth Matérn (\(\alpha \) = 2.0), and Gaussian. In each block, boxplots are ordered by estimation models: NSBSR (red), Bumpy PBSR (\(\alpha \) fixed to 0.1; green), exponential PBSR (\(\alpha \) fixed to 0.5; blue), smooth PBSR (\(\alpha \) fixed to 2.0; purple), general PBSR with \(\alpha \) unfixed (yellow), and UK with \(\alpha \) unfixed (gray)

Figure 3 shows prediction performance results of NSBSR (red) and PBSR with Matérn model with different degrees of fixed smoothness parameter \(\alpha =0.1\) (green), \(\alpha =0.5\) (blue), \(\alpha =2.0\) (purple), and unfixed smoothness parameter \(\alpha \) (yellow) for each simulation setting. In addition, we also compared with universal kriging (gray) with maximum likelihood estimates, where the smoothness parameter \(\alpha \) is also estimated. We call this approach UK. A fixed smoothness parameter means that we estimate other parameters except the smoothness parameter. An unfixed smoothness parameter means we estimate it as well. These are boxplots of root mean squared prediction errors (RMSPE) between observations and predicted values over 100 datasets. RMSPE is averaged over locations for each data set. From the left block (divided by the dotted vertical lines), the covariance models for data generation are, in turn, pure nugget (i.i.d.), Bumpy Matérn (\(\alpha =0.1\)), exponential, Bumpy powered exponential (\(\alpha =0.5\)), Smooth powered exponential (\(\alpha =1.5\)), smooth Matérn and Gaussian. RMSPEs for NSBSR are overall comparable to those for PBSR and UK in the case of lower degree of smoothness (first four blocks) and lower in the case of higher degree of smoothness (last three blocks). Sample visualization results for NSBSR in Fig. 2 also imply that the predicted values for a smoother covariance model is relatively less biased than those for a bumpy covariance model. RMSPEs for NSBSR are quite robust compared to those for the PBSR and UK with various covariance models. Note that unfixed Matérn model results of PBSR are not impressive, although it is more flexible than the fixed Matérn model. The results by UK, non-Bayesian approach show less variability overall. Results for anisotropic cases are also similar, although the difference in prediction performance at smooth covariance models is reduced.

Table 1 shows additional prediction performance measure and estimation performance for each regression coefficient. The row with \(R^2\) shows the average of coefficient of determination between observations and predicted values over 100 datasets. The rows with \(\beta _0, \beta _1, \beta _2\) show root mean squared error (RMSE) of regression coefficients. These results are for the simulated data (isotropy case) used in Fig. 3. \(R^2\)s are not large enough for all approaches when the data are independent or less smooth processes while it is getting larger when the processes are getting smoother. For both prediction and estimation, the proposed NSBSR approach and other parametric approaches are overall comparable in these measures.

Table 1 Prediction and estimation results

We also compare NSBSR with PBSR when the data are on an incomplete grid. We generate three exemplary datasets from Matérn covariance models with \(\alpha =0.1, 0.5, 2.0\), respectively on the complete grid \(S_{1, 32}\). Then, we consider the data only on \(S_{2,16}\) for fitting but we randomly remove grid points according to the missing ratio (MR, %). Then, our approach, NSBSR, is compared to PBSR with Matérn models with \(\alpha =0.1\) (P01), \(\alpha =0.5\) (P05), \(\alpha =2.0\) (P20) and unfixed \(\alpha \) (P00). Table 2 shows mean squared prediction error (MSPE) and \(R^2\) between observations and predicted values. Note that MSPE in this simulation study is the average over locations. The results in Table 2 show that MSPEs of NSBSR tend to get increased as MR increases but the increase is comparable to those of PBSR. Likewise, the \(R^2\) of NSBSR get decreased as MR increases but the decrease is comparable to those of PBSR. For example, with the data from the Matérn covariance model with \(\alpha =0.1\), MSPE of NSBSR approach is 0.814 while PBSR with Matérn covariance model fixed at \(\alpha =0.1\) (P01) is 0.820 when MR is 10%. The missing ratio does not affect much on the estimation of regression coefficients for both NSBSR and PBSR as well (results are not shown for brevity). However, we need longer MCMC chains when the missing ratio gets higher for convergence. Although the impact of missing rates is not apparent for this particular simulation study, convergence can be an issue in conditional simulation for imputing the missing data as discussed in Guinness and Fuentes (2017).

Table 2 Mean squared prediction error (MSPE) and R squares (\(R^2\)) between observations and predicted values under various missing ratios (MR)

For implementation, we used software R (www.r-project.org). When we fit the model using one dataset on \(S_{2,16}\), NSBSR took only additional 0.25 min per 1000 iteration with three chains for one data set compared to PBSR with computer specification of CPU Intel(R) Core(TM) i5-4690 with RAM 8.00 GB.

3.2 Real data analysis

In this section, we apply our approach to two ozone datasets. One is from Moderate Resolution Imaging Spectroradiometer (MODIS) Terra Level-3 Aerosol Cloud Water Vapor Ozone Daily Global product (MOD08D3) (https://ladsweb.modaps.eosdis.nasa.gov/search/). The other is from AURA (EOS CH-1) which is a multi-national NASA scientific research satellite studying the Earth’s ozone layer, air quality, and climate (https://disc.gsfc.nasa.gov/datasets?keywords=aura&page=1).

3.2.1 MODIS application

MOD08D3 contains daily-averaged values of atmospheric parameters related to aerosol particle properties, cloud optical and physical properties, atmospheric water vapor, atmospheric profile and stability indices, and total ozone burden on a \(1^{\circ } \times 1^{\circ }\) grid. Among them, we obtained quality controlled ozone exposure measurements and the missing values were left untreated. To properly impose a Gaussian assumption, log-transformed ozone exposure is used as a response variable Y. We focus on a neighborhood of the Korea peninsula, i.e. longitude ranged from 112 to \(141^\circ \), latitude from 24 to \(53^\circ \). The daily average on February 5, 2019 was used for analysis as an example to deal with a forecast of a short-term ambient exposure. For covariates, we used the world geodetic system (WGS 84) information so that \(X_1\) refers to longitude, and \(X_2\) refers to latitude.

MOD08D3 is a rectangular image pixels with \(1^\circ \) resolution as we mentioned above. We take a subset of the pixels with \(2^\circ \) resolution for model fitting. We then predict the values with \(1^\circ \) resolution. The missing rate is 13.0% for the original dataset. Both sample size and missing rate are moderate. As our approach assumes stationarity, we checked the data with a stationarity test introduced by Taylor et al. (2014), which is designed for testing stationarity of random fields on a regular lattice. The test is available as a R function (TOS2D) in LS2Wstat package. As it requires a complete set of data on a grid, we imputed the data by ordinary kriging with an exponential correlation function before applying the test. The p-value is 0.851, which indicates that we can not reject stationarity assumption of the data. Predicted values of ozone concentration can be used for exposure assessment to acquire valuable scientific meanings such as health effect estimation of ambient air pollution on mortality/morbidity in general epidemiological studies (Kim and Song 2017; Laden et al. 2006).

Fig. 4
figure 4

MODIS result: the first plot is the MODIS dataset (\(1^\circ \) resolution) (original). The second plot is a training set (\(2^\circ \) resolution) (training). The third plot is prediction result from NSBSR and the fourth plot is prediction result from PBSR with a Matérn covariance function. The smoothness parameter in PBSR was estimated as well. (Longitude: \(112^\circ \sim 141^\circ \); Latitude: \(24^\circ \sim 53^\circ \); \(30 \times 30\) pixels; Time: February 5, 2019)

First, we compare prediction result with a parametric approach by prediction map and computing MSPE and \(R^2\) between observations and predicted values. Figure 4 shows prediction results for the ozone data. The first plot shows the original dataset, which we can see some missing values. The second plot shows the data we used to fit the model. The third and fourth plots are prediction maps from our approach (NSBSR) and parametric approach (PBSR). For PBSR, we consider a Matérn covariance model with unfixed smoothness parameter for model fitting. We can see that the prediction map from our approach shows similarity to the original data. Compared with the result by PBSR, \(R^2\) of NSBSR (\(R^2=0.54\)) is higher than \(R^2\) for PBSR (\(R^2=0.46\)). MSPE of NSBSR (0.0793) is lower than MSPE of PBSR (0.0903).

Table 3 List of spatial prediction methods in Heaton et al. (2019) we compared with for prediction performance

Second, we compare prediction results with the methods reviewed in Heaton et al. (2019). Table 3 is the list of methods that we compare with and their abbreviations. A brief summary of each method is given in the Introduction section. For implementation, we basically used the codes available from Heaton et al. (2019) and some R packages when they are available. We used default settings if there are any tuning parameters. Thus, we would like to point out that the results in this section may not be the best for each method. Given the use of available codes, we omit two approaches: Metakriging and Multiresolution approximation in comparison. The computation time was rather long for Metakriging and we were not successful to implement the code for Multiresolution approximation.

Table 4 shows performance results for the MODIS data. We provide mean absolute error (MAE), root mean squared error (RMSE), the average length of 95% confidence intervals (for Non-Bayesian approaches)/credible interval (for Bayesian approaches) (INT) and prediction coverage (PC) as a ratio of cases that 95% prediction intervals contain the observed values over the lattice grids of the study regions. The results are sorted in ascending order by MAE. Although most methods are comparable as the values of MAE and RMSE are similar, NSBSR provides the best result in terms of MAE (tied with NNGP) and the second best result in terms of RMSE. NSBSR has relatively large INT, though. It is interesting that values of INT are widely different among the methods compared to MAE and RMSE, which indicates that interval estimation is more challenging. For the prediction coverage, NSBSR shows good performance as it is the closest to 95%.

Table 4 Prediction results for the MODIS data from various methods based on mean absolute error (MAE), root mean squared error (RMSE), confidence/credible interval length (INT), and prediction coverage (PC)

3.2.2 AURA application

The second dataset is total column ozone data from TOMS-Like Ozone and Radiative Cloud Fraction L3 1 day \(0.25^{\circ } \times 0.25^{\circ }\) V3 (DOI: 10.5067/Aura/OMI/DATA3002) by the Ozone Monitoring Instrument (OMI) onboard the AURA satellite. We consider the following covariates which can affect the level of ozone concentration as the ozone is a secondary pollutant: (1) Radiative cloud fraction (DOI: 10.5067/Aura/OMI/DATA3002); (2) solar Zenith angle (DOI: 10.5067/Aura/OMI/DATA3002); (3) total column of nitrogen dioxide (DOI: 10.5067/Aura/OMI/DATA3007); (4) total column of formaldehyde (DOI: 10.5067/Aura/OMI/DATA2016); (5) ultra violet aerosol index (DOI: 10.5067/Aura/OMI/DATA2025); (6) total column of sulfar dioxide (DOI: 10.5067/Aura/OMI/DATA2025). (3) and (4) are log-transformed for better interpretability. The achieved OMI/AURA dataset has global coverage with \(0.25^{\circ } \times 0.25^{\circ }\) resolution. We again focus on a neighborhood of the Korea peninsula, i.e. longitude ranged from 112 to \(141^\circ \), latitude from 24 to \(53^\circ \). For this time, we consider the averaged data between June 1 to August 31, 2019. Note that there were missing values in hourly data due to satellite’s orbits and other random sources but we averaged the data wherever available. As a result, there is no missing value for ozone concentration while some covariates still have some missing values. In this case, we imputed those missing values by the ordinary kriging. Stationarity for the AURA data is also tested by the method used for the MODIS data. The corresponding p value is 0.672. So, we can not reject the non-stationarity for the AURA data based on this test as well. However, p value is smaller compared to the case of the MODIS data.

Table 5 Prediction results for the AURA data from various methods based on mean absolute error (MAE), root mean squared error (RMSE), confidence/credible interval length (INT), and prediction coverage(PC)

For the AURA data, we further randomly removed 20% of the data to see the impact of missing values. Then, we took a subset at \(0.5^\circ \) resolution for fitting each method and prediction is made at \(0.25^\circ \) resolution. Table 5 shows the prediction performance. Similar to Table 4, the results are sorted in ascending order by MAE. RMSE, INT and PC are provided as well. NSBSR provides the fifth best result in terms of MAE and RMSE at this time. The ranks among the methods are not the same compared to the results for the MODIS data. For example, NNGP which show the best result for the MODIS data places the eighth for the AURA data. The MAE and RMSE of first six methods are relatively small compared to those of the remaining methods. The results for our approach are not as good as the ones for the MODIS data. However, it shows reasonable performance given that our approach is under stationary assumption while several methods allow more flexible non-stationarity. The values of INT vary more among the methods compared to MAE and RMSE, which is same as the MODIS data. Note that LK is best in terms of MAE and RMASE but INT is largest. The results of prediction coverage also vary but our approach still shows reasonable performance.

4 Discussion

We have proposed Bayesian spatial regression with non-parametric modeling of spectral density derived from Fourier Transform. Our approach, NSBSR has achieved reasonable computational efficiency in terms of storage and speed by using the Whittle likelihood approximation and the Fast Fourier Transform algorithm, even though there are more parameters to estimate compared to parametric covariance models. Simulation studies show that NSBSR is relatively robust compared to parametric covariance models and/or isotropic assumption. Also, NSBSR shows better prediction results in a sense that RMSPE is lower than those of parametric covariance models for smoother processes. Our approach requires stationary assumption, which is rather limited given that several methods to handle non-stationary spatial data are available. However, comparison analysis (see Tables 4, 5) using two ozone concentration datasets show that our approach can provide reasonable prediction given the variation in prediction performance among methods for different datasets. Thus, NSBSR is a good alternative to the existing prediction approaches in spatial data analysis.

Our approach could be used as a baseline for capturing a more complicated spatial dependence structure than that of stationary Gaussian fields. We can consider the marginal variance \(\sigma _{\epsilon }\) in the Eq. (1) to be \(\sigma _{\epsilon }(s)\) so that it is spatially varying. The resulting process becomes non-stationary. We could apply our approach to a stationary error process component, \(e(\varvec{s})\), so that we can handle a class of non-stationary processes.

Estimated spectral densities are not as good as the prediction results. It could be due to the DFT approximation or Whittle likelihood approximation with further approximation using a five-component mixture Gaussian. Yaglom (1987) pointed out that DFT approximation rather than the exact Fourier transform could cause accuracy issue on the covariance estimation. In addition, insufficient sample size or truncated study region could possibly have a negative effect on the spectral density estimation. A possible remedy would be a different likelihood approximation than the Whittle likelihood so that we can avoid using periodogram itself but it requires theoretical justification. An empirical choice of hyperparameters such as \(\rho _0\) for \(\rho _{\theta _1}\), \(\rho _{\theta _2}\) might be beneficial to enhance prediction accuracy, as well. However, these are rather subjective and we tried usual practice of vague priors in our analysis.

The proposed method requires the observations on a spatial lattice. We introduced a way to handle when the observations are on an incomplete lattice, which can be viewed as irregularly spaced data as well. For completely random observation locations not on a spatial lattice, one can consider an idea by Fuentes (2007) when the sample size is large. Fuentes (2007) proposed to aggregate the data points within each grid and treated them as observations of an integrated process on a spatial lattice. This can be a future direction to extend our approach.