Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Spatial models provide a suitable way of analyzing data when observations are thought to be correlated because of their locations in space. Bayesian inference has proven useful when dealing with spatial models and modeling local dependence. In Bayesian analysis (see, e.g., Gelman et al. 2003), inference about the vector of model parameters \( {\mathbf x} \) is based on computing their joint posterior distribution given the vector of observed data \( {\mathbf y} \). This is done by means of Bayes’ rule:

$$ \pi ({\mathbf x}|{\mathbf y})\propto \pi ({\mathbf y}|{\mathbf x})\pi ({\mathbf x}) $$

Here \( \pi ({\mathbf y}|{\mathbf x}) \) represents the likelihood of the model given its parameters and \( \pi ({\mathbf x}) \) is the prior distribution of the parameters of the model. Hence, the posterior distribution depends on the mechanism which generates the data (i.e., the likelihood) and the previous information about the model parameters (i.e., the prior distribution). Note that \( \pi ({\mathbf x}) \) is often supposed to depend on some hyperparameters which in turn have their own prior distributions.

\( \pi ({\mathbf x}|{\mathbf y}) \) is a multivariate distribution of the ensemble of model parameters which is often hard to obtain. In many applications it is sufficient with obtaining a separate posterior distribution for some of the parameters in the model because no joint inference is needed (e.g., the estimates of the relative risk in different areas). These distributions are called posterior marginals and can be denoted \( \pi ({x_i}|{\mathbf y}) \).

As these are univariate distributions, they are often easier to compute or approximate than the joint posterior distribution.

Given that in most cases there is no closed form for the posterior distributions of most parameters in the model, Markov chain Monte Carlo (MCMC, see Gelman et al. 2003) techniques have been employed to estimate the joint posterior. Furthermore, a number of sound techniques for model criticism, comparison, and selection make Bayesian inference appealing.

For models with complex spatial dependence or large datasets, MCMC may not be a convenient solution due to computational time. For this reason, Rue et al. (2009) propose the use of approximate inference based on what they have called the integrated nested Laplace approximation (INLA). This approximation will focus on the posterior marginals which are easier to compute than obtaining an approximation to the joint posterior distribution. Also, INLA will only consider approximations for hierarchical models whose latent effects can be expressed as a Gaussian Markov random field (GMRF).

Successful applications of INLA include disease mapping (Schroedle et al. 2011), geostatistics (Eidsvik et al. 2009), point patterns (Illian et al. 2012), and others (Martino and Rue 2010).

2 Integrated Nested Laplace Approximation

The integrated nested Laplace approximation (INLA) focuses on providing a good approximation to the posterior marginal distributions of the parameters in the model. In particular, this approximation has been developed for latent Gaussian models. These cover a general class of models which appear in many areas of interest. Spatial statistics is one of them, as spatial correlation can be introduced by means of correlated random effects.

First of all, let us assume that we have \( n \) observed variables \( {y_i},\:i=1,\ldots,n \) with a distribution (usually from the exponential family) with a mean \( {\mu_i} \) which is related to a linear predictor \( {\eta_i} \) through a convenient link function. In turn, \( {\eta_i} \) is modeled additively on different effects:

$$ {\eta_i}=\alpha +\sum\limits_{j=1}^{{{n_f}}}\,{f^{(j) }}({u_{ji }})+\sum\limits_{k=1}^{{{n_{\beta }}}}\,{\beta_k}{z_{ki }}+{\epsilon_i} $$

Here, \( {f^{(j) }} \) represents some nonlinear function or random effects (of which there are \( {n_f} \)) on a set of covariates u, \( {\beta_k} \) are coefficients for linear effects on a vector of covariates z, and \( {\epsilon_i} \) are unstructured terms. The latent effects \( {\mathbf x}=\left\{ {\{{\eta_{\it i}}\},\alpha, \{{\beta_k}\},\ldots } \right\} \) are assumed to be Gaussian with zero mean and precision matrix \( {\mathbf Q}({\theta_1}) \), where \( {\theta_1} \) is a vector of hyperparameters. Hence, the observations will have a likelihood which will depend on the latent effects x and a set of parameters \( {\theta_2} \). Furthermore, the observations \( {y_i} \) are supposed to be independent given x and \( {\theta_2} \).

In the particular case of spatial statistics, the terms \( {f^{(j) }}({u_{ji }}) \) can be taken as \( f_i^{(j) } \) (or \( {u_i} \) abusing of notation) to represent a random effect at a spatial location i. Hence, covariate \( {u_{ji }} \) acts as the spatial index i of area i for the set of random effects j. For example, taking \( {n_f}=2 \) we can define \( {u_i}={f^1}({u_{1i }}) \) and \( {v_i}={f^2}({u_{2i }}) \), where \( {\mathbf u}=\{{\it u_1},\ldots,{\it u_n}\} \) is a vector of independent random effects and \( {\mathbf v}=\{{\it v_1},\ldots,{\it v_n}\} \) is a vector of spatially correlated random effects.

Rue et al. (2009) focus on the posterior distribution of x and the vector of hyperparameters \( \theta =({\theta_1},{\theta_2}) \):

$$ \eqalign{ \pi ({\mathbf x},\theta |{\mathbf y})\propto \pi (\theta )\pi ({\mathbf x}|\theta )\prod\limits_{{\it i\in \mathcal{I}}}\,\pi ({\it y_i}|{\it x_i},\theta )\propto \hfill \cr \pi (\theta )|{\mathbf Q}(\theta ){|^{\it n/\text 2 }}\exp \left\{ {-\frac{\it1}{\it2}{{{\mathbf x}}^{\it T}}{\mathbf Q}(\theta ){\mathbf x}+\sum\limits_{{\it i\in \mathcal{I}}}\,\log (\pi ({\it y_i}|{\it x_i},\theta )} \right\} \hfill \cr } $$

Here \( \mathcal{I} \) is the subset of indices (from 1 to length of x, the number of latent effects) that are observed with observations y and their respective linear predictors \( \{{\eta_i}\} \). Note that \( {\eta_i} \) is the only observed latent effect (through \( {y_i} \)) and that all the other latent effects are not observed directly and need to be estimated. In addition, the latent effects may be subject to some linear constraints of the form \( {\mathbf Ax}={\mathbf e} \). Finally, the latent field is supposed to have conditional independence properties, so that x becomes a Gaussian Markov random field (GMRF). As we will show later, these Markov properties play an important role when modeling spatial data.

The likelihood of the data \( \pi ({\mathbf y}|{\mathbf x},\theta ) \) is not constrained to be Gaussian. At the moment, INLA can deal with several likelihoods from the exponential family as well as with mixtures, such as zero-inflated distributions. Furthermore, INLA is flexible enough to allow different observations to have different likelihoods. Hence, INLA can deal with a myriad of models.

Instead of aiming at the full posterior distribution of the model parameters x and \( \theta \), Rue et al. (2009) focus on obtaining an approximation to the posterior marginal distributions \( \pi ({x_i}|{\mathbf y}) \) and \( \pi ({\theta_j}|{\mathbf y}) \). These marginals can be written down as

$$ \pi ({x_i}|{\mathbf y})\propto \int \pi ({\it x_i}|\theta, {\mathbf y})\pi (\theta |{\mathbf y})\it d\theta $$

and

$$ \pi ({\theta_j}|{\mathbf y})\propto \int \pi (\theta |{\mathbf y})\it d{\theta_{-j }} $$

Here \( {\theta_{-j }} \) denotes θ minus component \( {\theta_j} \).

The approximations will be for the conditional distributions in the right-hand sides of the previous expressions. Note that an approximation to \( \pi (\theta |{\mathbf y}) \) is also required and that numerical integrations will be feasible only if the dimension of θ is small (as it often happens in practice).

A first approximation to \( \pi (\theta |{\mathbf y}) \) using Gaussian distributions can be constructed as follows:

$$ \tilde{\pi}(\theta |{\bf y})\propto \frac{{\pi ({\bf x},\theta, {\bf y})}}{{\mathop{\tilde{\pi}}\nolimits_G({\bf x}|\theta, {\bf y})}}{|_{{x={x^{*}}(\theta )}}} $$

\( \mathop{\tilde{\pi}}\nolimits_G({\mathbf x}|\theta, {\mathbf y}) \) is the Gaussian approximation to the full conditional of x and \( {x^{*}}(\theta ) \) is the mode of the full conditional for a given value of θ.

Hence, the marginals of interest can be computed using numerical integration over a multidimensional grid of values of θ. For example,

$$ \tilde{\pi}({x_i}|{\bf y})=\sum\limits_k\,\tilde{\pi}({x_i}|{\theta_k},{\bf y})\times \pi ({\theta_k}|{\mathbf y})\times {\Delta_{\it k}} $$

where \( {\Delta_k} \) represents the weights for each vector of values \( {\theta_k} \) in the grid.

Rue and Martino (2007) and Rue et al. (2009) stress the importance of having a good approximation to \( \pi ({x_i}|\theta, {\mathbf y}) \). A Gaussian approximation \( \mathop{\tilde{\pi}}\nolimits_G({x_i}|\theta, {\mathbf y}) \) is based on using a normal distribution with mean \( {\mu_i}(\theta ) \) and marginal variance \( \sigma_i^2(\theta ) \). The approximation provided by INLA (and in particular the Gaussian approximation for \( \pi ({\mathbf x}|\theta, {\mathbf y} \))) is exact for Gaussian data and the approximation is only due to integration (with respect to θ) error. This may be a good starting point, but it may not suffice because of possible inaccuracy if it is not centered at the correct point and because of its lack of skewness.

For this reason, they also propose other alternatives such as the Laplace approximation and the integrated nested Laplace approximation (INLA). Firstly, an improved approximation may be obtained by using a Laplace approximation:

$$ \mathop{\tilde{\pi}}\nolimits_{LA }({x_i}|\theta, {\bf y})\propto \frac{{\pi ({\bf x},\theta, {\bf y})}}{{\mathop{\tilde{\pi}}\nolimits_{GG }({{{\mathbf x}}_{-i }}|{x_i},\theta, {\bf y})}}{|_{{{{{\mathbf x}}_{-i }}={\bf x}_{-i}^{*}({x_i},\theta )}}} $$

Here \( \mathop{\tilde{\pi}}\nolimits_{GG }({{{\mathbf x}}_{-i }}|{x_i},\theta, {\mathbf y}) \) is a Gaussian approximation to \( {{{\mathbf x}}_{-i }}|{x_i},\theta, {\mathbf y} \) which is centered around the mode \( {\mathbf x}_{-i}^{*}({x_i},\theta ) \). As this approximation must be computed for every \( {x_i} \), some numerical techniques are required to speed up computation.

Finally, Rue et al. (2009) derive a simplified Laplace approximation to improve the approximation given by \( \mathop{\tilde{\pi}}\nolimits_{LA }({x_i}|\theta, {\mathbf y}) \) by means of a series expansion of the Laplace approximation around \( {x_i}={\mu_i}(\theta ) \). This provides a better approximation and it corrects for location and skewness. As \( \mathop{\tilde{\pi}}\nolimits_{LA }({x_i}|\theta, {\mathbf y}) \) is very expensive to compute, the simplified Laplace approximation seems the best trade-off between speed and accuracy.

It should be noted that while these approximations will center on the posterior marginal of a single latent effect \( {x_i} \) or hyperparameter \( {\theta_i} \), the methodology behind them could be applied to obtain an approximation of the joint posterior of any subset S of latent effects \( {{{\mathbf x}}_S} \) (see Sect. 6.1, Rue et al. 2009). However, in that case, the approximations become more complex and the numerical integration needed is more demanding.

2.1 Gaussian Markov Random Fields

Approximate inference using INLA is based on the assumption that the latent field x is Gaussian and fulfills some conditional independence properties. In particular, any two latent effects \( {x_i} \) and \( {x_j} \) in x should be independent given the remaining latent effects \( {{{\mathbf x}}_{-ij }} \). Furthermore, the number of hyperparameters appearing in the distribution of x is assumed to be small.

Rue and Held (2005) provide a description of methods for efficient computation of Gaussian Markov random fields (GMRF) which can be used to speed up computations and provide fast approximations. GMRF are the key to providing good Gaussian approximations for the posterior marginals. INLA is based on providing Gaussian approximations to densities like

$$ \pi ({\mathbf x}|\theta, \it y)\propto \exp \left\{ {-\frac{\text1}{\text2}{{{\mathbf x}}^T}{\mathbf Qx}+\sum\limits_{{\it i\in \mathcal{I}}}\,\log ({\it y_i}|{\it x_i},\theta )} \right\} $$

where Q is the precision matrix of the GMRF. Note that if Q is a known matrix, its determinant (sometimes termed Jacobian) can be ignored at this stage as the posterior distribution can be rescaled later. This distribution may be subject to a set of linear constraints \( {\mathbf Ax}={\mathbf e} \). In any case, the approximation will result in a Gaussian distribution with mean \( {{{\mathbf x}}^{*}} \) and precision matrix \( {{{\mathbf Q}}^{*}}={\mathbf Q}+\it diag({{{\mathbf c}}^{*}}) \) (see Rue et al. 2009, Sect. 2 for details). If linear constraints are present, the mean and precision matrices of the Gaussian approximation are conveniently corrected.

These constrained models are useful for fitting geostatistical models and adjacency-based spatial correlation effects for areal data (e.g., using an intrinsic conditional autoregressive model). Other spatial and temporal random effects can be modeled by using intrinsic GMRFs with linear constraints (see Rue and Held 2005, Chap. 3). Linear constraints are often employed to impose a sum-to-zero constraint on intrinsic GMRFs in order to make these effects identifiable. This is particularly important when dealing with complex spatiotemporal effects (Knorr-Held 2000).

2.2 Priors

So far, we have dealt with how the likelihood and the latent Gaussian Markov random fields are defined. As in all Bayesian approaches, a set of priors needs to be assigned to the parameters.

First of all, covariate coefficients in the linear predictor will be assigned a normal distribution with zero mean and precision \( \tau \). A similar distribution will be used for the random errors \( {\epsilon_i} \).

In principle, the latent random effects will be all Gaussian with zero mean. Hence, only the parameters in the precision matrix will need a prior. For the case in which the precision matrix is of the form \( \tau {\mathbf Q} \), where \( {\mathbf Q} \) is a known matrix, \( \tau \) can be assigned either a gamma, truncated normal, or improper flat (in the log-scale) prior. If the whole precision matrix is to be assigned a prior, then a Wishart distribution is available for correlated random effects of small dimension (up to 5). Finally, the INLA software provides other prior distributions. For example, correlation parameters, such as the ones used to model spatial autocorrelation, can be assigned a beta prior.

Note that, for simple models, these choices are equivalent to setting a conjugate prior distribution and that in all cases the prior parameters are supposed to be known (i.e., these cannot be assigned a prior in turn). It should be mentioned that these priors are the ones implemented in the INLA software (available from http://www.r-inla.org), but user-defined priors can be used as well by providing the mathematical expression for them.

Other priors can be built on upon simpler prior specifications. For example, spatially varying coefficients on a covariate can be implemented by using a prior which is the sum of independent and spatially correlated random effects. More information about how priors can be specified are available at http://www.r-inla.org/models/priors.

2.3 Model Criticism and Selection

INLA provides a number of ways of comparing and assessing models. First of all, an approximation to the marginal likelihood \( \pi ({\mathbf y}) \) is provided. This approximation is based on

$$ \tilde{\pi}({\mathbf y})=\int \frac{{\pi ({\mathbf x},\theta, {\mathbf y})}}{{{\pi_G}({\mathbf x}|\theta, {\mathbf y})}}{|_{{x={x^{*}}(\theta )}}}\text d\theta $$

where \( \pi ({\mathbf x},\theta, {\mathbf y})=\pi (\theta )\pi ({\mathbf x}|\theta )\pi ({\mathbf y}|{\mathbf x},\theta ) \). Models with a larger value of the marginal likelihood will be preferred. Also, marginal likelihood can be used to compute Bayes factors in order to compare models.

Predictive measures can also be computed very easily. In particular, INLA can compute the predictive distribution of \( {y_i} \) given all the other observations, that is, \( \pi ({y_i}|{{{\mathbf y}}_{-i }}) \). Following Pettit (1990), INLA reports the probability integral transform (PIT):

$$ PI{T_i}=Prob(y_i^{\mathrm{new}}\leq{y_i}|{{{\mathbf y}}_{-i }}) $$

This criterion has been used to assess the validity of spatial models in disease mapping and it avoids the use of other sampling-based methods which may be less accurate (Marshall and Spiegelhalter 2003).

Roos and Held (2011) discuss sensitivity to priors for binary data using the conditional predictive ordinate (CPO, Geisser 1993), which is defined as \( \pi ({y_i}|{{{\mathbf y}}_{-i }}) \). They use the mean logarithmic CPO to build the following statistic as a measure of the predictive quality of the model:

$$ \overline{CPO}=-\frac{1}{n}\sum\limits_i^n\,\log (\pi ({y_i}|{{{\mathbf y}}_{-i }})) $$

Lower values of \( \overline{CPO} \) indicate a better model. As the authors state, this criterion can easily be extended to other hierarchical models. Held et al. (2010) compare the CPO and PIT between “exact” Bayesian inference (using MCMC) and approximate inference (with INLA) showing that the approximated values are very close in general to the exact ones.

Finally, INLA can also compute the deviance information criterion (DIC, Spiegelhalter et al. 2002) which is a popular way of comparing Bayesian hierarchical models. The DIC also computes a measure of the effective number of parameters which is a measure of the complexity of the model.

2.4 Implementation

Besides the original paper, the authors have released a software (called INLA) which implements all the techniques mentioned here. In addition, an interface for the R programming language (R Development Core Team 2011) can be downloaded (from http://www.r-inla.org) which makes the use of the software easier and is able to produce summary statistics and plots of the results.

2.5 Other Features

In addition to an easy to use interface, the INLA software provides some other features. The joint posterior distribution of the hyperparameters can be computed. In addition, it is possible to define several linear combinations of the latent effects so that their posterior marginals are computed. Furthermore, if several of these linear combinations are computed, the joint correlation matrix can be computed as well, and this can be used to approximate the joint posterior distribution.

3 Spatial Models

Spatial dependence can be modeled in different ways in Bayesian hierarchical models (Banerjee et al. 2004). Given that INLA focuses on latent Gaussian models and given that the latent effects are Gaussian, spatial correlation can be embedded in the precision matrix. Furthermore, because of the Markov properties of the latent field, these variance-covariance matrices are often very sparse. How these methods can be applied to the different areas of spatial statistics is discussed below.

3.1 Geoadditive Mixed-Effects Models

Geoadditive models appear when regression models on a set of covariates are combined with other types of random effects (Kammann and Wand 2003). A geoadditive model will be based on modeling the mean \( {\mu_i} \) at each location \( i \) on the sum of a set of fixed and random effects:

$$ {\mu_i}=\mu +{{{\mathbf z}}_{{\mathbf i}}}\beta +{u_i}+{v_i} $$

where \( {{{\mathbf z}}_{{\mathbf i}}} \) is a vector of covariates and β the associated coefficients. u is a vector of spatially correlated random effects, while \( {\mathbf v} \) is a vector of independent random effects.

Note that this modeling can be done regardless of the likelihood employed for the data. In the case of a generalized linear model, a convenient link function will be used to transform the linear predictor accordingly.

Other nonparametric approaches can be implemented taking advantage of this approach. Kammann and Wand (2003) and Ruppert et al. (2003) show how penalized splines (P-splines) can be expressed as a mixed-effects model. Lee and Durbán (2009) describe how P-splines and a CAR model can be used to model spatial data. They develop an expression of these models as mixed-effects models. Although this is not a fully Bayesian approach, these models could be fitted with INLA using the following representation:

$$ \mu ={\mathbf X}\beta +{\mathbf Zu} $$

Here X and Z represent design matrices for the fixed and random effects which have a particular structure derived from the fact that this mixed model represents a P-spline (see Sect. 4.9 in Ruppert et al. 2003, for details). A fully Bayesian approach to P-splines can be found in Lang and Brezger (2004), and it is based on imposing a prior on the coefficients γ of a design matrix B (based on the basis functions):

$$ \mu ={\mathbf B}\gamma $$

Different priors on γ lead to different types of splines (Fahrmeir and Kneib 2011). For producing smoothed values of an observed covariate using P-splines, the prior should be a random walk. To achieve spatial smoothing, the prior on γ should be a GMRF with spatial structure. See Lang and Brezger (2004) for details on how to define B and the prior of γ for spatial smoothing.

3.2 Disease Mapping

The analysis of public health data has played an important role in the development of spatial statistics in the last two decades. Besag et al. (1991) provided a suitable model in which spatial correlation and unstructured variation are combined in a geoadditive way which is also computationally appealing. Other authors have extended this model later, some of them for spatiotemporal disease mapping.

It should not be forgotten that disease mapping is a particular example of the analysis of lattice data. In this case, observations are aggregated over some region (counties, states, health districts, etc.) and spatial models assume that neighboring areas will have similar behavior. Here, dependence is between neighbors and a popular criterion is that two areas are neighbors if they share a common boundary.

Besag et al. (1991) proposed the use of two latent random effects: a spatially correlated one u and an independent one v. The first will account for any spatial correlation and the second will account for any other unstructured difference between the regions. While the nonstructured random effects are Gaussian with zero mean and precision \( \tau {I_n} \) (where \( {I_n} \) is the identity matrix of size \( n\times n \)), the spatially correlated random effects are defined using conditional distributions given the values at the neighbors. This is equivalent to using an intrinsic GMRF (Rue and Held 2005, Chap. 3), which is known as intrinsic conditionally autoregressive (CAR) model.

In order to encode this spatial information into a GMRF with zero mean and precision Q, we will make use of the Markov property to note that if areas i and j are independent given the remaining areas, then \( {Q_{ij }}={Q_{ji }}=0 \). Hence, the precision matrix Q will be very sparse, and the algorithms described in Rue and Held (2005) can be used for fast sampling from this GMRF.

In particular, the intrinsic CAR precision matrix is defined as

$$ {Q_{ij }}=\kappa \left\{ {\begin{matrix} {{n_i}} & {i=j} \cr {-1} & {i\sim j} \cr 0 & {\mathrm{\text otherwise}} \end{matrix}} \right. $$

Here \( i\sim j \) means that areas \( i \) and \( j \) are neighbors, \( \kappa \) is a conditional precision, and \( {n_i} \) is the number of neighbors of area \( i \). This makes the conditional distribution of \( {u_i}|{{{\mathbf u}}_{{-{\mathbf i}}}},\kappa \) Gaussian with mean \( \tfrac{1}{{{n_i}}}\sum\nolimits_{{j\sim i}}\,{u_j} \) and variance \( \tfrac{1}{{\kappa {n_i}}} \).

Note that the intrinsic CAR is an improper GMRF of rank \( n-1 \). For this reason the constraint \( \sum\nolimits_i\,{u_i}=0 \) is added so that these effects can be identified. This is a common assumption for random effects based on intrinsic GMRF (Martino and Rue 2010).

A proper version of the intrinsic CAR model is available and it has a precision matrix similar to the previous one but adding a term \( d>0 \) to the diagonal elements, so that they become \( {Q_{ii }}={n_i}+d \). \( \log (d) \) is assigned a log-gamma prior distribution by default. Note that the main point of this model is to make the precision matrix strictly diagonally dominant so that it becomes invertible and the prior distribution is a proper one.

A more general approach is obtained when the precision matrix is defined as

$$ {\mathbf Q}=({\it I}-\frac{\rho }{{{\lambda_{\it max }}}}{\mathbf C}) $$

This can be used to define a general CAR spatial effect by taking C as a matrix of spatial weights (see Chap. 9 in Bivand et al. 2008, to see how different spatial weights can be defined). ρ represents the spatial correlation (and it can be assigned a prior) and takes values between 0 and 1 because the weight matrix is C divided by \( {\lambda_{max }} \), its maximum eigenvalue, and by default a Gaussian prior is on \( \mathrm{logit}(\rho ) \). Note that this will produce a proper distribution for the spatially correlated random effects. Negative spatial autocorrelation is often ignored in disease mapping.

In this general case, the conditional distribution of \( {u_i} \) is

$$ {u_i}|{{{\mathbf u}}_{{-{\mathbf i}}}},\kappa \sim N\left( {\rho \frac{{\sum\nolimits_{{j\ne i}}\,{w_{ij }}{u_i}}}{{{w_{i+ }}}},\frac{1}{{\kappa {w_{i+ }}}}} \right) $$

where \( {w_{ij }}={c_{ij }}/{\lambda_{\max }} \) and \( {w_{i+ }}=\sum\nolimits_{j=1}^n\,{w_{ij }} \). Note that if C is row standardized, then \( {\lambda_{\it max }}=1 \) and \( {w_{i+ }}=1 \) and the marginal distribution has a simpler form.

3.3 Geostatistical Models

In addition to fitting a model to the data, geostatistics focuses on predicting a continuous surface (often approximated by a discrete grid of points) so these models are often computationally very expensive. Spatially correlated random effects are built for the set of sampling locations, which may lead to trouble if the number of locations is large. Geostatistical models are not restricted to Gaussian likelihoods, as described in Banerjee et al. (2004) and Diggle and Ribeiro (2007), and they can be used to model other types of data using a geostatistical latent effect.

Spatial correlation in geostatistical models is built upon the distances between the sampling points, usually using a decaying function on the distance. For example, a simple covariance function is defined such as \( {\Sigma_{ij }}={\sigma^2}\exp (-{d_{ij }}/\varphi ) \). Here \( {d_{ij }} \) is the distance between points i and j, and \( \varphi \) is a parameter to control for the spatial scale. Once the model is fitted, prediction relies on the posterior distributions of the parameters and the covariances for the points in the grid.

A more general class of spatial covariance is provided by the Matérn correlation function, of which the exponential decaying function is a particular example. The Matérn covariance is defined as

$$ {\Sigma_{ij }}={\sigma^2}\frac{{{\tau^{\kappa }}K(\tau, \kappa )}}{{{2^{{\kappa -1}}}\Gamma(\kappa )}};\tau ={\alpha_{\kappa }}{d_{ij }}/\varphi $$

\( K(\cdot, \kappa ) \) is the modified Bessel function of order κ and \( \Gamma(\cdot ) \) the gamma function. \( {\alpha_{\kappa }} \) and \( \varphi \) can be used to control the scale of the spatial variation. Setting κ to \( 0.5 \) leads to an exponential covariance. Other values of κ will lead to other known spatial covariance functions (Eidsvik et al. 2009).

When it comes to provide a prediction on the grid, INLA treats the observation at each point on the grid as a missing value. This makes INLA compute the marginal posterior distribution at that point so that summary statistics can be obtained later.

In this approach, modeling and prediction occur on a regular grid, and observations need to match to some location in the grid. Lindgren et al. (2011) aim at modeling the geostatistical model by using a mesh based on a triangulation of the sampling points (instead of a regular grid) and stochastic partial differential equations (SPDE). In this approach, the spatially distributed effect \( {\mathbf u} \) is

$$ u(s)=\sum\limits_{k=1}^n\,{\psi_k}(s){w_k},\:s\in \mathbb{R}{^2} $$

where \( \{{\psi_k}\} \) are some basis functions, \( \{{w_k}\} \) are Gaussian distributed weights, and n is the number of points in the triangulation used to split the study area. As this is a more complex approach, the reader is referred to the original paper (Lindgren et al. 2011) and the gentle introduction by Cameletti et al. (2011) for details on how the basis functions and weights are taken.

Finally, INLA can be used for geostatistical design. Methods and results discussed in Diggle et al. (2010) for preferential sampling can be reproduced with INLA (see the Case Studies section in http://www.r-inla.org). Anisotropic models could also be employed, as discussed in Fuglstad (2011), and use of these models is being integrated into the software package.

3.4 Point Process Models

Rue et al. (2009) show an example of the analysis of a point pattern with INLA using a Poisson process. Rather than modeling the continuous intensity of the point process, they divide the study area in N disjoint cells (not necessarily of equal size) and model the data as coming from a counting process. Hence, the response variable \( {y_i} \) represents the number of occurrences of the process in square \( {w_i};\:i=1,\ldots,N \). For simplicity a square lattice may be employed. In a square lattice all the squares have the same area, and spatially correlated random effects can be defined similarly as in lattice data (i.e., two squares are neighbors if they have a common boundary).

In their example, Rue et al. (2009) use a hierarchical Poisson process to model the number of trees in each square using a log-Gaussian Cox process (LGP). In this case, the intensity function is \( \lambda (s)=\exp \{Z(s)\};\:s\in \mathbb{R}{^2} \), where \( Z(s) \) is a Gaussian field at \( s\in \mathbb{R}{^2} \).

Hence, \( {y_i} \) is the observed number of occurrences in cell \( {w_i} \). If \( {\eta_i} \) is the realization of \( Z({s_i}) \), then \( \pi ({\mathbf y}|\eta )=\prod\nolimits_i\,\pi ({y_i}|{\eta_i}) \), where \( \pi ({y_i}|{\eta_i}) \) represents a Poisson distribution with mean \( |{w_i}|\exp ({\eta_i}) \). \( |{w_i}| \) is the area of cell \( {w_i} \).

In turn, \( {\eta_i} \) is modeled according to a number of covariates plus some random effects:

$$ {\eta_i}={X_i}\beta +{u_i}+{v_i} $$

u and v are modeled in a similar way as with the lattice data case. \( {v_i} \) are independent Gaussian with zero mean and variance \( \sigma_v^2 \) so that they represent independent variation between the squares. On the other hand, \( {u_i} \) are modeled using a second-order polynomial intrinsic GMRF. In this way, first-, second-, and third-order neighbors are taken into account, each one with a different weight, to mimic thin plate splines. See Rue and Held (2005) for details.

Simpson et al. (2011) extend the ideas in Lindgren et al. (2011) to model the latent LGP in a continuous way using a mesh on the study area. They show that this is a better approach that reduces the computational burden as a mesh is used instead of a regular grid and there is no need to aggregate cases into small cells.

More complex models cannot be fully addressed using INLA, in particular, those for which a closed likelihood does not exist as, for example, Gibbs processes. In a Gibbs process, future observations depend on present observations and, hence, producing a likelihood in closed form is not feasible.

4 Examples

As it happens, INLA is one of many alternatives for fitting Bayesian hierarchical models. In this section we provide a comparison to other software available for the R programming language, including computing times. Our aim here is not to provide a full comparison of computation times but to indicate how different approaches compare in terms of time and accuracy of results when used to fit a similar model to the same data set.

4.1 Geostatistics

For geostatistical models, we will use the Rongelap data set analyzed in several works on model-based geostatistics (Diggle and Ribeiro 2007). This data set records radionuclide concentration at 157 different locations, and the interest is on providing an estimate of the concentration over the whole Rongelap island.

As INLA makes computation on a regular grid, we have considered a 5 × 5 regular grid on one of the clusters in the northeast part of the island to make a fair comparison between computing times. We have used the INLA software (using the Laplace approximation) and the R package geoRglm, which provides model fitting using MCMC. The different computation times are shown in Table 71.1, while a map comparing the different estimates is shown in Fig. 71.1.

Table 71.1 Summary of computation times for different problems, softwares, and fitting methods
Fig. 71.1
figure 1

Estimates of the radionuclide concentration using different methods: Integrated nested Laplace approximation (INLA) and MCMC (using geoRglm)

4.2 Lattice Data

For the case of lattice data, we have used the number of total malignant neoplasms mortalities in Georgia in 1999. We have fitted the model proposed in Besag et al. (1991) with population density as a covariate. In this case, we have used the INLA software as well as WinBUGS. Times are available in Table 71.1 and a graphical comparison of the estimates is available in Fig. 71.2.

Fig. 71.2
figure 2

Estimates of the relative risk using different methods: Standardized mortality ratio (SMR), integrated nested Laplace approximation (INLA), and MCMC (using WinBUGS)

4.3 Point Patterns

Finally, a point pattern has been included; we have performed an analysis of the Japanese pines data set available in R package Spatstat. This data set provides the location of Japanese pine saplings in a square region in a natural forest. Again, model fitting with INLA requires the use of a regular square grid so that the data are the number of saplings in each grid square. A 10 × 10 square grid has been used in this case, and the model to account for spatial dependence is the same as in the previous example (Besag et al. 1991). This will also give us an idea of how INLA behaves as the grid size increases.

Figure 71.3 summarizes the fitted number of saplings and computing times are available in Table 71.1. It is worth noting how the differences between INLA and WinBUGS have increased now.

Fig. 71.3
figure 3

Estimates of the number of saplings per square using two different methods: Integrated nested Laplace approximation (INLA) and MCMC (using WinBUGS)

5 Conclusions

The integrated nested Laplace approximation developed in Rue et al. (2009) provides a series of approximations for the posterior marginals of the parameters of a Bayesian hierarchical model in which the latent effects are a Gaussian Markov random field. This family of models covers a good number of Bayesian hierarchical models, including several of those most used in spatial statistics. In addition, Markov properties are very convenient in dealing with spatial data and they can be used to model local dependence. Besides an approximation to the posterior marginals of the parameters in the model, INLA can compute several criteria for model criticism and selection, such as PIT and the DIC.

Regarding spatial models, INLA has been used to tackle problems in the analysis of lattice data, geostatistics, and point processes. In all cases, spatial dependence is modeled via the precision matrix of Gaussian random effects. The recent developments by Lindgren et al. (2011) allow for continuous modeling of latent spatial effects, which avoids the use of a grid and provides a good computational approach as well.

The availability of associated software that implements all these methods provides a suitable framework for their wider use. Other external software may be required to display the results in maps or create adjacency matrices for the analysis of lattice data. For this reason, the authors of the INLA software have provided an interface to the R programming language. The R-INLA web site (http://www.r-inla.org) provides the latest version of the software and its documentation as well as an updated list of published and working papers.