Keywords

The likelihood function plays an important role in Bayesian inference, since it connects the observed data with the statistical model. Both simulation-based (e.g. MCMC) and optimisation-based (e.g. variational Bayes) algorithms require the likelihood to be evaluated pointwise, up to an unknown normalising constant. However, there are some situations where this evaluation is analytically and computationally intractable. For example, when the complexity of the likelihood grows at a combinatorial rate in terms of the number of observations, then likelihood-based inference quickly becomes infeasible for the scale of data that is regularly encountered in applications.

Intractable likelihoods arise in a variety of contexts, including models for DNA mutation in population genetics [43, 64], models for the spread of disease in epidemiology [46, 60], models for the formation of galaxies in astronomy [12], and estimation of the model evidence in Bayesian model choice [29]. This chapter will mainly focus on Markov random field (MRF) models with discrete state spaces, such as the Ising, Potts, and exponential random graph models (ERGM). These models are used for image segmentation or analysis of social network data, two areas where millions of observations are commonplace. There is therefore a need for scalable inference algorithms that can handle these large volumes of data.

The Ising, Potts, or ERGM likelihood functions can be expressed in the form of an exponential family:

$$\displaystyle \begin{aligned} p(\mathbf{y} \mid \boldsymbol\theta) = \frac{\exp\left\{\boldsymbol\theta^T \mathbf{s}(\mathbf{y}) \right\}}{\mathcal{C}(\boldsymbol\theta)}, \end{aligned} $$
(6.1)

where the observed data y = y 1, …, y n is in the form of an undirected graph, θ is a vector of unknown parameters, s(y) is a corresponding vector of jointly-sufficient statistics for these parameters, and \(\mathcal {C}(\boldsymbol \theta )\) is an intractable normalising constant, also known as a partition function:

$$\displaystyle \begin{aligned} \mathcal{C}(\boldsymbol\theta) = \sum_{\mathbf{y} \in \mathcal{Y}} \exp\left\{\boldsymbol\theta^T \mathbf{s}(\mathbf{y}) \right\}, \end{aligned} $$
(6.2)

where the sum is over all possible configurations of states, \(\mathbf {y} \in \mathcal {Y}\).

In the case of an Ising model, a single node can take one of two possible values, y i ∈{0, 1}. For example, in image analysis the value 1 might represent a foreground pixel, while 0 represents the background. The q-state Potts model generalises this construction to more than two states, so y i ∈{1, …, q}. The cardinality of the configuration space, \(\#\mathcal {Y}\), is then q n. Even with only 2 states and n = 100 pixels, computation of (6.2) requires more than 1030 floating point operations. It would take a supercomputer with 100 PetaFLOPS over 400,000 years to find an answer.

Both the Ising and Potts models possess a single parameter, β, known as the inverse temperature. The corresponding sufficient statistic is then:

$$\displaystyle \begin{aligned} s(\mathbf{y}) = \sum_{i \sim \ell \in \mathcal{E}} \delta(y_i, y_\ell), \end{aligned} $$
(6.3)

where \(\mathcal {E}\) is the set of all unique pairs of neighbours i ∼  in the observed graph, and δ(x, y) is the Kronecker delta function, which equals 1 when x = y and 0 otherwise. We assume a first-order neighbourhood structure, so a given pixel y i would have up to 4 neighbours in a regular 2D lattice, or 6 neighbours in 3D. Pixels on the boundary of the image domain have less than 4 (or 6) neighbours, so \(\#\mathcal {E} = 2(n - \sqrt {n})\) for a square 2D lattice, or 3(n − n 2∕3) for a cube.

The observed data for an ERGM can be represented as a binary adjacency matrix Y , encoding the presence or absence of a neighbourhood relationship between nodes i and j: [Y ]i,j = 1 if i ∼ j; [Y ]i,j = 0 otherwise. \(\#\mathcal {Y}\) for an ERGM is equal to 2M, where M = n(n − 1)∕2 is the maximum number of ties in an undirected graph with n nodes. As with the Ising or Potts models, computing the normalising constant (6.2) is therefore intractable for non-trivial graphs. Various kinds of ERGM can be defined by the choice of sufficient statistics. The simplest example is the Bernoulli random graph [23], which has a single statistic s 1(Y ) = m, the number of connected neighbours in the graph. In an undirected graph, this is half the number of nonzero entries in the adjacency matrix. An important class of graph statistics are the numbers of k-stars [26], which can be defined in terms of the degree distribution [59]:

$$\displaystyle \begin{aligned} n_k = \sum_{i=1}^n \binom{d_i}{k}, {} \end{aligned} $$
(6.4)

where the degree d i is the number of neighbours of node i:

$$\displaystyle \begin{aligned} d_i = \sum_{j=1}^n [Y]_{ij}. {} \end{aligned} $$
(6.5)

Note that under this definition n 1 = 2m, since each tie is counted twice. An alternative definition, which avoids double-counting, is given by:

$$\displaystyle \begin{aligned} \begin{array}{ll} n_1 = \sum_{i<j} [Y]_{ij}\ & \mbox{number of edges}\\ n_2 = \sum_{i<j<k} [Y]_{ik} [Y]_{jk} & \mbox{number of 2-stars}\\ n_3 = \sum_{i<j<l<k} [Y]_{ik} [Y]_{jk} [Y]_{lk} & \mbox{number of 3-stars.} \end{array} \end{aligned}$$

The remainder of this chapter will describe various MCMC methods that target the posterior distribution π(θy), or some approximation thereof. This will be in the context of a random walk Metropolis (RWM) algorithm that proposes a new value of θ at iteration t using a (multivariate) Gaussian proposal distribution, \(q(\cdot \mid \boldsymbol \theta ^{t-1}) \sim \mathcal {N}(\boldsymbol \theta ^{t-1}, \varSigma _t)\). Methods for tuning the proposal bandwidth Σ t have been described by Andrieu and Thoms [3] and Roberts and Rosenthal [67]. Normally, the proposed parameter value would be accepted with probability \(\min \{1, \rho _t\}\), or else rejected, where ρ t is the Radon–Nikodým derivative:

$$\displaystyle \begin{aligned} \rho_t = \frac{q\left(\boldsymbol\theta^{(t-1)} \mid \boldsymbol\theta'\right) p\left(\mathbf{y} \mid \boldsymbol\theta'\right) \pi_0\left(\boldsymbol\theta'\right)}{q\left(\boldsymbol\theta' \mid \boldsymbol\theta^{(t-1)}\right) p\left(\mathbf{y} \mid \boldsymbol\theta^{(t-1)}\right) \pi_0\left(\boldsymbol\theta^{(t-1)}\right)}, \end{aligned} $$
(6.6)

π 0(θ) is the prior density for the parameter/s, and \(p\left (\mathbf {y} \mid \boldsymbol \theta \right )\) is the likelihood (6.1). If we use a symmetric proposal distribution q and a uniform prior π 0, then these terms will cancel, leaving:

$$\displaystyle \begin{aligned} \rho_t = \frac{\psi\left(\mathbf{y} \mid \boldsymbol\theta'\right)}{\psi(\mathbf{y} \mid \boldsymbol\theta^{(t-1)})} \frac{\mathcal{C}(\boldsymbol\theta^{(t-1)})}{\mathcal{C}\left(\boldsymbol\theta'\right)}, \end{aligned} $$
(6.7)

which is the ratio of unnormalised likelihoods \(\psi = \exp \left \{\boldsymbol \theta ^T \mathbf {s}(\mathbf {y})\right \}\), multiplied by the ratio of intractable normalising constants (6.2). It is clearly infeasible to evaluate (6.7) directly, so alternative algorithms are required. One option is to estimate ρ t by simulation, which we categorise as auxiliary variable methods: pseudo-marginal algorithms, the exchange algorithm, and approximate Bayesian computation (ABC). Other methods include path sampling, also known as thermodynamic integration (TI), pseudolikelihood, and composite likelihood.

1 Auxiliary Variable Methods

1.1 Pseudo-Marginal Algorithms

Pseudo-marginal algorithms [2, 6] are computational methods for fitting latent variable models, that is where the observed data y can be considered as noisy observations of some unobserved or hidden states, x. For example, hidden Markov models (HMMs) are commonly used in time series analysis and signal processing. Models of this form can also arise as the result of data augmentation approaches, such as for mixture models [17, 73]. The marginal likelihood is of the following form:

$$\displaystyle \begin{aligned} p(\mathbf{y} \mid \boldsymbol\theta) = \int_{\mathcal{X}} p(\mathbf{y} \mid \mathbf{x}) \,p(\mathbf{x} \mid \boldsymbol\theta) \,d\mathbf{x}, \end{aligned} $$
(6.8)

which can be intractable if the state space is very high-dimensional and non-Gaussian. In this case, we can substitute an unbiased, non-negative estimate of the likelihood.

O’Neill et al. [60] introduced the Monte Carlo within Metropolis (MCWM) algorithm, which replaces both \(p\left (\mathbf {y} \mid \boldsymbol \theta '\right )\) and \(p\left (\mathbf {y} \mid \boldsymbol \theta ^{(t-1)}\right )\) in the Metropolis–Hastings ratio ρ t (6.6) with importance-sampling estimates:

$$\displaystyle \begin{aligned} \tilde{p}_{IS}(\mathbf{y} \mid \boldsymbol\theta) \approx \frac{1}{M} \sum_{m=1}^M p(\mathbf{y} \mid X_m) \frac{p(X_m \mid \boldsymbol\theta)}{q(X_m \mid \boldsymbol\theta)}, \end{aligned} $$
(6.9)

where the samples X 1, …, X M are drawn from a proposal distribution q(X mθ) for θ and θ (t−1). MCWM is generally considered as an approximate algorithm, since it does not target the exact posterior distribution for θ. However, Medina-Aguayo et al. [47] have established some conditions under which MCWM converges to the correct target distribution as M →. See also [55] and [1] for further theoretical analysis of approximate pseudo-marginal methods.

Beaumont [6] introduced the grouped independence Metropolis–Hastings (GIMH) algorithm, which does target the exact posterior. The key difference is that \(\tilde {p}_{IS}\left (\mathbf {y} \mid \boldsymbol \theta ^{(t-1)}\right )\) is reused from the previous iteration, rather than being recalculated every time. The theoretical properties of this algorithm have been an active area of research, with notable contributions by Andrieu and Roberts [2], Maire et al. [41], Andrieu and Vihola [4], and Sherlock et al. [69]. Andrieu et al. [5] introduced the particle MCMC algorithm, which is a pseudo-marginal method that uses sequential Monte Carlo (SMC) in place of importance sampling. This is particularly useful for HMMs, where SMC methods such as the bootstrap particle filter provide an unbiased estimate of the marginal likelihood [62]. Although importance sampling and SMC are both unbiased estimators, it is necessary to use a large enough value of M so that the variance is kept at a reasonable level. Otherwise, the pseudo-marginal algorithm can fail to be variance-bounding or geometrically ergodic [39]. Doucet et al. [18] recommend choosing M so that the standard deviation of the log-likelihood estimator is between 1 and 1.7.

Pseudo-marginal algorithms can be computationally intensive, particularly for large values of M. One strategy to reduce this computational burden, known as the Russian Roulette algorithm [40], is to replace \(\tilde {p}_{IS}(\mathbf {y} \mid \boldsymbol \theta )\) (6.9) with a truncated infinite series:

$$\displaystyle \begin{aligned} \tilde{p}_{RR}(\mathbf{y} \mid \boldsymbol\theta) = \sum_{j=0}^\tau V_{\boldsymbol\theta}^{(j)}, \end{aligned} $$
(6.10)

where τ is a random stopping time and \(V_{\boldsymbol \theta }^{(j)}\) are random variables such that (6.10) is almost surely finite and \(\mathbb {E}[\tilde {p}_{RR}(\mathbf {y} \mid \boldsymbol \theta )] = p(\mathbf {y} \mid \boldsymbol \theta )\). There is a difficulty with this method, however, in that the likelihood estimates are not guaranteed to be non-negative. Jacob and Thiery [37] have established that there is no general solution to this sign problem, although successful strategies have been proposed for some specific models.

Another important class of algorithms for accelerating pseudo-marginal methods involve approximating the intractable likelihood function using a surrogate model. For example, the delayed-acceptance (DA) algorithm of [14] first evaluates the Metropolis–Hastings ratio (6.6) using a fast, approximate likelihood \(\tilde {p}_{DA}(\mathbf {y} \mid \boldsymbol \theta )\). The proposal θ is rejected at this screening stage with probability \(1 - \min \{1, \rho _t\}\). Otherwise, a second ratio \(\rho ^{(2)}_{DA}\) is calculated using a full evaluation of the likelihood function (6.9). The acceptance probability \(\min \{1, \rho ^{(2)}_{DA}\}\) is modified at the second stage according to:

$$\displaystyle \begin{aligned} \rho_{DA}^{(2)} = \frac{\tilde{p}_{IS}(\mathbf{y} \mid \boldsymbol\theta') \,\pi_0(\boldsymbol\theta')}{\tilde{p}_{IS}(\mathbf{y} \mid \boldsymbol\theta^{(t-1)}) \,\pi_0(\boldsymbol\theta^{(t-1)})} \frac{\tilde{p}_{DA}(\mathbf{y} \mid \boldsymbol\theta^{(t-1)}) \,\pi_0(\boldsymbol\theta^{(t-1)})}{\tilde{p}_{DA}(\mathbf{y} \mid \boldsymbol\theta') \,\pi_0(\boldsymbol\theta')}, \end{aligned} $$
(6.11)

which corrects for the conditional dependence on acceptance at the first stage and therefore preserves the exact target distribution. DA has been used for PMCMC by Golightly et al. [33], where the linear noise approximation [25] was used for \(\tilde {p}_{DA}\). Sherlock et al. [70] instead used k-nearest-neighbours for \(\tilde {p}_{DA}\) in a pseudo-marginal algorithm.

Drovandi et al. [22] proposed an approximate pseudo-marginal algorithm, using a Gaussian process (GP) as a surrogate log-likelihood. The GP is trained using a pilot run of MCWM, then at each iteration \(\log \tilde {p}(\mathbf {y} \mid \boldsymbol \theta ')\) is either approximated using the GP or else using SMC or importance sampling, depending on the level of uncertainty in the surrogate model for θ. MATLAB source code is available from http://www.runmycode.org/companion/view/2663. Stuart and Teckentrup [71] have shown that, under certain assumptions, a GP provides a consistent estimator of the negative log-likelihood, and they provide error bounds on the approximation.

1.2 Exchange Algorithm

Møller et al. [50] introduced a MCMC algorithm for the Ising model that targets the exact posterior distribution for β. An auxiliary variable x is defined on the same state space as y, so that \(\mathbf {x}, \mathbf {y} \in \mathcal {Y}\). This is a data augmentation approach, where we simulate from the joint posterior π(β, xy), which admits the posterior for β as its marginal. Given a proposed parameter value β′, a proposal x is simulated from the model to obtain an unbiased sample from (6.1). This requires perfect simulation methods, such as coupling from the past [65], perfect slice sampling [49], or bounding chains [8, 35]. Refer to [36] for further explanation of perfect simulation. Instead of (6.7), the joint ratio for β′ and x becomes:

$$\displaystyle \begin{aligned} \rho_t = \frac{\psi\left(\mathbf{y} \mid \beta'\right)}{\psi\left(\mathbf{y} \mid \beta^{(t-1)}\right)} \frac{\psi\left(\mathbf{x}' \mid \tilde\beta\right)}{\psi\left({\mathbf{x}}^{(t-1)} \mid \tilde\beta\right)} \frac{\psi({\mathbf{x}}^{(t-1)} \mid \beta^{(t-1)})}{\psi\left(\mathbf{x}' \mid \beta'\right)}, \end{aligned} $$
(6.12)

where the normalising constants \(\mathcal {C}(\beta ')\) and \(\mathcal {C}(\beta ^{(t-1)})\) cancel out with each other. This is analogous to an importance-sampling estimate of the normalising constant with M = 1 samples, since:

$$\displaystyle \begin{aligned} \mathbb{E}_{\mathbf{x}}\left[ \frac{\psi\left(\mathbf{x} \mid \beta\right)}{q(\mathbf{x} \mid \beta)} \right] = \mathcal{C}(\beta), \end{aligned} $$
(6.13)

where the proposal distribution q(xβ) is (6.1). This algorithm is therefore closely-related with pseudo-marginal methods such as GIMH.

Murray et al. [54] found that (6.12) could be simplified even further, removing the need for a fixed value of \(\tilde \beta \). The exchange algorithm replaces (6.7) with the ratio:

$$\displaystyle \begin{aligned} \rho_t = \frac{\psi\left(\mathbf{y} \mid \beta'\right)}{\psi\left(\mathbf{y} \mid \beta^{(t-1)}\right)} \frac{\psi(\mathbf{x}' \mid \beta^{(t-1)})}{\psi\left(\mathbf{x}' \mid \beta'\right)}. \end{aligned} $$
(6.14)

However, perfect sampling is still required to simulate x at each iteration, which can be infeasible when the state space is very large. Cucala et al. [15] proposed an approximate exchange algorithm (AEA) by replacing the perfect sampling step with 500 iterations of Gibbs sampling. Caimo and Friel [9] were the first to employ AEA for fully-Bayesian inference on the parameters of an ERGM. AEA for the hidden Potts model is implemented in the R package ‘bayesImageS’ [51] and AEA for ERGM is implemented in ‘Bergm’ [10].

1.3 Approximate Bayesian Computation

Like the exchange algorithm, ABC uses an auxiliary variable x to decide whether to accept or reject the proposed value of θ. In the terminology of ABC, x is referred to as “pseudo-data.” Instead of a Metropolis–Hastings ratio such as (6.7), the summary statistics of the pseudo-data and the observed data are directly compared. The proposal is accepted if the distance between these summary statistics is within the ABC tolerance, 𝜖. This produces the following approximation:

$$\displaystyle \begin{aligned} p\left(\boldsymbol\theta \mid \mathbf{y} \right) \;\approx\; \pi_\epsilon\left(\boldsymbol\theta \mid \| \mathbf{s}(\mathbf{x}) - \mathbf{s}(\mathbf{y}) \| < \epsilon\right), \end{aligned} $$
(6.15)

where ∥⋅∥ is a suitable norm, such as Euclidean distance. Since s(y) are jointly-sufficient statistics for Ising, Potts, or ERGM, the ABC approximation (6.15) approaches the true posterior as n → and 𝜖 → 0. In practice there is a tradeoff between the number of parameter values that are accepted and the size of the ABC tolerance.

Grelaud et al. [34] were the first to use ABC to obtain an approximate posterior for β in the Ising/Potts model. Everitt [24] used ABC within sequential Monte Carlo (ABC-SMC) for Ising and ERGM. ABC-SMC uses a sequence of target distributions \(\pi _{\epsilon _t} \left (\boldsymbol \theta \mid \| \mathbf {s}(\mathbf {x}) - \mathbf {s}(\mathbf {y}) \| < \epsilon _t \right )\) such that 𝜖 1 > 𝜖 2 > ⋯ > 𝜖 T, where the number of SMC iterations T can be determined dynamically using a stopping rule. The ABC-SMC algorithm of [19] uses multiple MCMC steps for each SMC iteration, while the algorithm of [16] uses multiple replicates of the summary statistics for each particle. Everitt [24] has provided a MATLAB implementation of ABC-SMC with the online supplementary material accompanying his paper.

The computational efficiency of ABC is dominated by the cost of drawing updates to the auxiliary variable, as reported by Everitt [24]. Thus, we would expect that the execution time for ABC would be similar to AEA or pseudo-marginal methods. Various approaches to improving this runtime have recently been proposed. “Lazy ABC” [63] involves early termination of the simulation step at a random stopping time, hence it bears some similarities with Russian Roulette. Surrogate models have also been applied in ABC, using a method known as Bayesian indirect likelihood [BIL; 20, 21]. Gaussian processes (GPs) have been used as surrogate models by Wilkinson [75] and Meeds and Welling [48]. Järvenpää et al. [38] used a heteroskedastic GP model and demonstrated how the output of the precomputation step could be used for Bayesian model choice. Moores et al. [52] introduced a piecewise linear approximation for ABC-SMC with Ising/Potts models. Boland et al. [7] derived a theoretical upper bound on the bias introduced by this and similar piecewise approximations. They also developed a piecewise linear approximation for ERGM. Moores et al. [53] introduced a parametric functional approximate Bayesian (PFAB) algorithm for the Potts model, which is a form of BIL where \(\tilde {p}_{BIL}(\mathbf {y} \mid \boldsymbol \theta )\) is derived from an integral curve.

2 Other Methods

2.1 Thermodynamic Integration

Since the Ising, Potts, and ERGM are all exponential families of distributions, the expectation of their sufficient statistic/s can be expressed in terms of the normalising constant:

$$\displaystyle \begin{aligned} \mathbb{E}_{\mathbf{y} | \boldsymbol\theta}[\mathbf{s}(\mathbf{y})] = \frac{\mathrm{d}}{\mathrm{d}\boldsymbol\theta} \log\{\mathcal{C}(\boldsymbol\theta)\}. \end{aligned} $$
(6.16)

Gelman and Meng [31] derived an approximation to the log-ratio of normalising constants for the Ising/Potts model, using the path sampling identity:

$$\displaystyle \begin{aligned} \log\left\{\frac{\mathcal{C}(\beta_{t-1})}{\mathcal{C}(\beta')}\right\} = \int_{\beta'}^{\beta_{t-1}} \mathbb{E}_{\mathbf{y} | \beta}[s(\mathbf{y})] \, \mathrm{d}\beta , \end{aligned} $$
(6.17)

which follows from (6.16). The value of the expectation can be estimated by simulating from the Gibbs distribution (6.1) for fixed values of β. At each iteration, \(\log \{\rho _t\}\) (6.7) can then be approximated by numerical integration methods, such as Gaussian quadrature or the trapezoidal rule. Figure 6.1 illustrates linear interpolation of \(\mathbb {E}_{\mathbf {y} | \beta }[s(\mathbf {y})]\) on a 2D lattice for q = 6 labels and β ranging from 0 to 2 in increments of 0.05. This approximation was precomputed using the algorithm of [72].

Fig. 6.1
figure 1

Approximation of \( \mathbb {E}_{\mathbf {y} | \beta }[s(\mathbf {y})]\) by simulation for fixed values of β, with linear interpolation

TI is explained in further detail by Chen et al. [13, chap. 5]. A reference implementation in R is available from the website accompanying [42]. Friel and Pettitt [29] introduced the method of power posteriors to estimate the marginal likelihood or model evidence using TI. Calderhead and Girolami [11] provide bounds on the discretisation error and derive an optimal temperature schedule by minimising the variance of the Monte Carlo estimate. Oates et al. [56] introduced control variates for further reducing the variance of TI.

The TI algorithm has an advantage over auxiliary variable methods because the additional simulations are performed prior to fitting the model, rather than at each iteration. This is particularly the case when analysing multiple images that all have approximately the same dimensions. Since these simulations are independent, they can make use of massively parallel hardware. However, the computational cost is still slightly higher than pseudolikelihood, which does not require a pre-computation step.

2.2 Composite Likelihood

Pseudolikelihood is the simplest of the methods that we have considered and also the fastest. Rydén and Titterington [68] showed that the intractable distribution (6.1) could be approximated using the product of the conditional densities:

$$\displaystyle \begin{aligned} \tilde{p}_{PL}(\mathbf{y} \mid \boldsymbol\theta) \approx \prod_{i=1}^n p(y_i \mid y_{\setminus i}, \boldsymbol\theta). \end{aligned} $$
(6.18)

This enables the Metropolis–Hastings ratio ρ t (6.6) to be evaluated using (6.18) to approximate both \(p\left (\mathbf {y} \mid \boldsymbol \theta '\right )\) and \(p\left (\mathbf {y} \mid \boldsymbol \theta ^{(t-1)}\right )\) at each iteration. The conditional density function for the Ising/Potts model is given by:

$$\displaystyle \begin{aligned} p(y_i \mid y_{\setminus i}, \beta) = \frac{\exp\left\{\beta\sum_{\ell \in \partial(i)}\delta(z_i,z_\ell)\right\}}{\sum_{j=1}^q \exp\left\{\beta\sum_{\ell \in \partial(i)}\delta(j,z_\ell)\right\}}, \end{aligned} $$
(6.19)

where  ∈ (i) are the first-order (nearest) neighbours of pixel i. The conditional density for an ERGM is given by the logistic function:

$$\displaystyle \begin{aligned} p([Y]_{ij} = 1 \mid [Y]_{\setminus ij}, \boldsymbol\theta) = \mathrm{logit}^{-1}\left\{\boldsymbol\theta^T \mathbf{s}(Y) \right\}. \end{aligned} $$
(6.20)

Pseudolikelihood is exact when θ = 0 and provides a reasonable approximation for small values of the parameters. However, the approximation error increases rapidly for the Potts/Ising model as β approaches the critical temperature, β crit, as illustrated by Fig. 6.2. This is due to long-range dependence between the labels, which is inadequately modelled by the local approximation. Similar issues can arise for ERGM, which can also exhibit a phase transition.

Fig. 6.2
figure 2

Approximation error of pseudolikelihood for n = 12, q = 3 in comparison to the exact likelihood calculated using a brute force method: (a) \(\sum _{\mathbf {y} \in \mathcal {Y}} s(\mathbf {y}) p(\mathbf {y} | \beta )\) using either Eq. (6.1) or (6.18); (b) \(\sqrt {\sum _{\mathbf {y} \in \mathcal {Y}} \left ( s(\mathbf {y}) - \mathbb {E}_{\mathbf {y}|\beta }[s(\mathbf {y})] \right )^2 p(\mathbf {y} | \beta )}\). (a) Expectation. (b) Standard deviation

Rydén and Titterington [68] referred to Eq. (6.18) as point pseudolikelihood, since the conditional distributions are computed for each pixel individually. They suggested that the accuracy could be improved using block pseudolikelihood. This is where the likelihood is calculated exactly for small blocks of pixels, then (6.18) is modified to be the product of the blocks:

$$\displaystyle \begin{aligned} \tilde{p}_{BL}(\mathbf{y} \mid \boldsymbol\theta) \approx \prod_{i=1}^{N_B} p({\mathbf{y}}_{B_i} | {\mathbf{y}}_{\setminus B_i}, \boldsymbol\theta) {} \end{aligned} $$
(6.21)

where N B is the number of blocks, \({\mathbf {y}}_{B_i}\) are the labels of the pixels in block B i, and \({\mathbf {y}}_{\setminus B_i}\) are all of the labels except for \({\mathbf {y}}_{B_i}\). This is a form of composite likelihood, where the likelihood function is approximated as a product of simplified factors [74]. Friel [27] compared point pseudolikelihood to composite likelihood with blocks of 3 × 3, 4 × 4, 5 × 5, and 6 × 6 pixels. Friel showed that (6.21) outperformed (6.18) for the Ising (q = 2) model with β < β crit. Okabayashi et al. [58] discuss composite likelihood for the Potts model with q > 2 and have provided an open source implementation in the R package ‘potts’ [32].

Evaluating the conditional likelihood in (6.21) involves the normalising constant for \({\mathbf {y}}_{B_i}\), which is a sum over all of the possible configurations \(\mathcal {Y}_{B_i}\). This is a limiting factor on the size of blocks that can be used. The brute force method that was used to compute Fig. 6.2 is too computationally intensive for this purpose. Pettitt et al. [61] showed that the normalising constant can be calculated exactly for a cylindrical lattice by computing eigenvalues of a k r × k r matrix, where r is the smaller of the number of rows or columns. The value of (6.2) for a free-boundary lattice can then be approximated using path sampling. Friel and Pettitt [28] extended this method to larger lattices using a composite likelihood approach.

The reduced dependence approximation (RDA) is another form of composite likelihood. Reeves and Pettitt [66] introduced a recursive algorithm to calculate the normalising constant using a lag-r representation. Friel et al. [30] divided the image lattice into sub-lattices of size r 1 < r, then approximated the normalising constant of the full lattice using RDA:

$$\displaystyle \begin{aligned} \mathcal{C}(\beta) \approx \frac{\mathcal{C}_{r_1 \times n}(\beta)^{r - r_1 + 1}}{\mathcal{C}_{r_1 - 1 \times n}(\beta)^{r - r_1}} {} \end{aligned} $$
(6.22)

McGrory et al. [44] compared RDA to pseudolikelihood and the exact method of [50], reporting similar computational cost to pseudolikelihood but with improved accuracy in estimating β. Ogden [57] showed that if r is chosen proportional to n, then RDA gives asymptotically valid inference when β < β crit. However, the error increases exponentially as β approaches the phase transition. This is similar to the behaviour of pseudolikelihood in Fig. 6.2. Source code for RDA is available in the online supplementary material for McGrory et al. [45].

3 Conclusion

This chapter has surveyed a variety of computational methods for Bayesian inference with intractable likelihoods. Auxiliary variable methods, such as the exchange algorithm and pseudo-marginal algorithms, target the exact posterior distribution. However, their computational cost can be prohibitive for large datasets. Algorithms such as delayed acceptance, Russian Roulette, and “lazy ABC” can accelerate inference by reducing the number of auxiliary variables that need to be simulated, without modifying the target distribution. Bayesian indirect likelihood (BIL) algorithms approximate the intractable likelihood using a surrogate model, such as a Gaussian process or piecewise function. As with thermodynamic integration, BIL can take advantage of a precomputation step to train the surrogate model in parallel. This enables these methods to be applied to much larger datasets by managing the tradeoff between approximation error and computational cost.