Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

In this chapter we describe the use of graphical models in a Bayesian setting, in which parameters are treated as random quantities on equal footing with the random variables. This allows complex stochastic systems to be modelled. This is one of the most successful application areas of graphical models; we give only a brief introduction here and refer to Albert (2009) for a more comprehensive exposition.

The paradigm used in Chaps. 2, 4 and 5 was that of identifying a joint distribution of a number of variables based on independent and identically distributed samples, with parameters unknown apart from restrictions determined by a log-linear, Gaussian, or mixed graphical model.

In contrast, Chap. 3 illustrated how a joint distribution for a Bayesian network may be constructed from a collection of conditional distributions; the network can subsequently be used to infer values of interesting unobserved quantities given evidence, i.e. observations of other quantitites. As parameters and random variables are on an equal footing in the Bayesian paradigm, we may think of the interesting unobserved quantitites as parameters and the evidence as data.

In the present chapter we follow this idea through in a general statistical setting. We focus mainly on constructing full joint distributions of a system of observed and unobserved random variables by specifying a collection of conditional distributions for a graphical model given as a directed acyclic graph with nodes representing all these quantities. Bayes’ theorem is then invoked to perform the necessary inference.

6.2 Bayesian Graphical Models

6.2.1 Simple Repeated Sampling

In the simplest possible setting we specify the joint distribution of a parameter θ and data x through a prior distribution π(θ) for θ and a conditional distribution p(x | θ) of data x for fixed value of θ, leading to the joint distribution

$$p(x,\theta)=p(x\,|\,\theta)\pi(\theta).$$

The prior distribution represents our knowledge (or rather uncertainty) about θ before the data have been observed. After observing that X=x our posterior distribution π (θ) of θ is obtained by conditioning with the data x to obtain

$$\pi^*(\theta)=p(\theta|x) = \frac{p(x|\theta)\pi(\theta)}{p(x)}\propto L(\theta)\pi(\theta),$$

where L(θ)=p(x | θ) is the likelihood. Thus the posterior is proportional to the likelihood times the prior and the normalizing constant is the marginal density p(x)=∫p(x|θ)π(θ).

If the data is a sample x=(x 1,x 2,x 3,x 4,x 5) we can represent this process by a small Bayesian network as shown to the left in Fig. 6.1. This network represents the model

$$p(x^1,\dots,x^5,\theta) = \pi(\theta) \prod_{\nu=1}^5 p(x^\nu\,|\,\theta).$$

reflecting that the individual observations are conditionally independent and identically distributed given θ. We can make a more compact representation of the network by introducing a plate which indicates repeated observations, such as shown to the right in Fig. 6.1.

Fig. 6.1
figure 1

Representation of a Bayesian model for simple sampling. The graph to the left indicates that observations are conditionally independent given θ; the picture to the right represents the same, but the plate allows a more compact representation

For a more sophisticated example, consider a graphical Gaussian model given by the conditional independence for fixed value of the concentration matrix K. In previous chapters we would have represented this model with its dependence graph:

figure a

However, in the Bayesian setting we need to include the parameters explicitly into the model, and could for example do that by the graph in Fig. 6.2.

Fig. 6.2
figure 2

A chain graph representing N independent observations of X=(X 1,X 2,X 3) from a Bayesian graphical Gaussian model in which and K follows a hyper Markov prior distribution

The model is now represented by a chain graph, where the first chain component describes the structure of the prior distribution for the parameters in the concentration matrix. We have here assumed a so-called hyper Markov prior distribution (Dawid and Lauritzen 1993): conditionally on k 22, the parameters (k 11,k 12) are independent of (k 23,k 33). The plate indicates that there are N independent observations of X, so the graph has 3N+5 nodes. The chain component on the plate reflects the factorization

for each of the individual observations of X=(X 1,X 2,X 3).

6.2.2 Models Based on Directed Acyclic Graphs

A key feature of Bayesian graphical models is that explicitly including parameters and observations themselves in the graphical representation enables much more complex observational patterns to be accommodated. Consider for example a linear regression model

$$Y_i \sim N(\mu_i, \sigma^2) \quad \mbox{with}\ \mu_i = \alpha+\beta x_i \mbox{ for } i=1,\ldots,N.$$

To obtain a full probabilistic model we must specify a joint distribution for (α,β,σ) whereas the dependent variables x i are assumed known (observed). If we specify independent distributions for these quantities, Fig. 6.3 shows a plate- based representation of this model with α, β, and σ being marginally independent and independent of Y i .

Fig. 6.3
figure 3

Graphical representations of a traditional linear regression model with unknown intercept α, slope β, and variance σ 2. In the representation to the left, the means μ i have been represented explicitly

Note that μ i are deterministic functions of their parents and the same model can also be represented without explicitly including these nodes. However, there can be specific advantages of representing the means directly in the graph. If the independent variables x i are not centered, i.e. \(\bar{x} \neq0\), the model would change if x i were replaced with \(x_{i}-\bar{x}\), as α then would be the conditional mean when \(x_{i}=\bar{x} \) rather than when x i =0, inducing a different distribution of μ i .

For a full understanding of the variety and complexity of models that can easily be described by DAGs with plates, we refer to the manual for BUGS (Spiegelhalter et al. 2003), which also gives the following example.

Weights have been measured weekly for 30 young rats over five weeks. The observations Y ij are the weights of rat i measured at age x j . The model is essentially a random effects linear growth curve:

$$Y_{ij} \sim\mathcal{N}(\alpha_{i} + \beta_i(x_j - \bar{x}), \sigma_c^{2})$$

and

$$\alpha_i \sim\mathcal{N}(\alpha_c, \sigma_\alpha^{2}),\qquad \beta_i \sim\mathcal{N}(\beta_c, \sigma_\beta^{2}),$$

where \(\bar{x} = 22\). Interest particularly focuses on the intercept at zero time (birth), denoted \(\alpha_{0} = \alpha_{c} -\beta_{c} \bar{x}\). The graphical representation of this model is displayed in Fig. 6.4.

Fig. 6.4
figure 4

Graphical representation of a random coefficient regression model for the growth of rats

For a final illustration we consider the chest clinic example in Sect. 3.1.1. Figure 6.5 shows a directed acyclic graph with plates representing N samples from the chest clinic network.

Fig. 6.5
figure 5

A graphical representation of N samples from the chest clinic network, with parameters unknown and marginally independent for seven of the nodes

Here we have introduced a parameter node for each of the variables. Each of these nodes may contain parameters for the conditional distribution of a node given any configuration of its parents, so that, following Spiegelhalter and Lauritzen (1990), we would write for the joint model

$$p(x,\theta)=\prod_{v\in V}\pi(\theta_v)\prod_{\nu=1}^N p(x^\nu _v\,|\,x^\nu_{\operatorname{pa}(v)}, \theta_v).$$

6.3 Inference Based on Probability Propagation

If the prior distributions of the unknown parameters are concentrated on a finite number of possibilities, i.e. the parameters are all discrete, the marginal posterior distribution of each of these parameters can simply be obtained by probability propagation in a Bayesian network with 7+8N nodes, inserting the observations as observed evidence. The moral graph of this network is shown in Fig. 6.6. This graph can be triangulated by just adding edges between \(x^{\nu}_{L}\) and \(x^{\nu}_{B}\) and the associated junction tree would thus have 10N cliques of size at most 4. Thus, propagation would be absolutely feasible, even for large N.

Fig. 6.6
figure 6

Moral and triangulated graph of N samples from the chest clinic network, with seven unknown parameters

We illustrate this procedure in the simple case of N=3 where we only introduce unknown parameters for the probability of visiting Asia and the probability of a smoker having lung cancer, each having three possible levels, low, medium and high. We first define the parameter nodes

figure b

and then specify a template for probabilities where we notice that A and L have an extra parent

figure c

We create three instances of the pattern defined above. In these instance the variable name asia[i] is replaced by asia1, asia2 and asia3 respectively.

figure d

We then proceed to the specification of the full network which is displayed in Fig. 6.7:

figure e
Fig. 6.7
figure 7

Bayesian network for the chest clinic example with two unknown parameter nodes and two potential observations of the network. Parameters appear as nodes in the graph

Finally we insert evidence for three observed cases, none of whom have been to Asia, all being smokers, one of them presenting with dyspnoea, one with a positive X-ray, one with dyspnoea and a negative X-ray; we then query the posterior distribution of the parameters:

figure f

We see that the probabilities of visiting Asia is now more likely than before to be low, whereas the probability of having lung cancer for a smoker is more likely to be high.

In the special case where all cases have been completely observed, it is not necessary to form the full network with 7+8N nodes, but updating can be performed sequentially as follows.

Let \(p^{*}_{n}(\theta)\) denote the posterior distribution of θ given n observations x 1,…,x n, i.e. \(p^{*}_{n}(\theta)=p(\theta\,|\,x^{1}, \dots,x^{n})\). We then have the recursion:

Hence we can incorporate evidence from the n-th observation by using the posterior distribution from the n−1 first observations as a prior distribution for a network representing only a single case. It follows from the moral graph in Fig. 6.6 that if all nodes in the plates are observed, the seven parameters are conditionally independent also in the posterior distribution after n observations. If cases are incomplete, such a sequential scheme can only be used approximately (Spiegelhalter and Lauritzen 1990).

6.4 Computations Using Monte Carlo Methods

In most cases the posterior distribution

$$ \pi^*(\theta) = p(\theta|x) = \frac{p(x|\theta)\pi(\theta)}{p(x)}\propto p(x|\theta)\pi(\theta)$$
(6.1)

of the parameters of interest cannot be calculated or represented in a simple fashion. This would for example be the case if the parameter nodes in Fig. 6.5 had values in a continuum and there were incomplete observations, such as in the example given in the previous section.

In such models one will often resort to Markov chain Monte Carlo (MCMC) methods: we cannot calculate π (θ) analytically but if we can generate samples θ (1),…,θ (M) from the distribution π (θ), we can do just as well.

6.4.1 Metropolis–Hastings and the Gibbs Sampler

Such samples can be generated by the Metropolis–Hastings algorithm. In the following we change the notation slightly.

We suppose that we know p(x) only up to a normalizing constant. That is to say, p(x)=k(x)/c, where k(x) is known but c is unknown. We partition x into blocks, for example x=(x 1,x 2,x 3).

We wish to generate samples x 1,…,x M from p(x). Suppose we have a sample \(x^{t-1}=(x_{1}^{t-1},x_{2}^{t-1},x_{3}^{t-1})\) and also that x 1 has also been updated to \(x_{1}^{t}\) in the current iteration. The task is to update x 2. To do so we need to specify a proposal distribution h 2 from which we can sample candidate values for x 2. The single component Metropolis–Hastings algorithm works as follows:

  1. 1.

    Draw \(x_{2} \sim h_{2}(\cdot\,|\,x_{1}^{t},x_{2}^{t-1},x_{3}^{t-1})\). Draw uU(0,1).

  2. 2.

    Calculate acceptance probability

    $$ \alpha=\min\biggl(1,\frac{p(x_2\,|\,x_1^t, x_3^{t-1})h_2(x_2^{t-1}\,|\,x_1^{t},x_2, x_3^{t-1})}{p(x_2^{t-1}\,|\,x_1^t, x_3^{t-1})h_2(x_2\,|\,x_1^{t},x_2^{t-1}, x_3^{t-1})}\biggr)$$
    (6.2)
  3. 3.

    If u<α set \(x_{2}^{t}=x_{2}\); else set \(x_{2}^{t}=x_{2}^{t-1}\).

The samples x 1,…,x M generated this way will form an ergodic Markov chain that, under certain conditions, has p(x) as its stationary distribution so that the expectation of any function of x can be calculated approximately as

$$\int f(x)p(x)\,dx= \lim_{M\to\infty} \frac{1}{M}\sum_{\nu=1}^Mf(x^\nu)\approx\frac{1}{M}\sum_{\nu=1}^M f(x^\nu).$$

Note that \(p(x_{2}\,|\,x_{1}^{t}, x_{3}^{t-1}) \propto p(x_{1}^{t},x_{2},x_{3}^{t-1})\propto k(x_{1}^{t},x_{2},x_{3}^{t-1})\) and therefore the acceptance probability can be calculated even though p(x) may only be known up to proportionality.

A special case of the single component Metropolis–Hastings algorithm is the Gibbs sampler: If as proposal distribution h 2 we choose \(p(x_{2}\,|\,x_{1}^{t},x_{3}^{t-1})\) then the acceptance probability becomes 1 because terms cancel in (6.2). The conditional distribution of a single component X 2 given all other components (X 1,X 3) is known as the full conditional distribution.

For a directed graphical model, the density of full conditional distributions can be easily identified:

(6.3)

where \(\operatorname{bl}(i)\) is the Markov blanket of node i:

$$\operatorname{bl}(i)=\operatorname{pa}(i)\cup\operatorname{ch}(i)\cup\biggl\{\bigcup_{v\in \operatorname{ch}(i)}\operatorname{pa}(v)\setminus\{i\}\biggr\}$$

or, equivalently, the neighbours of i in the moral graph, see Sect. 1.4.1. Note that (6.3) holds even if some of the nodes involved in the expression correspond to values that have been observed. To sample from the posterior distribution of the unobserved values given the observed ones, only unobserved variables should be updated in the Gibbs sampling cycle.

In this way, a Markov chain of pseudo-observations from all unobserved variables is generated, and those corresponding to quantities (parameters) of interest can be monitored.

6.4.2 Using WinBUGS via R2WinBUGS

The program WinBUGS (Gilks et al. 1994) is based on the idea that the user specifies a Bayesian graphical model based on a DAG, including the conditional distribution of every node given its parents. WinBUGS then identifies the Markov blanket of every node and using properties of the full conditional distributions in (6.3), a sampler is automatically generated by the program. As the name suggests, WinBUGS is available on Windows platforms only. WinBUGS can be interfaced from R via the R2WinBUGS package (Sturtz et al. 2005) and to do this, WinBUGS must be installed. R2WinBUGS works by calling WinBUGS, doing the computations there, shutting WinBUGS down and returning control to R.

The model described in Fig. 6.3 can be specified in the BUGS language as follows (notice that the dispersion of a normal distribution is parameterized in terms of the concentration τ where τ=σ −2):

figure g

BUGS comes with a Windows interface in the program WinBUGS. To analyse this model in R we can use the package R2WinBUGS. First we save the model specification to a plain text file:

figure h

We specify data:

figure i

As the sampler must start somewhere, we specify initial values for the unknowns:

figure j

We may now ask WinBUGS for a sample from the model:

figure k

The file lines.res contains the output. A simple summary of the samples is

figure l

We next convert the output to a format suitable for analysis with the coda package:

figure m

An summary of the posterior distribution of the monitored parameters is as follows:

figure n

As the observations are very informative, the posterior distributions of the regression parameters α and β are similar to the sampling distributions obtained from a standard linear regression analysis:

figure o

A traceplot (see Fig. 6.8) of the samples is useful for visual inspection of indications that the sampler has not converged. There appears to be no problem here:

figure p

A plot of the marginal posterior densities (see Fig. 6.9) provides a supplement to the numeric summaries shown above:

figure q
Fig. 6.8
figure 8

A traceplot of the samples produced by BUGS is a useful tool for visual inspection of indications of that the sampler has not converged

Fig. 6.9
figure 9

A plot of each posterior marginal distribution provides a provides a supplement to the numeric summary statistics

6.5 Various

An alternative to WinBUGS is OpenBUGS (Spiegelhalter et al. 2011). The two programs have the same genesis and the model specification languages are very similar. OpenBUGS can be interfaced from R via the BRugs package and OpenBUGS/BRugs is available for all platforms. The modus operandi of BRugs is fundamentally different from that of WinBUGS: a sampler created using BRugs remains alive in the sense that one may call the sampler repeatedly from within R. Yet another alternative is package rjags which interfaces the JAGS program; this must be installed separately and is available for all platforms.