Graphical Models for Complex Stochastic Systems

Højsgaard, Søren; Edwards, David; Lauritzen, Steffen

doi:10.1007/978-1-4614-2299-0_6

Søren Højsgaard⁴,
David Edwards⁵ &
Steffen Lauritzen⁶

Part of the book series: Use R! ((USE R))

13k Accesses

Abstract

This chapter provides a brief introduction to the use of Bayesian graphical models in R. In these models, parameters are treated as random quantities on an equal footing with the random variables. This allows complex stochastic systems to modeled, often using Markov chain Monte Carlo (MCMC) sampling methods. We first consider a series of examples, ranging from simple repeated sampling, linear regression models, random coefficient regression models, to the chest clinic example of Chap. 3. We formulate these as Bayesian graphical models, and represent them graphically in a compact form using plates. We then describe a special case, in which the unknown parameters are all discrete, and explain how probability propagation methods described in Chap. 3 may be used to compute the posterior distributions. We then turn to the general case, when MCMC sampling is required, and explain the computations involved in Metropolis-Hastings sampling and Gibbs sampling. Finally we illustrate a Bayesian linear regression analysis using the R2WinBUGS package.

Access provided by Autonomous University of Puebla. Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

In this chapter we describe the use of graphical models in a Bayesian setting, in which parameters are treated as random quantities on equal footing with the random variables. This allows complex stochastic systems to be modelled. This is one of the most successful application areas of graphical models; we give only a brief introduction here and refer to Albert (2009) for a more comprehensive exposition.

The paradigm used in Chaps. 2, 4 and 5 was that of identifying a joint distribution of a number of variables based on independent and identically distributed samples, with parameters unknown apart from restrictions determined by a log-linear, Gaussian, or mixed graphical model.

In contrast, Chap. 3 illustrated how a joint distribution for a Bayesian network may be constructed from a collection of conditional distributions; the network can subsequently be used to infer values of interesting unobserved quantities given evidence, i.e. observations of other quantitites. As parameters and random variables are on an equal footing in the Bayesian paradigm, we may think of the interesting unobserved quantitites as parameters and the evidence as data.

In the present chapter we follow this idea through in a general statistical setting. We focus mainly on constructing full joint distributions of a system of observed and unobserved random variables by specifying a collection of conditional distributions for a graphical model given as a directed acyclic graph with nodes representing all these quantities. Bayes’ theorem is then invoked to perform the necessary inference.

6.2 Bayesian Graphical Models

6.2.1 Simple Repeated Sampling

In the simplest possible setting we specify the joint distribution of a parameter θ and data x through a prior distribution π(θ) for θ and a conditional distribution p(x | θ) of data x for fixed value of θ, leading to the joint distribution

$$p(x,\theta)=p(x\,|\,\theta)\pi(\theta).$$

The prior distribution represents our knowledge (or rather uncertainty) about θ before the data have been observed. After observing that X=x our posterior distribution π ^∗(θ) of θ is obtained by conditioning with the data x to obtain

$$\pi^*(\theta)=p(\theta|x) = \frac{p(x|\theta)\pi(\theta)}{p(x)}\propto L(\theta)\pi(\theta),$$

where L(θ)=p(x | θ) is the likelihood. Thus the posterior is proportional to the likelihood times the prior and the normalizing constant is the marginal density p(x)=∫p(x|θ)π(θ)dθ.

If the data is a sample x=(x ¹,x ²,x ³,x ⁴,x ⁵) we can represent this process by a small Bayesian network as shown to the left in Fig. 6.1. This network represents the model

$$p(x^1,\dots,x^5,\theta) = \pi(\theta) \prod_{\nu=1}^5 p(x^\nu\,|\,\theta).$$

reflecting that the individual observations are conditionally independent and identically distributed given θ. We can make a more compact representation of the network by introducing a plate which indicates repeated observations, such as shown to the right in Fig. 6.1.

For a more sophisticated example, consider a graphical Gaussian model given by the conditional independence for fixed value of the concentration matrix K. In previous chapters we would have represented this model with its dependence graph:

However, in the Bayesian setting we need to include the parameters explicitly into the model, and could for example do that by the graph in Fig. 6.2.

The model is now represented by a chain graph, where the first chain component describes the structure of the prior distribution for the parameters in the concentration matrix. We have here assumed a so-called hyper Markov prior distribution (Dawid and Lauritzen 1993): conditionally on k ₂₂, the parameters (k ₁₁,k ₁₂) are independent of (k ₂₃,k ₃₃). The plate indicates that there are N independent observations of X, so the graph has 3N+5 nodes. The chain component on the plate reflects the factorization

for each of the individual observations of X=(X ₁,X ₂,X ₃).

6.2.2 Models Based on Directed Acyclic Graphs

A key feature of Bayesian graphical models is that explicitly including parameters and observations themselves in the graphical representation enables much more complex observational patterns to be accommodated. Consider for example a linear regression model

$$Y_i \sim N(\mu_i, \sigma^2) \quad \mbox{with}\ \mu_i = \alpha+\beta x_i \mbox{ for } i=1,\ldots,N.$$

To obtain a full probabilistic model we must specify a joint distribution for (α,β,σ) whereas the dependent variables x _i are assumed known (observed). If we specify independent distributions for these quantities, Fig. 6.3 shows a plate- based representation of this model with α, β, and σ being marginally independent and independent of Y _i.

Note that μ _i are deterministic functions of their parents and the same model can also be represented without explicitly including these nodes. However, there can be specific advantages of representing the means directly in the graph. If the independent variables x _i are not centered, i.e. $\bar{x} \neq0$, the model would change if x _i were replaced with $x_{i}-\bar{x}$, as α then would be the conditional mean when $x_{i}=\bar{x} $ rather than when x _i=0, inducing a different distribution of μ _i.

For a full understanding of the variety and complexity of models that can easily be described by DAGs with plates, we refer to the manual for BUGS (Spiegelhalter et al. 2003), which also gives the following example.

Weights have been measured weekly for 30 young rats over five weeks. The observations Y _ij are the weights of rat i measured at age x _j. The model is essentially a random effects linear growth curve:

$$Y_{ij} \sim\mathcal{N}(\alpha_{i} + \beta_i(x_j - \bar{x}), \sigma_c^{2})$$

and

$$\alpha_i \sim\mathcal{N}(\alpha_c, \sigma_\alpha^{2}),\qquad \beta_i \sim\mathcal{N}(\beta_c, \sigma_\beta^{2}),$$

where $\bar{x} = 22$. Interest particularly focuses on the intercept at zero time (birth), denoted $\alpha_{0} = \alpha_{c} -\beta_{c} \bar{x}$. The graphical representation of this model is displayed in Fig. 6.4.

For a final illustration we consider the chest clinic example in Sect. 3.1.1. Figure 6.5 shows a directed acyclic graph with plates representing N samples from the chest clinic network.

Here we have introduced a parameter node for each of the variables. Each of these nodes may contain parameters for the conditional distribution of a node given any configuration of its parents, so that, following Spiegelhalter and Lauritzen (1990), we would write for the joint model

$$p(x,\theta)=\prod_{v\in V}\pi(\theta_v)\prod_{\nu=1}^N p(x^\nu _v\,|\,x^\nu_{\operatorname{pa}(v)}, \theta_v).$$

6.3 Inference Based on Probability Propagation

If the prior distributions of the unknown parameters are concentrated on a finite number of possibilities, i.e. the parameters are all discrete, the marginal posterior distribution of each of these parameters can simply be obtained by probability propagation in a Bayesian network with 7+8N nodes, inserting the observations as observed evidence. The moral graph of this network is shown in Fig. 6.6. This graph can be triangulated by just adding edges between $x^{\nu}_{L}$ and $x^{\nu}_{B}$ and the associated junction tree would thus have 10N cliques of size at most 4. Thus, propagation would be absolutely feasible, even for large N.

We illustrate this procedure in the simple case of N=3 where we only introduce unknown parameters for the probability of visiting Asia and the probability of a smoker having lung cancer, each having three possible levels, low, medium and high. We first define the parameter nodes

and then specify a template for probabilities where we notice that A and L have an extra parent

We create three instances of the pattern defined above. In these instance the variable name asia[i] is replaced by asia1, asia2 and asia3 respectively.

We then proceed to the specification of the full network which is displayed in Fig. 6.7:

Finally we insert evidence for three observed cases, none of whom have been to Asia, all being smokers, one of them presenting with dyspnoea, one with a positive X-ray, one with dyspnoea and a negative X-ray; we then query the posterior distribution of the parameters:

We see that the probabilities of visiting Asia is now more likely than before to be low, whereas the probability of having lung cancer for a smoker is more likely to be high.

In the special case where all cases have been completely observed, it is not necessary to form the full network with 7+8N nodes, but updating can be performed sequentially as follows.

Let $p^{*}_{n}(\theta)$ denote the posterior distribution of θ given n observations x ¹,…,x ⁿ, i.e. $p^{*}_{n}(\theta)=p(\theta\,|\,x^{1}, \dots,x^{n})$. We then have the recursion:

Hence we can incorporate evidence from the n-th observation by using the posterior distribution from the n−1 first observations as a prior distribution for a network representing only a single case. It follows from the moral graph in Fig. 6.6 that if all nodes in the plates are observed, the seven parameters are conditionally independent also in the posterior distribution after n observations. If cases are incomplete, such a sequential scheme can only be used approximately (Spiegelhalter and Lauritzen 1990).

6.4 Computations Using Monte Carlo Methods

In most cases the posterior distribution

$$ \pi^*(\theta) = p(\theta|x) = \frac{p(x|\theta)\pi(\theta)}{p(x)}\propto p(x|\theta)\pi(\theta)$$

(6.1)

of the parameters of interest cannot be calculated or represented in a simple fashion. This would for example be the case if the parameter nodes in Fig. 6.5 had values in a continuum and there were incomplete observations, such as in the example given in the previous section.

In such models one will often resort to Markov chain Monte Carlo (MCMC) methods: we cannot calculate π ^∗(θ) analytically but if we can generate samples θ ⁽¹⁾,…,θ ^(M) from the distribution π ^∗(θ), we can do just as well.

6.4.1 Metropolis–Hastings and the Gibbs Sampler

Such samples can be generated by the Metropolis–Hastings algorithm. In the following we change the notation slightly.

We suppose that we know p(x) only up to a normalizing constant. That is to say, p(x)=k(x)/c, where k(x) is known but c is unknown. We partition x into blocks, for example x=(x ₁,x ₂,x ₃).

We wish to generate samples x ¹,…,x ^M from p(x). Suppose we have a sample $x^{t-1}=(x_{1}^{t-1},x_{2}^{t-1},x_{3}^{t-1})$ and also that x ₁ has also been updated to $x_{1}^{t}$ in the current iteration. The task is to update x ₂. To do so we need to specify a proposal distribution h ₂ from which we can sample candidate values for x ₂. The single component Metropolis–Hastings algorithm works as follows:

1.
Draw $x_{2} \sim h_{2}(\cdot\,|\,x_{1}^{t},x_{2}^{t-1},x_{3}^{t-1})$. Draw u∼U(0,1).
2.
Calculate acceptance probability
$$ \alpha=\min\biggl(1,\frac{p(x_2\,|\,x_1^t, x_3^{t-1})h_2(x_2^{t-1}\,|\,x_1^{t},x_2, x_3^{t-1})}{p(x_2^{t-1}\,|\,x_1^t, x_3^{t-1})h_2(x_2\,|\,x_1^{t},x_2^{t-1}, x_3^{t-1})}\biggr)$$
(6.2)
3.
If u<α set $x_{2}^{t}=x_{2}$; else set $x_{2}^{t}=x_{2}^{t-1}$.

The samples x ¹,…,x ^M generated this way will form an ergodic Markov chain that, under certain conditions, has p(x) as its stationary distribution so that the expectation of any function of x can be calculated approximately as

$$\int f(x)p(x)\,dx= \lim_{M\to\infty} \frac{1}{M}\sum_{\nu=1}^Mf(x^\nu)\approx\frac{1}{M}\sum_{\nu=1}^M f(x^\nu).$$

Note that $p(x_{2}\,|\,x_{1}^{t}, x_{3}^{t-1}) \propto p(x_{1}^{t},x_{2},x_{3}^{t-1})\propto k(x_{1}^{t},x_{2},x_{3}^{t-1})$ and therefore the acceptance probability can be calculated even though p(x) may only be known up to proportionality.

A special case of the single component Metropolis–Hastings algorithm is the Gibbs sampler: If as proposal distribution h ₂ we choose $p(x_{2}\,|\,x_{1}^{t},x_{3}^{t-1})$ then the acceptance probability becomes 1 because terms cancel in (6.2). The conditional distribution of a single component X ₂ given all other components (X ₁,X ₃) is known as the full conditional distribution.

For a directed graphical model, the density of full conditional distributions can be easily identified:

(6.3)

where $\operatorname{bl}(i)$ is the Markov blanket of node i:

$$\operatorname{bl}(i)=\operatorname{pa}(i)\cup\operatorname{ch}(i)\cup\biggl\{\bigcup_{v\in \operatorname{ch}(i)}\operatorname{pa}(v)\setminus\{i\}\biggr\}$$

or, equivalently, the neighbours of i in the moral graph, see Sect. 1.4.1. Note that (6.3) holds even if some of the nodes involved in the expression correspond to values that have been observed. To sample from the posterior distribution of the unobserved values given the observed ones, only unobserved variables should be updated in the Gibbs sampling cycle.

In this way, a Markov chain of pseudo-observations from all unobserved variables is generated, and those corresponding to quantities (parameters) of interest can be monitored.

6.4.2 Using WinBUGS via R2WinBUGS

The program WinBUGS (Gilks et al. 1994) is based on the idea that the user specifies a Bayesian graphical model based on a DAG, including the conditional distribution of every node given its parents. WinBUGS then identifies the Markov blanket of every node and using properties of the full conditional distributions in (6.3), a sampler is automatically generated by the program. As the name suggests, WinBUGS is available on Windows platforms only. WinBUGS can be interfaced from R via the R2WinBUGS package (Sturtz et al. 2005) and to do this, WinBUGS must be installed. R2WinBUGS works by calling WinBUGS, doing the computations there, shutting WinBUGS down and returning control to R.

The model described in Fig. 6.3 can be specified in the BUGS language as follows (notice that the dispersion of a normal distribution is parameterized in terms of the concentration τ where τ=σ ⁻²):

BUGS comes with a Windows interface in the program WinBUGS. To analyse this model in R we can use the package R2WinBUGS. First we save the model specification to a plain text file:

We specify data:

As the sampler must start somewhere, we specify initial values for the unknowns:

We may now ask WinBUGS for a sample from the model:

The file lines.res contains the output. A simple summary of the samples is

We next convert the output to a format suitable for analysis with the coda package:

An summary of the posterior distribution of the monitored parameters is as follows:

As the observations are very informative, the posterior distributions of the regression parameters α and β are similar to the sampling distributions obtained from a standard linear regression analysis:

A traceplot (see Fig. 6.8) of the samples is useful for visual inspection of indications that the sampler has not converged. There appears to be no problem here:

A plot of the marginal posterior densities (see Fig. 6.9) provides a supplement to the numeric summaries shown above:

6.5 Various

An alternative to WinBUGS is OpenBUGS (Spiegelhalter et al. 2011). The two programs have the same genesis and the model specification languages are very similar. OpenBUGS can be interfaced from R via the BRugs package and OpenBUGS/BRugs is available for all platforms. The modus operandi of BRugs is fundamentally different from that of WinBUGS: a sampler created using BRugs remains alive in the sense that one may call the sampler repeatedly from within R. Yet another alternative is package rjags which interfaces the JAGS program; this must be installed separately and is available for all platforms.

References

Albert J (2009) Bayesian computation with R, 2nd edn. Springer, New York
Book MATH Google Scholar
Dawid AP, Lauritzen SL (1993) Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann Stat 21:1272–1317
Article MathSciNet MATH Google Scholar
Gilks WR, Thomas A, Spiegelhalter DJ (1994) BUGS: a language and program for complex Bayesian modelling. Statistician 43:169–178
Article Google Scholar
Spiegelhalter D, Thomas A, Best N, Lunn D (2003) WinBUGS user manual version 1.4. http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/manual14.pdf
Spiegelhalter D, Thomas A, Best N, Lunn D (2011) OpenBUGS user manual version 3.21. http://www.openbugs.info/
Spiegelhalter DJ, Lauritzen SL (1990) Sequential updating of conditional probabilities on directed graphical structures. Networks 20:579–605
Article MathSciNet MATH Google Scholar
Sturtz S, Ligges U, Gelman A (2005) R2WinBUGS: A package for running WinBUGS from R. J Stat Softw 12(3):1–16. http://www.jstatsoft.org
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, Aalborg University, Aalborg, Denmark
Søren Højsgaard
Centre for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
David Edwards
Department of Statistics, University of Oxford, Oxford, UK
Steffen Lauritzen

Authors

Søren Højsgaard
View author publications
You can also search for this author in PubMed Google Scholar
David Edwards
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Lauritzen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Højsgaard, S., Edwards, D., Lauritzen, S. (2012). Graphical Models for Complex Stochastic Systems. In: Graphical Models with R. Use R!. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-2299-0_6

Download citation

DOI: https://doi.org/10.1007/978-1-4614-2299-0_6
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-2298-3
Online ISBN: 978-1-4614-2299-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics