Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Graphical models provide a framework for describing statistical dependencies in (possibly large) collections of random variables. At their core lie various correspondences between the conditional independence properties of a random vector, and the structure of an underlying graph used to represent its distribution. They have been used and studied within many sub-disciplines of statistics, applied mathematics, electrical engineering and computer science, including statistical machine learning and artificial intelligence, communication and information theory, statistical physics, network control theory, computational biology, statistical signal processing, natural language processing and computer vision among others.

The purpose of these notes is to provide an introduction to the basic material of graphical models and associated message-passing algorithms. We assume only that the reader has undergraduate-level background in linear algebra, multivariate calculus, probability theory (without needing measure theory), and some basic graph theory. These introductory lectures should be viewed as a pre-cursor to the monograph [64], which focuses primarily on some more advanced aspects of the theory and methodology of graphical models.

2 Probability Distributions and Graphical Structure

In this section, we define various types of graphical models, and discuss some of their properties. Before doing so, let us introduce the basic probabilistic notation used throughout these notes. Any graphical model corresponds to a family of probability distributions over a random vector \(X = (X_{1},\ldots,X_{N})\). Here for each \(s \in [N]:=\{ 1,2,\ldots,N\}\), the random variable \(X_{s}\) take values in some space \(\mathcal{X}_{s}\), which (depending on the application) may either be continuous (e.g., \(\mathcal{X}_{s} = \mathbb{R}\)) or discrete (e.g., \(\mathcal{X}_{s} =\{ 0,1,\ldots,m - 1\}\)). Lower case letters are used to refer to particular elements of \(\mathcal{X}_{s}\), so that the notation \(\{X_{s} = x_{s}\}\) corresponds to the event that the random variable \(X_{s}\) takes the value \(x_{s} \in \mathcal{X}_{s}\). The random vector \(X = (X_{1},X_{2},\ldots,X_{N})\) takes values in the Cartesian product space \(\prod _{s=1}^{N}\mathcal{X}_{s}:= \mathcal{X}_{1} \times \mathcal{X}_{2} \times \ldots \times \mathcal{X}_{N}\). For any subset \(A \subseteq [N]\), we define the subvector \(X_{A}:= (X_{s},\;s \in A)\), corresponding to a random vector that takes values in the space \(\mathcal{X}_{A} =\prod _{s\in A}\mathcal{X}_{s}\). We use the notation \(x_{A}:= (x_{s},\;s \in A)\) to refer to a particular element of the space \(\mathcal{X}_{A}\). With this convention, note that \(\mathcal{X}_{[N]}\) is shorthand notation for the full Cartesian product \(\prod _{s=1}^{N}\mathcal{X}_{s}\). Given three disjoint subsets A, B, C of \([N]\), we use \(X_{A} \perp \!\!\!\perp X_{B}\mid X_{C}\) to mean that the random vector X A is conditionally independent of X B given X C . When C is the empty set, then this notion reduces to marginal independence between the random vectors X A and X B .

2.1 Directed Graphical Models

We begin our discussion with directed graphical models, which (not surprisingly) are based on the formalism of directed graphs. In particular, a directed graph \(\mbox{ $\mathcal{D} = (\mathcal{V},\overrightarrow{\mathcal{E}})$}\) consists of a vertex set \(\mathcal{V} =\{ 1,\ldots,N\}\) and a collection \(\overrightarrow{\mathcal{E}}\) of directed pairs \((s \rightarrow t)\), meaning that \(s\) is connected by an edge directed to \(t\). When there exists a directed edge \((t \rightarrow s) \in \mathcal{E}\), we say that node \(s\) is a child of node \(t\), and conversely that node \(t\) is a parent of node \(s\). We use \(\pi (s)\) to denote the set of all parents of node \(s\) (which might be an empty set). A directed cycle is a sequence of vertices \((s_{1},s_{2},\ldots,s_{\ell})\) such that \((s_{\ell} \rightarrow s_{1}) \in \overrightarrow{\mathcal{E}}\), and \((s_{j} \rightarrow s_{j+1}) \in \overrightarrow{\mathcal{E}}\) for all \(j = 1,\ldots,\ell-1\). A directed acyclic graph, or DAG for short, is a directed graph that contains no directed cycles. As an illustration, the graphs in panels (a) and (b) are both DAGs, whereas the graph in panel (c) is not a DAG, since it contains (among others) a directed cycle on the three vertices {1, 2, 5}.

Any mapping \(\rho: [N] \rightarrow [N]\) defines an ordering of the vertex set \(\mbox{ $\mathcal{V} =\{ 1,2,\ldots,N\}$,}\) and of interest to us are particular orderings.

Definition 1

The ordering \(\{\rho (1),\ldots,\rho (N)\}\) of the vertex set \(\mathcal{V}\) of a DAG is topological if for each \(s \in \mathcal{V}\), we have \(\rho (t) <\rho (s)\) for all \(t \in \pi (s)\).

Alternatively stated, in a topological ordering, children always come after their parents. It is an elementary fact of graph theory that any DAG has at least one topological ordering, and this fact plays an important role in our analysis of directed graphical models. So as to simplify our presentation, we assume throughout these notes that the canonical ordering \(\mathcal{V} =\{ 1,2,\ldots,N\}\) is topological. Note that this assumption entails no loss of generality, since we can always re-index the vertices so that it holds. With this choice of topological ordering, vertex 1 cannot have any parents (i.e., \(\pi (1) =\emptyset\)), and moreover vertex \(N\) cannot have any children.

With this set-up, we are now ready to introduce probabilistic notions into the picture. A directed graphical model is a family of probability distributions defined by a DAG. This family is built by associating each node \(s\) of a DAG with a random variable \(X_{s}\), and requiring the joint probability distribution over \((X_{1},\ldots,X_{N})\) factorize according to the DAG. Consider the subset of vertices \((s,\pi (s))\) corresponding to a given vertex \(s\) and its parents \(\pi (s)\). We may associate with this subset a real-valued function \(f_{s}: \mathcal{X}_{s} \times \mathcal{X}_{\pi (s)} \rightarrow \mathbb{R}_{+}\) that maps any given configuration \((x_{s},x_{\pi (s)}) \in \mathcal{X}_{s} \times \mathcal{X}_{\pi (s)}\) to a real number \(f_{s}(x_{s},x_{\pi (s)}) \geq 0\). We assume moreover that \(f_{s}\) satisfies the normalization condition

$$\displaystyle\begin{array}{rcl} \sum _{x_{s}}f_{s}(x_{s},\,x_{\pi (s)})& = 1\quad \mbox{ for all $x_{\pi (s)} \in \mathcal{X}_{\pi (s)}$.}&{}\end{array}$$
(1)

Definition 2 (Factorization for Directed Graphical Models)

The directed graphical model based on a given DAG \(\mathcal{D}\) is the collection of probability distributions over the random vector \((X_{1},\ldots,X_{N})\) that have a factorization of the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& = \frac{1} {Z}\,\prod _{s=1}^{N}f_{ s}(x_{s},\,x_{\pi (s)}),&{}\end{array}$$
(2)

for some choice of non-negative parent-to-child functions \((f_{1},\ldots,f_{N})\) that satisfy the normalization condition (1). We use \(\mathcal{F}_{\mbox{ Fac}}(\mathcal{D})\) to denote the set of all distributions that factorize in the form (2).

In the factorization (2), the quantity Z denotes a constant chosen to ensure that \(p\) sums to one.

Let us illustrate this definition with some examples.

Example 1 (Markov Chain as a Directed Graphical Model)

Perhaps the simplest example of a directed acyclic graph is the chain on \(N\) nodes, as shown in panel (a) of Fig. 1. Such a graph underlies the stochastic process \((X_{1},\ldots,X_{N})\) known as a Markov chain, used to model various types of sequential dependencies. By definition, any Markov chain can be factorized in the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& = p(x_{1})\;p(x_{2}\,\vert \,x_{1})\;p(x_{3}\,\vert \,x_{2})\cdots p(x_{N}\,\vert \,x_{N-1}).&{}\end{array}$$
(3)

Note that this is a special case of the factorization (2), based on the functions f s (x s , x π(s)) = p(x s  | x s−1) for each \(s = 2,\ldots,N\), and \(f_{1}(x_{1},\,x_{\pi (1)}) = p(x_{1})\). ♣

Fig. 1
figure 1

(a) The simplest example of a DAG is a chain, which underlies the familiar Markov chain. The canonical ordering \(\{1,2,\ldots,N\}\) is the only topological one. (b) A more complicated DAG. Here the canonical ordering {1, 2, , 7} is again topological, but it is no longer unique: for instance, {1, 4, 2, 3, 5, 7, 6} is also topological. (c) A directed graph with cycles (non-DAG). It contains (among others) a directed cycle on vertices {1, 2, 5}

We now turn to a more complex DAG.

Example 2 (Another DAG)

Consider the DAG shown in Fig. 1b. It defines the family of probability distributions that have a factorization of the form

$$\displaystyle\begin{array}{rcl} & & p(x_{1},\ldots,x_{7}) {}\\ & & \qquad \,\propto \,\,f_{1}(x_{1})f_{2}(x_{2},\,x_{1})f_{3}(x_{3},\,x_{1},x_{2})f_{4}(x_{4},\,x_{1})f_{5}(x_{5},\,x_{3},x_{4})f_{6}(x_{6},\,x_{5})f_{7}(x_{7},\,x_{5}).{}\\ \end{array}$$

for some collection of non-negative and suitably normalized functions \(\{f_{s},s \in \mathcal{V}\}\). ♣

The factorization (3) of the classical Markov chain has an interesting property, in that the normalization constant Z = 1, and all the local functions \(f_{s}\) are equal to conditional probability distributions. It is not immediately apparent whether or not this property holds for the more complex DAG discussed in Example 2, but in fact, as shown by the following result, it is a generic property of directed graphical models.

Proposition 1

For any directed acyclic graph \(\mathcal{D}\) , any factorization of the form factorization (2) with Z = 1 defines a valid probability distribution. Moreover, we necessarily have \(f_{s}(x_{s},\,x_{\pi (s)}) = p(x_{s}\,\vert \,x_{\pi (s)})\) for all \(s \in \mathcal{V}\) .

Proof

Throughout the proof, we assume without loss of generality (re-indexing as necessary) that \(\{1,2,\ldots,N\}\) is a topological ordering. In order to prove this result, it is convenient to first state an auxiliary result.

Lemma 1

For any distribution p of the form (2), we have

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{t})& = \frac{1} {Z}\prod _{s=1}^{t}f_{ s}(x_{s},\,x_{\pi (s)})\quad \mbox{ for each $t = 1,\ldots,N$.}&{}\end{array}$$
(4)

We first use this result to establish the main claims before returning to prove it. If we apply Lemma 1 with \(t = 1\), then we obtain that \(p(x_{1}) = f_{1}(x_{1})/Z\), and hence that Z = 1 by the normalization condition on f 1. Otherwise, for any t ∈ { 2, , N}, applying the representation (4) to both t and t − 1 yields

$$\displaystyle\begin{array}{rcl} \frac{p(x_{1},\ldots,x_{t})} {p(x_{1},\ldots,x_{t-1})}& = f_{t}(x_{t},\,x_{\pi (t)})\qquad \mbox{ for all $(x_{1},\ldots,x_{t})$.}& {}\\ \end{array}$$

Since the right-hand side depends only on \(x_{\pi (t)}\), so must the left-hand side. When the left-hand side depends only on \(x_{\pi (t)}\), then it is equal to the conditional \(p(x_{t}\,\vert \,x_{\pi (t)})\), and so we conclude that p(x t  | x π(t)) = f t (x t , x π(t)) as claimed.

It remains to prove Lemma 1. We may assume without loss of generality (re-indexing as necessary) that \(\{1,2,\ldots,N\}\) is a topological ordering. Consequently, node \(N\) has no children, so that we may write

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N-1},x_{N})& = \frac{1} {Z}\,\biggr [\prod _{s=1}^{N-1}f_{ s}(x_{s},\,x_{\pi (s)})\;\biggr ]\;f_{N}(x_{N},\,x_{\pi (N)}).& {}\\ \end{array}$$

Marginalizing over \(x_{N}\) yields that

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N-1})& =& \frac{1} {Z}\biggr [\prod _{s=1}^{N-1}f_{}(x_{ s},\,x_{\pi (s)})\;\biggr ]\biggr [\;\sum _{x_{N}}f_{N}(x_{N},\,x_{\pi (N)})\biggr ] {}\\ & =& \frac{1} {Z}\;\prod _{s=1}^{N-1}f_{ s}(x_{s},\,x_{\pi (s)}), {}\\ \end{array}$$

where we have used the facts that \(x_{N}\) appears only in one term (since \(N\) is a leaf node), and that \(\sum _{x_{N}}f_{N}(x_{N},\,x_{\pi (N)}) = 1\). We have thus shown that if the claim holds for \(t = N\), then it holds for \(t = N - 1\). By recursively applying this same argument, the claim of Lemma 1 follows. □ 

Proposition 1 shows that the terms f i in the factorization (2) have a concrete interpretation as the child-parent conditional probabilities (i.e., \(f_{s}(x_{s},\,x_{\pi (s)})\) is equal to the conditional probability of \(X_{s} = x_{s}\) given that \(X_{\pi (s)} = x_{\pi (s)}\)). This local interpretability, which (as we will see) is not shared by the class of undirected graphical models, has some important consequences. For instance, sampling a configuration \((\tilde{X}_{1},\ldots,\tilde{X}_{N})\) from any DAG model is straightforward: assuming the canonical topological ordering, we first sample \(\tilde{X}_{1} \sim f_{1}(\cdot )\), and then for \(s = 2,\ldots,N\), sample \(\tilde{X}_{s} \sim f_{s}(\cdot,\,\tilde{X}_{\pi (s)})\). This procedure is well-specified: due to the topological ordering, we are guaranteed that the variable \(\tilde{X}_{\pi (s)}\) has been sampled before we move on to sampling \(\tilde{X}_{s}\). Moreover, by construction, the random vector \((\tilde{X}_{1},\ldots,\tilde{X}_{N})\) is distributed according to the probability distribution (2).

2.1.1 Conditional Independence Properties for Directed Graphs

Thus far, we have specified a joint distribution over the random vector \(X = (X_{1},\ldots,X_{N})\) in terms of a particular parent-to-child factorization. We now turn to a different (but ultimately equivalent) characterization in terms of conditional independence. (The reader should recall our standard notation for conditional independence properties from the beginning of Sect. 2.) Throughout the discussion to follow, we continue to assume that the canonical ordering \(\{1,2,\ldots,N\}\) is topological.

Given any vertex \(s \in \mathcal{V}\setminus \{1\}\), our choice of topological ordering implies that the parent set \(\pi (s)\) is contained within the set \(\{1,2,\ldots,s - 1\}\). Note that for \(s = 1\), the parent set must be empty. We then define the set \(\nu (s) =\{ 1,2,\ldots,s - 1\}\setminus \pi (s)\). Our basic conditional independence properties are based on the three disjoint subsets \(\{s\}\), \(\pi (s)\) and \(\nu (s)\).

Definition 3 (Markov Property for Directed Graphical Models)

The random vector \(X = (X_{1},\ldots,X_{N})\) is Markov with respect to a directed graph if

$$\displaystyle\begin{array}{rcl} X_{s} \perp \!\!\!\perp X_{\nu (s)}\mid X_{\pi (s)}\quad \mbox{ for all $s \in \mathcal{V}$.}& &{}\end{array}$$
(5)

We use \(\mathcal{F}_{\text{Mar}}(\mathcal{D})\) to denote the set of all distributions that are Markov with respect to \(\mathcal{D}\).

Let us illustrate this definition with our running examples.

Example 3 (Conditional Independence for Directed Markov Chain)

Recall the directed Markov chain first presented in Example 1. Each node \(s \in \{ 2,\ldots,N\}\) has a unique parent \(\pi (s) = s - 1\), so that the (non-trivial) basic conditional properties are of the form \(X_{s} \perp \!\!\!\perp (X_{1},\ldots,X_{s-2})\mid X_{s-1}\) for \(s \in \{ 3,\ldots,N\}\). Note that these basic conditional independence statements imply other (non-basic) properties as well. Perhaps the most familiar is the assertion that

$$\displaystyle\begin{array}{rcl} (X_{s},X_{s+1},\ldots,X_{N}) \perp \!\!\!\perp (X_{1},\ldots,X_{s-2})\,\mid \,X_{s-1}\qquad \mbox{ for all $s \in \{ 3,\ldots,N\}$},& & {}\\ \end{array}$$

corresponding the fact that the past and future of a Markov chain are conditionally independent given the present (\(X_{s-1}\) in this case). ♣

As a second example, let us now return to the DAG shown in Fig. 1b.

Example 4 (Conditional Independence for a More Complex DAG)

For this DAG, the basic conditional independence assertions (using the canonical ordering) can again be read off from the graph. In particular, the non-trivial relations are \(X_{3} \perp \!\!\!\perp X_{1}\mid X_{2}\), \(X_{5} \perp \!\!\!\perp (X_{1},X_{2})\mid (X_{3},X_{4})\), as well as

$$\displaystyle\begin{array}{rcl} X_{6} \perp \!\!\!\perp (X_{1},X_{2},X_{3},X_{4})\mid X_{5},\quad \text{and}\quad X_{7} \perp \!\!\!\perp (X_{1},X_{2},X_{3},X_{4},X_{6})\mid X_{5}.& & {}\\ \end{array}$$

2.1.2 Equivalence of Representations

For any directed graph \(\mathcal{D}\), we have now defined two families of probability distributions: the family \(\mathcal{F}_{\text{Fac}}(\mathcal{D})\) of all distributions with a factorization of the form (2), and the family \(\mathcal{F}_{\text{Mar}}(\mathcal{D})\) of all distributions that satisfy the basic Markov properties (5). It is natural to ask how these two families are related; pleasingly, they are equivalent, as shown in the following result.

Theorem 1 (Equivalence for Directed Graphical Models)

For any directed acyclic graph \(\mathcal{D}\) , we have \(\mathcal{F}_{\text{Fac}}(\mathcal{D}) = \mathcal{F}_{\text{Mar}}(\mathcal{D})\) .

Proof

The proof of this result is straightforward given our development thus far. We begin with the inclusion \(\mathcal{F}_{\text{Mar}}(\mathcal{D}) \subseteq \mathcal{F}_{\text{Fac}}(\mathcal{D})\). From Lemma 1 used in the proof of Proposition 1, for any vertex \(t \in \mathcal{V}\), we have

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{t})& =& \prod _{s=1}^{t}p(x_{ s}\,\vert \,x_{\pi (s)})\; =\;\prod _{ s=1}^{t-1}p(x_{ s}\,\vert \,x_{\pi (s)})\;\;p(x_{t}\,\vert \,x_{\pi (t)}) {}\\ & =& p(x_{1},\ldots,x_{t-1})\;\;p(x_{t}\,\vert \,x_{\pi (t)}). {}\\ \end{array}$$

By the definition of \(\nu (t)\), we have \(\{1,\ldots,t - 1\} =\pi (t) \cup \nu (t)\), and consequently we can write

$$\displaystyle\begin{array}{rcl} \frac{p(x_{\pi (t)},x_{\nu (t)},x_{t})} {p(x_{\pi (t)})} & = \frac{p(x_{\pi (t)},x_{\nu (t)})} {p(x_{\pi (t)})} \;\;p(x_{t}\,\vert \,x_{\pi (t)})\; =\; p(x_{\nu (t)}\,\vert \,x_{\pi (t)})\;p(x_{t}\,\vert \,x_{\pi (t)}),& {}\\ \end{array}$$

which shows that \(X_{t} \perp \!\!\!\perp X_{\nu (t)}\mid X_{\pi (t)}\).

In order to establish the reverse inclusion, suppose that the basic Markov properties hold, where we are still using \(\{1,2,\ldots,N\}\) as our topological ordering. Using the chain rule for probability, we have

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& = & p(x_{1})\;\prod _{s=2}^{N}p(x_{ t}\,\vert \,x_{1},\ldots,x_{t-1}) {}\\ & \stackrel{(i)}{=}& p(x_{1})\;\prod _{s=2}^{N}p(x_{ t}\,\vert \,x_{\pi (t)}), {}\\ \end{array}$$

where equality (i) follows by applying the Markov properties. □ 

2.2 Undirected Graphical Models

We now turn to discussion of undirected graphical models, which are also known as Markov random fields or Gibbs distributions. Naturally, these models are built using an undirected graphs, by which we mean a pair \(\mbox{$ \mathcal{G} = (\mathcal{V},\mathcal{E})$}\), where \(\mathcal{V} =\{ 1,\ldots,N\}\) is the vertex set (as before), and \(\mathcal{E}\) is a collection of undirected edges, meaning that there is no distinction between the edge \((s,t)\) and the edge \((t,s)\). As before, we associate a random variable \(X_{s}\) with each vertex \(s \in \mathcal{V}\) of graph, and our interest is in characterizing the joint distribution of the random vector \(X = (X_{1},\ldots,X_{N})\).

As with directed graphical models, there are two different ways in which the probabilistic structure of the random vector \(X\) can be linked to the graphical structure: factorization and conditional independence properties. Let us begin our exploration with the former property.

2.2.1 Factorization for Undirected Models

For undirected graphical models, the factorization properties are specified in terms of cliques of the graph. A clique \(C\) of an undirected graph \(\mathcal{G}\) is a fully connected subset \(C\) of the vertex set \(\mathcal{V}\) (i.e., \((s,t) \in \mathcal{E}\) for all \(s,t \in C\)). A clique is maximal if it is not contained within any other clique. Thus, any singleton set \(\{s\}\) is always a clique, but it is not maximal unless \(s\) has no neighbors in the graph. See Fig. 2a for an illustration of some other types of cliques.

Fig. 2
figure 2

(a) Illustration of cliques in an undirected graph. Sets A = { 1, 2, 3} and B = { 3, 4, 5} are 3-cliques, whereas sets C = { 4, 6} and D = { 5, 7} are 2-cliques (that can be identified with edges). (b) Illustration of a vertex cutset: removal of the vertices in S breaks the graph into the separate pieces indexed by subsets A and B

Given an undirected graph, we use \(\mathfrak{C}\) to denote the set of all its cliques. With each clique \(C \in \mathfrak{C}\), we associate a compatibility function \(\psi _{C}: \mathcal{X}_{C} \rightarrow \mathbb{R}_{+}\): it assigns a non-negative number \(\psi _{C}(x_{C})\) to each possible configuration \(x_{C} = (x_{s},s \in C) \in \mathcal{X}_{C}\). The factorization property for an undirected graphical model is stated in terms of these compatibility functions.

Definition 4 (Factorization for Undirected Graphical Models)

A probability distribution p factorizes over the graph \(\mathcal{G}\) if

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& = \frac{1} {Z}\,\prod _{C\in \mathfrak{C}}\psi _{C}(x_{C}),&{}\end{array}$$
(6)

for some choice of non-negative compatibility function \(\psi _{C}: \mathcal{X}_{C} \rightarrow \mathbb{R}_{+}\) for each clique \(C \in \mathfrak{C}\). We use \(\mathcal{F}_{\text{Fac}}(\mathcal{G})\) to denote the set of all distributions that have a factorization of the form (6).

As before, the quantity Z is a constant chosen to ensure that the distribution is appropriately normalized. (In contrast to the directed case, we cannot take Z = 1 in general.) Note that there is a great deal of freedom in the factorization (6); in particular, in the way that we have written it, there is no unique choice of the compatibility functions and the normalization constant Z. For instance, if we multiply any compatibility function by some constant α > 0 and do the same to Z, we obtain the same distribution.

Let us illustrate these definitions with some examples.

Example 5 (Markov Chain as an Undirected Graphical Model)

Suppose that we remove the arrows from the edges of the chain graph shown in Fig. 1a; we then obtain an undirected chain. Any distribution that factorizes according to this undirected chain has the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{5}) = \frac{1} {Z}\prod _{s=1}^{5}\psi _{ s}(x_{s})\;\psi _{12}(x_{1},x_{2})\;\psi _{12}(x_{1},x_{2})\;\psi _{12}(x_{1},x_{2})\;\psi _{12}(x_{1},x_{2})\;& &{}\end{array}$$
(7)

for some choice of non-negative compatibility functions. Unlike the directed case, in general, these compatibility functions are not equal to conditional distributions, nor to marginal distributions. Moreover, we have no guarantee that Z = 1 in this undirected representation of a Markov chain. The factorization (7) makes use of both maximal cliques (edges) and non-maximal cliques (singletons). Without loss of generality, we can always restrict to only edge-based factors. However, for applications, it is often convenient, for the purposes of interpretability, to include both types of compatibility functions. ♣

Example 6

As a second example, consider the undirected graph shown in Fig. 2a. Any distribution that respects its clique structure has the factorization

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{7})& = \frac{1} {Z}\;\psi _{123}(x_{1},x_{2},x_{3})\;\psi _{345}(x_{3},x_{4},x_{5})\;\psi _{4.6}(x_{4},x_{6})\;\psi _{57}(x_{5},x_{7}),& {}\\ \end{array}$$

for some choice of non-negative compatibility functions. In this case, we have restricted our factorization to the maximal cliques of the graph. ♣

2.2.2 Markov Property for Undirected Models

We now turn to the second way in which graph structure can be related to probabilistic structure, namely via conditional independence properties. In the case of undirected graphs, these conditional independence properties are specified in terms of vertex cutsets of the graph. A vertex cutset is a subset S of the vertex set \(\mathcal{V}\) such that, when it is removed from the graph, breaks the graph into two or more disconnected subsets of vertices. For instance, as shown in Fig. 2b, removing the subset S of vertices breaks the graph into two disconnected components, as indexed by the vertices in A and B respectively. For an undirected graph, the Markov property is defined by these vertex cutsets:

Definition 5 (Markov Property for Undirected Graphical Models)

A random vector \((X_{1},\ldots,X_{N})\) is Markov with respect to an undirected graph \(\mathcal{G}\) if, for all vertex cutsets S and associated components A and B, the conditional independence condition \(X_{A} \perp \!\!\!\perp X_{B}\,\mid \,X_{C}\) holds. We use \(\mathcal{F}_{\text{Mar}}(\mathcal{G})\) to denote the set of all distributions that satisfy all such Markov properties defined by \(\mathcal{G}\).

For example, for any vertex \(s\), its neighborhood set

$$\displaystyle\begin{array}{rcl} \mathcal{N}(s):=\{ t \in \mathcal{V}\,\mid \,(s,t) \in \mathcal{E}\}& &{}\end{array}$$
(8)

is always a vertex cutset. Consequently, any distribution that is Markov with respect to an undirected graph satisfies the conditional independence relations

$$\displaystyle\begin{array}{rcl} X_{s} \perp \!\!\!\perp X_{\mathcal{V}\setminus \{s\}\cup \mathcal{N}(s)}\mid X_{\mathcal{N}(s)}.& &{}\end{array}$$
(9)

The set of variables \(X_{\mathcal{N}(s)} = (X_{t},t \in \mathcal{N}(s)\}\) is often referred to as the Markov blanket of \(X_{s}\), since it is the minimal subset required to render \(X_{s}\) conditionally independent of all other variables in the graph. This particular Markov property plays an important role in pseudolikelihood estimation of parameters [11], as well as related approaches for graphical model selection by neighborhood regression (see the papers [48, 52] for further details).

Let us illustrate with some additional concrete examples.

Example 7 (Conditional Independence for a Markov Chain)

Recall from Example 5 the view of a Markov chain as an undirected graphical model. For this chain graph, the minimal vertex cutsets are singletons, with the non-trivial ones being the subsets \(\{s\}\) for \(s \in \{ 2,\ldots,N - 1\}\). Each such vertex cutset induces the familiar conditional independence relation

$$\displaystyle\begin{array}{rcl} \mathop{\underbrace{(X_{1},\ldots,X_{s-1})}}\limits _{\text{Past}} \perp \!\!\!\perp \mathop{\underbrace{ (X_{s+1},\ldots,X_{N})}}\limits _{\text{Future}}\mid \mathop{\underbrace{X_{s}}}\limits _{\text{Present}},& &{}\end{array}$$
(10)

corresponding to the fact that the past and future of a Markov chain are conditionally independent given the present. ♣

Example 8 (More Markov Properties)

Consider the undirected graph shown in Fig. 2b. As previously discussed, the set S is a vertex cutset, and hence it induces the conditional independence property \(X_{A} \perp \!\!\!\perp X_{B}\mid X_{S}\). ♣

2.2.3 Hammersley-Clifford Equivalence

As in the directed case, we have now specified two ways in which graph structure can be linked to probabilistic structure. This section is devoted to a classical result that establishes a certain equivalence between factorization and Markov properties, meaning the two families of distributions \(\mathcal{F}_{\text{Fac}}(\mathcal{G})\) and \(\mathcal{F}_{\text{Mar}}(\mathcal{G})\) respectively. For strictly positive distributions \(p\), this equivalence is complete.

Theorem 2 (Hammersley-Clifford)

If the distribution p of random vector \(X = (X_{1},\ldots,X_{N})\) factorizes over a graph \(\mathcal{G}\) , then X is Markov with respect to \(\mathcal{G}\) . Conversely, if a random vector X is Markov with respect to \(\mathcal{G}\) and p(x) > 0 for all \(x \in \mathcal{X}_{[N]}\) , then \(p\) factorizes over the graph.

Proof

We begin by proving that \(\mathcal{F}_{\text{Fac}}(\mathcal{G}) \subseteq \mathcal{F}_{\text{Mar}}(\mathcal{G})\). Suppose that the factorization (6) holds, and let \(S\) be an arbitrary vertex cutset of the graph such that subsets A and B are separated by \(S\). We may assume without loss of generality that both A and B are non-empty, and we need to show that \(X_{A} \perp \!\!\!\perp X_{B}\mid X_{s}\). Let us define subsets of cliques by \(\mathfrak{C}_{A}:=\{ C \in \mathfrak{C}\,\mid \,C \cap A\neq \emptyset \}\), \(\mathfrak{C}_{B}:=\{ C \in \mathfrak{C}\,\mid \,C \cap B\neq \emptyset \}\), and \(\mathfrak{C}_{s}:=\{ C \in \mathfrak{C}\,\mid \,C \subseteq S\}\). We claim that these three subsets form a disjoint partition of the full clique set—namely, \(\mathfrak{C} = \mathfrak{C}_{A} \cup \mathfrak{C}_{s} \cup \mathfrak{C}_{B}\). Given any clique C, it is either contained entirely within S, or must have non-trivial intersection with either A or B, which proves the union property. To establish disjointness, it is immediate that \(\mathfrak{C}_{s}\) is disjoint from \(\mathfrak{C}_{A}\) and \(\mathfrak{C}_{B}\). On the other hand, if there were some clique \(C \in \mathfrak{C}_{A} \cap \mathfrak{C}_{B}\), then there would exist nodes a ∈ A and b ∈ B with {a, b} ∈ C, which contradicts the fact that A and B are separated by the cutset \(S\).

Consequently, we may write

$$\displaystyle\begin{array}{rcl} p(x_{A},x_{s},x_{B}) = \frac{1} {Z}\;\mathop{\underbrace{\big[\prod _{C\in \mathfrak{C}_{A}}\psi _{C}(x_{C})\big]}}\limits \quad \mathop{\underbrace{\big[\prod _{C\in \mathfrak{C}_{s}}\psi _{C}(x_{C})\big]}}\limits \quad \mathop{\underbrace{\big[\prod _{C\in \mathfrak{C}_{B}}\psi _{C}(x_{C})\big]}}\limits. {}\\ & \qquad \quad \varPsi _{A}(x_{A},x_{s})\quad \quad \varPsi _{s}(x_{s})\quad \qquad \varPsi _{B}(x_{B},x_{s})& {}\\ \end{array}$$

Defining the quantities

$$\displaystyle{ Z_{A}(x_{s}):=\sum _{x_{A}}\varPsi _{A}(x_{A},x_{s}),\quad \text{and}\quad Z_{B}(x_{s}):=\sum _{x_{B}}\varPsi _{B}(x_{B},x_{s}), }$$

we then obtain the following expressions for the marginal distributions of interest

$$\displaystyle{ p(x_{s})=\frac{Z_{A}(x_{s})\;Z_{B}(x_{s})} {Z} \:\!\varPsi _{s}(x_{s})\quad \mbox{ and}\quad p(x_{A},x_{s})=\frac{Z_{B}(x_{s})} {Z} \,\varPsi _{A}(x_{A},x_{s})\;\varPsi _{s}(x_{s}), }$$

with a similar expression for \(p(x_{B},x_{s})\). Consequently, for any \(x_{s}\) for which p(x s ) > 0, we may write

$$\displaystyle\begin{array}{rcl} \frac{p(x_{A},x_{s},x_{B})} {p(x_{s})} & =& \frac{ \frac{1} {Z}\,\varPsi _{A}(x_{A},x_{s})\varPsi _{s}(x_{s})\varPsi _{B}(x_{B},x_{s})} {\frac{Z_{A}(x_{s})\;Z_{B}(x_{s})} {Z} \,\varPsi _{s}(x_{s})} \\ & =& \frac{\varPsi _{A}(x_{A},x_{s})\varPsi _{B}(x_{B},x_{s})} {Z_{A}(x_{s})\;Z_{B}(x_{s})}. {}\end{array}$$
(11)

Similar calculations yield the relations

$$\displaystyle\begin{array}{rcl} \frac{p(x_{A},x_{s})} {p(x_{s})} & = \frac{\frac{Z_{B}(x_{s})} {Z} \,\varPsi _{A}(x_{A},x_{s})\;\varPsi _{s}(x_{s})} {\frac{Z_{A}(x_{s})Z_{B}(x_{s})} {Z} \varPsi _{s}(x_{s})} \; =\; \frac{\varPsi _{A}(x_{A},x_{s})} {Z_{A}(x_{s})},\qquad \text{and}&{}\end{array}$$
(12a)
$$\displaystyle\begin{array}{rcl} \frac{p(x_{B},x_{s})} {p(x_{s})} & = \frac{\frac{Z_{A}(x_{s})} {Z} \,\varPsi _{B}(x_{B},x_{s})\;\varPsi _{s}(x_{s})} {\frac{Z_{A}(x_{s})Z_{B}(x_{s})} {Z} \varPsi _{s}(x_{s})} \; =\; \frac{\varPsi _{B}(x_{B},x_{s})} {Z_{B}(x_{s})}.&{}\end{array}$$
(12b)

Combining Eq. (11) with Eqs. (12a) and (12b) yields

$$\displaystyle\begin{array}{rcl} p(x_{A},x_{B}\,\mid \,x_{s})& =& \frac{p(x_{A},x_{B},x_{s})} {p(x_{s})} \; =\; \frac{p(x_{A},x_{s})} {p(x_{s})} \,\frac{p(x_{B},x_{s})} {p(x_{s})} {}\\ & =& p(x_{A}\,\mid \,x_{s})\;p(x_{B}\,\mid \,x_{s}), {}\\ \end{array}$$

thereby showing that \(X_{A} \perp \!\!\!\perp X_{B}\mid X_{s}\), as claimed.

In order to prove the opposite inclusion—namely, that the Markov property for a strictly positive distribution implies the factorization property—we require a version of the inclusion-exclusion formula. Given a set \([N] =\{ 1,2,\ldots,N\}\), let \(\mathcal{P}([N])\) be its power set, meaning the set of all subsets of \([N]\). With this notation, the following inclusion-exclusion formula is classical:

Lemma 2 (Inclusion-Exclusion)

For any two real-valued functions Ψ and Φ defined on the power set \(\mathcal{P}([N])\) , the following statements are equivalent:

$$\displaystyle\begin{array}{rcl} \varPhi (A)& =\sum _{B\subseteq A}(-1)^{\vert A\setminus B\vert }\varPsi (B)\quad \mbox{ for all $A \in \mathcal{P}([N])$.}&{}\end{array}$$
(13a)
$$\displaystyle\begin{array}{rcl} \varPsi (A)& =\sum _{B\subseteq A}\varPhi (B)\quad \mbox{ for all $A \in \mathcal{P}([N])$.}&{}\end{array}$$
(13b)

Proofs of this result can be found in standard texts in combinatorics (e.g., [61]). Returning to the main thread, let \(y \in \mathcal{X}_{[N]}\) be some fixed element, and for each subset \(A \in \mathcal{P}([N])\), define the function \(\phi: \mathcal{X}_{[N]} \rightarrow \mathbb{R}\) via

$$\displaystyle\begin{array}{rcl} \phi _{A}(x)&:=\sum _{B\subseteq A}(-1)^{\vert A\setminus B\vert }\log \frac{p(x_{B},y_{B^{c}})} {p(y)}.&{}\end{array}$$
(14)

(Note that taking logarithms is meaningful since we have assumed \(p(x) > 0\) for all \(x \in \mathcal{X}_{[N]}\).) From the inclusion-exclusion formula, we have \(\log \frac{p(x_{A},y_{A^{c}})} {p(y)} =\sum _{B\subseteq A}\phi _{B}(x)\), and setting \(A = [N]\) yields

$$\displaystyle\begin{array}{rcl} p(x)& = p(y)\exp \big\{\sum _{B\in \mathcal{P}([N])]}\phi _{B}(x)\big\}.&{}\end{array}$$
(15)

From the definition (14), we see that ϕ A is a function only of x A . In order to complete the proof, it remains to show that ϕ A  = 0 for any subset A that is not a graph clique. As an intermediate result, we claim that for any t ∈ A, we can write

$$\displaystyle\begin{array}{rcl} \phi _{A}(x)& =\sum _{B\subseteq A\setminus \{t\}}(-1)^{\vert A-B\vert }\,\log \frac{p(x_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})} {p(y_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})}.&{}\end{array}$$
(16)

To establish this claim, we write

$$\displaystyle\begin{array}{rcl} \phi _{A}(x)& =& \sum _{\begin{array}{c}B\subset A \\ B\not\ni t \end{array}}(-1)^{\vert A\setminus B\vert }\log \frac{p(x_{B},y_{B^{c}})} {p(y)} +\sum _{\begin{array}{c}B\subset A \\ B\ni t \end{array}}(-1)^{\vert A\setminus B\vert }\log \frac{p(x_{B},y_{B^{c}})} {p(y)} \\ & =& \sum _{B\subseteq A\setminus \{t\}}(-1)^{\vert A\setminus B\vert }\,\biggr \{\log \frac{p(x_{B},y_{B^{c}})} {p(y)} -\log \frac{p(x_{B\cup t},y_{B^{c}\setminus \{t\}})} {p(y)} \biggr \} \\ & =& \sum _{B\subseteq A\setminus \{t\}}(-1)^{\vert A\setminus B\vert }\,\log \frac{p(x_{B},y_{B^{c}})} {p(x_{B\cup t},y_{B^{c}\setminus \{t\}})} {}\end{array}$$
(17)

Note that for any \(B \subseteq A\setminus \{t\}\), we are guaranteed that tB, whence

$$\displaystyle\begin{array}{rcl} \frac{p(x_{B},y_{B^{c}})} {p(x_{B\cup t},y_{B^{c}\setminus \{t\}})}& = \frac{p(y_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})} {p(x_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})}.& {}\\ \end{array}$$

Substituting into Eq. (17) yields the claim (16).

We can now conclude the proof. If A is not a clique, then there must some exist some pair (s, t) of vertices not joined by an edge. Using the representation (16), we can write ϕ A as a sum of four terms (i.e., \(\phi _{A} =\sum _{ i=1}^{4}T_{i}\)), where

$$\displaystyle\begin{array}{rcl} T_{1}(x)& =& \sum _{B\subseteq A\setminus \{s,t\}}(-1)^{\vert A\setminus B\vert }\log p(y_{ t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}}), {}\\ T_{2}(x)& =& \sum _{B\subseteq A\setminus \{s,t\}}(-1)^{\vert A\setminus (B\cup \{t\})\vert }\log p(x_{ t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}}) {}\\ T_{3}(x)& =& \sum _{B\subseteq A\setminus \{s,t\}}(-1)^{\vert A\setminus (B\cup \{s\})\vert }\log p(y_{ t}\,\vert \,x_{B\cup \{s\}},y_{B^{c}\setminus \{s,t\}}),\quad \text{and} {}\\ T_{4}(x)& =& \sum _{B\subseteq A\setminus \{s,t\}}(-1)^{\vert A\setminus (B\cup \{s,t\})\vert }\log p(x_{ t}\,\vert \,x_{B\cup \{s\}},y_{B^{c}\setminus \{s,t\}}). {}\\ \end{array}$$

Combining these separate terms, we obtain

$$\displaystyle\begin{array}{rcl} \phi _{A}(x)& =\sum _{B\subseteq A\setminus \{s,t\}}(-1)^{\vert A\setminus B\vert }\log \frac{p(y_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})p(x_{t}\,\vert \,x_{B\cup \{s\}},y_{B^{c}\setminus \{s,t\}})} {p(x_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})p(y_{t}\,\vert \,x_{B\cup \{s\}},y_{B^{c}\setminus \{s,t\}})}.& {}\\ \end{array}$$

But using the Markov properties of the graph, each term in this sum is zero. Indeed, since \(s\notin N(t)\), we have

$$\displaystyle\begin{array}{rcl} p(x_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})& =& p(x_{t}\,\vert \,x_{B\cup \{s\}},y_{B^{c}\setminus \{s,t\}}),\quad \text{and} {}\\ p(y_{t}\,\vert \,x_{B},y_{B^{c}\setminus \{t\}})& =& p(y_{t}\,\vert \,x_{B\cup \{s\}},y_{B^{c}\setminus \{s,t\}}), {}\\ \end{array}$$

which completes the proof. □ 

2.2.4 Factor Graphs

For large graphs, the factorization properties of a graphical model, whether undirected or directed, may be difficult to visualize from the usual depictions of graphs. The formalism of factor graphs provides an alternative graphical representation, one which emphasizes the factorization of the distribution [43, 46].

Let \(\mathcal{F}\) represent an index set for the set of factors defining a graphical model distribution. In the undirected case, this set indexes the collection \(\mathcal{C}\) of cliques, while in the directed case \(\mathcal{F}\) indexes the set of parent–child neighborhoods. We then consider a bipartite graph \(\mathcal{G} = (\mathcal{V},\mathcal{F},\mathcal{E})\), in which \(\mathcal{V}\) is (as before) an index set for the variables, and \(\mathcal{F}\) is an index set for the factors. Given the bipartite nature of the graph, the edge set \(\mathcal{E}\) now consists of pairs \((s,a)\) of nodes \(s \in \mathcal{V}\) such that the fact \(a \in \mathcal{N}(s)\), or equivalently such that \(a \in \mathcal{N}(s)\). See Fig. 3b for an illustration.

Fig. 3
figure 3

Illustration of undirected graphical models and factor graphs. (a) An undirected graph on \(N = 7\) vertices, with maximal cliques {1, 2, 3, 4}, {4, 5, 6} and {6, 7}. (b) Equivalent representation of the undirected graph in (a) as a factor graph, assuming that we define compatibility functions only on the maximal cliques in (a). The factor graph is a bipartite graph with vertex set \(\mathcal{V} =\{ 1,\ldots,7\}\) and factor set \(\mathcal{F} =\{ a,b,c\}\), one for each of the compatibility functions of the original undirected graph

For undirected models, the factor graph representation is of particular value when \(\mathcal{C}\) consists of more than the maximal cliques. Indeed, the compatibility functions for the nonmaximal cliques do not have an explicit representation in the usual representation of an undirected graph—however, the factor graph makes them explicit. This explicitness can be useful in resolving certain ambiguities that arise with the standard formalism of undirected graphs. As one illustration, consider the three-vertex cycle shown in Fig. 4a. Two possible factorizations that are consistent with this graph structure are the pairwise-only factorization

$$\displaystyle\begin{array}{rcl} p(x_{1},x_{2},x_{3})& = \frac{1} {Z}\psi _{12}(x_{1},x_{2})\;\psi _{23}(x_{2},x_{3})\;\psi _{13}(x_{1},x_{3}),&{}\end{array}$$
(18)

which involves only the non-maximal pairwise cliques, versus the generic triplet factorization \(p(x_{1},x_{2},x_{3}) = \frac{1} {Z'}\psi _{123}(x_{1},x_{2},x_{3})\). The undirected graph in panel (a) does not distinguish between these two possibilities. In contrast, the factor graph representations of these two models are distinct, as shown in panels (b) and (c) respectively. This distinction between the pairwise and triplet factorization is important in many applications, for instance when testing for the presence of ternary interactions between different factors underlying a particular disease.

Fig. 4
figure 4

(a) A three-vertex cycle graph that can be used to represent any distribution over three random variables. (b) Factor graph representation of the pairwise model (19). (c) Factor graph representation of the triplet interaction model

3 Exact Algorithms for Marginals, Likelihoods and Modes

In applications of graphical models, one is typically interested in solving one of a core set of computational problems. One example is the problem of likelihood computation, which arises in parameter estimation and hypothesis testing with graphical models. A closely related problem is that of computing the marginal distribution \(p(x_{A})\) over a particular subset \(A \subset \mathcal{V}\) of nodes, or similarly, computing the conditional distribution \(p(x_{A}\mid x_{B})\), for disjoint subsets \(A\) and \(B\) of the vertex set. This marginalization problem is needed to in order to perform filtering or smoothing of time series (for chain-structured graphs), or the analogous operations for more general graphical models. A third problem is that of mode computation, in which the goal is to find a configuration \(\hat{x} \in \mathcal{X}_{[N]}\) of highest probability—that is, \(\hat{x} \in \max _{x\in \mathcal{X}_{[N]}}p(x)\). In typical applications, some subset of the variables are observed, so these computations are often performed for conditional distributions, with the observed values fixed. It is straightforward to incorporate observed values into the algorithms that we discuss in this section.

At first sight, these problems—likelihood computation, marginalization and mode computation—might seem quite simple and indeed, for small graphical models (e.g., \(N \approx {10\textendash 20}\)), they can be solved directly with brute force approaches. However, many applications involve graphs with thousands (if not millions) of nodes, and the computational complexity of brute force approaches scales very poorly in the graph size. More concretely, let us consider the case of a discrete random vector \((X_{1},\ldots,X_{N}) \in \mathcal{X}_{[N]}\) such that, for each node \(s \in \mathcal{V}\), the random variable X s takes values in the state space \(\mathcal{X}_{s} =\{ 0,1,\ldots,m - 1\}\). A naive approach to computing a marginal at a single node—say \(p(x_{s})\)—entails summing over all configurations of the form \(\{x' \in \mathcal{X}^{m}\mid x'_{s} = x_{s}\}\). Since this set has \(m^{N-1}\) elements, it is clear that a brute force approach will rapidly become intractable. Given a graph with \(N = 100\) vertices (a relatively small problem) and binary variables (\(m = 2\)), the number of terms is 299, which is already larger than estimates of the number of atoms in the universe. This is a vivid manifestation of the “curse-of-dimensionality” in a computational sense.

A similar curse applies to the problem of mode computation for discrete random vectors, since it is equivalent to solving an integer programming problem. Indeed, the mode computation problem for Markov random fields includes many well-known instances of NP-complete problems (e.g., 3-SAT, MAX-CUT etc.) as special cases. Unfortunately, for continuous random vectors, the problems are no easier and typically harder, since the marginalization problem involves computing a high-dimensional integral, whereas mode computation corresponds to a generic (possibly non-convex) optimization problem. An important exception to this statement is the Gaussian case where the problems of both marginalization and mode computation can be solved in polynomial-time for any graph (via matrix inversion), and in linear-time for tree-structured graphs via the Kalman filter.

In the following sections, we develop various approaches to these computational inference problems, beginning with discussion of a relatively naive but pedagogically useful scheme known as elimination, then moving onto more sophisticated message-passing algorithms, and culminating in our derivation of the junction tree algorithm.

3.1 Elimination Algorithm

We begin with an exact but relatively naive method known as the elimination algorithm. It operates on an undirected graphical model equipped with a factorization of the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& = \frac{1} {Z}\prod _{C\in \mathfrak{C}}\psi _{C}(x_{C}).& {}\\ \end{array}$$

As input to the elimination algorithm, we provide it with a target vertex \(T \in \mathcal{V}\) (or more generally a target subset of vertices), an elimination ordering \(\mathcal{I}\), meaning an ordering of the vertex set \(\mathcal{V}\) such that the target T appears first; as well as the collection of compatibility functions \(\{\psi _{C},C \in \mathfrak{C}\}\) defining the factorization (6). As output, the elimination algorithm returns the marginal distribution \(p(x_{T})\) over the variables x T at the target vertices.

Elimination Algorithm for Marginalization

  1. 1.

    Initialization:

    1. (a)

      Initialize active set of compatibility functions \(\mathcal{A} =\{\psi _{C},C \in \mathfrak{C}\}\).

    2. (b)

      Initialize elimination ordering \(\mathcal{I}\) (with the target vertex T appearing first).

  2. 2.

    Elimination: while \(\mathop{\mathrm{card}}\nolimits (\mathcal{I}) >\mathop{ \mathrm{card}}\nolimits (T)\):

    1. (a)

      Activate last vertex \(s\) on the current elimination order:

    2. (b)

      Form product

      $$\displaystyle\begin{array}{rcl} \phi (x_{s},x_{\mathcal{N}(s)})& =\prod _{\psi _{ C}\mbox{ involving $x_{s}$}}\psi _{C}(x_{C}),& {}\\ \end{array}$$

      where \(\mathcal{N}(s)\) indexes all variables that share active factors with \(s\).

    3. (c)

      Compute partial sum:

      $$\displaystyle\begin{array}{rcl} \tilde{\phi }_{\mathcal{N}(s)}(x_{\mathcal{N}(s)})& =\sum _{x_{s}}\phi (x_{s},x_{\mathcal{N}(s)}),& {}\\ \end{array}$$

      and remove all \(\{\psi _{C}\}\) that involve \(x_{s}\) from active list.

    4. (d)

      Add \(\tilde{\phi }_{\mathcal{N}(s)}\) to active list, and remove \(s\) from index set \(\mathcal{I}\).

  3. 3.

    Termination: when \(\mathcal{I} =\{ T\}\), form the product ϕ T (x T ) of all remaining functions in \(\mathcal{A}\), and then compute

    $$\displaystyle\begin{array}{rcl} Z =\sum _{x_{T}}\phi _{T}(x_{T})\quad \text{and}\quad p(x_{T}) = \frac{\phi _{T}(x_{T})} {Z}.& & {}\\ \end{array}$$

We note that in addition to returning the marginal distribution p(x T ), the algorithm also returns the normalization constant Z. The elimination algorithm is best understood in application to a particular undirected graph.

Example 9 (Elimination in Action)

Consider the 7-vertex undirected graph shown in Fig. 2a. As discussed in Example 6, it induces factorizations of the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{7})& \propto \psi _{123}(x_{1},x_{2},x_{3})\;\psi _{345}(x_{3},x_{4},x_{5})\;\psi _{46}(x_{4},x_{6})\;\psi _{57}(x_{5},x_{7}),& {}\\ \end{array}$$

Suppose that our goal is to compute the marginal distribution p(x 5). In order to do so, we initialize the elimination algorithm with T = 5, the elimination ordering  = { 5, 4, 2, 3, 1, 6, 7}, and the active set \(\mathcal{A} =\{\psi _{123},\psi _{345},\psi _{46},\psi _{57}\}\).

Eliminating x 7 : :

At the first step, \(s = 7\) becomes the active vertex. Since \(\psi _{57}\) is the only function involving x 7, we form \(\phi (x_{5},x_{7}) =\psi _{57}(x_{5},x_{7})\), and then compute the partial sum \(\tilde{\phi }_{5}(x_{5}) =\sum _{x_{7}}\psi _{57}(x_{5},x_{7})\). We then form the new active set \(\mathcal{A} =\{\psi _{123},\psi _{345},\psi _{46},\tilde{\phi }_{5}\}\).

Eliminating x 6 : :

This phase is very similar. We compute the partial sum \(\tilde{\phi }_{4}(x_{4}) =\sum _{x_{6}}\psi _{46}(x_{4},x_{6})\), and end up with the new active set \(\mbox{ $\mathcal{A} =\{\psi _{123},\psi _{345},\tilde{\phi }_{5},\tilde{\phi }_{4}\}$.}\)

Eliminating x 1 : :

We compute \(\tilde{\phi }_{23}(x_{2},x_{3}) =\sum _{x_{1}}\psi _{123}(x_{1},x_{2},x_{3})\), and conclude with the new active set \(\mathcal{A} =\{\tilde{\phi } _{23},\psi _{345},\tilde{\phi }_{5},\tilde{\phi }_{4}\}\).

Eliminating x 3 : :

In this phase, we first form the product \(\phi _{2345}(x_{2},x_{3},x_{4},x_{5}) =\tilde{\phi } _{23}(x_{2},x_{3})\psi _{345}(x_{3},x_{4},x_{5})\), and then compute the partial sum

$$\displaystyle\begin{array}{rcl} \tilde{\phi }_{245}(x_{2},x_{4},x_{5}) =\sum _{x_{2}}\phi _{2345}(x_{2},x_{3},x_{4},x_{5}).& & {}\\ \end{array}$$

We end up with the active set \(\mathcal{A} =\{\tilde{\phi } _{245},\tilde{\phi }_{5},\tilde{\phi }_{4}\}\).

Eliminating x 2 : :

The output of this phase is the active set \(\mathcal{A} =\{\tilde{\phi } _{45},\tilde{\phi }_{5},\tilde{\phi }_{4}\}\), where \(\tilde{\phi }_{45}\) is obtained by summing out \(x_{2}\) from \(\tilde{\phi }_{245}\).

Eliminating x 4 : :

We form the product \(\phi _{45}(x_{4},x_{5}) =\tilde{\phi } _{4}(x_{4})\tilde{\phi }_{45}(x_{4},x_{5})\), and then the partial sum \(\hat{\phi }(x_{5}) =\sum _{x_{4}}\phi _{45}(x_{4},x_{5})\). We conclude with the active set \(\mathcal{A} =\{\tilde{\phi } _{5},\hat{\phi }_{5}\}\).

After these steps, the index set \(\mathcal{I}\) has been reduced to the target vertex T = 5. In the final step, we form the product \(\phi _{5}(x_{5}) =\tilde{\phi } _{5}(x_{5})\hat{\phi }_{5}(x_{5})\), and then re-normalize it to sum to one. ♣

A few easy extensions of the elimination algorithm are important to note. Although we have described it in application to an undirected graphical model, a simple pre-processing step allows it to be applied to any directed graphical model as well. More precisely, any directed graphical model can be converted to an undirected graphical model via the following steps:

  • remove directions from all the edges

  • “moralize” the graph by adding undirected edges between all parents of a given vertex—that is, in detail

    $$\displaystyle\begin{array}{rcl} \mbox{ for all $s \in \mathcal{V}$, and for all $t,u \in \pi (s)$, add the edge $(t,u)$.}& & {}\\ \end{array}$$
  • the added edges yield cliques \(C = (s,\pi (s))\); on each such clique, define the compatibility function \(\psi _{C}(x_{C}) = p(x_{s}\,\vert \,x_{\pi (s)})\).

Applying this procedure yields an undirected graphical model to which the elimination algorithm can be applied.

A second extension concerns the computation of conditional distributions. In applications, it is frequently the case that some subset of the variables are observed, and our goal is to compute a conditional distribution of X T given these observed variables. More precisely, suppose that we wish to compute a conditional probability of the form \(p(x_{T}\,\mid \bar{x}_{\mathcal{O}})\) where \(\mathcal{O}\subset \mathcal{V}\) are indices of the observed values \(X_{\mathcal{O}} =\bar{ x}_{\mathcal{O}}\). In order to do so, it suffices to compute a marginal of the form \(p(x_{T},\bar{x}_{\mathcal{O}})\). This marginal can be computed by adding binary indicator functions to the active set given as input to the elimination algorithm. In particular, for each \(s \in \mathcal{O}\), let \(\mathbb{I}\,_{s;\bar{x}_{s}}(x_{s})\) be a zero-one indicator for the event \(\{x_{s} =\bar{ x}_{s}\}\). We then provide the elimination algorithm with the initial active set \(\mathcal{A} =\{\psi _{C},C \in \mathfrak{C}\} \cup \{ \mathbb{I}\,_{s,\bar{x}_{s}},s \in \mathcal{O}\}\), so that it computes marginals for the modified distribution

$$\displaystyle\begin{array}{rcl} q(x_{1},\ldots,x_{N})& \propto \prod _{C\in \mathfrak{C}}\psi _{C}(x_{C})\;\prod _{s\in \mathcal{O}}\mathbb{I}\,_{s\bar{x}_{s}}(x_{s}),& {}\\ \end{array}$$

in which q(x) = 0 for all configurations \(x_{\mathcal{O}}\neq \bar{x}_{\mathcal{O}}\). The output of the elimination algorithm can also be used to compute the likelihood \(p(\bar{x}_{\mathcal{O}})\) of the observed data.

3.1.1 Graph-Theoretic Versus Analytical Elimination

Up to now, we have described the elimination algorithm in an analytical form. It can also be viewed from a graph-theoretic perspective, namely as an algorithm that eliminates vertices, but adds edges as it does so. In particular, during the elimination step, the graph-theoretic version removes the active vertex \(s\) from the current elimination order (as well as all edges involving \(s\)), but then adds an edge for each pair of vertices \((t,u)\) that were connected to \(s\) in the current elimination graph. These adjacent vertices define the neighborhood set \(\mathcal{N}(s)\) used in Step 2(c) of the algorithm. This dual perspective—the analytical and graph-theoretic—is best illustrated by considering another concrete example.

Example 10 (Graph-Theoretic Versus Analytical Elimination)

Consider the 3 × 3 grid-structured graph shown in panel (a) of Fig. 5. Given a probability distribution that is Markov with respect to this grid, suppose that we want to compute the marginal distribution p(x 5), so that the target vertex is T = 5. In order to do so, we initialize the elimination algorithm using the ordering \(\mathcal{I} =\{ 5,2,8,6,4,1,3,7,9\}\). Panel (b) shows the graph obtained after eliminating vertex 9, now shown in dotted lines to indicate that it has been removed. At the graph-theoretic level, we have removed edges (8, 9) and (6, 9), and added edge (6, 9). At the analytical level, we have performed the summation over x 9 to compute

$$\displaystyle\begin{array}{rcl} \tilde{\phi }(x_{6},x_{8})& =\sum _{x_{9}}\psi _{69}(x_{6},x_{9})\psi _{89}(x_{8},x_{9}),& {}\\ \end{array}$$

and added it to the active list of compatibility functions. Notice how the new function \(\tilde{\phi }_{68}\) depends only on the variables (x 6, x 8), which are associated with the edge (6, 8) added at the graph-theoretic level. Panel (c) shows the result after having eliminated vertices {1, 3, 7, 9}, in reverse order. Removal of the three vertices (in addition to 9) has a symmetric effect on each corner of the graph. In panel (d), we see the effect of eliminating vertex 4; in this case, we remove edges (4, 5), (4, 2), and (4, 8), and then add edge (2, 8). At the analytical stage, this step will introduce a new compatibility function of the form \(\tilde{\phi }_{258}(x_{2},x_{5},x_{8})\). Panel (e) shows the result of removing vertex 6; no additional edges are added at this stage. Finally, panel (f) shows the reconstituted graph \(\tilde{\mathcal{G}}\), meaning the original graph plus all the edges that were added during the elimination process. Note that the added edges have increased the size of the maximal clique(s) to four; for instance, the set {2, 5, 6, 8} is a maximal clique in the reconstituted graph. This reconstituted graph plays an important role in our discussion of junction tree theory to follow. ♣

Fig. 5
figure 5

(a) A 3 × 3 grid graph in the plane. (b) Result of eliminating vertex 9. Dotted vertices/edges indicate quantities that have been removed during elimination. (c) Remaining graph after having eliminated vertices {1, 3, 7, 9}. (d) Resulting graph after eliminating vertex 4. (e) Elimination of vertex 6. (f) Final reconstituted graph: original graph augmented with all edges added during the elimination process

3.1.2 Complexity of Elimination

What is the computational complexity of the elimination algorithm? Let us do some rough calculations in the discrete case, when \(\mathcal{X}_{s} =\{ 0,1,\ldots,m - 1\}\) for all vertices \(s \in \mathcal{V}\). Clearly, it involves at least a linear cost in the number of vertices \(N\). The other component is the cost of computing the partial sums \(\tilde{\phi }\) at each step. Supposing that we are eliminating \(s\), let \(\mathcal{N}(s)\) be the current set of its neighbors in the elimination graph. For instance, in Fig. 5, when we eliminate vertex \(s = 4\), its current neighborhood set is \(\mathcal{N}(4) =\{ 2,5,8\}\), and the function \(\tilde{\phi }_{258}\) at this stage depends on \((x_{2},x_{5},x_{8})\). Computing \(\tilde{\phi }_{258}\) incurs a complexity of \(\mathcal{O}(m^{4})\), since we need to perform a summation (over \(x_{4}\)) involving \(m\) terms for each of the \(m^{3}\) possible instantiations of \((x_{2},x_{5},x_{8})\).

In general, then, the worst-case cost over all elimination steps will scale as \(\mathcal{O}(m^{\vert \mathcal{N}(s)\vert +1})\), where \(\vert \mathcal{N}(s)\vert \) is the neighborhood size in the elimination graph when eliminating \(s\). It can be seen that the worst-case neighborhood size (plus one) is equivalent to the size of the largest clique in the reconstituted graph. For instance, in Fig. 5f, the largest cliques in the reconstituted graph have size 4, consistent with the \(\mathcal{O}(m^{4})\) calculation from above. Putting together the pieces, we obtain a conservative upper bound on the cost of the elimination algorithm—namely, \(\mathcal{O}(m^{\,c}N)\), where c is the size of the largest clique in the reconstituted graph.

A related point is that for any given target vertex T, there are many possible instantiations of the elimination algorithm that can be used to compute the marginal p(x T ). Remember that our only requirement is that \(T\) appear first in the ordering, so that there are actually \((N - 1)!\) possible orderings of the remaining vertices. Which ordering should be chosen? Based on our calculations above, in order to minimize the computational cost, one desirable goal would be to minimize the size of the largest clique in the reconstituted graph. In general, finding an optimal elimination ordering of this type is computationally challenging; however, there exist a variety of heuristics for choosing good orderings, as we discuss in Sect. 4.

3.2 Message-Passing Algorithms on Trees

We now turn to discussion of message-passing algorithms for models based on graphs without cycles, also known as trees. So as to highlight the essential ideas in the derivation, we begin by deriving message-passing algorithms for trees with only pairwise interactions, in which case the distribution has form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& = \frac{1} {Z}\prod _{s\in \mathcal{V}}\psi _{s}(x_{s})\;\prod _{(s,t)\in \mathcal{E}}\psi _{st}(x_{s},x_{t}),&{}\end{array}$$
(19)

where \(\mathcal{T} = (\mathcal{V},\mathcal{E})\) is a given tree, and \(\{\psi _{s},s \in \mathcal{V}\}\) and \(\{\psi _{st},(s,t) \in \mathcal{E}\}\) are compatibility functions. For a tree, the elimination algorithm takes a very simple form—assuming that the appropriate ordering is followed. Recall that a vertex in a graph is called a leaf if it has degree one.

The key fact is that for any tree, it is always possible to choose an elimination ordering \(\mathcal{I}\) such that:

  1. (a)

    The last vertex in \(\mathcal{I}\) (first to be eliminated) is a leaf; and

  2. (b)

    For each successive stage of the elimination algorithm, the last vertex in \(\mathcal{I}\) remains a leaf of the reduced graph.

This claim follows because any tree always has at least one leaf vertex [17], so that the algorithm can be initialized in Step 1. Moreover, in removing a leaf vertex, the elimination algorithm introduces no additional edges to the graph, so that it remains a tree at each stage. Thus, we may apply the argument recursively, again choosing a leaf vertex to continue the process. We refer to such an ordering as a leaf-exposing ordering.

In computational terms, when eliminating leaf vertex \(s\) from the graph, the algorithm marginalizes over \(x_{s}\), and then passes the result to its unique parent \(\pi (s)\). The result \(\tilde{\phi }_{\pi (s)}\) of this intermediate computation is a function of \(x_{\pi (s)}\); following standard notation used in presenting the sum-product algorithm, we can represent it by a “message” of the form \(M_{s\rightarrow \pi (s)}\), where the subscript reflects the direction in which the message is passed. Upon termination, the elimination algorithm will return the marginal distribution \(p(x_{T})\) at the target vertex \(T \in \mathcal{V}\); more specifically, this marginal is specified in terms of the local compatibility function \(\psi _{T}\) and the incoming messages as follows:

$$\displaystyle\begin{array}{rcl} p(x_{T})& \propto \psi _{T}(x_{T})\prod _{s\in \mathcal{N}(r)}M_{s\rightarrow T}(x_{T}).& {}\\ \end{array}$$

Assuming that each variable takes at most \(m\) values (i.e., \(\vert \mathcal{X}_{s}\vert \leq m\) for all \(s \in \mathcal{V}\)), the overall complexity of running the elimination algorithm for a given root vertex scales as \(\mathcal{O}(m^{2}N)\), since computing each message amounts to summing \(m\) numbers a total of \(m\) times, and there are \(\mathcal{O}(N)\) rounds in the elimination order. Alternatively, since the leaf-exposing ordering adds no new edges to the tree, the reconstituted graph is equivalent to the original tree. The maximal cliques in any tree are the edges, hence of size two, so that the complexity estimate \(\mathcal{O}(m^{2}N)\) follows from our discussion of the elimination algorithm.

3.2.1 Sum-Product Algorithm

In principle, by running the tree-based elimination algorithm \(N\) times—once for each vertex acting as the target in the elimination ordering—we could compute all singleton marginals in time \(\mathcal{O}(m^{2}N^{2})\). However, as the attentive reader might suspect, such an approach is wasteful, since it neglects to consider that many of the intermediate operations in the elimination algorithm would be shared between different orderings. Herein lies the cleverness of the sum-product algorithm: it re-uses these intermediate results in the appropriate way, and thereby reduces the overall complexity of computing all singleton marginals to \(\mathcal{O}(m^{2}N)\). In fact, as an added bonus, without any added complexity, it also computes all pairwise marginal distributions over variables \((x_{s},x_{t})\) such that \((s,t) \in \mathcal{E}\).

For each edge (s, t) ∈ , the sum-product algorithm maintains two messages—namely, \(M_{s\rightarrow t}\) and \(M_{t\rightarrow s}\)—corresponding to two possible directions of the edge.Footnote 1 With this notation, the sum-product message-passing algorithm for discrete random variables takes the following form:

Sum-Product Message-Passing Algorithm (Pairwise Tree):

  1. 1.

    At iteration \(k = 0\):

    For all \((s,t) \in \mathcal{E}\), initialize messages

    $$\displaystyle\begin{array}{rcl} M_{t\rightarrow s}^{0}(x_{ s}) = 1\quad \mbox{ for all $x_{s} \in \mathcal{X}_{s}$}\quad \mbox{ and}\quad M_{s\rightarrow t}^{0}(x_{ t}) = 1\quad \mbox{ for all $x_{t} \in \mathcal{X}_{t}$.}& & {}\\ \end{array}$$
  2. 2.

    For iterations \(k = 1,2,\ldots\):

    1. (i)

      For each \((t,s) \in \mathcal{E}\), update messages:

      $$\displaystyle\begin{array}{rcl} M_{t\rightarrow s}^{k}(x_{ s})& \leftarrow \alpha \;\sum _{x'_{t}}\biggr \{\psi _{st}(x_{s},x'_{t})\,\psi _{t}(x'_{t})\;\prod _{u\in \mathcal{N}(t)/s}M_{u\rightarrow t}^{k-1}(x'_{ t})\biggr \},& {}\end{array}$$
      (20)

      where \(\alpha > 0\) chosen such that \(\sum _{x_{s}}M_{s\rightarrow s}^{k}(x_{s}) = 1\).

    2. (ii)

      Upon convergence, compute marginal distributions:

      $$\displaystyle\begin{array}{rcl} p(x_{s})& =\;\alpha _{s}\;\psi _{s}(x_{s})\;\prod _{t\in \mathcal{N}(s)}\;\;M_{t\rightarrow s}(x_{s}),\quad \text{and}& {}\end{array}$$
      (21a)
      $$\displaystyle\begin{array}{rcl} p_{st}(x_{s},x_{t})& =\;\alpha _{st}\;\psi _{st}(x_{s},x_{t})\;\prod _{u\in \mathcal{N}(s)\setminus \{t\}}\;\;M_{u\rightarrow s}(x_{s})\;\prod _{u\in \mathcal{N}(t)\setminus \{s\}}\;M_{u\rightarrow t}(x_{t}),& {}\end{array}$$
      (21b)

      where \(\alpha _{s} > 0\) and \(\alpha _{st} > 0\) are normalization constants (to ensure that the marginals sum to one).

The initialization given in Step 1 is the standard “uniform” one; it can be seen that the final output of the algorithm will be same for any initialization of the messages that has strictly positive components. As with our previous discussion of the elimination algorithm, we have stated a form of message-passing updates suitable for discrete random variables, involving finite summations. For continuous random variables, we simply need to replace the summations in the message update (20) with integrals. For instance, in the case of a Gauss-Markov chain, we obtain the Kalman filter as a special case of the sum-product updates.

There are various ways in which the order of message-passing can be scheduled. The simplest to describe, albeit not the most efficient, is the flooding schedule, in which every message is updated during every round of the algorithm. Since each edge is associated with two messages—one for each direction—this protocol involves updating \(2\vert \mathcal{E}\vert \) messages per iteration. The flooding schedule is especially well-suited to parallel implementations of the sum-product algorithm, such as on a digital chip or field programmable gate array, in which a single processing unit can be devoted to each vertex of the graph.

In terms of minimizing the total number of messages that are passed, a more efficient scheduling protocol is based on the following rule: any given vertex \(s\) updates the message \(M_{s\rightarrow t}\) to its neighbor \(t \in \mathcal{N}(s)\) only after it has received messages from all its other neighbors \(u \in \mathcal{N}(s)\setminus \{t\}\). In order to see that this protocol will both start and terminate, note that any tree has at least one leaf vertex. All leaf vertices will pass a message to their single neighbor at round 1. As in the elimination algorithm, we can then remove these vertices from further consideration, yielding a sub-tree of the original tree, which also has at least one leaf node. Again, we see that a recursive leaf-stripping argument will guarantee termination of the algorithm. Upon termination, a total of \(2\vert \mathcal{E}\vert \) messages have been passed over all iterations, as opposed to per iteration in the flooding schedule.

Regardless of the particular message-update schedule used, the following result states the convergence and correctness guarantees associated with the sum-product algorithm on a tree:

Proposition 2

For any tree \(\mathcal{T}\) with diameter \(d(\mathcal{T} )\) , the sum-product updates have a unique fixed point \(M^{{\ast}}\) . The algorithm converges to it after at most \(d(\mathcal{T} )\) iterations, and the marginals obtained by Eqs. (21a) and (21b) are exact.

Proof

We proceed via induction on the number of vertices. For \(N = 1\), the claim is trivial. Now assume that the claim holds for all trees with at most \(N - 1\) vertices, and let us show that it also holds for any tree with \(N\) vertices. It is an elementary fact of graph theory [17] that any tree has at least one leaf node. By re-indexing as necessary, we may assume that node \(N\) is a leaf, and its unique parent is node 1. Moreover, by appropriately choosing the leaf vertex, we may assume that the diameter of \(\mathcal{T}\) is achieved by a path ending at node \(N\). By definition of the sum-product updates, for all iterations \(k = 1,2,\ldots\), the message sent from \(N\) to 1 is given by

$$\displaystyle\begin{array}{rcl} M_{N\rightarrow 1}^{{\ast}}(x_{ 1})& =\,\alpha \;\sum _{x_{N}}\psi _{N}(x_{N})\,\psi _{N1}(x_{N},x_{1}).& {}\\ \end{array}$$

(It never changes since vertex \(N\) has only one neighbor.) Given this fixed message, we may consider the probability distribution defined on variables \((x_{1},\ldots,x_{N-1})\) given by

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N-1})& \;\propto \; M_{N\rightarrow 1}^{{\ast}}(x_{1})\;\big[\prod _{s=1}^{N-1}\psi _{s}(x_{s})\big]\;\prod _{(s,t)\in \mathcal{E}\setminus \{(1,N)\}}\psi _{st}(x_{s},x_{t}).& {}\\ \end{array}$$

Our notation is consistent, in that \(p(x_{1},\ldots,x_{N-1})\) is the marginal distribution obtained by summing out \(x_{N}\) from the original problem. It corresponds to an instance of our original problem on the new tree \(\mathcal{T}' = (\mathcal{V}',\mathcal{E}')\) with \(\mathcal{V}' =\{ 1,\ldots,N - 1\}\) and \(\mathcal{E}' = \mathcal{E}\setminus \{(1,N)\}\). Since the message \(M_{N\rightarrow 1}^{{\ast}}\) remains fixed for all iterations \(k \geq 1\), the iterates of the sum-product algorithm on \(\mathcal{T}'\) are indistinguishable from the iterates on \(\mathop{\mathrm{tree}}\nolimits\) (for all vertices \(s \in \mathcal{V}'\) and edges \((s,t) \in \mathcal{E}'\)).

By the induction hypothesis, the sum-product algorithm applied to \(\mathcal{T}'\) will converge after at most \(d(\mathcal{T}')\) iterations to a fixed point \(M^{{\ast}} =\{ M_{s\rightarrow t}^{{\ast}},M_{t\rightarrow s}^{{\ast}}\,\mid \,(s,t) \in \mathcal{E}'\}\), and this fixed point will yield the correct marginals at all vertices \(s \in \mathcal{V}'\) and edges \((s,t) \in \mathcal{E}'\). As previously discussed, the message from \(N\) to 1 remains fixed for all iterations after the first, and the message from 1 to \(N\) will be fixed once the iterations on the sub-tree \(\mathcal{T}'\) have converged. Taking into account the extra iteration for node 1 to \(N\), we conclude that sum-product on \(\mathcal{T}\) converges in at most \(d(\mathcal{T} )\) iterations as claimed.

It remains to show that the marginal at node \(N\) can be computed by forming the product p(x N )  ∝  ψ N (x N )M 1 → N (x N ), where

$$\displaystyle\begin{array}{rcl} M_{1\rightarrow N}^{{\ast}}(x_{ N})&:=\alpha \sum _{x_{1}}\psi _{1}(x_{1})\;\psi _{1N}(x_{1},x_{N})\,\prod _{u\in \mathcal{N}(1)\setminus \{N\}}M_{u\rightarrow 1}^{{\ast}}(x_{ 1}).& {}\\ \end{array}$$

By elementary probability theory, we have \(p(x_{N}) =\sum _{x_{1}}p(x_{N}\,\mid x_{1})p(x_{1})\). Since \(N\) is a leaf node with unique parent 1, we have the conditional independence relation \(X_{N} \perp \!\!\!\perp X_{\mathcal{V}\setminus \{1\}}\mid X_{1}\), and hence, using the form of the factorization (19),

$$\displaystyle\begin{array}{rcl} p(x_{N}\,\mid \,x_{1})& = \frac{\psi _{N}(x_{N})\,\psi _{1N}(x_{1},x_{N})} {\sum _{x_{N}}\psi _{N}(x_{N})\,\psi _{1N}(x_{1},x_{N})}\; \propto \; \frac{\psi _{N}(x_{N})\,\psi _{1N}(x_{1},x_{N})} {M_{N\rightarrow 1}^{{\ast}}(x_{1})}.&{}\end{array}$$
(22)

By the induction hypothesis, the sum-product algorithm, when applied to the distribution (22), returns the marginal distribution

$$\displaystyle\begin{array}{rcl} p(x_{1})& \;\propto \;\big [M_{N\rightarrow 1}^{{\ast}}(x_{1})\psi _{1}(x_{1})\big]\;\prod _{t\in \mathcal{N}(1)\setminus \{N\}}M_{t\rightarrow 1}^{{\ast}}(x_{1}).& {}\\ \end{array}$$

Combining the pieces yields

$$\displaystyle\begin{array}{rcl} p(x_{N})& =& \sum _{x_{1}}p(x_{N}\,\mid x_{1})p(x_{1}) {}\\ & \propto & \psi _{N}(x_{N})\sum _{x_{1}} \frac{\psi _{1N}(x_{1},x_{N})} {M_{N\rightarrow 1}^{{\ast}}(x_{1})}\;\big[M_{N\rightarrow 1}^{{\ast}}(x_{ 1})\psi _{1}(x_{1})]\;\prod _{k\in \mathcal{N}(1)\setminus \{N\}}M_{k\rightarrow 1}^{{\ast}}(x_{ 1}) {}\\ & \propto & \psi _{N}(x_{N})\;M_{1\rightarrow N}^{{\ast}}(x_{ N}), {}\\ \end{array}$$

as required. A similar argument yields that Eq. (21b) yields the correct form of the pairwise marginal for the edge \((1,N) \in \mathcal{E}\); we leave the details as an exercise for the reader. □ 

3.2.2 Sum-Product on General Factor Trees

In the preceding section, we derived the sum-product algorithm for a tree with pairwise interactions; in this section, we discuss its extension to arbitrary tree-structured factor graphs. A tree factor graph is simply a factor graph without cycles; see Fig. 3b for an example. We consider a distribution with a factorization of the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& = \frac{1} {Z}\prod _{s\in \mathcal{V}}\psi _{s}(x_{s})\prod _{a\in \mathcal{F}}\psi _{a}(x_{a}),&{}\end{array}$$
(23)

where \(\mathcal{V}\) and \(\mathcal{F}\) are sets of vertices and factor nodes, respectively.

Before proceeding, it is worth noting that for the case of discrete random variables considered here, the resulting algorithm is not more general than sum-product for pairwise interactions. Indeed, in any distribution over discrete random variables with a tree-structured factor graph—regardless of the order of interactions that it involves—can be converted to an equivalent tree with pairwise interactions. We refer the reader to Appendix E.3 in the monograph [64] for further details on this procedure. Nonetheless, it is often convenient to apply the non-pairwise form of the sum-product algorithm directly to a factor tree, as opposed to going through this conversion.

We use \(M_{s\rightarrow a}\) to denote the message from vertex \(s\) to the factor node \(a \in \mathcal{N}(s)\), and similarly \(M_{a\rightarrow s}\) to denote the message from factor node a to the vertex \(s \in \mathcal{N}(a)\). Both messages (in either direction) are functions of \(x_{s}\), where \(M_{a\rightarrow s}(x_{s})\) (respectively \(M_{s\rightarrow a}(x_{s})\)) denotes the value taken for a given \(x_{s} \in \mathcal{X}_{s}\). We let \(x_{a} =\{ x_{s},s \in \mathcal{N}(a)\}\) denote the sub-vector of random variables associated with factor node \(a \in \mathcal{F}\). With this notation, the sum-product updates for a general tree factor graph take the following form:

Sum-Product Updates for Tree Factor Graph:

$$\displaystyle\begin{array}{rcl} M_{s\rightarrow a}(x_{s})& \leftarrow \;\alpha \;\psi _{s}(x_{s})\;\prod _{b\in \mathcal{N}(s)\setminus \{a\}}M_{b\rightarrow s}(x_{s}),\quad \text{and}&{}\end{array}$$
(24a)
$$\displaystyle\begin{array}{rcl} M_{a\rightarrow s}(x_{s})& \leftarrow \;\alpha \sum _{x_{t},\;t\in \mathcal{N}(a)\setminus \{s\}}\,\Big[\psi _{a}(x_{a})\prod _{t\in \mathcal{N}(a)\setminus \{s\}}M_{t\rightarrow a}(x_{t})\Big].&{}\end{array}$$
(24b)

Here the quantity \(\alpha\) again represents a positive constant chosen to ensure that the messages sum to one. (As before, its value can differ from line to line.) Upon convergence, the marginal distributions over \(x_{s}\) and over the variables x a are given, respectively, by the quantities

$$\displaystyle\begin{array}{rcl} p(x_{s})& =\alpha \:\psi _{s}(x_{s})\;\prod _{a\in \mathcal{N}(s)}M_{a\rightarrow s}(x_{s}),\quad \text{and}&{}\end{array}$$
(25a)
$$\displaystyle\begin{array}{rcl} p(x_{a})& =\alpha \:\psi _{a}(x_{a})\;\prod _{t\in \mathcal{N}(a)}M_{t\rightarrow a}(x_{t}).&{}\end{array}$$
(25b)

We leave it as an exercise for the reader to verify that the message-passing updates (24) will converge after a finite number of iterations (related to the diameter of the factor tree), and when equation (25) is applied using the resulting fixed point \(M^{{\ast}}\) of the message updates, the correct marginal distributions are obtained.

3.2.3 Max-Product Algorithm

Thus far, we have discussed the sum-product algorithm that can be used to solve the problems of marginalization and likelihood computation. We now turn to the max-product algorithm, which is designed to solve the problem of mode computation. So as to clarify the essential ideas, we again present the algorithm in terms of a factor tree involving only pairwise interactions, with the understanding that the extension to a general factor tree is straightforward.

The problem of mode computation amounts to finding a vector

$$\displaystyle\begin{array}{rcl} x^{{\ast}}& \in \arg \max _{ x\in \mathcal{X}_{[N]}}p(x_{1},\ldots,x_{N}),& {}\\ \end{array}$$

where we recall our shorthand notation \(\mathcal{X}_{[N]} = \mathcal{X}_{1} \times \mathcal{X}_{2} \times \cdots \times \mathcal{X}_{N}\). As an intermediate step, let us first consider the problem of computing the so-called max-marginals associated with any distribution defined by a tree \(\mathcal{T} = (\mathcal{V},\mathcal{E})\). In particular, for each vertex \(s \in \mathcal{V}\), we define the singleton max-marginal

$$\displaystyle\begin{array}{rcl} \nu _{s}(x_{s})&:=\max _{\{x'\in \mathcal{X}_{[N]}\,\mid \,x'_{s}=x_{s}\}}p(x'_{1},\ldots,x'_{N}),&{}\end{array}$$
(26)

For each edge \((s,t) \in \mathcal{E}\), the pairwise max-marginal is defined in an analogous manner:

$$\displaystyle\begin{array}{rcl} \nu _{\mathit{ij}}(x_{i},x_{j})&:=\max _{\{x'\in \mathcal{X}_{[N]}\,\mid \,(x'_{i},x'_{j})=(x_{i},x_{j})\}}p(x'_{1},\ldots,x'_{N}),&{}\end{array}$$
(27)

Note that the singleton and pairwise max-marginals are the natural analogs of the usual marginal distributions, in which the summation operation has been replaced by the maximization operation.

Before describing how these max-marginals can be computed by the max-product algorithm, first let us consider how max-marginals are relevant for the problem of mode computation. This connection is especially simple if the singleton max-marginals satisfy the unique maximizer condition—namely, if for all \(s \in \mathcal{V}\), the maximum over \(\nu _{s}(x_{s})\) is achieved at a single element, meaning that

$$\displaystyle\begin{array}{rcl} \arg \max _{x_{s}\in \mathcal{X}_{s}}\nu _{s}(x_{s})& = x_{s}^{{\ast}}\quad \mbox{ for all $s \in \mathcal{V}$.}&{}\end{array}$$
(28)

In this case, it is straightforward to compute the mode, as summarized in the following:

Lemma 3

Consider a distribution p whose singleton max-marginals satisfy the unique maximizer condition (28). Then the vector \(x^{{\ast}} = (x_{1}^{{\ast}},\ldots,x_{N}^{{\ast}}) \in \mathcal{X}_{[N]}\) is the unique maximizer of \(p(x_{1},\ldots,x_{N})\) .

Proof

The proof of this claim is straightforward. Note that for any vertex \(s \in \mathcal{V}\), by the unique maximizer condition and the definition of the max-marginals, we have

$$\displaystyle{ \nu _{s}(x_{s}^{{\ast}})\; =\;\max _{ x_{s}\in \mathcal{X}_{s}}\nu _{s}(x_{s})\; =\;\max _{x\in \mathcal{X}_{[N]}}p(x_{1},\ldots,x_{N}). }$$

Now for any other configuration \(\tilde{x}\neq x^{{\ast}}\), there must be some index \(t\) such that \(x_{t}^{{\ast}}\neq \tilde{x}_{t}\). For this index, we have

$$\displaystyle{ \max _{x\in \mathcal{X}_{[N]}}p(x_{1},\ldots,x_{N})\; =\;\nu _{t}(x_{t}^{{\ast}})\;\stackrel{(i)}{>}\;\nu _{ t}(\tilde{x}_{t})\;\stackrel{(ii)}{\geq }\;p(\tilde{x}), }$$

where inequality (i) follows from the unique maximizer condition, and inequality (ii) follows by definition of the max-marginal. We have thus established that \(\mbox{ $p(\tilde{x}) <\max \limits _{x\in \mathcal{X}_{[N]}}p(x)$}\) for all \(\tilde{x}\neq x^{{\ast}}\), so that \(x^{{\ast}}\) uniquely achieves the maximum. □ 

If the unique maximizer condition fails to hold, then the distribution \(p\) must have more than one mode, and it requires a bit more effort to determine one. Most importantly, it is no longer sufficient to extract some \(x_{s}^{{\ast}}\in \arg \max _{x_{s}\in \mathcal{X}_{i}}\nu _{i}(x_{s})\), as illustrated by the following toy example.

Example 11 (Failure of Unique Maximizer)

Consider the distribution over a pair of binary variables given by

$$\displaystyle\begin{array}{rcl} p(x_{1},x_{2})& = \left \{\begin{array}{@{}l@{\quad }l@{}} 0.45\quad &\mbox{ if $x_{1}\neq x_{2}$, and}\\ 0.05\quad &\text{otherwise.} \end{array} \right.&{}\end{array}$$
(29)

In this case, we have \(\nu _{1}(x_{1}) =\nu _{2}(x_{2}) = 0.45\) for all \((x_{1},x_{2}) \in \{ 0,1\}^{2}\), but only configurations with x 1x 2 are globally optimal. As a consequence, a naive procedure that looks only at the singleton max-marginals has no way of determining such a globally optimal configuration. ♣

Consequently, when the unique maximizer condition no longer holds, it is necessary to also take into account the information provided by the pairwise max-marginals (27). In order to do so, we consider a back-tracking procedure that samples some configuration \(x^{{\ast}}\in \arg \max _{x\in \mathcal{X}_{[N]}}p(x)\). Let us assume that the tree is rooted at node 1, and that \(\mathcal{V} =\{ 1,2,\ldots,N\}\) is a topological ordering (i.e., such that \(\pi (s) < s\) for all \(s \in \{ 2,\ldots,N\}\)). Using this topological ordering we may generate an optimal configuration \(x^{{\ast}}\) as follows:

Back-Tracking for Choosing a Mode \(x^{{\ast}}\)

  1. 1.

    Initialize the procedure at node 1 by choosing any \(x_{1}^{{\ast}}\in \arg \max \limits _{x_{1}\in \mathcal{X}_{1}}\nu _{1}(x_{1})\).

  2. 2.

    For each \(s = 2,\ldots,N\), choose a configuration \(x_{s}^{{\ast}}\) at vertex \(s\) such that

    $$\displaystyle\begin{array}{rcl} x_{s}^{{\ast}}\in \arg \max \limits _{ x_{s}\in \mathcal{X}_{s}}\nu _{s,\pi (s)}(x_{s},x_{\pi (s)}^{{\ast}}).& & {}\\ \end{array}$$

Note that this is a recursive procedure, in that the choice of the configuration \(x_{s}^{{\ast}}\) depends on the previous choice of \(x_{\pi (s)}^{{\ast}}\). Our choice of topological ordering ensures that \(x_{\pi (s)}^{{\ast}}\) is always fixed before we reach vertex \(s\). We establish that this back-tracking procedure is guaranteed to output a vector \(x^{{\ast}}\in \arg \max _{x\in \mathcal{X}_{[N]}}p(x)\) in Proposition 3 below.

Having established the utility of the max-marginals for computing modes, let us now turn to an efficient algorithm for computing them. As alluded to earlier, the max-marginals are the analogs of the ordinary marginals when summation is replaced with maximization. Accordingly, it is natural to consider the analog of the sum-product updates with the same replacement; doing so leads to the following max-product updates:

$$\displaystyle\begin{array}{rcl} M_{t\rightarrow s}(x_{s})& \leftarrow \alpha \;\max _{x_{t}}\biggr \{\psi _{st}(x_{s},x_{t})\,\psi _{t}(x_{t})\;\prod _{u\in \mathcal{N}(t)/s}M_{u\rightarrow t}(x_{t})\biggr \},&{}\end{array}$$
(30)

where \(\alpha > 0\) chosen such that \(\max _{x_{s}}M_{t\rightarrow s}(x_{s}) = 1\). When the algorithm converges to a fixed point \(M^{{\ast}}\), we can compute the max-marginals via Eqs. (21a) and (21b).

Let us summarize the properties of the max-product algorithm on any undirected tree:

Proposition 3

For any tree \(\mathcal{T}\) with diameter \(d(\mathcal{T} )\) , the max-product updates converge to their unique fixed point \(M^{{\ast}}\) after at most \(d(\mathcal{T} )\) iterations. Moreover, Eqs.  (21a) and  (21b) can be used to compute the max-marginals, and the back-tracking procedure yields an element \(x^{{\ast}}\in \arg \max \limits _{x\in \mathcal{X}_{[N]}}p(x)\) .

Proof

The proof of convergence and correct computation of the max-marginals is formally identical to that of Proposition 2, so we leave the details to the reader as an exercise. Let us verify the remaining claim—namely, the optimality of any configuration \(x^{{\ast}}\in \mathcal{X}_{[N]}\) returned by the back-tracking procedure.

In order to do so, recall that we have assumed without loss of generality (re-indexing as necessary) that \(\mathcal{V} =\{ 1,2,\ldots,N\}\) is a topological ordering, with vertex 1 as the root. Under this ordering, let us use the max-marginals to define the following cost function:

$$\displaystyle\begin{array}{rcl} J(x_{1},\ldots,x_{N};\nu )& =\nu _{1}(x_{1})\;\prod _{s=2}^{N}\frac{\nu _{s\,\pi (s)}(x_{s},x_{\pi (s)})} {\nu _{\pi (s)}(x_{\pi (s)})}.&{}\end{array}$$
(31)

In order for division to be defined in all cases, we take 0∕0 = 0.

We first claim that there exists a constant \(\alpha > 0\) such that \(J(x;\nu ) =\alpha \: p(x)\) for all \(x \in \mathcal{X}_{[N]}\), which implies that \(\arg \max _{x\in \mathcal{X}_{[N]}}J(x;\nu ) =\arg \max _{x\in \mathcal{X}_{[N]}}p(x)\). By virtue of this property, we say that \(\nu\) defines a reparameterization of the original distribution. (See the papers [65, 66] for further details on this property and its utility in analyzing the sum and max-product algorithms.)

To establish the reparameterization property, we first note that the cost function J can also be written in the symmetric form

$$\displaystyle\begin{array}{rcl} J(x;\nu )& =\prod _{ s=1}^{N}\nu _{s}(x_{s})\;\prod _{(s,t)\in \mathcal{E}} \frac{\nu _{st}(x_{s},x_{t})} {\nu _{s}(x_{s})\nu _{t}(x_{t})}.&{}\end{array}$$
(32)

We then recall that the max-marginals are specified by the message fixed point \(M^{{\ast}}\) via Eqs. (21a) and (21b). Substituting these relations into the symmetric form (32) of J yields

$$\displaystyle\begin{array}{rcl} J(x;\nu )& \propto & \;\Big[\prod _{s\in \mathcal{V}}\psi _{s}(x_{s})\prod _{u\in \mathcal{N}(s)}M_{u\rightarrow s}^{{\ast}}(x_{ s})\Big]\;\Big[\prod _{(s,t)\in \mathcal{E}} \frac{\psi _{st}(x_{s},x_{t})} {M_{s\rightarrow t}^{{\ast}}(x_{t})M_{t\rightarrow s}^{{\ast}}(x_{s})}\Big] {}\\ & =& \;\prod _{s\in \mathcal{V}}\psi _{s}(x_{s})\;\prod _{(s,t)\in \mathcal{E}}\psi _{st}(x_{s},x_{t}) {}\\ & \propto & \;p(x), {}\\ \end{array}$$

as claimed.

Consequently, it suffices to show that any \(x^{{\ast}}\in \mathcal{X}_{[N]}\) returned by the back-tracking procedure is an element of \(\arg \max _{x\in \mathcal{X}_{[N]}}J(x;\nu )\). We claim that for each vertex \(s \in \mathcal{V}\) such that \(\pi (s)\neq \emptyset\),

$$\displaystyle\begin{array}{rcl} \frac{\nu _{s\,\pi (s)}(x_{s}^{{\ast}},x_{\pi (s)}^{{\ast}})} {\nu _{\pi (s)}(x_{\pi (s)}^{{\ast}})} & \geq \frac{\nu _{s\,\pi (s)}(x_{s},x_{\pi (s)})} {\nu _{\pi (s)}(x_{\pi (s)})} \quad \mbox{ for all $(x_{s},x_{\pi (s)}) \in \mathcal{X}_{s} \times \mathcal{X}_{\pi (s)}$.}&{}\end{array}$$
(33)

By definition of the max-marginals and our choice of \(x^{{\ast}}\), the left-hand side is equal to one. On the other hand, the definition of the max-marginals implies that \(\nu _{\pi (s)}(x_{\pi (s)}) \geq \nu _{s\,\pi (s)}(x_{s},x_{\pi (s)})\), showing that the right-hand side is less than or equal to one.

By construction, any \(x^{{\ast}}\) returned by back-tracking satisfies \(\nu _{1}(x_{1}^{{\ast}}) \geq \nu _{1}(x_{1})\) for all \(x_{1} \in \mathcal{X}_{1}\). This fact, combined with the pairwise optimality (33) and the definition (31), implies that J(x ; ν) ≥ J(x; ν) for all \(x \in \mathcal{X}_{[N]}\), showing that \(x^{{\ast}}\) is an element of the set \(\arg \max _{x\in \mathcal{X}_{[N]}}J(x;\nu )\), as claimed. □ 

In summary, the sum-product and max-product are closely related algorithms, both based on the basic principle of “divide-and-conquer”. A careful examination shows that both algorithms are based on repeated exploitation of a common algebraic property, namely the distributive law. More specifically, for arbitrary real numbers \(a,b,c \in \mathbb{R}\), we have

$$\displaystyle{ a\, \cdot \, (b + c) = a \cdot b + a \cdot c,\quad \text{and}\quad a\, \cdot \,\max \{ b,c\} =\max \{ a \cdot b,\;a \cdot c\}. }$$

The sum-product (respectively max-product) algorithm derives its power by exploiting this distributivity to re-arrange the order of summation and multiplication (respectively maximization and multiplication) so as to minimize the number of steps required. Based on this perspective, it can be shown that similar updates apply to any pair of operations \((\oplus,\otimes )\) that satisfy the distributive law \(a \otimes (b \oplus c) = a \otimes b\; \oplus \; a \otimes c\), and for which \(\otimes \) is a commutative operation (meaning \(a \otimes b = b \otimes a\)). Here the elements a, b, c need no longer be real numbers, but can be more exotic objects; for instance, they could elements of the ring of polynomials, where \((\otimes,\oplus )\) correspond to multiplication or addition with polynomials. We refer the interested reader to the papers [1, 28, 55, 62] for more details on such generalizations of the sum-product and max-product algorithms.

4 Junction Tree Framework

Thus far, we have derived the sum-product and max-product algorithms, which are exact algorithms for tree-structured graphs. Consequently, given a graph with cycles, it is natural to think about some type of graph transformation—for instance, such as grouping its vertices into clusters—-so as to form a tree to which the fast and exact algorithms can be applied. It turns out that this “clustering”, if not done carefully, can lead to incorrect answers. Fortunately, there is a theory that formalizes and guarantees correctness of such a clustering procedure, which is known as the junction tree framework.

4.1 Clique Trees and Running Intersection

In order to motivate the development to follow, let us begin with a simple (but cautionary) example. Consider the graph with vertex set \(\mathcal{V} =\{ 1,2,3,4,5\}\) shown in Fig. 6a, and note that it has four maximal cliques—namely {2, 3, 4}, {1, 2}, {1, 5} and {4, 5}. The so-called clique graph associated with this graph has four vertices, one for each of these maximal cliques, as illustrated in panel (b). The vertices corresponding to cliques C 1 and C 2 are joined by an edge in the clique graph if and only if \(C_{1} \cap C_{2}\) is not empty. In this case, we label the edge joining C 1 and C 2 with the set S = C 1C 2, which is known as the separator set. The separator sets are illustrated in gray boxes in panel (b). Finally, one possible tree contained within this clique graph—known as a clique tree—is shown in panel (c).

Fig. 6
figure 6

(a) A graph with cycles on five vertices. (b) Associated clique graph in which circular vertices are maximal cliques of \(\mathcal{G}\); square gray boxes sitting on the edges represent separator sets, corresponding to the intersections between adjacent cliques. (c) One clique tree extracted from the clique graph. It can be verified that this clique tree fails to satisfy the running intersection property

Now suppose that we were given a MRF distribution over the single cycle in panel (a); by the Hammersley-Clifford theorem, it would have the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots x_{5})& \propto \;\psi _{234}(x_{2},x_{3},x_{4})\psi _{12}(x_{1},x_{2})\psi _{15}(x_{1},x_{5})\psi _{45}(x_{4},x_{5}).&{}\end{array}$$
(34)

We can provide an alternative factorization of (essentially) the same model on the clique graph in panel (b) by making two changes: first, introducing extra copies of each variable from the original graph appears in multiple places on the clique graph, and second, introducing indicator functions on the edges to enforce equality between the different copies. In our particular example, we introduce extra variables \(x'_{s}\) for \(s \in \{ 1,2,4,5\}\), since these variables appear twice in the clique graph. We then form a distribution \(q(x,x') = q(x_{1},x_{2},x_{3},x_{4},x_{5}x'_{1},x'_{2},x'_{4},x'_{5})\) that factorizes as

$$\displaystyle\begin{array}{rcl} q(x,x')& \propto \psi _{234}(x_{2},x_{3},x_{4})\psi _{12}(x_{1},x'_{2})\psi _{15}(x'_{1},x_{5})\psi _{45}(x'_{4},x_{5})\;\times \;\prod _{s\in \mathcal{V}\setminus \{3\}}\mathbb{I}\,[x_{s} = x'_{s}],&{}\end{array}$$
(35)

where \(\mathbb{I}\,[x_{s} = x'_{s}]\) is a {0, 1}-valued indicator function for the event \(\{x_{s} = x'_{s}\}\). By construction, for any \(s \in \{ 1,2,3,4,5\}\), if we were to compute the distribution \(q(x_{s})\) in the expanded model (35), it would be equal to the marginal \(p(x_{s})\) from the original model (34). Although the expanded model (35) involves additional copies \(x'_{s}\), the indicator functions that have been added enforce the needed equalities to maintain model consistency.

However, the clique graph in panel (b) and the distribution (35) are not still not trees, which leads us to the key question. When is it possible to extract a tree from the clique graph, and have a factorization over the resulting clique tree that remains consistent with the original model (34)? Panel (c) of Fig. 6 shows one clique tree, obtained by dropping the edge labeled with separator set {5} that connects the vertices labeled with cliques {4, 5} and {1, 5}. Accordingly, the resulting factorization r(⋅ ) over the tree would be obtained by dropping the indicator function \(\mathbb{I}\,[x_{5} = x'_{5}]\), and take the form

$$\displaystyle\begin{array}{rcl} r(x,x')& \propto \psi _{234}(x_{2},x_{3},x_{4})\psi _{12}(x_{1},x'_{2})\psi _{15}(x'_{1},x_{5})\psi _{45}(x'_{4},x_{5})\;\times \;\prod _{s\in \mathcal{V}\setminus \{3,5\}}\mathbb{I}\,[x_{s}\,=\,x'_{s}],&{}\end{array}$$
(36)

Of course, the advantage of this clique tree factorization is that we now have a distribution to which the sum-product algorithm could be applied, so as to compute the exact marginal distributions r(x i ) for \(i = 1,2,\ldots,5\). However, the drawback is that these marginals will not be equal to the marginals p(x i ) of the original distribution (34). This problem occurs because—in sharp contrast to the clique graph factorization (35)—the distribution (36) might assign non-zero probability to some configuration for which x 5x5. Of course, the clique tree shown in Fig. 6c is only one of spanning trees contained within the clique graph in panel (b). However, the reader can verify that none of these clique trees have the desired property.

A bit more formally, the property that we require is the clique tree factorization always have enough structure to enforce the equivalences \(x_{s} = x'_{s} = x''_{s} =\ldots\), for all copies of a given variable indexed by some \(s \in \mathcal{V}\). (Although variables only appeared twice in the example of Fig. 6, the clique graphs obtained from more complicated graphs could have any number of copies.) Note that there are copies \(x_{s}\) and \(x'_{s}\) of a given variable \(x_{s}\) if and only if \(s\) appears in at least two distinct cliques, say C 1 and C 2. In any clique tree, there must exist a unique path joining these two vertices, and what we require is that the equivalence \(\{x_{s} = x'_{s}\}\) be propagated along this path. In graph-theoretic terms, the required property can be formalized as follows:

Definition 6

A clique tree has the running intersection property if for any two clique vertices \(C_{1}\) and \(C_{2}\), all vertices on the unique path joining them contain the intersection \(C_{1} \cap C_{2}\). A clique tree with this property is known as a junction tree.

To illustrate this definition, the clique tree in Fig. 6 fails to satisfy running intersection since 5 belongs to both cliques {4, 5} and {1, 5}, but does not belong to the clique {2, 3, 4} along the path joining these two cliques in the clique tree. Let us now consider a modification of this example so as to illustrate a clique tree that does satisfy running intersection, and hence is a junction tree.

Figure 7a shows a modified version of the graph from Fig. 6a, obtained by adding the extra edge (1, 4). Due to this addition, the modified graph now contains three maximal cliques, each of size three. Panel (b) shows the associated clique graph; it contains three vertices, one for each of the three maximal cliques, and the gray boxes on the edges represent the separator sets. Panel (c) shows one clique tree extracted from the clique graph in panel (b). In contrast to the tree from Fig. 6a, this clique tree does satisfy the running intersection property, and hence is a junction tree. (For instance, vertex 4 belongs to both cliques {2, 3, 4} and {1, 4, 5}, and to every clique on the unique path joining these two cliques in the tree.)

Fig. 7
figure 7

(a) A modified version of the graph from Fig. 6a, containing the extra edge (1, 4). The modified graph has three maximal cliques, each of size three. (b) The associated clique graph has one node for each of three maximal cliques, with gray boxes representing the separator sets. (c) A clique tree extracted from the clique graph; it can be verified that this clique tree satisfies the running intersection property

4.2 Triangulation and Junction Trees

Thus far, we have studied in detail two particular graphs, one (Fig. 6a) for which it was impossible to obtain a junction tree, and a second (Fig. 7a) for which a junction tree could be found. In this section, we develop a principled basis on which to produce graphs for which the associated clique graph has a junction tree, based on the graph-theoretic notion of triangulation.

A cycle in a graph is a sequence of vertices \((s_{1},s_{2},\ldots s_{\ell},s_{1})\) such that \((s_{\ell},s_{1}) \in \mathcal{E}\) and \((s_{i}s_{i+1}) \in \mathcal{E}\) for all \(i = 1,\ldots,\ell-1\). The cycle is chordless if there are no edges in the graph joining non-successive vertices in the cycle—that is, the graph does not contain any edges apart from those listed above that form the cycle. For example, the cycle (1, 2, 4, 5, 1) in the graph from Fig. 6a is chordless, whereas in contrast, the cycle (1, 2, 3, 4, 5, 1) contains the chord (2, 4).

Definition 7

A graph is triangulated if it contains no chordless cycles of length greater than three.

Thus, the graph in Fig. 6a is not triangulated (due to the chordless cycle (1, 2, 4, 5, 1)), whereas it can be verified that the graph in Fig. 7a is triangulated. We now state the fundamental connection between triangulation and junction trees:

Theorem 3

The clique graph associated with a graph \(\mathcal{G}\) has a junction tree if and only if \(\mathcal{G}\) is triangulated.

We provide a proof of this result as a part of a more general set of graph-theoretic equivalences in Appendix. The practical significance of Theorem 3 is that it enables us to construct a modified graph \(\tilde{\mathcal{G}}\)—by triangulating the original graph \(\mathcal{G}\)—such that the clique graph associated with \(\tilde{\mathcal{G}}\) is guaranteed to have at least one junction tree. There are a variety of algorithms for obtaining triangulated graphs. For instance, recall the vertex elimination procedure discussed in Sect. 3.1; it takes as input a graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\) and then processes the vertices in a pre-specified order so as to produce a new graph.

Lemma 4

The output \(\tilde{\mathcal{G}} = (\mathcal{V},\tilde{\mathcal{E}})\) of the vertex elimination algorithm is a triangulated graph.

We leave the proof of Lemma 4 as an exercise for the reader. This fact establishes that there is a connection between the elimination algorithm and the property of triangulation.

Example 12 (Triangulation and Junction Tree)

To illustrate the triangulation procedure, consider the 3 × 3 grid shown in Fig. 8a. If we run the vertex elimination algorithm using the ordering \(\mathcal{I} =\{ 1,3,7,9,4,6,2,8\}\), then we obtain the graph \(\tilde{\mathcal{G}}\) shown in panel (b). (Note that the graph would not be triangulated if the additional edge joining vertices 2 and 8 were not present. Without this edge, the 4-cycle \(\{2,4,8,6,2\}\) would lack a chord.) Note that \(\tilde{\mathcal{G}}\) has six maximal cliques: two 4-cliques in the middle of the graph, and four 3-cliques on the boundary. The clique graph associated with \(\tilde{\mathcal{G}}\) is illustrated in panel (c), and one clique tree is shown in panel (d). It can be verified that this clique tree satisfies the running intersection property, and so is a junction tree. ♣

Fig. 8
figure 8

Illustration of junction tree construction. (a) Original graph is a 3 × 3 grid. (b) Triangulated version of original graph. Note the two 4-cliques in the middle. (c) Corresponding clique graph for triangulated graph in (b), with maximal cliques depicted within ellipses, and separator sets within rectangles. (d) A junction tree extracted from the clique graph in (c). It is a clique tree that satisfies the running intersection property

4.3 Constructing the Junction Tree

Given that any triangulated graph has a junction tree (JT), the first step in the JT procedure is to triangulate the graph. At least in principle, this operation is straightforward since it entails adding edges so as to remove chordless cycles from the graph. (However, see the discussion at the end of Sect. 4 for discussion of optimal triangulations.) The focus of this section is the next step of the JT procedure—namely, how to form a junction tree from a triangulated graph. Any graph with cycles has more than one clique tree. Given a triangulated graph, we are guaranteed that at least one of these clique trees has the running intersection property, and so is a junction tree. How to find such a junction tree? It turns out that there is a simple procedure, based on solving a certain maximum weight spanning tree problem, that is always guaranteed to find a junction tree.

Given a triangulated graph, we again consider the clique graph, as previously defined. Recall that vertices C 1 and C 2 (corresponding to cliques from the original graph) are connected by an edge if and only if the associated separator set S = C 1C 2 is non-empty. Accordingly, for each edge e = (C 1, C 2) of the clique graph, we can associate the positive weight \(w(e) =\mathop{ \mathrm{\mathop{\mathrm{card}}\nolimits }}\nolimits (C_{1} \cap C_{2})\). We refer to the resulting object as the cardinality-weighted clique graph. Given this weighted graph, the maximum weight spanning tree problem corresponds to finding the tree T (whose vertex set includes every node) such that the weight \(w(T):=\sum _{e\in T}w(e)\) is maximized.

Proposition 4

Any maximal weight spanning tree of the cardinality-weighted clique graph is a junction tree for the original graph.

Proof

Suppose that the triangulated graph \(\tilde{\mathcal{G}}\) has \(M\) maximal cliques, so that (by definition) the weighted clique graph has \(M\) vertices. Let us consider an arbitrary spanning tree T of the weighted clique graph, say with vertices \(\{C_{1},\ldots,C_{M}\}\) and an associated edge set \(\{S_{1},\ldots,S_{M-1}\}\), where we have identified each edge with its associated separator set. The weight of this spanning tree can be written as \(w(\mathcal{T} ) =\sum _{ j=1}^{M-1}\mathop{ \mathrm{\mathop{\mathrm{card}}\nolimits }}\nolimits (S_{j})\). Letting \(\mathbb{I}\,[t \in S_{j}]\) be an {0 − 1}-valued indicator function for the event {t ∈ S j }, we may write \(\mathop{\mathrm{\mathop{\mathrm{card}}\nolimits }}\nolimits (S_{j}) =\sum _{t\in \mathcal{V}}\mathbb{I}\,[t \in S_{j}]\), and hence we have

$$\displaystyle\begin{array}{rcl} w(T)& =\sum _{ j=1}^{M-1}\sum _{t\in \mathcal{V}}\mathbb{I}\,[t \in S_{j}]\; =\;\sum _{t\in \mathcal{V}}\sum _{j=1}^{M-1}\mathbb{I}\,[t \in S_{j}].&{}\end{array}$$
(37)

We now claim that for any \(t \in \mathcal{V}\), we have the inequality

$$\displaystyle\begin{array}{rcl} \sum _{j=1}^{M-1}\mathbb{I}\,[t \in S_{ j}]& \leq \sum _{i=1}^{m}\mathbb{I}\,[t \in C_{ i}] - 1,&{}\end{array}$$
(38)

with equality if and only if the subgraph induced by t is connected. To establish this claim, consider the subgraph of T that induced by vertex t—meaning the subgraph formed by the clique vertices C i that include t, and the associated edges or separator sets S j that also include t. Since it is a subgraph of the tree T, it must also be acyclic, from which the inequality (38) follows. If the subgraph has a single connected component, then equality holds. When T is actually a junction tree, this equality will hold for any vertex t.

Substituting the bound (38) into the earlier inequality (37), we conclude that

$$\displaystyle\begin{array}{rcl} w(T)& \leq \sum _{t\in \mathcal{V}}\biggr \{\sum _{i=1}^{m}\mathbb{I}\,[t \in C_{i}] - 1\biggr \},& {}\\ \end{array}$$

where equality holds if and only T is a junction tree. Since the given tree T was arbitrary, the claim follows. □ 

Putting together the pieces of our development, we are led to the following algorithm:

Junction Tree Algorithm Given an undirected graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\):

  1. 1.

    Run a triangulation procedure to obtain a triangulated graph \(\tilde{\mathcal{G}}\).

  2. 2.

    Form the weighted junction graph, and find a maximum weight spanning tree via a standard algorithm (e.g., using Kruskal’s algorithm, or Prim’s algorithm).

  3. 3.

    Given the junction tree, define appropriate compatibility functions on its vertices and edges to reproduce the original distribution, and run the sum-product algorithm.

There are various techniques for finding some triangulation of a graph; as noted in Lemma 4, running the vertex elimination algorithm (using any ordering) yields a triangulation. Note, however, that the complexity of tree inference depends on the size of the state spaces at each node in the junction tree, which are determined by the clique sizes in the triangulated graph. It would be desirable, then, to obtain a triangulation with minimal maximal clique size, or equivalently, with minimal treewidth.

Definition 8

The treewidth \(\kappa (\mathcal{G})\) of a undirected graph \(\mathcal{G}\) is the size of largest clique (minus one) in the best triangulation.

As an illustration, any ordinary tree has treewidth one, whereas the graph in Fig. 7a has treewidth two, and the grid in Fig. 8 has treewidth three.

For any fixed treewidth \(\kappa =\kappa (\mathcal{G})\), there are efficient algorithms—meaning polynomial-time in the graph size—to test whether the graph has treewidth \(\kappa\), and if so, to determine an associated junction tree (e.g., [16, 60]). However, the complexity of these algorithms grows exponentially in the treewidth \(\kappa\), so they are feasible only for graphs of bounded treewidth. For a general graph, the problem of finding an optimal triangulation is NP-hard [6, 67]. Despite this negative result, there are a variety of heuristic algorithms for obtaining “good” triangulations of a graph. Some of most widely studied are the minimum degree and the minimum fill heuristics [4, 36]. The minimum degree heuristic is based on following the elimination ordering obtained by choosing a minimum degree vertex (in the partially reduced graph) at each step. The minimum fill heuristic operates similarly, but instead is based on choosing a vertex \(s\) so as to minimize the total number of edges required to make the neighborhood \(\mathcal{N}(s)\) a clique, a quantity known as the “fill”.

5 Basics of Graph Estimation

Up to this point, we have considered various types of “forward problems”, by which we mean that the graphical model—both the graph and the functions in its factorization—are known quantities. In this section, we turn our attention to various classes of “inverse problems”, in which either the functions and/or the structure of the graph are unknown. Given a black box that generates samples from the distributions, we consider two possible problems:

  • assuming that the clique structure is known, how to estimate the functions in the factorization?

  • given no a priori information, how to estimate the clique structure and the functions?

The former problem is one of parameter estimation, since it involves estimating the parameters that specify either the conditional probabilities (for a directed graphical model) or the compatibility functions (for an undirected graphical model). In contrast, the latter problem is a form of model selection, since we need to select the correct graph from competing families of graphs. In the machine learning community, the latter problem is also known as structure learning. We begin our discussion with the former problem, as it is simpler both conceptually and computationally.

5.1 Parameter Estimation for Directed Graphs

The problem of parameter estimation is particularly simple for directed graphical models, in which case, as long as the data is fully observed, the solution can be obtained in closed form.

More concretely, for a known directed graph \(\mathcal{D} = (\mathcal{V},\overrightarrow{\mathcal{E}})\), consider the set of all distributions that factorize in the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N})& =\prod _{s\in \mathcal{V}}p(x_{s}\,\vert \,x_{\pi (s)}),& {}\\ \end{array}$$

where \(\pi (s) \subseteq \mathcal{V}\) is the parent set of vertex \(s\). Suppose we observe a collection of \(n\) samples, each of the form \(X_{i\,\bullet } = (X_{i1},\ldots,X_{iN}) \in \mathcal{X}_{[N]}\). Given the collection of samples X 1• n: = { X 1 •, X 2 •, , X n •}, our goal is to estimate the conditional distributions at each vertex \(s \in \mathcal{V}\). In these notes, we focus exclusively on the case of discrete variables, in particular with \(\mathcal{X}_{s} =\{ 0,1,\ldots,m - 1\}\) for all vertices \(s \in \mathcal{V}\). (The extension to discrete variables with differing numbers of states per vertex is straightforward.)

By definition, each conditional distribution can be written as a ratio of two marginal distributions—namely, as

$$\displaystyle\begin{array}{rcl} p(x_{s}\,\vert \,x_{\pi (s)})& = \frac{p(x_{s},x_{\pi (s)})} {p(x_{\pi (s)})}.& {}\\ \end{array}$$

A natural estimate of each marginal distribution are the empirical marginal distributions—namely, the quantities

$$\displaystyle\begin{array}{lcl} \hat{p}(x_{s}) = \frac{1} {n}\sum _{i=1}^{N}\mathbb{I}\,(X_{ i\,s} = x_{s}),\quad \text{and}& {}\\ \hat{p}(x_{s},x_{\pi (s)}) = \frac{1} {n}\sum _{i=1}^{N}\mathbb{I}\,(X_{ i\,s} = x_{s},X_{i\,\pi (s)} = x_{\pi (s)}), {}\\ \end{array}$$

where \(\mathbb{I}\,(X_{i\,s} = x_{s})\) is a zero-one valued indicator function for the event that \(\{X_{i\,s} = x_{s}\}\), with the other indicator similarly defined. In words, the empirical marginal \(\hat{p}(x_{s})\) is simply the relative fraction of times that \(X_{s} = x_{s}\) in the data.

Given these estimates of the marginal distributions, a natural “plug-in” estimate of the conditional distribution is given by

$$\displaystyle\begin{array}{rcl} \hat{p}(x_{s}\mid x_{\pi (s)})& = \frac{\hat{p}(x_{s},x_{\pi (s)})} {\hat{p}(x_{\pi (s)})}.&{}\end{array}$$
(39)

(To be clear, we take 0∕0 to be equal to 0, since these correspond to configurations that never appear in the data set.)

5.2 Parameter Estimation for Undirected Graphs

For undirected graphical models, the problem of parameter estimation is not so simple in general. In particular, given an undirected graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\), consider the set of all distributions that factorize in the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N};\psi )& = \frac{1} {Z}\prod _{C\in \mathfrak{C}}\psi _{C}(x_{C}),&{}\end{array}$$
(40)

where \(\mathfrak{C}\) is the set of all maximal cliques of the graph. In this setting, our goal is to estimate the unknown compatibility functions \(\psi _{C}: \mathcal{X}_{C} \rightarrow (0,\infty )\) associated with each clique of the graph. A challenge here is that, at least in general, the compatibility functions are not directly related to conditional or marginal distributions of the graph. The exception to this rule is given by undirected trees (and more generally, junction trees), in which case there is always a choice of compatibility functions that is related to the marginal and/or conditional distributions.

In these notes, we limit our discussion to the method of maximum likelihood for estimating the compatibility functions. It is based on choosing the compatibility functions \(\psi =\{\psi _{C},C \in \mathfrak{C}\}\) so as to maximize the (rescaled) log likelihood of the data, namely the quantity

$$\displaystyle\begin{array}{rcl} \ell(\psi;X_{1\,\bullet },\ldots,X_{n\,\bullet })& = \frac{1} {n}\sum _{i=1}^{N}\log p(X_{ i\,\bullet };\psi ).&{}\end{array}$$
(41)

We use \(\hat{\psi }=\{\hat{\psi } _{C},C \in \mathfrak{C}\}\) to denote the maximum likelihood estimate, namely a choice of compatibility functions that maximize the log likelihood. A remark about parameterization before proceeding: since we are considering discrete random variables, the estimate \(\hat{\psi }\) can be viewed as a vector of real numbers. In particular, the estimate \(\hat{\psi }_{C}\) can be represented as a sub-vector of \(m^{\vert C\vert }\) real numbers, one for each possible configuration \(J =\{ J_{s},s \in C\} \in \mathcal{X}_{C}\) that the variables x C may assume. By concatenating all of these sub-vectors, we obtain a vector \(\hat{\psi }\in \mathbb{R}^{D}\), where \(D =\sum _{C\in \mathfrak{C}}m^{\vert C\vert }\). Consequently, the rescaled log likelihood is a function from \(\mathbb{R}^{D}\) to \(\mathbb{R}\).

5.2.1 Maximum Likelihood for Undirected Trees

Let us begin by considering the maximum likelihood problem for an undirected tree \(\mathcal{T} = (\mathcal{V},\mathcal{E})\), say with a factorization of the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N};\psi )& = \frac{1} {Z}\prod _{s\in \mathcal{V}}\psi _{s}(x_{s})\,\prod _{(s,t)\in \mathcal{E}}\psi _{st}(x_{s},x_{t}).&{}\end{array}$$
(42)

For any tree, we claim that the maximum likelihood estimate \(\hat{\psi }\) can be expressed in terms of the empirical marginals at the vertices and edges of the tree—namely, the quantities

$$\displaystyle\begin{array}{rcl} \hat{p}(x_{s}=j)& = \frac{1} {n}\sum _{i=1}^{N}\mathbb{I}\,_{ s;j}(X_{is}),\quad \text{and}\quad \hat{p}((x_{s},x_{t}) = (j,k)) = \frac{1} {n}\sum _{i=1}^{N}\mathbb{I}\,_{ st;jk}(X_{is},X_{it}),&{}\end{array}$$
(43)

where \(\mathbb{I}\,_{s;j}(x_{s})\) is a zero-one indicator for the event that \(\{x_{s} = j\}\), with \(\mathbb{I}\,_{st;jk}(x_{s},x_{t})\) similarly defined. With these empirical marginals computed from the data, the maximum likelihood estimates of the compatibility functions are given by

$$\displaystyle\begin{array}{rcl} \hat{\psi }_{s}(x_{s}) =\hat{ p}(x_{s}),\quad \text{and}\quad \hat{\psi }_{st}(x_{s},x_{t}) = \frac{\hat{p}(x_{s},x_{t})} {\hat{p}(x_{s})\hat{p}(x_{t})}.& &{}\end{array}$$
(44)

(As before, we take 0∕0 = 0 for concreteness.) We prove in the following section, as a corollary of a more general result, that the relations (44) do in fact define a global maximizer of the log likelihood (41) for a tree-structured graph.

5.2.2 Maximum Likelihood on General Undirected Graphs

In the case of graphs with cycles, the MLE no longer has a convenient closed-form expression. In order to appreciate the challenges, it is convenient to discuss the maximum likelihood objective using an alternative exponential parameterization. Let us do so for a graph with general clique set \(\mathfrak{C}\), understanding that the undirected tree is a special case. For a given clique C and configuration \(J =\{ J_{s},s \in C\} \in \mathcal{X}_{C}\), we define the function

$$\displaystyle\begin{array}{rcl} \mathbb{I}\,_{C;J}(x_{C})& = \left \{\begin{array}{@{}l@{\quad }l@{}} 1\quad &\mbox{ if $x_{C} = J$} \\ 0\quad &\text{otherwise,} \end{array} \right.&{}\end{array}$$
(45)

corresponding to a binary-valued indicator for the event that {x C  = J}. Note that these functions are generalizations of the vertex-based functions \(\mathbb{I}\,_{s;j}\) and edge-based functions \(\mathbb{I}\,_{st;jk}\) discussed previously.

We use \(\mathbb{I}\,_{C} =\{ \mathbb{I}\,_{C;J},J \in \mathcal{X}_{C}\}\) to denote the vector all of \(m^{\vert C\vert }\) such indicator functions associated with clique C. With this notation, any function \(\psi _{C}: \mathcal{X}_{C} \rightarrow (0,\infty )\) can be parameterized in the log-linear form

$$\displaystyle\begin{array}{rcl} \log \psi _{C}(x_{C})& =\langle \theta _{C},\, \mathbb{I}\,_{C}(x_{C})\rangle:=\sum _{J\in \mathcal{X}_{C}}\theta _{C;J}\mathbb{I}\,_{C;J}(x_{C})& {}\\ \end{array}$$

for some real vector \(\theta _{C} =\{\theta _{C;J},\;J \in \mathcal{X}_{C}\}\) in m  | C |  dimensions. If we adopt this parameterization for each clique in the graph, then the overall factorization can be re-written in the form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N};\theta )& =\exp \Big\{\sum _{C\in \mathfrak{C}}\langle \theta _{C},\, \mathbb{I}\,_{C}(x_{C})\rangle -\varPhi (\theta )\Big\}&{}\end{array}$$
(46)

where \(\theta =\{\theta _{C},C \in \mathfrak{C}\}\) is a real-valued vector of parameters to be estimated, and

$$\displaystyle\begin{array}{rcl} \varPhi (\theta )&:=\log \sum _{x\in \mathcal{X}_{[N]}}\exp \Big\{\sum _{C\in \mathfrak{C}}\langle \theta _{C},\, \mathbb{I}\,_{C}(x_{C})\rangle \Big\}&{}\end{array}$$
(47)

is the log normalization constant. The representation (46) shows that an undirected graphical model can be associated with an exponential family, namely one with sufficient statistics \(\{\mathbb{I}\,_{C}(\cdot ),C \in \mathfrak{C}\}\) and natural parameter \(\theta =\{\theta _{C},C \in \mathfrak{C}\} \in \mathbb{R}^{D}\). This connection is useful, since there is a rich and classical statistical literature on the properties of exponential families (e.g., [19, 30, 64]). We make use of some of these properties here. Moreover, an attractive property of the maximum likelihood estimate is its invariance to reparameterizations: namely, if we can compute the MLE \(\hat{\theta }\) in our new, exponential parameterization, then \(\hat{\psi }\equiv \exp (\hat{\theta })\) will define the MLE in the original parameterization. We use this fact freely in the discussion to follow.

One convenient property of the exponential parameterization is that the log likelihood \(\ell(\theta;X_{1\,\bullet },\ldots,X_{n\,\bullet }):= \frac{1} {n}\sum _{i=1}^{N}\log p(X_{ i\,\bullet };\theta )\) can be written in terms of the empirical marginals

$$\displaystyle\begin{array}{rcl} \hat{\mu }_{C;J}& = \frac{1} {n}\sum _{i=1}^{N}\mathbb{I}\,_{ C;J}(X_{iC}),&{}\end{array}$$
(48)

at each clique. Note that \(\hat{\mu }_{C;J} =\hat{ p}(x_{C} = J)\) corresponds to the average fraction of times that \(\{X_{C} = J\}\) in the data set. Let us define the vector \(\hat{\mu }_{C} =\{\hat{\mu } _{C;J},J \in \mathcal{X}_{C}\} \in m^{\vert C\vert }\), as well as the inner product \(\langle \theta _{C},\,\hat{\mu }_{C}\rangle =\sum _{J\in \mathcal{X}_{C}}\theta _{C;J}\;\hat{\mu }_{C;J}\). With this notation, the maximum likelihood estimate \(\hat{\theta }\) takes the form

$$\displaystyle\begin{array}{rcl} \hat{\theta }& =\arg \max _{\theta \in \mathbb{R}^{D}}\Big\{\sum _{C\in \mathfrak{C}}\langle \theta _{C},\,\hat{\mu }_{C}\rangle -\varPhi (\theta )\Big\}.&{}\end{array}$$
(49)

In order to understand the structure of this optimization problem, we require some useful properties of the function \(\varPhi\):

Lemma 5

The function \(\varPhi\) is convex and infinitely differentiable on \(\mathbb{R}^{D}\) , and in particular, we have

$$\displaystyle\begin{array}{rcl} \frac{\partial \varPhi } {\partial \theta _{C;J}}(\theta )& =\mathop{\underbrace{ \sum _{x\in \mathcal{X}}p(x;\theta )\;\mathbb{I}\,_{C;J}(x_{C})}}\limits _{p(x_{C}=J;\theta )}& {}\\ \end{array}$$

This result shows that taking the partial derivatives of \(\varPhi\) with respect to the parameters over clique C yields the marginal distribution over that clique. The proof of these properties are relatively straightforward, so we leave them as an exercise for the reader.

Lemma 5 allows us to characterize the structure of the maximum likelihood estimate (49). First, it shows that the rescaled log likelihood \(\ell\) is a concave function of θ, and hence has no local optima that can trap an iterative algorithm. Secondly, it can be used to characterize the stationary conditions that characterize a global optimum; more precisely, by taking derivatives of the rescaled log likelihood and setting them to zero, we find that the MLE \(\hat{\theta }\) satisfies the stationary conditions

$$\displaystyle\begin{array}{rcl} p(x_{C} = J;\hat{\theta })& =\hat{\mu } _{C;J}\quad \mbox{ for all $C \in \mathfrak{C}$, and $J \in \mathcal{X}_{C}$}.&{}\end{array}$$
(50)

In words, the maximum likelihood estimate \(\hat{\theta }\) is such that the marginals computed under the distribution \(p(\cdot;\,\hat{\theta })\) are equal to the empirical marginals determined by the data. This type of moment-matching condition characterizes the MLE in any exponential family; see the monograph [64] for further details.

Let us now return to our previously open question: for a tree-structured graph \(\mathcal{T}\) with the factorization (42), why does Eq. (44) specify the maximum likelihood estimate? We simply need to verify that the moment-matching conditions are satisfied. Given the suggested form of \(\hat{\psi }\) from Eq. (44), we have

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N};\hat{\psi })& = \frac{1} {Z}\prod _{s\in \mathcal{V}}\hat{p}(x_{s})\;\prod _{(s,t)\in \mathcal{E}} \frac{\hat{p}(x_{s},x_{t})} {\hat{p}(x_{s})\hat{p}(x_{t})}.&{}\end{array}$$
(51)

Our claim is that in this factorization, we must necessarily have Z = 1 and moreover

$$\displaystyle\begin{array}{rcl} p(x_{s};\hat{\psi })=\hat{p}(x_{s})\quad \mbox{ for all $s \in \mathcal{V}$, and}\quad p(x_{s},x_{t};\hat{\psi })\,=\,\hat{p}(x_{s},x_{t})\quad \mbox{ for all $(s,t) \in \mathcal{E}$.}& &{}\end{array}$$
(52)

In order to verify this claim, we first observe that any undirected tree can be converted to an equivalent directed form as follows: first, designate one vertex, say r, as the root, and then orient all edges away from the root toward the leaves. See Fig. 9 for an illustration of this procedure.

Fig. 9
figure 9

(a) Undirected tree. (b) Conversion of undirected tree to a directed tree rooted at vertex r. Each vertex \(s \in \mathcal{V}\setminus \{r\}\) now has a unique parent \(\pi (s)\). Any given parent may have multiple children

After this conversion, each vertex \(s \in \mathcal{V}\setminus \{r\}\) has a unique parent \(\pi (s)\), and we can associate the product \(\hat{\psi }_{s}(x_{s})\;\hat{\psi }_{s\pi (s)}(x_{s},x_{\pi (s)})\) with the directed edge \((\pi (s) \rightarrow s)\) and the term \(\hat{\psi }_{r}(x_{r})\) with the root vertex. In this way, the factorization (51) can be written the directed form

$$\displaystyle\begin{array}{rcl} p(x_{1},\ldots,x_{N};\hat{\psi })& =& \frac{1} {Z}\hat{\psi }_{r}(x_{r})\;\prod _{s\in \mathcal{V}\setminus \{r\}}\hat{\psi }_{s}(x_{s})\;\hat{\psi }_{s\pi (s)}(x_{s},x_{\pi (s)}) {}\\ & =& \frac{1} {Z}\mathop{\underbrace{\hat{p}(x_{r})}}\limits _{f_{r}(x_{r})}\;\prod _{s\in \mathcal{V}\setminus \{r\}}\mathop{\underbrace{ \frac{\hat{p}(x_{s},x_{\pi (s)})} {\hat{p}(x_{\pi (s)})} }}\limits _{f_{s}(x_{s},\,x_{\pi (s)})}. {}\\ \end{array}$$

Note that this is now a factorization of the general directed form (2), with functions \(\{f_{s},s \in \mathcal{V}\}\) that satisfy the normalization condition (1). Consequently, an application of Proposition 1 implies that

$$\displaystyle\begin{array}{rcl} p(x_{r};\hat{\psi }) =\hat{ p}(x_{r}),\quad \text{and}\quad \frac{p(x_{s},x_{\pi (s)};\hat{\psi })} {p(x_{\pi (s)};\hat{\psi })} = \frac{\hat{p}(x_{s},x_{\pi (s)})} {\hat{p}(x_{\pi (s)})} \quad \mbox{ for all $\pi (s) \rightarrow s$.}& & {}\\ \end{array}$$

A recursive argument, moving from the root r out to the leaves, then shows that the marginal-matching condition (52) holds. We leave the details of this final step as an exercise for the reader. (Start with the case of a Markov chain to see the general idea.)

5.2.3 Iterative Proportional Scaling

Thus far, we have seen that in the special case of a tree-structured graph, it is straightforward to find a set of compatibility functions \(\hat{\psi }\) (or equivalently, a vector of parameters \(\hat{\theta }\)) that satisfy the moment-matching conditions (50). The same basic idea can be extended to any triangulated graph, by first building a junction tree, and then following the analogous procedure on the junction tree.

In contrast, for a general undirected graph (with cycles, not triangulated), the MLE problem no longer has a closed-form solution, and needs to be solved by an iterative algorithm. Since the moment-matching conditions (50) characterize the global optimum of the convex program (49) that defines the MLE, a variety of standard algorithms from convex programming [9, 18] could be applied. Here we describe one particular algorithm, attractive due to the simplicity of its updates, known as iterative proportional scaling, or IPS for short. As we will see, it amounts to a block co-ordinate ascent of the objective function. The IPS algorithm takes as input the empirical marginals \(\{\hat{\mu }_{C},C \in \mathfrak{C}\}\), computed from the data set as in Eq. (48), as well as a particular ordering of clique set.

Iterative Proportional Scaling (IPS)

  1. 1.

    Initialize estimate θ C  = 0 for all cliques \(C \in \mathfrak{C}\).

  2. 2.

    Cycling over cliques in the given order:

    1. (a)

      Activate clique \(A\), and given current parameter estimate θ′, compute marginal probabilities

      $$\displaystyle\begin{array}{rcl} \mu '_{A;J}& = p(x_{A} = J;\theta ')\quad \mbox{ for all $J \in \mathcal{X}_{A}$,}& {}\end{array}$$
      (53)

      and form the vector \(\mu '_{A} =\{\mu '_{A;J},J \in \mathcal{X}_{A}\}\).

    2. (b)

      Update the parameter estimate \(\theta '\mapsto \theta\) via

      $$\displaystyle\begin{array}{rcl} \theta _{C}& = \left \{\begin{array}{@{}l@{\quad }l@{}} \theta '_{C} \quad &\mbox{ for all $C \in \mathfrak{C}\setminus \{A\}$} \\ \theta '_{A} +\log \frac{\hat{\mu }_{A}} {\mu '_{A}}\quad &\mbox{ for $C = A$.}\end{array} \right.& {}\\ \end{array}$$
  3. 3.

    Iterate until convergence.

Observe that Step 2(a) of the algorithm requires computation of the marginal distribution \(p(x_{A};\theta ')\) over the currently active clique \(A\). Doing so requires running the elimination algorithm, or building a junction tree on which to run a message-passing algorithm. This step can be computationally expensive if the graph treewidth is high, in which context approximate methods for marginalization become important (see the monograph [64] for more details). The update in Step 2(b) fixes the parameter estimate on all cliques except for the active one \(A\); for this clique, the update can be equivalently written in the form

$$\displaystyle\begin{array}{rcl} e^{\theta _{A;J}}& = e^{\theta '_{A;J}}\;\frac{\hat{\mu }_{A;J}} {\mu '_{A;J}}\quad \mbox{ for all $J \in \mathcal{X}_{A}$.}& {}\\ \end{array}$$

This form of the update clarifies the origins of the name “proportional scaling”, since the parameters are being adjusted by the relative proportions of the empirical marginal \(\hat{\mu }_{A;J}\) and the current model marginal \(\mu '_{A;J}\).

It is clear from the IPS updates that any fixed point must satisfy the moment-matching conditions (50), and hence be a global optimum of the log likelihood. Less clear, however, is whether convergence is guaranteed (though it is assumed in the statement of Step 3). Among other properties, the following result shows that the IPS algorithm is indeed guaranteed to converge to the MLE:

Proposition 5 (IPS Properties)

The IPS algorithm has the following properties:

  1. (a)

    We have \(\varPhi (\theta ') =\varPhi (\theta )\) over all updates \(\theta '\mapsto \theta\) of the algorithm.

  2. (b)

    After updating clique \(A\) , the estimate θ satisfies

    $$\displaystyle\begin{array}{rcl} p(x_{A} = J;\theta )& =\hat{\mu } _{A;J}\quad \mbox{ for all $J \in \mathcal{X}_{A}$.}& {}\end{array}$$
    (54)
  3. (c)

    The algorithm iterates converge to an MLE solution \(\hat{\theta }\) .

Proof

(a), (b):

By definition of the update, for each \(J \in \mathcal{X}_{A}\), we have

$$\displaystyle\begin{array}{rcl} p(x_{A} = J;\theta )& =& \sum _{x\in \mathcal{X}_{[N]}}p(x;\theta )\mathbb{I}\,_{A;J}(x_{A}) {}\\ & =& \sum _{x\in \mathcal{X}_{[N]}}p(x;\theta ')\frac{e^{\varPhi (\theta ')-\varPhi (\theta )}\,\hat{\mu }_{A;J}} {\mu '_{A;J}} \mathbb{I}\,_{A;J}(x_{A}) {}\\ & =& \Big\{\frac{\sum _{x\in \mathcal{X}_{[N]}}p(x;\theta ')\mathbb{I}\,_{A;J}(x_{A})} {\mu '_{A;J}} \Big\}\;\hat{\mu }_{A;J}e^{\varPhi (\theta ')-\varPhi (\theta )} {}\\ & =& \hat{\mu }_{A;J}e^{\varPhi (\theta ')-\varPhi (\theta )}. {}\\ \end{array}$$

Summing both sides over \(J \in \mathcal{X}_{A}\) yields that \(\varPhi (\theta ') =\varPhi (\theta )\) as claimed in part (a), and we then see that \(p(x_{A} = J;\theta ) =\hat{\mu } _{A;J}\) as claimed in part (b).

(c):

Note that the rescaled log likelihood can be block partitioned in the form \(\ell(\theta '_{C},C \in \mathfrak{C})\). With θ′ fixed in all co-ordinates except those indexed by \(A\), let us consider the function

$$\displaystyle\begin{array}{rcl} \theta _{A}\mapsto \ell(\theta _{A},\theta '_{C},C \in \mathfrak{C}\setminus A).& & {}\\ \end{array}$$

It is a concave function of \(\theta _{A}\). Moreover, by combining Lemma 5 with the result of part (b), we see that, after updating clique \(A\), the IPS algorithm has maximized this function over \(\theta _{A}\). Thus, the IPS algorithm is equivalent to performing block co-ordinate ascent of the log-likelihood function. By known results on the convergence of such procedures [9], it follows that IPS converges to a global optimum. □ 

5.3 Tree Selection and the Chow-Liu Algorithm

In our discussion thus far, we have assumed that the graph structure is known, and that only the compatibility functions need to be estimated. In this section, we consider an instance of the more general problem in which neither is known. As previously mentioned, this problem is known as model selection for graphical models, or graph structure learning. We limit ourselves to the problem of selecting tree-structured graphical models. See the bibliographic section for discussion of other methods suitable for more general forms of graph structure learning.

Given our data set \(\{X_{1\,\bullet },\ldots,X_{n\,\bullet }\}\), suppose that our goal is to choose the “best-fitting” spanning tree graphical model. More formally, let \(\mathbb{T}\) be the set of all spanning trees on \(N\) vertices. For a given tree \(\mathcal{T} \in \mathbb{T}\) and a given distribution \(p(x;\theta (\mathcal{T} ))\) that is Markov with respect to \(\mathcal{T}\), we measure its fit in terms of its rescaled log likelihood \(\ell(\theta (\mathcal{T} )) = \frac{1} {n}\sum _{i=1}^{N}\log p(X_{ i\,\bullet };\theta (\mathcal{T} ))\).

In principle, the solution to this problem is straightforward: for each tree \(\mathcal{T} \in \mathbb{T}\), we compute the maximum likelihood estimate

$$\displaystyle\begin{array}{rcl} \hat{\theta }(\mathcal{T} )& =\arg \max _{\theta (\mathcal{T} )}\ell(\theta (\mathcal{T} )),& {}\\ \end{array}$$

thereby obtaining a list \(\{\ell(\hat{\theta }(\mathcal{T} ),\mathcal{T} \in \mathbb{T}\}\) of maximized likelihoods for every tree \(\mathcal{T}\) in the set \(\mathbb{T}\) of all spanning trees. We then choose the tree \(\hat{\mathcal{T}}\) with the highest maximized likelihood, that is \(\hat{\mathcal{T}} =\arg \max _{\mathcal{T}\in \mathbb{T}}\ell(\hat{\theta }(\mathcal{T} ))\). However, a graph with \(N\) vertices has an enormous number of possible spanning trees—\(N^{N-2}\) to be precise [20]—so that a brute force approach is infeasible. Fortunately, as previously shown (44), the MLE for a tree-structured graph has a very special structure, which can be exploited to obtain an efficient algorithm for finding the best-fitting tree.

Chow-Liu Algorithm for Maximum Likelihood Tree

  1. 1.

    Compute the empirical marginals (43) associated with all vertices \(s \in \mathcal{V}\) and all distinct vertex pairs \((s,t)\).

  2. 2.

    For each distinct pair of vertices \((s,t)\), compute the empirical mutual information

    $$\displaystyle\begin{array}{rcl} \hat{I}_{st}& =\sum _{(x_{s},x_{t})\in \mathcal{X}_{s}\times \mathcal{X}_{t}}\hat{p}(x_{s},x_{t})\log \frac{\hat{p}(x_{s},x_{t})} {\hat{p}(x_{s})\,\hat{p}(x_{t})}.& {}\end{array}$$
    (55)
  3. 3.

    Using \(\{\hat{I}_{st}\}\) as edge weights, compute a maximum weight spanning tree

    $$\displaystyle\begin{array}{rcl} \hat{\mathcal{T}}& =\arg \max _{\mathcal{T}\in \mathbb{T}}\sum _{(s,t)\in \mathcal{E}(\mathcal{T} )}\hat{I}_{st}.& {}\end{array}$$
    (56)
  4. 4.

    Return the maximum likelihood estimate \(\hat{\theta }(\mathcal{T} )\) on the tree from the previous step:

    $$\displaystyle\begin{array}{rcl} \hat{\theta }_{s}(x_{s};\hat{\mathcal{T}})& =& \log \hat{p}_{s}(x_{s})\quad \mbox{ for all $s \in \mathcal{V}$, and} {}\\ \hat{\theta }_{st}(x_{s},x_{t};\hat{\mathcal{T}})& =& \log \frac{\hat{p}(x_{s},x_{t})} {\hat{p}(x_{s})\hat{p}(x_{t})}\quad \mbox{ for all $(s,t) \in \mathcal{E}(\hat{\mathcal{T}})$.} {}\\ \end{array}$$

Note that Step 3 requires solving a maximum weight spanning tree problem. As mentioned previously during our treatment of the junction tree algorithm, there are various efficient methods for this problem, including Kruskal’s algorithm and Prim’s algorithm. See the bibliographic section for further discussion.

We claim that the Chow-Liu algorithm is guaranteed to return a maximum likelihood tree—namely, one satisfying

$$\displaystyle\begin{array}{rcl} \hat{\mathcal{T}}& \in \arg \max _{\mathcal{T}\in \mathbb{T}}\ell(\hat{\theta }(\mathcal{T} ))\; =\;\arg \max _{\mathcal{T}\in \mathbb{T}}\;\Big\{\max _{\theta (\mathcal{T} )}\ell(\theta (\mathcal{T} ))\Big\}.& {}\\ \end{array}$$

The proof of this claim is straightforward given our previous discussion. For any fixed tree \(\mathcal{T}\), the maximum likelihood solution has the form (44), whence

$$\displaystyle\begin{array}{rcl} \ell(\hat{\theta }(\mathcal{T} ))& =& \frac{1} {n}\sum _{i=1}^{N}\log p(X_{ i\,\bullet };\hat{\theta }(\mathcal{T} )) {}\\ & =& \frac{1} {n}\sum _{i=1}^{N}\Big\{\sum _{ s\in \mathcal{V}}\log \hat{p}(X_{is}) +\sum _{(s,t)\in \mathcal{E}(\mathcal{T} )}\log \frac{\hat{p}(X_{is},X_{it})} {\hat{p}(X_{is})\,\hat{p}(X_{it})}\Big\} {}\\ & =& \sum _{s\in \mathcal{V}}\hat{p}(x_{s})\log \hat{p}(x_{s}) +\sum _{(s,t)\in \mathcal{E}(\mathcal{T} )}\hat{p}(x_{s},x_{t})\log \frac{\hat{p}(x_{s},x_{t})} {\hat{p}(x_{s})\,\hat{p}(x_{t})} {}\\ & =& C +\sum _{(s,t)\in \mathcal{E}(\mathcal{T} )}\hat{I}_{st}, {}\\ \end{array}$$

where C denotes a constant that is independent of the tree \(\mathcal{T}\). Consequently, the maximum weight spanning tree obtained in Step 3 corresponds to the maximum likelihood tree.

Intuitively, one might think that an analogous procedure could be extended to junction trees, requiring solving a maximum weight spanning hypertree problem as opposed to a maximum weight spanning tree problem. Unfortunately, unlike the case of spanning trees, computing a maximum weight hypertree is computationally expensive, so that in practice, exact methods are limited to ordinary spanning trees, as in the Chow-Liu algorithm.

6 Bibliographic Details and Remarks

Graphical models were developed independently in a number of different communities. The notion of a Gibbs distribution appeared initially in the statistical physics literature, a notable example being the Ising model [39] that was introduced to model ferromagnetism. Other communities in which graphical models were developed include information theory and coding theory [34, 59], contingency tables and discrete multivariate analysis [13, 38], image processing [12, 35], and artificial intelligence [50].

The Hammersley-Clifford theorem derives its name from a manuscript that was never published (J.M. Hammersley, P. Clifford, 1971, Markov fields on finite graphs and lattices, unpublished). Besag [10] and Grimmett [37] were the first to publish proofs of the result; in these notes, we have adapted the latter proof. See Clifford [22] for some discussion of the history of the result. Lauritzen [44] provides discussion of how the Markov-factorization equivalence can break down when the strict positivity condition fails to hold.

Forms of the sum-product and max-product algorithms were independently developed in many different communities, including the work of Kalman in control theory [41], work on LDPC codes [34] and Viterbi decoding [32, 63] in error-control coding, the alpha-beta or forward-backward algorithms in signal processing [51], Pearl’s work [50] in artificial intelligence, and work on non-serial dynamic programming [8]. Dating back to the 1960s, a number of authors (e.g., [49, 54]) studied the relation between graph triangulation and the vertex elimination algorithm, and pointed out the connection to the complexity of Gaussian elimination algorithms for solving linear systems of equations. (In the statistical setting, this problem can be cast as an inference problem for a multivariate Gaussian.) At the end of the 1980s, several groups of researchers independently recognized that the significance of tree decompositions and triangulation extended beyond solving linear equations to more general classes of graph-theoretic problems [14, 45]. Within the statistics and artificial intelligence communities, the junction tree framework was developed by Lauritzen and Spielgelhalter [45]. The paper [15] contains an extensive description of treewidth and related notions of graph complexity. There are various algorithms for solving the maximum weight spanning tree problem, including Kruskal’s algorithm and Prim’s algorithm [23].

Early work on iterative proportional fitting and scaling algorithms include the papers [27, 31, 58]. These algorithms are actually special cases of more general Bregman-based projection algorithms [2426, 29]; see also the work of Amari [2, 3] on the information geometry of exponential families. The Chow-Liu algorithm for finding the maximum likelihood tree appeared in the paper [21]. Karger and Srebro [42, 57] studied generalizations of this problem to junction trees of higher treewidth, establishing the hardness of the problem (in a complexity-theoretic sense) but also proposing an approximation algorithm. There is a very large literature on methods for graphical model selection, local forms of 1-regularized regression (e.g., [48, 52]), 1-regularization and global likelihoods (e.g., [7, 33, 47, 53]), as well as multiple testing and related approaches (e.g., [5, 40, 56]).