Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Introduction

This chapter introduces mixed interaction models, a class of models for discrete and continuous variables that combine log-linear models for discrete variables (described in Chap. 2) with graphical Gaussian models for continuous variables (described in Chap. 4). The exposition given here is restricted to homogeneous mixed interaction models. Homogeneity in this context means that the covariance matrix of the Gaussian variables does not depend on the values of discrete variables. More general types of mixed interaction models that do not assume homogeneity are described in Lauritzen (1996) and Edwards (2000). An important advantage of the homogeneous models is that they can be specified using model formulae that are similar to the model formulae for log-linear models and for graphical Gaussian models.

5.2 Example Datasets

To introduce the models we consider three datasets that are in gRbase. The first dataset, milkcomp1, comes from a study comparing the composition of sow milk in terms of fat, protein and lactose content under 8 different diets. The control diet consisted of soybean meal, barley and wheat. The other diets added 8% fat to this basis diet: animal fat, rapeseed oil, fish oil, coconut oil, palm oil or sunflower oil. Sow milk was analysed for the concentration of dry matter, protein, fat and lactose: here we consider the data recorded four days after farrowing (i.e., giving birth). For further details see Lauridsen and Danielsen (2004). The first rows of the dataset are:

figure a

The second dataset, wine, contains the results of a study of the chemical constituents of three varieties of grape, grown in the same region in Italy. There are 178 observations on 14 variables, of which one is discrete (grape variety) and the rest (chemical constituents) are continuous. For more information on this dataset see http://archive.ics.uci.edu/ml/datasets/Wine.

figure b

The third dataset, Nutrimouse, stems from a study of the effect of nutrition on lipid levels and gene expression in mice. Forty mice were each assigned one of five different diets, with different fatty acid compositions. Two strains of mice were used, one with the PPARα gene knocked out and the other was wild-type (i.e. the PPARα gene was present). The PPARα gene is known to affect fatty acid metabolism. The concentrations of 21 lipids (fatty acids) in the liver were recorded. In addition the data include the expression levels of 120 genes in the liver: these 120 were selected from a much greater number as potentially relevant for nutrition. Thus the dataset contains N=40 observations of 143 variables: two discrete design variables—genotype (with two levels) and diet (with five levels), 120 gene expression values and 21 lipid values. For more details see Martin et al. (2007).

The following code fragment lists a small subset of the data.

figure c

5.3 Mixed Data and CG-densities

Suppose that N observations of d discrete variables and q continuous variables are available. We denote the set of discrete variables by Δ, the set of continuous variables by Γ, and the combined variable set by V=ΔΓ.

An observation has the form x=(i,y)=(i 1,…,i d ,y 1,…,y q ). This combines the notation of Chap. 2 and Chap. 4. As in Chap. 2 we write the set of possible cells i=(i 1,…,i d ) as \(\mathcal{I}\).

We construct a homogeneous conditional Gaussian density, or CG-density for short, for x=(i,y) in the following way. Firstly, the probability of the discrete variables falling in cell i is denoted p(i). We assume that p(i)>0 for all \(i \in\mathcal{I} \). Secondly, the conditional distribution of the continuous variables given that the discrete variables fall in cell i is multivariate Gaussian \(\mathcal{N}\{\mu(i), \varSigma\}\). Observe that the mean may depend on i but the variance does not. The density takes the form

$$ f(i,y) = p(i) (2\pi)^{-q/2} \det(\varSigma)^{-1/2}\exp\biggl[- \frac{1}{2} \{y-\mu(i)\}^{\top}\varSigma^{-1}\{y-\mu(i)\}\biggr]$$
(5.1)

The parameters \(\{p(i), \mu(i), i \in\mathcal{I} ; \varSigma\}\), that is, the cell probability and mean vector for each cell i and the common covariance matrix, are called the moment parameters.

It is convenient to represent (5.1) in exponential family form as

(5.2)

The parameters \(\{g(i), h(i), i \in\mathcal{I} ; K\}\) are called the canonical parameters. Note that the canonical parameters have the same dimensions as the moment parameters: for each i, g(i) is a scalar (the discrete canonical parameter) and h(i) is a q-vector (the linear canonical parameter); also, the concentration matrix K is a symmetric positive definite q×q matrix.

Occasionally it is convenient to use the mixed parameters which are given as \(\{p(i), h(i), i \in\mathcal{I} ; K\}\). We allow ourselves to write the parameters briefly as {p,μ,Σ}, {g,h,K} and {p,h,K}.

We can transform back and forth between the different parameterizations using the relations

and

(5.3a)

5.4 Homogeneous Mixed Interaction Models

The homogeneous mixed interaction models, which we for brevity here refer to as MI-models, are defined by constraining the canonical parameters of CG-densities so as follow factorial expansions.

For example, let Δ={A,B} and Γ={X,Z} and let the levels of the factors A and B be denoted j and k. So in this case i=(j,k) and y=(x,z). The joint density can be written

$$ f(i,y)= \exp\biggl\{g(i) + h^x(i)x + h^z(i)z - \frac{1}{2} (k_{xx} x^2 + 2k_{xz} xz + k_{zz}z^2) \biggr\}$$
(5.4)

and we can write the unrestricted (or saturated) model as

(5.5)
(5.6)
(5.7)
(5.8)

where the u’s, v’s and w’s are interaction terms. In this model g(i), h x(i) and h y(i) are unrestricted functions of the cells i=(j,k). To estimate the interaction terms uniquely would require some further constraints but we do not bother about this here. This is because we use the factorial expansions to constrain the way canonical parameters vary over \(\mathcal{I}\), but are not usually interested in their values per se.

Models are defined by setting certain interaction terms to zero. The usual hierarchical rule, that if a term is set to zero then all higher-order terms must also be zero, is respected. So by this rule, if we set \(v^{a}_{j}\) to zero for all j, we must also set \(v^{ab}_{jk}\) to zero for all j and k.

Conditional independence constraints can be imposed by setting interaction terms to zero. For example, to make we must set all terms involving A and X in (5.4) to zero, that is, \(v^{a}_{j}=v^{ab}_{jk}=0, \ \forall j,k\). To make we must set all terms involving A and B to zero, i.e., \(u^{ab}_{jk}=v^{ab}_{jk}=w^{ab}_{jk}=0, \ \forall j,k\). Finally, to obtain we set k xz =0.

For example, consider the milkcomp1 data:

figure d

The CGstats() function calculates the number of observations and the means of the continuous variables for each cell i, as well as (by default) a common covariance matrix:

figure e

Note that the mean of fat (and to a lesser extent of protein) varies over the treatments whereas the lactose means seem to be more or less constant. The coefficients of variation are:

figure f

The corresponding canonical parameters are

figure g

Let j refer to a level of the treatment factor. Then h(j) takes the form

$$h(j) = (h^{\mathtt{fat}}(j), h^{\mathtt{protein}}(j), h^{\mathtt{lactose}}(j)).$$

The coefficients of variation for h are

figure h

which suggests that h lactose(j) is constant as a function of j; that is

$$h(j) = (h^{\mathtt{fat}}(j), h^{\mathtt{protein}}(j), h^{\mathtt{lactose}}).$$

If we insert this in (5.2) and use the factorization criterion 1.1 we find that

The partial correlation matrix is more informative than the concentration matrix:

figure i

This suggests that the partial correlation between fat and lactose is zero. If we set k fat,lactose =0 in (5.2) and use the factorization criterion we find that

5.5 Model Formulae

In this section we describe how to specify MI-models using model formulae and show how they may be represented as dependence graphs. Here and later we refer to the models and graphs shown in Table 5.1.

Table 5.1 Some homogeneous mixed interaction models

As we have seen above in Sect. 5.4, we define an MI-model by constraining g(i) and the h u(i) for uΓ to satisfy factorial expansions, and by constraining some off-diagonal elements of K to zero. So in principle we can define an MI-model by giving a list of generating classes—one for g(i) and one for h u(i) for each uΓ—together with list of off-diagonal elements of K that are allowed to be non-zero. Together these specifications define an MI-model, although some restrictions in the different components are necessary, as we describe below.

To give all these generating classes would be very cumbersome, however. It is much more convenient to specify a model using a single generating class \(\mathcal{C}= \{G_{1},\ldots, G_{m}\}\), with G j V for each j=1…m. We now explain how this is done.

We use the following convention. We write a generator G as a pair (a,b) where a=GΔ are discrete variables and b=GΓ are continuous variables. For aΔ, by g a (i a ) we mean a function which depends on an index i only through i a . Let q be the number of variables in Γ. Suppose that y is a q-vector. For bΓ we write the corresponding subvector of y as y b. Furthermore, we take [y b] to mean the q-vector obtained by padding y b with zeros in the right places to obtain full dimension.

Using this convention we can define the restrictions which a generating class \(\mathcal{C}\) imposes on a general (homogeneous) CG-density.

  1. 1.

    The discrete canonical parameter g(i) is constrained to follow the factorial expansion

    $$g(i) = \sum_{(a,b)\in\mathcal{C}} g_a(i_a)$$

    That is to say, the generators for g(i) are the maximal elements of \(\{a \mid(a,b) \in\mathcal{C}\}\), which we write compactly as \(\max(\{ a \mid(a,b) \in\mathcal{C}\})\). These are called the discrete generators of the model.

  2. 2.

    The linear canonical parameter h is constrained to follow the factorial expansion

    $$h(i) = \sum_{(a,b)\in\mathcal{C}} [h^b_a(i_a)].$$

    It follows that \(h(i)^{\top}y = \sum_{(a,b)\in\mathcal{C}}h^{b}_{a}(i_{a})^{\top}y_{b}\). For each uΓ, the generators for h u(i) are \(\mathcal{C}^{u}=\max(\{ a \mid(a,b) \in\mathcal{C}\wedge u \in b\})\); that is, the discrete components of those generators containing u. These are termed the linear generators of the model.

  3. 3.

    Finally, the quadratic canonical parameter K is constrained as follows: elements k uv of K are set to zero unless {u,v}⊂b for some generator \((a,b)\in\mathcal{C}\). The sets \(\{b\mid(a,b) \in\mathcal{C}\}\) induce a graph whose edges of correspond to those k uv which are not set to zero. The cliques of the graph are called the quadratic generators of the model.

For example, the last model in Table 5.1 has the generating class

$$\{(A,B),(A,Z),(B,X,Z)\}.$$

The derived formulae for g(i), h x(i) and h z(i) are {(A,B)}, {(B)}, and {(A),(B)}, respectively. Hence g(i) is unrestricted, h x(i) satisfies \(h^{x}(i) = v+v^{b}_{k}\) for all i=(j,k) and h z(i) satisfies \(h^{z}(i) = w+w^{a}_{j}+w^{b}_{k}\) for all i=(j,k). Since (X,Z)⊂(B,X,Z), k xz is not set to zero.

It can be shown that to ensure location and scale invariance, the formula for g(i) must be “larger” than the formulae for each h u(i) in the sense that each generator for h u(i) must be contained in a generator for g(i). This constraint is automatically fulfilled by the above construction.

The model formula notation for MI-models used here has the disadvantage that distinct formulae can specify the same model. For example, if Δ={I} and Γ={X,W,Z} then the formulae I*X*W+X*W*Z and I*X*W+X*Z+W*Z give identical models. This is not usually problematic, but it can impact the efficiency of the iterative estimation procedure, as we describe later. We can define a particular representation, termed the maximal form of the model. This has generators defined as the maximal sets \(\mathcal{A}\subseteq \varDelta \cup\varGamma\) such that:

  1. 1.

    \(\mathcal{A}\cap \varDelta \) is contained in a generator of g(i),

  2. 2.

    for each \(u \in \mathcal{A}\cap\varGamma\), \(\mathcal{A}\cap \varDelta \) is contained in a generator of h u(i), and

  3. 3.

    for each \(x,y \in\mathcal{A}\cap\varGamma\), with uv, k uv is not set to be zero.

For example, I*X*W+X*W*Z is of maximal form but I*X*W+X*Z+W*Z is not.

The mmod() function in the gRim package allows MI-models to be defined using model formulae. For example, to define the model for the milk composition dataset with the conditional independences arrived at in Sect. 5.4, we specify the generating class with generators {treat,fat,protein} and {protein,lactose}, as follows:

figure j

The discrete, linear and quadratic generators of the model are

figure k

To construct the dependence graph of an MI-model defined using such a formula, we connect with an edge all variable pairs appearing in the same generator. By convention, discrete variables are drawn with filled circles and continuous variables with hollow circles. The global Markov property (Sect. 1.3) can be used for reading conditional independencies from the dependence graph in the usual way. For example, the dependence graph for the model milkmod just discussed is shown in Fig. 5.1. It can be obtained using the plot function:

figure l
Fig. 5.1
figure 1

Mixed interaction model for milk composition data. Discrete variables are shown as grey nodes while continuous variables are white

5.6 Graphical and Decomposable MI-models

Suppose we are given an undirected graph with vertex set ΔΓ and consider the MI-model for ΔΓ whose generators are the cliques of the graph. An MI-model that can be formed in this way is termed a graphical MI-model. Table 5.1 shows some graphical MI-models.

As with log-linear models, it is possible to set higher-order interactions to zero, without introducing new conditional independence relations. Such models are called non-graphical. For example, consider model (b) in Table 5.1. Since the generators of the formula correspond to the cliques of the graph, the model is graphical. The model implies that the term h y(i) is unrestricted, say as

$$h^y(i)=w+w^{a}_{j}+w^{b}_{k}+w^{ab}_{jk}.$$

If we constrain \(w^{ab}_{jk}=0, \ \forall j,k\), then h y(i) has the additive form \(h^{y}(i)=w+w^{a}_{j}+w^{b}_{k}, \ \forall j,k\). This does not correspond to a conditional independence restriction, but results in model (f) in Table 5.1. So model (f) is non-graphical. Since no further conditional independence restrictions have been added model (f) has the same dependence graph as model (b).

We now turn to a subclass of the graphical MI-models, the decomposable MI-models. These build on a more basic concept, that of a decomposition, which we describe first.

The notion of a decomposition of a graph \(\mathcal{G}\) with mixed variables relates to the question of how and when the analysis of a graphical MI-model may be broken down into analyses of smaller models. This notion is slightly more elaborate than in the purely discrete and purely continuous cases. Let A, B and S be disjoint non-empty subsets of V such that ABS=V. We define (A,B,S) to be a decomposition of \(\mathcal{G}\) if the following conditions hold:

  1. 1.

    A and B are separated by S in \(\mathcal{G}\),

  2. 2.

    S is complete in \(\mathcal{G}\), and

  3. 3.

    SΔ or BΓ.

It can be shown that when (A,B,S) is a decomposition of \(\mathcal {G}\), the maximum likelihood estimator \(\hat{f}\) of the density of the graphical MI-model with dependence graph \(\mathcal{G}\) is given by

$$\hat{f}=\frac{\hat{f}_{[A\cup S]}\hat{f}_{[B\cup S}]}{\hat{f}_{[S]}}$$

where \(\hat{f}_{[A\cup S]}\), \(\hat{f}_{[B\cup S]}\), \(\hat{f}_{[S]}\) are the estimates of densities based on the models corresponding to the relevant induced subgraphs and based on marginal data only. Indeed they are weak marginals of \(\hat{f}\), see Sect. 5.7.5.1 below.

A graph with mixed variables \(\mathcal{G}\) is called decomposable if it is complete or it can be successively decomposed into complete graphs.

Various characterizations of graphs with this property are useful. One is based on the forbidden path property: a forbidden path is a path between two non-adjacent discrete vertices that passes through only continuous vertices. It can be shown that a graph is decomposable if and only if it is triangulated and has no no forbidden paths. A simple example of a graph with mixed variables that is not decomposable is:

figure m

Another characterization is that the cliques of a decomposable graph with mixed variables can be ordered as (C 1,…,C k ) with a modified version of the running intersection property. For j>1 define \(H_{j} = \bigcup_{t=1}^{j-1} C_{t}\) and S j =C j H j . The modified condition is that

  1. 1.

    for each j>1, S j C i for some i<j, and

  2. 2.

    for each j>1 it holds that C j S j Γ or S j Δ.

The additional condition (2) states that continuous variables cannot be prior to discrete ones. A graph with mixed variables is decomposable if and only there exists an ordering of its cliques fulfilling conditions (1) and (2).

A decomposable MI-model is a graphical MI-model whose dependence graph is decomposable. For such a model, the maximum likelihood estimates take the closed form

$$ \hat{f}(x) = \prod_{j=1}^k \frac{\hat{f}_{[C_j]} (x_{C_j})}{\hat {f}_{[S_j]} (x_{S_j})}$$
(5.9)

where we have let S 1=∅ and \(\hat{f}_{\emptyset}=1\).

To check whether a graph with mixed variables is decomposable, the so-called star graph construction can be used. That is, let \(\mathcal{G}^{\star}\) be a new graph obtained by adding an extra vertex, ⋆, to \(\mathcal{G}\) and adding edges between ⋆ and all discrete variables. Then \(\mathcal{G}^{\star}\) is triangulated (which can be checked with maximum cardinality search) if and only if \(\mathcal{G}\) is decomposable.

It can also be shown that a graph \(\mathcal{G}\) with mixed variables is decomposable if and only if the vertices of \(\mathcal{G}\) can be given a perfect ordering. For such graphs this is defined as an ordering {v 1,v 2,…,v T } such that (i) S k =ne(v k )∩{v 1,v 2,…,v k−1} is complete in \(\mathcal{G}\) and (ii) S k Δ if v k Δ. The mcsmarked() function is based on constructing \(\mathcal{G}^{\star}\) as described above and returns a perfect ordering if the graph is decomposable.

As an example consider the following two graphs shown in Fig. 5.2. If a and d are discrete and b and c are continuous then the graph on the left is not decomposable whereas the graph on the right is. Note that since a graph object contains no information about whether the nodes are discrete or continuous, mcsmarked() has to be supplied this information explicitly.

figure n
Fig. 5.2
figure 2

Decomposable graphs with mixed variables. If a and d are discrete and b and c are continuous then the first graph is not decomposable whereas the second graph is

5.7 Maximum Likelihood Estimation

In this section we derive expressions for the likelihood and describe algorithm(s) the maximize this.

5.7.1 Likelihood and Deviance

In this section we derive some expressions for the likelihood and the deviance. The log density can be written as

so the log-likelihood of a sample (i ν,y ν), ν=1,…,N is

We can simplify the last term using that

So an alternative expression for the log likelihood is

The full homogeneous model has MLEs \(\hat{p}(i)=n(i)/N\), (so that \(\hat{m}(i)=N\hat{p}(i)\)), \(\hat{\mu}(i)=\bar{y}(i)\), and \(\hat{\varSigma}=S=\sum_{i} n(i)S_{i}/N\), so the maximized log likelihood for this model is

$$\hat{\ell}_s=\sum_in(i)\log\{n(i)/N\}-Nq\log(2\pi)/2 -N\log\det (S)/2-Nq/2, $$
(5.10)

and the deviance of a homogeneous model \(\mathcal{M}\) with MLEs \(\hat{p}(i)\), \(\hat{\mu}(i)\), and \(\hat{\varSigma}\) with respect to the full homogeneous model simplifies to

Note that in contrast to the models considered in Chap. 4, we do not necessarily have \(\operatorname{tr}(S\hat{\varSigma}^{-1})=q\) so the term \(N\log\det(S\hat{\varSigma}^{-1})\) does not disappear.

5.7.2 Dimension of MI-models

The dimension of a mixed interaction model may be simply calculated by adding the dimensions of the component models for g(i) and each h u(i) to the number of free elements of the covariance matrix, and finally subtract one for the normalisation constant.

5.7.3 Inference

Under \(\mathcal{M}\), the deviance D is asymptotically χ 2(k) where the degrees of freedom k is the difference in dimension (number of free parameters) between the saturated model and \(\mathcal{M}\). Similarly, for two nested models \(\mathcal{M}_{1} \subseteq\mathcal{M}_{2}\), the deviance difference D 1D 2 is asymptotically χ 2(k) where the degrees of freedom k is the difference in dimension (number of free parameters) between the two models.

5.7.4 Likelihood Equations

Suppose we have a sample of N independent, identically distributed observations (i ν,y ν) for ν=1…N. Let \((n(i),t(i),\overline{y}(i))_{i \in\mathcal{I}}\) be the observed counts, variate totals and variate means, for cell i, and SS and S be the uncorrected sums of squares and sample covariance matrices, i.e.,

For aΔ, we write the marginal cell corresponding to i as i a and likewise for bΓ, we write the subvector of y as y b . Similarly, we write the marginal cell counts as \(\{n({i_{a}})\}_{i_{a} \in\mathcal{I}_{a}}\), marginal variate totals as \(\{t^{b}({i_{a}})\}_{i_{a} \in\mathcal{I}_{a}}\) and marginal variate means as \(\{\bar{y}_{b}({i_{a}})\}_{i_{a} \in\mathcal{I}_{a}}\). Define

$$SSD^b_a(i_a)=\sum_{\nu:i^\nu_a=i_a} \{y_b^k-\bar{y}_b(i_a)\}\{y_b^k-\bar{y}_b(i_a)\}^{\top}$$

and let

$$SSD^b_a = \sum_{i_{a} \in\mathcal{I}_{a}} SSD^b_a(i_a) = SS^b - \sum _{i_{a} \in\mathcal{I}_{a}} n(i_a)\bar{y}_b(i_a)\bar{y}_b(i_a)^{\top}$$

where SS b is the b-submatrix of the sums-of-squares matrix SS.

The log-likelihood for the sample is

(5.11)

where in the last term there is a contribution from SS uv only if k uv ≠0, that is if {u,v}∈b for some generator \((a,b)\in\mathcal{C}\).

Consider now a given model with generators \(\mathcal{C}= \{G_{1}, \dots,G_{m}\}\) and derive the formulae for g(i) and each h u(i) as described in Sect. 5.5. Then a set of minimal sufficient statistics is given by

  1. 1.

    A set of marginal tables of cell counts \(\{n(i_{a})\}_{i_{a} \in\mathcal{I}_{a}}\) for each discrete generator a.

  2. 2.

    For each uΓ, a set of marginal variate totals \(\{t^{u}(i_{a})\}_{i_{a} \in\mathcal{I}_{a}}\) for each linear generator a of u.

  3. 3.

    A set of marginal tables of uncorrected sums and squares {SS b} for each quadratic generator b.

From exponential family theory, we know that the MLE of {p(i),μ(i),Σ} can be found by equating the expectations of these minimal sufficient statistics to their observed values. Equating the minimal sufficient statistics to their observed values for a generator (a,b) yields:

(5.12)
(5.13)
(5.14)

Each generator \((a,b)\in\mathcal{C}\) defines a set of equations of the form (5.12)–(5.14) and the collection of these equations are the likelihood equations for the model. The MLEs, when they exist, are the unique solution to these equations that also satisfy the model constraints.

For example, for the saturated model on V=ΔΓ, we set a=Δ and b=Γ. Here there are no model constraints, and from the equations we find that the MLEs are given as \(\hat{p}(i)=n(i)/N\), \(\hat{\mu}(i)= \overline{y}(i)\) and \(\hat{\varSigma}=S\).

5.7.5 Iterative Proportional Scaling

As with discrete log-linear models and graphical Gaussian models, iterative methods to find the maximum likelihood parameter estimates are generally necessary. The iterative proportional scaling algorithm for mixed interaction models proceeds by equating observed and expected margins, in much the same way as with discrete and continuous models. An important conceptual difference, however, relates to marginalization. Whereas multinomial and Gaussian distributions are preserved under marginalization, the same is not generally true in the mixed case: the marginal distribution of a CG-distribution is not necessarily CG. For this reason the concept of weak marginals is needed.

5.7.5.1 Weak Marginals

Consider a CG-density f V defined over the variables V=ΔΓ. Letting aΔ and bΓ we wish to obtain the marginal density f ab . This density is obtained by first integrating over y Γb to produce f Δb which again is a CG-density. The next step is to sum over i Δa to form f ab . This summation may involve forming a mixture of normal densities, which does not generally have the form of a CG-density. However, even though f ab is not in general a CG-density we can find the moments of f ab using standard formulae, namely

These moments \(\{ p_{[a]}(i_{a}), \mu^{b}_{[a]}(i_{a}), \varSigma^{b}_{[a]}(i_{a})\}_{i_{a} \in\mathcal{I}_{a}} \) define a CG density f [ab] denoted the weak marginal density (which is not homogeneous).

Furthermore, we define the homogeneous weak marginal variance to be:

The moments \(\{ p_{[a]}(i_{a}), \mu^{b}_{[a]}(i_{a}), \varSigma^{b}_{[a]}\}_{i_{a} \in\mathcal{I}_{a}} \) define a CG density \(f^{h}_{[a\cup b]}\) which is denoted the homogeneous weak marginal density.

The weak marginal density is the CG-density which best approximates the true marginal f ab in the sense of minimizing the Kullback–Leibler distance, see Lauritzen (1996), p. 162. The same proof yields that the analogous statement holds for the homogeneous weak marginal.

5.7.5.2 Likelihood Equations Revisited

It is illustrative to rewrite the likelihood equations as follows. Observe that

(5.15)

Using the definitions of the parameters of weak marginal models, (5.12) and (5.13) imply that

$$ n(i_a)/N = p_{[a]}(i_a), \qquad\bar{y}^b(i_a) = t^b(i_a)/n(i_a) = \mu _{[a]}^b(i_a).$$
(5.16)

Using (5.15) and (5.16) we get from (5.14) that

The MLEs under the saturated MI-model for the variables ab (whose density is denoted \(\tilde{f}_{a\cup b}\)) are \(\{\tilde{p}_{a}(i_{a}), \tilde{\mu}^{b}_{a}(i_{a}), \tilde{S}^{b}_{a}\}_{i_{a} \in \mathcal{I}_{a}} \) where

$$\tilde{p}_a(i_a)=n(i_a)/N, \qquad \tilde{\mu}^b_a(i_a)=\bar{y}^b(i_a)\quad \mbox{and}\quad \tilde{S}^b_a=SSD^b_a/N.$$

In other words, the likelihood equations are:

(5.17)
(5.18)
(5.19)

thus the homogeneous weak marginal model on ab should be identical to the saturated MI-model on ab, i.e. \(f^{h}_{[a\cup b]} = \tilde{f}_{a\cup b}\).

5.7.5.3 General IPS Update Step

Here we describe the iterative algorithm for general MI-models implemented in gRim and MIM (Edwards 2000). Equations (5.17)–(5.19) suggest the following IPS update step for a generator (a,b):

$$ f^*(i,y) \propto f(i,y) \frac{f^{\mathrm{sat}}_{a\cup b}(i_a,y_b)}{f^h_{[a\cup b]}(i_a,y_b)}$$
(5.20)

Note that the right-hand side of (5.20) will not in general be a density: Integrating over y Γb and summing over i Δa gives

$$f_{a\cup b}(i_a,y^b)f^{sat}_{a\cup b}(i_a,y^b) / f^h_{[a\cup b]}(i_a,y^b)$$

which will not be a density unless the marginal density f ab (i a ,y b ) equals the homogeneous weak marginal density \(f^{h}_{[a\cup b]}(i_{a},y_{b})\).

It is convenient to perform the update (5.20) on log-densities using the canonical parametrisation, since it just involves to addition and subtraction of canonical parameters. From (5.17)–(5.19), to update (g,h,K) we first transform the moment parameters \(\{\tilde{p}_{a},\tilde{\mu}^{b}_{a}, \tilde{S}^{b}_{a}\}\) and \(\{p_{[a]},\mu^{b}_{[a]},\varSigma^{b}_{[a]}\}\) of \(\tilde{f}_{a\cup b}\) and \(f^{h}_{[a\cup b]}\) to canonical parameters \((\tilde{g}_{a}, \tilde{h}^{b}_{a}, \tilde{K}^{b}_{a})\) and \((g_{[a]}, h^{b}_{[a]}, K^{b}_{[a]})\). Then we

  1. 1.

    Update g as follows: For each \({i_{a} \in\mathcal{I}_{a}} \) do for all j for which j a =i a :

    $$ g(j) \leftarrow g(j) + \{\tilde{g}_a(i_a) - g_{[a]}(i_a)\}.$$
    (5.21)
  2. 2.

    Update the b subvector of h as follows: For each \({i_{a} \in \mathcal{I}_{a}} \) do for all j for which j a =i a :

    $$ h^b(j) \leftarrow h^b(j) + \{\tilde{h}^b_a(i_a) - h^b_{[a]}(i_a) \}.$$
    (5.22)
  3. 3.

    Update the b submatrix K bb of K as follows:

    $$ K^{bb} \leftarrow K^{bb} + \{\tilde{K}^b_a - K^b_{[a]}\}.$$
    (5.23)

After the update steps (5.21)–(5.23) we know h and K and hence the conditional distribution of y given i. To complete the update we must transform (g,h,K) to moment form (p,μ,Σ), normalize p to sum to one and transform back to canonical form (g,h,K) again before moving on to the next generator. Running through the generators (a 1,b 1),(a 2,b 2),…,(a M ,b M ) as described above constitutes one cycle of the iterative fitting process.

A measure of how much the updates (5.21)–(5.23) change the parameter estimates may be obtained by comparing the moments of \(\tilde{f}_{a\cup b}\) and \(f^{h}_{[a\cup b]}\). Following Edwards (2000) we use the quantity:

$$\begin{aligned}[b]\mathtt{mdiff}(a,b)=\max_{{i_{a} \in\mathcal{I}_{a}} ,u,v\in b}\biggl\{&\frac {N | p_{[a]}(i_a)-\tilde{p}_a(i_a)|}{\sqrt{N p_{[a]}(i_a)+1}},\frac {|\mu^u_{[a]}(i_a)-\tilde{\mu}^u_{a}(i_a)|}{\sqrt{(\varSigma^b_{[a]})_{uu}}}, \\& \frac {|(\varSigma^b_{[a]})_{uv}-(\tilde{\varSigma}^b_{a})_{uv}|}{\sqrt{(\varSigma^b_{[a]})_{uu}(\varSigma^b_{[a]})_{vv}+(\varSigma^b_{[a]})_{uv}^2}}\biggr\}\end{aligned} $$
(5.24)

It sometimes happens that the updates (5.21)–(5.23) lead to a decrease in the likelihood. To avoid this situation we first calculate \(\verb'mdiff'(a,b)\) in (5.24). If \(\verb'mdiff'(a,b)\) is smaller than some prespecified criterion we do not update the model but proceed to the next generator. If this is true for all generators we exit the iterative process, as it essentially only happens when we are close to the MLE.

Since the estimation algorithm in the mmod() function is based on the model formula, which is not unique, there will be efficiency differences between the different representations of the same model. The maximal form is the most efficient.

5.7.5.4 Step-Halving Variant

It can happen that the updates (5.21)–(5.23) fail to increase the likelihood, or lead to a K that is not positive definite. The step-halving variant of the algorithm (currently not implemented in gRim) replaces the three update steps in (5.21)–(5.23) with:

Initially κ=1. The update is attempted and it is then checked if (1) K is positive definite and (2) the likelihood is increased. If either of these conditions fail, κ is halved and the update is attempted again. The step-halving variant is therefore slower than the general algorithm. Edwards (2000, p. 312) shows an example with contrived data where step-halving is necessary.

5.7.5.5 Mixed Parameterisation Variant

If the model is collapsible onto the discrete parameters, the estimate \(\hat{p}(i)\) is identical to the estimate obtained in the log-linear model with the same discrete generator. This permits another variant based on the mixed parametrisation to be used. It has the following update scheme

The model is collapsible onto Δ if and only every connected component of the subgraph induced by the continuous variables has a complete boundary in the subgraph induced by the discrete variables (Frydenberg 1990b). This variant is currently not implemented in gRim.

5.8 Using gRim

The function mmod() in the gRim package allows homogeneous mixed interaction models to be defined and fitted to data.

figure o

This model is shown in Fig. 5.1. More details about the model are obtained with

figure p

The parameters are obtained using coef() where the desired parameterization can be specified. For example, the canonical parameters are

figure q

5.8.1 Updating Models

Models are changed using the update() method. A list with one or more of the components add.edge, drop.edge, add.term and drop.term is specified. The updates are made in the order given. For example:

figure r

5.8.2 Inference

Functions such as ciTest(), testInEdges(), testOutEdges(), etc. behave more or less as for pure discrete and pure continuous variables. For example

figure s

and

figure t

or

figure u

and

figure v

5.8.3 Stepwise Model Selection

The stepwise() function in the gRim package implements stepwise selection for mixed interaction models. The functionality is very similar to that described above in Sect. 2.4 and Sect. 4.4.1, for discrete graphical models and undirected graphical Gaussian models respectively. We refer to those sections for further details, and illustrate using the wine dataset described in Sect. 5.2. We start from the saturated model and use the BIC criterion:

figure w

The selected model is shown below:

figure x
figure y

We note that the model is non-decomposable, since there are several chordless four-cycles in the graph. Since the graph is connected, it appears that all constituents differ over the grape varieties. Seven constituents are adjacent to the discrete variable. The model implies that these seven are sufficient to predict grape variety, since the remaining six are independent of variety given the seven, and so would not increase predictive ability.

5.9 An Example of Chain Graph Modelling

In this section we illustrate an approach that is appropriate when there is a clear overall response to the data, that is, when some variables are prior or explanatory to others, that are themselves prior or explanatory to others, and so on. The variables can a priori be divided into blocks, whose mutual ordering in this respect is clear. The goal of the analysis is to model the data, respecting this ordering between blocks, but not assuming any ordering within blocks. Chain graph models fit this purpose well.

The Nutrimouse dataset described above in Sect. 5.2 is here used as example. Here, the variables fall into three blocks: two discrete design variables (genotype and diet), 120 gene expression variables, and 21 lipid measurements. Clearly the design variables, which are subject to the control of the experimenter, are causally prior to the others. It is also natural as a preliminary working hypothesis to suppose that the gene expression measurements are causally prior to the lipid measurements, and this is the approach taken here. More advanced methods would be necessary to study whether there is evidence of influence in the opposite direction.

The chain graph is constructed using two graphical models: the first is for the gene expressions (block 2) given the design variables (block 1), and the second is for the lipids (block 3) given blocks 1 and 2. We use the gRapHD package described in Chap. 7. This package supports decomposable mixed models, both homogeneous and heterogeneous, exploiting the closed-form expressions for the MLEs (5.9). This restriction also means that models can simply be specified as graphs, rather than using model formulae.

To model the conditional distribution of block 2 given block 1 we restrict attention to models in which block 1 is complete, that is, there is an edge between the two design variables. See Fig. 4.33. The following code first finds the minimal BIC forest containing this edge, and then uses this as initial model in a forward selection process to find the minimal BIC decomposable model. This takes a few seconds.

figure z
figure aa
figure ab

We display the two graphs in Figs. 5.3 and 5.4, using the same vertex coordinates for clarity. The vertex coordinates are saved in a matrix xyD1.

Fig. 5.3
figure 3

A tree model for the gene expression variables (block 2) given the design variables (block 1)

Fig. 5.4
figure 4

A decomposable model for the gene expression variables (block 2) given the design variables (block 1)

We now turn to modelling the conditional distribution of block 3 variables given the prior blocks. We adopt the same approach as before, first finding a minimal BIC forest and then using this as start model in a forward selection process. As before we restrict the search space to conditional models by including all edges between prior variables in the models considered. The forward selection process seeks the decomposable MI-model with minimum BIC in this search space.

figure ac

The stepw() function is computationally intensive, taking around 10 minutes on an ordinary laptop running Windows. We display the decomposable model in Fig. 5.5.

figure ad

Now we construct a graph gD3 by adding to gD1 those edges in gD2 that have a vertex in block 3:

figure ae

Note that gD3 is an undirected graph rather than a chain graph. We use the igraph package to display it as a chain graph using different colours for the blocks, interblock edges displayed as arrows, in a layout in which the different blocks are separated for clarity. See Fig. 5.6

figure af
Fig. 5.5
figure 5

A decomposable model for the lipid variables given the gene expression and design variables. The subgraph induced by the gene expression and design variables is complete, and is shown as a compact splat

Fig. 5.6
figure 6

A chain graph model for the nutrimouse data. The design variables are shown as open circles, the gene expression variables as blue circles, and the lipid variables as red circles. The variables are shown as column numbers

5.10 Various

Several other R packages are designed for graphical modelling with mixed discrete and Gaussian variables. The package CoCo (Badsberg 1991) implements undirected graphical (and hierarchical) models with mixed variables. The package deal (Bøttcher and Dethlefsen 2003) allows a Bayesian analysis using models for mixed variables based on DAGs, based on the conditional Gaussian distribution. Prior distributions for the model parameters are set and posterior distributions given data are derived. A heuristic search strategy for structural learning is also supported. The package RHugin also supports the use of Bayesian network models with mixed variables: see Chap. 3.