Keywords

5.1 The Generalized Linear Model (GLM) in Keywords

Log-linear models for contingency tables are members of the family of generalized linear models (GLMs). The GLM is a broad class of statistical models, introduced by Nelder and Wedderburn (1972), that allows for unified consideration and treatment of many models of different types of response variables and error structures. Characteristic special cases of the GLM are the models of regression, logistic regression, Poisson regression, and the log-linear models. The GLM is an extension of the classical regression model that relates a response variable Y to a set of q explanatory variables X j , j = 1, , q, by equating a function of the expected response E(Y ) to a linear predictor based on X = (X 1, , X q ).

Under the classical linear regression model, if \(\mathbf{y} = (y_{1},\ldots,y_{n_{y}})^{\prime}\) is a sample of size n y of the response variable Y and \(\mathbf{x} = \left (x_{ij}\right )_{n_{y}\times q}\) is the n y × q matrix with the corresponding sample values on the explanatory variables X j , j = 1, , q, then in matrix notation we have

$$\displaystyle{\mathbf{y} = \mathbf{X}\boldsymbol{\beta }+\boldsymbol{\varepsilon },}$$

where \(\boldsymbol{\beta }= (\beta _{1},\ldots,\beta _{q})^{\prime}\) is the parameter vector and \(\boldsymbol{\varepsilon }= (\varepsilon _{1},\ldots,\varepsilon _{n_{y}})^{\prime}\), the vector of errors. The distributional assumptions are that (i) Y i are independent normal distributed with E(Y i ) = μ i (i = 1, , n y ) and common variance Var(Y i ) = σ 2 and (ii) the errors are also independent normal distributed with zero mean and common variance \(\sigma _{\varepsilon }^{2}\). In summary, the regression model has a random component, the response variable Y, and a systematic component, the linear combination of the explanatory variables \(\mathbf{X}\boldsymbol{\beta }\), that links to the vector of the expected response values, i.e.,

$$\displaystyle{ \boldsymbol{\mu }= \text{E}(\mathbf{Y}) = \mathbf{X}\boldsymbol{\beta }, }$$
(5.1)

with \(\boldsymbol{\mu }= (\mu _{1},\ldots,\mu _{n_{y}})^{\prime}\) and \(\mathbf{Y} = (Y _{1},\ldots,Y _{n_{y}})^{\prime}\).

The GLM extends the regression models by relaxing the assumption about normal distributed response variable Y and by linking the systematic component not directly to \(\boldsymbol{\mu }\) but to a function of it \(g(\boldsymbol{\mu })\). Thus, the systematic component of the GLM is

$$\displaystyle{ \boldsymbol{\eta }= g(\boldsymbol{\mu }) = \mathbf{X}\boldsymbol{\beta }, }$$
(5.2)

with \(\boldsymbol{\eta }= (\eta _{1},\ldots,\eta _{n_{y}})^{\prime}\). Function g is called the link function. The linear model (5.1) is a special case of (5.2) for the identity link, i.e. for \(\boldsymbol{\eta }= g(\boldsymbol{\mu }) =\boldsymbol{\mu }\).

Under GLM, the distribution of the response Y may be any member of the exponential family. For univariate responses, as considered in this book, the corresponding density function is

$$\displaystyle{ f(y_{i};\ \theta _{i},\psi,\omega _{i}) = \text{exp}\left \{\frac{y_{i}\theta _{i} - b(\theta _{i})} {\psi } \omega _{i} + c(y_{i},\psi,\omega _{i})\right \}, }$$
(5.3)

where ω i is a weight with

$$\displaystyle{\omega _{i} = \left \{\begin{array}{ll} 1,\ &\text{ungrouped data}\ (i = 1,\ldots,n_{y}) \\ n_{i}^{c},\ &\text{grouped data}\ \ \ \ (i = 1,\ldots,g)\\ \end{array},\right.}$$

and c = 1 or − 1, according to whether as group response is considered the average or the sum of the individuals’ responses in a group, respectively. Parameter θ is called natural parameter, because it determines the mean, since

$$\displaystyle{ \mu = \text{E}(Y ) = b^{\prime}(\theta ). }$$
(5.4)

Parameter ψ controls the variance

$$\displaystyle{{ \sigma }^{2} = \text{Var}(Y ) = \frac{\psi } {\omega _{i}}b^{\prime\prime}(\theta ) }$$
(5.5)

and is therefore called the dispersion parameter. b(⋅ ) and c(⋅ ) are specific functions determined by the type of the exponential family.

Many commonly used distributions are members of the exponential family, like the normal, the gamma, the binomial, the multinomial, and the Poisson. For one-parameter families the dispersion parameter ψ is fixed. For example, the Poisson \(\mathcal{P}(\theta )\) and the binomial \(\mathcal{B}(n,\theta )\), for fixed n, have ψ = 1. These distributions are in the simpler natural exponential family. Furthermore, for the Poisson ω = 1 while for the binomial ω = n or n −1, according to whether as response y is considered the success proportion or the number of successes.

The link function η i  = g(μ i ) can theoretically be any monotonic and differentiable function. However, the link options are practically limited, since the link is chosen so that the inverse \(\mu _{i} = {g}^{-1}(\eta _{i})\) leads to admissible values for μ i and simple functions of θ i . Characteristic example is the case of a binomial response \(\mathcal{B}(n,\pi _{i})\). Then μ i  = π i and it must be in (0, 1). The three links that are more often used for binomial data are the logit, the probit, the complementary log–log , and the complementary log. In Chap.8, we will apply the logit link \(g(\pi ) =\log \left ( \frac{\pi }{1-\pi }\right )\) and refer briefly to the other options. The link function specifies the nature of the distribution considered for the error \(\varepsilon _{i}\). A convenient link with nice properties is the canonical link that expresses μ i in terms of the parameter θ i , i.e., the canonical link is \(g(\mu _{i}) = {B}^{-1}(\theta _{i})\), where B = b′. Under the canonical link, XY is a sufficient statistic for \(\boldsymbol{\beta }.\)

In summary, GLM is a framework that unifies a wide range of models, flexible through the choices for the distribution of its random component, for the link and eventually the error distribution. Beyond the powerful theoretical setup, it is practically attractive because it allows to draw inference for all possible GLM models by the same algorithm, simplifying thus their implementation in statistical software.

5.2 Log-Linear Model: Member of the GLM Family

Classical log-linear models, presented in Chap. 4, can be viewed in the framework of GLM for specific selection of the link function and the error distribution, as will be stated next. Doing so has specific advantages. Beyond convenience in model selection and inference by adopting the procedures developed for the GLM family, it allows for easy handling of the structural zeros in log-linear modeling (see Sect.5.5) and it provides a platform for extending the log-linear model to model the marginals as well (see Sect. 5.6).

In order to adjust to GLM’s notation, contingency tables are expanded to vectors. Thus, the I × J table n = (n ij ) is expanded (by rows) to the n y × 1 vector y as

$$\displaystyle{\mathbf{y} = (y_{1},y_{2},\ldots,y_{n_{y}})^{\prime} = (n_{11},n_{12},\ldots,n_{1J},n_{21},\ldots,n_{IJ})^{\prime},}$$

with n y  = IJ. Additionally, this vector approach ensures unified treatment for tables of any dimension. Throughout this book whenever tables are expanded in vectors, expansion is considered by rows, followed by columns, layers, etc.

Under the GLM setup, the log-linear models for contingency tables are easier derived considering the Poisson distribution for the random component, i.e., \(Y _{i} \sim \mathcal{P}(\theta _{i})\) and for link the g(μ i ) = logμ i , i = 1, , n y . The log link is the canonical link for the Poisson distribution. They are referred as Poisson log-linear models. Considering Poisson sampling is not restrictive due to the equivalence of the three possible sampling schemes (see Sect.2.2.1). Recall that also in the classical log-linear framework, estimation was based on the Poisson likelihood (2.33).

Thus, the log-linear models for I × J tables discussed in this section can be expressed in matrix notation, as follows:

$$\displaystyle{ \log (\boldsymbol{\mu }) = \mathbf{X}\boldsymbol{\beta }, }$$
(5.6)

where \(\boldsymbol{\mu }\) is the IJ × 1 vector of expected cell frequencies under the model, \(\boldsymbol{\beta }\) is the q × 1 vector of parameters, and X is the IJ × q associated design matrix. The table of expected cell frequencies \(\mathbf{m}_{I\times J}\) is expanded the same way as the table of observed frequencies.

For example, the model of independence (4.1) subject to last category zero constraints is equivalently expressed by (5.6), where the IJ × 1 vector of expected frequencies is \(\boldsymbol{\upmu } = (m_{11},m_{12},\ldots,m_{1J},m_{21},\ldots,m_{IJ})^{\prime}\), the \((I + J - 1) \times 1\) vector of parameters is \(\boldsymbol{\upbeta } = (\lambda,\lambda _{1}^{X},\ldots,\lambda _{I-1}^{X},\lambda _{1}^{Y },\ldots,\lambda _{J-1}^{Y })^{\prime}\), and

$$\displaystyle{\mathbf{X} = \left (\begin{array}{lll} \mathbf{1}&{\mathbf{1}}^{(1)} & {\mathbf{I}}^{{\ast}} \\ \mathbf{1}&{\mathbf{1}}^{(2)} & {\mathbf{I}}^{{\ast}}\\ \vdots &\vdots &\vdots \\ \mathbf{1}&{\mathbf{1}}^{(I-1)} & {\mathbf{I}}^{{\ast}} \\ \mathbf{1}&\mathbf{0}_{J\times (I-1)} & {\mathbf{I}}^{{\ast}} \end{array} \right )}$$

is the \(IJ \times (I + J - 1)\) design matrix, with 1 the J × 1 matrix of 1’s, 1 (i) the J × (I − 1) matrix with 1’s at the ith column and 0’s in all other entries, 0 s×t the s × t matrix of 0’s, and

$$\displaystyle{{\mathbf{I}}^{{\ast}} = \left (\begin{array}{l} \mathbf{I}_{J-1} \\ \mathbf{0}_{1\times (J-1)} \end{array} \right ),}$$

where I s is the s × s identity matrix.

The application of the independence model through local odds ratios (2.52), though simpler in expression, is more advanced and computationally involved, because it is not in the GLM family. It does not apply to the expected cell frequencies directly but to a function of them. For this, a generalization of the GLM is needed, briefly discussed in Sect.5.6.

5.3 Inference for GLMs

5.3.1 ML Estimation for GLMs

For the maximum likelihood estimation of \(\boldsymbol{\beta }\) for model (5.2), the log-likelihood of a given sample needs to be maximized with respect to \(\boldsymbol{\beta }\). Thus, for a random sample y of size n y , from a population distributed by (5.3), the log-likelihood is

$$\displaystyle{ \ell=\sum _{ i=1}^{n_{y} }\log f(y_{i};\ \theta _{i},\psi,\omega _{i}) =\sum _{ i=1}^{n_{y} }\frac{y_{i}\theta _{i} - b(\theta _{i})} {\psi } \omega _{i} +\sum _{ i=1}^{n_{y} }c(y_{i},\psi,\omega _{i}) }$$
(5.7)

and is a function of \(\boldsymbol{\beta }\) due to (5.2) and (5.4).

The first derivative of the log-likelihood function is the Fisher’s score function

$$\displaystyle{s(\boldsymbol{\beta }) = \frac{\partial \ell} {\partial \boldsymbol{\beta }} = \left (\frac{\partial \ell(\boldsymbol{\beta })} {\partial \beta _{1}},\ldots \frac{\partial \ell(\boldsymbol{\beta })} {\partial \beta _{q}} \right )^{\prime}.}$$

Equating the score function’s components to zero, the corresponding likelihood equations are obtained

$$\displaystyle{s(\beta _{j}) = \frac{\partial \ell} {\partial \beta _{j}} = \frac{\partial } {\partial \beta _{j}}\left (\sum _{i=1}^{n_{y} }\log f(y_{i};\ \theta _{i},\psi,\omega _{i})\right ) = 0,\ \ \ j = 1,\ldots,q,}$$

and are finally equal to

$$\displaystyle{ \sum _{i=1}^{n_{y} }\left (\frac{y_{i} -\text{E}(Y _{i})} {\text{Var}(Y _{i})} \cdot \frac{\partial {g}^{-1}(\eta _{i})} {\partial \eta _{i}} \cdot x_{ij}\right ) = 0,\ \ \ j = 1,\ldots,q, }$$
(5.8)

where \(\eta _{i} =\sum _{ j=1}^{q}\beta _{j}x_{ij}\). The likelihood equations (5.8) are derived applying the chain rule, since \(\theta _{i} = {(b^{\prime})}^{-1}(\mu _{i})\), \(\mu _{i} = {g}^{-1}(\eta _{i})\), and using (5.4) and (5.5).

For certain distributional assumption for Y i and particular link function g, the likelihood equations (5.8) take their explicit form and specify the MLE \(\hat{\boldsymbol{\beta }}\). For the canonical link, η i  = θ i and \({g}^{-1} = b^{\prime}\), leading to \(\frac{\partial {g}^{-1}(\eta _{ i})} {\partial \eta _{i}} = b^{\prime\prime}(\theta _{i})\). Thus, by (5.5), (5.8) are simplified to

$$\displaystyle{ \sum _{i=1}^{n_{y} }\left [y_{i} -\text{E}(Y _{i})\right ]x_{ij} = 0,\ \ \ j = 1,\ldots,q, }$$
(5.9)

stating that the likelihood equations for the canonical link equate the β j ’s sufficient statistic \(\sum _{i=1}^{n_{y}}y_{i}x_{ij}\) to its expected value, for j = 1, , q.

The asymptotic covariance matrix of \(\hat{\boldsymbol{\beta }}\) is derived from the second derivative of the log-likelihood, since it is equal to

$$\displaystyle{\text{Cov}(\hat{\boldsymbol{\beta }}) = \mathbf{\mathcal{I}}_{F}^{-1},}$$

where \(\mathbf{\mathcal{I}}_{F} = \text{Cov}(s(\boldsymbol{\beta }))\) is the expected Fisher information matrix. In our case

$$\displaystyle{\mathbf{\mathcal{I}}_{F} = \text{Cov}(s(\boldsymbol{\beta })) = \text{E}\left (\frac{\partial \ell} {\partial \boldsymbol{\beta }} \frac{\partial \ell} {\partial \boldsymbol{\beta }^{\prime}}\right ) = \text{E}\left (-\frac{{\partial }^{2}\ell} {\partial \boldsymbol{\beta }\partial \boldsymbol{\beta }^{\prime}}\right ) = \mathbf{X}^{\prime}\mathbf{W}\mathbf{X},}$$

where W is a diagonal matrix with diagonal entries

$$\displaystyle{ w_{i} = {(\partial \mu _{i}/\partial \eta _{i})}^{2}{[\text{Var}(Y _{ i})]}^{-1}. }$$
(5.10)

For large n y ,

$$\displaystyle{\boldsymbol{\hat{\beta }}\sim \mathcal{N}_{q}(\boldsymbol{\beta },\ \mathbf{\mathcal{I}}_{F}^{-1}).}$$

The matrix of the negative second derivatives of the score function is the observed information matrix

$$\displaystyle{\mathbf{\mathcal{I}}_{F}^{obs} = -\mathbf{H} = -\frac{{\partial }^{2}\ell} {\partial \boldsymbol{\beta }\partial \boldsymbol{\beta }^{\prime}},}$$

where the matrix of second derivatives H is usually referred as the Hessian matrix. It holds that

$$\displaystyle{ \mathbf{\mathcal{I}}_{F} = \text{E}\left (\mathbf{\mathcal{I}}_{F}^{obs}\right ) = \text{E}\left (-\mathbf{H}\right ). }$$
(5.11)

For GLMs with canonical link functions, η i  = θ i implies \(\frac{\partial \mu _{i}} {\partial \eta _{i}} = \frac{\partial \mu _{i}} {\partial \theta _{i}}\) and the Hessian matrix becomes

$$\displaystyle{ \mathbf{H} = -\mathbf{X}^{\prime}\mathbf{W}\mathbf{X}, }$$
(5.12)

with W a diagonal matrix with entries \(w_{i} =\omega _{i}\left [{g}^{-1}(\theta _{i})\right ]^{\prime}/\psi\), i = 1, , n y , independent of y. Hence

$$\displaystyle{\mathbf{\mathcal{I}}_{F} = \text{E}\left (-\mathbf{H}\right ) = -\mathbf{H} = \mathbf{\mathcal{I}}_{F}^{obs},}$$

i.e., for canonical link functions, the expected and observed information matrices are identical.

The likelihood equations (5.8) or (5.9) do not usually lead to closed form expressions for the \(\hat{\boldsymbol{\beta }}\) and have to be solved iteratively. The two algorithms usually applied for solving the likelihood equations are the Newton–Raphson and the Fisher scoring.

If \(\boldsymbol{{\beta }}^{(t)}\) is the value assigned to \(\hat{\boldsymbol{\beta }}\) at stage t of the iterative procedure (t = 0, 1, 2, ), then the updating equations of the Newton–Raphson algorithm at stage t + 1 are

$$\displaystyle{{ \boldsymbol{\beta }}^{t+1} {=\boldsymbol{\beta } }^{t} -{\left ({\mathbf{H}}^{(t)}\right )}^{-1}s{(\boldsymbol{\beta }}^{(t)}), }$$
(5.13)

where \(s{(\boldsymbol{\beta }}^{(t)})\) and H (t) are the score function \(s(\boldsymbol{\beta })\) and the Hessian matrix H evaluated at \(\boldsymbol{{\beta }}^{(t)}\). For matrix inversion to be possible, H (t) has to be non-singular.

The algorithm converges and stops when a termination criterion is met, say after t c iterations, leading to \(\hat{\boldsymbol{\beta }}{=\boldsymbol{\beta } }^{(t)}\). A termination criterion checks whether \({\boldsymbol{\beta }}^{(t)}\) and \({\boldsymbol{\beta }}^{(t+1)}\) are sufficiently close, for example, whether

$$\displaystyle{\vert \ell{(\boldsymbol{\beta }}^{(t_{c}+1)}) -\ell {(\boldsymbol{\beta }}^{(t_{c})})\vert \leq \epsilon \text{or}\frac{\left \|{\boldsymbol{\beta }}^{(t+1)} -{\boldsymbol{\beta }}^{(t)}\right \|} {\left \|{\boldsymbol{\beta }}^{(t)}\right \|} \leq \epsilon,}$$

for a pre-chosen small positive ε.

The Fisher’s scoring algorithm is similar to the Newton–Raphson algorithm with the only difference being that it is based on the expected information matrix, instead of the observed information matrix. In particular, the updating equations for the Fisher scoring algorithm are

$$\displaystyle{{ \boldsymbol{\beta }}^{t+1} {=\boldsymbol{\beta } }^{t} +{ \left (\mathbf{\mathcal{I}}_{ F}^{(t)}\right )}^{-1}s{(\boldsymbol{\beta }}^{(t)}), }$$
(5.14)

where \(\mathbf{\mathcal{I}}_{F}^{(t)}\) is \(\mathbf{\mathcal{I}}_{F}\) evaluated at \(\boldsymbol{{\beta }}^{(t)}\).

The asymptotic covariance matrix of \(\hat{\boldsymbol{\beta }}\) is estimated for the Fisher’s scoring algorithm by \(\widehat{\text{Cov}}(\hat{\boldsymbol{\beta }}) =\widehat{ \mathbf{\mathcal{I}}}_{F}^{-1}\) and for the Newton–Raphson algorithm by \(\widehat{\text{Cov}}(\widehat{\boldsymbol{\beta }}) = {(-\hat{\mathbf{H}})}^{-1}\), where \(\hat{\mathbf{\mathcal{I}}}_{F}\) and \(\hat{\mathbf{H}}\) are \(\mathbf{\mathcal{I}}_{F}\) and H, respectively, evaluated at \(\hat{\boldsymbol{\beta }}\).

Due to (5.11), the Newton–Raphson and the Fisher scoring algorithm coincide for GLMs of canonical link function. For noncanonical link functions, the choice between the algorithms relates to issues of ease of application, algorithm’s convergence, and efficiency of implementation. It is a choice between observed and expected information matrix. For a related discussion, we refer to the classical discussion paper by Efron and Hinkley (1978) and Palmgren (1981). Alternatively, other methods have been proposed like the Quasi-Newton (or Newton’s unidimensional) method that is easier to apply since it does not require matrix inversion but does not provide estimate of the asymptotic covariance matrix. We will illustrate the Newton’s unidimensional method for association models in Sect.6.2.

The solutions of the likelihood equations correspond actually to local maxima and not to the global maximum of the log-likelihood function , as is expected for the MLE \(\hat{\boldsymbol{\beta }}\). Whenever is concave, the local and global maxima are identical. For non-concave , the choice of the initial estimate \(\boldsymbol{{\beta }}^{(0)}\) is important, to ensure that it is in the region of the global maxima.

5.3.2 Evaluating Model Fit for GLMs

Given a sample y of n y observations, let \(\hat{\boldsymbol{\mu }}\) denote the corresponding ML estimate of \(\boldsymbol{\mu }= \text{E}(\mathbf{Y})\) under a model \(\mathcal{M}\) of q parameters. The quality of the model fit is assessed by comparing the maximum log-likelihood for the model \(\ell(\hat{\boldsymbol{\mu }};\mathbf{y})\) to the maximum log-likelihood for the model that describes the data perfectly, i.e., the saturated model. A saturated model has as many parameters as the observations in the sample. We have seen so far saturated models in the context of log-linear models. For the saturated GLM, the number of parameters is n y , \(\hat{\boldsymbol{\mu }}= \mathbf{y}\) and the corresponding log-likelihood is (y; y). It is obvious that always \(\ell(\hat{\boldsymbol{\mu }};\mathbf{y}) <\ell (\mathbf{y};\mathbf{y})\) with model \(\mathcal{M}\) fitting as better as its log-likelihood approaches the saturated log-likelihood. Hence, the goodness of fit of a model is expressed in terms of their difference by the test statistic

$$\displaystyle{-2\left [\ell(\hat{\boldsymbol{\mu }};\mathbf{y}) -\ell (\mathbf{y};\mathbf{y})\right ],}$$

which for the exponential family (5.3) becomes

$$\displaystyle{ \frac{D(\mathbf{y};\hat{\boldsymbol{\mu }})} {\psi } = \frac{2} {\psi } \sum _{i=1}^{n_{y} }\omega _{i}\left (y_{i}(\tilde{\theta }_{i} -\hat{\theta }_{i}) - [b(\tilde{\theta }_{i}) - b(\hat{\theta }_{i})]\right ), }$$
(5.15)

where \(\hat{\theta }_{i}\) is the ML estimate of parameter θ i under the model \(\mathcal{M}\) and \(\tilde{\theta }_{i}\) is the estimate under the saturated model. The statistic \(D(\mathbf{y};\hat{\boldsymbol{\mu }})\) is known as deviance. Analogously, the Pearson’s X 2 statistic can be used for testing the adequacy of model \(\mathcal{M}\). In this context

$$\displaystyle{ {X}^{2}(\mathcal{M}) =\sum _{ i=1}^{n_{y} }\frac{{(y_{i} -\hat{\mu }_{i})}^{2}} {\hat{\mu }_{i}}. }$$
(5.16)

For Poisson and binomial GLMs, the deviance (5.15) turns out to equal the LR statistic for testing the null hypothesis that model \(\mathcal{M}\) holds against the saturated model

$$\displaystyle{ {G}^{2}(\mathcal{M}) = 2\sum _{ i=1}^{n_{y} }y_{i}\log (\frac{y_{i}} {\hat{\mu }_{i}} ). }$$
(5.17)

The statistics above can be used for testing goodness of fit of \(\mathcal{M}\), if their asymptotic distribution can be specified. For this to be possible, the data have to be grouped (each y i occurs n i times) with the number of observations in each group n i being sufficiently large. In this case, the distribution for the statistics (5.15)–(5.17) is approximately \(\mathcal{X}_{df}^{2}\), with \(df = n_{y} - q\), the difference between the number of parameters for the saturated model (n y ) and the model under testing (q). For more on the test statistics refer to McCullagh and Nelder (1989).

These goodness-of-fit tests do not account for model complexity while they are increasing in sample size n y , giving thus significant values even for good models if the sample size is large. Alternatively, the fit of a model \(\mathcal{M}\) can be evaluated by Akaike’s information criterion (Akaike 1974)

$$\displaystyle{ AIC = -2\ell(\hat{\boldsymbol{\mu }};\mathbf{y}) + 2q. }$$
(5.18)

It is based on the maximum likelihood under \(\mathcal{M}\) but penalizes its value for model complexity. Furthermore, the Bayesian information criterion (Schwarz 1978)

$$\displaystyle{ BIC = -2\ell(\hat{\boldsymbol{\mu }};\mathbf{y}) + (\log n)q }$$
(5.19)

is another maximum likelihood-based measure, incorporating Bayesian thinking, that beyond complexity takes into account also the sample size n. The AIC and BIC are used for comparing models, with smaller values indicating better models. They can be used to compare also non-nested models. They will be illustrated in the log-linear model context in Sect.5.4.1.

5.3.3 Residuals

Residuals are critical for diagnosing lack of model fit and identifying possible underlying patterns. The types of residuals used in GLM analysis are the same as those discussed in the context of independence testing for two-way tables (see Sect.2.2.4). In the GLM setup, the raw residuals \(e_{i} = y_{i} -\mu _{i}\) (i = 1, , n y ) are transformed to the Pearsonian residuals

$$\displaystyle{ e_{i}^{P} = \frac{y_{i} -\hat{\mu }_{i}} {\sqrt{\widehat{\text{Var} }(y_{i } )}},\ \ \ i = 1,\ldots,n_{y}. }$$
(5.20)

For the Poisson GLM, \(\widehat{\text{Var}}(y_{i}) =\hat{\mu } _{i}\) in (5.20) above, while for testing independence in two-way tables, (5.20) is (2.40), expressed in vector form. Pearson’s residuals are asymptotic normal distributed but not standard normal, as explained in Sect. 2.2.4. Thus, dividing the raw residuals by their asymptotic standard errors, the standardized residuals are derived

$$\displaystyle{ e_{i}^{s} = \frac{e_{i}^{P}} {\sqrt{1 -\hat{ h}_{i}}} = \frac{e_{i}} {\sqrt{\widehat{\text{Var} }(y_{i } )(1 -\hat{ h}_{i } )}},\ \ \ i = 1,\ldots,I,\ \ j = 1,\ldots,J, }$$
(5.21)

where \(\hat{h}_{i}\) is the estimate of the diagonal element h i , i = 1, , n y of the n y × n y matrix

$$\displaystyle{\mathbf{Hat} ={ \mathbf{W}}^{1/2}\mathbf{X}{({\mathbf{X}}^{{\prime}}\mathbf{W}\mathbf{X})}^{-1}{\mathbf{W}}^{1/2},}$$

known as hat matrix, with W the diagonal matrix with entries (5.10).

The deviance residuals decompose the deviance to the individual contributions of each observation i. Hence, for the exponential family (5.3), they are equal to

$$\displaystyle{ e_{i}^{d} = \text{sign}(y_{ i} -\hat{\mu }_{i}) \cdot {\left [2\omega _{i}\left (y_{i}(\tilde{\theta }_{i} -\hat{\theta }_{i}) - [b(\tilde{\theta }_{i}) - b(\hat{\theta }_{i})]\right )\right ]}^{1/2},\ \ \ i = 1,\ldots,n_{ y}, }$$
(5.22)

satisfying \(D(\mathbf{y};\hat{\boldsymbol{\mu }}) =\sum _{ i}^{n_{y}}{\left (e_{i}^{d}\right )}^{2}\). For testing independence in two-way tables, (5.22) simplify to (2.43).

5.3.4 Model Selection in GLMs

Deviance plays a predominant role in comparing GLMs, via the likelihood ratio criterion, for responses y i , i = 1, , n y , in the exponential family with ψ = 1. In this case, by (5.15), the deviance of a model is equal to the corresponding LR statistic (4.33) for testing its fit.

Let \(\mathcal{M}_{1}\) be a GLM of q 1 parameters. Let also \(\mathcal{M}_{0}\) be a simpler GLM, produced from \(\mathcal{M}_{1}\) by eliminating r of its q 1 parameters. Then, \(\mathcal{M}_{0}\) is said to be nested in \(\mathcal{M}_{1}\) and denoted by \(\mathcal{M}_{0} \subset \mathcal{M}_{1}\). Model \(\mathcal{M}_{0}\) has \(q_{0} = q_{1} - r\) parameters and is more parsimonious than \(\mathcal{M}_{1}\).

If \(\hat{\boldsymbol{\mu }}_{0}\) and \(\hat{\boldsymbol{\mu }}_{1}\) are the ML estimates of \(\boldsymbol{\mu }\) under \(\mathcal{M}_{0}\) and \(\mathcal{M}_{1}\), respectively, then, for ψ = 1, the deviances of models \(\mathcal{M}_{0}\) and \(\mathcal{M}_{1}\) are

$$\displaystyle\begin{array}{rcl} D(\mathbf{y};\hat{\boldsymbol{\mu }}_{0})& =& -2\left [\ell(\widehat{\boldsymbol{\mu }_{0}};\mathbf{y}) -\ell (\mathbf{y};\mathbf{y})\right ] {}\\ D(\mathbf{y};\hat{\boldsymbol{\mu }}_{1})& =& -2\left [\ell(\widehat{\boldsymbol{\mu }_{1}};\mathbf{y}) -\ell (\mathbf{y};\mathbf{y})\right ]. {}\\ \end{array}$$

Since reducing the number of model’s parameters implies increase of model’s distance from the perfect fit of the saturated model, it will always be \(D(\mathbf{y};\hat{\boldsymbol{\mu }}_{0}) > D(\mathbf{y};\hat{\boldsymbol{\mu }}_{1})\).

Models \(\mathcal{M}_{0}\) and \(\mathcal{M}_{1}\) apply both on the same y, thus their difference is

$$\displaystyle{D(\mathbf{y};\hat{\boldsymbol{\mu }}_{0}) - D(\mathbf{y};\hat{\boldsymbol{\mu }}_{1}) = -2\left [\ell(\widehat{\boldsymbol{\mu }_{0}};\mathbf{y}) -\ell (\widehat{\boldsymbol{\mu }_{1}};\mathbf{y})\right ] = \text{LRS}(\mathcal{M}_{0},\mathcal{M}_{1}),}$$

where \(\text{LRS}(\mathcal{M}_{0},\mathcal{M}_{1})\) is the LR statistic for testing the null hypothesis that \(\mathcal{M}_{0}\) holds against the alternative that \(\mathcal{M}_{1}\) holds. In particular, by (5.15), the difference in deviances equals

$$\displaystyle{ D(\hat{\boldsymbol{\mu }}_{0};\hat{\boldsymbol{\mu }}_{1}) = D(\mathbf{y};\hat{\boldsymbol{\mu }}_{0}) - D(\mathbf{y};\hat{\boldsymbol{\mu }}_{1}) = 2\sum _{i=1}^{n_{y} }\omega _{i}\left (y_{i}(\hat{\theta }_{i1} -\hat{\theta }_{i0}) - [b(\hat{\theta }_{i1}) - b(\hat{\theta }_{i0})]\right ). }$$
(5.23)

Under \(\mathcal{M}_{0}\), (5.23) is approximately \(\mathcal{X}_{r}^{2}\) distributed, where \(r = q_{1} - q_{0}\) is the difference between the number of parameters of the two compared models. This asymptotic result is the key for models’ comparison.

For Poisson log-linear models, (5.23) simplifies to (4.34), i.e.,

$$\displaystyle{{G}^{2}(\mathcal{M}_{ 0}\vert \mathcal{M}_{1}) = 2\sum _{i=1}^{n_{y} }\hat{\mu }_{i1}\log \left (\frac{\hat{\mu }_{i1}} {\hat{\mu }_{i0}}\right ) = {G}^{2}(\mathcal{M}_{ 0}) - {G}^{2}(\mathcal{M}_{ 1}),}$$

where \({G}^{2}(\mathcal{M}_{0})\) and \({G}^{2}(\mathcal{M}_{1})\) are as in (5.17).

Upon considering a sequence of nested models from a very simple \(\mathcal{M}_{0}\) up to the saturated \(\mathcal{M}_{\text{sat}}\),

$$\displaystyle{\mathcal{M}_{0} \subset \mathcal{M}_{1} \subset \mathcal{M}_{2} \subset \ldots \subset \mathcal{M}_{\text{sat}},}$$

the importance of the parameters added gradually can be evaluated by successive comparisons of neighbor models. Thus, the appropriate model can be built by selecting this model \(\mathcal{M}_{s}\) for which \(D(\hat{\boldsymbol{\mu }}_{s};\hat{\boldsymbol{\mu }}_{s+1})\) is nonsignificant and \(D(\hat{\boldsymbol{\mu }}_{s-1};\hat{\boldsymbol{\mu }}_{s})\) is significant. This means that adding more parameters would complicate the model without improving its fit significantly, while removing any parameters further would lead to a model of significantly poorer fit. Hence, comparisons of nested models serve for developing procedures of “best model” selection. Furthermore, once the “best model” is selected, model comparison can serve as a tool for evaluating the individual importance of each parameter or group of parameters. Model selection can also be based on AIC and BIC. For a comparative study of AIC and BIC and a corrected for finite samples version of AIC with emphasis on their role in model selection, we refer to Burnham and Anderson (2004). These criteria will be illustrated in the context of log-linear models for multi-way tables next (see Sect.5.4.1).

5.4 Software for GLMs

All general-purpose statistical packages (like SAS, SPSS, Stata, and SYSTAT) have procedures for GLM analysis. For example, GLMs are fitted in SAS by the procedure GENMOD. The corresponding R function is glm, which is based on the S-function “glm” (Hastie and Pregibon 1992). The basic form for calling the glm function is

> Mfit <- glm(formula, family=…, data=…)

where formula defines the model to be fitted, family determines the error distribution and link function of the model, and data specifies the data frame on which the model will be applied. Mfit is the object where output of glm is saved. formula is provided in a form of the type Y ∼ X1+X2+X3+X1:X2, where Y is the dependent variable, X1, X2, X3 the independent, and X1:X2 denotes the interaction between X1 and X2. The expression above is equivalent to Y ∼ X3+X1*X2, where X1*X2 stands for the generating term of a hierarchical model, i.e., it is equivalent to Y ∼ X1+X2+X1:X2. For log-linear models the choice for family is family=poisson (link = "log"). The specification of data frame is optional. If it is omitted, the variables are taken from the environment from which glm is called.

The minimum output is printed on screen by simply typing Mfit while more detailed output is provided by summary(Mfit). The content of object Mfit can be viewed by names(Mfit). An item, say A, of Mfit is located in Mfit$A and can be saved in a variable for further use (e.g., V1 <- Mfit$A). Due to the predominant role deviance plays in GLM’s analysis, the residuals saved in Mfit, the output object of glm, are the deviance residuals. For results not provided in Mfit, a variety of special functions is available that apply on the glm output. Function step() for model selection between nested models and anova() for analysis of variance can be activated also in glm framework, as will be illustrated in the examples that follow.

For historical reasons, let us note that GLIM (generalized linear interactive modeling) was the first package with the ability of fitting a variety of GLMs in a unified manner. It was developed by the GLIM working party of the Royal Statistical Society in 1974. GLIM4, the latest release (1993), had many links as standard options and was convenient for GLM fit and model selection. A rich macro library was available while users could write their own macros in GLIM language. The associated journal GLIM Newsletter, issued from 1979 to 1998, was publishing GLIM macros.

Table 5.1 Summary output of the independence model applied on Table 2.3, fitted by glm

5.4.1 Example 2.4 by glm

The log-linear model of independence (4.1) will be fitted on Table 2.3, by glm of R. The variables are required in vector form; thus we apply glm on the data frame nt.frame, constructed in Sect.4.2.1. Model (4.1) is then fitted by

> I.glm <- glm(freq ∼ WELFARE+DEGREE,family=poisson,data=nt.frame)

and the extended output (provided in Table 5.1) is obtained by

> summary(I.glm)

The value of the G 2 statistic is reported under “Residual Deviance” and is saved in I.glm$deviance, as can be verified by typing names(I.glm). Its asymptotic p-value is not provided but can easily be calculated by

> p.value <- 1-pchisq(I.glm$deviance, I.glm$df.residual)

We find p-value = 0.240; thus the independence model describes adequately this data set. Furthermore the value of the AIC is given (AIC = 110. 74) while the BIC, defined by (5.19), can be computed as

> n <- sum(ntfare$freq); q <- I.glm$df.null-I.glm$df.residual

> BIC <- I.glm$aic-(2-log(n))*q

giving BIC = 139. 91. The level of the AIC and BIC values can be judged in comparison to alternative models. In this case, for the saturated model

> sat <- glm(freq ∼ WELFARE*DEGREE,family=poisson,data=nt.frame)

AIC = 116. 4, while for the models of only one main effect

> welfr <- glm(freq ∼ WELFARE,family=poisson,data=nt.frame)

and

> degr <- glm(freq ∼ DEGREE,family=poisson,data=nt.frame)

we get AIC = 548. 1 and AIC = 129, respectively. Hence, the choice of the independence model is justified.

Function glm produces parameter estimates subject to the first category zero constraints. Recall that only the effect differences between different categories are of interest and these remain invariant under different types of constraints. Observe that \(\hat{\lambda }_{3}^{X} -\hat{\lambda }_{1}^{X} = 0.347 - 0\), equal to the corresponding value derived in Sect.4.2.1 subject to the sum to zero constraints.

The residuals saved in object I.glm are the working residuals. The Pearsonian residuals are calculated by residuals(I.glm, type = c("pearson")) and the deviance by changing the type option to "deviance". Standardized residuals are obtained by rstandard(I.glm).

The items of the output object are all in vector form but can easily be transformed to the more friendly table form by xtabs(). For example, the ML estimates of the expected cell frequencies under independence and the standardized residuals are derived in table form by

> MLEs <- xtabs(I.glm$fitted.values ∼ WELFARE+DEGREE,data=natfare)

> stdres <- xtabs(rstandard(I.glm) ∼ WELFARE+DEGREE,data=natfare)

Thus, the standardized residuals are

> stdres

The only standardized residual that exceeds in absolute value 1.96 corresponds to cell (1, 1). That is, responders with educational level lower than high school tend to believe that welfare spending is too little with higher probability than expected under the independence model.

The sequence of commands followed above is unified in function fit.I() of the web appendix (see Sect. A.3.4), which additionally provides the values for Pearson’s X 2 along with its p-value, the dissimilarity index (4.18) and the BIC. The function requires the vector of frequencies (by rows) and the number of rows and columns of the table. For this example, it is called as fit.I(freq,3,5).

The standardized residuals can be displayed on the mosaic plot as shown below. We apply

> mosaic(natfare, gp=shading_Friendly, residuals=stdres,

+ residuals_type="Std∖nresiduals",labeling = labeling_residuals)

where stdres is the table of standardized residuals derived above. The mosaic plot derived is given in Fig.5.1 (right). The figure on the left is the mosaic plot for standardized residuals for Example 2.2 and is derived analogously.

Fig. 5.1
figure 1

Mosaic plots of standardized residuals for the independence model applied on Table 2.2 (left) and Table 2.3 (right)

The residuals illustrated in the mosaic plots so far were all for the independence model (default). To refer to residuals of a different model, the output object of the assumed model has to take the position of the data matrix as input in mosaic(). Thus,

> mosaic(natfare, gp=shading_hcl, residuals_type="deviance")

is equivalent to

> mosaic(I.glm, gp=shading_hcl, residuals_type="deviance")

To incorporate the residuals of the model with only the row (opinion) main effect

> X.glm <- glm(freq ∼ WELFARE+DEGREE,family=poisson,data=nf.frame)

the mosaic plot function should be

> mosaic(X.glm, gp=shading_hcl, residuals_type="deviance")

From the ML estimates it can be verified that the estimated under independence \(\hat{\theta }_{ij}\) (i = 1, 2, \(j = 1,\ldots,4I\)) are, as expected, all equal to 1. The same holds also for the global and cumulative odds ratios. The ML estimates of any set of generalized odds ratios expected under the assumed model can be calculated in R, using the corresponding functions of the web appendix (see Sect. A.3.2). The procedure is that described for the sample generalized odds ratios at the end of Sect.2.2.5 and illustrated in the example of Sect.2.2.6. Only the vector of observed frequencies has to be replaced with the vector of ML estimates of the expected cell frequencies under the assumed model. The equivalent independence model (2.52) in terms of the local odds ratios will be illustrated for this example in Sect. 5.6.

5.4.2 Example 3.1 (Revisited)

For the example of Table 3.1, we have seen in Sect.3.3, applying the Breslow–Day test (or the Woolf test), that the association between smoking and depression is homogeneous for males and females. At this point, we shall select the appropriate log-linear model for describing the underlying association structure of Table 3.1. The data are available in R in matrix depsmok3. In order to fit the models in the GLM setup applying glm, the data have to be expanded from a matrix to a vector and the factors corresponding to the classification variables have to be defined. This is carried out easily as follows:

> obs <- as.vector(depsmok3)

> row <- rep(1:2, 4); col <- rep(1:2, each=2,2)

> lay <- rep(1:2, each=4); row.lb <- c("yes","no")

> col.lb <- c("yes","no"); lay.lb <- c("male", "female")

> S <- factor(row,labels=row.lb); D <- factor(col,labels=col.lb)

> G <- factor(lay, labels=lay.lb)

> depres.fr <- data.frame(obs,S,D,G)

The appropriate log-linear model is selected via the backward stepwise procedure based on AIC. Thus, we first save the saturated model under object saturated and then proceed with the backward model selection procedure as follows:

> saturated <- glm(freq S*D*G, poisson, data = depres.fr)

> step(saturated, direction="backward")

The stepwise procedure concludes to the model of no three-factor interaction (SD,  DG,  SG), giving the following output:

The (SD,  DG,  SG) is also the model of homogeneous association since under this model the association in all two-way partial tables is homogeneous across the levels of the remaining third classification variable, as explained in Sect.4.3. This model is fitted in R, as shown below, giving the output provided in Table 5.2.

> hom.assoc <- glm(freq ∼ S*D+S*G+D*G, poisson,data=depres.fr);

summary(hom.assoc)

Table 5.2 Output for model (SD,  DG,  SG), fitted on data in Table 3.1

The p-value of testing the model fit based on G 2 statistic is 0.380, which is close to the corresponding p-values of the Woolf’s or the Breslow–Day test (Sect. 3.3.3).

Relation (4.27), adjusted in our setup, becomes

$$\displaystyle{\log \theta _{(k)}^{SD} =\log \left (\frac{\pi _{11\vert k}\pi _{12\vert k}} {\pi _{21\vert k}\pi _{22\vert k}}\right ) =\lambda _{ 22}^{SD} {=\log \theta }^{SD},\ k = 1,2,}$$

due to the identifiability constraints \(\lambda _{11}^{SD} =\lambda _{ 12}^{SD} =\lambda _{ 21}^{SD} = 0\). Thus, the ML estimate of the common odds ratio θ SD under the log-linear model of homogeneous association is

$$\displaystyle{\hat{{\theta }}^{SD} = \mathrm{exp}\left (\hat{\lambda }_{ 22}^{SD}\right ) =\exp (0.91871) = 2.506,}$$

close in value to \(\hat{\theta }_{MH}\) and \(\hat{\theta }_{W}\), calculated in Sect.3.3.3.

Furthermore, the asymptotic Wald (1 −α)100% CI for θ SD is

$$\displaystyle{\mathrm{exp}\left[\log \hat{{\theta }}^{SD}\ \pm \ z_{\alpha /2}s.e.\left (\log \hat{{\theta }}^{SD}\right )\right ],}$$

where \(s.e.{(\log \hat{\theta }}^{SD})\) is the standard error of \(\log \hat{{\theta }}^{SD}\) and is equal to \(s.e.{(\log \theta }^{SD}) = s.e.(\lambda _{22}^{SD}) = 0.17059\).

This CI can easily be computed via the function

> CI <- function(t, SE, conf.level=0.95)

{exp(t+c(-1,1)*qnorm(0.5*(1+conf.level))*SE)}

with t and SE standing for \(\log \hat{{\theta }}^{SD}\) and its standard error, respectively. Hence, the 95% CI for θ SD in this case is computed as

> logSD <- 0.91871; SE.SD <- 0.17059

> CI(logSD, SE.SD)

[1] 1.793842 3.501041

The xtabs() function, used in the previous example (Sect.5.4.1), is especially useful in multi-way tables, since it provides a straightforward way to extract marginal and partial tables of observed or expected cell frequencies. In this example for instance, the smoking-depression marginal table of the ML estimates of the expected cell frequencies under (SD,  DG,  SG) is

> MLE.SD <- xtabs(hom.assoc$fitted.values ∼ S + D)

and, as expected, coincides with the corresponding marginal table of observed frequencies, which for arrays is obtained by

> margin.table(depsmok3, c(1,2))

or

> apply(depsmok3, c(1,2), sum)

However, were the data available only in the data frame format (depres.fr), with obs the vector of observed frequencies, then the smoking-depression observed marginal table would be

> MLE.SD <- xtabs(obs ∼ S + D)

5.5 Independence for Incomplete Tables

In case of structural zeros existence (see also Sect.4.9.1), the corresponding cells are of zero probability and must be excluded from the analysis . Thus, any model assumed will not apply on all cells of the contingency table under consideration but only on the subset of its nonstructural zero cells. Hence, structural zeros affect the assumed model in substance. A table with structural zeros is known as an incomplete or truncated table.

As an illustration, we will consider the independence model for an I × J table. Independence is considered not for all IJ cells but only for the subset of the nonstructural zero cells S = { (i, j): π ij  > 0}. The model of independence applied on an incomplete table is known as the quasi-independence (QI) model, term introduced by Goodman (1968).

QI is defined naturally in the log-linear models framework, as the classical model of independence (4.1), applied on a subset S of the table

$$\displaystyle{ \log m_{ij} =\lambda +\lambda _{i}^{X} +\lambda _{ j}^{Y },\qquad (i,j) \in S. }$$
(5.24)

The main effect parameters satisfy the identifiability constraints (4.4), and the associated df are \(df = (I - 1)(J - 1) - s\), where \(s = IJ -\vert S\vert \) is the number of structural zeros, i.e., the cardinality of the set of structural zeros S c.

The restriction (i, j) ∈ S can be incorporated in the model by introducing s additional parameters in (4.1), one for each structural zero. Hence, (5.24) is equivalent to

$$\displaystyle{ \log m_{ij} =\lambda +\lambda _{i}^{X} +\lambda _{ j}^{Y } + q_{ ij}\text{I}_{ij}^{{S}^{c} },\ \ i = 1,\ldots,I,\ j = 1,\ldots,J, }$$
(5.25)

where \(\text{I}_{ij}^{{S}^{c} }\) is the indicator function for structural zeros

$$\displaystyle{ \text{I}_{ij}^{{S}^{c} } = \left \{\begin{array}{ll} 1,&(i,j)\notin S\\ 0, &(i, j) \in S \\ \end{array}.\right. }$$

This way, the structural zero cells equal the observed counts (\(n_{ij} = m_{ij} = 0\) for (i, j)∉S), sacrificing thus sdf. Structural zeros have no contribution to the value of the X 2 or G 2 test statistic.

QI is expressed directly on the cell probabilities, as

$$\displaystyle{\pi _{ij} =\alpha _{i}\beta _{j},\qquad \qquad (i,j) \in S,}$$

where the marginal parameters are no more the marginal probabilities.

Additionally, structural zeros serve as a powerful tool in contingency table analysis, since they can be activated by the needs of the analysis to exclude a specific cell or region of the table that is nonzero but exhibits “special behavior” and exacerbates the fit of the assumed model. This is often the case for mobility tables or panel studies, where the tables are square with augmented diagonal entries, corresponding to non-change. It is natural thus to exclude the diagonal from the analysis by considering S = { (i, j):  ij}. Other incomplete square tables that received special attention are triangular tables. We will return to special QI models for square tables in Sect.9.3. References on conditions for existence of ML estimates for truncated tables are provided in Sect.5.7.1.

Structural zeros are incorporated easily in log-linear models analysis in the GLM framework. A cell (i, j) is excluded from the model, by the inclusion in the log-linear model (5.25) of the additional parameters q ij that is responsible for fixing it to its observed frequency (e ij  = 0). In practice, this is achieved in standard software by adding in the log-linear model the index variable of (5.25) as an explanatory variable. In the presence of more structural zeros, additional index variables are added in the model, one for each structural zero. Alternatively, in the GLM context, all structural zeros can be indicated in one single variable that will be used to determine the subset of cells on which model (5.24) will be applied. SPSS handles structural zeros in the “general log-linear analysis” straightforward. An index variable has to be added in the data file, taking values 0 for structural zero cells and 1 otherwise. This index variable has to be declared in the “Cell Structure” field of the window:

Analyze > Loglinear > General…

QI will be illustrated in R, using Example 5.1 below.

When interaction is significant, model (4.5) is expressed for two-way incomplete tables as

$$\displaystyle{ \log m_{ij} =\lambda +\lambda _{i}^{X} +\lambda _{ j}^{Y } +\lambda _{ ij}^{XY },\qquad \qquad (i,j) \in S. }$$
(5.26)

The main effect parameters satisfy constraints (4.4) while the sum to zero constraints in (4.6) for the interaction parameters are corrected to

$$\displaystyle{\sum _{i=1}^{I}\text{I}_{ ij}^{{S}^{c} }\lambda _{ij}^{XY } =\sum _{ j=1}^{J}\text{I}_{ ij}^{{S}^{c} }\lambda _{ij}^{XY } = 0.}$$

Log-linear models for multi-way incomplete contingency tables can be defined and fitted in an analogous manner.

5.5.1 Example 5.1

A typical example of contingency table with structural zeros is a survey on teenagers’ health concerns. Teenagers are cross-classified according to their health concerns (in four categories), gender, and age (in two categories: 12–15, 16–17) in a 4 × 2 × 2 table. The table has two structural zeros, since the health concerns category “menstrual problems” cannot refer to boys. This example is analyzed by Grizzle and Williams (1972) and Fienberg (2007, pp.148–150). Ignoring age, i.e., merging over the age, the data are provided in Table 5.3, and there exists 1 structural zero; thus, the test of QI will be based on 2 df. QI is rejected, since G 2(QI) = 12. 60 (p-value = 0.0018) and X 2(QI) = 12. 39 (p-value = 0.0020). The ML estimates of the expected under QI cell frequencies along with the standardized residuals are provided in Table 5.3 in parentheses. Observing them, we conclude that the greatest difference between genders lies on the category “how healthy I am,” for which girls are significantly less concerned and boys more than under independence, followed by “sex, reproduction” for which boys are significantly less interested while girls more, though not as significant. Finally, boys are more health concerns-free than expected under independence and girls less, but these differences are at the limit of 5% significance.

Table 5.3 Teenagers’ cross-classification by gender and their health concerns (Brunswick 1971)

This model was fitted in R by the function fit.QI(), provided in web appendix (see Sect. A.3.4). This function fits the QI model by (5.24), excluding the structural zero cells from the analysis. It needs to read the numbers of rows I and columns J of the table, the cell frequencies in a vector (by rows) of length IJ, where 0 are put in places of structural zeros, and an index vector of length IJ with entries the \(\text{I}_{ij}^{{S}^{c} }\) indices, given by rows. Thus for our example, the analysis is carried out by the commands

> freq<-c(6,16,0,12,49,29,77,102)

> zer<- c(0,0,1,0,0,0,0,0)

> fit.QI(freq,zer,4,2)

The output of fit.QI(), beyond the results presented above, includes the overview of the fit provided by summary() and the estimates of the log-linear model parameters in vector forms for possible further use.

Alternatively, without restricting the cells on which the model applies, the QI model can be fitted by (5.25), including s extra parameters in the model, one for each structural zero. For this example, s = 1 and would have

> NI <- 4

> NJ <- 2

> row<-gl(NI,NJ,length=NI*NJ)

> col<-gl(NJ,1,length=NI*NJ)

> example <- data.frame(row, col, freq, zer)

> QI.model <- glm(freq ∼ row+col+zer, poisson)

Under this approach, in the presence of s > 1 structural zeros, the index vector zer used in glm() above, needs to be replaced by a factor of s + 1 levels. Level 0 is assigned to the non-structural zero cells and a different level (from 1 to s) is assigned to every structural zero cell.

In case of existence of sampling zeros as well, they will not differ from the structural zeros in the frequency vector but in their index vector entry.

5.6 Models for Joint and Marginal Distributions

Model (5.6) applies directly on the cell entries of the table. In certain frameworks, it is of interest to model or test hypotheses about linear functions of the cell entries. For this, (5.6) is extended to

$$\displaystyle{ \log (\mathbf{M}\mathbf{m}) = \mathbf{X}\boldsymbol{\beta }, }$$
(5.27)

with M a matrix suitably defined in order to form the desired functions of the expected cell entries when applied on m.

The most famous models of this type are those modeling the marginals of a table, since some structures can easier be expressed in terms of marginal distributions, leading to the marginal models. Marginal models for contingency tables impose structural restrictions on certain marginals of the classification variables and are usually of log-linear type. A characteristic example is the marginal homogeneity model for a square I × I table, presented in Sect.9.2.2. For higher dimensional tables, modeling the marginal distributions is important for clustered and longitudinal categorical data (see Sects.5.7.2 and 9.7.4).

However, if we would like to model the local odds ratios of an I × J table, model (5.27) is not appropriate; a further extension is needed. A brighter family of models is the generalized log-linear model (GLLM)

$$\displaystyle{ \mathbf{C}\log (\mathbf{M}\mathbf{m}) = \mathbf{X}\boldsymbol{\beta }. }$$
(5.28)

Matrix C provides more flexibility and allows an even brighter variety of models to be included in this class. GLLM is introduced by Lang and Agresti (1994) and opened new origins in the analysis of multivariate categorical data, providing a powerful and flexible framework to model structures of associations. Model (5.28) is suitable for modeling, among others, the log of local or global odds ratios (see Sect.2.2.5). Recall the matrix definition of the generalized odds ratios, given by (2.54) and (2.55), which correspond to the left-hand side of (5.28).

GLLM is itself a member of the broader multinomial-Poisson homogeneous (MPH) model, which is of the very general form

$$\displaystyle{ \text{L}(\mathbf{m}) = \mathbf{X}\boldsymbol{\beta }, }$$
(5.29)

where L is a link function. Details on inference for the MPH model are beyond the scope of this book and can be found in Lang (2004, 2005). Setting L(m) = Clog(Mm), (5.29) reduces to (5.28).

Another special case of the MPH model (5.29) is the

$$\displaystyle{ \text{h}(\mathbf{m}) = \mathbf{0}, }$$
(5.30)

where h() is a smooth constraint function with the constraints in (5.30) being nonredundant. With the adequate choice of the constraint function h(), model (5.30) reduces to the independence model (2.52), expressed in terms of the local odds ratios.

Though inference for the MPH model is not straightforward, it can be implemented in R by the mph function of Lang or the package hmmm of Colombi et al. (2013). We will illustrate mph, which is a powerful and flexible function that fits a big variety of general models via maximum likelihood. We limit its use only to GLLM models (5.28) and to model (5.30), both considered for the local odds ratios and the global odds ratios of a contingency table.

Function mph is available on request. The file “mph.Rcode.txt” is then sent and the routine mph is activated in R by

> source("c://…//mph.Rcode.txt")

The data are read in vector form that has to be defined as matrix. Thus, the I × J table of observed frequencies is expanded (by rows) in a IJ × 1 vector freq and this vector finally forms the IJ × 1 data matrix

> y <- matrix(freq)

The derived vector of expected cell frequencies m is also a matrix of size IJ × 1.

The typical expression of the mph function for fitting (5.29) is

> mph.out <- mph.fit(y=y,L.fct=L.fct,X=X, strata=1)

where L.fct is the link function and X the design matrix of the MPH model (5.29) under consideration. The link for the GLLM model (5.28) is defined by

> L.fct <- function(m)C%*%log(M%*%m)

with C and M appropriate defined matrices. In the sequel, command

> mph.summary(mph.out,cell.stats=T,model.info=T)

produces summary output of the model, i.e., goodness-of-fit statistics, parameter estimates, expected cell frequency estimates under the assumed model, and information on the model applied and its convergence.

Model (5.30) is fitted by

mph.constr <- mph.fit(y, constraint=h.fct, strata=1)

where h.fct is the constraints function. For example, in order to fit the independence model (2.52), it should be

> h.fct <- function(m) {C%*%log(m)}

with C an appropriate \((I - 1)(J - 1) \times IJ\) matrix.

Examples of fitting the GLLM model through the L.fct option will be discussed in Sects.6.6.4 and 7.1, for the local and the global odds ratios, respectively. The standard expression of mph.fit() assumes one single multinomial sample (strata=1). The extra option for defining more strata of data will be discussed in Sect.5.6.2. At this point we will use mph to fit model (2.52) for our familiar Example 2.3, illustrating the use of h.fct.

5.6.1 Example 2.4 by mph

The function local.odds.DM() in the web appendix (see Sect. A.3.2) produces the matrix C needed to derive the logs of the local odds ratios when multiplied to log(m), for tables of any size I × J.

Hence, after actualizing mph in R, model (2.52) is fitted for our example by

> NI <- 3; NJ <- 5

> freq <- c(45,116,19,48,23,40,167,33,68,41,47,185,34,63,26)

> C<-local.odds.DM(NI,NJ)

> h.fct <- function(m) {C%*%log(m)}

> ind.odds <- mph.fit(y, constraint=h.fct, strata=1)

The corresponding output is derived by

> mph.summary(ind.odds,cell.stats=T,model.info=T)

Part of this output is provided in Table 5.4.

Table 5.4 Output of the mph function, fitting the independence model on the local odds ratios of Example 2.4

If we wanted to express the independence model in terms of the global odds ratios, then h(m) in (5.30) equals h(m) = Clog(Mm), with matrices C and M appropriately defined. Function global.odds.DM() of the web appendix (see Sect. A.3.2) returns these two matrices for tables of size I × J. The procedure above had to be adjusted as follows:

> C <- global.odds.DM(NI,NJ)Ɍ M <- global.odds.DM(NI,NJ)$M

> h.fct <- function(m) {C%*%log(M%*%m)}

> ind.glob <- mph.fit(y, constraint=h.fct, strata=1)

5.6.2 Example 3.3 by mph

The hypothesis of homogeneous association (3.7) in 2 × 2 × K tables can be treated also in the GLLM framework, expressed by (5.28) with m the expected cell frequencies under the homogeneous association hypothesis expanded in a 4K × 1 matrix form, \(\mathbf{X} = \left (1\right )_{K\times 1}\), and C the K × 4K block-diagonal matrix \(\mathbf{C} = diag\left (\mathbf{C}_{1},\ldots,\mathbf{C}_{K}\right )\) matrix with \(\mathbf{C}_{k} = \mathbf{C}_{0} = (1,\ -1,\ -1,\ 1)\), for k = 1, , K. C 0 is the matrix for constructing the log odds ratios when applied on m. It has this form, provided that the expected frequency table is expanded by columns. In this case the parameter is scalar and is equal to the assumed log odds ratio for all partial 2 × 2 tables under the homogeneous association hypothesis, i.e., β = logθ.

This approach is illustrated in mph for Example 3.3, as follows. Function bdiag() of library Matrix is applied to produce the block-diagonal matrix C.

> source("c://Program Files//R//mph.Rcode.txt");

freq <- c(79,68,5,17,89,221,4,46,141,77,6,18,45,26,29,21,81,112,

3,11,168,51,13,12);

y<- matrix(freq); K <- 6; X1 <- matrix(rep(1, K));

library(Matrix); C0<-c(1, -1, -1, 1);

C <- t(bdiag(C0,C0,C0,C0,C0,C0)); # 6 × 6 block-diagonal matrix

L.fct <- function(m){C%*%log(m)};

mph.out <- mph.fit(y=y,strata=K,L.fct=L.fct,X=X1);

mph.summary(mph.out,cell.stats=T,model.info=T)

From the observed output we have that G 2 = 7. 950 (p-value=0.159, df=5) and X 2 = 7. 896 (p-value=0.162, df=5) while the ML estimate of the common under homogeneous association log odds ratio is \(\hat{\beta }= 1.0759\), i.e., \(\hat{\theta }= 2.9326\). This model is equivalent to the homogeneous association log-linear model applied on the cell frequencies (see Sect.4.6.1.1). Recall from Sect.3.3.4 that the Mantel–Haenszel estimate was \(\hat{\theta }_{MH} = 2.96\).

5.7 Overview and Further Reading

The classical reference for GLMs is McCullagh and Nelder (1989). Additionally, a comprehensive reference is Fahrmeir and Tutz (2001). For application of GLMs in S-Plus and R, we refer to Venables and Ripley (2002, Chap.7). Dobson and Barnett (2008) provide an easy to follow introduction to GLMs, with theoretical counterpart but focusing on the analysis of particular types of data and their implementation in standard software, categorical data included. They consider also Bayesian analysis and Markov chain Monte Carlo (MCMC) methods to fit GLMs. A formulation and presentation of models for categorical data through the GLM family can be found in Agresti (2007, 2013).

GLMs have been extended in various directions, like for incorporating nonconstant variance, modeling dispersion, or generalizing the link function (McCullagh and Nelder 1989). In categorical data context, characteristic cases are, for example, the consideration of a negative binomial instead of a Poisson response or the introduction of dispersion effect in the cumulative link model (McCullagh 1980).

The Fisher information matrix plays an important role in statistics in many different aspects, the two most characteristic being in determining the variance of an estimator and the “noninformative” priors determination in the Bayesian setup. Spall (2005) reviews basic principles associated with the information matrix and presents a resampling-based method for computing the information matrix.

When the n i ’s are small, the residuals are not approximately normal distributed. For such cases the transformed Anscombe residuals have been proposed (see McCullagh and Nelder 1989). For a survey on residuals for GLMs, we refer to Pierce and Schafer (1986). For goodness-of-fit testing of GLMs for sparse data, see Farrington (1996).

5.7.1 Incomplete Contingency Tables

Incomplete tables attracted researchers’ attention very early. Stigler (1992), in an interesting and enlightening historical review, points out that in 1913, Karl Pearson was the first to consider the independence model for two-way incomplete tables. The historical fingerprint data set in Waite (1915) contains structural and sampling zeros while Harris and Treloar (1927) and Harris and Tu (1929) face for incomplete tables the problems occurring in the applicability of the contingency coefficient.

The existence of ML estimates for models considered on incomplete tables became a central issue in the late 1960s and 1970s. The most well-known model for incomplete tables is the QI model, presented in Sect. 5.5. Very popular, especially in the context of rater agreement and mobility tables, is the QI model for square tables having the main-diagonal entries missing or excluded. The key reference for the QI model is Goodman (1968), though the QI model for diagonal truncated square tables had been considered earlier by Savage and Deutsch (1960) and Goodman (1963a) in transaction flows analysis and White (1963) and Goodman (1965) in mobility table analysis. Fundamental papers in developing inference for QI in the log-linear model setup were Bishop and Fienberg (1969), Fienberg (1970a), and Haberman (1973a), with the last two providing conditions for existence of unique nonzero ML estimates. The QI model is discussed in detail in Bishop et al. (1975).

Interesting is the approach of Fienberg (1969) that locates the cells exhibiting interaction, when the number of such cells is relatively small compared with the total number of cells in the table, and applies then the QI model, excluding these cells. Mantel (1970) focused on determining the appropriate degrees of freedom and considered, beyond independence, also symmetry testing for incomplete square tables. Goodman (1971a) proposed a test procedure for testing the hypothesis of QI simultaneously for several different subsets of the cells of a table. Enke (1977) considered incomplete two-way tables of special structures that are decomposed to separable tables and lead to closed form MLEs. For the ML estimation of the diagonal truncated independence model, Morgan and Titterington (1977) compared the performance of the EM, Newton–Raphson, and iterative scaling algorithms, concluding empirically that the last is the least efficient method.

Another special type of incomplete square tables are the triangular tables. Such form of incomplete tables occurred already in Waite (1915), while is special referred in Goodman (1968) and Bishop and Fienberg (1969). Special on triangular QI are Goodman (1979a, 1994) and Altham (1975), who considered also the Bayesian analysis with conjugate prior. For ordinal triangular tables, Sarkar (1989) interpreted QI in terms of likelihood ratio dependence and Tsai and Sen (1995) provided an alternative test of QI. We considered in Sect.5.5 the problem of incorporating structural zeros in the simple independence model for two-way tables. The diagonal and triangular truncated tables will be presented in Sect.9.3.

Colombo and Ihm (1988) applied the QI model in an unusual context to estimate failure rates of components classified by two qualitative covariates. QI allows for different operating times in the various cells, zero operating time included.

Incomplete tables may occur in tables of higher dimension and of more complex association structures. Klimova et al. (2012) introduce a general family of models for contingency tables, the rational models, which provide a unified framework for analysis of complete and incomplete tables by log-linear models and others, like association models (Chap. 6) and rater agreement models (Sect.9.5.2). They provide sufficient conditions for the existence of the ML estimates under this general model and prove the classical equivalence between the Poisson and multinomial likelihoods.

A nice review of the literature on the sensitivity analysis of overparameterized models for incomplete categorical data, Bayesian and frequentist, is provided by Poleto et al. (2011).

5.7.2 Marginal Distributions Modeling

Marginal models have been mainly developed by Lang and Agresti (1994), Lang (1996a), Lang et al. (1999), and Bergsma and Rudas (2002a,b). Their approach is based on earlier work by Haber (1985) and Haber and Brown (1986). Bartolucci et al. (2007) generalized the model of Bergsma and Rudas (2002a) to allow for global and continuation type logits, which may be more adequate for ordinal data analysis. Rudas et al. (2010) formed conditional independence models in a marginal log-linear parameterization. Becker et al. (1998) explored similarities and differences between standard log-linear and marginal models with special emphasis on square tables and reference to multi-way tables as well in the social sciences framework. For a detailed presentation of marginal models and their features, we refer to the book by Bergsma et al. (2009).

Marginal models are applied for modeling repeated (or clustered) categorical data (see also Sect.9.7.4).