Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Hierarchical logistic regression models consist of inherent correlation due to different sources of variation. At each level of the hierarchy, we have random intercepts and sometimes random slopes as well as the appropriate fixed effects. We have done extensive work with the GLIMMIX and NLMIXED procedures in fitting hierarchical models and have noted the trials and tribulations in computing regression estimates and covariance estimates associated with hierarchical models in SAS , as attested by others. We have had several occasions when our models do not converge. In some cases, we found that the convergence criterion was satisfied, but the standard error for the covariance parameters was given as “.” This problem has gained the attention of many (Hartzel et al. 2001; Wilson and Lorenz 2015 to name a few). We do not know with certainty why certain convergence problems exist. As such we provide some understanding and make some suggestions based on our own work as well as work done by others. We also provide the steps and results of a simulation study which can be expanded upon for further exploration of the problem and its remedies.

In this chapter, we discuss the use of two-level and three-level hierarchical models for binary data, although it is possible to analyze higher level data. We discuss the use of models with effects at level 2 and level 3 representing random intercepts and random slopes. These random effects are added into the model to account for unobservable effects that are known to exist but were not measured or cannot be measured. We also discuss the use of simulations as a means of investigating issues or irregularities. This process is presented as an exercise in simulating hierarchical binary data, which for simplicity is restricted to the two-level case, although the techniques discussed can be readily expanded for higher levels. These simulated models have incorporated a random intercept and a random slope at level 2. We implement a hierarchical model using the GLIMMIX procedure in SAS , to identify factors that contribute to AIDS knowledge in Bangladesh and investigate models that do and do not converge based on the number of fixed effect predictors.

2 Generalized Linear Model

The birth of the generalized linear models unified many methods (Nelder and Wedderburn 1972). These models consist of a set of n independent random variables \(\mathrm{Y}_1 \ldots ..\mathrm{Y}_\mathrm{n} \), each with a distribution from the exponential family. We define a generalized linear model as having three components: the random component, the systematic component, and the link component. We define the log-likelihood function based on unknown mean parameters, a dispersion parameter, and a weight parameter, denoted by \({\uptheta }_{\mathrm{i}}, {\upvarphi }\), and \(\upomega _\mathrm{i}\) respectfully, and of the form (Smyth 1989),

$$ \mathrm{l}({\upphi }_{\mathrm{i}}^{-1}, {\upomega }_{\mathrm{i}}:\mathrm{y}_{\mathrm{i}})=\sum _{\mathrm{i}}\{{\upomega }_{\mathrm{i}}{{\upphi }}_{\mathrm{i}}^{-1}[\mathrm{y} _{\mathrm{i}}{{\uptheta }_{\mathrm{i}}}-\mathrm{b}({\uptheta }_{\mathrm{i}})]-\mathrm{c}(\mathrm{y}_{\mathrm{i}},{\upomega }_{\mathrm{i}}{\upphi }_{\mathrm{i}}^{-1}) $$

with \({\upphi } _{\mathrm{i}}\) unknown and assume that

$$ \mathrm{c}(\mathrm{y}_{\mathrm{i}}, {\upomega }_{\mathrm{i}}{\upphi }_{\mathrm{i}}^{-1})={\upomega }_{\mathrm{i}}{\upphi }_{\mathrm{i}}^{-1}\mathrm{a}(\mathrm{y}_{\mathrm{i}})-\frac{1}{2}\mathrm{s}(-{\upomega }_{\mathrm{i}}{\upphi }_{\mathrm{i}}^{-1})+\mathrm{t}(\mathrm{y}_{\mathrm{i}}) $$

Thus we have a generalized linear model for the mean such that

$$ \upmu _{\mathrm{i}}=\mathrm{E}(\mathrm{Y}_{\mathrm{i}})=\mathrm{b}^{\prime }({\uptheta }_{\mathrm{i}})=\mathbf{x}_{\mathbf{i}}^{\prime }\varvec{\upbeta } $$

where \(\mathbf{x}_\mathbf{i}^{\prime } =(\mathrm{x}_1, \ldots .., \mathrm{x}_\mathrm{p})^{\prime }\) is the vector of covariates and \({\varvec{\upbeta }}\) is the vector of regression parameters. The functions \(\mathrm{a}\left( \mathrm{y} \right) \) and \(\mathrm{b}\left( {\uptheta }_{\mathrm{i}} \right) \) are known functions. We also present the generalized linear model as

$$ \mathbf{Y}=\mathbf{X}{\upbeta }+\upvarepsilon $$

where the random component belongs to the exponential family of distributions, while in the marginal form we present \(\mathrm{g}(\mathrm{E}(\mathrm{Y}))=\mathbf{X}{\upbeta }\). However, when the set of outcomes from the outcomes \(\mathrm{Y}_\mathrm{i} \) are not independent, then the generalized linear model in its pure form is no longer appropriate and we must use generalized linear mixed models.

3 Hierarchical Models

It is common in fields such as public health, education, demography, and sociology to encounter data structures where the information is collected based on a hierarchy. For instance, in health studies, we often see patients nested within doctors and doctors nested within hospitals. In these types of cases, there is variability at each level of the hierarchy, resulting in intraclass correlation due to the clustering. As a result of the correlation at each level inherent from these hierarchical structures, the standard logistic regression is inappropriate (Rasbash et al. 2012). Ignoring these levels of design while researching the outcome is sure to lead to erroneous results unless the intraclass correlation is of an insignificant size (Irimata and Wilson 2017). Others have demonstrated that ignoring a level of nesting in the data can impact variance estimates and the available power to detect significant covariates (Wilson and Lorenz 2015). When seeking to appropriately analyze these types of correlated data, we must extend the generalized linear models by accounting for the association among the responses.

Hierarchical models , also referred to as nested models or mixed models are statistical models that extend the class of generalized linear models (GLMs) to address and account for the hierarchical (correlated) nesting of data (Hox 2002; Raudenbush and Bryk 2002; Snijders and Bosker 1998). We will refer to these as the hierarchical generalized linear models (HGLMs). This approach incorporates a random effect, usually according to the normal distribution, although non-normal random effects can also be used. The extension required in HGLMs is not as involved when the outcomes follow a conditional normal distribution and the random effects are normally distributed. However, when dealing with outcomes that are not normally distributed (i.e. binary, categorical, ordinal), the extension is not as straightforward. In these cases, we often use a link other than the identity and must specify an appropriate error distribution for the response at each level. We thus present the conditional mean explanation rather than the marginal mean.

While most work have concentrated on random intercepts , we have often been confronted with data requiring multiple random intercepts and even random slopes. When using the GLIMMIX procedure in SAS , we often find that models which include multiple random intercepts or even one random intercept with one random slope may not converge. Therefore, this chapter introduces the reader to hierarchical models with dichotomous outcomes (i.e., hierarchical generalized linear models) , and provides concrete examples of non-convergence and possible remedies in these situations.

We present hierarchical models as

$$ \mathbf{Y}=\mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }+{\upvarepsilon } $$

where the random effects \({\uptheta }\) have a multivariate normal distribution with mean vector zero and covariance matrix G, with the distribution of the errors \(\varepsilon \) as normal with mean vector 0 and covariance matrix R. The X matrix consists of the fixed effects with vector of regression parameters \({\upbeta }\) while the Z matrix consists of columns, each representing the random effects with vector of parameters \(\uptheta \). Researchers refer to this as compensating for the correlation through the systematic component. Thus we often write in the conditional response form as

$$ \mathrm{g}(\mathrm{E}[\mathbf{Y}|\uptheta ])=\mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta } $$

where \({\uptheta }\sim \mathcal{N}(0, \mathrm{G})\). The unconditional covariance matrix for Y, is

$$ \mathrm{var}(\mathbf{Y})=\mathbf{A}^{\mathbf{1/2}}{} \mathbf{RA}^{\mathbf{1/2}}+{\varvec{G}} $$

and the conditional covariance matrix, given the random effects is given by

$$ \mathrm{var}( \mathbf{Y}|{\uptheta })=\mathbf{A}^{\mathbf{1/2}}{} \mathbf{RA}^{\mathbf{1/2}}=\mathbf{V}. $$

Thus, it is common in literature to refer to the G-side and R-side effects, which refer to the covariance matrix of the random effects, and the covariance matrix of the residual effects, respectfully.

In SAS, the GLIMMIX procedure distinguishes between the G-side and R-side effects and can model the random effects as well as correlated errors. This procedure fits generalized linear mixed models based on linearization and relies on a restricted pseudo-likelihood method of estimation. We revisit the method here as it helps us to understand the problems regarding non-convergence. This estimation is essentially based on the following.

Consider the conditional mean as

$$ \mathrm{E}[\mathrm{Y}|\uptheta ]=\mathrm{g}^{-1}(\mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }) $$

and using Taylor series expansion we linearize \(\mathrm{g}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }\right) \) about the points \(\tilde{\beta }\) and \(\tilde{\theta }\) which gives

$$\begin{aligned} \mathrm{g}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta } \right)&\cong \mathrm{g}^{-1}\left( \mathbf{X} \tilde{\upbeta }+ \mathbf{Z}\tilde{\uptheta }\right) +\frac{\partial {\mathrm{g}}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta } \right) }{\partial {\upbeta }}\left( {\upbeta }- \tilde{\upbeta }\right) \\&\,\,+\frac{\partial \mathrm{g}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta } \right) }{\partial {\uptheta }}({\uptheta }- \tilde{\uptheta }) \end{aligned}$$
$$ \mathrm{g}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta } \right) \cong \mathrm{g}^{-1}\left( \mathbf{X}\tilde{\upbeta } +\mathbf{Z}\tilde{\uptheta }\right) +\Omega _{|\tilde{\beta }\tilde{\theta }} \mathbf{X}\left( \upbeta - \tilde{\upbeta }\right) +\Omega _{|\tilde{\beta }\tilde{\theta }} \mathbf{Z}({\uptheta }-\tilde{\uptheta } ) $$

where \(\Omega _{|\tilde{\upbeta }}\) and \(\Omega _{|\tilde{{\uptheta }}}\) denote the matrix of derivatives evaluated at \(\tilde{\upbeta }\) and \(\tilde{\uptheta }\) respectively. Thus

$$\begin{aligned}&\Omega _{|\tilde{\beta }\tilde{\theta }}\{\mathrm{g}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }\right) \}\\&\qquad \qquad \quad \cong {\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1}\{\mathrm{g}^{-1}\left( \mathbf{X}\tilde{\upbeta }+\mathbf{Z}\tilde{\uptheta }\right) \}+\mathbf{X}\left( {\upbeta }-\tilde{\upbeta }\right) +\mathbf{Z}({\uptheta }-\tilde{\uptheta }) \end{aligned}$$

So

$$\begin{aligned}&\Omega _{|\tilde{\beta } \tilde{\theta }} \{\mathrm{g}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }\right) \}-{\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1}\{\mathrm{g}^{-1}\left( \mathbf{X}\tilde{\upbeta }+\mathbf{Z}\tilde{\uptheta }\right) \} \\&\qquad \qquad \quad \cong \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }-(\mathbf{X}\tilde{\upbeta } +{{\varvec{Z}}}\tilde{\uptheta }) \end{aligned}$$

and

$$\begin{aligned} \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }&\cong \left( \mathbf{X}\tilde{\upbeta }+{{\varvec{Z}}}\tilde{\uptheta }\right) + {\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1}\{\mathrm{g}^{-1}\left( \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }\right) \}\\&\qquad -{\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1}\{\mathrm{g}^{-1}\left( \mathbf{X}\tilde{\upbeta }+\mathbf{Z}\tilde{\uptheta }\right) \} \end{aligned}$$
$$ \mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta } \cong \left( \mathbf{X}\tilde{\upbeta } +{{\varvec{Z}}}{\uptheta }\right) +{\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1}\{\left( \mathrm{E}[\mathbf{Y}|\uptheta ]\right) -\{\mathrm{g}^{-1}\left( \mathbf{X}\tilde{\upbeta } +\mathbf{Z}\tilde{\uptheta }\right) ]\} $$

Hence we consider the approximation and use the similar structure denoted by \(\mathbf{X}\tilde{\upbeta }+{{\varvec{Z}}}\tilde{\uptheta }\) to represent the matrix of fixed effects multiplied by a beta-like term and Z matrix of random effects multiplied by a theta-like term and we denote \({\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1}\left\{ \left( \mathrm{E}[\mathbf{Y}|\uptheta ]\right) -\{\mathrm{g}^{-1}\left( \mathbf{X}\tilde{\upbeta }+\mathbf{Z}\tilde{\uptheta }\right) \right\} = \zeta \) as an error-like term. So we can think of the approximation as a linear term and defined as

$$ \mathrm{Y}_{\mathrm{approx}} =\mathbf{X}{\upbeta }+\mathbf{Z}{\uptheta }+{\upzeta } $$

with the variance

$$ \mathrm{var}[\mathrm{Y}_{\mathrm{approx}} |\uptheta ]=\mathrm{var}[\{\left( \mathrm{E}\left[ \mathbf{Y}|\uptheta \right] \right) ]={\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1}{} \mathbf{A}^{\mathbf{1/2}}{} \mathbf{RA}^{\mathbf{1/2}}{\Omega _{|\tilde{\beta }\tilde{\theta }}}^{-1} $$

As such this can be seen as a linear approximation, given by \(\mathrm{Y}_{\mathrm{approx}}\) with fixed effects \({\upbeta }\), and random effects \({\uptheta }\) and variance of \({\upzeta }\) given by \(\mathrm{var}[\mathrm{Y}_{\mathrm{approx}} |\uptheta ]\).

3.1 Approaches with Binary Outcomes

Binary outcomes are very common in healthcare research, amongst many other fields. For example, one may investigate whether a patient has improved or recovered after discharge from the hospital or not. For healthcare and other types of research, the logistic regression model is one of the preferred methods of modeling data when the outcome variable is binary. In its standard form, it is a member of a class of generalized linear models specific to the binomial random component. As is customary in regression analysis, the model makes use of several predictor variables that may be either numerical or categorical. However, a standard logistic regression model assumes that the observations obtained from each unit are independent. If we were to fit a standard logistic regression to nested data, the assumption of independent observations is seriously violated. This violation could lead to an underestimation of the standard errors, which in turn can lead to conclusions of a significant effect, when in fact it is not.

Multilevel approaches for nested data can also be applied to analysis of dyadic data to take into account the nested sources of variability at each level (Raudenbush 1992). Many researchers have explored the use of these two-level approaches with binary outcomes (see for example McMahon et al. 2006).

4 Three-Level Hierarchical Models

In the analysis of multilevel data, each level provides a component of variance that measures intraclass correlation . For instance, consider a hierarchical model at three levels for the \(\mathrm{k}{\mathrm{th}}\) patient seeing the \(\mathrm{j}{\mathrm{th}}\) doctor in the \(\mathrm{i}{\mathrm{th}}\) hospital. The patients are at the lower level (level 1) and are nested within doctors (level 2) which are nested within hospitals at the next level (level 3). We consider the hospital as the primary unit, doctors as secondary unit, and patients as the observational unit. These clusters are treated as random effects. We make use of random effects as we believe there are some non-measurable influences on patient outcomes based on the doctor and also based on the hospital. Some effects may be positive and some effects may be negative, but overall we assume their average effects are zero.

4.1 With Random Intercepts

At level 1, we may take responses from different patients, while noting their age (Age) and length of stay (LOS). The outcomes are modeled through a logistic regression model

$$\begin{aligned} \log \left[ {\frac{\mathrm{p}_{\mathrm{ijk}}}{1-\mathrm{p}_{\mathrm{ijk}}}}\right] ={\upgamma }_{\mathrm{oij}} +{\upgamma }_{1\mathrm{ij}} \mathrm{Age}_{\mathrm{ijk}} +{\upgamma }_{2\mathrm{ij}} \mathrm{Los}_{\mathrm{ijk}} \end{aligned}$$
(4.1)

where \(\upgamma _{\mathrm{oij}}\) is the intercept, \({\upgamma }_{1\mathrm{ij}}\) is the coefficient associated with the predictor \(\mathrm{Age}_{\mathrm{ijk}},\) and \({\upgamma }_{2\mathrm{ij}}\) is the coefficient associated with the predictor \(\mathrm{Los}_{\mathrm{ijk}}\) (length of stay) for \(\mathrm{k}=1,2, {\ldots }, \mathrm{n}_{\mathrm{ij}}\) patients; \(\mathrm{j}=1, 2, {\ldots }, \mathrm{n}_{\mathrm{i}}\) doctors and \(\mathrm{i}=1, {\ldots }, \mathrm{n}\); hospitals. Each doctor has a separate logistic model. If we allow the effects of Age and LOS on the outcome to be the same for each doctor, but allow the intercept to be different on the logit scale, we have parallel planes for their predictive model. The \({\upgamma }_{\mathrm{oij}}\) intercept represents those differential effects among doctors.

At level 2, we assume that the intercept \({\upgamma }_{\mathrm{oij}}\) (which allows a different intercept for doctors within hospitals) depends on the unobserved factors specific to the \(\mathrm{i}\mathrm{th}\) hospital, the covariates given as associated with the doctors within the \(\mathrm{i}{\mathrm{th}}\) hospital, and a random effect \(\mathrm{u}_{\mathrm{oij}}\) associated with doctor j within hospital i. Thus,

$$\begin{aligned} {\upgamma }_{\mathrm{oij}} ={\upgamma }_{\mathrm{oi}} +{\upgamma }_{1\mathrm{i}} \mathrm{Experience}_{\mathrm{ij}} +u_{\mathrm{oij}} \end{aligned}$$
(4.2)

where \(\mathrm{Experience}_{\mathrm{ij}}\) is the experience for the \(\mathrm{j}{\mathrm{th}}\) doctor within the \(\mathrm{i}{\mathrm{th}}\) hospital. Similarly, hospital administration policies may have different effects on doctors. At level 3, the model assumes that differential hospital policies depend on the overall fixed intercept \({\upbeta }_0 \) and the random intercept \(\mathrm{u}_{\mathrm{oi}} \) associated with the unmeasurable effect for hospital \(\mathrm{i}\). Thus,

$$\begin{aligned} \gamma _\mathrm{oi} =\beta _0 +u_\mathrm{oi} \end{aligned}$$
(4.3)

By successive substitution into the expression for \({\upgamma }_{\mathrm{oi}} \) in (4.3) into (4.2), and then by substituting the resulting expression for \({\upgamma }_{\mathrm{oij}}\) into (4.1), we obtained

$$\begin{aligned} \log \left[ \frac{\mathrm{p}_{\mathrm{ijk}}}{1-\mathrm{p}_{\mathrm{ijk}}}\right] ={\upbeta }_0 +{\upgamma }_{1\mathrm{i}}\mathrm{Experience}_{\mathrm{ij}} +{\upgamma }_{1\mathrm{ij}} \mathrm{Age}_{\mathrm{ijk}} +{\upgamma }_{2\mathrm{ij}} \mathrm{Los}_{\mathrm{ijk}} +\mathrm{u}_{\mathrm{oi}} +\mathrm{u}_{\mathrm{oij}} \end{aligned}$$
(4.4)

The combination of random and fixed terms results in a generalized linear mixed model with two random effects; hospitals denoted by \(\mathrm{u}_{\mathrm{oi}} \sim \mathcal{N}(0,{\upsigma }^{2}_{\mathrm{u}_\mathrm{i}})\) and doctors denoted by \(\mathrm{u}_{\mathrm{oij}} \sim \mathcal{N}(0,{\upsigma }^{2}_{\mathrm{u}_{\mathrm{ij}}})\) with covariance \(\sigma _{\mathrm{u}_{\mathrm{oi}} , \mathrm{u}_{\mathrm{oij}}}\) . From Eq. (4.4), the model consists of the overall mean plus experience of doctors plus age of patient, length of stay plus effects due to hospitals and effects due to doctors for each individual. Hence, we have a subject-specific model.

4.2 Three-Level Logistic Regression Models with Random Intercepts and Random Slopes

Consider the three-level random intercept and random slope model consisting of a logistic regression model at level 1,

$$\begin{aligned} \log \left[ \frac{\mathrm{p}_{\mathrm{ijk}}}{1-\mathrm{p}_{\mathrm{ijk}}} \right] ={\upgamma }_{\mathrm{oij}} +{\upgamma }_{1\mathrm{ij}} \mathrm{Age}_{\mathrm{ijk}} +{\upgamma }_{2\mathrm{ij}} \mathrm{Los}_{\mathrm{ijk}} \end{aligned}$$
(4.5)

where both \({\upgamma }_{\mathrm{oij}}\) and \({\upgamma }_{2\mathrm{ij}}\) are random, for \(\mathrm{k}=1,2, {\ldots }, \mathrm{n}_{\mathrm{ij}}\); \(\mathrm{j}=1, 2, {\ldots }, n_i\); and \(\mathrm{i}=1, {\ldots }, \mathrm{n}\). So each doctor has a different intercept and the rates of change with respect to length of stay are not the same for all the doctors. However, there are some unobserved effects related to LOS that impact remission. There are factors associated with LOS and the doctors’ impacts on patients vary as LOS varies. The intercept represents a group of unidentifiable factors that impact the overall effect of the doctor on the patient’s success, while the slope represents the differential impact that the particular variable (LOS) has that results in differences among patients.

So, at level 2, \({\upgamma }_{\mathrm{oij}}\) and \({\upgamma }_{2\mathrm{ij}}\) are treated as response variables within the model,

$$\begin{aligned} {\upgamma }_{\mathrm{oij}} ={\upgamma }_{\mathrm{oi}} +{\upgamma }_{1\mathrm{i}} \mathrm{Experience}_{\mathrm{ij}} +\mathrm{u}_{\mathrm{oij}} \end{aligned}$$
(4.6)
$$\begin{aligned} {\upgamma }_{2\mathrm{ij}} ={\upgamma }_{2\mathrm{i}} +\mathrm{u}_{2\mathrm{ij}} \end{aligned}$$
(4.7)

where \({\upgamma }_{\mathrm{oi}}\) and \({\upgamma }_{2\mathrm{i}}\) are random effects. Equation (4.6) assumes the intercept \({\upgamma }_{\mathrm{oij}}\) for doctors nested within hospital \(\mathrm{j}\), depends on the unobserved intercept specific to the \(\mathrm{i}{\mathrm{th}}\) hospital, the effects associated with the doctor’s experience in the hospital, and a random term \(\mathrm{u}_{\mathrm{oij}}\) associated with doctor \(\mathrm{j}\) within hospital \(\mathrm{i}\). The slope \({\upgamma }_{2\mathrm{ij}}\) depends on the overall slope \({\upgamma }_{2\mathrm{i}} \) for hospital \(\mathrm{i}\) and a random term \(\mathrm{u}_{2\mathrm{ij}}\).

At level 3, the model shows that the hospitals vary based on random effects

$$\begin{aligned} {\upgamma }_{\mathrm{oi}} ={\upbeta }_{00} +u_{oi} \end{aligned}$$
(4.8)
$$\begin{aligned} {\upgamma }_{2\mathrm{i}} ={\upbeta }_{22} +\mathrm{u}_{2\mathrm{i}} \end{aligned}$$
(4.9)

The intercept \({\upgamma }_{\mathrm{oi}}\) depends on the overall fixed intercept \(\beta _{00}\) and the random term \(u_{oi}\) associated with the hospital i, while the hospital slope \({\upgamma }_{2\mathrm{i}}\) depends on the overall fixed slope \({\upbeta }_{22}\) and the random effect \(u_{2i}\) associated with the slope for hospital \(\mathrm{i}\). By substituting the expression for \({\upgamma }_{\mathrm{oi}}\) and \({\upgamma }_{2\mathrm{i}}\) into (4.7) and (4.8), and then substituting the resulting expression for \({\upgamma }_{\mathrm{oij}}\) and \({\upgamma }_{2\mathrm{ij}}\) into (4.9), we obtained

$$\begin{aligned}&\qquad \log \left[ \frac{\mathrm{p}_{\mathrm{ijk}}}{1-\mathrm{p}_{\mathrm{ijk}}} \right] ={\upbeta }_{00} +{\upgamma }_{1\mathrm{ij}} \mathrm{Age}_{\mathrm{ijk}} +{\upgamma }_{1\mathrm{i}} \mathrm{Experience}_{\mathrm{ij}} +\mathrm{u}_{\mathrm{oi}} +\mathrm{u}_{\mathrm{oij}} +\nonumber \\&\left( \beta _{22} +\mathrm{u}_{2\mathrm{i}} +\mathrm{u}_{2\mathrm{ij}} \right) \mathrm{Los}_{\mathrm{ijk}} \end{aligned}$$
(4.10)

Thus, we have a generalized linear mixed model with random effects \(\mathrm{u}_{\mathrm{oi}} , \mathrm{u}_{\mathrm{oij}}, {\upgamma }_{1\mathrm{i}}\) and \(\upgamma _{1\mathrm{ij}}\). Therefore, \(\mathrm{Los}_{\mathrm{ijk}}\) is associated with both a fixed and random part. We take advantage of this regrouping of terms to incorporate the random effects and their variance-covariance matrix, so that \(\mathrm{u}_{\mathrm{oi}} , \mathrm{u}_{\mathrm{oij}}, \upgamma _{1\mathrm{i}}\) and \(\upgamma _{1\mathrm{ij}}\) are jointly distributed normally with a mean of zero and a covariance matrix reflecting the relationships between the random effects.

4.3 Nested Higher Level Logistic Regression Models

For higher than three level nested we can easily present the model, though executing the necessary computations may be tedious. Imagine if we had the data with another level, hospitals nested within cities (level 4 denoted by h). Cities may have their own way of monitoring healthcare within their jurisdiction. We also believed that the number of beds within the hospital is a necessary variable. For such data, we will have the k\({\mathrm{th}}\) patient nested within the j\({\mathrm{th}}\) doctor which is nested within i\({\mathrm{th}}\) hospital which is nested in the h\({\mathrm{th}}\) city. Then the model is:

$$\begin{aligned}&\qquad \log \left[ \frac{\mathrm{p}_{\mathrm{hijk}}}{1-\mathrm{p}_{\mathrm{hijk}}}\right] ={\upbeta }_{00} +{\upgamma }_{1\mathrm{hij}} \mathrm{Age}_{\mathrm{hijk}} +{\upgamma }_{1\mathrm{hi}} \mathrm{Experience}_{\mathrm{hij}}+\nonumber \\&{\upgamma }_{1\mathrm{h}} \mathrm{Bed}_{\mathrm{hi}} + \mathrm{u}_{\mathrm{oh}} +\mathrm{u}_{\mathrm{ohi}} +\mathrm{u}_{\mathrm{ohij}} +\left( \beta _{22} +\mathrm{u}_{2\mathrm{hi}} +\mathrm{u}_{2\mathrm{hij}}\right) \mathrm{Los}_{\mathrm{hijk}} \end{aligned}$$
(4.11)

5 Possible Problems with Hierarchical Model

5.1 Issues in Hierarchical Modeling

We found that convergence of parameter estimates can sometimes be difficult to achieve, especially when fitting models with random slopes or higher levels of nesting. Some researchers have found that convergence problems may occur if the outcome is skewed for certain clusters or if there is quasi or complete separation. Such phenomena destroy the variability within clusters which is essential to obtaining the solutions. In addition, including too many random effects may not be computationally possible (Schabenberger 2005).

We also found what other researchers did; for hierarchical logistic models for nested binary data, it is often not feasible to estimate random effects for both intercepts and slopes at the same time in a model. Newsom (2002) showed that we can have models with too many parameters to be estimated given the number of covariance elements included. Others found that such models can lead to severe convergence problems , which can limit the modeling. Before fitting these conditional models, McMahon et al. (2006) suggested that one should determine whether there is significant cluster interdependence to justify the use of multilevel modeling. Irimata and Wilson (2017) through simulation gave some further guidance.

Regardless of the number of clusters, Austin (2010) found that for all statistical software procedures, the estimation of variance components tended to be poor when there were only five subjects per cluster. The number of clusters on the mean number of quadrature points was negligible. However, when the random effects were large, Rodriquez and Goldman (1995) found substantial decreases in the estimation of fixed effects and/or variance components. They also found that there was bias in the estimation when the number of subjects per cluster was small.

These hierarchical models can be fitted through SAS with the GLIMMIX or NLMIXED procedure as well as in SPSS and R. Maas and Hox (2004) claimed that only one random statement is supported in the NLMIXED procedure so that nonlinear mixed models cannot be assessed at more than two levels. However, Hedeker et al. (2008), Hedeker et al. (2012) showed how more than one random statement can be used for continuous data in the NLMIXED procedure with more than two-levels.

5.2 Parameter Estimations

The conditional joint distribution of the responses and the distribution of the random effects provide a joint likelihood which cannot necessarily be readily written down in closed form. However, we still need to estimate the regression coefficients and the random components. In so doing, it is imperative for us to use some form of approximations. Sometimes researchers have used the quasi-likelihood approach through a Taylor series expansion to approximate the joint likelihood. The approximate likelihood is maximized to produce maximized quasi-likelihood estimates. The disadvantage which many researchers have pointed out with this approach is the bias involved with quasi-likelihoods (Wedderburn 1974). Other researchers have resorted to numerical integration, split up into quadratures, to obtain approximations of the true likelihood. More integration points will increase the number of computations and thus impede the speed to convergence, although it increases the accuracy. Each added random component increases the integral dimension. A random intercept is one dimension (one added parameter), a random slope makes that two dimensions. Our experience is that the three-level nested models with random intercepts and slopes often create problems regarding convergence.

5.3 Convergence Issues in SAS

We spent considerable time overcoming the challenges of the GLIMMIX procedure. We reviewed available literature and discussed with those with experience using SAS . Although there are by no means guarantees that there will not be challenges, we provide in this chapter our experiences, underscored by others, as well as suggestions for improving the performance of this procedure.

Non-convergence in the GLIMMIX procedure can be identified by looking at the output and the log. The most obvious indication of issues is in the convergence criterion, which is provided below the iteration history. When convergence is not obtained, SAS will provide the following warning: “Did not converge”.

A successful convergence message does not itself necessarily guarantee that the model converged. In some cases, the convergence criterion will be satisfied, but the standard error for one or more of the (non-zero) covariance parameters will be missing. When this occurs, the standard error will be given by a “.” instead of an actual estimate. In these cases, the output may look similar to the following:

figure a

When there is non-convergence, there are a number of possible remedies. Many authors, such as Kiernan et al. (2012) have offered a number of possible solutions. Researchers using the GLIMMIX procedure may choose to:

  • Drop certain variables

  • Relax the convergence criterion

  • Increase the value of ABSCONV =

  • Change the covariance structure using TYPE =

  • Adjust the quadrature using QUAD =

  • Utilize different approximation algorithms such as TECH = NRRIDG or TECH =NEWRAP, in the NLOPTIONS statement.

  • Increase the number of iterations using MAXITER = in the NLOPTIONS statement

  • Control the number of outer iterations using the INITGLM option

  • Increase the number of optimizations using the MAXOPT = option

  • Rescale data values to reduce issues relating to extreme values

  • Utilize an alternative approach, such as the %HPGLIMMIX MACRO (Xie and Madden 2014)

For a more thorough discussion of the procedure itself, Ene et al. (2015) provided a thorough introduction to the use and interpretation of the GLIMMIX procedure in SAS .

6 Simulation of Data

The IML procedure in SAS was used to simulate two-level data following a generalized linear mixed model with random intercepts and random slopes. In this example, we explored the effects of including an increasing number of fixed effects when using the GLIMMIX procedure to fit a logistic regression model with one random intercept and one random slope. The approaches discussed in this section can readily be expanded to simulate data with more than two levels, although only two levels are discussed for ease of interpretation and understanding.

6.1 Simulation Setup

Here we set the parameters for the simulation . We will assume that our random intercept has variance \(\sigma _{INT}^2 {=}7\) and that the random slope has variance \(\sigma _{SLOPE}^2 {=}15\). We also assume that there are six continuous fixed effects. Each of the fixed effects has a mean of 1, with some random noise added such that the means are not all equal. The fixed effects are assumed to independent of one another and also pairwise independent of the random slope. The simulated data will include 15 clusters of observations, each with a randomly chosen number of observations between 2 and 40.

figure b

Once the parameters for the simulation are chosen, the cluster level data are created. Each of the random (cluster) intercepts are chosen according to independent random standard normal distributions with mean of 0 and standard deviation of 1. The random (cluster) slope coefficients are also chosen according to independent random normal distributions with our specified variance and a mean of -1. In effect, each of the 15 clusters is assigned a unique cluster level intercept and slope term. Our design matrix is created using these random values.

figure c

Once the cluster level data are created, we can generate the observation level data. We create a matrix of independent normal realizations to serve as the observations for each of the six continuous fixed as well as the random slope variables. The realizations of each variable are created using a multivariate random normal. The fixed effect predictors are also transformed for better model fitting.

figure d

We combine our simulated data to create two matrices. The first matrix is used to combine all fixed and random effects information, while the second matrix provides a reduced set of information for use in simulation of the response. This second matrix removes information on the true random slope coefficient and the true cluster ID and thus contains information on the six fixed effects and the random intercept term.

figure e

The coefficients for the fixed effect predictors are set according to those specified at the start of the simulation . The cluster level (random) intercept is assigned a coefficient equal to the square root of the random intercept variance term; since the random intercepts were originally simulated from a standard normal distribution, this coefficient introduces the specified variance into the simulation. These coefficients are also standardized based on the standard deviation of the respective observations.

figure f

We create our response as a function of these covariates. The simulated data are multiplied by the coefficients and the effect of the random slope is added in. The resulting value is then converted into a probability and used to create a binary response according to the Bernoulli distribution. This response is then combined with a “blinded” data matrix which has the value of the cluster intercept and the random slope coefficients removed. The final matrix is then output to a SAS data set with specified variable names.

figure g

The outputted data set is then analyzed using the GLIMMIX procedure in SAS . Each of the fixed effect predictors is added to the model one by one to determine the point at which this procedure will fail, if at all. A partial example of these analyses are shown below.

figure h

6.2 Simulation Results

Although the GLIMMIX procedure is a powerful tool for fitting generalized linear models , it is not uncommon to find that the procedure fails to provide results. We utilized a simulation study similar to the one utilized in the previous section to investigate the effect of the number of predictors on the failure rates in the GLIMMIX procedure. A SAS macro was implemented to run the simulation across a variety of conditions and the GLIMMIX procedure was used to analyze the data under each condition for 1000 replications per condition. Each simulated data set contained information on a binary outcome, an identifier label for cluster number, one (random) cluster level predictor, and six fixed effect predictors. For each simulated data set, the GLIMMIX procedure was used to analyze the data set six times, where each call to the procedure included one additional fixed effect predictor.

In particular, the conditions examined were the number of data clusters, the strength of the variance of both the random intercept and slope and strength of fixed effect coefficients. The simulation took into account data sets with either 3, 15 or 45 clusters of data. The random effect variances we investigated included all combinations of low, medium and high variances for the random intercept and random slope—yielding a total of nine different variance combinations. The fixed effects also took three levels of strength—weak, moderate or strong.

Table 1 Failure rates for the GLIMMIX procedure (three clusters)
Table 2 Failure Rates for the GLIMMIX Procedure (fifteen clusters)
Table 3 Failure Rates for the GLIMMIX Procedure (forty-five clusters)

The results of this simulation are given in Tables 1, 2 and 3 and are also displayed graphically in Fig. 1. These displays provide the failure rates for the 1000 simulations conducted for each of the specified conditions, thus higher values indicate poorer performance as a higher proportion of the calls to the GLIMMIX procedure failed to provide results. Tables 1, 2 and 3 divide the results of the simulations based on the number of clusters in each simulation, where Table 1 summarizes the simulations with 3 data clusters each, Table 2 summarizes the simulations with 15 data clusters each and Table 3 summarizes the simulations with 45 data clusters each. The first column in each of the tables provides the strength of the fixed effects predictor (weak, moderate or strong). The second and third columns denote the simulation settings for the variance of the random intercept and slope, respectively, where each variance term takes one of three levels (low, medium, high). The remaining six columns contain the failure rates as a proportion for the GLIMMIX procedure with a given number of fixed effects predictors. For instance, we can see from Table 1, in the first data row, in the last column that of the 1000 simulations with three clusters, weak fixed effects, low intercept variance and low slope variance, 84.4 % of the models with six fixed effect predictors failed to converge.

Fig. 1
figure 1

Failure rates for the GLIMMIX procedure

Figure 1 provides a graphical representation of the same simulated data presented in Tables 1, 2 and 3. Each individual plot contains three lines representing the failure rates for each of the three strengths of the fixed effects. The blue line represents the simulations with weak predictors, the red line represents the simulations with moderate predictors and the green line represents the simulations with strong predictors. The vertical (Y) axis of each individual plot denotes the failure rates as a percentage, where higher values indicate higher rates of failure. The horizontal (X) axis within each of the individual plots represents the number of fixed effects included in the model for those simulations. The individual plots are also organized into three columns according to the number of data clusters in those simulations. The individual plots are further grouped into nine rows according to the strength of the random effects for those simulations. For example, in the individual plot in the last column of the first row contains information on the 1000 simulations in which there were 45 clusters, with weak fixed effects predictors, low random intercept variance and low random slope variance.

In general, as the number of predictors increased, the failure rates also increased. Notable exceptions include the case where there is very little variance in the random effects. For instance, in the case of low random intercept variance and low random slope variance, the failure rates may actually decrease, or increase only slightly. We can also see that the effect of increasing the number of predictors is also suppressed when there are more data clusters. In general, the GLIMMIX procedure is more successful in analyzing data with more clusters as illustrated by the lower failure rates. Similarly, data with overall stronger random effect variance is also less susceptible to failure as the number of predictors in the model increases. This holds true with respect to both the random intercept variance as well as the random slope variance.

7 Analysis of Data

7.1 Description

A subset of data from the 2011 Bangladesh Demographic and Health Survey is used in this study. This subset contains information on 1000 women between the ages of 10 and 49, living in Bangladesh . The data in this study are hierarchical in nature in that each of the women is nested within one of seven different districts, which correspond approximately to administrative regions in Bangladesh (NIPORT 2013). A simplified version of this structure is represented as Fig. 2.

The outcome of interest in this data set is a binary variable representing the woman’s knowledge of AIDS . The variable takes one of two values representing knowledge of AIDS (1) or no knowledge of AIDS (0). In addition to this outcome, the data set also includes information on the woman’s wealth index, age, number of living children as well as whether or not the woman lives in an urban or rural setting. Wealth index had five possible levels representing the quintile to which the woman belonged. Age represented the woman’s age at the time of survey while number of living children represented how many living children the woman had at the time of survey. The urban/rural variable was a district level predictor as the value of this predictor were partially driven by the administrative region.

Fig. 2
figure 2

Hierarchical structure in 2011 DHS Study

Please note to use the included DHS subset data, you must register as a DHS data user at: http://www.dhsprogram.com/data/new-user-registration.cfm. This subset data must not be passed on to others without the written consent of DHS (archive@dhsprogram.com). You are required to submit a copy of any reports/publications resulting from using this subset data to: archive@dhsprogram.com.

7.2 Data Analysis

We fit a logistic regression model with one random intercept and one random slope for the urban/rural variable. For these data, the random effects were used to address the clustering present due to districts. Each of these models was fitted using the GLIMMIX procedure in SAS . The first model included one fixed effect predictor for wealth index.

$$ \log \left[ {\frac{\mathrm{p}_{\mathrm{jk}} }{1-\mathrm{p}_{\mathrm{jk}} }} \right] ={\upbeta }_0 +{\upgamma }_1 \mathrm{Urban}_\mathrm{j} +{\upgamma }_{1\mathrm{j}} \mathrm{Wealth}_{\mathrm{jk}} +\mathrm{u}_{\mathrm{oj}} $$

As in the data simulation section, these data can be analyzed in SAS using code similar to the example given below. Note that additional fixed effects predictors can be included in the model statement to fit additional models.

figure i

The convergence criterion noted that the GLIMMIX procedure converged successfully and that we are also provided with standard errors for our random effects. Therefore, we see the procedure was successful in fitting the model.

figure j

We also fit the model which included fixed effects for both wealth and age.

$$ \log \left[ {\frac{\mathrm{p}_{\mathrm{jk}} }{1-\mathrm{p}_{\mathrm{jk}} }} \right] ={\upbeta }_0 +{\upgamma }_1 \mathrm{Urban}_{\mathrm{j}} +{\upgamma }_{1\mathrm{j}} Wealth_{\mathrm{jk}} +{\upgamma }_{2\mathrm{j}} \mathrm{Age}_{\mathrm{jk}} +\mathrm{u}_{\mathrm{oj}} $$

In this case, we can similarly see that the convergence criterion is satisfied and that estimates of the standard errors of the random effects are provided. Thus, we see that the GLIMMIX procedure was successful in fitting a model.

figure k

We added a third predictor for number of living children to our mixed model.

$$ \log \left[ {\frac{\mathrm{p}_{\mathrm{jk}} }{1-\mathrm{p}_{\mathrm{jk}} }} \right] ={\upbeta }_0 +{\upgamma }_1 \mathrm{Urban}_\mathrm{j} +{\upgamma }_{1\mathrm{j}} \mathrm{Wealth}_{\mathrm{jk}} +{\upgamma }_{2\mathrm{j}} \mathrm{Age}_{\mathrm{jk}} +{\upgamma }_{3\mathrm{j}} \mathrm{Children}_{\mathrm{jk}} +\mathrm{u}_{\mathrm{oj}} $$

With the inclusion of this third predictor, we see that the GLIMMIX procedure fails to converge and consequently does not provide estimates of the standard errors of the random effects. Hence, we see that, although SAS is able to fit the model with two fixed effects, the inclusion of a third fixed effect leads to failure.

figure l

Although we do not explore its use in depth here, the %hpglimmix macro provides an alternative approach in SAS for fitting generalized linear mixed models (Xie and Madden 2014). This macro offers improvements in memory usage as well as processing time and supports the fitting of more complicated models as compared to the GLIMMIX procedure. Although this macro does not currently provide standard errors of the covariance parameter estimates or Type III test results, it can be useful when alternative approaches fail to resolve convergence issues in the GLIMMIX procedure. We fit the previously discussed model, which includes three fixed effects predictors as well as one random intercept and one random slope for the Bangladesh data. After loading the macro into the current SAS session, the model can be run using code similar to the following.

figure m

Though this model fails to converge in the GLIMMIX procedure, we see that %hpglimmix provides results for the model which includes three fixed effect predictors.

figure n

Another possible remedy in this case is found in the NLMIXED procedure in SAS . This procedure utilizes likelihood-based approaches to fit mixed models for nonlinear outcomes (Wolfinger 1999). This procedure is readily available in SAS software and provides similar techniques to those available in the GLIMMIX procedure. Although the models that can be fit in both procedures are similar, it is worth noting that the two procedures use different techniques for estimation and thus the results may vary between the two approaches. However, because different estimation techniques are employed there are also cases in which the NLMIXED procedure will converge, while the GLIMMIX procedure will not.

The NLMIXED procedure is implemented differently as compared to many other procedures in SAS software. In particular, one must provide starting values for each of the parameters of interest, which can be estimated in a number of ways. In this example, we first used the logistic procedure to obtain estimates of the fixed effects parameters and specify a generic value of ‘1’ for the variance of each of our random effects (intercept and slope). We also specified an equation with respect to our parameters and observed predictor values, and use this equation in the specification of our model statement through the calculation of our probability using the logit link. Finally, each of the random effects as well as the corresponding distribution is specified, and the subject assigned.

figure o

We found that the NLMIXED procedure converges successfully and also provides solutions for both our fixed and random effects for the model which includes three fixed effects predictors. Wolfinger (1999) provides a good introduction to the NLMIXED procedure and its usage, as well as some of the underlying calculations.

figure p

In general, we found that the results of our data analysis are in agreement with our findings based on the simulation study. The GLIMMIX procedure was successful in analyzing the models with fewer fixed effects predictors. However, once we included additional fixed effects, we saw that the GLIMMIX procedure failed to converge. In these cases, we may choose to investigate only the smaller subset of predictors in order to get successful analyses. Alternatively, if the larger number of predictors is of interest, we can utilize the %hpglimmix macro, which is able to achieve convergence, although the output is reduced. We may also utilize the NLMIXED procedure, which utilizes different methods for estimation.

8 Conclusions

Fitting hierarchical logistic regression models to survey binary data is common in a number of disciplines. These models are useful in analyzing survey data in the presence of clustering or correlation , which otherwise would make standard approaches inappropriate due to the lack of independence amongst the outcomes. Although there are a number of powerful approaches for fitting these models, such as the GLIMMIX and NLMIXED procedures in SAS , the computational complexity of the algorithms can often lead to failures in convergence.

Through the use of simulations , we obtained useful information for exploring the reasons for non-convergence, as well as steps to avoid these issues. In particular, when using the GLIMMIX procedure, researchers should be careful in selecting predictors to include in the model. The inclusion of too many predictors can lead to convergence issues, regardless of whether these predictors are fixed or random. When many predictors must be included due to research or knowledge constraints and if the GLIMMIX procedure failures to converge, other options can be explored to fit similar models. Because it utilizes different approaches, the NLMIXED procedure is a viable option for obtaining convergence in the mixed model setting when the GLIMMIX procedure fails. Recent advances, such as the %hpglimmix macro can also be utilized as a remedy.

While we concentrated and presented results applicable only to the convergence issue in the GLIMMIX procedure for two-level hierarchical logistic regression models, we believe that these approaches can be readily adapted and expanded to explore different or more complex problems. In general, Monte-Carlo simulation offers a fast, and inexpensive avenue for investigating problems such as convergence, as well as appropriate solutions.