Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of 1/0, with 1 generally indicating a success and 0 a failure. However, the actual values that 1 and 0 can take vary widely, depending on the purpose of the study. For example, for a study of the odds of failure in a school setting, 1 may have the value of fail, and 0 of not-fail, or pass. The important point is that 1 indicates the foremost subject of interest for which a binary response study is designed. Modeling a binary response variable using normal linear regression introduces substantial bias into the parameter estimates. The standard linear model assumes that the response and error terms are normally or Gaussian distributed, that the variance, σ 2, is constant across observations, and that observations in the model are independent. When a binary variable is modeled using this method, the first two of the above assumptions are violated. Analogical to the normal regression model being based on the Gaussian probability distribution function (pdf), a binary response model is derived from a Bernoulli distribution, which is a subset of the binomial pdf with the binomial denominator taking the value of 1. The Bernoulli pdf may be expressed as:

$$ f({y}_{i};{\pi }_{i}) = {\pi }_{i}^{\;{y}_{i} }{(1 - {\pi }_{i})}^{1-{y}_{i} }. $$
(1)

Binary logistic regression derives from the canonical form of the Bernoulli distribution. The Bernoulli pdf is a member of the exponential family of probability distributions, which has properties allowing for a much easier estimation of its parameters than traditional Newton–Raphson-based maximum likelihood estimation (MLE) methods.

In 1972 Nelder and Wedderbrun discovered that it was possible to construct a single algorithm for estimating models based on the exponential family of distributions. The algorithm was termed Generalized linear models (GLM), and became a standard method to estimate binary response models such as logistic, probit, and complimentary-loglog regression, count response models such as Poisson and negative binomial regression, and continuous response models such as gamma and inverse Gaussian regression. The standard normal model, or Gaussian regression, is also a generalized linear model, and may be estimated under its algorithm. The form of the exponential distribution appropriate for generalized linear models may be expressed as

$$f({y}_{i};{\theta }_{i},\phi ) =\exp \{ ({y}_{i}{\theta }_{i} - b({\theta }_{i}))/\alpha (\phi ) + c({y}_{i};\phi )\},$$
(2)

with θ representing the link function, α(ϕ) the scale parameter, b(θ) the cumulant, and c(y; ϕ) the normalization term, which guarantees that the probability function sums to 1. The link, a monotonically increasing function, linearizes the relationship of the expected mean and explanatory predictors. The scale, for binary and count models, is constrained to a value of 1, and the cumulant is used to calculate the model mean and variance functions. The mean is given as the first derivative of the cumulant with respect to θ, b′(θ); the variance is given as the second derivative, b′(θ). Taken together, the above four terms define a specific GLM model.

We may structure the Bernoulli distribution (3) into exponential family form (2) as:

$$f({y}_{i};{\pi }_{i}) =\exp \{ {y}_{i}\ln ({\pi }_{i}/(1 - {\pi }_{i})) +\ln (1 - {\pi }_{i})\}.$$
(3)

The link function is therefore \(\ln (\pi /(1 - \pi )),\) and cumulant \(-\ln (1 - \pi )\) or \(\ln (1/(1 - \pi )).\) For the Bernoulli, π is defined as the probability of success. The first derivative of the cumulant is π, the second derivative, π(1 − π). These two values are, respectively, the mean and variance functions of the Bernoulli pdf. Recalling that the logistic model is the canonical form of the distribution, meaning that it is the form that is directly derived from the pdf, the values expressed in (3), and the values we gave for the mean and variance, are the values for the logistic model.

Estimation of statistical models using the GLM algorithm, as well as MLE, are both based on the log-likelihood function. The likelihood is simply a re-parameterization of the pdf which seeks to estimate π, for example, rather than y. The log-likelihood is formed from the likelihood by taking the natural log of the function, allowing summation across observations during the estimation process rather than multiplication.

The traditional GLM symbol for the mean, μ, is typically substituted for π, when GLM is used to estimate a logistic model. In that form, the log-likelihood function for the binary-logistic model is given as:

$$L({\mu }_{i};{y}_{i}) ={ \sum \limits _{i=1}^{n}}\{{y}_{ i}\ln ({\mu }_{i}/(1 - {\mu }_{i})) +\ln (1 - {\mu }_{i})\},$$
(4)

or

$$L({\mu }_{i};{y}_{i}) ={ \sum \limits _{i=1}^{n}}\{{y}_{ i}\ln ({\mu }_{i}) + (1 - {y}_{i})\ln (1 - {\mu }_{i})\}.$$
(5)

The Bernoulli-logistic log-likelihood function is essential to logistic regression. When GLM is used to estimate logistic models, many software algorithms use the deviance rather than the log-likelihood function as the basis of convergence. The deviance, which can be used as a goodness-of-fit statistic, is defined as twice the difference of the saturated log-likelihood and model log-likelihood. For logistic model, the deviance is expressed as

$$D = 2{\sum \limits _{i=1}^{n}}\{{y}_{ i}\ln ({y}_{i}/{\mu }_{i}) + (1 - {y}_{i})\ln ((1 - {y}_{i})/(1 - {\mu }_{i}))\}.$$
(6)

Whether estimated using maximum likelihood techniques or as GLM, the value of μ for each observation in the model is calculated on the basis of the linear predictor, x′ β. For the normal model, the predicted fit, \(\hat{y}\), is identical to x′ β, the right side of (7). However, for logistic models, the response is expressed in terms of the link function, \(\ln ({\mu }_{i}/(1 - {\mu }_{i})).\) We have, therefore,

$${x}_{i}'\beta =\ln ({\mu }_{i}/(1 - {\mu }_{i})) = {\beta }_{0} + {\beta }_{1}{x}_{1} + {\beta }_{2}{x}_{2} + \cdots + {\beta }_{n}{x}_{n}.$$
(7)

The value of μ i , for each observation in the logistic model, is calculated as

$${\mu }_{i} = 1/\left (1 +\exp \left (-{x}_{i}'\beta \right )\right ) =\exp \left ({x}_{i}'\beta \right )/\left (1 +\exp \left ({x}_{i}'\beta \right )\right ).$$
(8)

The functions to the right of μ are commonly used ways of expressing the logistic inverse link function, which converts the linear predictor to the fitted value. For the logistic model, μ is a probability.

When logistic regression is estimated using a Newton-Raphson type of MLE algorithm, the log-likelihood function as parameterized to x′ β rather than μ. The estimated fit is then determined by taking the first derivative of the log-likelihood function with respect to β, setting it to zero, and solving. The first derivative of the log-likelihood function is commonly referred to as the gradient, or score function. The second derivative of the log-likelihood with respect to β produces the Hessian matrix, from which the standard errors of the predictor parameter estimates are derived. The logistic gradient and hessian functions are given as

$$\frac{\partial L(\beta )} {\partial \beta } ={ \sum \limits _{i=1}^{n}}({y}_{ i} - {\mu }_{i}){x}_{i}$$
(9)
$$\frac{{\partial }^{2}L(\beta )} {\partial \beta \partial \beta '} = -{\sum \limits _{i=1}^{n}} \left \{{x}_{ i}{x'}_{i}{\mu }_{i}(1 - {\mu }_{i}) \right \}$$
(10)

One of the primary values of using the logistic regression model is the ability to interpret the exponentiated parameter estimates as odds ratios. Note that the link function is the log of the odds of \(\mu ,\,\ln (\mu /(1 - \mu )),\) where the odds are understood as the success of μ over its failure, 1 − μ. The log-odds is commonly referred to as the logit function. An example will help clarify the relationship, as well as the interpretation of the odds ratio.

We use data from the 1912 Titanic accident, comparing the odds of survival for adult passengers to children. A tabulation of the data is given as:

The odds of survival for adult passengers is 442/765, or 0.578. The odds of survival for children is 57/52, or 1.096. The ratio of the odds of survival for adults to the odds of survival for children is (442/765)/(57/52), or 0.52709552. This value is referred to as the odds ratio, or ratio of the two component odds relationships. Using a logistic regression procedure to estimate the odds ratio of age produces the following results

Table 1

With 1 = adult and 0 = child, the estimated odds ratio may be interpreted as:

Table 2

The odds of an adult surviving were about half the odds of a child surviving.

By inverting the estimated odds ratio above, we may conclude that children had [1 ∕ . 527 ∼ 1. 9] some 90% – or nearly two times – greater odds of surviving than did adults.

For continuous predictors, a one-unit increase in a predictor value indicates the change in odds expressed by the displayed odds ratio. For example, if age was recorded as a continuous predictor in the Titanic data, and the odds ratio was calculated as 1.015, we would interpret the relationship as:

The odds of surviving is one and a half percent greater for each increasing year of age.

Non-exponentiated logistic regression parameter estimates are interpreted as log-odds relationships, which carry little meaning in ordinary discourse. Logistic models are typically interpreted in terms of odds ratios, unless a researcher is interested in estimating predicted probabilities for given patterns of model covariates; i.e., in estimating μ.

Logistic regression may also be used for grouped or proportional data. For these models the response consists of a numerator, indicating the number of successes (1s) for a specific covariate pattern, and the denominator (m), the number of observations having the specific covariate pattern. The response ym is binomially distributed as:

$$f({y}_{i};{\pi }_{i},{m}_{i}) = \left (\begin{matrix} {m}_{i} \\ {y}_{i} \end{matrix} \right ){\pi }_{i}^{\;{y}_{i} }{(1-{\pi }_{i})}^{{m}_{i}-{y}_{i} },$$
(11)

with a corresponding log-likelihood function expressed as

$$\begin{array}{crl} L({\mu }_{i};{y}_{i},{m}_{i}) = & { \sum \limits _{i=1}^{n}} \left \{ {y}_{ i}\ln ({\mu }_{i}/(1 - {\mu }_{i})) + {m}_{i}\ln (1 - {\mu }_{i}) \right .\\ & \left. + \left( \begin{matrix}{m}_{i} \\ {y}_{i} \end{matrix} \right ) \right \}.\end{array}$$
(12)

Taking derivatives of the cumulant, \(-{m}_{i}\ln (1 - {\mu }_{i}),\) as we did for the binary response model, produces a mean of μ i = m i π i and variance, \({\mu }_{i}(1 - {\mu }_{i}/{m}_{i}).\)

Consider the data below:

y indicates the number of times a specific pattern of covariates is successful. Cases is the number of observations having the specific covariate pattern. The first observation in the table informs us that there are three cases having predictor values of \({x}_{1} = 1,{x}_{2} = 0,\mbox{ and }{x}_{3} = 1.\) Of those three cases, one has a value of y equal to 1, the other two have values of 0. All current commercial software applications estimate this type of logistic model using GLM methodology.

Table 3

The data in the above table may be restructured so that it is in individual observation format, rather than grouped. The new table would have ten observations, having the same logic as described. Modeling would result in identical parameter estimates. It is not uncommon to find an individual-based data set of, for example, 10,000 observations, being grouped into 10–15 rows or observations as above described. Data in tables is nearly always expressed in grouped format.

Table 4

Logistic models are subject to a variety of fit tests. Some of the more popular tests include the Hosmer-Lemeshow goodness-of-fit test, ROC analysis, various information criteria tests, link tests, and residual analysis. The Hosmer–Lemeshow test, once well used, is now only used with caution. The test is heavily influenced by the manner in which tied data is classified. Comparing observed with expected probabilities across levels, it is now preferred to construct tables of risk having different numbers of levels. If there is consistency in results across tables, then the statistic is more trustworthy.

Information criteria tests, e.g., Akaike information Criteria (see Akaike’s Information Criterion and Akaike’s Information Criterion: Background, Derivation, Properties, and Refinements) (AIC) and Bayesian Information Criteria (BIC) are the most used of this type of test. Information tests are comparative, with lower values indicating the preferred model. Recent research indicates that AIC and BIC both are biased when data is correlated to any degree. Statisticians have attempted to develop enhancements of these two tests, but have not been entirely successful. The best advice is to use several different types of tests, aiming for consistency of results.

Several types of residual analyses are typically recommended for logistic models. The references below provide extensive discussion of these methods, together with appropriate caveats. However, it appears well established that m-asymptotic residual analyses is most appropriate for logistic models having no continuous predictors. m-asymptotics is based on grouping observations with the same covariate pattern, in a similar manner to the grouped or binomial logistic regression discussed earlier. The Hilbe (2009) and Hosmer and Lemeshow (2000) references below provide guidance on how best to construct and interpret this type of residual.

Logistic models have been expanded to include categorical responses, e.g., proportional odds models and multinomial logistic regression. They have also been enhanced to include the modeling of panel and correlated data, e.g., generalized estimating equations, fixed and random effects, and mixed effects logistic models.

Finally, exact logistic regression models have recently been developed to allow the modeling of perfectly predicted data, as well as small and unbalanced datasets. In these cases, logistic models which are estimated using GLM or full maximum likelihood will not converge. Exact models employ entirely different methods of estimation, based on large numbers of permutations.

About the Author

Joseph M. Hilbe is an emeritus professor, University of Hawaii and adjunct professor of statistics, Arizona State University. He is also a Solar System Ambassador with NASA/Jet Propulsion Laboratory, at California Institute of Technology. Hilbe is a Fellow of the American Statistical Association and Elected Member of the International Statistical institute, for which he is founder and chair of the ISI astrostatistics committee and Network, the first global association of astrostatisticians. He is also chair of the ISI sports statistics committee, and was on the founding executive committee of the Health Policy Statistics Section of the American Statistical Association (1994–1996). Hilbe is author of Negative Binomial Regression (2007, Cambridge University Press), and Logistic Regression Models (2009, Chapman & Hall), two of the leading texts in their respective areas of statistics. He is also co-author (with James Hardin) of Generalized Estimating Equations (2002, Chapman & Hall/CRC) and two editions of Generalized Linear Models and Extensions (2001, 2007, Stata Press), and with Robert Muenchen is coauthor of the R for Stata Users (2010, Springer). Hilbe has also been influential in the production and review of statistical software, serving as Software Reviews Editor for The American Statistician for 12 years from 1997–2009. He was founding editor of the Stata Technical Bulletin (1991), and was the first to add the negative binomial family into commercial generalized linear models software. Professor Hilbe was presented the Distinguished Alumnus award at California State University, Chico in 2009, two years following his induction into the University’s Athletic Hall of Fame (he was two-time US champion track & field athlete). He is the only graduate of the university to be recognized with both honors.

Cross References

Case-Control Studies

Categorical Data Analysis

Generalized Linear Models

Multivariate Data Analysis: An Overview

Probit Analysis

Recursive Partitioning

Regression Models with Symmetrical Errors

Robust Regression Estimation in Generalized Linear Models

Statistics: An Overview

Target Estimation: A New Approach to Parametric Estimation