Logistic Regression

Hilbe, Joseph M.

doi:10.1007/978-3-642-04898-2_344

Joseph M. Hilbe^2,3,4

1404 Accesses
26 Citations
3 Altmetric

Access provided by Autonomous University of Puebla. Download reference work entry PDF

Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of 1/0, with 1 generally indicating a success and 0 a failure. However, the actual values that 1 and 0 can take vary widely, depending on the purpose of the study. For example, for a study of the odds of failure in a school setting, 1 may have the value of fail, and 0 of not-fail, or pass. The important point is that 1 indicates the foremost subject of interest for which a binary response study is designed. Modeling a binary response variable using normal linear regression introduces substantial bias into the parameter estimates. The standard linear model assumes that the response and error terms are normally or Gaussian distributed, that the variance, σ ², is constant across observations, and that observations in the model are independent. When a binary variable is modeled using this method, the first two of the above assumptions are violated. Analogical to the normal regression model being based on the Gaussian probability distribution function (pdf), a binary response model is derived from a Bernoulli distribution, which is a subset of the binomial pdf with the binomial denominator taking the value of 1. The Bernoulli pdf may be expressed as:

$$ f({y}_{i};{\pi }_{i}) = {\pi }_{i}^{\;{y}_{i} }{(1 - {\pi }_{i})}^{1-{y}_{i} }. $$

(1)

Binary logistic regression derives from the canonical form of the Bernoulli distribution. The Bernoulli pdf is a member of the exponential family of probability distributions, which has properties allowing for a much easier estimation of its parameters than traditional Newton–Raphson-based maximum likelihood estimation (MLE) methods.

In 1972 Nelder and Wedderbrun discovered that it was possible to construct a single algorithm for estimating models based on the exponential family of distributions. The algorithm was termed Generalized linear models (GLM), and became a standard method to estimate binary response models such as logistic, probit, and complimentary-loglog regression, count response models such as Poisson and negative binomial regression, and continuous response models such as gamma and inverse Gaussian regression. The standard normal model, or Gaussian regression, is also a generalized linear model, and may be estimated under its algorithm. The form of the exponential distribution appropriate for generalized linear models may be expressed as

$$f({y}_{i};{\theta }_{i},\phi ) =\exp \{ ({y}_{i}{\theta }_{i} - b({\theta }_{i}))/\alpha (\phi ) + c({y}_{i};\phi )\},$$

(2)

with θ representing the link function, α(ϕ) the scale parameter, b(θ) the cumulant, and c(y; ϕ) the normalization term, which guarantees that the probability function sums to 1. The link, a monotonically increasing function, linearizes the relationship of the expected mean and explanatory predictors. The scale, for binary and count models, is constrained to a value of 1, and the cumulant is used to calculate the model mean and variance functions. The mean is given as the first derivative of the cumulant with respect to θ, b′(θ); the variance is given as the second derivative, b′(θ). Taken together, the above four terms define a specific GLM model.

We may structure the Bernoulli distribution (3) into exponential family form (2) as:

$$f({y}_{i};{\pi }_{i}) =\exp \{ {y}_{i}\ln ({\pi }_{i}/(1 - {\pi }_{i})) +\ln (1 - {\pi }_{i})\}.$$

(3)

The link function is therefore $\ln (\pi /(1 - \pi )),$ and cumulant $-\ln (1 - \pi )$ or $\ln (1/(1 - \pi )).$ For the Bernoulli, π is defined as the probability of success. The first derivative of the cumulant is π, the second derivative, π(1 − π). These two values are, respectively, the mean and variance functions of the Bernoulli pdf. Recalling that the logistic model is the canonical form of the distribution, meaning that it is the form that is directly derived from the pdf, the values expressed in (3), and the values we gave for the mean and variance, are the values for the logistic model.

Estimation of statistical models using the GLM algorithm, as well as MLE, are both based on the log-likelihood function. The likelihood is simply a re-parameterization of the pdf which seeks to estimate π, for example, rather than y. The log-likelihood is formed from the likelihood by taking the natural log of the function, allowing summation across observations during the estimation process rather than multiplication.

The traditional GLM symbol for the mean, μ, is typically substituted for π, when GLM is used to estimate a logistic model. In that form, the log-likelihood function for the binary-logistic model is given as:

$$L({\mu }_{i};{y}_{i}) ={ \sum \limits _{i=1}^{n}}\{{y}_{ i}\ln ({\mu }_{i}/(1 - {\mu }_{i})) +\ln (1 - {\mu }_{i})\},$$

(4)

or

$$L({\mu }_{i};{y}_{i}) ={ \sum \limits _{i=1}^{n}}\{{y}_{ i}\ln ({\mu }_{i}) + (1 - {y}_{i})\ln (1 - {\mu }_{i})\}.$$

(5)

The Bernoulli-logistic log-likelihood function is essential to logistic regression. When GLM is used to estimate logistic models, many software algorithms use the deviance rather than the log-likelihood function as the basis of convergence. The deviance, which can be used as a goodness-of-fit statistic, is defined as twice the difference of the saturated log-likelihood and model log-likelihood. For logistic model, the deviance is expressed as

$$D = 2{\sum \limits _{i=1}^{n}}\{{y}_{ i}\ln ({y}_{i}/{\mu }_{i}) + (1 - {y}_{i})\ln ((1 - {y}_{i})/(1 - {\mu }_{i}))\}.$$

(6)

Whether estimated using maximum likelihood techniques or as GLM, the value of μ for each observation in the model is calculated on the basis of the linear predictor, x′ β. For the normal model, the predicted fit, $\hat{y}$, is identical to x′ β, the right side of (7). However, for logistic models, the response is expressed in terms of the link function, $\ln ({\mu }_{i}/(1 - {\mu }_{i})).$ We have, therefore,

$${x}_{i}'\beta =\ln ({\mu }_{i}/(1 - {\mu }_{i})) = {\beta }_{0} + {\beta }_{1}{x}_{1} + {\beta }_{2}{x}_{2} + \cdots + {\beta }_{n}{x}_{n}.$$

(7)

The value of μ _i, for each observation in the logistic model, is calculated as

$${\mu }_{i} = 1/\left (1 +\exp \left (-{x}_{i}'\beta \right )\right ) =\exp \left ({x}_{i}'\beta \right )/\left (1 +\exp \left ({x}_{i}'\beta \right )\right ).$$

(8)

The functions to the right of μ are commonly used ways of expressing the logistic inverse link function, which converts the linear predictor to the fitted value. For the logistic model, μ is a probability.

When logistic regression is estimated using a Newton-Raphson type of MLE algorithm, the log-likelihood function as parameterized to x′ β rather than μ. The estimated fit is then determined by taking the first derivative of the log-likelihood function with respect to β, setting it to zero, and solving. The first derivative of the log-likelihood function is commonly referred to as the gradient, or score function. The second derivative of the log-likelihood with respect to β produces the Hessian matrix, from which the standard errors of the predictor parameter estimates are derived. The logistic gradient and hessian functions are given as

$$\frac{\partial L(\beta )} {\partial \beta } ={ \sum \limits _{i=1}^{n}}({y}_{ i} - {\mu }_{i}){x}_{i}$$

(9)

$$\frac{{\partial }^{2}L(\beta )} {\partial \beta \partial \beta '} = -{\sum \limits _{i=1}^{n}} \left \{{x}_{ i}{x'}_{i}{\mu }_{i}(1 - {\mu }_{i}) \right \}$$

(10)

One of the primary values of using the logistic regression model is the ability to interpret the exponentiated parameter estimates as odds ratios. Note that the link function is the log of the odds of $\mu ,\,\ln (\mu /(1 - \mu )),$ where the odds are understood as the success of μ over its failure, 1 − μ. The log-odds is commonly referred to as the logit function. An example will help clarify the relationship, as well as the interpretation of the odds ratio.

We use data from the 1912 Titanic accident, comparing the odds of survival for adult passengers to children. A tabulation of the data is given as:

The odds of survival for adult passengers is 442/765, or 0.578. The odds of survival for children is 57/52, or 1.096. The ratio of the odds of survival for adults to the odds of survival for children is (442/765)/(57/52), or 0.52709552. This value is referred to as the odds ratio, or ratio of the two component odds relationships. Using a logistic regression procedure to estimate the odds ratio of age produces the following results

Table 1

Full size table

With 1 = adult and 0 = child, the estimated odds ratio may be interpreted as:

Table 2

Full size table

The odds of an adult surviving were about half the odds of a child surviving.

By inverting the estimated odds ratio above, we may conclude that children had [1 ∕ . 527 ∼ 1. 9] some 90% – or nearly two times – greater odds of surviving than did adults.

For continuous predictors, a one-unit increase in a predictor value indicates the change in odds expressed by the displayed odds ratio. For example, if age was recorded as a continuous predictor in the Titanic data, and the odds ratio was calculated as 1.015, we would interpret the relationship as:

The odds of surviving is one and a half percent greater for each increasing year of age.

Non-exponentiated logistic regression parameter estimates are interpreted as log-odds relationships, which carry little meaning in ordinary discourse. Logistic models are typically interpreted in terms of odds ratios, unless a researcher is interested in estimating predicted probabilities for given patterns of model covariates; i.e., in estimating μ.

Logistic regression may also be used for grouped or proportional data. For these models the response consists of a numerator, indicating the number of successes (1s) for a specific covariate pattern, and the denominator (m), the number of observations having the specific covariate pattern. The response y ∕ m is binomially distributed as:

$$f({y}_{i};{\pi }_{i},{m}_{i}) = \left (\begin{matrix} {m}_{i} \\ {y}_{i} \end{matrix} \right ){\pi }_{i}^{\;{y}_{i} }{(1-{\pi }_{i})}^{{m}_{i}-{y}_{i} },$$

(11)

with a corresponding log-likelihood function expressed as

$$\begin{array}{crl} L({\mu }_{i};{y}_{i},{m}_{i}) = & { \sum \limits _{i=1}^{n}} \left \{ {y}_{ i}\ln ({\mu }_{i}/(1 - {\mu }_{i})) + {m}_{i}\ln (1 - {\mu }_{i}) \right .\\ & \left. + \left( \begin{matrix}{m}_{i} \\ {y}_{i} \end{matrix} \right ) \right \}.\end{array}$$

(12)

Taking derivatives of the cumulant, $-{m}_{i}\ln (1 - {\mu }_{i}),$ as we did for the binary response model, produces a mean of μ _i = m _i π _i and variance, ${\mu }_{i}(1 - {\mu }_{i}/{m}_{i}).$

Consider the data below:

y indicates the number of times a specific pattern of covariates is successful. Cases is the number of observations having the specific covariate pattern. The first observation in the table informs us that there are three cases having predictor values of ${x}_{1} = 1,{x}_{2} = 0,\mbox{ and }{x}_{3} = 1.$ Of those three cases, one has a value of y equal to 1, the other two have values of 0. All current commercial software applications estimate this type of logistic model using GLM methodology.

Table 3

Full size table

The data in the above table may be restructured so that it is in individual observation format, rather than grouped. The new table would have ten observations, having the same logic as described. Modeling would result in identical parameter estimates. It is not uncommon to find an individual-based data set of, for example, 10,000 observations, being grouped into 10–15 rows or observations as above described. Data in tables is nearly always expressed in grouped format.

Table 4

Full size table

Logistic models are subject to a variety of fit tests. Some of the more popular tests include the Hosmer-Lemeshow goodness-of-fit test, ROC analysis, various information criteria tests, link tests, and residual analysis. The Hosmer–Lemeshow test, once well used, is now only used with caution. The test is heavily influenced by the manner in which tied data is classified. Comparing observed with expected probabilities across levels, it is now preferred to construct tables of risk having different numbers of levels. If there is consistency in results across tables, then the statistic is more trustworthy.

Information criteria tests, e.g., Akaike information Criteria (see Akaike’s Information Criterion and Akaike’s Information Criterion: Background, Derivation, Properties, and Refinements) (AIC) and Bayesian Information Criteria (BIC) are the most used of this type of test. Information tests are comparative, with lower values indicating the preferred model. Recent research indicates that AIC and BIC both are biased when data is correlated to any degree. Statisticians have attempted to develop enhancements of these two tests, but have not been entirely successful. The best advice is to use several different types of tests, aiming for consistency of results.

Several types of residual analyses are typically recommended for logistic models. The references below provide extensive discussion of these methods, together with appropriate caveats. However, it appears well established that m-asymptotic residual analyses is most appropriate for logistic models having no continuous predictors. m-asymptotics is based on grouping observations with the same covariate pattern, in a similar manner to the grouped or binomial logistic regression discussed earlier. The Hilbe (2009) and Hosmer and Lemeshow (2000) references below provide guidance on how best to construct and interpret this type of residual.

Logistic models have been expanded to include categorical responses, e.g., proportional odds models and multinomial logistic regression. They have also been enhanced to include the modeling of panel and correlated data, e.g., generalized estimating equations, fixed and random effects, and mixed effects logistic models.

Finally, exact logistic regression models have recently been developed to allow the modeling of perfectly predicted data, as well as small and unbalanced datasets. In these cases, logistic models which are estimated using GLM or full maximum likelihood will not converge. Exact models employ entirely different methods of estimation, based on large numbers of permutations.

About the Author

Joseph M. Hilbe is an emeritus professor, University of Hawaii and adjunct professor of statistics, Arizona State University. He is also a Solar System Ambassador with NASA/Jet Propulsion Laboratory, at California Institute of Technology. Hilbe is a Fellow of the American Statistical Association and Elected Member of the International Statistical institute, for which he is founder and chair of the ISI astrostatistics committee and Network, the first global association of astrostatisticians. He is also chair of the ISI sports statistics committee, and was on the founding executive committee of the Health Policy Statistics Section of the American Statistical Association (1994–1996). Hilbe is author of Negative Binomial Regression (2007, Cambridge University Press), and Logistic Regression Models (2009, Chapman & Hall), two of the leading texts in their respective areas of statistics. He is also co-author (with James Hardin) of Generalized Estimating Equations (2002, Chapman & Hall/CRC) and two editions of Generalized Linear Models and Extensions (2001, 2007, Stata Press), and with Robert Muenchen is coauthor of the R for Stata Users (2010, Springer). Hilbe has also been influential in the production and review of statistical software, serving as Software Reviews Editor for The American Statistician for 12 years from 1997–2009. He was founding editor of the Stata Technical Bulletin (1991), and was the first to add the negative binomial family into commercial generalized linear models software. Professor Hilbe was presented the Distinguished Alumnus award at California State University, Chico in 2009, two years following his induction into the University’s Athletic Hall of Fame (he was two-time US champion track & field athlete). He is the only graduate of the university to be recognized with both honors.

Cross References

Case-Control Studies

Categorical Data Analysis

Generalized Linear Models

Multivariate Data Analysis: An Overview

Probit Analysis

Recursive Partitioning

Regression Models with Symmetrical Errors

Robust Regression Estimation in Generalized Linear Models

Statistics: An Overview

Target Estimation: A New Approach to Parametric Estimation

References and Further Reading

Collett D (2003) Modeling binary regression, 2nd edn. Chapman & Hall/CRC Cox, London
Google Scholar
Cox DR, Snell EJ (1989) Analysis of binary data, 2nd edn. Chapman & Hall, London
MATH Google Scholar
Hardin JW, Hilbe JM (2007) Generalized linear models and extensions, 2nd edn. Stata Press, College Station
MATH Google Scholar
Hilbe JM (2009) Logistic regression models. Chapman & Hall/CRC Press, Boca Raton
MATH Google Scholar
Hosmer D, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York
MATH Google Scholar
Kleinbaum DG (1994) Logistic regression; a self-teaching guide. Springer, New York
Google Scholar
Long JS (1997) Regression models for categorical and limited dependent variables. Sage, Thousand Oaks
MATH Google Scholar
McCullagh P, Nelder J (1989) Generalized linear models, 2nd edn. Chapman & Hall, London
MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Hawaii, Honolulu, HI, USA
Joseph M. Hilbe (Emeritus Professor, Adjunct Professor of Statistics, Solar System Ambassador)
Arizona State University, Tempe, AZ, USA
Joseph M. Hilbe (Emeritus Professor, Adjunct Professor of Statistics, Solar System Ambassador)
California Institute of Technology, Pasadena, CA, USA
Joseph M. Hilbe (Emeritus Professor, Adjunct Professor of Statistics, Solar System Ambassador)

Authors

Joseph M. Hilbe
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics and Informatics, Faculty of Economics, University of Kragujevac, City of Kragujevac, Serbia
Miodrag Lovric

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Hilbe, J.M. (2011). Logistic Regression. In: Lovric, M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04898-2_344

Download citation

DOI: https://doi.org/10.1007/978-3-642-04898-2_344
Published: 02 December 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04897-5
Online ISBN: 978-3-642-04898-2
eBook Packages: Mathematics and StatisticsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Logistic Regression

About the Author

Cross References

References and Further Reading

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation