Cook’s Distance

Cook, R. Dennis

doi:10.1007/978-3-642-04898-2_189

R. Dennis Cook²

2289 Accesses
4 Citations

Access provided by Autonomous University of Puebla. Download reference work entry PDF

Introduction

Prior to 1975 there was little awareness within statistics or the applied sciences generally that a single observation can influence a statistical analysis to a point where inferences drawn with the observation included can be diametrically opposed to those drawn without the observation. The recognition that such influential observations do occur with notable frequency began with the 1977 publication of Cook’s Distance, which is a means to assess the influence of individual observations on the estimated coefficients in a linear regression analysis (Cook 1977). Today the detection of influential observations is widely acknowledged as an important part of any statistical analysis and Cook’s distance is a mainstay in linear regression analysis. Generalizations of Cook’s distance and of the underlying ideas have been developed for application in diverse statistical contexts. Extensions of Cook’s distance for linear regression along with a discussion of surrounding methodology were presented by Cook and Weisberg (1982).

Cook’s distance and its direct extensions are based on the idea of contrasting the results of an analysis with and without an observation. Implementation of this idea beyond linear and generalized linear models can be problematic. For these applications the related concept of local influence (Cook 1986) is used to study the touchiness of an analysis to local perturbations in the model or the data. Local influence analysis continues to be an area of active investigation (see, for example, Zhu et al. 2007).

Cook’s Distance

Consider the linear regression of a response variable Y on p predictors X ₁, …, X _p represented by the model

$${Y }_{i} = {\beta }_{0} + {\beta }_{1}{X}_{i1} + \cdots + {\beta }_{p}{X}_{ip} + {\epsilon }_{i},$$

where i = 1, …, n indexes observations, the β’s are the regression coefficients and ε is an error that is independent of the predictors and has mean 0 and constant variance σ ². This classic model can be represented conveniently in matrix terms as Y = X β + ε. Here, Y = (Y _i) is the n × 1 vector of responses, X = (X _ij) is the n ×(p + 1) matrix of predictor values X _ij, including a constant column to account for the intercept β ₀, and ε = (ε_i) is the n ×1 vector of errors. For clarity, the ith response Y _i in combination with its associated values of the predictors X _i1, …, X _ip is called the ith case. Let $\widehat{\beta }$ denote the ordinary least squares (OLS) estimator of the coefficient vector β based on the full data and let β _(i) denote the OLS estimator based on the data after removing the ith case. Let s ² denote estimator of σ ² based on the OLS fit of the full dataset – s ² the residual sum of squares divided by (n − p − 1).

Cook (1977) proposed to assess the influence of the ith case on $\widehat{\beta }$ by using a statistic D _i, which subsequently became known as Cook’s distance, that can be expressed in three equivalent ways:

$$\begin{array}{rcl}{ D}_{i}& =& \frac{{(\widehat{\beta }\ -{\widehat{\beta }}_{(i)})}^{T}{\mathbf{X}}^{T}\mathbf{X}(\widehat{\beta } -{\widehat{\beta }}_{(i)})} {(p + 1){s}^{2}} \end{array}$$

(1)

$$\begin{array}{rcl} & =& \frac{{(\widehat{\mathbf{Y}} -{\widehat{\mathbf{Y}}}_{(i)})}^{T}(\widehat{\mathbf{Y}} -{\widehat{\mathbf{Y}}}_{(i)})} {(p + 1){s}^{2}} \end{array}$$

(2)

$$\begin{array}{rcl} & =& \frac{{r}_{i}^{2}} {p + 1} \times \frac{{h}_{i}} {1 - {h}_{i}}\end{array}$$

(3)

The first form (1) shows that Cook’s distance measures the difference between $\widehat{\beta }$ and ${\widehat{\beta }}_{(i)}$ using the inverse of the contours of the estimated covariance matrix s ²(X ^T X)^{− 1} of $\widehat{\beta }$ and scaling by the number of terms (p + 1) in the model. The second form shows that Cook’s distance can be viewed also as the squared length of the difference between the n ×1 vector of fitted values $\widehat{\mathbf{Y}} = \mathbf{X}\widehat{\beta }$ based on the full data and the n ×1 vector of fitted values ${\widehat{\mathbf{Y}}}_{(i)} = \mathbf{X}{\widehat{\beta }}_{(i)}$ when β is estimated without the ith case.

The final form (3) shows the general characteristics of cases with relatively large values of D _i. The ith leverageh _i, 0 ≤ h _i ≤ 1, is the ith diagonal of the projection matrix H = X(X ^T X)^{− 1} X that puts the “hat” on Y, $\widehat{\mathbf{Y}} = \mathbf{H}{\bf Y}$. It measures how far the predictor values X _i = (X _i1, …, X _ip)^T for the ith case are from the average predictor value $\overline{\mathbf{X}}$. If X _i is far from $\overline{\mathbf{X}}$ then the ith case will have substantial pull on the fit, h _i will be near its upper bound of 1, and the second factor of (3) will be very large. Consequently, D _i will be large unless the first factor in (3) is small enough to compensate. The second factor tells us about the leverage or pull that X _i has on the fitted model, but it does not depend on the response and thus says nothing about the actual fit of the ith case. That goodness of fit information is provided by r _i ² in first factor of (3): r _i is the Studentized residual for the ith case – the ordinary residual for the ith case divided by $s\sqrt{1 - {h}_{i}}$. The squared Studentized residual r _i ² will be large when Y _i does not fit the model and thus can be regarded as an outlier, but it says nothing about leverage. In short, the first factor gives information on the goodness of the fit of Y _i, but it says nothing about leverage, while the second factor gives the leverage information but says nothing about goodness of fit. When multiplied, these factors combine to give a measure of the influence of the ith case.

The Studentized residual r _i is a common statistic for testing the hypothesis that Y _i is not an outlier. That test is most powerful when h _i is small, so X _i is near $\overline{\mathbf{X}}$, and least powerful when h _i is relatively large. However, leverage or pull is weakest when h _i is small and strongest when h _i is large. In other words, the ability to detect outliers is strongest where the outliers tend to be the least influential and weakest where the outliers tend to be the most influential. This gives another reason why influence assessment can be crucial in an analysis.

Cook’s distance is not a test statistic and should not by itself be used to accept cases or reject cases. It may indicate an anomalous case that is extramural to the experimental protocol or it may indicate the most important case in the analysis, one that points to a relevant phenomenon not reflected by the other data. Cook’s distance does not distinguish these possibilities.

Illustration

The data that provided the original motivation for the development of Cook’s distance came from an experiment on the absorption of a drug by rat livers. Nineteen rats were given various doses of the drug and, after a fixed waiting time, the rats were sacrificed and the percentage Y of the dose absorbed by the liver was measured. The predictors were dose, body weight and liver weight. The largest absolute Studentized residual is max | r _i | = 2. 1, which is unremarkable when adjusting for multiple testing. The case with the largest leverage 0. 85 has a modest Studentized residual of 0. 80, but a relatively large Cook’s distance of 0. 93 – the second largest Cook’s distance is 0. 27. Body weight and dose have significant effects in the analysis of the full data, but there are no significant effects after the influential case is removed. It is always prudent to study the impact of cases with relatively large values of D _i and all case for which D _i > 0. 5. The most influential case in this analysis fits both of these criteria. The rat data are discussed in Cook and Weisberg (1999) and available from the accompanying software.

Acknowledgments

Research for this article was supported in part by National Science Foundation Grant DMS-0704098.

About the Author

Dennis Cook is Full Professor, School of Statistics, University of Minnesota. He served a ten-year term as Chair of the Department of Applied Statistics, and a three-year term as Director of the Statistical Center, both at the University of Minnesota. He has served as Associate Editor of the Journal of the American Statistical Association (19761982; 1988–1991; 2002–2005), The Journal of Quality Technology, Biometrika (1991–1993), Journal of the Royal Statistical Society, Seies B (1992–1997) and Statistica Sinica (1999–2005). He is a three-time recipient of the Jack Youden Prize for Best Expository Paper in Technometrics as well as the Frank Wilcoxon Award for Best Technical Paper. He received the 2005 COPSS Fisher Lecture and Award. He is a Fellow of ASA and IMS, and an elected member of the ISI.

Cross References

Influential Observations

Regression Diagnostics

Robust Regression Estimation in Generalized Linear Models

Simple Linear Regression

References and Further Reading

Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19:15–18. Reprinted in 2000 under the same title for the Technometrics Special 40th Anniversary Issue 42, 65–68
Google Scholar
Cook RD (1986) Assessment of local influence (with discussion). J R Stat Soc Ser B 48:133–169
MATH Google Scholar
Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman & Hall, London/New York. This book is available on-line without charge from the University of Minnesota Digital Conservancy: http://purl.umn.edu/37076
Cook RD, Weisberg S (1999) Applied regression including computing and graphics. Wiley, New York
MATH Google Scholar
Zhu H, Ibrahim JG, Lee S, Zhang H (2007) Perturbation selection and influence measures in local influence analysis. Ann Stat 35:2565–2588
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

University of Minnesota, Minneapolis, MN, USA
R. Dennis Cook

Authors

R. Dennis Cook
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics and Informatics, Faculty of Economics, University of Kragujevac, City of Kragujevac, Serbia
Miodrag Lovric

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Cook, R.D. (2011). Cook’s Distance. In: Lovric, M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04898-2_189

Download citation

DOI: https://doi.org/10.1007/978-3-642-04898-2_189
Published: 02 December 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04897-5
Online ISBN: 978-3-642-04898-2
eBook Packages: Mathematics and StatisticsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Cook’s Distance

Introduction

Cook’s Distance

Illustration

Acknowledgments

About the Author

Cross References

References and Further Reading

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation