Introduction

Prior to 1975 there was little awareness within statistics or the applied sciences generally that a single observation can influence a statistical analysis to a point where inferences drawn with the observation included can be diametrically opposed to those drawn without the observation. The recognition that such influential observations do occur with notable frequency began with the 1977 publication of Cook’s Distance, which is a means to assess the influence of individual observations on the estimated coefficients in a linear regression analysis (Cook 1977). Today the detection of influential observations is widely acknowledged as an important part of any statistical analysis and Cook’s distance is a mainstay in linear regression analysis. Generalizations of Cook’s distance and of the underlying ideas have been developed for application in diverse statistical contexts. Extensions of Cook’s distance for linear regression along with a discussion of surrounding methodology were presented by Cook and Weisberg (1982).

Cook’s distance and its direct extensions are based on the idea of contrasting the results of an analysis with and without an observation. Implementation of this idea beyond linear and generalized linear models can be problematic. For these applications the related concept of local influence (Cook 1986) is used to study the touchiness of an analysis to local perturbations in the model or the data. Local influence analysis continues to be an area of active investigation (see, for example, Zhu et al. 2007).

Cook’s Distance

Consider the linear regression of a response variable Y on p predictors X 1, , X p represented by the model

$${Y }_{i} = {\beta }_{0} + {\beta }_{1}{X}_{i1} + \cdots + {\beta }_{p}{X}_{ip} + {\epsilon }_{i},$$

where i = 1, , n indexes observations, the β’s are the regression coefficients and ε is an error that is independent of the predictors and has mean 0 and constant variance σ 2. This classic model can be represented conveniently in matrix terms as Y = X β + ε. Here, Y = (Y i ) is the n × 1 vector of responses, X = (X ij ) is the n ×(p + 1) matrix of predictor values X ij , including a constant column to account for the intercept β 0, and ε = (ε i ) is the n ×1 vector of errors. For clarity, the ith response Y i in combination with its associated values of the predictors X i1, , X ip is called the ith case. Let \(\widehat{\beta }\) denote the ordinary least squares (OLS) estimator of the coefficient vector β based on the full data and let β (i) denote the OLS estimator based on the data after removing the ith case. Let s 2 denote estimator of σ 2 based on the OLS fit of the full dataset – s 2 the residual sum of squares divided by (np − 1).

Cook (1977) proposed to assess the influence of the ith case on \(\widehat{\beta }\) by using a statistic D i , which subsequently became known as Cook’s distance, that can be expressed in three equivalent ways:

$$\begin{array}{rcl}{ D}_{i}& =& \frac{{(\widehat{\beta }\ -{\widehat{\beta }}_{(i)})}^{T}{\mathbf{X}}^{T}\mathbf{X}(\widehat{\beta } -{\widehat{\beta }}_{(i)})} {(p + 1){s}^{2}} \end{array}$$
(1)
$$\begin{array}{rcl} & =& \frac{{(\widehat{\mathbf{Y}} -{\widehat{\mathbf{Y}}}_{(i)})}^{T}(\widehat{\mathbf{Y}} -{\widehat{\mathbf{Y}}}_{(i)})} {(p + 1){s}^{2}} \end{array}$$
(2)
$$\begin{array}{rcl} & =& \frac{{r}_{i}^{2}} {p + 1} \times \frac{{h}_{i}} {1 - {h}_{i}}\end{array}$$
(3)

The first form (1) shows that Cook’s distance measures the difference between \(\widehat{\beta }\) and \({\widehat{\beta }}_{(i)}\) using the inverse of the contours of the estimated covariance matrix s 2(X T X) − 1 of \(\widehat{\beta }\) and scaling by the number of terms (p + 1) in the model. The second form shows that Cook’s distance can be viewed also as the squared length of the difference between the n ×1 vector of fitted values \(\widehat{\mathbf{Y}} = \mathbf{X}\widehat{\beta }\) based on the full data and the n ×1 vector of fitted values \({\widehat{\mathbf{Y}}}_{(i)} = \mathbf{X}{\widehat{\beta }}_{(i)}\) when β is estimated without the ith case.

The final form (3) shows the general characteristics of cases with relatively large values of D i . The ith leverageh i , 0 ≤ h i ≤ 1, is the ith diagonal of the projection matrix H = X(X T X) − 1 X that puts the “hat” on Y, \(\widehat{\mathbf{Y}} = \mathbf{H}{\bf Y}\). It measures how far the predictor values X i = (X i1, , X ip )T for the ith case are from the average predictor value \(\overline{\mathbf{X}}\). If X i is far from \(\overline{\mathbf{X}}\) then the ith case will have substantial pull on the fit, h i will be near its upper bound of 1, and the second factor of (3) will be very large. Consequently, D i will be large unless the first factor in (3) is small enough to compensate. The second factor tells us about the leverage or pull that X i has on the fitted model, but it does not depend on the response and thus says nothing about the actual fit of the ith case. That goodness of fit information is provided by r i 2 in first factor of (3): r i is the Studentized residual for the ith case – the ordinary residual for the ith case divided by \(s\sqrt{1 - {h}_{i}}\). The squared Studentized residual r i 2 will be large when Y i does not fit the model and thus can be regarded as an outlier, but it says nothing about leverage. In short, the first factor gives information on the goodness of the fit of Y i , but it says nothing about leverage, while the second factor gives the leverage information but says nothing about goodness of fit. When multiplied, these factors combine to give a measure of the influence of the ith case.

The Studentized residual r i is a common statistic for testing the hypothesis that Y i is not an outlier. That test is most powerful when h i is small, so X i is near \(\overline{\mathbf{X}}\), and least powerful when h i is relatively large. However, leverage or pull is weakest when h i is small and strongest when h i is large. In other words, the ability to detect outliers is strongest where the outliers tend to be the least influential and weakest where the outliers tend to be the most influential. This gives another reason why influence assessment can be crucial in an analysis.

Cook’s distance is not a test statistic and should not by itself be used to accept cases or reject cases. It may indicate an anomalous case that is extramural to the experimental protocol or it may indicate the most important case in the analysis, one that points to a relevant phenomenon not reflected by the other data. Cook’s distance does not distinguish these possibilities.

Illustration

The data that provided the original motivation for the development of Cook’s distance came from an experiment on the absorption of a drug by rat livers. Nineteen rats were given various doses of the drug and, after a fixed waiting time, the rats were sacrificed and the percentage Y of the dose absorbed by the liver was measured. The predictors were dose, body weight and liver weight. The largest absolute Studentized residual is max | r i | = 2. 1, which is unremarkable when adjusting for multiple testing. The case with the largest leverage 0. 85 has a modest Studentized residual of 0. 80, but a relatively large Cook’s distance of 0. 93 – the second largest Cook’s distance is 0. 27. Body weight and dose have significant effects in the analysis of the full data, but there are no significant effects after the influential case is removed. It is always prudent to study the impact of cases with relatively large values of D i and all case for which D i > 0. 5. The most influential case in this analysis fits both of these criteria. The rat data are discussed in Cook and Weisberg (1999) and available from the accompanying software.

Acknowledgments

Research for this article was supported in part by National Science Foundation Grant DMS-0704098.

About the Author

Dennis Cook is Full Professor, School of Statistics, University of Minnesota. He served a ten-year term as Chair of the Department of Applied Statistics, and a three-year term as Director of the Statistical Center, both at the University of Minnesota. He has served as Associate Editor of the Journal of the American Statistical Association (19761982; 1988–1991; 2002–2005), The Journal of Quality Technology, Biometrika (1991–1993), Journal of the Royal Statistical Society, Seies B (1992–1997) and Statistica Sinica (1999–2005). He is a three-time recipient of the Jack Youden Prize for Best Expository Paper in Technometrics as well as the Frank Wilcoxon Award for Best Technical Paper. He received the 2005 COPSS Fisher Lecture and Award. He is a Fellow of ASA and IMS, and an elected member of the ISI.

Cross References

Influential Observations

Regression Diagnostics

Robust Regression Estimation in Generalized Linear Models

Simple Linear Regression