Correlation is a tool for understanding the relationship between two quantities. Regression considers how one quantity is influenced by another. In correlation analysis the two quantities are considered symmetrically: in regression analysis one is supposed dependent on the other, in an unsymmetric way. Extensions to sets of quantities are important.

Suppose that for each value of a quantity x, another quantity y has a probability distribution p(y | x), the probability of y, given x. The mean value of this distribution, alternatively called the expectation of y, given x, and written E(y | x), is a function of x and is called the regression of y on x. The quantity x is often called the independent variable, though a better term is regressor variable: y is the dependent variable. The regression tells us something about how y depends on x. The simplest case is linear regression, where E(y | x) = α + βx parameters α and β: the latter is called the regression coefficient (of y on x). Other features of the conditional distribution p(y | x) are usually considered in addition to the mean. The variance (or standard deviation) measures the spread of the y – values, for fixed x. A common case is where this is constant over x: the regression is then said to be homoskedastic. A further common assumption is that p(y | x) is normal, or Gaussian. Then y is normally distributed about α + βx with constant variance σ2.

The regression concept of y on x does not involve a probability distribution for the regressor x. If it does have one, p(x), then x and y have a joint distribution given by p(x, y) = p(y | x)p(x). This joint distribution yields variances, σxx and σyy, for x and y, and a covariance σxy. The correlation between x and y is then defined as ρxy = σxy/(σxxσyy)1/2. It is the ratio of the covariance to the product of the standard deviations and is clearly unaffected by a change of scale in either x or y (and since the variances and covariance are unaffected, by a change in origin). It is easy to show that − 1 ≤ ρxy ≤ 1, and if x and y are independent, ρxy is zero. When ρxy = 0, x and y are said to be uncorrelated. The correlation measures the association between x and y. If x and y have a joint distribution, then not only is there a regression of y on x, considered above, but also of x on y.

The linear, homoskedastic case is easily the most common one used in practice and has several important properties. We may write y = α + βx + ϵ, where ε has zero mean and variance σ2. If x has a distribution, then the factorization p(x, y) = p(y | x) p(x) shows ε is independent of x and therefore ϵ and x are uncorrelated. Averaging we have μy = α + βμx, relating the means, μx and μy, of x and y. A change of origin enables both of these to be put equal to zero, when α = 0 and E(y | x) = βx, or y = βx + ϵ. Multiplying this last result by x and taking expectations, σxy = βσxx as ε and x are uncorrelated. Consequently the regression coefficient of y on x equals σxy/σxx. Similarly the regression coefficient of x on y (if that regression is also linear homoskedastic) is σxy/σyy and the square of the correlation coefficient equals the product of the regression coefficients.

Returning to the relation y = βx + ε and considering the variances of both sides, we obtain σyy = β2σxx + σ2 (again using the lack of correlation between x and ϵ). Hence \( {\sigma}^2={\sigma}_{yy}-{\sigma}_{xy}^2/{\sigma}_{xx} \), on using β = σxy/σxx, and we have the important relationship that \( {\sigma}^2={\sigma}_{yy}\left(1-{\rho}_{xy}^2\right) \), showing that the variance σ2, of y about the regression, is a proportion \( \left(1-{\rho}_{xy}^2\right) \) of the total variance of y, σyy. In the form σyy = β2σxx + σ2, we have the result that the total variance of y is made up of two additive components, that due to x, β2σxx, and that about the regression line. The former is called the component of variance ascribable to x: the latter is the residual variance and, as we have just seen, is a proportion \( \left(1-{\rho}_{xy}^2\right) \) of the total. That ascribable to x is a proportion ρ2xy. This decomposition of variance is at the heart of analysis of variance techniques.

The ideas of regression and correlation are due to Galton and Pearson. The classic example has x the height of a father and y that of his son. Both regressions are linear, homoskedastic and normal, having positive regression coefficients which are less than one. Galton noticed that tall (short) fathers have sons who are, on average, shorter (taller) than themselves. This follows since, centering the values at the mean, or average height, E(y | x) = βx < x if x > 0 corresponding to tall fathers, βx > x if x < 0 for short ones. This is the phenomenon of regression (of heights) towards the mean and is necessary if the variability in heights is not to increase from one generation to the next. An illustration from economics might have x as the price of an item and y the number sold. There β will be negative reflecting the average decrease in numbers sold as the price increases. Here x might not have a probability distribution but be at the control of the seller.

The modern tendency is to make increasing use of regression and less of correlation. Part of the explanation for this is the importance of dependency relations, instead of associations, between quantities. Another reason is that in so many examples (as item price) the regressor variable is not random, so that σxx and σxy are meaningless and correlation ideas are unavailable. A third consideration is that correlation can be misleading. As an illustration of this let x be a quantity, symmetrically and randomly distributed about zero. Let y = x2. Then σxy = E(xy) = E(x3) = 0 by the symmetry about zero. Hence the correlation is zero whilst y and x are highly associated, one being the square of the other. Correlation ideas work well when all variables are normally distributed but less well otherwise. (If y = x2, y cannot be normal.)

The ideas and definitions extend to the case where there are several regressor variables x1, x2, …, xm. Write x = (x1, x2, …, xm). Then E(y | x) is the (multiple) regression of y on x. In the linear case with means at zero, E(y | x) = Σβixi and βi is the partial regression coefficient of y on xi. The notation and nomenclature here are too brief and can be misleading, for βi only measures the dependence of y on xi in the presence of the other quantities in x. Were, say xm, to be omitted βi, i < m, would typically change: indeed, the regression might not be linear. The cumbersome notation exemplified by β2.134 (i = 2, m = 4) is sometimes used. In words, the coefficient of y on x2, allowing for x1, x3 and x4. The variance about the regression remains and the homoskedastic case, where this is constant, is the one usually considered.

In the linear case E(y | x) = ΣβiPi(x). the x’s can be functionally related. A common case is where xi = xi, the powers of a single quantity x. This is referred to as polynomial regression. It is usually more convenient to work with polynomials Pi(x) of degree i in x which are orthogonal with respect to some measure. Then E(y | x) = ΣβiPi(x). Another possibility is where the xi are periodic, say cos it. Notice that the linearity is in the terms Pi(x) – or the coefficients βi – not in x.

If the regressor variables have a joint distribution then the covariances σyi, between y and xi, and σij between xi and xj are available. With more than one regressor variable additional concepts can be introduced. For example, if all the x’s are held fixed except for xi there is a conditional joint distribution of y and xi given all the x’s except xi. This has a correlation, defined as above as the ratio of the conditional covariance to the product of the conditional standard deviations, and is called the partial correlation between y and xi. The notation is exemplified by ρy2.134. This will, in general, depend on the fixed values of the regressor variables but is normally only used when it is constant. This happens if the joint distribution of y and x is multivariate normal.

In the case of a single regressor variable we saw that \( 1-{\rho}_{xy}^2={\sigma}^2/{\sigma}_{yy} \), where σ2 is the residual variance of y, conditional on x. In the multiple case, continue to define σ2 in this way conditional on all the quantities in x. Then define R2 by (1 − R2) = σ2/σyy, in analogy with the single variable case. The positive square root R is called the multiple correlation coefficient (of y on x). As before, we may write σyy = σ2 + R2σyy expressing the total variance of y additively in terms of the residual variance σ2 and that due to the regression on x. It is more common nowadays to work in terms of the variance components than R2.

The mathematical theory of regression and correlation is now well understood. Centering at the means, all the concepts depend on the matrix of variances and covariances of y, the dependent variable, and x, the set of regressor variables: σyi and σij. The calculations are merely ways of rearranging these elements in convenient forms: correlations and components of variance in regression are just two possibilities. The real difficulty, and the real interest in regression lies in the interpretation of the results.

As an illustration consider the simple case of linear, homoskedastic regression of y on a single regressor variable x, written y = βx + ε, with β as the regression coefficient and ε as the residual variation, with zero mean and variance σ2. All this says is that for any fixed x, y has mean βx and variance σ2: and it is only this aspect of the dependence of y on x that is described. Suppose a large amount of data consisting of pairs (xi, yi) is collected and the fit y = 2x + ε with σ2 = 2 is established. (We discuss how this might be done below.) This shows a fairly close association between y and x. In order therefore to increase y it might be thought reasonable to set x to a high value. Suppose this is done, will this cause y necessarily to increase? Surprisingly, not so. Suppose there is another quantity z and the real relationships are that y = −x + z + ε1, and \( x=\frac{1}{3}z+{\varepsilon}_2 \) so that z is the basic quantity determining the situation. This clearly yields y = 2x + ε, with ε = ε1 − 3ε2, the observed relation. If now x is controlled at a large value without affecting z which is, under natural conditions, the main determinant of x, the effect will be to decrease y through y = −x + z + ε1. Consequently a strong positive relationship between y and x need not imply an increase in y when x is increased. There can be an enormous difference between the association of y with x, when x is uncontrolled and allowed to vary freely, and the association when x is controlled. And the reason is the presence of another quantity z whose influence on x in the free system is disturbed by the control.

Whenever the regression of y on a set of quantities x is discussed, one has to beware of the possible presence of other, unobserved quantities z that could affect the relationship. A laboratory scientist, or even a social scientist doing a planned survey, can often guard against such hidden quantities by careful design or by appropriate randomization; but an economist, or anyone who has to rely on data from unplanned studies, has always to be on his guard against their effects. Another way of describing the difficulty is to distinguish carefully between association and causation. All regression and correlation analyses can do is study association: the underlying causal mechanism is not necessarily revealed. It is remarkable how little attention has been paid by statisticians to the meaning of causation, and to how it can be revealed by statistical analysis. Economists have had to rely on statistical analyses of randomly obtained data and some of the causal inferences they have drawn are totally unjustified by that data and the analyses.

We now consider the nature of these statistical analyses, confining ourselves predominantly to the case of homoskedastic, linear regression y = Σβixi + ε, ε having mean zero and variance σ2. There the means have been supposed zero. There is usually no difficulty over this as the mean of each variable can ordinarily be estimated by the sample means, y. and xi. The quantities being discussed here are, in terms of the original data, the deviations, yy and xix.i, from the sample means. The standard method of estimating the β’s and σ2 is least squares. This has been in use for two centuries and is still adopted by almost all data analysts. If that data is (yj, xji: i = 1, 2, … m; j = 1, 2, … n) consisting of n independent observations of y and the m regressor variables, then the least-squares estimates of βi are provided by minimizing the sum of squares of residuals y − Σβixi for each of the n observations: that is Σj(yj − Σiβixji)2. Matrix notation is most convenient. Write y = (y1, y2, … yn)T, β = (β1, β2, … βm)T and X as the matrix with elements xji, observation j on variable xi. Then y = + residual and the sum of squares to be minimized over β is (yXβ)T (yXβ) with minimum given by \( \widehat{\beta}={\left({X}^TX\right)}^{-1}{X}^Ty \). The variance σ2 is estimated by the sum of squares at \( \widehat{\beta} \) divided by (n − m). The \( {\widehat{\beta}}_i \) are called the least-squares estimates of βi.

The method is deservedly popular because it is relatively easy to use and interpret, and many convenient computer programs are available. Its long and successful history testifies to its merits. Unfortunately it has been discovered that there can be very real difficulties when m, the number of variables is large. With the availability of fast computers capable of handling a lot of data, it is not uncommon to have 40 or more variables. The difficulties then become noticeable. Before the arrival of such computing power, least squares was only used with few variables and the difficulties are scarcely noticeable. It is easy to appreciate what could go wrong: it is not so easy to correct it. Consider the case where the sum of squares is Σj(yjβj)2. This apparently special and degenerate case is, in fact, a canonical form for least squares and any multiple regression situation can be transformed to it by linear transformations. (In so doing, the meanings of the y’s and β’s will change.) The minimization is trivial with estimate \( {\widehat{\beta}}_j={y}_j \), and the minimum value is zero. But we know that yj differs from its expectation, here βj but in general Σiβixji, by an amount which has variance σ2, so the average of \( {\varSigma}_j{\left({y}_j-{\widehat{\beta}}_j\right)}^2 \) ought to be about σ2, and indeed this is the usual estimate of σ2 as mentioned above. But here this estimate is zero, which is absurd. This first, rigorous demonstration that least squares is unsatisfactory was given by Charles Stein. He showed that whenever the number of variables exceeds two, there is an estimate which is, for every value of the regression coefficients, better than least squares. Better here means having smaller mean-square error, though the statement remains true under many other meanings. The efficiency varies with the true values of the β’s. The result just quoted says that it is always less than one. It can be as low as 2/m: with m = 40 this gives only 5% efficiency, a rather serious loss.

It is surprising how little attention Stein’s result has received outside of a small group, largely of theoreticians, yet its practical value could be enormous. Stein, and others, have produced estimates which improve on least squares but none has had much acceptance. Fairly early in the use of computers for regression analysis, it was appreciated that difficulties could arise when the matrix XTX, which has to be inverted to obtain the least-squares estimates, is illconditioned, with determinant near zero. This is the matrix of sample variances and covariances of the regressor variables, a typical element being Σrxrixrj where the x’s are deviations from their means, xi. It will be ill-conditioned if, in the data, there is a near linear relationship between the regressor variables. One suggestion was to put the matrix into correlation form, dividing each row and each column by the sample standard deviation of the variable corresponding to that row or column, so making all diagonal elements one and each off-diagonal element equal to a sample correlation coefficient between xi and xj, and then subtracting a constant λ from each unit diagonal element. This leads to ridge regression estimates and ways of choosing λ have been proposed. It often works well but can fail.

These ideas all lie within a frequentist school of inference. In principle, a solution is available with the Bayesian paradigm for inference. Here, in addition to the distribution of y, given x, is included a probability distribution for the regression parameter β = (β1, β2, …, βm). Inference is then made by calculating the revised probability distribution of β given the data. This procedure always avoids Stein’s criticism provided the original distribution of β has total integral unity. (Least squares results from this procedure only if all the values of β are equally probable, a form which is not finitely integrable.) The practical difficulty is the choice of a distribution for β. The ridge method can be produced for certain types of exchangeable distributions for β. In the case of polynomial regression, a reasonable possibility is to suppose that the coefficients of the higher degree polynomials are likely to be smaller than those of lower degree. When the regressor variables refer to different quantities, a possibility is to suppose that few of them have an appreciable coefficient, and therefore influence y, but it is not known which are the determining ones.

This idea that only a few regressors matter has led to a lot of work on the choice of which to include in the regression. There are two broad ways to proceed. One can fit all the quantities available and then discard them one by one as long as the discarding has little effect. Or one can proceed in the reverse direction, introducing them one at a time only if they have an appreciable effect. In both of these methods it has to be decided how the effect is to be measured. The usual criterion is the change in the variance of y ascribable to x; the quantity denoted above by R2σyy. Alternatively expressed, this is the change in the multiple correlation coefficient. For example, in the method where the variables are discarded, R2 will decrease when a variable is omitted from the regression. Only if this decrease is small will the omision be granted. There are two difficulties here. First, it is possible for two quantities, separately to have little effect, but jointly to be of considerable importance, so that tests of them one at a time may be misleading. (The possibility of computing all 2m regressions is too extravagant.) Second, it is not clear what is meant by saying the change in R2 is “small”: how small? One possibility is to use an ordinary significance test, here a t-test. If significant the regressor causing the change can be included: if not, it is omitted. This is for some suitably chosen significance level. This has been thought to be unsatisfactory by some and other criteria have been proposed. It is here that the Bayesian and frequentist views part company. The usual Bayesian criterion for ‘small’ depends on the assumed distribution for the regression coefficients, but, in general, it seems to need more evidence to introduce a regressor when using the Bayesian approach than when employing a significance test. The former has been accused of favouring the hypothesis that the variable is not worth including. The Bayesian reply is that some ‘significant’ effects are spurious. Multiple regression techniques are so widely used today that one wonders how many effects of xi on y reported in the literature are meaningful.

Regression concerns a relation, to take the linear, one variable form, y = βx + ε between y and x. This treats y and x asymmetrically and does not lead to y = β−1y + ε′ with ε′ unrelated to y. There is, however, a symmetric form that is sometimes useful. Suppose two quantities, ξ and η, are exactly linearly related, η = βξ, or equally ξ = β−1η. Suppose that each is measured with error giving y = η + ε, x = ξ + ε′. Then the pair (x, y) may have linear regressions but the real interest lies in β, the coefficient of the exact relationship. This is often referred to as the case where both variables, independent and regressor, are subject to error. Ordinary least-squares techniques, even with a single regressor variable, require modification.

Linear multiple regression is part of the general theory of linear models in which, to use the notation above, E(y | X) = , the linearity being in the parameter β. Least squares and its Steintype modifications are the standard techniques for analysis, together with the analysis of variance.

See Also