Regression and Correlation Analysis

Lindley, D. V.

doi:10.1057/978-1-349-95189-5_1873

D. V. Lindley¹

87 Accesses

Abstract

Correlation is a tool for understanding the relationship between two quantities. Regression considers how one quantity is influenced by another. In correlation analysis the two quantities are considered symmetrically: in regression analysis one is supposed dependent on the other, in an unsymmetric way. Extensions to sets of quantities are important.

Access provided by CONRICYT-eBooks. Download reference work entry PDF

Correlation and Simple Linear Regression

Correlation is a tool for understanding the relationship between two quantities. Regression considers how one quantity is influenced by another. In correlation analysis the two quantities are considered symmetrically: in regression analysis one is supposed dependent on the other, in an unsymmetric way. Extensions to sets of quantities are important.

Suppose that for each value of a quantity x, another quantity y has a probability distribution p(y | x), the probability of y, given x. The mean value of this distribution, alternatively called the expectation of y, given x, and written E(y | x), is a function of x and is called the regression of y on x. The quantity x is often called the independent variable, though a better term is regressor variable: y is the dependent variable. The regression tells us something about how y depends on x. The simplest case is linear regression, where E(y | x) = α + βx parameters α and β: the latter is called the regression coefficient (of y on x). Other features of the conditional distribution p(y | x) are usually considered in addition to the mean. The variance (or standard deviation) measures the spread of the y – values, for fixed x. A common case is where this is constant over x: the regression is then said to be homoskedastic. A further common assumption is that p(y | x) is normal, or Gaussian. Then y is normally distributed about α + βx with constant variance σ².

The regression concept of y on x does not involve a probability distribution for the regressor x. If it does have one, p(x), then x and y have a joint distribution given by p(x, y) = p(y | x)p(x). This joint distribution yields variances, σ_xx and σ_yy, for x and y, and a covariance σ_xy. The correlation between x and y is then defined as ρ_xy = σ_xy/(σ_xxσ_yy)^1/2. It is the ratio of the covariance to the product of the standard deviations and is clearly unaffected by a change of scale in either x or y (and since the variances and covariance are unaffected, by a change in origin). It is easy to show that − 1 ≤ ρ_xy ≤ 1, and if x and y are independent, ρ_xy is zero. When ρ_xy = 0, x and y are said to be uncorrelated. The correlation measures the association between x and y. If x and y have a joint distribution, then not only is there a regression of y on x, considered above, but also of x on y.

The linear, homoskedastic case is easily the most common one used in practice and has several important properties. We may write y = α + β_x + ϵ, where ε has zero mean and variance σ². If x has a distribution, then the factorization p(x, y) = p(y | x) p(x) shows ε is independent of x and therefore ϵ and x are uncorrelated. Averaging we have μ_y = α + βμ_x, relating the means, μ_x and μ_y, of x and y. A change of origin enables both of these to be put equal to zero, when α = 0 and E(y | x) = βx, or y = β_x + ϵ. Multiplying this last result by x and taking expectations, σ_xy = βσ_xx as ε and x are uncorrelated. Consequently the regression coefficient of y on x equals σ_xy/σ_xx. Similarly the regression coefficient of x on y (if that regression is also linear homoskedastic) is σ_xy/σ_yy and the square of the correlation coefficient equals the product of the regression coefficients.

Returning to the relation y = βx + ε and considering the variances of both sides, we obtain σ_yy = β²σ_xx + σ² (again using the lack of correlation between x and ϵ). Hence \( {\sigma}^2={\sigma}_{yy}-{\sigma}_{xy}^2/{\sigma}_{xx} \), on using β = σ_xy/σ_xx, and we have the important relationship that \( {\sigma}^2={\sigma}_{yy}\left(1-{\rho}_{xy}^2\right) \), showing that the variance σ², of y about the regression, is a proportion \( \left(1-{\rho}_{xy}^2\right) \) of the total variance of y, σ_yy. In the form σ_yy = β²σ_xx + σ², we have the result that the total variance of y is made up of two additive components, that due to x, β²σ_xx, and that about the regression line. The former is called the component of variance ascribable to x: the latter is the residual variance and, as we have just seen, is a proportion \( \left(1-{\rho}_{xy}^2\right) \) of the total. That ascribable to x is a proportion ρ²_xy. This decomposition of variance is at the heart of analysis of variance techniques.

The ideas of regression and correlation are due to Galton and Pearson. The classic example has x the height of a father and y that of his son. Both regressions are linear, homoskedastic and normal, having positive regression coefficients which are less than one. Galton noticed that tall (short) fathers have sons who are, on average, shorter (taller) than themselves. This follows since, centering the values at the mean, or average height, E(y | x) = βx < x if x > 0 corresponding to tall fathers, βx > x if x < 0 for short ones. This is the phenomenon of regression (of heights) towards the mean and is necessary if the variability in heights is not to increase from one generation to the next. An illustration from economics might have x as the price of an item and y the number sold. There β will be negative reflecting the average decrease in numbers sold as the price increases. Here x might not have a probability distribution but be at the control of the seller.

The modern tendency is to make increasing use of regression and less of correlation. Part of the explanation for this is the importance of dependency relations, instead of associations, between quantities. Another reason is that in so many examples (as item price) the regressor variable is not random, so that σ_xx and σ_xy are meaningless and correlation ideas are unavailable. A third consideration is that correlation can be misleading. As an illustration of this let x be a quantity, symmetrically and randomly distributed about zero. Let y = x². Then σ_xy = E(xy) = E(x³) = 0 by the symmetry about zero. Hence the correlation is zero whilst y and x are highly associated, one being the square of the other. Correlation ideas work well when all variables are normally distributed but less well otherwise. (If y = x², y cannot be normal.)

The ideas and definitions extend to the case where there are several regressor variables x₁, x₂, …, x_m. Write x = (x₁, x₂, …, x_m). Then E(y | x) is the (multiple) regression of y on x. In the linear case with means at zero, E(y | x) = Σβ_ix_i and β_i is the partial regression coefficient of y on x_i. The notation and nomenclature here are too brief and can be misleading, for β_i only measures the dependence of y on x_i in the presence of the other quantities in x. Were, say x_m, to be omitted β_i, i < m, would typically change: indeed, the regression might not be linear. The cumbersome notation exemplified by β_2.134 (i = 2, m = 4) is sometimes used. In words, the coefficient of y on x₂, allowing for x₁, x₃ and x₄. The variance about the regression remains and the homoskedastic case, where this is constant, is the one usually considered.

In the linear case E(y | x) = Σβ_iP_i(x). the x’s can be functionally related. A common case is where x_i = xⁱ, the powers of a single quantity x. This is referred to as polynomial regression. It is usually more convenient to work with polynomials P_i(x) of degree i in x which are orthogonal with respect to some measure. Then E(y | x) = Σβ_iP_i(x). Another possibility is where the x_i are periodic, say cos it. Notice that the linearity is in the terms P_i(x) – or the coefficients β_i – not in x.

If the regressor variables have a joint distribution then the covariances σ_yi, between y and x_i, and σ_ij between x_i and x_j are available. With more than one regressor variable additional concepts can be introduced. For example, if all the x’s are held fixed except for x_i there is a conditional joint distribution of y and x_i given all the x’s except x_i. This has a correlation, defined as above as the ratio of the conditional covariance to the product of the conditional standard deviations, and is called the partial correlation between y and x_i. The notation is exemplified by ρ_y2.134. This will, in general, depend on the fixed values of the regressor variables but is normally only used when it is constant. This happens if the joint distribution of y and x is multivariate normal.

In the case of a single regressor variable we saw that \( 1-{\rho}_{xy}^2={\sigma}^2/{\sigma}_{yy} \), where σ² is the residual variance of y, conditional on x. In the multiple case, continue to define σ² in this way conditional on all the quantities in x. Then define R² by (1 − R²) = σ²/σ_yy, in analogy with the single variable case. The positive square root R is called the multiple correlation coefficient (of y on x). As before, we may write σ_yy = σ² + R²σ_yy expressing the total variance of y additively in terms of the residual variance σ² and that due to the regression on x. It is more common nowadays to work in terms of the variance components than R².

The mathematical theory of regression and correlation is now well understood. Centering at the means, all the concepts depend on the matrix of variances and covariances of y, the dependent variable, and x, the set of regressor variables: σ_yi and σ_ij. The calculations are merely ways of rearranging these elements in convenient forms: correlations and components of variance in regression are just two possibilities. The real difficulty, and the real interest in regression lies in the interpretation of the results.

As an illustration consider the simple case of linear, homoskedastic regression of y on a single regressor variable x, written y = βx + ε, with β as the regression coefficient and ε as the residual variation, with zero mean and variance σ². All this says is that for any fixed x, y has mean βx and variance σ²: and it is only this aspect of the dependence of y on x that is described. Suppose a large amount of data consisting of pairs (x_i, y_i) is collected and the fit y = 2x + ε with σ² = 2 is established. (We discuss how this might be done below.) This shows a fairly close association between y and x. In order therefore to increase y it might be thought reasonable to set x to a high value. Suppose this is done, will this cause y necessarily to increase? Surprisingly, not so. Suppose there is another quantity z and the real relationships are that y = −x + z + ε₁, and \( x=\frac{1}{3}z+{\varepsilon}_2 \) so that z is the basic quantity determining the situation. This clearly yields y = 2x + ε, with ε = ε₁ − 3ε₂, the observed relation. If now x is controlled at a large value without affecting z which is, under natural conditions, the main determinant of x, the effect will be to decrease y through y = −x + z + ε₁. Consequently a strong positive relationship between y and x need not imply an increase in y when x is increased. There can be an enormous difference between the association of y with x, when x is uncontrolled and allowed to vary freely, and the association when x is controlled. And the reason is the presence of another quantity z whose influence on x in the free system is disturbed by the control.

Whenever the regression of y on a set of quantities x is discussed, one has to beware of the possible presence of other, unobserved quantities z that could affect the relationship. A laboratory scientist, or even a social scientist doing a planned survey, can often guard against such hidden quantities by careful design or by appropriate randomization; but an economist, or anyone who has to rely on data from unplanned studies, has always to be on his guard against their effects. Another way of describing the difficulty is to distinguish carefully between association and causation. All regression and correlation analyses can do is study association: the underlying causal mechanism is not necessarily revealed. It is remarkable how little attention has been paid by statisticians to the meaning of causation, and to how it can be revealed by statistical analysis. Economists have had to rely on statistical analyses of randomly obtained data and some of the causal inferences they have drawn are totally unjustified by that data and the analyses.

We now consider the nature of these statistical analyses, confining ourselves predominantly to the case of homoskedastic, linear regression y = Σβ_ix_i + ε, ε having mean zero and variance σ². There the means have been supposed zero. There is usually no difficulty over this as the mean of each variable can ordinarily be estimated by the sample means, y. and x_i. The quantities being discussed here are, in terms of the original data, the deviations, y − y and x_i − x._i, from the sample means. The standard method of estimating the β’s and σ² is least squares. This has been in use for two centuries and is still adopted by almost all data analysts. If that data is (y_j, x_ji: i = 1, 2, … m; j = 1, 2, … n) consisting of n independent observations of y and the m regressor variables, then the least-squares estimates of β_i are provided by minimizing the sum of squares of residuals y − Σβ_ix_i for each of the n observations: that is Σ_j(yj − Σ_iβ_ix_ji)². Matrix notation is most convenient. Write y = (y₁, y₂, … y_n)^T, β = (β₁, β₂, … β_m)^T and X as the matrix with elements x_ji, observation j on variable x_i. Then y = Xβ + residual and the sum of squares to be minimized over β is (y − Xβ)^T (y − Xβ) with minimum given by \( \widehat{\beta}={\left({X}^TX\right)}^{-1}{X}^Ty \). The variance σ² is estimated by the sum of squares at \( \widehat{\beta} \) divided by (n − m). The \( {\widehat{\beta}}_i \) are called the least-squares estimates of β_i.

The method is deservedly popular because it is relatively easy to use and interpret, and many convenient computer programs are available. Its long and successful history testifies to its merits. Unfortunately it has been discovered that there can be very real difficulties when m, the number of variables is large. With the availability of fast computers capable of handling a lot of data, it is not uncommon to have 40 or more variables. The difficulties then become noticeable. Before the arrival of such computing power, least squares was only used with few variables and the difficulties are scarcely noticeable. It is easy to appreciate what could go wrong: it is not so easy to correct it. Consider the case where the sum of squares is Σ_j(y_j − β_j)². This apparently special and degenerate case is, in fact, a canonical form for least squares and any multiple regression situation can be transformed to it by linear transformations. (In so doing, the meanings of the y’s and β’s will change.) The minimization is trivial with estimate \( {\widehat{\beta}}_j={y}_j \), and the minimum value is zero. But we know that y_j differs from its expectation, here β_j but in general Σ_iβ_ix_ji, by an amount which has variance σ², so the average of \( {\varSigma}_j{\left({y}_j-{\widehat{\beta}}_j\right)}^2 \) ought to be about σ², and indeed this is the usual estimate of σ² as mentioned above. But here this estimate is zero, which is absurd. This first, rigorous demonstration that least squares is unsatisfactory was given by Charles Stein. He showed that whenever the number of variables exceeds two, there is an estimate which is, for every value of the regression coefficients, better than least squares. Better here means having smaller mean-square error, though the statement remains true under many other meanings. The efficiency varies with the true values of the β’s. The result just quoted says that it is always less than one. It can be as low as 2/m: with m = 40 this gives only 5% efficiency, a rather serious loss.

It is surprising how little attention Stein’s result has received outside of a small group, largely of theoreticians, yet its practical value could be enormous. Stein, and others, have produced estimates which improve on least squares but none has had much acceptance. Fairly early in the use of computers for regression analysis, it was appreciated that difficulties could arise when the matrix X^TX, which has to be inverted to obtain the least-squares estimates, is illconditioned, with determinant near zero. This is the matrix of sample variances and covariances of the regressor variables, a typical element being Σ_rx_rix_rj where the x’s are deviations from their means, x_i. It will be ill-conditioned if, in the data, there is a near linear relationship between the regressor variables. One suggestion was to put the matrix into correlation form, dividing each row and each column by the sample standard deviation of the variable corresponding to that row or column, so making all diagonal elements one and each off-diagonal element equal to a sample correlation coefficient between x_i and x_j, and then subtracting a constant λ from each unit diagonal element. This leads to ridge regression estimates and ways of choosing λ have been proposed. It often works well but can fail.

These ideas all lie within a frequentist school of inference. In principle, a solution is available with the Bayesian paradigm for inference. Here, in addition to the distribution of y, given x, is included a probability distribution for the regression parameter β = (β₁, β₂, …, β_m). Inference is then made by calculating the revised probability distribution of β given the data. This procedure always avoids Stein’s criticism provided the original distribution of β has total integral unity. (Least squares results from this procedure only if all the values of β are equally probable, a form which is not finitely integrable.) The practical difficulty is the choice of a distribution for β. The ridge method can be produced for certain types of exchangeable distributions for β. In the case of polynomial regression, a reasonable possibility is to suppose that the coefficients of the higher degree polynomials are likely to be smaller than those of lower degree. When the regressor variables refer to different quantities, a possibility is to suppose that few of them have an appreciable coefficient, and therefore influence y, but it is not known which are the determining ones.

This idea that only a few regressors matter has led to a lot of work on the choice of which to include in the regression. There are two broad ways to proceed. One can fit all the quantities available and then discard them one by one as long as the discarding has little effect. Or one can proceed in the reverse direction, introducing them one at a time only if they have an appreciable effect. In both of these methods it has to be decided how the effect is to be measured. The usual criterion is the change in the variance of y ascribable to x; the quantity denoted above by R²σ_yy. Alternatively expressed, this is the change in the multiple correlation coefficient. For example, in the method where the variables are discarded, R² will decrease when a variable is omitted from the regression. Only if this decrease is small will the omision be granted. There are two difficulties here. First, it is possible for two quantities, separately to have little effect, but jointly to be of considerable importance, so that tests of them one at a time may be misleading. (The possibility of computing all 2^m regressions is too extravagant.) Second, it is not clear what is meant by saying the change in R² is “small”: how small? One possibility is to use an ordinary significance test, here a t-test. If significant the regressor causing the change can be included: if not, it is omitted. This is for some suitably chosen significance level. This has been thought to be unsatisfactory by some and other criteria have been proposed. It is here that the Bayesian and frequentist views part company. The usual Bayesian criterion for ‘small’ depends on the assumed distribution for the regression coefficients, but, in general, it seems to need more evidence to introduce a regressor when using the Bayesian approach than when employing a significance test. The former has been accused of favouring the hypothesis that the variable is not worth including. The Bayesian reply is that some ‘significant’ effects are spurious. Multiple regression techniques are so widely used today that one wonders how many effects of x_i on y reported in the literature are meaningful.

Regression concerns a relation, to take the linear, one variable form, y = βx + ε between y and x. This treats y and x asymmetrically and does not lead to y = β⁻¹y + ε′ with ε′ unrelated to y. There is, however, a symmetric form that is sometimes useful. Suppose two quantities, ξ and η, are exactly linearly related, η = βξ, or equally ξ = β⁻¹η. Suppose that each is measured with error giving y = η + ε, x = ξ + ε′. Then the pair (x, y) may have linear regressions but the real interest lies in β, the coefficient of the exact relationship. This is often referred to as the case where both variables, independent and regressor, are subject to error. Ordinary least-squares techniques, even with a single regressor variable, require modification.

Linear multiple regression is part of the general theory of linear models in which, to use the notation above, E(y | X) = Xβ, the linearity being in the parameter β. Least squares and its Steintype modifications are the standard techniques for analysis, together with the analysis of variance.

Bibliography

Efron, B., and C. Morris. 1975. Data analysis using Stein’s estimator and its generalizations. Journal of the American Statistical Association 70: 311–319.
Article Google Scholar
Hoerl, A.E., and R.W. Kennard. 1970. Ridge regression: Biased estimation of non-orthogonal problems. Technometrics 12: 55–67.
Article Google Scholar
Seber, G.A.F. 1977. Linear regression analysis. New York: Wiley.
Google Scholar
Vinod, H.D., and A. Ullah. 1981. Recent advances in regression methods. New York: Dekker.
Google Scholar
Zellner, A. 1971. An Introduction to Bayesian inference in econometrics. New York: Wiley.
Google Scholar

Download references

Author information

Authors and Affiliations

http://springerlink.bibliotecabuap.elogim.com/referencework/10.1007/978-1-349-95121-5
D. V. Lindley

Authors

D. V. Lindley
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Copyright information

About this entry

Cite this entry

Lindley, D.V. (2018). Regression and Correlation Analysis. In: The New Palgrave Dictionary of Economics. Palgrave Macmillan, London. https://doi.org/10.1057/978-1-349-95189-5_1873

Download citation

DOI: https://doi.org/10.1057/978-1-349-95189-5_1873
Published: 15 February 2018
Publisher Name: Palgrave Macmillan, London
Print ISBN: 978-1-349-95188-8
Online ISBN: 978-1-349-95189-5
eBook Packages: Economics and FinanceReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences

Publish with us

Policies and ethics

Regression and Correlation Analysis

Abstract

Similar content being viewed by others

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression

See Also

Bibliography

Author information

Authors and Affiliations

Editor information

Copyright information

About this entry

Cite this entry

Download citation

Publish with us

Navigation

Regression and Correlation Analysis

Abstract

Similar content being viewed by others

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression

See Also

Bibliography

Author information

Authors and Affiliations

Editor information

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation