Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

8.1 Causation and Correlation

Suppose we find direct correlation between two variables. But it does not mean that the change in variable “Y” is a direct cause of a change in variable “X.” If at all the change in “Y” is directly associated with a change in the variable “X,” then it would be certain that X and Y are correlated. The existence of correlation may be due to any one of the following:

8.2 One Variable Being a Cause of Another

The cause variable is taken as an independent variable (X), and the effect variable is considered as a dependent one (Y). Suppose “age” and “height” are correlated. Age is an independent variable which is a cause for change in height, the dependent variable.

8.2.1 Both Variables Being the Result of a Common Cause

Women in a group were followed, after a given operation. The duration of survival and number of children born to a woman were recorded. These factors were related, and it was found that there was a high degree of “positive correlation.” It would be interesting to interpret this data in either of the following two ways:

  1. 1.

    Prolonged life of a woman tends to bear more children.

  2. 2.

    Bearing of children tends to prolong the life of a woman.

Note

Both these interpretations are absurd. Neither prolonged life has any effect on bearing of children nor bearing of children increases the life span of a woman. One can therefore think of some other factors such as age and state of health at the time of operation, which could tend to affect both the survival time and bearing of children.

8.2.2 Chance

Rainfall of some place in north may find high degree of correlation with per acre yield of rice in the south. It would be meaningless to think that the rainfall recorded in the north has any effect on the yield of rice in the south. Such correlations are called spurious or chance correlations. Hence, one must reasonably think of any likelihood relationship existing between the two variables under study.

So, one should be very careful in interpreting the relationship when correlation between the two variables exists.

8.3 Methods of Studying Correlation

  1. 1.

    The scatter diagram

  2. 2.

    Pearson’s coefficient of correlation

  3. 3.

    The regression line

8.3.1 The Scatter Diagram

Usually the scale on Y-axis starts from zero though the scale on X-axis need not start from zero. But in cases of “scatter diagram,” this restriction on the side of Y-axis is also removed. Both X- and Y-axes may be stared at the minimum values of the respective variables.

8.3.2 Pearson Coefficient of Correlation for Ungrouped Data

Pearson’s coefficient of correlation is a measure of the degree of relationship between the two variables. It is denoted by “r” in the case of the sample estimate and by “ρ” in case of the correlation obtained from the whole population. This is also known as the product moment component of correlation. The computation formulae for the both have been illustrated in Table 8.1.

Table 8.1 Pearson’s coefficient of correlation formulae

The formulae may also be written in different forms for the sake of convenience in calculations. These are as below:

$$ \mathbf{r}=\frac{\sum_1^n\left( Xi-\overline{X}\right)\bullet \left( Yi-\overline{Y}\right)}{\sqrt{\sum_1^n{\left( Xi-\overline{X}\right)}^2\bullet {\sum}_1^n{\left( Yi-\overline{Y}\right)}^2\ }} $$
$$ \mathbf{r}=\frac{\sum_1^n XiYi-n\overline{X}\overline{Y}}{\sqrt{\left[{\sum}_1^n{Xi}^2-n{\overline{X}}^2\right]\bullet \left[{\sum}_1^n{Yi}^2-n{\overline{Y}}^2\right].}} $$
$$ \mathbf{r}=\frac{\sum_1^n Xi Yi-\frac{\sum_i^n Xi\bullet {\sum}_i^n Yi}{\mathrm{n}}}{\sqrt{\left[\frac{\sum_i^n{Xi}^2}{1}-\frac{{\left({\sum}_{\mathrm{i}}^n Xi\right)}^2}{n}\right]\bullet \left[\frac{\sum_i^n{Yi}^2}{1}-\frac{{\left({\sum}_i^n Yi\right)}^2}{n}\right]\ }} $$
$$ \mathbf{r}=\frac{\sum_1^n uv-\frac{\sum_i^nu\bullet {\sum}_i^nv}{\mathrm{n}}}{\sqrt{\left[\frac{\sum_i^n{u}^2}{1}-\frac{{\left({\sum}_i^nu\right)}^2}{n}\right]\bullet \left[\frac{\sum_i^n{v}^2}{1}-\frac{{\left({\sum}_i^nv\right)}^2}{n}\right]\ }} $$

In the above formula, “u” and “v” are the new variables used to simplify the computation: u = XX o and v = YY o, where X o and Y o are the assumed means.

Pearson’s coefficient of correlation (r) can also be computed by the “difference formula” as given below:

$$ \mathbf{r}=\frac{\sum_1^n{x}^2+{\sum}_1^n{y}^2-{\sum}_1^n{d}^2}{2\sqrt{\sum_1^n{x}^2\bullet {\sum}_1^n{y}^2\ }} $$

In which \( {\sum}_1^n{d}^2 \)=\( {\sum}_1^n{\left(x-y\right)}^2 \) and x = X\( \overline{X} \); y = Y\( \overline{Y.} \)

The above equation can also be modified as below:

$$ \mathbf{r}=\frac{\sum_1^n{x}^2+{\sum}_1^n{y}^2-{\sum}_1^n{\left(x-y\right)}^2-2\left({\sum}_1^nx\right)\left({\sum}_1^ny\right)}{2\sqrt{\left[n{\sum}_1^n{x}^2-{\left({\sum}_1^nx\right)}^2\right]\left[n{\sum}_1^n{y}^2-{\left({\sum}_1^ny\right)}^2\right]\ }} $$

8.3.2.1 Examples Illustrating the Computations of r-Test

Example 1

Computations of “r” when deviations are taken from their means. Data of height and weight of five students have been tabulated in Table 8.2.

Table 8.2 Data of height and weight for r-test
$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\sum_1^n\left( Xi-\overline{X}\right)\cdot \left( Yi-\overline{Y}\right)}{\sqrt{\sum_1^n{\left( Xi-\overline{X}\right)}^2\cdot {\sum}_1^n{\left( Yi-\overline{Y}\right)}^2\ }}\\ {}=\frac{\sum_1^n xy}{\sqrt{\sum_1^n{(x)}^2\cdot {\sum}_1^n{(y)}^2\ }}=\frac{55}{\sqrt{20\times 750\ }}=\mathbf{0.449}\end{array}} $$

df = n−2 = 5−2 = 3

r 0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r 0.05 = 0.878. So, “null hypothesis” (H o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 2

Computations of “r” when deviations are taken from the assumed means. Data of height and weight of five students has been tabulated in Table 8.3.

Table 8.3 Data of height and weight for r-test
$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\sum_1^n uv-\frac{\sum_{\mathrm{i}}^nu\cdot {\sum}_{\mathrm{i}}^nv}{\mathrm{n}}}{\sqrt{\left[\frac{\sum_{\mathrm{i}}^n{u}^2}{1}-\frac{{\left({\sum}_{\mathrm{i}}^nu\right)}^2}{n}\right]\cdot \left[\frac{\sum_{\mathrm{i}}^n{v}^2}{1}-\frac{{\left({\sum}_{\mathrm{i}}^nv\right)}^2}{n}\right]\ }}\\ {}=\frac{55-\frac{\left(-5\right)\cdot (0)}{5}}{\sqrt{\left[25--\frac{(5)^2}{5}\right]\cdot \left[750--\frac{(0)^2}{5}\right]\ }}=\frac{55}{\sqrt{20\times 750}}=\mathbf{0.449}\end{array}} $$

df = n−2 = 5−2 = 3

r 0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r 0.05 = 0.878. So, “null hypothesis” (H o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 3

Computations of “r” from observed data without taking deviations. Data of height and weight of five students has been tabulated in Table 8.4.

Table 8.4 Data of height and weight for r-test
$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\sum_1^n Xi Yi-\frac{\sum_i^n Xi\cdot {\sum}_i^n Yi}{n}}{\sqrt{\left[\frac{\sum_i^n{Xi}^2}{1}-\frac{{\left({\sum}_i^n Xi\right)}^2}{n}\right]\cdot \left[\frac{\sum_i^n{Yi}^2}{1}-\frac{{\left({\sum}_i^n Yi\right)}^2}{n}\right]}}=\frac{24205-\frac{345\times 350}{5}}{\sqrt{\left[23825-\frac{(345)^2}{5}\right]\cdot \left[25250-\frac{(350)^2}{5}\right]\ }}\\ {}=\frac{24205-24150}{\sqrt{\left[23825-23805\right]\cdot \left[25250-24500\right]\ }}=\frac{55}{\sqrt{20\times 750\ }}=\frac{55}{122.47}=\mathbf{0.449}\end{array}} $$

df = n−2 = 5−2 = 3

r 0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r 0.05 = 0.878. So, “null hypothesis” (H o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 4

Computations of “r” by the “difference formula.” Data of height and weight of five students has been tabulated in Table 8.5.

Table 8.5 Data of height and weight for r-test
$$ \mathbf{r}=\frac{\sum_1^n{x}^2+{\sum}_1^n{y}^2-{\sum}_1^n{d}^2}{2\sqrt{\sum_1^n{x}^2\cdot {\sum}_1^n{y}^2\ }}=\frac{20+750-660}{2\sqrt{20\times 750\ }}=\frac{110}{2\sqrt{20\times 750\ }}=\frac{55}{\sqrt{20\times 750\ }}=0.449 $$

df = n−2 = 5−2 = 3

r 0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r 0.05 = 0.878. So, “null hypothesis” (H o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 5

The body weights of five chicks were 180, 170,170, 190, and 190 grams, respectively, and their comb weights were found to be 50, 40, 20, 60, and 60 grams, respectively. Find out if there is any correlation between the body weight and comb weight of chicks.

Solution

Data has been transformed by subtracting 170 from the body weights and 40 from the comb weights as shown in Table 8.6.

Table 8.6 Body weights and comb weights of chicks
$$ {\displaystyle \begin{array}{c}\Sigma {u}^2=100+400+400=900\\ {}\Sigma {v}^2=100+400+400+400=1300\\ {}\Sigma uv=100+400+400=900\end{array}} $$
$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\Sigma uv-\frac{\Sigma u\cdot \Sigma v}{n}}{\sqrt{\left(\Sigma {u}^2-\frac{{\left(\Sigma u\right)}^2\kern0.5em }{n}\right)\left(\Sigma {v}^2-\frac{{\left(\Sigma v\right)}^2\kern0.5em }{n}\right)\ }}\\ {}=\frac{900-\frac{50\times 30}{5}}{\sqrt{\left(900-\frac{\ 50\times 50}{5}\right)\left(1300-\frac{30\times 30}{5}\right)\ }}\\ {}=\frac{900-300}{\sqrt{\left(900-500\right)\left(1300-180\right)\ }}\\ {}=\frac{600}{\sqrt{400\times 1120\ }}=\frac{600}{20\sqrt{\ 1120\ }}=\frac{30}{\sqrt{1120\ }}=\frac{30}{33.5}=0.895=+\mathbf{0.895}\end{array}} $$

df = n – 2 = 5 – 2 = 3 ; r 0.05 = 0.878

Decision

The calculated r = +0.895 is greater than r 0.05 = 0.878. So, “null hypothesis” (H o) is rejected (p < 0.05). Hence, there is direct correlation between the “body weights” and “comb weights” of chicks.

8.3.3 Regression Line

To determine the amount of change that normally takes place in the Y-variable for a unit change in the X-variable, a line is fitted to the points plotted on the scatter diagram. This line is described as Y = a + bx and is said to be line of regression of Y on X. Here, “a” and “b” are the two constants: a = Y-intercept and b = slope of the regression line. The “b” may also be written as “by” = regression coefficient of Y on X. It is also possible to find bXY = “regression coefficient” of X on Y. There is a definite relationship between “r” and these two “regression coefficients” bYX and bXY. The “r” is the geometric mean of bXY and bYX.

Therefore:

$$ \mathbf{r}=\sqrt{bYX\cdot bXY};\mathrm{But}\kern0.5em bYX=\frac{\Sigma \left(X-\overline{X}\right)\left(Y-\overline{Y}\right)}{{\left(X-\overline{X}\ \right)}^2} $$

whereas:

$$ \mathbf{r}=\frac{\sum_1^n\left( Xi-\overline{X}\right)\bullet \left( Yi-\overline{Y}\right)}{\sqrt{\sum_1^n{\left( Xi-\overline{X}\right)}^2\bullet {\sum}_1^n{\left( Yi-\overline{Y}\right)}^2\ }} $$

Hence it can be proved that \( bYX=r\sqrt{\frac{{\left(Y-\overline{Y}\right)}^2}{{\left(X-\overline{X}\right)}^2}}=\frac{sy}{sx} \)

8.4 Proportions of “r

  1. 1.

    The “r” values range from −1.00 through 0.00 to +1.00.

  2. 2.

    It is a pure number, independent of the units of measurement of the variables X and Y.

  3. 3.

    If r = −1, a perfect inverse linear relationship exists between the variables (e.g., volume ∞ \( \frac{1}{\mathrm{Power}} \)).

  4. 4.

    If r = −0, linear relationship between the two variables X and Y does not exist (e.g., number of births registered vs number of cars registered).

  5. 5.

    If r = +1, there is a perfect direct linear relationship (e.g. diameter vs circumference).

  6. 6.

    If r = −0.7 or + 0.7 in a large set, the degree of relationship between the two variables seems to be high.

  7. 7.

    If r = +0.6, it does not mean that 60% of the values are related.

  8. 8.

    The computation of r is valid only if the variables are approximately normally distributed.

  9. 9.

    The r 2 is known as coefficient of determination. If r 2 = 0.756, it means that approximately 75.6% of the variation in Y is only due to the linear regression of Y on X.