Correlation

Rayat, Charan Singh

doi:10.1007/978-981-13-0827-7_8

Charan Singh Rayat²

1822 Accesses

Abstract

So far, we have learned to compute certain mathematical measures, representing the performance or behavior of a set of observations. Instead of studying the performance of two sets separately, it would be of great importance to examine the relationship of one set of observations with the other one. Two variables are said to be “correlated” if an increase or decrease in one variable is associated with an increase or decrease in the other. If higher values of one variable are associated with higher values of the other or when the lower values of one variable are associated with the lower values of the other, then it is said to be “directly correlated” or “positively correlated.”

In other words, with an increase/decrease in one variable, the other also increases/decreases, respectively. This is said to be positive or direct correlation. On the other hand, in “negative correlation or inverse correlation” with an increase in one variable, the other variable decreases, or with a decrease in one variable, the other variable increases. Pearson’s coefficient of correlation is a measure of the degree of relationship between the two variables. It is denoted by “r” in the case of sample estimate and by “ρ” in the case of the correlation obtained from the whole population. The applications of various formulae for computing “coefficient of correlation” would be dispensed with solved examples in this chapter.

Access provided by CONRICYT-eBooks. Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

8.1 Causation and Correlation

Suppose we find direct correlation between two variables. But it does not mean that the change in variable “Y” is a direct cause of a change in variable “X.” If at all the change in “Y” is directly associated with a change in the variable “X,” then it would be certain that X and Y are correlated. The existence of correlation may be due to any one of the following:

8.2 One Variable Being a Cause of Another

The cause variable is taken as an independent variable (X), and the effect variable is considered as a dependent one (Y). Suppose “age” and “height” are correlated. Age is an independent variable which is a cause for change in height, the dependent variable.

8.2.1 Both Variables Being the Result of a Common Cause

Women in a group were followed, after a given operation. The duration of survival and number of children born to a woman were recorded. These factors were related, and it was found that there was a high degree of “positive correlation.” It would be interesting to interpret this data in either of the following two ways:

1.
Prolonged life of a woman tends to bear more children.
2.
Bearing of children tends to prolong the life of a woman.

Note

Both these interpretations are absurd. Neither prolonged life has any effect on bearing of children nor bearing of children increases the life span of a woman. One can therefore think of some other factors such as age and state of health at the time of operation, which could tend to affect both the survival time and bearing of children.

8.2.2 Chance

Rainfall of some place in north may find high degree of correlation with per acre yield of rice in the south. It would be meaningless to think that the rainfall recorded in the north has any effect on the yield of rice in the south. Such correlations are called spurious or chance correlations. Hence, one must reasonably think of any likelihood relationship existing between the two variables under study.

So, one should be very careful in interpreting the relationship when correlation between the two variables exists.

8.3 Methods of Studying Correlation

1.
The scatter diagram
2.
Pearson’s coefficient of correlation
3.
The regression line

8.3.1 The Scatter Diagram

Usually the scale on Y-axis starts from zero though the scale on X-axis need not start from zero. But in cases of “scatter diagram,” this restriction on the side of Y-axis is also removed. Both X- and Y-axes may be stared at the minimum values of the respective variables.

8.3.2 Pearson Coefficient of Correlation for Ungrouped Data

Pearson’s coefficient of correlation is a measure of the degree of relationship between the two variables. It is denoted by “r” in the case of the sample estimate and by “ρ” in case of the correlation obtained from the whole population. This is also known as the product moment component of correlation. The computation formulae for the both have been illustrated in Table 8.1.

Table 8.1 Pearson’s coefficient of correlation formulae

Full size table

The formulae may also be written in different forms for the sake of convenience in calculations. These are as below:

$$ \mathbf{r}=\frac{\sum_1^n\left( Xi-\overline{X}\right)\bullet \left( Yi-\overline{Y}\right)}{\sqrt{\sum_1^n{\left( Xi-\overline{X}\right)}^2\bullet {\sum}_1^n{\left( Yi-\overline{Y}\right)}^2\ }} $$

$$ \mathbf{r}=\frac{\sum_1^n XiYi-n\overline{X}\overline{Y}}{\sqrt{\left[{\sum}_1^n{Xi}^2-n{\overline{X}}^2\right]\bullet \left[{\sum}_1^n{Yi}^2-n{\overline{Y}}^2\right].}} $$

$$ \mathbf{r}=\frac{\sum_1^n Xi Yi-\frac{\sum_i^n Xi\bullet {\sum}_i^n Yi}{\mathrm{n}}}{\sqrt{\left[\frac{\sum_i^n{Xi}^2}{1}-\frac{{\left({\sum}_{\mathrm{i}}^n Xi\right)}^2}{n}\right]\bullet \left[\frac{\sum_i^n{Yi}^2}{1}-\frac{{\left({\sum}_i^n Yi\right)}^2}{n}\right]\ }} $$

$$ \mathbf{r}=\frac{\sum_1^n uv-\frac{\sum_i^nu\bullet {\sum}_i^nv}{\mathrm{n}}}{\sqrt{\left[\frac{\sum_i^n{u}^2}{1}-\frac{{\left({\sum}_i^nu\right)}^2}{n}\right]\bullet \left[\frac{\sum_i^n{v}^2}{1}-\frac{{\left({\sum}_i^nv\right)}^2}{n}\right]\ }} $$

In the above formula, “u” and “v” are the new variables used to simplify the computation: u = X−X _o and v = Y−Y _o, where X _o and Y _o are the assumed means.

Pearson’s coefficient of correlation (r) can also be computed by the “difference formula” as given below:

$$ \mathbf{r}=\frac{\sum_1^n{x}^2+{\sum}_1^n{y}^2-{\sum}_1^n{d}^2}{2\sqrt{\sum_1^n{x}^2\bullet {\sum}_1^n{y}^2\ }} $$

In which $ {\sum}_1^n{d}^2 $=$ {\sum}_1^n{\left(x-y\right)}^2 $ and x = X−$ \overline{X} $; y = Y−$ \overline{Y.} $

The above equation can also be modified as below:

$$ \mathbf{r}=\frac{\sum_1^n{x}^2+{\sum}_1^n{y}^2-{\sum}_1^n{\left(x-y\right)}^2-2\left({\sum}_1^nx\right)\left({\sum}_1^ny\right)}{2\sqrt{\left[n{\sum}_1^n{x}^2-{\left({\sum}_1^nx\right)}^2\right]\left[n{\sum}_1^n{y}^2-{\left({\sum}_1^ny\right)}^2\right]\ }} $$

8.3.2.1 Examples Illustrating the Computations of r-Test

Example 1

Computations of “r” when deviations are taken from their means. Data of height and weight of five students have been tabulated in Table 8.2.

Table 8.2 Data of height and weight for r-test

Full size table

$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\sum_1^n\left( Xi-\overline{X}\right)\cdot \left( Yi-\overline{Y}\right)}{\sqrt{\sum_1^n{\left( Xi-\overline{X}\right)}^2\cdot {\sum}_1^n{\left( Yi-\overline{Y}\right)}^2\ }}\\ {}=\frac{\sum_1^n xy}{\sqrt{\sum_1^n{(x)}^2\cdot {\sum}_1^n{(y)}^2\ }}=\frac{55}{\sqrt{20\times 750\ }}=\mathbf{0.449}\end{array}} $$

df = n−2 = 5−2 = 3

r _0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r _0.05 = 0.878. So, “null hypothesis” (H _o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 2

Computations of “r” when deviations are taken from the assumed means. Data of height and weight of five students has been tabulated in Table 8.3.

Table 8.3 Data of height and weight for r-test

Full size table

$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\sum_1^n uv-\frac{\sum_{\mathrm{i}}^nu\cdot {\sum}_{\mathrm{i}}^nv}{\mathrm{n}}}{\sqrt{\left[\frac{\sum_{\mathrm{i}}^n{u}^2}{1}-\frac{{\left({\sum}_{\mathrm{i}}^nu\right)}^2}{n}\right]\cdot \left[\frac{\sum_{\mathrm{i}}^n{v}^2}{1}-\frac{{\left({\sum}_{\mathrm{i}}^nv\right)}^2}{n}\right]\ }}\\ {}=\frac{55-\frac{\left(-5\right)\cdot (0)}{5}}{\sqrt{\left[25--\frac{(5)^2}{5}\right]\cdot \left[750--\frac{(0)^2}{5}\right]\ }}=\frac{55}{\sqrt{20\times 750}}=\mathbf{0.449}\end{array}} $$

df = n−2 = 5−2 = 3

r _0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r _0.05 = 0.878. So, “null hypothesis” (H _o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 3

Computations of “r” from observed data without taking deviations. Data of height and weight of five students has been tabulated in Table 8.4.

Table 8.4 Data of height and weight for r-test

Full size table

$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\sum_1^n Xi Yi-\frac{\sum_i^n Xi\cdot {\sum}_i^n Yi}{n}}{\sqrt{\left[\frac{\sum_i^n{Xi}^2}{1}-\frac{{\left({\sum}_i^n Xi\right)}^2}{n}\right]\cdot \left[\frac{\sum_i^n{Yi}^2}{1}-\frac{{\left({\sum}_i^n Yi\right)}^2}{n}\right]}}=\frac{24205-\frac{345\times 350}{5}}{\sqrt{\left[23825-\frac{(345)^2}{5}\right]\cdot \left[25250-\frac{(350)^2}{5}\right]\ }}\\ {}=\frac{24205-24150}{\sqrt{\left[23825-23805\right]\cdot \left[25250-24500\right]\ }}=\frac{55}{\sqrt{20\times 750\ }}=\frac{55}{122.47}=\mathbf{0.449}\end{array}} $$

df = n−2 = 5−2 = 3

r _0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r _0.05 = 0.878. So, “null hypothesis” (H _o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 4

Computations of “r” by the “difference formula.” Data of height and weight of five students has been tabulated in Table 8.5.

Table 8.5 Data of height and weight for r-test

Full size table

$$ \mathbf{r}=\frac{\sum_1^n{x}^2+{\sum}_1^n{y}^2-{\sum}_1^n{d}^2}{2\sqrt{\sum_1^n{x}^2\cdot {\sum}_1^n{y}^2\ }}=\frac{20+750-660}{2\sqrt{20\times 750\ }}=\frac{110}{2\sqrt{20\times 750\ }}=\frac{55}{\sqrt{20\times 750\ }}=0.449 $$

df = n−2 = 5−2 = 3

r _0.05 = 0.878

Decision

The computed value of r = 0.449 is less than the “table value” of r _0.05 = 0.878. So, “null hypothesis” (H _o) is accepted. Hence, there is no significant correlation between the height and weight of students.

Example 5

The body weights of five chicks were 180, 170,170, 190, and 190 grams, respectively, and their comb weights were found to be 50, 40, 20, 60, and 60 grams, respectively. Find out if there is any correlation between the body weight and comb weight of chicks.

Solution

Data has been transformed by subtracting 170 from the body weights and 40 from the comb weights as shown in Table 8.6.

Table 8.6 Body weights and comb weights of chicks

Full size table

$$ {\displaystyle \begin{array}{c}\Sigma {u}^2=100+400+400=900\\ {}\Sigma {v}^2=100+400+400+400=1300\\ {}\Sigma uv=100+400+400=900\end{array}} $$

$$ {\displaystyle \begin{array}{c}\mathbf{r}=\frac{\Sigma uv-\frac{\Sigma u\cdot \Sigma v}{n}}{\sqrt{\left(\Sigma {u}^2-\frac{{\left(\Sigma u\right)}^2\kern0.5em }{n}\right)\left(\Sigma {v}^2-\frac{{\left(\Sigma v\right)}^2\kern0.5em }{n}\right)\ }}\\ {}=\frac{900-\frac{50\times 30}{5}}{\sqrt{\left(900-\frac{\ 50\times 50}{5}\right)\left(1300-\frac{30\times 30}{5}\right)\ }}\\ {}=\frac{900-300}{\sqrt{\left(900-500\right)\left(1300-180\right)\ }}\\ {}=\frac{600}{\sqrt{400\times 1120\ }}=\frac{600}{20\sqrt{\ 1120\ }}=\frac{30}{\sqrt{1120\ }}=\frac{30}{33.5}=0.895=+\mathbf{0.895}\end{array}} $$

df = n – 2 = 5 – 2 = 3 ; r _0.05 = 0.878

Decision

The calculated r = +0.895 is greater than r _0.05 = 0.878. So, “null hypothesis” (H _o) is rejected (p < 0.05). Hence, there is direct correlation between the “body weights” and “comb weights” of chicks.

8.3.3 Regression Line

To determine the amount of change that normally takes place in the Y-variable for a unit change in the X-variable, a line is fitted to the points plotted on the scatter diagram. This line is described as Y = a + bx and is said to be line of regression of Y on X. Here, “a” and “b” are the two constants: a = Y-intercept and b = slope of the regression line. The “b” may also be written as “by” = regression coefficient of Y on X. It is also possible to find bXY = “regression coefficient” of X on Y. There is a definite relationship between “r” and these two “regression coefficients” bYX and bXY. The “r” is the geometric mean of bXY and bYX.

Therefore:

$$ \mathbf{r}=\sqrt{bYX\cdot bXY};\mathrm{But}\kern0.5em bYX=\frac{\Sigma \left(X-\overline{X}\right)\left(Y-\overline{Y}\right)}{{\left(X-\overline{X}\ \right)}^2} $$

whereas:

$$ \mathbf{r}=\frac{\sum_1^n\left( Xi-\overline{X}\right)\bullet \left( Yi-\overline{Y}\right)}{\sqrt{\sum_1^n{\left( Xi-\overline{X}\right)}^2\bullet {\sum}_1^n{\left( Yi-\overline{Y}\right)}^2\ }} $$

Hence it can be proved that $ bYX=r\sqrt{\frac{{\left(Y-\overline{Y}\right)}^2}{{\left(X-\overline{X}\right)}^2}}=\frac{sy}{sx} $

8.4 Proportions of “r”

1.
The “r” values range from −1.00 through 0.00 to +1.00.
2.
It is a pure number, independent of the units of measurement of the variables X and Y.
3.
If r = −1, a perfect inverse linear relationship exists between the variables (e.g., volume ∞ $ \frac{1}{\mathrm{Power}} $).
4.
If r = −0, linear relationship between the two variables X and Y does not exist (e.g., number of births registered vs number of cars registered).
5.
If r = +1, there is a perfect direct linear relationship (e.g. diameter vs circumference).
6.
If r = −0.7 or + 0.7 in a large set, the degree of relationship between the two variables seems to be high.
7.
If r = +0.6, it does not mean that 60% of the values are related.
8.
The computation of r is valid only if the variables are approximately normally distributed.
9.
The r ² is known as coefficient of determination. If r ² = 0.756, it means that approximately 75.6% of the variation in Y is only due to the linear regression of Y on X.

Author information

Authors and Affiliations

Department of Histopathology, Postgraduate Institute of Medical Education & Research, Chandigarh, India
Charan Singh Rayat

Authors

Charan Singh Rayat
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rayat, C.S. (2018). Correlation. In: Statistical Methods in Medical Research. Springer, Singapore. https://doi.org/10.1007/978-981-13-0827-7_8

Download citation

DOI: https://doi.org/10.1007/978-981-13-0827-7_8
Published: 24 August 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0826-0
Online ISBN: 978-981-13-0827-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Correlation

Abstract

Keywords

8.1 Causation and Correlation

8.2 One Variable Being a Cause of Another

8.2.1 Both Variables Being the Result of a Common Cause

Note

8.2.2 Chance

8.3 Methods of Studying Correlation

8.3.1 The Scatter Diagram

8.3.2 Pearson Coefficient of Correlation for Ungrouped Data

8.3.2.1 Examples Illustrating the Computations of r-Test

Example 1

Decision

Example 2

Decision

Example 3

Decision

Example 4

Decision

Example 5

Solution

Decision

8.3.3 Regression Line

8.4 Proportions of “r”

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation