Keywords

1 Introduction

Current medical research is dependent on statistics from the early stages (designing the study type, estimation of sample size) to the final analysis of data and publication of results. Statistical methods can be broadly divided into descriptive statistics and inferential statistics. Descriptive statistics refers to describing, organizing and summarizing data. Inferential statistics involves making inferences about a population by analysing the observations of a sample. Hypothesis testing is a form of inferential statistics. Statistical tests involving two groups are the most commonly used univariate analysis. A clear understanding of the principles and concepts underlying these statistical tests is essential for the conduct and interpretation of results. Table 30.1 summarizes the parametric and non-parametric tests enumerated in this chapter. It also provides a simplified way of choosing an appropriate statistical test for a given set of variables, study design and type of analysis [1]. Parametric tests are listed in the first row (R1). Non-parametric tests are listed in a row (R2). Tests for analysis of categorical variables are listed in Row (R3). If one knows the type of analysis and exposure and outcome variables, then the statistical test can be identified by looking at a specific row and cell to arrive at the statistical test. Readers are advised to read related Chaps. 28 and 29.

Table 30.1 How to choose a statistical test?

At the end of this chapter, readers should be able to

  1. 1.

    List the various parametric and nonparametric statistical tests.

  2. 2.

    Explain the principle of various statistical tests and their assumptions.

  3. 3.

    Enumerate the methods to check the assumptions and interpret the results of the analysis.

2 Parametric Tests for Comparing Two Groups/Datasets

2.1 Etymology

A numerical value that describes the population is known as a parameter and that which describes the sample is called a statistic [2]. A mnemonic to remember this is provided in Box 30.1. For example, the mean of the population is a parameter whereas the mean of the sample is a statistic.

Box 30.1

Mnemonic

P for P and S for S

Parameter—Population; Statistic—Sample

Statistical tests which estimate the underlying population’s parameter and then use it to test the null hypothesis are known as parametric tests [3]. These tests are based on the probability distribution for making inferences. Statistical tests which are not based on probability distribution and do not test hypotheses concerning parameters are called nonparametric tests. These are also known as distribution free tests.

2.2 Parametric Tests

There are various parametric tests, and they are categorized according to the number of groups/dataset and the design of the study. Figure 30.1 provides a flowchart for selecting various parametric tests.

Fig. 30.1
An organizational chart. The number of groups is classified into 1 group, 2 groups, and more than 2 groups. 1 group is subdivided into 2 datasets, and more than 2 datasets. The parametric test for each group is indicated.

Parametric tests for analysis of data. * if the result turns significant, a suitable post-hoc test should be performed (see Sect. 30.6)

2.3 Student’s t Test

2.3.1 History

Student’s t test was developed by William S Gossett who was employed in a brewery (Arthur Guiness Son & Co) in Dublin, Ireland. He was entrusted with quality control in the brewery and he wanted to perform this with a small number of samples. As Guiness Brewery prohibited its employees from publishing the work done in the brewery, he published it under the pseudonym ‘Student’. There is a speculation that he conducted these works during the afternoon breaks and hence he named the distribution ‘t’ (tea) distribution [4].

2.3.2 Principle

t test is carried out when the mean of two groups/datasets are to be compared (see below Sect. 30.2.3.3). The difference in the mean between the two groups/datasets with relation to the standard error of the difference is calculated. If this ratio is more than or equal to 1.96, then it is less likely to occur by chance (P<0.05) [5].

2.3.3 Criteria for Performing t Test

Student (independent) t test can be done under the following circumstances.

Outcome variable: Numerical continuous (summarized as mean)

Analysis type: Comparison of mean

Number of groups: two groups/datasets

Study design: Unpaired (two independent groups)

Distribution of data: Normal distribution

2.3.4 Assumptions of t Test

All parametric tests have some assumptions. There are three assumptions of t test. They are

  1. 1.

    The samples are selected from the respective population randomly.

  2. 2.

    The variances (square of standard deviation) of both groups are equal.

  3. 3.

    The variable follows a normal distribution.

2.3.5 Methods for Checking the Normal Distribution

A frequency histogram of the observed data may be created for each variable. Then the normal distribution curve may be superimposed on it. Now by comparing these two visually, we can easily find out whether the data follows normal distribution [6]. There are statistical tests like the Kolmogorov Smirnov test, Shapiro Wilk test for checking the normality of data. These tests have their own assumptions. So, using one significance test conditional on another significance test is not recommended [5].

Departure from the symmetry of the distribution is called skewness. Skewed distribution can be detected from the summary statistics. If the mean is smaller than twice the standard deviation, then the data are likely to be skewed.

2.3.6 The Way Out for Assumptions

Most often convenience sampling is followed (and not random sampling) in medical research. If the intervention is randomly allocated (randomization), then the difference in mean between the two groups behaves like the difference between two random samples [7]. Hence if randomization is followed, the assumption of random sampling will be satisfied.

The same argument placed for checking the normality of data holds good for checking equality of variance. As a rule of thumb, if the ratio of the two standard deviations (larger standard deviation ÷ smaller standard deviation) is greater than 2, then it may be considered that the variance is not equal. Most of the software for statistical analysis provides output for equal and unequal variance. So, if the variances of the two groups are not equal, the output for an unequal variance may be considered.

If the data is not extremely skewed and if the number of samples (sample size) is the same in both groups, then t test will be valid. If the data is extremely skewed and if we wish to perform t test, then transformation of the data should be attempted. Log transformation is preferred among all transformations as back transformation is possible and meaningful [8]. If the data does not follow normal distribution even after transformation, then a nonparametric test should be done.

2.4 Paired t Test

2.4.1 Conditions for Conducting a Paired t Test

The paired t test is used when there is one group and two datasets (before and after intervention).

Paired t test is carried out under the following circumstances.

Outcome variable: Numerical continuous (summarized as mean)

Analysis type: Comparison of mean

Number of groups: One group and two datasets (baseline and after intervention)

Study design: Paired (matched)

Distribution of data: Normal distribution

2.4.2 Principle

In this type of analysis, every participant serves as his/her own control. The observations are paired as both baseline and after intervention measurements are made on the same subject. Thus, interindividual variation is eliminated in this type of study. The difference between the baseline and after intervention measurement is estimated. Then the standard deviation of the difference is calculated (the standard deviation of the difference is not equal to the difference between the standard deviations of the baseline and after intervention data). For the mean difference the t statistic, significance level, and 95% confidence interval are calculated.

2.4.3 Applications of Paired t Test

Apart from its application for analysing two datasets from one group of individuals (baseline and after intervention), paired t test may be used to analyse data from studies that have eliminated the interindividual variability. For example, in patients having a bilateral fungal infection of the hand, the intervention is administered on one side (say left side) and the standard therapy (comparator) on the other side. These two datasets are analysed by paired t test.

2.5 Example for t Test (Unpaired and Paired)

A pilot study was conducted in patients with type 2 diabetes mellitus to evaluate the efficacy of a new hypothetical drug jipizide as compared to glipizide. The drug was administered for one month. The postprandial blood glucose levels are provided in Table 30.2. The baseline postprandial glucose and the postprandial glucose after one month of therapy are measured in the study. For analyzing this data, we need to create a new variable viz. reduction in postprandial glucose after one month of therapy. Student t test has to be performed for comparing the reduction in postprandial blood glucose between the two groups. The values of postprandial glucose obtained after one month of therapy should not be used as such for analysis as this depends upon the baseline value and the baseline value is different in each patient. The baseline postprandial glucose level and the level after one month of therapy within a group may be compared by a paired t test to assess if the drug is effective. If the efficacy has to be compared between the two groups, then the student t test has to be used.

Table 30.2 The antidiabetic effect of jipizide in patients with type 2 diabetes mellitus

2.6 Interpretation of Results

Before interpreting the results, we need to make sure that the assumptions are satisfied. The P value indicates the probability of the results occurring by chance alone. If the P value is less than 5% (0.05), it is considered that there is a significant difference between the two groups/interventions (But still there is a 5% probability that this can happen purely by chance). If we are concluding that there is no significant difference (P > 0.05), we should calculate the power and confirm that the study was adequately powered (>0.8 or 80%) to detect the difference.

A significant P value (statistical significance) does not mean clinical or biological significance. When data from groups with large sample sizes are tested, even a small difference will be picked up as statistically significant which need not be clinically significant. For example, in a clinical trial involving 10,000 participants, the new drug produced a mean decrease in systolic blood pressure by 4 mm Hg compared to the reference drug (P < 0.05). Now the reduction in systolic blood pressure is statistically significant but not clinically significant as the mean reduction in systolic blood pressure is just 4 mm Hg.

P value does not provide any information regarding the effect size. But 95% confidence interval indicates that there is a 95% chance of including the population parameter in the given interval. For example, if the 95% confidence interval of the mean difference in systolic blood pressure is −12 to −5 mm Hg. It means that if the new drug is used in the real-world population (hypertensives), then we can expect a reduction in systolic blood pressure by 5 to 12 mm Hg.

3 Nonparametric Tests for Comparing Two Groups/Datasets

3.1 Nonparametric Tests

Nonparametric tests are to be used when the population distributions do not follow a normal distribution. These tests are simple to perform and do not have any assumptions about the population distribution. This does not mean that we can do away with the parametric tests and use the nonparametric tests for all data. As parametric tests are more powerful, they are preferred over nonparametric tests [9]. The various nonparametric tests are mentioned in Fig. 30.2. These are the corresponding nonparametric tests for the parametric tests provided in Fig. 30.1.

Fig. 30.2
An organizational chart. The number of groups is classified into 1 group, 2 groups, and more than 2 groups. 1 group is subdivided into 2 datasets, and more than 2 datasets. The non-parametric test for each group is indicated.

Nonparametric tests according to the number of groups/datasets.* if the result turns significant, a suitable post-hoc test should be performed (see Sect. 30.6)

3.2 Etymology

If the variable in the population does not follow a normal distribution, then we need to use nonparametric tests. These tests are not based on probability distribution. For any meaningful inference about the population, we need to compare the parameters. Hence nonparametric test is a misnomer.

3.3 Wilcoxon Test

3.3.1 Types

Based on the number of groups, Wilcoxon test can be of two types viz.—Wilcoxon rank sum test and the Wilcoxon signed rank test. A simple mnemonic for this is provided in Box 30.2. Mann and Whitney also described the rank sum test independently. By convention, the Wilcoxon test is now ascribed to paired data and the Mann Whitney U test to unpaired data.

Box 30.2

Mnemonic

U for U

Unpaired (two independent groups)—Wilcoxon rank sum test

or

Mann Whitney U test

3.3.2 Principle

The Mann Whitney U test checks whether the medians (as opposed to mean by t test) of the two independent groups are different. The first step is to arrange all the observations (both groups put together) in ascending or descending order. Then rank all these observations ignoring their group. When the observations/scores are equal, the average of the ranks is assigned to all the tied observations. After all the observations are ranked, they are groupwise arranged and the sum of the ranks of each group is calculated (hence the name rank sum test). These ranks are analysed as though they are the original observations to find the level of significance (P value) [10].

3.3.3 Example for Assigning Ranks

The adverse effects of a new inhaled beta-2 adrenergic receptor agonist for bronchial asthma are compared with a reference drug, salbutamol, in a pilot study of a sample size of 20 (10 in each group). The tremors are scored 0 to 5 (0, − no tremor; 5, − intense tremor). The observed data are provided in Table 30.3.

Table 30.3 The tremor score of the new drug as compared to salbutamol in patients with bronchial asthma

Arrange all observations in ascending order

0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4

Assign ranks

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20

Find the average of the ranks for the tied observations and assign the average of the ranks of the tied observations.

1.5, 1.5, 5, 5, 5, 5, 5, 11, 11, 11, 11, 11, 11, 11, 16.5, 16.5, 16.5, 16.5, 18.5, 18.5

The rank assigned for each observation is mentioned in Table 30.4.

Table 30.4 Rank for the data given in the example in Sect. 30.3.3.3

T1—the sum of ranks of group 1

T2—the sum of ranks of group 2

T—Smaller of T1 and T2

Mean sum of ranks (m) = {n1(n1 + n2 + 1)}/2

Z = {Modulus (m – T) – 0.5}/S.D

If Z < 1.96, the null hypothesis is accepted (P > 0.05)

If Z > 1.96, the null hypothesis is rejected at a 5% significance level.

If Z > 2.58, P < 0.01 and if Z > 3.29, P < 0.001

3.3.4 Wilcoxon Signed Rank Test

The Wilcoxon signed rank test is done to assess the difference in median between two datasets from a group (baseline scores and scores after intervention). The individual difference in score is calculated for each pair of observations. These are arranged in ascending order (ignoring their sign) and ranked accordingly. The ranking is done as explained previously in Sect. 30.3.3.2. The zero difference values are ignored [11]. Then the ranks of the positive signed difference values and the negative signed ones are separated into two groups (hence the name signed rank test). These are then analysed to find out the significance level.

3.3.5 Example for Wilcoxon Signed Rank Test

A pilot study was conducted to assess the reduction in salivary secretion by a new anticholinergic drug. Salivation was scored 1 to 10 (1, − completely dry; 5, − normal secretion; 10, − profuse secretion). The data and the steps in assigning ranks are provided in Sect. 30.3.3.3 (Table 30.5).

Table 30.5 Effect of a new anticholinergic on salivary secretion

T+—sum of ranks of positive value = 40

T−—sum of ranks of negative values = 15

T—Smaller of T+ and T−

T = 15

Sum of positive and negative ranks = n (n + 1)/2

Mean sum of ranks (m) = n(n + 1)/4

Z = {Modulus (T-m) − 0.5}/S.D

If Z < 1.96, the null hypothesis is accepted (P > 0.05)

If Z > 1.96, the null hypothesis is rejected (P < 0.05)

4 Parametric Tests for Comparing Three or More Groups/Datasets

4.1 Analysis of Variance (ANOVA)

Analysis of variance (ANOVA) is a statistical tool to be used when more than two independent group means are to be compared. In statistical terms, the exposure variable is categorical with more than two levels, and the outcome variable is numerical continuous one. It is a parametric test like t test and was developed by Fisher [12].

4.2 Principle

The principle behind ANOVA is to test the mean differences among the groups by estimating variance across the groups and variance within the group and hence the name analysis of variances. The ratio of these variances is represented by the F ratio. When the variability across the groups is greater than that of the variability within the group, we conclude that at least one group mean is significantly different from other group means. Thus, ANOVA is an omnibus test and tells only whether one group mean is different from the rest of the mean. If ANOVA returns significant result, a further post-hoc test should be performed to identify which specific pair/s of means are significantly different (See 6).

4.3 Assumptions of ANOVA

  1. 1.

    Each data point in the group is independent of the others and is randomly selected from the population. It should be planned at the design stage itself.

  2. 2.

    The data in each group follows a normal distribution.

  3. 3.

    There is a homogeneity of variance among groups i.e., variance is similar in all the groups. It is commonly assessed by doing Bartlett’s test or Levene’s test in statistical packages.

    These assumptions must be checked before performing the actual analysis.

4.4 Types of ANOVA

Depending on the number of factors in exposure variables, the ANOVA model can be a one-way, two-way, or multifactorial ANOVA (three-way ANOVA). If there is a relatedness of the datasets, then repeated measures ANOVA is used. In this section, one-way, two-way, and one-way repeated measures of ANOVA are discussed. Interested readers may refer to Doncaster and Davey [13] for learning other complex models [nested, split plot, mixed model ANOVA and analysis of covariance (ANCOVA)].

4.5 One-Way ANOVA

One-way ANOVA is performed when a continuous outcome variable needs to be compared across one exposure variable with more than two levels. It is called one-way because the exposure groups are classified by just one variable. It is also referred to as factor and hence one factor ANOVA. The experimental design corresponding to one-way ANOVA is termed a completely randomized design.

4.5.1 Example for One-Way ANOVA

A researcher wishes to evaluate the effect of three different treatments (drugs A, B, and C) on the reduction of blood cholesterol levels in mg/dL at the end of one month in cafeteria-diet induced obesity rat model against positive control drug statin and vehicle control. Table 30.6 provides the data set.

Table 30.6 Effect of treatment on change in blood cholesterol in experimental rats
  • Outcome variable: Blood cholesterol (continuous)

  • Number of exposures with their levels: One (Drug treatment, five levels)

Table 30.7 provides the output of ANOVA results. One-way ANOVA divides the total variation which is represented by the total sum of squares into two distinct components: the sum of squares (SS) due to differences between the group means and the sum of squares due to differences between the observations within each group (known as the residual sum of squares or residuals/error). The degree of freedom for between-group SS is k − 1, where k is the number of groups and the residual SS has a degree of freedom (n − k), where n is the total observations. The F ratio is estimated by dividing between group mean SS by within group mean SS under the abovementioned pairs of a degree of freedom. The F value test statistic is compared with the F distribution table to locate the critical threshold. If the test statistic exceeds the critical value, we reject the null hypothesis (P < 0.05).

Table 30.7 Results of one-way Analysis of variance (ANOVA)

4.5.2 Interpretation of Results

In the example scenario, since the P value is less than 0.05, we can infer that there are statistically significant differences in the mean decrease in cholesterol among the five groups as determined by one-way ANOVA (F(4,25) = 238.35, P = 0.000). It only points out that at least one group mean is significantly different from others. To identify which specific pair/pairs of means are different, post -hoc analysis needs to be done (Refer Sect. 30.6).

After performing a post hoc test, Tukey’s honestly significant difference (HSD) (refer Sect. 30.6), further comparisons can be made. When compared to vehicle, the highest mean reduction in cholesterol was observed in the statin group [39.5 (CI: 34.0,44.5); P < 0.001] followed by drug C [38.5 (CI: 33.4, 43.5); P < 0.001] and drug B [26.7 (CI:21.6,31.7); P < 0.001]. Drug A did not show a mean reduction in cholesterol as compared to vehicle. The mean reduction in cholesterol observed with drug B is lower as compared to both drug C and statin.

4.6 Two-Way ANOVA

Two-way ANOVA is done to compare continuous variables across two exposure variables. In the above example of one-way ANOVA (Sect. 30.4.5.1), if the researcher wants to evaluate the gender effect (male/female) in addition to the treatment effect (five treatment groups), the data sets (5) would be stratified into (5 × 2 = 10) data sets. The total variation would be partitioned for both the exposure factors and their interaction [12].

4.6.1 Example of Two-Way ANOVA

In an experiment, the effect of an antagonist on agonist-induced depletion of catecholamines measured in a nanomolar unit from the rabbit heart was studied. There were three levels of treatment namely, control (placebo), drug A and drug B, and each of these treatments was tested in the presence and absence of an antagonist. Thus, there were six (3 × 2) factor level combinations. Rabbits (4 per treatment) were randomly assigned to each of the six factor level combinations. We are interested to know any possible interaction between treatments and the antagonist in addition to the individual effects of each factor. The data are given in Table 30.8.

Table 30.8 Effect of drug treatment by antagonist on depletion of catecholamines in rabbit heart
  • Outcome variable: catecholamine levels in nanomolar units (continuous variable)

  • Number of exposure variables with their levels (Factors): Two

    Factor 1: Treatment- three levels

    Factor 2: Presence of Antagonist—two levels

4.6.2 Interpretation of Results

As shown in Table 30.9, the F ratio is determined for treatment, antagonist and the drug and antagonist interaction. A statistically significant effect has been noted for both treatment and antagonist status, i.e., at least one of the two drugs had a significant effect on depleting catecholamines as compared to the untreated control. Also, the antagonist had a significant effect on the depletion of catecholamines induced by agonists relative to the placebo (control). It can be deciphered that drugs responded differently under the influence of antagonists since the interaction between agonist and antagonist status shows a statistically significant interaction.

Table 30.9 Results of two-way analysis of variance

4.7 Repeated Measures ANOVA (RM ANOVA)

Repeated measures ANOVA is considered as an extension to paired t test when there is one group and three or more related datasets. It is applied to situations where repeated measurements of the same variable are taken at different time points or under different conditions. The assumption that is specific to RM ANOVA is that the variances of the differences between all combinations of related groups must be equal. It is assessed by Mauchly’s test of sphericity. If this assumption is violated, some corrections like Greenhouse-Geisser or Hunyh-Feldt needs to be done to avoid inflating type II error.

4.7.1 Example of One-Way Repeated Measures ANOVA

A physician started prescribing a new antihypertensive available on the market. He wants to assess the efficacy of the drug in a group of 10 newly diagnosed hypertensive individuals. He measures the blood pressure at baseline before prescribing the drug and collected the mean systolic blood pressure readings of the same individuals at 1 week, 1 month and 3 months after prescribing the drug. The dataset is given in Table 30.10.

Table 30.10 Effect of the new drug on blood pressure recordings at three time-points

4.7.2 Interpretation of Results

The results are depicted similarly to two-way ANOVA results with time and participants as between group variations (Table 30.11). A one-way repeated measures ANOVA was run on a sample of 10 participants to determine if there were a reduction in BP after taking a new drug after three months of therapy. The results showed that the new drug elicited statistically significant differences in mean BP reduction over its time course, F (3, 27) = 139.48, P < 0.005.

Table 30.11 Results of repeated measures analysis of variance

5 Non-parametric Tests for More Than 3 Datasets

As shown in Fig. 30.2, Kruskal- Wallis and Friedman are two non-parametric tests used to analyze more than two datasets.

5.1 Kruskal-Wallis Test

It is a non-parametric test equivalent to one-way ANOVA for comparison of three or more datasets. It is used in the setting of non-normally distributed continuous variables or data sets corresponding to small, unbalanced sample sizes with no homogeneity of variances or in the case of ordinal/discrete variables. The calculation of test statistic requires rank ordering of data like the Mann-Whitney U test. It also performs only global assessments like ANOVA and a suitable post-hoc test needs to be performed to identify which group is significantly different from others.

5.1.1 Example of Kruskal-Wallis ANOVA

A physician wants to compare the visual analogue (VAS) score for the two new formulations of an analgesic compared to the standard drug in the treatment of osteoarthritis (n = 10). VAS ranged from 1 to 10. One indicates less pain and ten indicates severe pain (Table 30.12).

Table 30.12 Effect of three formulations on change in VAS (Visual analog scale) score

5.1.2 Interpretation of Results

Table 30.13 depicts the results of Kruskal Wallis test showing the rank sum for each group and the chi-squared statistic with P value. Results can be reported as Kruskal-Wallis test showed that there was a statistically significant difference in the VAS Score among the three groups, χ2(2) = 8.473, p = 0.0145. Subsequently, Dunn test should be carried out to identify the individual median differences.

Table 30.13 Results of Kruskal-Wallis equality-of-populations rank test

5.2 Friedman Test

Friedman test is a non-parametric counterpart to repeated measures ANOVA for outcome variables being ordinal or non-normal continuous data. It is developed by Nobel Prize-winning economist Milton Friedman in 1937 [4]. An example scenario for using Friedman test would be to analyse the effect of two interventions on a numerical pain rating scale measured at three different time points. Paired Wilcoxon test with Bonferroni correction should be performed as post hoc if the result of Friedman test is significant.

6 Post-hoc Tests

ANOVA performs global assessment. It does not convey which pair of group means are different from each other. To compare the two groups, we know that t-test needs to be performed. The fallacy of doing multiple t tests is explained by an example. We wish to compare the mean reduction in HbA1C at the end of three months of therapy with three different anti-diabetic drugs A, B and C. For each comparison, we need to do a t test. Hence, three t tests need to be done. When multiple analyses are done with the same experimental dataset, the likelihood of finding a false positive test (type-1 error) increases. This is known as the familywise error rate. For the above example, if we do three t tests, the inflated alpha error can be calculated from the simple formula 1 − (1 − α)N, where N is the number of groups, which is 14.6% which is unacceptable as it exceeds the conventional error rate 5%.

To account for such multiple comparisons testing, post hoc tests should be employed in order to do pairwise comparisons. There are numerous post-hoc tests available. Some commonly used tests and their characteristics are given in Table 30.14. These tests basically differ in how they safeguard against α errors [14, 15]. For example, Bonferroni correction adjusts the α value by dividing it by the number of comparisons to be made. In the above example of three comparisons, the adjusted P value becomes 0.016 (0.05/3) instead of 0.05. The tests that are stringent in adjusting the α error are termed as conservative and those that are not are labelled as liberal. Scheffe test is the most conservative of all and is used in complex comparisons including a large number of groups. Tukey HSD is the commonly used post-hoc test in biomedical research settings. Games Howell is preferred for samples with unequal variances and Dunn’s test is used for non-parametric tests. Each has its own pros and cons, and no single test is universally applied in all settings. A researcher has to choose a suitable test based on their research objective.

Table 30.14 Characteristics of some common post-hoc tests

7 Chi-square Test

Chi-square test (χ2) (pronounced as ki as in kite [10]) is a statistical tool to assess the relationship between two categorical variables. The categorical variables can be either in nominal or ordinal scale. It is a versatile test that can assess both the association of variables as well as test the difference in proportions of two variables [12]. The test in principle compares the observed numbers in each of the four categories in the 2 × 2 contingency table with the numbers to be expected if there were no difference between the two groups.

7.1 Criteria to Do Chi-Square Test

Outcome variable: Categorical (nominal/ordinal) summarized as a proportion

Analysis type: Comparison of proportion/association

Number of groups: two/more than two groups

Study design: unpaired

Distribution of data: dichotomous distribution

7.2 Contingency Table

Constructing a contingency table is the first step in finding the relationship between two categorical variables. Both exposure and outcome variables with two levels are cross-tabulated to form a frequency distribution matrix, conventionally known as 2 × 2 contingency table [12]. In this table, the two levels of variables are arranged as rows and columns. It is a convention that exposure variables are depicted in rows and the outcome variables in columns. Individuals are assigned to one of a cell of the contingency table with respect to their values for the two variables. The data in the cell should be entered as frequencies/counts not any other derived parameter like percentages. Table 30.15 shows the 2 × 2 contingency table for testing the association between smoking and lung cancer. If there are more than two levels in the variable, it is written as r × c tables, where r denotes the number of rows and c denotes the number of columns. Even numerical discrete or continuous variables can be grouped as categorical variables and presented as large contingency table.

Table 30.15 Contingency table (2 × 2) showing an association between smoking and Lung cancer

7.3 Assumptions for Chi-Square Test

Assumptions to be satisfied for performing the Chi-square test are listed below [5]:

  1. 1.

    The data should be random observations from the sample.

  2. 2.

    The expected frequency (not observed) should be 5 or more in 80% of cells. And no cell should have an expected frequency of less than 1.

  3. 3.

    The sample size should be at least the number of cells multiplied by 5. For a 2 × 2 table, the minimum sample should be at least 20.

7.4 Example for Chi-Square Test

A new antibiotic is compared with a reference antibiotic in curing patients with urinary tract infections. A total of 380 patients were randomly assigned to receive either reference or new antibiotic and 160 achieved microbiological cures in the new antibiotic group and 120 in the reference antibiotic group. The microbiological cure is recorded by the absence of organisms in urine culture after seven days of treatment.

The null hypothesis for this scenario is that there is no difference in proportions of cure rate between new and reference antibiotic groups and vice versa is the alternate hypothesis. Assuming the null hypothesis, the expected frequency of each cell can be calculated by multiplying the marginal row and marginal column total divided by the grand total. Thus, in the given example, for the first cell (a), the expected frequency is (200 × 280)/380 = 147.4 and similarly, the values for other cells are estimated (Refer Table 30.16). The difference between observed (O) and expected (E) is then derived (O − E). The Chi-square (X2) test statistic is obtained from the formula (Box 30.3):

Table 30.16 Contingency table (2 × 2) depicting two antibiotic groups and their cure rates

Box 30.3 Chi-square Statistic Formula

$$ {x}^2=\frac{\sum {\left(O-E\right)}^2}{E} $$

where, O, the observed number; E, expected number; X2 is the test statistic.

Box 30.4 Worked Out Example

$$ {x}^2=\frac{(12.6)^2}{147.4}+\frac{{\left(-12.6\right)}^2}{132.6}+\frac{{\left(-12.6\right)}^2}{52.6}+\frac{(12.6)^2}{47.4}=8.641 $$

The degree of freedom for Chi-square statistic is (row-1) × (column −1) [(2 − 1) (2 − 1) = 1]. From the Chi-square distribution table, under the specified degree of freedom (1) and level of significance at 0.05, the χ2 value is found to be 3.84. Since the test statistic X2 (8.861) is higher than Chi-square critical value (3.84), we reject the null hypothesis. We can conclude that the new antibiotic has a significantly higher cure rate than the reference drug (P < 0.05).

7.5 Strength of Association and Its Interpretation

The Chi-square test only informs whether the proportion of one group is significantly different from the other group; it does not tell the magnitude of the relationship between two variables. The strength of association should be determined by absolute measure (risk difference) or relative measures like Relative risk (RR) or Odds ratio (OR).

7.5.1 Risk Difference

Risk difference measures the difference of outcome variable between two groups. In the given example, the cure rate with the new antibiotic is (160/200) 80% and the reference drug is (120/180) 66.6%. The risk difference is 13.4%. The interpretation is straightforward, and the precision of the estimate can be given by calculating the confidence interval for this estimate.

7.5.2 Relative Risk (RR)

Relative risk is obtained by computing the ratio of the risk of occurrence of an event in the test (exposed) group and the risk of occurrence of an event in the control (unexposed) group. RR of one indicates that there is no difference in the occurrence of events between the exposed and unexposed groups. RR greater than one implies more events occur in the exposed group. Conversely, RR less than one indicates the occurrence of events is less in the exposed group as compared to the non-exposed group. The risk for each group is calculated by the number of events in that group divided by the total population at risk in that group. Thus, it can be calculated from a 2 × 2 contingency table by the formula [a/(a + c)/b/(b + d)] (see Table 30.4). For the example scenario, the relative risk is estimated to be 160/200/120/180 = 1.2. It means there is a 20% higher probability of cure in the new antibiotic group as compared to the reference drug group. Relative risk is the preferred measure in the setting of prospective study designs like cohort and randomized controlled trials as the population at risk is defined in these settings.

7.5.3 Odds Ratio

It is the ratio of the odds of occurrence of events in the exposed group and the odds of occurrence of an event in the unexposed group. Odds for each group are calculated by the number of individuals who had events divided by the number of individuals who did not have events. It is computed from 2 × 2 contingency by the formula (ad/bc). The odds ratio of one indicates there is no difference in the odds of the occurrence of events in the exposed and unexposed groups. An odds ratio greater than one denotes higher odds of an event in the exposed group and the reverse is true for odds less than one. For the example scenario, the odds ratio is 2 [160/40/120/60]. It means there are two-fold increased odds of cure in the new antibiotic group compared to reference antibiotic group. For retrospective studies, the odds ratio should be reported. It can also be used in prospective study designs especially when it has to be adjusted for confounding factors [4].

7.6 Closely Related Tests to the Chi-Square Test

  • If the sample size of the study is too small (n less than 20) or if any of the expected cells takes a value less than 5, then the Chi-square test assumption would be violated, and it can lead to increased type II error. In these situations, Chi-square with Yate’s continuity correction or Fisher’s exact test should be performed. Yate’s continuity correction is done only for the 2 × 2 contingency table.

  • If the observations are paired data/related samples, McNemar’s Chi-square test can be employed if there are only two datasets and Cochran’s Q in case of more than two sets of observations. It is used in the setting of paired design as in paired t test and matched case-control studies [12]. However, for matched case-control studies, conditional logistic regression would be more appropriate.

  • When the levels of groups are more than two, the significance derived from Chi-square test provides only global assessment like ANOVA (Refer to Sect. 30.4.2). To find the significance between the specific pair/pairs of groups of proportions, a partitioned Chi- square should be used.

  • Cochrane Armitage trend test can be used to test the dose-response relationship of categorical variables to assess whether there is an increasing (or decreasing) trend in the proportions over the exposure categories. This is useful in pharmacogenetic studies to identify the association between different allele frequencies and a clinical phenotype. It is also useful in toxicity studies where teratogenicity is evaluated at various dose levels [4].

8 Correlation and Regression

The correlation and regression analysis are performed when the association between two quantitative variables is studied.

8.1 Correlation

Correlation is the statistical tool to assess the association between two quantitative variables. Both variables are measured in the same individuals. The first step is to create a scatter plot graphically (Fig. 30.3) and visualize the association between two variables. In a scatter plot, one variable is plotted on the X-axis and the other variable is plotted on the Y-axis. As a convention, the exposure (independent) variable is plotted on the X-axis and the outcome (dependent) variable is plotted on the Y-axis. Assumptions for performing correlation are random sample selection and the two variables, X and Y follow bivariate normal distribution [10]. The strength of association is given by Pearson’s correlation coefficient (r) which ranges from −1 to +1, which is a parametric test. The plus and minus symbols indicate the direction of the relationship and the value 1 indicates there is a strong correlation and ‘0’ indicates no correlation. If the assumption of a normal distribution is violated or if either of the variables is ordinal, a non-parametric equivalent, Spearman rank correlation (rho) can be employed. Correlation only expresses the strength of association and does not tell the magnitude of change that happens from one variable to another variable. Also, statistical correlation does not imply causal association.

Fig. 30.3
A scatterplot of body weight of children in kilograms versus their midarm circumference in centimeters. The plots follow an increasing trend.

Scatter plot of data mentioned in Table 30.17

8.1.1 Example of Correlation

A researcher is interested to study the relationship between mid-arm circumference and body weight in children. He collected the body weight and mid-arm circumference of 10 school-going children. Table 30.17 provides a data set and Table 30.18 gives the results.

Table 30.17 The midarm circumference and body weight of children
Table 30.18 Results of Pearson correlation test

8.1.2 Interpretation of Results

In this scenario, the r value is close to 1 which implies a strong correlation. Thus, results can be written as the mid-arm circumference is strongly associated with the body weight of school-going children with a Pearson correlation coefficient of 0.997 (P<0.0001). The r2 in Table 30.18 depicts the coefficient of determination which indicates the proportion of variance in one variable that is explained by the other variable.

8.2 Regression

Regression analysis is done in medical research for two main purposes. First, is when the research question focuses on the prediction of one variable from the other variable. Second, is when we need to infer an association between two variables by estimating the effect size after adjusting for the potential confounders and effect modifiers. Suppose in the above example (Sect. 30.8.1.1), if the researcher wants to predict the body weight of individuals if he measures the mid-arm circumference itself, regression analysis can be performed, provided all the variations in body weight are captured by mid-arm circumference. In this analysis, mid-arm circumference is the independent variable (predictor or explanatory variable) and body weight is the dependent variable (outcome variable). Depending on the type of outcome variable, regression analysis may be logistic, linear, cox or Poisson regression as described in Table 30.20 [16]. When multiple independent variables are assessed, it is called multiple regression. When we use a regression model to predict, the predictor values must be within the range of values that have been used to develop the model (refer to Sect. 30.8.2.3).

8.2.1 Simple Linear Regression

In simple linear regression, one independent variable is explored to predict an outcome. The different types of regression based on the outcome variable are provided in Table 30.19. A scatter plot is made as in correlation and the data points are joined by a line of best fit. The differences between observed data points and the predicted line are called residuals. The sum of squares of residuals should be minimum to get the best fit line by the least squares method. To perform linear regression, the following assumptions are checked. For each value of X, the Y values should follow a normal distribution, the mean of Y values lie on the predicted regression line (linearity assumption) and the other important assumption is that variances of Y are similar for each value of X (homoscedasticity). The simple linear regression model is given by Y = β0 + β1 X + €, which is quite similar to the straight-line equation studied in elementary geometry (y = a + bx). β1 (regression coefficient) is the slope of the line that estimates the extent of change in Y when X changes by 1 unit. β0 is the Y-intercept when X is 0. € is the error term included in statistical model to account for random variations.

Table 30.19 Types of regression based on the outcome variable

8.2.2 Example of Simple Linear Regression

A researcher wants to investigate the association of body weight and serum creatinine and collected both variables in ten individuals. He also wants to know whether body weight can predict the serum creatinine level. Table 30.20 provides a dataset for both variables.

Table 30.20 Dataset showing body weight and serum creatinine levels

8.2.3 Interpretation of Results

From the results Table 30.21, the regression equation can be written as serum creatinine = −0.402 + 0.019 body weight in Kg, where −0.402 is the value of y-intercept (β0) and 0.019 is the slope of the regression line (β1). For a one kg increase in body weight, serum creatinine increases by 0.019 (CI:0.016,0.02). From the regression equation, we can easily predict the value of serum creatinine value of an individual weighing 80 kg to be 1.118. This model should not be used for prediction of those who weigh less than 70 or more than 100, since the regression model is constructed from individuals weighing 70 to 100 kg.

Table 30.21 Results of simple linear regression analysis

R2 can be interpreted as follows: according to the model, body weight in kilogram accounts for a 98.16% variation in serum creatinine in mg/dL. We should keep in mind that such near perfect explanation of a physiological variable by another variable is very rare and the variation is often explained by the combination of variables. In that case, R2 implies the explanatory power of all the variables in the model acting together.

9 Concluding Remarks

Statistical methods are used for drawing inferences about the population from the observations made in the sample. The statistical test for analysis of data is selected based on the type of variable, distribution of data, number of groups/datasets and study design. Most statistical tests come with a set of assumptions that the data must fulfill. These assumptions have to be checked before interpreting the results. These statistical tests calculate the probability of the results occurring by chance alone. The parametric tests are more powerful than the nonparametric tests and should be preferred. P value indicates the probability of the results occurring by chance alone. A significant P value does not mean clinical significance. In addition to the statistical significance, the point estimate of the effect and its precision (95% confidence interval) should be reported.