Keywords

1 Introduction to Statistical Data Analysis

This chapter provides the readers (e.g. researchers, data analysts, statisticians) with basic guidelines toward a comprehensive understanding of the different types of statistical data analysis and methods particularly for scientific research. This includes a description of when best to apply each particular type of analysis following our explanations of the prerequisites in statistical data analysis discussed in the previous chapters of the book (see Chaps. 4 and 5). Quite often, the researchers tend to choose the type of analysis for their work based on their expert knowledge or experience about the readily available tools and methods rather than considering the fact that in real practice, the type of analysis for any research work depends on the type of data that is collected and/or the variables being considered.

To this effect, this chapter discusses in detail the different types of statistical data analysis methods to guide the work of the researchers and data analysts when carrying out their investigations. In addition, it provides some examples of use case scenarios for each of the methods being discussed.

2 Statistical Data Analysis and Methods in Scientific Research

Some of the most frequently applied methods used to carry out the statistical data analysis and hypotheses testing in literature are discussed here. Before we look at the different types of statistical methods, it is important for the researchers at all stages, particularly during the planning and design stage of their research, to bear in mind that the type of data analysis or method to be used depends on the type of data collected or the research design (see Chaps. 2 and 5).

2.1 Linear Regression

Linear regression (best known type is often referred to as ordinary least square—OLS) (Zdaniuk, 2014) is a statistical method used to estimate the “relationships between variables” (Kronthaler & Zöllner, 2021). The test includes different techniques for modeling and analyzing the association between two or more variables. The linear regression is mostly applied when the focus is on the relationship between a dependent variable (DV) plus one or more independent variables (IV) (see Chap. 5).

The linear regression analysis assumes that the data points generally, but not exactly, fall along a straight line as shown in Fig. 6.1 (Montgomery et al., 2012).

Fig. 6.1
A linear regression graph of the y-axis dependent versus the x-axis independent. An equation reads y prime = b + a x. It plots a straight line starting from y-intercept = b and slope = a.

Linear regression graph

Example of Use Case Scenario for Linear Regression:

A setting where the linear regression analysis can be applied is when a dependent variable (e.g., the student’s grades) is expected to increase in proportion to their study time (independent variable).

2.2 Logistic Regression

The logistic regression is a type of statistical method used to predict the outcome of a categorical variable (dependent) as a function of the independent (predictor) variables (Connelly, 2020). The method is useful to model the probability of an event occurring as a function of other factors. The logistic test is mostly applied for machine learning and data prediction, otherwise referred to as a statistical method for measuring the likelihood of data at different intervals.

In a logistic regression test (see Fig. 6.2), separation occurs when a linear combination of the predictors can perfectly classify part or all of the observations in the sample (Ghosh et al., 2018). Consequentially, a finite maximum likelihood estimate of the regression coefficients tends to not exist.

Fig. 6.2
Two graphs of the y-axis dependent versus the x-axis independent. For linear regression, the predicted y can exceed the range of 0 and 1. For logistic regression, the predicted y is always within the range of 0 and 1.

Graphical representation of the linear regression versus logistic regression models

Linear Regression versus Logistic Regression:

In the example shown in Fig. 6.2, it can be seen that while the linear regression analysis is used to estimate the relationships between variables; on the other hand, the logistic regression is very useful when classifying samples within a range or categories and tend to make use of different types of data to perform the classification.

The logistic regression method can also be used to determine or assess what variables are useful for classifying the data samples.

Example of Use Case Scenario of Logistic Regression:

Consider a dataset containing information about students who are considered to have a learning difficulty. There are some certain features such as cognitive impairment, visual impairment, mobility impairment, etc. that may be seen as the determining factors. Therefore, the data analysts or researchers’ task could be to find the correlation between those listed features and their dependencies on each other.

Thus, for research purposes, the following questions can be answered using the logistic regression:

  • Are the cognitive impaired students more prone to be classified as students with learning difficulty?

  • What is the probability that a visually impaired student could have a relationship with students considered to have learning difficulties?

  • Does mobility impairment have any impact in classifying a student as having a learning difficulty?

In essence, performing a logistic regression test using the above variables (features) will fit better to the available dataset in question. For instance, the analyst or researcher can make use of the regression (logistic) to build a predictive model for a new set of records that is capable of determining whether the students have a learning difficulty or not. However, in any case, the most important factor to note or consider in such type of analysis is the predictive accuracy of the resultant model.

2.3 Linear-Log Model

The estimated unit change in the dependent variable (DV) for a percentage change in the independent variable (IV) can be represented or calculated through the coefficients in a linear-log model. Thus, if we use natural log values to represent the independent variables (x) and keep the dependent variable (y) in its original scale, the econometric specification is called a linear-log model (basically the mirror image of the log-linear model).

The linear-log models are typically used when the impact of the independent variable (x) on the dependent variable (y) decreases as a result of an increase in the independent variable (in contrast to the linear regression analysis). The resultant models (or linear-log models) can sometimes correct for the lack of homoscedasticity that is usually associated with the linear regression analysis, thus, it allows for heteroscedasticity in the residual distribution (Glick & Figliozzi, 2019).

For instance, when the researcher estimates a linear-log regression, a number of outcomes for the coefficient on the x-axis tend to produce the most likely relationships, as described in Fig. 6.3.

Fig. 6.3
Two linear-log model graphs of y versus x. For beta 1 is greater than 0, the impact of the independent variable is positive. For beta 1 is less than 0, the impact of the independent variable is negative.

Linear-log model graph

Where:

  • Part (a)—Fig. 6.3 shows a linear-log function where the impact of the independent variable is positive and

  • Part (b)—Fig. 6.3 shows a linear-log function where the impact of the independent variable is negative.

2.4 T-test

The t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups of a variable which may be related in some certain features or characteristics (Novak, 2020). The t-test is one of the many common tests used for the purpose of hypothesis testing in statistics or research experiments. The method (t-test) can be regarded as one of the “multivariate” types of statistical analysis technique used to analyze datasets that contain more than one group of variable, and the methods are especially valuable when working with correlated variables (Chatfield, 2018; Novak, 2020).

The following formula is used to calculate the t-test:

$${\text{T}} = \frac{{\text{Variance between Groups}}}{{\text{Variance within Groups}}}$$

whereby

  • A non-trivial (large) T-value equals to different groups.

  • A trivial (small) T-value equals to similar groups.

Example of Use Case Scenario for T-test:

Consider a situation whereby a researcher is interested in whether men and women have different average heights. In a real-world scenario, it is not practically possible to measure the height of every man and woman across the globe. Instead, the researcher can decide to measure a selected sample of each, maybe, 500 men and 500 women in order to determine the average mean difference. The t-test will seek to determine whether that difference is probably a representative of a real difference between men and women in general, or, otherwise, whether the analysis is most likely a meaningless statistical hypothesis. Therefore, considering the scenario above one may ask: whether there were, in fact, no difference between the average heights in men and women? or focus on what are the chances that the randomly selected groups from those populations (men and women across the globe) will or will not be enough to accept the hypothesis.

Limitations of the T-test

  • The results of inferential statistics, such as T-test, can only be applied to populations that resemble the sample in question or that is being tested.

  • In T-test analysis, the sample and population are expected to be normal in distribution. Hence, most scores are often around the mean with fewer scores further out that may be resembling a bell curve.

  • Each group in T-test is expected to have about the same number of data point or distribution. In other words, measuring large and small groups together may give inaccurate results.

  • To perform a T-test, all data must be independent. Thus, the scores should not be influenced by each other.

  • The datasets used to perform T-test are approximately interval level or higher. In turn, each unit of measurement is considered to be about equal to any other unit.

How to Resolve the Shortcomings with the T-test

  • Non-parametric tests, such as the Mann–Whitney U and Kruskal–Wallis H test can perform the same type of analysis as T-test, with just the added benefit of being able to be applied with non-normal distributions and ordered-level data. Even though these tests (i.e., Mann–Whitney U and Kruskal–Wallis H test, etc.) could also be considered less powerful in some settings, depending on the type of the analysis being done.

2.5 Analysis of Variance—ANOVA (F-test)

Researchers can make use of the ANOVA (short name for Analysis of Variance) test to determine the influence that an independent variable(s) has on the dependent variable in a regression study. With ANOVA F-test (which are regarded as one of the multivariate families of analysis), the datasets are split into two parts, namely, systematic factors and random factors (Chatfield, 2018; Christensen, 2020; Nibrad, 2019).

The analysis of variance (ANOVA) as the name implies is defined as a collection of models and their associated procedures, in which the variance is partitioned into certain components due to different explanatory variables. Similar to the T-test method, ANOVA also makes use of the variance between the groups and variance within the groups to calculate the ratio. Thus, the result of the ANOVA as shown in the following formula (i.e., F-statistic, also called the F-ratio) allows for the analysis of multiple groups of data to determine the variability between the samples and within samples. The researchers can make use of the ANOVA test results (F-test) to generate additional data or draw conclusions (facts) in alignment with the so-called regression models. For instance, with ANOVA tests, if there exists no real difference between the tested groups (also referred to as the null hypothesis) the result of the ANOVA's F-ratio statistic will be close to 1. Thus, the larger the ratio, the more likely that the groups are different.

The formula for ANOVA test is as follows:

$${\text{F}} = \frac{{{\text{MST}}}}{{{\text{MSE}}}}$$

where

F:

= ANOVA Coefficient,

MST:

= Mean sum of squares due to treatment, and

MSE:

= Mean sum of squares due to error.

Consequently, if most of the variation (ratio) is between groups, then the researcher or data analysts can considerably claim that there is probably a significant effect. On the other hand, if most of the variation is within the groups, then there is probably not a significant effect.

Interestingly, the ANOVA analysis can be used to test for fuzziness in the datasets (Ahmed & Kilic, 2019). For instance, the one-way ANOVA (also referred to as a between-subject ANOVA or one-factor ANOVA) can help in determining statistical differences between the mean of continuous independent variables. Even though the method cannot tell which specific groups of data are significantly different from each other, rather, it just provides information that at least two of the groups are significantly different from each other (Ahmed & Kilic, 2019).

Types of ANOVA and Use Case Scenarios:

There are two main types of ANOVA analysis, namely, (i) one-way ANOVA, and (ii) two-way ANOVA (Christensen, 2020; Nibrad, 2019).

  1. (i)

    The One-way ANOVA test for between groups can be used when the researcher wants to test two groups to see if there is a difference between them. In order to conduct a one-way ANOVA analysis, the following six assumptions must be satisfied (Ahmed & Kilic, 2019):

    1. (1)

      The dependent variables must be continuous.

    2. (2)

      Independent variable(s) must consist of two or more categorical, independent groups.

    3. (3)

      There should be no relationship between the observations in each group or between the groups themselves, i.e., independence of observations must hold.

    4. (4)

      There should be no significant outliers, which might have a negative effect on the one-way ANOVA, thus reducing the validity of the results.

    5. (5)

      Dependent variable should be approximately normally distributed for each category of the independent variable. Even though, one-way ANOVA only requires approximately normal data because it is quite “robust” to violations of normality, meaning that assumption can be a little violated and still provide valid results.

    6. (6)

      Homogeneity of variances must hold.

  2. (ii)

    The Two-way ANOVA can be with or without replication:

    • The Two-way ANOVA with replication is used for two groups where the members of those groups are doing more than one thing. For example, two groups of patients from different hospitals trying two different therapies.

    • The Two-way ANOVA without replication is used when the analyst has one group and is double-testing that same group. For example, a research experiment testing one set of individuals before and after they take a particular medication to see if it works or not.

Difference between ANOVA versus T-test:

  • T-test calculates the Mean.

  • ANOVA calculates the Ratio.

Also, as shown in Fig. 6.4, a T-test analysis test two groups of variable, whereas ANOVA tests more than two groups to determine the differences or dependency of the variables. Thus, to conduct a test with three or more categories of variable, one must use an Analysis of Variance (ANOVA).

Fig. 6.4
A flow chart begins with a data sample. If the group is 1, one sample T-test. If the group is 2 + +, ANOVA F-test. If the group is 2 and dependent, then paired sample T-test. If not dependent and equal variance, then an independent sample T-test of approximation of d f. If not equal variance, then an independent sample T-test of the pooled variance.

Flowchart showing the T-test versus ANOVA analysis

2.6 Mann–Whitney U Test

Mann–Whitney U test is a non-parametric equivalent to independent sample T-test. The test is used to compare whether there is a difference in the dependent variable for two independent groups of variable (McKnight & Najab, 2010), as shown in Fig. 6.5. For example, the probable effect of an exam-administration mode (the independent variable) over the test-takers’ scores (dependent variable) can be analyzed using the Mann–Whitney U test (Oz & Ozturan, 2018) provided the exam-administration mode is of two categories or group (e.g., online and paper based). Quite often, researchers interpret the Mann–Whitney U test by comparing the medians between the two populations.

Fig. 6.5
An illustration describes the Mann-Whitney U test. Population leads to sample as Groups A and B. Applicable scenarios include 3 lists. Two independent groups are being considered, an ordinal outcome is being used, and the assumption of independent observation is met.

Description of Mann–Whitney U test

The formula to calculate the Mann–Whitney test is as follows:

$$\begin{aligned} {\text{U}}_{1} \, = \, & {\text{ n}}_{1} {\text{n}}_{2} + { }\frac{{{\text{n}}_{1} \left( {{\text{n}}_{1} + 1} \right)}}{2} - {\text{ R}}_{1} \\ {\text{U}}_{2} \, = \, & {\text{ n}}_{1} {\text{n}}_{2} + { }\frac{{{\text{n}}_{2} \left( {{\text{n}}_{2} + 1} \right)}}{2} - {\text{ R}}_{2} . \\ \end{aligned}$$

where

\({\text{R}}_{1} { }\):

= sum of the ranks for group 1 and

\({\text{R}}_{2}\):

= sum of the ranks for group 2.

It is important to mention that the Mann–Whitney U method functions by pooling the observations from the two samples into one combined sample. This is done by keeping track of which sample each observation comes from and then ranking them according to lowest to highest from 1 to \({\text{R}}_{1}\) + \({\text{R}}_{2}\), respectively.

For instance, the Mann–Whitney U test has proved itself as one of the many multivariate tests that can combine the primary and mortality endpoints of a dataset into a single composite endpoint and can be analyzed through the ranking of those combined outcomes (Matsouaka et al., 2016). The testing of those combined endpoints can be performed as a weighted test where the optimal weights are determined by maximizing the power of the statistical analysis, perhaps, under a particular alternative hypothesis.

Example of Use Case Scenario of Mann–Whitney U Test:

Consider a students’ assessment system designed to determine the effectiveness of a new teaching program or strategy to improve the students’ learning outcome. To this effect, a total of n participants is selected randomly to undergo either the new or a previously existing program. The students are asked to take note of the record of the number of times they feel overwhelmed as a result of the assigned program over a specified period of time. The Mann–Whitney U test in this scenario can be used to determine:

  • If there is a difference in the number of times the students feel overwhelmed over the period of participating in the new program compared to those undergoing the previously existing program?

  • If so, are the observations statistically significant?

2.7 Chi-Squared (χ2)

Chi-squared2) statistics is a test that measures how expectations compare to the actual observed data (or model results) (McHugh, 2012). Datasets used in calculating a chi-squared statistic must be random, raw, mutually exclusive, drawn from independent groups or population, and from a large enough sample size. For example, the results of tossing a coin 100 times meet these criteria.

Chi-squared tests have proved effective and are often used in hypothesis testing for calculating new similarity distance measures, which is an important measure in the applications of image analysis, for instance, and statistical inference (Ren et al., 2019).

The formula for calculating the chi-square (χ2) statistics is as follows:

$$x_{c}^{2} = \Sigma { }\frac{{\left( {O_{i} - { }E_{i} } \right)^{2} }}{{E_{i} }}$$

where

C:

= degrees of freedom,

O:

= observed values(s), and

E:

= expected values(s.)

As gathered in Table 6.1, the numbers denoted with (O)  represent the Observed value, O, whereas the numbers denoted with (E) represent the Expected value, E.

Table 6.1 Example of a chi-squared (χ2) data distribution

Example of Use Case Scenario for Chi-squared (χ2):

A chi-squared test can be applied to determine, for instance, the level of effect or impact that gender bias has on the students’ evaluation of teaching or expectations about the academic professors or performance.

2.8 Kruskal–Wallis H Test

The Kruskal–Wallis H test proposed by William Kruskal and W. Allen Wallis (Kruskal & Wallis, 1952) is a non-parametric method used to determine whether a group of data comes from the same population. The Kruskal–Wallis H analysis is identical (an alternative) to the ANOVA with the data replaced by categories or ordinal level data. Just in the same manner as the Mann–Whitney U test (but in this case for more than two groups of variables); the Kruskal–Wallis tests can be used to determine if there exist statistical differences between the independent observations based on the dependent variables (Veerasamy et al., 2018).

In other words, the Kruskal–Wallis H test can be referred to as an extension of Mann–Whitney U test typically applied for three or more groups, as illustrated in Fig. 6.6.

Fig. 6.6
An illustration describes the Kruskal-Wallis H test. Population leads to sample as Groups A, A, and B. Applicable scenarios include 4 lists. Three or more independent groups are being considered, the assumption of normality has been met, independent observation is met, and homogeneity of variance has been violated for ANOVA.

Kruskal–Wallis H test

The formula for computing the Kruskal–Wallis H test is as follows:

$$H = { }\left( {\frac{12}{{n\left( {n + 1} \right)}}{ }\mathop \sum \limits_{j = 1}^{k} \frac{{R_{j}^{2} }}{{n_{j} }}} \right) - 3\left( {n + 1} \right)$$

where

K:

= number of comparison groups,

n:

= total sample size,

nj:

= sample size in the jth group, and

Rj:

= sum of the ranks in the jth group.

The following are the key features of the Kruskal–Wallis H test:

  • All \(n = n_{1} + { }n_{2} + \cdots + n_{k}\) measurements are jointly ranked (i.e., treated as one large sample).

  • One can also use the sums of the ranks of the k samples to compare the distributions.

Example of Use Case Scenario of Kruskal–Wallis H Test:

Please refer to the use case scenario of Mann–Whitney U test (Section 6.2.6), but in this situation used for three or more groups.

2.9 Correlation

Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship (whether positive or negative).

2.9.1 Kendall Rank Correlation

The Kendall rank (also known as Kendall’s tau analysis) is a non-parametric test mostly used to measure the strength of dependence between two variables (Brossart et al., 2018). In theory, Kendall's rank correlation coefficient can be applied as an efficient and robust way of identifying monotone relationships between two data sequences. Although when applied to digital data (i.e., discrete or discontinuous representation), the high number of ties can produce inconsistent results due to quantization (Couso et al., 2018).

The formula for Kendall Rank Correlation is as follows:

$${\text{Kendall}}^{\prime}{\text{s}}\,{\text{tau}} = { }\frac{{{\text{C}} - {\text{D}}}}{{{\text{C}} + {\text{D}}}}$$
2.9.1.1 Spearman’s Rank Correlation

Spearman rank (also known as Spearman’s rho analysis) is a non-parametric test that is mostly used to measure the degree of association between two variables (Wang et al., 2019; Zar, 2014). For example, the researchers can use Spearman’s rank correlation coefficient and multiple regression techniques to measure the relationship between some set of variables (Veerasamy et al., 2018).

The formula for Spearman’s Rank Correlation is as follows:

$${\text{Spearman}}^{\prime}{\text{s}}.{\text{rho}}\, = \,1 - { }\frac{{6\sum \left( {{\text{d}}_{{\text{i}}}^{2} } \right)}}{{{\text{n}}\left( {{\text{n}}^{2} - 1} \right)}}$$

Kendall versus Spearman:

As shown in Table 6.2 and the formula/calculations, an illustration of the difference between the Kendal tau and Spearman rho analysis is as follows (Hauke & Kossowski, 2011):

Table 6.2 Kendall versus Spearman data distribution

When to use Kendall’s tau Analysis?

  • The distribution of Kendall’s tau analysis is most useful when the data analyst or researcher is interested in a test that has better statistical property.

  • The interpretation of Kendall’s tau test in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct.

When to use Spearman’s rho Analysis?

  • Spearman’s rank correlation coefficient is the most widely used rank correlation coefficient analysis.

In summary, quite often the interpretation of Kendall’s tau and Spearman’s rho rank correlation coefficient are very similar. Thus, both methods tend to invariably lead to the same inferences.

2.10 Wilcoxon Test (Signed-Rank and Rank-Sum)

The Wilcoxon test, also referred to the Wilcoxon-signed-rank or Wilcoxon-rank-sum tests (Wilcoxon, 1945), is a non-parametric test and alternative version of the t-test (Rey & Neuhäuser, 2011). The test is mostly applied by the researchers to compare two dependent samples by testing whether the median values of the two groups differ significantly from each other. The resultant models assume that the data comes from two matched or dependent populations, following the same distribution through time or place (Hayes, 2023). The test can be applied to test the hypothesis that the median of a symmetrical distribution equals a given constant. And, as the name implies, and as with the many other non-parametric tests that we have already and previously described in this chapter (Chap. 6) and in Chap. 5, this distribution-free test is based on ranks (Rey & Neuhäuser, 2011). It is assumed that the independent variable in a Wilcoxon test is dichotomous, and the dependent variable is a continuous variable whose measurement is at least ordinal.

The main types and summary of the Wilcoxon test include (Hayes, 2023):

  • The Wilcoxon test compares two paired or independent groups of variables and comes in two versions, (i) the rank-sum test, and (ii) signed-rank test.

  • The aim of the test is to determine if two or more sets of pairs in a data or variable are different from one another in a statistically significant manner.

  • Both tests (whether rank-sum or signed-rank) assume that the pairs in the data sample come from the same dependent populations.

  • Unlike t-test that calculates the mean difference of two groups of a variable, the Wilcoxon test is used to calculate the median difference between the two groups of variables.

The signed-ranked version of the Wilcoxon test is calculated based on differences in the samples’ median scores but in addition to it taking into account the signs of the differences, thus, takes into consideration the magnitudes of the observed differences.

As the non-parametric equivalent of the paired t-test, the signed-rank can be used as an alternative to the t-test when the population data does not follow a normal distribution.

On the other hand, the Wilcoxon rank-sum test version is often used as the non-parametric equivalent of the independent or two-sample t-test.

The Wilcoxon rank-sum test is used to compare the median of two independent samples, while Wilcoxon signed-rank test is used to compare the median of two related (paired) samples.

The value of z (test statistics) in a Wilcoxon test is calculated with the following formula:

$$Z_{T} \, = \,\frac{{T - \mu_{T} }}{{\sigma_{T} }}$$

where

  • T = sum of values from calculating the ranges of differences in the sample.

Example of Use case Scenario of Wilcoxon Test:

The following type of research questions can be answered using the Wilcoxon test:

  • Are the test scores for students, e.g., 5th grade to 6th grade for the same group of students different from each other?

  • Are the learning performance of a particular group of students better in the morning or in the evening?

3 Summary

The type of research methodology or design one chooses to carry out the research investigations determines the type of data that is required for the research purpose, and vice and versa. This includes the means or procedures that will be applied for collecting the samples (data collection) as well as the type of analysis (statistical data analysis) that would be performed. The authors have provided the data type and method matching in Table 6.3 as a guideline for the researchers in the selection of the most appropriate or suitable statistical analysis/method based on the type of data or sample (i.e., independent versus dependent variable).

Table 6.3 The different types of statistical data analysis described in terms of the independent versus dependent Variables

Overall, a more comprehensive guide on how to choose the best and suitable statistical data analysis method based on the type of data (see Chap. 4) or available statistical tools for the research investigations can also be found in the following sources:

  • NYU Elmer Homes Bobst Library (NYU Libraries, 2023) and

  • UCLA—Institute for Digital Research and Education (UCLA, 2023).