Keywords

1 Introduction

Chi-Squared (χ2) is a statistical test applied by researchers or data analysts to measure how expectations compares to actual observed data or results of a model (Biswal, 2023; Kishore & Jaswal, 2023). The test (Chi-squared) is mainly used to explain or determine whether there exists a relationship between “categorical” variables which must be raw data that are mutually exclusive and randomly drawn from independent populations and from a large enough sample size (Sayassatov & Cho, 2020; Turhan, 2020).

By default, the hypothesis for performing the Chi-squared (χ2) test is; IF the p-value of the test statistics is less than or equal to 0.05 (p ≤ 0.05), THEN we can assume that there exists a relationship between the targeted (categorical) variables and that this is not by chance (H1), ELSE IF the p-value is greater than 0.05 (p > 0.05) THEN we can say that there is no relationship (measure of independency) between the analyzed (categorical) variables or data.

The formula for calculating the Chi-square (χ2) statistics is as follows (Biswal, 2023; Kishore & Jaswal, 2023):

$${X}_{c}^{2}=\sum \frac{{{(O}_{i} - {E}_{i})}^{2}}{{E}_{i}}$$

where:

c = Degrees of freedom

O = Observed value(s)

E = Expected values(s)

Theoretically, there are two main types of analysis or test that can be performed using the Chi-squared (X2) statistics or method (Preacher, 2001; Turhan, 2020). These are:

  • Independence test: which is a test of “relationship” that allows the researcher or data analysts to compare two (categorical) variables to determine whether they are related or not. In this scenario, the researcher can apply the Chi-squared test to tell how likely or if by random (chance) the resultant relationship, usually determined through the p-value (p ≤ 0.05), can explain any difference found between the observed or actual (frequency) data and the expected data (McHugh, 2012; Taneichi et al., 2020).

  • Goodness of fit test: is applied to determine whether a proportion of a data sample matches the larger population. In this scenario, the test (Chi-squared) allows the researcher to check how well the analyzed (drawn) sample matches the assumed (expected) characteristics or features of the larger population that the data is projected to represent (Cochran, 1952; Rolke & Gongora, 2020). Thus, if the analyzed data does not match or fit the assumed (expected) characteristics of the intended population, usually determined through the p-value (p ≤ 0.05), then the researcher may not consequentially want to utilize the drawn data sample to make any ample or definite conclusion about the studied/larger population in question.

As a general rule of thumb, the following assumptions must be met in order to perform the Chi-squared (X2) statistics or analysis (Turhan, 2020):

  • The “observed” and “expected” observations must be randomly collected or drawn.

  • All the groups of items in the data must be independent.

  • None of the groups must contain very few items (e.g., not less than 10).

  • The data sample size must be large (at least n > 50).

It is also important to mention, just like the many other versions of the non-parametric statistical methods (see Chap. 4), that the Chi-squared (X2) test does not require the studied or analyzed dataset or sample to meet the “equality of variance” assumption among the groups of variables or yet “homoscedasticity” in the data (McHugh, 2012).

In the next section of this chapter (Sect. 10.2), the authors will demonstrate how to conduct the Chi-squared (X2) test in R. Figure 1 is an outline of the different steps that we will apply in order to perform the test (Chi-squared) in RStudio.

2 Chi-Squared (X2) Test in R

As defined in the previous section (Sect. 10.1)—using the Chi-Squared (X2) test in R is a method that can be used to determine whether two categorical variables have a statistically significant correlation (association) between them. With the Chi-squared statistics, the two targeted variables must be categorized (e.g., sex, marital status, ethnicity, religious orientation, likelihood of events, etc.), and must be selected from the same population.

In this section, the authors will be practically demonstrating to the readers how to conduct the two types of tests (i.e., Independence, and Goodness of fit) in R. This will be done by using the Chi-squared (X2) function called chisq.test( ) in R. We will do this by following the steps outlined in Fig. 10.1.

Fig. 10.1
An illustration of five steps for constructing a Chi-squared test. It includes R packages, data, analysis, visualize, and interpret.

Steps to conducting Chi-squared (X2) test in R

To begin, Open RStudio and Create a new or Open an existing project. Once the user have the RStudio and an R Project opened, Create a new RScript and name it “ChiSquare_Demo” or any name the user chooses (see Chap. 1 on how to do these steps if the user need to refresh on the topic).

Once the user have created an R Script, now let’s download an example dataset that we will use to demonstrate the Chi-squared (X2) test (the users are welcome to use any dataset or format of their choice if they wish to do so).

As shown in Fig. 10.2, download the example data (Sample CSV Files) named as “sample-csv-file-for-testing” from the following link (https://www.learningcontainer.com/sample-excel-data-for-analysis/#Sample_CSV_file_download) and save the file on the local machine or computer.

Fig. 10.2
A screenshot of the C S V sample file download with the download button for statistical test in R.

Example of CSV file download. Source https://www.learningcontainer.com/sample-excel-data-for-analysis/#Sample_CSV_file_download. ***Note the users can also directly access the example file through the following link (https://doi.org/https://doi.org/10.6084/m9.figshare.24728073) where the authors have uploaded all the example files used in this book

Once the user have successfully downloaded and saved the example file (which is named as “sample-csv-file-for-testing” upon download) on the system or computer, we can proceed to conduct the Chi-squared (X2) analysis in R.

# Step 1—Install and Load the required R Packages and Libraries

Install and Load the following R packages and libraries (Fig. 10.3, Step1, Lines 3 to 13) that we will be using to call the different R functions, data manipulations, and graphical visualizations for the Chi-squared (X2) test.

Fig. 10.3
A screenshot of the program code with two steps for conducting chi-squared X 2 statistics in R. A syntax for the console is provided below.

Conducting Chi-squared (X2) statistics in R

The syntax and code to install and load the required R packages and libraries are as follows:

An 8-line code installs and loads R packages, v c d, grid, g g p u b r, and d p l y r.

# Step 2—Import and Inspect the example dataset for Analysis.

As defined in Fig. 10.3 (Step 2, Lines 16 to 21), import the example dataset named “sample-csv-file-for-testing” (Sample CSV Files) that we downloaded earlier, and store this in an R object named “Chisqd.data” (the users can use any name of choice if they wish to do so).

The syntax for importing and attaching the example data file into R is as shown in the code below:

A 4-line R code reads a C S V file, attaches it to the R session, displays, and prints its structure.

Once the data is successfully imported and stored in RStudio as an R object we called “Chisqd.data”, the users will be able to view the details of the file originally named “sample-csv-file-for-testing” (Sample CSV Files) as shown in Fig. 10.4 with 700 observations and 16 variables in the data sample.

Fig. 10.4
A screenshot of a table lists the segment, country, product, discount band, units sold, manufacturing price, sale price, and gross sales. The imported and stored data with the console is also presented.

Example dataset imported and stored as an R object in RStudio

# Step 3—Conduct Chi-squared (X2) test (used for categorical variables only)

Now we can proceed to analyze the imported dataset that we stored as Chisqd.data in the R environment (see: Fig. 10.4).

As defined in the introduction section (Sect. 10.1);

  • Chi-squared (X2) statistics compares the relationship (correlation) between two sets of variables with two or more levels or independent groups.

  • The target variable(s) must be a “categorical” data type.

To demonstrate how to perform the two main types of the Chi-squared (X2) test or analysis (i.e., Independence test, and Goodness of fit test), we will be using the R function called chisq.test( ) to conduct the following test in R:

  1. 1.

    For the Independence test—we will test whether the two categorical variables “Segment” and “Discount.Band” in the imported dataset (see: Fig. 10.4) are related (correlated).

  2. 2.

    For the Goodness of Fit test—we will check how the observed groups in the “Discount.Band” variable matches or is capable of fitting the expected population.

The syntax and code for performing the above Chi-squared (X2) tests in R is as shown in the codes below and represented in Fig. 10.5 (Step 3, Lines 24 to 52).

Fig. 10.5
A screenshot of the program code to conduct chi-squared tests in the R object for the independence test, first create contingency, perform independence test, goodness fit test, create contingency, check the proportion, and visualize the expected proportion.

Chi-squared (X2) Test in R using the chisq.test( ) function

Two program codes. A. Test 3 conducts an independence test between 2 variables using a chi-square test. B. Test 3, The goodness of fit test. It creates contingency for a target variable group.
A program code for the contingency table with a null hypothesis to check the frequency of the target variable and to visualize the expected proportion.

Note: by adding or running the $expected command along with the results of the chisq.test( ) function/method (see: Line 36, Fig. 10.5) returns a Contingency table that contains the expected counts which will or are considered to be “TRUE” under the null hypothesis (H0).

Once the user have successfully run the codes or command (Step 3, Fig. 10.5), the user will be presented with the results of the method in the Console as shown in Figs. 10.6a and b, respectively.

As shown in results of the Chi-squared tests in Figs. 10.6a and b; we conducted the “Independence” (Test 3(A)) and “Goodness of Fit” (Test 3(B)) tests in R using the Chi-squared (X2) method or function: chisq.test( ). This was illustrated by using the two categorical variables “Segment” and “Discount.Band” contained in the example dataset “sample-csv-file-for-testing” (Sample CSV Files) that we stored as R object named or defined as “Chisqd.data” in R. The results of the methods were stored as R objects which we called “ChiSqd_Ind_test” and “ChiSqd_GoFit_test”, respectively (Figs. 10.6a and b).

# Step 4—Plot and visualize the categorical variables and data relationships.

In Fig. 10.7a (Step 4, Lines 57 to 62), the authors used the ggplot( ) function in R to visualize the relationship (correlation) between the two categorical variables “Segment” and “Discount.Band” in the example dataset named “Chisqd.data”.

The syntax and code used to plot the correlation are shown below, and the resultant chart represented in Fig. 10.7a.

A program code to visualize the association of two categorical variables.

Additionally, in Fig. 10.7b (Step 4, Lines 65 to 71), we made use of the assocstats( ) and mosaic( ) functions in R to plot the results of the Chi-squared Independence test. Technically, the mosaic( ) method has the advantage of combining the Contingency table and the result of the Chi-square (X2) test of independence.

The code used to plot the contingency table and the result of the Chi-squared test of independence is shown below, and the resulting chart represented in Fig. 10.7b.

A program code to combine plot and statistical test in an R object.

# Step 5—Results Interpretation for the Chi-squared test.

The final step for the Chi-squared (X2) test and analysis is to interpret and understand the results of the test/method.

By default, the hypothesis for conducting the two main tests (Independence test, and Goodness of Fit test) by considering the selected categorical variables “Segment” and “Discount.Band” (see: Fig. 10.4) is as follows:

Test of Independence:

  • (H1) IF the p-value of the test is less than or equal to 0.05 (p ≤ 0.05), THEN we can assume that the two variables are associated, thus, there is a relationship (correlation) between the two categorical variables. In other words, determining the value of one variable helps to predict the value of the other.

  • (H0) ELSE IF the p-value is greater than 0.05 (p > 0.05) THEN we can say that the two variables are not related, thus, there is no relationship (correlation) between the two categorical variables. Therefore, determining the value of one variable does not help to predict the value of the other, and vice and versa.

A program code for Pearson's chi-squared test. The chi-square test result indicates a significant difference between observed and expected values of p equals 0.008. The higher X-squared value of 26.825 with 12 degrees of freedom supports this conclusion.

Consequentially, as shown in the results of the test represented above (see: Fig. 10.6a); the meaning of the Independence Chi-squared (X2) test output can be explained as a list containing the following:

Fig. 10.6
A program code presents the results of the chi squared X 2 test in an R object to create the contingency tables for target variables, code to view the expected table values and perform an independence test.

a Results of the (Independence) Chi-squared (X2) test in R. b Results of the (Goodness of Fit) Chi-squared (X2) test in R

Fig. 10.7
a. Program code to visualize the association of variables, console, and environment syntax. A stacked bar graph plots the count versus the discount band for channel partners, enterprises, government, midmarket, and small business. b. Program code to combine plot and statistical test with the console and environment with a data plot and chi-squared result for the small business market, segment government, and enterprise partners.

a Plot representing correlation (relationship) between two categorical variables using the ggplot( ) function in R. b Plot representing the Contingency table and the Chi-squared test of Independence using the assocstats( ) and mosaic( ) functions in R

  • Statistics: X2 (X-squared) = 26.825 represents the value of the correlation test.

  • p-value: p-value = 0.008189 is the significance level of the test.

Statistically, we can see from the reported result (p-value=0.008189) that the p-value is less than the slated significance level (p ≤ 0.05) deemed scientifically acceptable. Therefore, we reject the H0 and accept the H1 by statistically concluding that there is a significant relationship (correlation) between the “Segment” and “Discount.Band” variables in the analyzed data.

Likewise, for the Goodness of Fit test we conducted to determine whether the observed group of the “Discount.Band” variable matches or is capable of fitting the expected population:

Goodness of Fit test:

  • (H1) IF the p-value of the test is less than or equal to 0.05 (p ≤ 0.05), THEN we assume that the observed group of the “Discount.Band” (categorical) variable does not match (i.e., varies or are not commonly distributed) and are not capable of fitting the expected population. In other words, the different groups in the “Discount.Band” variable are not the same and are not an expected representative (fits) of the studied population, and may consequently not be used to make ample conclusions about the studied population.

  • (H0) ELSE IF the p-value is greater than 0.05 (p > 0.05) THEN we can say that the observed group of the “Discount.Band” (categorical) variable match (i.e., fit or are commonly distributed) and are capable of representing the expected population. Thus, the different groups in the “Discount.Band” variable are an expected representative (fits) of the studied population and can be utilized to make conclusions about the studied population, vice and versa.

A program code of the go-fit test of given probabilities in the chi-squared test. The values are as follows. X-squared = 10.58. d f = 3. p-VALUE = 0.01423.

Accordingly, as shown in the Goodness of Fit test result presented above (see: Fig. 10.6b); we can see that the p-value (p-value = 0.01423) is less than the stated significance level (p ≤ 0.05). Therefore, we reject the H0 and accept H1 by concluding that the different groups of the “Discount.Band” variable are not proportionate or commonly distributed (does not fit or vary), or are not an expected representative of the studied population.

3 Conclusion

In this chapter, the authors demonstrated how to conduct the two main types of Chi-squared (X2) tests in R. This includes the “Independence” and “Goodness of Fit” tests covered in Sect. 10.2.

Also, the chapter covers how to graphically plot the results of the Chi-squared tests and the Contingency table, and then discussed in detail how to interpret and understand the results of the test (Chi-squared) in R.

In summary, the main contents covered in this chapter are as follows:

  • The Chi-Squared (X2) statistics measures how expectations compares to actual observed data or results of a model. It is mainly used to determine whether there exists a relationship (correlation) between two categorical variables.

  • The test of “Independence” checks the association between the two categorical variables, while

  • The “Goodness of fit” test checks if there is a significant difference between the observed frequency values and the expected frequency values for a specific variable.

  • In either case (Independence test or Goodness of Fit test), the researcher or data analyst must create a Contingency table, otherwise referred to as “frequency table” before applying the Chi-squared (X2) method.