Keywords

1 Introduction

The volumes of data that the world is producing every day are extremely huge. Data can be produced by humans directly through our interactions with social networks or by feeding of data into electronic systems. Also, they can be produced solely by machines, such as those logging the activities of humans or machines. Data production and storage have become the norm in every aspect of our lives. They are used in healthcare, banking, education, and even in homes. With adaptation of new technologies such as the internet of things (IoT) and advancements in technologies to produce and store data, it is expected that the world will produce more and more data.

Data can be stored in many forms and in different ways. They can be stored in relational databases and spreadsheets, and are called structured data. Alternatively, data can be stored in a nontraditional row–column database, such as text and multimedia content (e.g., videos, photographs, and audio) and are called unstructured data. Furthermore, data that contain semantic tags (such as e-mail messages, XML, and HTML) and are not stored in relational databases are called semistructured data. Because of this diversity, there have been many methods and techniques used to mine each type. Data differ in the scope and the field that they belong to, and data-mining (DM) experts deal with them differently. For instance, the objective used to mine educational data can be different from the objective used to mine healthcare data. It could be acceptable to build a model that predicts student performance with 85% accuracy, whereas it is not acceptable to have accuracy of only 85% to predict the success rate of drug treatment.

DM techniques can be predictive or descriptive. Predictive methods use variables to predict unknown or future values of one or more variables. Examples of predictive methods are classification, regression, and deviation detection [1]. Descriptive methods such as clustering, association rule discovery, and sequential pattern discovery find human-interpretable patterns that describe data [1].

Regression techniques aim to predict the value of a continuous attribute on the basis of the values of other attributes [1]. The simplest form of regression is linear regression, where the class is a linear combination of attributes with predetermined weights. On the other hand, classification methods aim to predict the value of a discrete attribute on the basis of the values of other attributes. Classification has many approaches, including decision tree (DT), Bayesian classifier or network, neural network (NN), support vector machine (SVM), and logistic regression approaches. Logistic regression allows us to use regression for classification.

Clustering measures the similarity between data points or variables; in other words, it groups similar data points on the basis of their attributes, and each group is called a cluster [1]. Clustering methods are divided, in general, into two groups: partitioning methods and hierarchical methods. Association rules produce dependency rules based on strongly associated different attribute values [2]. There are many association rule–mining algorithms such as Apriori, Equivalence Class Transformation (Eclat), and frequent pattern growth (FPGrowth) [3].

Many researchers have used DM techniques such as regression, classification, clustering, and association rule mining in the field of education. They have applied these analyses in traditional education, web-based education (e-learning), and learning management systems [4]. Their studies have been conducted to accomplish many tasks related to students’ learning processes, including providing feedback, predicting student performance, and detecting student behavior [4]. In traditional education, there are hidden patterns in data that are difficult for instructors and administrators to notice without the help of DM techniques. Such knowledge could be useful to students’ learning processes in many ways. Processes such as students registering in courses, and instructors making or changing major plans or advising students, take time and effort during every semester. With the use of students’ historical data that universities collect over the years, DM can help to enhance these processes. As a result, instructors and students can make better-informed decisions.

In this chapter, we report our work in finding an association between courses in a Computer Science (CS) program on the basis of students’ grades. We report our work in discovering interesting rules by using different parameters such as support, confidence, lift, Kulczynski (Kulc), and the imbalance ratio (IR). We used the Apriori algorithm to mine data on undergraduate students majoring in CS at our university. The rest of this chapter is structured as follows: we review research in DM and association mining in Sect. 13.2. In Sect. 13.3 we give a detailed description of our experiments in using association rule mining to find interesting rules. We give our conclusion in Sect. 13.4.

2 Related Work

In this section, we review some related work that has used DM in education. We start by discussing previous work in terms of the classification methods used and the challenges associated with them. Then, we give a brief background of association rule mining and how we measure rules’ interestingness. After that, we review related work that has used association rule mining in education.

2.1 Classification in Education

Classification is one of the most popular DM techniques that have been used in research to analyze educational data. Several studies have been conducted to review the literature in this area (e.g., a study by Shahiri et al. [5]) and to compare classification methods that have been used [5, 6]. For instance, Hämäläinen and Vinni [6] compared classification methods used on educational data on the basis of eight general criteria.

Many attributes have been used to predict students’ performance, such as previous courses, standardized examination scores, or preuniversity scores. Different studies have used different attributes in their analyses. Shahiri et al. [5] reviewed important attributes used in prediction, such as the Cumulative Grade Point Average (CGPA) [7, 8], internal assessment [8, 9], students’ demographic data [10, 11], external assessment, and psychometric factors. According to Shahiri et al. [5], the most popular task is classification, for which many algorithms have been used, such as DT [7,8,9,10, 12], artificial neural networks (ANNs) [7, 8, 12], naïve Bayes [9, 10, 12], K-nearest neighbor [10, 11], and SVM. Shahiri et al. [5] mentioned that most researchers have used CGPA and internal assessment as data attributes, and ANN and DT as classification techniques.

Several studies [7,8,9,10,11, 13] have focused on predicting students’ performance by using DM techniques alone. The prediction of overall performance or performance in a specific course could be based on different attributes such as students’ preuniversity data [7, 10], average of course attendance [11], grades in courses [13], marks in the course [8, 9], and behavioral features in an e-learning education system [12]. Most of the related studies that we reviewed used classification algorithms for prediction, including a nearest neighbor algorithm (IBk) [10], rule learners (OneR and JRip) [9, 10], classification based on an association rules algorithm [13], linear regression models [7], a classification and regression tree (C&RT), and chi-squared automatic interaction detection (CHAID) [8]. To compare the models, the studies used different evaluation measures such as accuracy (which was used by most of the studies), precision, recall, and F-measures [12].

2.2 Background to Association Rule Mining

Association rule mining is a DM technique that discovers interesting relations between attributes and then generates rules that represent these relations. These rules do not imply a causal relationship; for example, rule (AB) does not mean that A causes B. However, the rules imply an association relationship between attributes (that is, those attributes go together) [14]. As we have mentioned above, the popular algorithms in association rule mining are Apriori, Eclat, and FPGrowth [3].

The Eclat algorithm uses a vertical data format method where each item is stored together with its list of TIDs (that is, the IDs of transactions that contain this item) [3, 15]. It is used only at the Apriori join step to generate the candidate itemsets [15]. To compute the support for an itemset, Eclat uses an intersection-based approach [15]. The intersection operation on TIDs is fast frequency counting, and it is advantageous for the vertical format that Eclat uses [16]. However, it has a drawback when the intermediate results of vertical TID lists become too large for the memory [16].

The FPGrowth algorithm was developed on the basis of a new data structure called a frequent pattern tree (FP-tree) [17]. It stores the database in the main memory using a combination of vertical and horizontal database layouts [17]; the transactions are stored in a tree structure, and each item has a linked list going through to all transactions that contain that item [17]. It avoids the cost of the candidate generation that Apriori does, by focusing on frequent pattern growth [3]. However, with long pattern data sets, it consumes more memory and performs badly [17, 18].

Apriori is a basic algorithm to find frequent item sets [3]; it is the one that we used in this work. Sometimes, Apriori produces a huge number of rules, making it difficult to use all of them. Because of that, and for other reasons that will be mentioned later, we must use interestingness measures. The measures of interestingness evaluate how interesting each rule is. This makes it easier to distinguish between the rules.

Measures of Interestingness

In Apriori, patterns can be represented in the form of association rules. If the rule is (AB), then, to measure its interestingness, we have many measures such as support, confidence, lift, Kulczynski, and the IR. Such measures are useful to distinguish between rules especially when there is a huge number of resulting rules. In addition, we should distinguish between uninteresting rules, which present an obvious fact, and new rules that could be interesting and useful. It is common in association rule mining to get a large number of rules that present facts we already know [19]. More details about the measures are as follows:

  • The absolute support (also known as support count, count, or occurrence frequency) for itemsets A and B is the number of transactions that contain the itemset, and this is the probability: P(AB) [3]. The support for the rule (AB) (sometimes referred to as relative support) is the count of transactions containing A and B to the number of transactions in the database [20].

    $$ \mathrm{Support}\left(A\Rightarrow B\right)=\frac{\mathrm{count}\left(A\cup B\right)}{n} $$
    (13.1)
  • The confidence of a rule (AB) is the number (or count) of transactions that contain A and B to the number of transactions that contain A, and this is the conditional probability: P(B|A) [3, 20]. The confidence value is calculated as shown below [20]:

    $$ \mathrm{Confidence}\left(A\Rightarrow B\right)=\frac{\mathrm{count}\left(A\cup B\right)}{\mathrm{count}(A)} $$
    (13.2)
  • The lift of a rule measures how many more times A and B occur together in transactions than would be expected if A and B were statistically independent (not correlated). The lift value could be equal to 1, which would mean that A and B are independent and there is no correlation between them; or it could be less than 1, which would mean that the occurrence of A is negatively correlated with the occurrence of B; or it could be greater than 1, which would mean that A and B are positively correlated, and the occurrence of one implies the occurrence of the other. For example, if the lift of rule (AB) is greater than 1, then we could say that A occurrence increases (or “lifts”) the likelihood of B occurrence by a factor of the lift’s value [3]. The lift of a rule can be calculated as follows [20]:

    $$ \mathrm{Lift}\left(A,B\right)=\frac{P\left(A\cup B\right)}{P(A)P(B)},\mathrm{OR} $$
    (13.3)
    $$ \mathrm{lift}\left(A\Rightarrow B\right)=\frac{\mathrm{confidence}\left(A\Rightarrow B\right)}{\mathrm{support}(B)} $$
    (13.4)
  • The Kulczynski measure of A and B (abbreviated as Kulc) is an average of two confidence measures where the two confidence measures mean the conditional probabilities: the probability of itemset B given itemset A, and the probability of itemset A given itemset B. Its range is from 0 to 1, and the higher the values are, the closer the relationship between A and B is [3]. Kulc = 0.5 signifies neutral or balanced skewness, whereas the further the value is from 0.5, the closer the relationship is between the two item sets [21]. Kulc is defined by Eq. (13.3).

    $$ \mathrm{Kulc}\left(A,B\right)=\frac{1}{2}\left(P\left(A|B\right)+P\left(B|A\right)\right) $$
    (13.5)
  • The imbalance ratio (IR) assesses the imbalance of two item sets (A and B) in rule implications [3]. Its range is from 0 to 1; IR = 0 means that the two directional implications between A and B (AB and BA) are the same, which means it is not an interesting rule, whereas IR = 1 means it is a highly skewed or very interesting rule [21]. IR is calculated by Eq. (13.3):

    $$ \mathrm{IR}\left(A,B\right)=\frac{\left|\sup (A)-\sup (B)\right|}{\sup (A)+\sup (B)-\sup \left(A\cup B\right)} $$
    (13.6)

Mining of association rules is a two-step process: first, we must find all frequent item sets that satisfy the minimum support threshold (Min_sup) specified by the user. Second, from these item sets, we must generate association rules that satisfy the minimum confidence threshold (Min_conf), and these rules are called strong [3]. However, when the items that we are interested in have support that is below (or far below) a user-specified minimum support threshold, they are called infrequent (or rare) items [22]. Rare items are caused by an imbalance in the data set where some items have a very high frequency, count, or support, while other items have very low support, and the resulting rules mostly cover only those items with high support [3, 22]. To mine the rare items, there are several methods that can be applied, such as balancing techniques or rare association rule–mining algorithms [23]. Another way is to set the minimum support threshold to a low value, which counts as the simplest way to mine rare items [22].

If we choose to set the minimum support threshold at a low value, that will produce a huge number of strong rules. When the resulting strong rules are huge in number, then use of a lift will help to rank or filter these rules [20]. Another reason to use a lift is to avoid misleading “strong” rules, because not all strong rules are interesting [3]. However, the lift is influenced by a null transaction—a transaction that does not contain any of the itemsets being examined. For example, a transaction that does not contain item sets A and B is a null transaction. If the value of a measure (of the interestingness of a rule) is not influenced by null transactions, it is called a null-invariant measure.

Null invariance is an important property for measuring association patterns in large transaction databases. So, a lift is not a null-invariant measure, whereas Kulc and IR are null-invariant measures because they are not influenced by a null transaction [3, 21]. Because of that, the lift has difficulty distinguishing interesting pattern association relationships in comparison with Kulc and IR. As far as we know (from reviewing the studies by Jiawei [3], Gupta and Arora [21], Wu et al. [24], and Gopalakrishnan [25]), we could use the three measures lift, Kulc and IR together as follows:

  • If the Kulc value is close to 1, then the left-hand side (LHS) and the right-hand side (RHS) are positively correlated [3, 24].

  • If the Kulc value is close to 0, then the LHS and RHS are negatively correlated [3, 24].

  • If Kluc = 0.5 (that is, neutral) [3, 21, 24], then check the IR value.

    • If IR = 0, then it is not an interesting rule [21].

    • If the IR value is close to 1, then the rule might be worth looking at [21].

2.3 Association Rule Mining in Education

Association rule mining has been applied in education, mostly using the Apriori algorithm. The relevant studies have used Apriori for several objectives. Some of them have used it to predict students’ performance [14, 26, 27] and to provide a good placement for a student by matching the organization’s requirement with the student’s profile [26]. In a study by Kasih et al. [14], the prediction of the students’ final results was based on their performance in eight courses in the first four semesters, while in a study by Borkar and Rajeswari [27], the prediction was achieved by finding the association between attributes such as attendance and assignments. In a study by Ahmed et al. [28], the authors used students’ academic and personal data to discover their impact on the students’ performance; they extracted the association rules related to the impacts of sex, residence, retention, etc.

Other studies have used students’ admission data [29, 30]. In a study by Mashat et al. [29], the data represented applicant student information and their status of being rejected or accepted for enrollment at the university. The researchers applied Apriori to the whole data set and then to the accepted applicants and the rejected applicants separately. The resulting rules were presented and interpreted with respect to the admissions office perspective [29]. Abdullah et al. [30] applied the SLPGrowth (Significant Least Pattern Growth) algorithm and two measures—lift and critical relative support (CRS)—to a student admission data set.

Damaševičius [31] aimed to improve the content of an informatics course. He used association rules and ranked course topics on the basis of their importance to the final course marks. He also proposed a novel metric called “cumulative interestingness” for assessing the strength of an association rule. Vranic et al. [19] used data on an electrical engineering fundamentals course with general data to predict the success of the next year’s students in this course.

Upendran et al. [32] proposed a course recommendation system that suggested courses for new students. They used the Apriori algorithm to generate rules using previous students’ marks in core courses and focused on rules with success as a consequence. These rules were used to suggest courses for new students where they had a high probability of success.

Some studies have associated courses with students’ grades; for example, Buldu and Üçgün [33] were interested in the relation between courses in which students failed. In the study by Ahmed et al. [28], the resulting association rules showed that the grade in one course might depend on prerequisite courses. In the study by Upendran et al. [32], marks in core subjects such as Mathematics, Physics, Chemistry, Biology, Computer Science, and English were considered as attributes. Table 13.1 summarizes some of the studies’ experimental settings and the Apriori parameters that they applied.

Table 13.1 Summary of studies that have used the Apriori algorithm

3 Experimental Settings and Results

In this section, we present some of the experiments that we did to mine association rules by using the Apriori algorithm to find associations between CS courses based on the students’ grades. We present the settings of the experiments, including the data preprocessing, and the results of these experiments.

3.1 Experimental Settings

In this subsection, we show how we preprocessed the data, then we show how we selected the values of the Apriori parameters.

Data Preprocessing

Our aim in this study was to find an association between CS courses based on students’ grades. Therefore, the items in the data set should be in the form (course = G) where course is the course code and G is the grade, G ∈ {A, B, C, D, F}. For example, CS140 = A, CS140 = C, CS322 = A, and CS322 = F.

We started by translating the raw data set from Arabic into the English language, and we selected CS students’ data only. Then we transformed the “date of birth” attribute from “dd/mm/yyyy” to “yyyy”; the results are presented as a screenshot (Fig. 13.1). Notice that a student is represented by multiple rows. For instance, if a student studied 50 courses, then he or she would have 50 rows: one row for each course.

Fig. 13.1
figure 1

Data set after deletion of some attributes

After that, the grades were merged into five categories (A, B, C, D, F): A+ was merged with A, B+ was merged with B, etc. Any grades that did not belong to any of the five categories were deleted. For instance, the grades in course CS480 (Practical Training) had two values—NP signified success without a degree and NF signified failure without a degree—and most of the students got NP, so we deleted them. We also deleted the attributes that did not seem useful, especially when they had the same value for all students, such as major and residential area attributes.

Then, to represent each student by one row, we transformed rows into columns; the results are presented as a screenshot (Fig. 13.2), where we present some of the attributes. With this transformation, we kept only the first trail (which is the first grade that a student got when he or she studied the course for the first time). To handle missing data, we selected the graduated students, since their records had fewer missing data/grades. We had 833 CS graduated students.

Fig. 13.2
figure 2

Data set after transformation of rows to columns

At the end, to prepare the data set for association rule mining, and to find the association between courses, we selected only the attributes that represented CS courses—general courses and computer specialized (mandatory and elective) courses—which meant that we deleted the rest of the attributes. In the CS study plan, we had 60 courses; for each course, student could get one of five grades (A, B, C, D, F), which meant we had (60 × 5 = 300 items) in the form (course = G).

3.1.1 Setting of Apriori Parameters

After conducting many experiments and reading related works, we set the minimum support at 0.01 and confidence at 0.8. This minimum support meant that approximately 94% of the items would be included in the mining process, while items that had support of less than 0.01 would not be included. It also meant 1% of students, which was either 8 students out of 833 students (1% × 833 = 8.33) or 4 students out of 483 students [29].

Besides using support and confidence to measure the rules’ interestingness, we decided to use lift, Kulc, and IR for the reasons mentioned in Sect. 13.2. In addition, on the basis of the relation between the three measures, we wrote the pseudocode below to evaluate the rules. As in the studies by Bramer [20] and Angeline [26], we used lift to rank or filter the rules, then we applied the pseudocode. Table 13.2 lists the values that we chose for Kulc and IR.

Table 13.2 Kulczynski (Kulc) and imbalance ratio (IR) values and their meanings

The itemset number was set at (2, 3, 4, and 5) to specify the length of the rules. This helped us to focus on each subset of the rules; for example, when we analyzed 2-itemset rules, we focused on the relationship between two courses only, so if the relation was positive, then getting a high grade in one course would be associated with getting a high grade in the other one, and vice versa.

Pseudocode Used for Rule Filtering Sort according to lift and lift > 1 If Kulc close to 1 then LHS and RHS are positively correlated Else if Kulc close to 0 then LHS and RHS are negatively correlated Else if Kulc close to 0.5 then // Neutral, use IR to help find the imbalance If IR very close to 1 then // Imbalanced The rule might be worth looking at: very imbalanced case Else if IR relatively close to 1 then // Imbalanced The rule might be worth looking at: imbalanced case Else if IR close to 0 then // Balanced Not interesting rule Else “neutral” // Kulc close to 0.5 and IR between 0.3 and 0.6

3.2 Results of the Experiments

In the experiments, we used two data sets. In the first data set, the instances were CS graduated students (833 students). In the second data set, we kept students who failed at least one course (483 students or rows) and we kept items that contained failing grades only (CS140 = F, CS322 = F, etc.). We present the two experiments as follows:

  • The first experiment focused on the association of courses based on success and failure, and it used data set 1.

  • The second experiment focused on association of courses based on failure, and it used data set 2.

Also, for each experiment, we generated rules with itemset numbers from 2 to 5. Because of space limitations, we present and discuss the resulting rules only for 2- and 3-itemsets, presenting 2-itemset rules for the first experiment and 3-itemset rules for the second experiment.

Association of Courses Based on Success and Failure

In this experiment, we set Apriori parameters as listed in Table 13.3.

Table 13.3 Apriori parameter values

The results of the 2-itemset were 1138 rules, and we were interested in rules where the LHS and RHS were positively correlated (103 rules) or worth looking at, especially in a very imbalanced case (453 rules). We first explore and present the positive correlation rules, then those worth looking at. After that, we present the rules that contained grade F.

Positive Correlation Rules

In Table 13.4, we present the top five rules with the highest lift values where LHS and RHS were positively correlated. As can be seen, these rules had high confidence and the support values were relatively low to high, being in the range of 0.3–0.8, which meant that 30–80% of the students achieved these grades in these courses. The confidence values were in the range of 0.8–0.9, indicating that these rules were found to be true 80–90% of the time; in other words, in 80–90% of instances where a student achieved LHS, he or she would achieve RHS too.

Table 13.4 Top five 2-itemset rules where the left-hand side (LHS) and right-hand side (RHS) are positively correlated

As an example of the rules presented in Table 13.4, rule (1043) {ARB104=A} ⇒ {IDE133=A} (sup = 0.384 and conf = 0.821) means that 38% of the students achieved an A in both course ARB104 and course IDE133, and the rule was found to be true for 82% of those students; in other words, in 82% of instances where a student got an A in ARB104, he or she would get an A in IDE133 too.

In addition, we noted the following about the positive correlation rules:

  • All rules that had a positive correlation associated getting grade A in LHS courses with getting grade A in RHS courses.

  • Examples of the courses that appeared in the rules are general courses and two mandatory courses, CS492 and CS493 (Senior Project in Computer Science 1 and Senior Project in Computer Science 2). However, grade A was the only grade that appeared.

  • Most of these rules were in the form ({X = A} ⇒ {QURxxx = A}), where “X” represented a course and “xxx” represented the code of the QUR (Holy Quran) courses. Table 13.5 shows the number of rules (in this form) for each QUR course.

Table 13.5 Number of 2-timeset rules containing one of the Holy Quran (QUR) courses

The high support indicates that these item sets {course1 = A, course2 = A} appeared frequently in the data set; most of the students got a grade of A in these courses (the general courses and two mandatory courses (CS492 and CS493)). For example, Fig. 13.3 shows the frequency or support of the QUR courses with grades (A, B, C, D, F) and, as we said, (QURxxx = A) items were the most frequent. As a conclusion, these rules may not be interesting, since getting an A in these courses could count as an obvious fact, and any rule that supports an obvious fact is not an interesting rule.

Fig. 13.3
figure 3

Support of Holy Quran (QUR) courses with grades (A, B, C, D, and F)

Rules Worth Looking At

In Table 13.6 we present the top five rules that were worth looking at (very imbalanced cases). These rules had high confidence values and low to relatively low support values; the support values were in the range of 0.01–0.1, meaning that 1–10% of the students achieved grades (A, B, C, D, F) in the LHS courses and grade A in the RHS courses. The confidence values are in the range of 0.8–1, which indicates that these rules were found to be true 80–100% of the time; in other words, 80–90% of instances where a student achieved in LHS courses, he or she would achieve in RHS courses too. A “worth looking at” rule may or may not imply an interesting relationship.

Table 13.6 Top five “worth looking at” 2-itemset rules

We noticed the following about the “worth looking at” rules:

  • All rules that had a positive correlation associated getting grade (A, B, C, D, F) in LHS courses with getting grade A in RHS courses.

  • As we saw in the (2-timeset rules with positive correlation), most of the rules were in the form (X = G ⇒ {QURxxx = A}), where G represented one of the grades (A, B, C, D, F), and we said that these rules were mostly not interesting, since they could count as obvious facts because of the high support of (QURxxx = A) items.

As an example of the rules presented in Table 13.6, rule (45) {CS439 = A} ⇒ {CS322 = A} implied that maybe there was an association between getting an A in both course CS439 (Cloud Computing) and course CS322 (Operating Systems), and we know from the CS study plan that one of the prerequisite courses to register in CS439 (Cloud Computing) was success in CS322 (Operating Systems). So, an association between getting an A in both courses is logical, since it confirms what we already know. The same applies to rule (46) {CS439 = A} ⇒ {CS370 = A}, where there might an association between getting an A in both course CS439 (Cloud Computing) and course CS370 (Introduction to Databases), and since the CS study plan did not link these two courses, that could mean that this rule is a new rule and therefore could be an interesting rule.

Table 13.7 shows some of the rules that contained CS courses on the LHS, arranged by the support values from largest to smallest. For example, the first three rules applied to approximately 5% of the students (5% of 833 students = 41 students), and they were true 90% or 80% of the time. They showed the association between getting an A in CS401 (Computational Numerical Analysis) and getting an A in CS493 (Senior Project in Computer Science 2), as in rule (262); getting an A in CS391 (Seminar), as in rule (258); and getting an A in CS392 (Senior Project in Computer Science 1), as in rule (264).

Table 13.7 “Worth looking at” 2-itemset rules where the left-hand side (LHS) items are Computer Science courses

“Worth Looking At” Rules Containing Failed Courses

We present some of the 2-timeset rules that contain grade F in Table 13.8; they are worth looking at (a very imbalanced case) on the basis of both their Kulc value and their IR value. Note that they have high confidence in the range of 0.8–1 and low support in the range of 0.011–0.067; that is, 1.1–6.7% of the students who got grade F in LHS courses and got grade A in QUR courses (RHS). Also, these rules were found to be true 80–100% of the time. As we know, a “worth looking at” rule may or may not imply an interesting relationship. From our point of view, since most of the students got an A in QUR courses, those students would get different grades (A, B, C, D, F) in the rest of their courses. So, it is an obvious fact, which means these rules do not seem to be interesting. For example, the support value for rule (13) is equal to the support value for item (CS430 = F) on the LHS; the same students who got an F in CS430 (approximately 11 students) got an A in QUR courses.

Table 13.8 2-Itemset rules that contain an F grade on the left-hand side (LHS), on the right-hand side (RHS), or both

Association of Courses Based on Failure

In this experiment, to find the association between failure and getting an F grade in courses, we used the second data set, which contained CS graduated students who failed in at least one course (483 rows), with 60 courses, and each course where a student got an F grade (so, 60 items in form (course = F)). Table 13.9 presents the Apriori parameter values that we used. We present the 3-itemset rules because there was no resulting 2-itemset rule.

Table 13.9 Apriori parameter values

The results of the 3-itemset were 43 rules; there was no rule where LHS and RHS were positively correlated. Table 13.10 shows the top five “worth looking at” rules (a very imbalanced case). These rules had high confidence values and low support values. The support values were in the range of 0.01–0.024, which meant that 1–2.4% of the students achieved grade F in these courses, while the confidence values in the range of 0.8–1 indicated that these rules were found to be true 80–100% of the time; in other words, 80–100% of the time that a student achieved LHS, he or she would achieve RHS too. For example: rule (18) {CS340 = F, CS471 = F} ⇒ {CS401 = F} indicates that 1.2% of the students got an F in all three courses of CS340 (Artificial Intelligence), CS471 (Database Management Systems), and CS401 (Computational Numerical Analysis), and 85% of the times that a student failed in CS340 and CS471, he or she would fail in CS401 too.

Table 13.10 Top five “worth looking at” 3-itemset rules for the failure data set

4 Conclusion

In this chapter, we have reviewed some of the literature that has used data-mining (DM) techniques (such as classification and association rule mining) in education for exploring patterns and extracting knowledge. We conducted experiments to extract and mine interesting rules from data on undergraduate Computer Science (CS) students, using the Apriori algorithm, and we presented the settings of these experiments and some of the results. We answered the questions of the research and we used lift, Kulczynski (Kulc), and the imbalance ratio (IR) to measure the interestingness of rules, along with their support and confidence. We explained how we preprocessed the data and how we set the values for minimum support, minimum confidence, Kulc, and IR. Our results showed correlation between some courses when students obtained A and F grades. Other grades (B, C, and D) did not show correlation between courses in the 2-itemset rules. In addition, they showed that most of the interesting rules had support higher than 1% and confidence higher than 80%. Therefore, we cannot confirm or deny previous opinions that have associated some courses with each other, such as associating success in mathematics with success in programming. We also plan to apply classification algorithms to the dataset.