Keywords

6.1 Introduction

Information technology is reshaping higher education globally and analytics can help provide insights into complex issues in higher education, such as student recruitment, enrollment, retention, student learning, and graduation. Student retention, in particular, has become a major issue in higher education, since it has an impact on students, institutions, and society. This study deals with the important issue of student retention at a micro level, i.e. at the course level, since course completion is considered to be “the smallest unit of analysis with respect to retention” (Hegedorn 2005). Student retention is a complex and widespread issue – 40 % of students leave higher education without getting a college degree (Porter 1989). Both the student and the institution need to shoulder the responsibility of ensuring that students can succeed in higher education. Therefore, institutions of higher education are looking for strategies to improve their retention rate. It is important to identify at-risk students – or those students who have a greater likelihood of dropping out of a course or program – as this can allow instructors and advisors to proactively implement appropriate retention strategies. Studies have shown much higher dropout rates in online courses compared to face-to-face courses. Several studies have shown that the dropout rate is 10–40 % higher in online courses compared to traditional, face-to-face courses (Carr 2000; Diaz 2000; Lynch 2001).

Online course dropout needs to be addressed to improve institutional effectiveness and student success. One of the key elements in reducing the e-learning dropout rate is the accurate and prompt identification of students who may be at greater risk to drop out (Lykourentzou et al. 2009). Previous studies have focused on identifying students who are more likely to drop out using academic and demographic data, obtained from Student Information Systems (SIS) (Kiser and Price 2007; Park 2007). Online courses are generally taught using a Course Management System (CMS), which can provide detailed data about student activity in the course. There is a need to develop models that can predict real-time dropout risk for each student while an online course is being taught. Using both SIS and CMS data, a predictive model can provide a more accurate, real-time dropout risk for each student while the course is in progress.

The model developed in this research study utilizes a combination of variables from the SIS to provide a baseline risk score for each student at the beginning of the course. Data from the CMS is used, in combination with the baseline prediction, to provide a dynamic risk score as the course progresses. This study identifies and analyzes various SIS-based and CMS-based variables to predict dropout risk for students in online courses and evaluates various data mining techniques for their predictive accuracy and performance to build the predictive model and risk scores. The model leverages both SIS (historical) and CMS (time-variant) data to improve on the predictive accuracy. The model provides a basis for building early alert and recommender systems so instructors and retention personnel can deploy proactive strategies before an online student drops out.

The rest of the paper includes the literature review in Sect. 6.2. Section 6.3 contains the methodology of the study. Section 6.4 presents the data analysis and results, and the proposed recommender system framework, followed by Conclusion.

6.2 Literature Review

6.2.1 Factors Affecting Student Retention and Dropout

Early detection of at-risk students and appropriate intervention strategies are a key in retention. Seidman (2005) developed a formula of student retention: Retention = Early Identification + (Early + Intensive + Continuous) Intervention. The formula emphasizes the role of early identification and intervention in improving student retention. Tinto (1987) introduced the importance of student integration (both socially and academically) in the prediction of student retention. Tinto (1993) reported that the factors in students dropping out include academic difficulty, adjustment problems, lack of clear academic and career goals, and poor social integration with the college community.

There are studies that have investigated the role of academic and non-academic variables in student retention in face-to-face programs, such as high school GPA and ACT scores – both of which were found to be good predictors of retention rates in face-to-face programs (Campbell and Oblinger 2007; Lotkowski et al. 2004). Lotkowski et al. (2004) note that college performance was strongly related to ACT scores, high school GPA (HSGPA), and socio-economic status (SES), as well as academic self-confidence and achievement motivation. Their study identified a strong relationship between all these factors and college performance. Some studies have shown that retention is influenced by college GPA once a student is in college (Cabrera et al. 1993; Mangold et al. 2002; O’Brien and Shedd 2001).

Non-academic factors, typically assessed once the student is enrolled, can also affect retention (Braxton 2000). Non-academic factors that have been known to influence retention include level of commitment to obtaining a degree, academic self-confidence, academic skills, time management skills, study skills, study habits, and level of academic and social integration into the institution (Lotkowski et al. 2004).

Kiser and Price (2007) examined the predictive accuracy of high school GPA (HSGPA), resident’s location, cumulative hours taken, mother’s education level, father’s education level, and gender on persistence of college freshmen to sophomore year (freshman retention). Their model used logistic regression on a dataset of 1,014 students and found that cumulative hours taken was statistically significant for the overall model. Cumulative credit hours have financial implications for students; as the number of hours grows, student investment grows. According to Parker and Greenlee (1997), financial problems, family complications, work schedule conflicts, and poor academic performance (in the order of importance) were the most important factors in persistence of nontraditional students. Bean and Metzner (1985) identified four factors that affect persistence of students, especially nontraditional students namely: academic variables, background and defining variables, environmental variables, academic and psychological outcomes.

Dutton and Perry (2002) examined the characteristics of students enrolled in online courses and how those students differed from their peers in traditional face-to-face courses. The study found that online students were older, more likely to have job responsibilities, and required more flexibility for their studies. In terms of the student gender, Rovai (2003) and Whiteman (2004) found that females tend to be more successful at online courses than males. However, another study showed that gender had no correlation with persistence in an e-learning program (Kemp 2002). Some studies have pointed out the relevance of age as a predictor of dropout rates in e-learning (Muse 2003; Whiteman 2004). Diaz, though, found that online students were older than traditional face-to-face students, but there was not a significant correlation between age and retention (Diaz 2000). A more recent study conducted by Park (2007), on a large population of about 89,000 students enrolled in online courses showed that age, gender, ethnicity, and financial aid eligibility were good predictors of successful course completion. Furthermore, Yu et al (2007) reported that earned credit hours were linked with student retention in online courses; the study also showed a correlation between the location of the student – in-state or out-of-state – and retention.

The usage data from the CMS can provide useful insights about students’ learning behavior. The data on student online activity in the CMS can provide an early indicator of student academic performance (Wang and Newlin 2002). CMS data can provide useful information about study patterns, engagement, and participation in an online course (Dawson et al. 2008). It has been argued that institutional CMS data can offer new insights into student success, and help identify students who are at risk of dropout or course failure (Campbell et al. 2007). Furthermore, Campbell et al. (2006) found that student SAT scores were mildly predictive for future student success; however, when a second variable, CMS login, was added, it tripled the predictive accuracy of the model.

Studies have used CMS activity logs to analyze learner paths and learning behavioral patterns (Bellaachia et al. 2006; Hung and Zhang 2008), to elicit the motivational level of the students towards the course (Cocea and Weibelzahl 2007), and to assess the performance of learners (Chen et al. 2007). Studies have indicated a significant relationship exists between CMS variables and academic performance (Macfadyen and Dawson 2010; Morris et al. 2005). The results indicate that engaged students are more likely to successfully complete a course than students who are less interactive and less involved with their peers. Research studies have been conducted to mine CMS data (such as number and duration of online sessions, discussion messages read or posted, content modules or pages visited, and number of quizzes or assignments completed or submitted) to identify students who are more likely to drop out of a course or receive a failing grade. CMS datasets can be captured in real time and can be mined to provide information about how a student is progressing in the course (Macfadyen and Dawson 2010).

6.2.2 Data Analysis and Mining Techniques in Dropout Prediction

There are several analysis and mining methods used to analyze and mine student data. Logistic regression (LR) is heavily used in predictive models that have a binary dependent variable (response variable). Logistic regression has been widely used in business to predict customer attrition events, sales events for a product, or any event that has a binary outcome (Nisbet et al. 2009). Many studies that were conducted to identify high-risk students used statistical models based on logistic regression (Araque et al. 2009; Newell 2007; Willging and Johnson 2004). Logistic regression and survival analysis were used to build a competing risk model of retention (Denson and Schumacker 1996). The rationale of using logistic regression in the retention problem is that outcome is typically binary (enrolled or not enrolled) and because probability estimates can be calculated for combinations of multiple independent variables (Pittman 2008).

Park and Choi (2009) used logistic regression analysis to determine how well four variables (family support, organizational support, satisfaction, and relevance) predicted learners’ decisions to drop out. Furthermore, Roblyer et al. (2008) gathered data on student characteristics regarding dropout in an online environment. They analyzed a dataset of over 2,100 students using binary logistic regression, with an overall correct classification rate of 79.3 %. Logistic regression was also used by Allen and Robbins (2008) to predict persistence over a large dataset of 50,000 students. The study used three variables: students’ vocational interest, their academic preparation, and their first-year academic performance. The study concluded that prior academic performance was the critical element in predicting persistence. A logistic regression model was used to identify at-risk students using CMS variables such as number of messages posted, assignments completed, and total time spent. This model accurately classified at-risk students with 73.7 % accuracy (Macfadyen and Dawson 2010).

On the other hand, several machine learning techniques, such as decision trees, neural networks, Support Vector Machines (SVM), and naïve Bayes are appropriate for binary classification. In that regard, Etchells et al. (2006) used neural networks for predicting students’ final grades. Furthermore, Herzog (2006) and Campbell (2008) compared neural networks with regression to estimate student retention and degree-completion time. One of the most commonly used techniques of data mining is a decision tree, which is a technique used in solving classification problems. In this context, Breiman et al. (1984) used a type of decision tree at the University of California San Diego Medical Center to identify high-risk patients. Furthermore, Cocea and Weibelzahl (2007) used a decision tree to analyze log files from the CMS to see the relationship between time spent reading and student motivation. Muehlenbrock (2005) applied a C4.5 decision tree model to analyze user actions in the CMS to predict future uses of the learning environment.

Given the rich information provided by the CMS, it is argued that CMS data, in addition to SIS data, may provide a more accurate, real-time dropout prediction at the course level. In this study, we present an approach that utilizes a combination of SIS variables to provide a baseline prediction for each student at the beginning of the course. CMS data from the current course is used, along with the baseline prediction, to provide a real-time risk score as the course progresses. The purpose of dynamic prediction by adjusting the baseline prediction with CMS log activity on a regular basis (daily or weekly) is to provide a more accurate prediction of likely dropouts. Up to our knowledge, no study has been conducted that has utilized both SIS (static data) and CMS (time-variant) data together to make a dynamic prediction while the course is in progress.

6.3 Research Approach

In this section we discuss the overall research process including the identification of relevant constructs and predictor variables and the evaluation criteria. An overview of the overall research process is shown in Fig. 6.1. We approach the problem of prediction whether a student will complete or drop a course as a binary classification problem. First, we begin with a literature review focused on retention theories and dropout studies identifying constructs and variables important to course completion or dropout. The following constructs were identified that were found to be useful for predicting online course dropout in the literature review: academic ability, financial support, academic goals, technology preparedness, demographics, course motivation and engagement, and course characteristics. The constructs were mapped to appropriate variables. The variables were selected based on the literature review, as well as available resources and availability of data. In the context of SIS variables, data was collected from the Student Information System.

Fig. 6.1
figure 1figure 1

Research approach

The literature review helped to identify various data mining techniques used in binary classification, in addition to hybrid machine learning techniques proposed in this study. SIS data was analyzed using descriptive statistics and various data mining techniques (LR, ANN, SVM, DT, NB, GA-ANN, and GA-NB). Using the accuracy metrics, each technique is evaluated. The most accurate technique (Algorithm I) is used to build the SIS Predictive Model, which provides a baseline dropout risk score for each student. CMS data is gathered from the Course Management System at the end of the third and seventh day from the beginning of the course. The data reflects the changes in student usage of the CMS in each online course. The data is analyzed using Algorithm II, which suits the time-sensitive nature of CMS data, since the usage of the CMS is updated daily. The dynamic prediction (CMS Predictive Model) provides a real-time dropout risk score, which can be updated as frequently as CMS data become updated.

The recommender system uses the predictive models to identify students who are at greater risk of dropping out of the course. The purpose of the recommender system is twofold: to create and send alerts to retention staff, students, and instructors at a predefined time and predefined score; and to recommend customized interventions to at-risk students. For example, if a student is given a high risk score because of financial aid status, the student will be directed to work with financial aid staff. Similarly, if a student has a high risk score due to not having taken an online course previously, then the student will be directed to online training videos relevant for e-learning. Figure 6.1 summarizes the key steps of the research approach.

6.3.1 Constructs and Variables

The following independent constructs were identified for this study to predict dropout: academic ability, financial support, academic goals, technology preparedness, demographics, course motivation and engagement, and course characteristics. Each identified construct was further mapped into appropriate variables. For instance, the dependent construct is the course completion, which includes course completion status variable. Regarding the independent variables, the study identifies the following variables for each construct grounded in the literature:

  1. 1.

    Academic ability: ACT score (Lotkowski et al. 2004; Vare et al 2000), high school GPA (Vare et al. 2000), current college GPA (Diaz 2000).

  2. 2.

    Financial support: financial aid status (Newell 2007).

  3. 3.

    Academic goals: credit hours completed (Yu et al. 2007), previous drops, degree seeking status (Bukralia et al. 2009).

  4. 4.

    Technology preparedness: previous online course completion (Osborn 2001).

  5. 5.

    Demographics : gender (Kemp 2002; Ross and Powell 1990; Whiteman 2004) and age (Diaz 2000; Muse 2003; Whiteman 2004).

  6. 6.

    Course engagement and motivation: total number of logins in CMS by the student, total time spent in CMS, and Number of days since last login in CMS.

  7. 7.

    Course characteristics: credit hours, course level, and course prefix.

6.3.2 Evaluation Criteria

Since superfluous variables can lead to inaccuracies, it is important to remove them. Stepwise selection (in regression) and feature selection (in machine learning) are common approaches for removing superfluous variables. To build a binary classification based predictive model, the following questions should be carefully addressed: (a) Which classification algorithms are appropriate in relation to problem characteristics? (b) How is the performance of selected classification algorithms evaluated?

The data mining techniques are selected based on the nature of the prediction (whether both class labels and probability of class membership are needed), nature of independent variables (continuous, categorical, or both), and the algorithmic approach (black box or white box). Both statistical and machine learning algorithms are used to identify class membership (for example, whether a student completed or dropped the course). A statistical technique (logistic regression), machine learning techniques (ANN, DT, SVM, and NB), and hybrid machine learning techniques (or ensemble methods such as boosted trees) are used to compare their predictive accuracy. The classification models are evaluated using two criteria: discrimination and calibration. Discrimination is used to demonstrate how well two classes (0 or 1) are separated, while calibration provides the accuracy of probability estimates. Common measures of discrimination include accuracy, precision, specificity, and sensitivity where:

$$ \mathrm{Sensitivity}=\mathrm{T}\mathrm{P}/\mathrm{P}=\mathrm{T}\mathrm{P}/\left(\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}\right) $$
$$ \mathrm{Specificity}=\mathrm{T}\mathrm{N}/\mathrm{N}=\mathrm{T}\mathrm{N}/\left(\mathrm{F}\mathrm{P}+\mathrm{T}\mathrm{N}\right) $$
$$ \mathrm{Accuracy}=\left(\mathrm{T}\mathrm{P}+\mathrm{T}\mathrm{N}\right)/\left(\mathrm{P}+\mathrm{N}\right) $$
$$ \mathrm{Precision}=\mathrm{T}\mathrm{P}/\left(\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{P}\right). $$

6.4 Data Analysis and Predictive Modeling

This section discusses the step-by-step process for collecting, preparing, and analyzing both SIS and CMS datasets. The process starts with the collection and preparation of SIS data for data analysis. The SIS datasets are analyzed using descriptive statistics and various data mining techniques to compare the accuracy in terms of identifying dropout cases. Similarly, CMS data are collected, cleansed, and analyzed. Various data mining techniques are evaluated for accuracy and performance with CMS datasets. Using the most accurate technique, a baseline risk scoring model is built using SIS data and a dynamic risk scoring model is built using both SIS and CMS data. The section concludes with the evaluation of the baseline and dynamic risk models to identify at-risk students.

6.4.1 SIS Data Analysis

Table 6.1 presents the variables that were chosen to be used in the SIS dataset. The dataset contains 1,376 student records from online courses offered in the Fall 2009 semester, and was extracted from the SIS (Datatel Colleague System: http://www.datatel.com/products/products_a-z/colleague_student_fa.cfm).

Table 6.1 Variables in the Student Information System (SIS) dataset

An examination of the dataset showed that it had some noisy data, missing values, and outliers. Many of the noisy data issues were related to human error in data entry. 1,113 records were from students who completed an online course, with 263 records from students who dropped out of an online course with 19.11 % (263 out of 1,376 total records) of students dropped their online course. To describe data, descriptive statistics were performed using WEKA and SPSS statistics packages. The analysis revealed the following dataset characteristics:

  1. 1.

    47.82 % of students did not have financial aid; while 52.18 % of students received financial aid.

  2. 2.

    College GPA had approximately 19 % missing values, with a mean of 2.5 and a standard deviation of 1.31.

  3. 3.

    High school GPA had 53 % missing values, with a mean of 3.18 and a standard deviation of .58 – which was understandable, as high school GPAs are above 2.0, since students who had a GPA in their records had completed high school before college admission.

  4. 4.

    The histograms show that the dataset had a higher number of records for female students than for males (974 records compared to 311 records).

In order to get accurate data analysis, it is essential to prepare the dataset using data preprocessing and formatting so that appropriate data mining techniques can be used to analyze data. As the dataset is not large, the deletion of records was not found to be a good strategy for this dataset, as it would have reduced the number of records to about merely 300. Records with outliers were removed and missing data were replaced using the mean value for numerical variables (ACT score, age, and college GPA). Missing data for categorical variables (such as degree seeking status) were set as zero (non-degree seeking, no financial aid, no previous online course, and female gender). The initial dataset was imbalanced, as the dependent variable (course status) had more records for students completing the course (course status = 1) than for students who dropped out (course status = 0).

Using WEKA, the following data mining techniques were performed on the dataset using 10fold cross validation to create a predictive model: Decision Trees using J.48 algorithm (DT), Naïve Bayes (NB), Logistic Regression (LR), Artificial Neural Networks with Multilayer Perceptron algorithm (ANN), and Support Vector Machines (SVM). Table 6.2 shows the coincidence matrix for the comparative accuracy of each data mining technique using the unbalanced SIS dataset. Coincidence Matrix also known as a misclassification matrix, provides information about true positives, true negatives, false positives, and false negatives.

Table 6.2 Coincidence matrix for various techniques using unbalanced SIS dataset. n = 1,376

As Table 6.2 shows, each technique shows an overall accuracy of around 80 %. The data analysis showed that ANN (MLP) technique provided the best predictive accuracy for students who completed the course (88.01 %) and for students who dropped out (55.55 %). ANN (MLP) also had the best overall predictive accuracy among tested techniques. With this dataset, true negatives (students who actually dropped out) were low. Since the focus of this study was to predict students who are likely to dropout, this accuracy was concerning. The accuracy of predicting true negatives for DT, NB, LR, ANN (MLP), and SVM was 0 %, 21.73 %, 31.57 %, 55.55 %, and 0 %, respectively. DT and SVM were especially poor in identifying true negatives. ANN (MLP) worked best to find true negatives (55.55 %); however that percentage was not acceptable either.

In order to correct the bias in the dataset, most duplicated student records were removed. If a student registered for more than one online course and completed one or more courses and dropped one or more courses, then one record for each course was kept to remove bias. The removal of duplicated student records provided a balanced but much smaller dataset. The new dataset included an almost equal number of records of students who completed or dropped out of a course. Tables 6.3 and 6.4 shows the coincidence matrix and the evaluation measures for each data mining technique using the balanced dataset of 525 student records.

Table 6.3 Coincidence matrix for various techniques using balanced SIS dataset. n = 525
Table 6.4 Evaluation measures for various techniques using balanced SIS dataset. n = 525

The overall predictive accuracy with the balanced dataset was slightly less than the accuracy with the unbalanced dataset. This was expected, because the unbalanced dataset provided an overly optimistic prediction for students who completed the course since it had more records for them in the dataset. The coincidence matrix shows that all techniques had a better accuracy of predicting dropout students from the balanced dataset (a predictive accuracy for dropout ranging from 69.1 % for naïve Bayes to 76.95 % for decision trees). In terms of predictive accuracy, decision trees performed better than other models.

Based on this observation, decision trees algorithms were further explored. It can be noted that in addition to the J.48 algorithm, C5.0 algorithm is also widely used in classification problems. C5.0 is a sophisticated decision tree algorithm, which works by splitting the sample based on the field that provides the maximum information gain. In this algorithm, each subsample is split multiple times based on various fields, and the process repeats until the subsamples cannot be split any further. The algorithm examines splits and prunes those that do not contribute significantly to the value of the model. It has been shown to work well with datasets that have missing values. Although the test/training data was cleaned, SIS data in production would contain missing values (such as the absence of HS GPA or ACT score). Other benefits of C5.0 are that it does not require long training times to estimate and it offers a powerful boosting method to increase the accuracy of classification. Accordingly, this study chose to compare the C5.0 algorithm with J.48 for decision trees.

In order to achieve better predictive accuracy compared to the accuracy of individual techniques, boosting was used. Boosting is an ensemble data mining technique. Ensemble data mining techniques leverage the power of multiple models, which consists of a set of individually trained classifiers (such as decision trees) whose predictions are combined when classifying novel instances. Ensemble techniques have been found to improve the predictive accuracy of the classifier (Opitz and Maclin 1999). To create a boosted C5.0 model a set of related models are created. Boosting in C5.0 builds multiple models in a sequence. The first model is built in the usual C5.0 tree. The second model is built using the records that were misclassified by the first model. Then a third model is built using the errors of the second model, and so on. Finally, cases are classified by applying the whole set of models to them, using a weighted voting procedure to combine the separate predictions into one overall prediction.

The balanced dataset containing 525 student records was analyzed with C5.0 algorithm in IBM Modeler software (WEKA does not provide this algorithm) using 10fold cross validation. Table 6.5 shows the comparative coincidence matrix and the ROC area for J.48, C5.0, and boosted C5.0 decision trees.

Table 6.5 Coincidence matrix for decision tree techniques for balanced SIS dataset. n = 525

The analysis of the SIS dataset with C5.0 decision trees (with and without boosting) provided a greater accuracy than J.48 decision trees of predicting both types of cases – students who dropped the course and students who completed the course. Compared to 73.52 % accuracy for J.48, C5.0 provided 86.29 % accuracy; while boosted C5.0 decision trees were able to identify 90.67 % of students accurately. Since the purpose of the model is to accurately identify students who are at greater risk of dropping out, it is important to examine the accuracy of true negatives. The boosted C5.0 decision trees model accurately identified 93.85 % of students who dropped out, compared to 88.97 % without boosting, and 76.95 % for J.48. Figure 6.2 shows the boosted C5.0 decision tree for the SIS dataset. As the figure shows, the tree is initially split for the credit hours (college credit hours taken) node. This node indicates the split for credit hours equal to or less than 68, or for greater than 68 h. Furthermore, college GPA is equal to or less than 3.563, with age equal to or less than 22, then the student would drop out.

Fig. 6.2
figure 2figure 2

Boosted C5.0 decision tree model for balanced SIS dataset

The boosted C5.0 decision tree model provided the relative importance of predictors, or independent variables. No predictor was found to be especially strong; however college credit hours (Credit Hours) and Age were found to be slightly more important than the other variables. The gain chart (Fig. 6.3) for the model suggests that the model can identify close to 100 % of cases with only 60 % of the total selections. In a well-fitted model, the response bow is well arched in the middle of the file. The gain chart of the boosted C5.0 model has a well arched bow in the middle, which reflects that the model is well-fitted and has long term stability.

Fig. 6.3
figure 3figure 3

Gain chart for boosted C5.0 model

Based on the comparative accuracy and evaluation measures, the boosted C5.0 decision tree model was used to build a baseline risk score for students. The boosted C5.0 model predicted the most likely value of course status (0 or 1). This value in the model is represented as $C-Current_Status (predicted course status). The model also provides the propensity scores, represented as $CC-Current_Status, and adjusted propensity scores, represented as $CRP-Current_Status. Therefore, Baseline Risk Score is computed by the following equation:

Baseline Risk Score = 100 - (Adjusted Propensity Score * 100)

Baseline Risk Score = 100 - ($CRP * 100)

6.4.2 CMS Data Analysis

CMS data was collected from online courses taught in the fall 2009 semester at a University located in Midwestern United States. The university uses Desire2Learn (D2L) as its Course Management System. 592 student records were collected from 29 randomly selected online courses at the end of the third day and seventh day of a course. Table 6.6 shows the CMS variables and their data types.

Table 6.6 Variables in the Course Management System (CMS) Dataset

Although the CMS data were available from the start date of the course, this study used CMS data from the end of day 3 and end of day 7. Since the CMS did not provide automated reports suitable for data mining, a software program had to be written to extract data from individual courses for individual students. The rationale for using data from day 3 was that there was significant data available by that time, as many students did not log in before then. The rationale for using data from the end of day 7 was that in practice the institution has witnessed that most students who drop out do so within the first week of the course. There was an option to extract data at the end of day 10, but that data was not used since that date was considered late to try to prevent dropout, as many students drop out by day 10 – the institution’s census date. However, it is important to note that once the model is built, an institution (including staff and instructors) could customize the model for use on any selected course dates. For example, this model could be customized to run CMS data extraction on a daily basis. The descriptive statistics for the CMS data are shown in Table 6.7.

Table 6.7 Descriptive statistics of the 3-Day CMS dataset numeric variables. n = 592

In terms of course level variable, the dataset had 28.2 % of students in courses coded as 1 (freshman), while only 3.9 % of students were enrolled in 0 level (remedial) courses. This distribution of students was expected because the institution offers relatively more 100-level courses online. In terms of the Days Since Last Login variable, only 10.3 % of students logged in to their courses on day 3. However, 57.1 % of students last logged in one day before the data extraction. In terms of the Credit Hours variable, the dataset had data for courses of only 2 and 3 credit hours where 97.5 % of students were enrolled in 3-credit hour online courses. In terms of the Discipline variable, the dataset consists of five disciplines namely: Business, Education, Humanities, Math, and Social Sciences. 28.4 % of students were enrolled in the Education discipline, which is consistent with the institution’s online course offerings being primarily in the disciplines of education, business, and humanities. In terms of Course Status variable, only 12.3 % of students dropped out, while 87.7 % of students completed the courses.

The 3-day CMS dataset was analyzed using J.48 decision trees, naïve Bayes, logistic regression, MLP neural networks, and SVM. All techniques were analyzed using 10fold cross-validation. J.48 decision trees, logistic regression, and SVM provided the highest overall accuracy – 87.66 %. However, J.48, logistic regression, and SVM had low accuracy for identifying true negatives (students who actually dropped out). J.48 had only 50 % accuracy, and SVM had 0 % accuracy in identifying dropout students. All techniques performed poorly at identifying dropout students, which was understandable since the dataset had the records for only 12.5 % of students who dropped out. Since the boosted C5.0 decision tree model was significantly better with SIS data, it was worth investigating that technique with the CMS dataset. The CMS dataset was analyzed using boosted C5.0 decision trees and 10fold cross-validation. As the coincidence matrix shows (Table 6.8), the boosted C5.0 decision tree model provided an overall accuracy of 89.7 %. Additionally, it was able to identify 92.85 % of true negatives (students who actually dropped their courses). Therefore, the boosted C5.0 decision tree model was chosen for CMS data analysis due its greater overall accuracy and greater accuracy for identifying true negatives.

Table 6.8 Coincidence matrix for CMS 3-day and 7-day dataset using C5.0 boosted trees

The boosted C5.0 decision tree model identified total logins as the only significant predictor. This was understandable, as the other numeric variables such as days since last login and total time spent would become more significant at a later time in the course. In the boosted C5.0 decision tree, the first node was split by total number of logins. The model indicated that if the total number of logins was greater than 9, then only 9.5 % of students dropped out. If the total number of logins was less than or equal to 9 by the end of day 3, then 40 % of those students dropped out. Of those students who had less than or equal to 9 total logins by the end of day 3: 63 % of those students dropped out if the course was in the business discipline; 100 % of students in the social sciences discipline dropped out, and 23 % of students in the education, humanities, and math disciplines dropped out. Out of the 63 % of students in business courses who had total logins equal to or less than 9, 83 % dropped out if they did not log in by day 3.

In addition to analyzing the CMS data at day 3, CMS dataset from the end of day 7 was analyzed using boosted C5.0 decision trees and 10fold cross-validation. The following coincidence matrix (Table 6.8) shows an overall accuracy of 89.86 %. This model was able to accurately identify 78.26 % of true negatives, which was significantly better than boosted C5.0 with the day 3 dataset.

The boosted C5.0 decision tree model used with the 7-day CMS dataset showed more contrast in relative predictor importance. The model showed that total logins and days since last login, as well as total time spent, were strong predictors. This finding was more in alignment with what is generally witnessed in practice – that students who accessed the course less frequently, spent less total time, and had more days lapse since the last login are more prone to drop out. Course level and discipline are weak predictors of course completion or drop out.

The boosted C5.0 decision tree for the day 7 dataset had fewer decision nodes. The tree showed the first node divided between total number of logins as less than or equal to 13, or total logins as greater than 13. If total logins were greater than 13 by the end of day 7, then 90 % of those students completed their course (only about 10 % dropped out). If total logins were less than or equal to 13 by the end of day 7, then only 57 % of those students completed their course (43 % dropped out). Of the students who had 13 or less logins by the end of day 7, 78 % dropped out in the disciplines of business or social sciences, while only 25 % of those students enrolled in the humanities, math, or education disciplines dropped out.

Raw propensity scores and adjusted propensity scores, along with the predicted value of the dependent variable (Course_Status), are computed using the boosted C5.0 decision tree model. The adjusted propensity score is used to create the CMS risk score from the day 3 and day 7 datasets. Since the adjusted propensity scores are given in decimal points, they are multiplied by 100 to make them more readable. The scores are subtracted by 100 to compute the risk score for dropout. The following equation is used to compute the dropout risk score for individual students.

Dynamic CMS Risk Score =IF(Total_Logins=0,"100",(100-($CRP-Course_Status*100)))

OR

Dynamic CMS Risk Score =IF(Total_Logins =0,"100",(100-(Adjusted Propensity Score for Course_Status*100)))

6.4.3 Computation of Dynamic Risk Scores

As previously mentioned, SIS data analysis provides a baseline risk score, which can be used at the beginning of the course. As the course progresses and CMS data are analyzed, the CMS risk scores can be computed. To compute a dynamic risk score, both the baseline risk score and CMS risk scores are averaged together. The 3-day dynamic risk score is computed by averaging the baseline risk score and day 3 CMS risk score. Similarly, the 7-day dynamic risk score is computed by averaging the baseline risk score and day 7 CMS risk score.

3_Day_Dynamic_Risk_Score = Baseline_ Risk_Score + 3_Day_Risk_Score / 2

7_Day_Dynamic_Risk_Score = Baseline_ Risk_Score + 7_Day_Risk_Score / 2

For example if a student’s baseline score was 20 and the 3-day CMS score was 60, then the 3-day dynamic risk score would be 40.

6.4.4 Evaluation of the Risk Scoring Model

Although the boosted C5.0 decision tree model selected for risk scoring used 10fold cross-validation, it was determined to evaluate the model with a new dataset to verify its performance and to address any possible bias. A new dataset of 205 students was extracted from the SIS and CMS. The SIS data was analyzed using a boosted C5.0 decision tree to calculate the baseline risk score from adjusted propensity scores. Similarly a CMS dataset for day 3 and day 7 was created for the 205 selected students. The CMS datasets for day 3 and day 7 were analyzed using boosted C5.0 decision trees to compute the CMS risk score for day 3 and day 7. The baseline SIS risk score and 3-day CMS risk score were then used to create the 3-day dynamic risk score; the baseline SIS risk score and 7-day CMS risk score were used to create the 7-day dynamic risk score. The data analysis for both SIS and CMS used 10fold cross-validation.

Course_Status is the actual outcome of the course, with 0 representing course dropout and 1 representing course completion. Baseline_Risk_Score is the risk score using the SIS dataset. A student is labeled at higher risk for dropout if the baseline score is 50 or higher. A baseline risk score of less than 50 suggests a lower risk of dropout (higher likelihood of course completion). All 205 records were individually reviewed to examine the baseline risk score against course status. Accuracy_Baseline_Score provides information about whether the prediction was correct or not using the following rule:

=IF(OR(AND(Baseline_Risk_Score>=50,Course_Status=0),AND(Baseline_Risk_Score<50,Course_Status=1)),"Correct","Incorrect")

The above rule explains that the variable Accuracy_Baseline_Score was coded as “Correct” if Baseline_Risk_Score was greater than or equal to 50 and Course_Status was equal to 0 (dropout). Accuracy_Baseline_Score was also coded as “Correct” if Baseline_Risk_Score was less than 50 and Course_Status was equal to 1. The Accuracy_Baseline_Score variable was coded as “Incorrect” if those conditions were not met. Using the above rule, the dataset was recoded for Accuracy_Baseline_Score.

The dataset was reviewed for accuracy of the 3-day and 7-day dynamic risk scores. The variable Accuracy_3_Day_DRS (DRS stands for dynamic risk score) was used to check for the accuracy of the 3_Day_Dynamic_Risk_Score. The variable Accuracy_7_Day_DRS was used to check for the accuracy of the 7_Day_Dynamic_Risk_Score.

Accuracy_3_Day_DRS provides information about whether the prediction was correct or not using the following rule:

=IF(OR(AND(3_Day_Dynamic_Risk_Score>=50,Course_Status=0),AND(3_Day_Dynamic_Risk_Score <50,Course_Status=1)),"Correct","Incorrect")

Accuracy_7_Day_DRS provides information about whether the prediction was correct or not using the following rule:

=IF(OR(AND(7_Day_Dynamic_Risk_Score>=50,Course_Status=0),AND(7_Day_Dynamic_Risk_Score <50,Course_Status=1)),"Correct","Incorrect")

The fields of Accuracy_3_Day_DRS and Accuracy_7_Day_DRS were coded as “Correct” if the above rule conditions were met. If the rule conditions were not met, then the fields were coded as “Incorrect.”

Table 6.9 provides the accuracy of the baseline risk score, the 3-day dynamic risk score and the 7-day dynamic risk score in the evaluation dataset. As indicated in the table, 82.4 % of records were correctly predicted using the baseline risk score whereas the accuracy of the baseline and 3-day dynamic score was 78.5 %, which was less than the accuracy of the baseline risk score. This lower accuracy of the day 3 dynamic risk score was a result of lower performance of the 3-day CMS dataset – indicating that the extraction of the end of day 3 CMS data does not improve the baseline accuracy provided by the SIS dataset. The table also includes the accuracy of the 7-day dynamic risk score, which is derived from the baseline risk score and the 7-day CMS risk score. The results show that the accuracy of the 7-day dynamic risk score was 86.3 %, which was higher than the baseline risk score and 3-day dynamic risk score. This indicates that by day 7 the CMS has sufficient data to improve on the accuracy of the overall prediction.

Table 6.9 Accuracy of baseline, 3-day dynamic, and 7-day dynamic risk scores

6.4.5 Comparison of Pre-census and Post-census Dropout Accuracy

Since the university provides a full refund to students who drop out before the census date, it would be important to investigate how accurate risk scores are for students who drop out before and after the census date. There were 72 records for dropout students (out of a total of 205 records) in the evaluation dataset. 46 of those 72 students dropped before the census date, while the remaining 26 dropped after the census date.

Table 6.10 explains the before census dropout accuracy and after census dropout accuracy for the baseline risk score, 3-day dynamic risk score, and 7-day dynamic risk score. The baseline risk score was found to be 91.3 % accurate in identifying students who dropped out before the census date. The accuracy of the 7-day dynamic risk score was highest at 88.46 % for students who dropped out after the census date. The 3-day dynamic risk score was marginally accurate for both pre- and post-census dropout – which is likely to be related to insufficient student activity through day 3. Based on this analysis, it is evident that both the baseline risk score and 7-day risk score are important for the accurate identification of dropout students.

Table 6.10 Comparison of pre-census and post-census dropout accuracy

6.4.6 Recommendation for Deployment of Predictive Models

Deployment is the final phase in the CRISP-DM process. The data analysis and evaluation phases indicated that the boosted C5.0 decision tree model was most suitable for both SIS and CMS datasets to predict at-risk students and to compute baseline and dynamic risk scores. To deploy the boosted C5.0 SIS and CMS predictive models, IBM Modeler and a database server, such as SQL server, can be used. IBM Modeler has the capability to export the model in PMML, which can be loaded to a database server to score new datasets. A web-based interface connected with a back-end data source containing risk scores can serve as the primary user interface to find risk scores for students. The risk scores and decision rules from boosted C5.0 decision trees can serve as rules for triggering alerts. For example, a rule could be set with the provision that if the predictive model has a risk score for a student of over 80, then an e-mail alert could be triggered to the instructor and retention personnel.

The purpose of early alerts and meaningful recommendations that help provide early intervention is to prevent dropout before it occurs. This approach has been noted in key retention studies (Seidman 2005). This study seeks to build an early alert system and recommender system, as a possible application of the predictive models, which use background research in recommender systems and student retention. The Seidman Retention formula based on the Tinto’s model (Tinto 1987, 1993). Tinto’s model discusses the role of early identification and intervention in improving student retention. The formula states:

Retention = early identification + (early + intensive + continuous) intervention

The early identification of at-risk students allows for early intervention. The predictive models and risk scores developed in the previous section help identify students who are at risk of dropping out. They can provide insights that can be developed into recommendations for intervention. Accurate early identification of at-risk students, with effective corresponding recommendations, is helpful for intervention that is early, intensive, and continuous. Early identification of at-risk students is achieved through the SIS based predictive model (predictive model I) and baseline risk score when the student enrolls in a course. As the course progresses and the CMS predictive model (predictive model II) and dynamic risk score become available, there is an opportunity for the continuous monitoring of at-risk students. Predictive model I and predictive model II allow early identification of at-risk students and can help create student-specific recommendations, leading to early, continuous, and effective interventions. In this systemic view:

Intervention (early + intensive + continuous) = At-risk student identification (early + accurate + continuous) + student-specific recommendation

Figure 6.4 shows the predictive analytics based intervention.

Fig. 6.4
figure 4figure 4

Predictive analytics based intervention

This study presents a framework for a recommender system, which uses rule-based predictive analytics. There are additional steps in deployment for creating early alerts and recommendations using the predictive models discussed in this section. The deployment steps for the recommender system are discussed in the next section.

6.4.7 Recommender System Architecture

The recommender system constitutes student rules, system rules, and a recommendation repository. Student rules and system rules are used for the early alert system that can be combined with the appropriate recommendation from the recommendation repository. A system rule, for example, is the minimum risk score (determined by a preselected value by the institution) at which an alert will be triggered. The alert system will send e-mails to retention personnel, the instructor, and the student that will include recommendations from the recommendation repository. If the student has not previously taken an online course, that student will be recommended web-based CMS tutorials and links to student support services in the e-mail. If a student does not have financial aid, the early alert system can send an e-mail alert to financial aid and retention personnel for the purpose of apprising the student of relevant information and/or assistance. Previous alerts and recommendations can be stored in a repository so they can be tracked by instructors and retention personnel using a web-based interface. The recommender system rules, the predictive model, and the risk score/recommendation repository are used to build the web-based interface of the recommender system. Such a system can be built using a back-end database server, such as SQL server or Oracle, with a web programming language (PHP, ASP.net or Java Server Pages, etc.). Figure 6.5 presents the general recommender system architecture.

Fig. 6.5
figure 5figure 5

Recommender system architecture

6.5 Conclusion

This study identified four primary research objectives. Each objective of the study was met, as described below.

  1. 1.

    Identify and analyze various SIS and CMS based predictors relative to their strengths to predict dropout risk for students in online courses.

    • This study identified 10 SIS independent variables using the literature review that have been shown to have an effect on student dropout. This study concludes that all SIS variables play a role in student dropout. Furthermore, the study identified 7 CMS variables including: total logins, course prefix, course level, total time spent, course credit hours, days since last login, and course discipline. The analysis showed that the end of day 7 CMS dataset can provide sufficient information to develop a predictive model. The analysis showed that total logins, days since last login, and total time spent were significant predictors.

  2. 2.

    Evaluate various statistical and machine learning techniques (such as Neural Networks, SVM, and hybrid methods) for their predictive accuracy to build predictive models to leverage both SIS (historical) and CMS (time-variant) data.

    • Based on this comparative analysis, it was concluded that the C5.0 decision tree algorithm performed the best. The predictive performance of this classifier can be further enhanced using boosting. The boosted C5.0 decision tree models provided the best accuracy and performance for both the SIS and CMS datasets.

  3. 3.

    Construct a baseline predictive model using SIS variables and dynamic predictive model using both SIS and CMS variables to model dynamic dropout risk in the form of a risk score.

  4. 4.

    Propose a recommendation system architecture based on the predictive models to identify at-risk students and to offer meaningful alerts and recommendations based on their risk score.