Keywords

1 Introduction

Likewise, most industries, data also plays an important role in higher education. Apart from traditional data, nowadays the education organizations are collecting data from social media and location-based streams. The collected data can be used to mine and construct predictive analytical models to enhance students’ success rate. The data mining and analytics software can be utilized to provide immediate outcomes to the instructors about the learner’s academic performance. Such tools can analyze patterns and predict outcomes such as potential cases of dropping out, requiring extra assistance or demanding challenging assignments [1]. Each learner has different behavior in different modules due to varying strengths and weaknesses. Predictive analytics in education can assist in differentiated learning [2]. Nowadays, higher education institutions are investing in building predictive analysis tools. A prediction model was built in the Nottingham Trent University (NTU) [3]. The application pointed out the four important factors such as library usage, attendance on campus, attendance in the tutorial and the use of the study portal. The software triggers the NTU instructor when the student’s rate of engagement is decreasing. The use of predictive analysis can be useful for institutions to raise their profit, revenue, and financial planning. Education organizations are adopting data science practices to predict enrollment trends and operational needs [4]. Moreover, the number of dropouts has become a serious issue for the education organization. Completion and persistence rates are important because they measure how well an institution is serving its students.

According to the National Student Clearinghouse Research Center (2016), the persistence rate from year one to year two is 72.1%. Persistence rates continue to decline during and after the second year of colleges. The total completion rate of students who start college and completed within six years is 54.8% nationally. The persistence and completion rates are lower for part-time students. Rates of completion and persistence vary between two and four-year colleges, public and private, full-time versus part-time students [5]. In order to contribute to the domain of predictive analytics in higher education, in this research, we carried out an in-depth review of the research in the area and compared machine learning algorithms to evaluate their accuracy rate. The rest of this paper is structured as follows. Section 2 elaborates the predictive analytics models and other approaches developed using machine learning algorithms to achieve a wide variety of goals in the education sector. Section 3 briefly describes the methodology of this research. Section 4 contains a detailed discussion of our experiment results. Finally, we have discussed the future direction of this research in Sect. 5.

2 Related Work

A predictive analysis system was built to measure student satisfaction level towards an online course program based on data collected from students who enrolled in the summer-session at western university [6]. Each of the learners was given a survey that included questions on demographics, student satisfaction, and five predictor variables. Correlation, regression, and anova analysis were used to build the prediction model. The study found that the interaction framework with the inclusion of two predictors is internet self-efficacy and self-regulation. Instructors are encouraged to design more collaborative activities to enhance learners’ interaction, and student satisfaction level can be improved by performing internet related-tasks. Furthermore, student retention is one of the major issues in higher institutions, especially in the online course program. Several researchers found that the rate of dropouts is increasing [7,8,9]. Shimin et al. [10] built prediction models using RapidMiner 5.3 that can predict whether a student will engage further after registering to a specific course. The prediction variables are built using J-48 and J-Rip decision tree [10]. The two algorithms tested resulted in producing a high-performance model that provides indicators for predicting the future of a student who has registered in a specific course program. A study was conducted in Delhi Technological University on building predictive analytics model based on the data collected from the National Informatics Center Delhi with the goals of predicting which students will enroll in the particular course and what is the current demand, which programs are trending, and which are becoming obsolete? [11]. The prediction models were built using the decision tree and neural network techniques. The outcomes show that the highest precision indicators were found in decision tree modeling.

The Social Networks Adapting Pedagogical Practice (SNAPP) is a predictive analytics software that is used by University of Wollongong to generate data visualization of user interaction, activity patterns of behavior on the forum [12]. SNAPP is used to map learners’ level on engagement and activity to identify learners who are at risk of failing a subject due to lower participation. The tools are used to generate data reports, which include monitoring login frequency, dwell time and number of downloads. Connect for Success is another example of predictive analysis software used by Edith Cowan University (ECU), the system works based on enrolment data, and it is an early warning tool that is used to improve learner success and to improve graduation rates [13]. Automated Wellness Engine (AWE) is an alert system that is designed and built to improve learner engagement and retention rates at the University of New England. AWE is a software-based prediction model that is using emoticons to identify activity embedded in the student portal (myUNE) and another system that identify learners’ interaction with the university and instructor. Based on the data collected by AWE, it is smart enough to predict learners who are at high-risk or struggling or may be experiencing disengagement from their course [14]. Open University Australia (OUA) developed Personalized Adaptive Study Success (PASS). PASS is a software tool that is used for predictive analysis to enhance student engagement and retention in an online environment. The model was built based on the individual characteristic, social web, curriculum and physical data collected from some systems [15]. The software assists learners to be aware of their upcoming academic performance and suggests what the students could do better to improve their performance.

3 Methodology

The data was obtained from the Kaggle [16, 17]. It contains 480 student records in rows and 16 features in the columns. The features are classified into three major groups such as Demographic Features (DF), Academic Background Features (AF), and Behavioral Features (BF). The demographic features are nationality, gender, place of birth, and parent responsible for the student. The academic background features consist of the educational stage, grade level, section ID, semester, topics and student absence days. The behavioral features are discussion groups, raised a hand in class, opening resources, viewing announcements, answering the survey by parents and parent school satisfaction. We have utilized these features using traditional classification methods such as random forest and Naive Bayes. Moreover, we applied ensemble algorithms to choose the correct features and to predict the students’ performance with high accuracy. The data provider already pre-processed the data. There were no missing fields. The data was pre-processed again before the analysis using R software. The data was split into 75% for training and 25% for testing.

4 Results and Discussion

We used a combination of ensemble techniques that will improve the accuracy of machine learning algorithms results by reducing the variance (bagging), bias (boosting) and improve the accuracy of the prediction (stacking). However, we have compared the results using traditional methods such as random forest and naive Bayes. We tested the behavioral features such as group discussion and resources visited to determine whether these features are contributing factors in students’ performance. Firstly, we checked the importance of the features of the dataset. The variables are shown in Fig. 1 circular bar chart and valued by importance in Table 1.

Fig. 1.
figure 1

The categories of the features.

Table 1. The importance of features.

We conclude that behavioral features are the most important features among the three groups of the features from the circular bar chart as shown in Fig. 1. Secondly, academic background features and finally demographic features. We used traditional classification methods to classify the students’ performance and to determine the independent variables that offer the highest accuracy. The students’ performance was classified using the random forest and Naive Bayes using all the independent characters. The most important 10 features are highlighted with bold font in Table 1. The best (10, 11 and 12) features were selected to classify the performance of the students. The data with the 11 features using random forest gave the best accuracy value (0.7333) as shown in Table 2. The more features taken does not promise higher accuracy. However, we must choose the correct features to improve the accuracy of the algorithms.

Table 2. Accuracy using Random forest and Naive Bayes based on the importance of the features.

All six behavioral variables were not selected as features to classify the performance in this case. However, we tried using one behavioral feature either group discussion or visited resources. The accuracy increased to 0.6917 when the feature visited resources was included in the classification algorithms. This shows that the behavioral variable visited resources is an important feature, and it should be included in the feature selection. This can be seen in Tables 1 and 3. It has established that the better importance of the features improves accuracy. Finally, we have utilized three ensemble methods namely boosting, bagging and stacking to improve the accuracy using all the sixteen features. The boosting machine learning algorithm namely basic tree model (C5.0) and stochastic gradient Boosting model (gbm) were used. The bagging algorithms such as bagged CART (treebag) and random forest (rf) were used. The results are given in Table 4. The ensemble stacking sub-models namely random forest, gbm, K Nearest Neighbors (KNN) and Linear Discriminant Analysis (LDA) were used and the results can be seen in Table 5. The accuracy of bagging random forest had outperformed the other methods with the accuracy of 0.7959. In this study, the ensemble bagging random forest had given the best result. We need to take more combination of features to increase the accuracy of the performance of the student. We can also improve the result by including more features such as hours spent in the module and the students’ interest in the particular module to classify the students’ performance in the future study.

Table 3. Accuracy with and without behavioral features using Random forest and Naive Bayes.
Table 4. Accuracy using boosting and bagging ensemble algorithm.
Table 5. Accuracy using an ensemble stacking algorithm.

5 Future Direction

Likewise, most of the recent research trends of technology such as Big Data, Cloud Computing, and Edge Computing, predictive analytics has several barriers to its implementation in education organization and various other sectors where security, privacy, quality and transparency of data play a critical role. Both instructors and students have raised data privacy and ownership concerns. The management must undertake the initiative to ensure ethics and permissions are maintained at all levels. As an example the University of California has addressed the concern of how students data privacy is violated, as the institution use a third party service platforms to gather data, the management realized that the vendors were using the student data to commercialize or sell their product [18]. There is an utmost need to formulate secure and privacy preserved data collection as well as analysis techniques. Most importantly, users of the data must be aware of the fair usage and treatment of their records. In education organizations, predictive analytics must be carried out in compliance with legal standards such as FERPA. Another concern raised on predictive analytics is the ethics, transparency and legal compliance on the use of data. Furthermore, data must be gathered from trusted sources, and systems must be in place to ensure that data is reliable, and analyzing such data would lead to suitable insights and appropriate actions. Predictive analytics also requires domain experts as well as data scientists [15]. The human brain has to make the decision, not analytic tools, hence the number of experts, background and wisdom do matter in making a decision. Data is just a help to produce a result. Lack of vision and familiarity is considered as the major barriers to predictive analytics.

Formulating predictive analysis modeling requires a lot of thought process to address the problem and the goal of predictive analytics modeling. The lack of data warehousing among institution makes the process of predictive analysis as a challenge. An institution cannot get the right type of data if there is no central repository of data that is accessible and transparent. The lack of data warehousing is also affecting the complexity of analytical tools. One of the major barriers to implementing data analytics in higher education is cost. Management tends to view analytics as an investment as the tools are expensive and tend to think it will not return the investment. To come up with predictive analysis, the first spending would be in human resources such as hiring a data analyst. The second comes to the data warehousing which includes the infrastructure, platform and services that are being purchased to build predictive analytics [19]. Apart from these barriers, the accuracy of the predictive model is also very important to make better decisions. Depending on the features of the dataset, a predictive model should be generated by trying a combination of various machine learning algorithms, and the model should be validated to obtain optimum accuracy. For example, in this research, based on our experience of data analytics, first, we found that random forest and naive bayes gave the accurate results, and then we enhanced the accuracy by using ensemble techniques with random forest. In the future direction of this research, we plan to use advanced deep learning algorithms to build a predictive analytics model for higher education.