Multi-split optimized bagging ensemble model selection for multi-class educational data mining

Injadat, MohammadNoor; Moubayed, Abdallah; Nassif, Ali Bou; Shami, Abdallah

doi:10.1007/s10489-020-01776-3

Multi-split optimized bagging ensemble model selection for multi-class educational data mining

Published: 22 July 2020

Volume 50, pages 4506–4528, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Multi-split optimized bagging ensemble model selection for multi-class educational data mining

Download PDF

MohammadNoor Injadat ORCID: orcid.org/0000-0003-1959-0058¹,
Abdallah Moubayed¹,
Ali Bou Nassif^2,1 &
…
Abdallah Shami¹

1207 Accesses
50 Citations
Explore all metrics

Abstract

Predicting students’ academic performance has been a research area of interest in recent years, with many institutions focusing on improving the students’ performance and the education quality. The analysis and prediction of students’ performance can be achieved using various data mining techniques. Moreover, such techniques allow instructors to determine possible factors that may affect the students’ final marks. To that end, this work analyzes two different undergraduate datasets at two different universities. Furthermore, this work aims to predict the students’ performance at two stages of course delivery (20% and 50% respectively). This analysis allows for properly choosing the appropriate machine learning algorithms to use as well as optimize the algorithms’ parameters. Furthermore, this work adopts a systematic multi-split approach based on Gini index and p-value. This is done by optimizing a suitable bagging ensemble learner that is built from any combination of six potential base machine learning algorithms. It is shown through experimental results that the posited bagging ensemble models achieve high accuracy for the target group for both datasets.

Educational Data Mining Using Base (Individual) and Ensemble Learning Approaches to Predict the Performance of Students

Student’s Academic Performance Prediction Using Ensemble Methods Through Educational Data Mining

Process-Based Multi-level Homogeneous Ensemble Predictive Model for Analysing Student’s Academic Performance

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data mining is rapidly becoming a part of software engineering projects, and standard methods are constantly revisited to integrate the software engineering point of view. Data mining can be defined as an extraction of data from a dataset and discovering useful information from it [28, 34]. This is followed by the analysis of the collected data in order to enhance the decision-making process [17]. Data mining uses different algorithms and tries to uncover certain patterns from data [1]. These techniques techniques have proved to be effective solutions in a variety of fields including education, network security, and business [29, 50, 66]. Hence, they have the potential to also be effective in other fields such as medicine and education. Educational Data Mining (EDM), a sub-field of data mining, has emerged that specializes in educational data with the goal of better understanding students’ behavior and improving their performance [12, 22]. Moreover, this sub-field also aims at enhancing the learning and teaching processes [17]. EDM often takes into consideration various types of data such as administrative data, students’ performance data, and student activity data to gain insights and provide the appropriate recommendation [35, 48].

The rapid growth of technology and the Internet has introduced an interactive opportunities to help education field to improve the teaching and learning processes. In turn, this has led to the emergence of the field of e-learning. This field can be defined as “the use of computer network technology, primarily over an intranet or through the Internet, to deliver information and instruction to individuals” [33, 61]. There are various challenges facing e-learning platforms and environment [49]. This includes the assorted styles of learning, and challenges arising from cultural differences [16]. Other challenges also exist such as pedagogical e-learning, technological and technical training, and e-learning time management [38]. To this end, personalized learning has emerged as a necessity in order to better cater to the learners’ needs [30]. Accordingly, this personalization process has become a challenging task [13], as it requires adapting courses to meet different individuals’ needs. This calls for adaptive techniques to be implemented [8, 14]. This can be done by automatically collecting data from the e-learning environment [8] and analyzing the learner’s profile to customize the course according to the participant’s needs and constraints such as his/her location, language, currency, seasons, etc. [8, 44, 46].

Many of the previous works in the literature focused on predicting the performance of the students by adopting a binary classification model. However, some educators prefer to identify not only two classes of students (i.e. Good vs. Weak), but instead they divide the students into several groups and consider the associated multi-class classification problem [58]. This is usually done because the binary model often identifies a large number weak students, many of which are not truly at risk of failing the course. Accordingly, this work considers two datasets at two different stages of the course, namely at 20% and 50% of the coursework, and divides the students into three groups, namely Weak, Fair, and Good students. Accordingly, the datasets are analyzed as a set of multi-class classification problems.

Multi-class classification problems can be solved by naturally extending the binary classification techniques for some algorithms, [3]. In this work, we consider various classification algorithms, compare their performances, and use Machine Learning (ML) techniques aiming to predict the students’ performance in the most accurate way. Indeed, we consider K-nearest neighbor (k-NN), random forest (RF), Support Vector Machine (SVM), Multinomial Logistic Regression (LR), Naïve Bayes (NB) and Neural Networks (NN) and use an optimized systematic ensemble model selection approach coupled with ML hyper-parameter tuning using grid search optimization.

In this paper, we produced a bagging of each type of model and the bagging was used for the ensembles as opposed to single models. Bagging is itself an ensemble algorithm as it consists of grouping several models of the same type and defining a linear combination of the individual predictions as the final prediction on an external test sample, as explained in Section 6. Bagging is one of the best procedures to improve the performance of classifiers as it helps reduce the variance in many hard decision problems [10, 52]. The empirical fact that bagging improves the classifiers’ performance is widely documented [9], and in fact ensemble methods placed first in many prestigious ML competitions, such as the Netflix Competition [54], KDD 2009 [24], and Kaggle [32]. Furthermore, a multi-split framework is considered for the studied datasets in order to reduce the bias of the ML models investigated as part of the bagging ensemble models.

The main disadvantage of bagging, and other ensemble algorithms, is the lack of interpretation. For instance, a linear combination of decision trees is much harder to interpret than a single tree. In the same way, bagging several variable selections gives little clues about which of the predictor variables are actually important.In this paper, in order to have a rough idea of which variables are the best predictors for each algorithm, we decided to average, for each variable, its importance in every model and this average is assigned to the variable and defined to be its averaged importance. This was done in order to better highlight the features that are truly important across the multiple splits under consideration.

The remainder of this paper is organized as follows: Section 2 presents some of the previous related work and their limitations; Section 3 summarizes the research contributions of this work; Section 4 describes the datasets under consideration and defines the corresponding target variables for both datasets; Section 5 describes the performance measurement approach adopted; Section 6 presents the methodology used to choose the best classifiers for the multi-class classification problem; Section 7 discusses the architecture used for training NN and shows the features’ importance for each classifier for each dataset; Section 8 presents and discusses the experimental results both in terms of Gini Indices (also called Gini coefficient) and by using confusion matrices; and finally, Section 9 lists the research limitations, proposes multiple future research opportunities, and concludes the paper.

2 Related work and limitations

2.1 Related work

Educational data mining has become a rich field of research with the demand for empirical studies and research by academia increasing in recent years. This is due to the competitive advantages that can be gained from such kind of research. Data mining can be used to evaluate and analyze the different factors that improve the knowledge gaining, skills improvement of the learners, and makes the educational institution offer a better learning experience with highly qualified students or trainees [60].

Several researchers have explored the use of data mining techniques in an educational setting. Authors of [37] used data mining techniques to analyze the learner’s web usage and content-based profiles to have an on-line automatic recommendation system. In contrast, Chang et al. proposed a k-NN classification model to classify the learner’s style [11]. The results of this model was used to help the educational institution management and faculties to improve the courses’ contents to satisfy the learner’s needs [11].

Another related study that used simple leaner regression to check the effect of the student mother’s education level and the family’s income in learner’s academic level was presented in [26].

On the other hand, Baradwaj and Pal used classification methods to evaluate the students’ performance using decision trees [6]. The study was conducted using collected data from previous year’s database to predict the student result at the end of the current semester. Their study aimed to provide a prediction that will help the next term instructors identify students that they may need help.

Other researchers [7] applied Naïve Bayes classification algorithm to predict students’ grades based on their previous performance and other important factors. The authors discovered that, other than students’ efforts, factors such as residency, the qualification standards of the mother, hobbies and activities, the total income of the family, and the state of the family had a significant effect on the students’ performance.

Later, the same authors used Iterative Dichotomiser 3 (ID3) decision tree algorithm and if-then rules to accurately predict the performance of the students at the end of the semester [56] based on different variables like Previous Semester Marks, Class Test Grades, Seminar Performance, Assignments, Attendance, Lab Work, General Proficiency, and End Semester Marks.

Similarly, Moubayed et al. [51, 53] studied the student engagement level using K-means algorithm and derived a set of rulers that related student engagement with academic performance using Apriori association rules algorithm. The results analysis showed a positive correlation between students’ engagement level and their academic performance in an e-learning environment.

Prasad et al. [57] used J48 (C4.5) algorithm and concluded that this algorithm is the best choice for making the best decision about the students’ performance. The algorithm was also preferred because of its accuracy and speed.

Ahmed and Elaraby conducted a similar research in 2014 [2] using classification rules. They analyzed data from a course program across 6 years and were able to predict students’ final grades. In similar fashion, Khan et al. [36] used J48 (C4.5) algorithm for predicting the final grade of Secondary School Students based on their previous marks.

Kostiantis et al. [40] proposed an incremental majority voting-based ensemble classifier based on 3 base classifiers, namely NB, k-NN, and Winnow algorithms. The authors’ experimental results showed that the proposed ensemble model outperformed the single base models in a binary classification environment.

Saxena [62] used k-means clustering and J48 (C4.5) algorithms and compared their performance in predicting students’ grades. The author concluded that J48 (C4.5) algorithm is more efficient, since it gave higher accuracy values than k-means algorithm. Authors in [59] used and compared K-Means and Hierarchical clustering algorithms. They concluded that K-means algorithm is more preferred to hierarchical clustering due to better performance and faster model building time.

Wang et al. proposed an e-Learning recommendation framework using deep learning neural networks model [65]. Their experiments showed that the proposed framework offered a better personalized e-learning experience. Similarly, Fok et al. proposed a deep learning model using TensorFlow to predict the performance of students using both academic and non-academic subjects [21]. Experimental results showed that the proposed model had a high accuracy in terms of student performance prediction.

Asogbon et al. proposed a multi-class SVM model to correctly predict students’ performance in order to admit them into appropriate faculty program [4]. The performance of the model was examined using an educational dataset collected at the University of Lagos, Nigeria. Experimental results showed that the proposed model adequately predicted the performances of students across all categories [4].

In a similar fashion, Athani et al. also proposed the use of a multi-class SVM model to predict the performance of high school students and classify them into one of five letter grades A-F [5]. The goal was to predict student performance to provide a better illustration of the education level of the schools based on their students’ failure rate. The authors used a Portuguese high school dataset consisting mostly of the students’ socio-economic descriptors as features. Their experiments showed that the proposed multi-class SVM model achieved high prediction accuracy close to 89% [5].

Jain and Solanki proposed a comparative study between four tree-based models to predict the performance of students based on a three-class output [31]. Similar to the work of Athani et al., the authors in this work also considered the Portuguese high school dataset consisting mostly of the students’ socio-economic descriptors as features. Experimental results showed that the proposed tree-based model also achieved high prediction accuracy with a low execution time [31].

2.2 Limitations of related work

The limitations of the related work can be summarized as follows:

Do not analyze the features before applying any ML model. Any classification model is directly applied without studying the nature of the data being considered.
Mostly consider the binary classification case. Such cases often lead to identifying too many students which are not truly in danger of failing the course and hence would not need as much help and attention. Even when multi-class models were considered, the features used were mostly focused on students’ socio-economic status rather than their performance in different educational tasks.
Often use a single classification model or an ensemble model built upon randomly chosen group of base classifiers. Moreover, to the best of our knowledge, only majority voting-based ensemble models are considered.
Often predict the performance of students from one course to the other or from one year to the other. Performance prediction is rarely considered during the course delivery.
Often use the default parameters of the utilized algorithms/techniques without optimization.

3 Research contribution

To overcome the limitations presented in Section 2.2, our research aims to predict the students’ performance during the course delivery as opposed to other previous works that perform the prediction at the end of the course. The multi-class classification problem assumes that their is a proportional relationship between the students’ efforts and seriousness in the course and their final course performance and grade.

More specifically, our work aims to:

Analyze the collected datasets and visualize the corresponding features by applying different graphical and quantitative techniques (e.g. dataset distribution visualization, target variable distribution, and feature importance).
Optimize hyper-parameters of the different ML algorithms under consideration using grid search algorithm.
Propose a systemic approach to build a multi-split-based (to reduce bias) bagging ensemble (to reduce variance) learner to select the most suitable model depending on multiple performance metrics, namely the Gini index (for better statistical significance and robustness) and the target class score.
Study the performance of the proposed ensemble learning classification model on multi-class datasets.
Evaluate the performance of the proposed bagging ensemble learner in comparison with classical classification techniques.

Note that in this work, the term Gini index refers to the Gini coefficient that is calculated based on the Lorenz curve and area under the curve terms [43]. Therefore, the remainder of this work adopts to the term Gini index.

4 Dataset and target variable description

4.1 Dataset description

In this section,the two datasets under consideration are described at the two course delivery stages (20% and 50% of the coursework). This corresponds to the results of a series of tasks performed by University students. Moreover, Principal Components Analysis (PCA) is conducted to better visualize the considered datasets.

Dataset 1: The experiment was conducted at the University of Genoa on a group of 115 first year engineering major students [63]. The dataset consists of data collected using a simulation environment named Deeds (Digital Electronics Education and Design Suite). This e-Learning platform allows students to access the courses’ contents using a special browser and asks the students to solve problems that are distributed over different complexity levels.

Table 1 shows a summary of the different tasks for which the data was collected. It is worth mentioning that 52 students out of the original 115 students registered were able to complete the course.
Table 1 Dataset 1 - Features
Full size table

The 20% stage consists of the grades of tasks ES 1.1 to ES 3.5. On the other hand, the 50% stage consists of tasks ES. 1.1 to ES 5.1.

To improve the accuracy of the classification model, empty marks were replaced with a 0. Moreover, all tasks’ marks were converted to a scale out of 100. Furthermore, all decimal point marks were rounded to the nearest 1 to maintain consistency.
Dataset 2: This dataset was collected at the University of Western Ontario for a second year undergraduate Science course. The dataset is composed of two main parts. The first part is an event log of the 486 students enrolled. This event log dataset consists of 305933 records. In contrast, the other part, which is under consideration in this research, is the grades of the 486 students in the different evaluated tasks. This includes assignments, quizzes, and exams.

Table 2 summarizes the different tasks evaluated within this course. The 20% stage consists of the results of Assignment 01 and Quiz 01. On the other hand, the 50% stage consists of the grades of Quiz 01, Assignments 01 and 02, and the midterm exam.
Table 2 Dataset 2 - Features
Full size table

Similar to Dataset 1, all empty marks were replaced with a value of 0 for better classification accuracy. Moreover, all marks were scaled out of 100. Additionally, decimal point marks were rounded to the nearest 1.

4.2 Target variable description

For the two datasets under consideration, the target variables were constructed by considering the final grade. More specifically, the students were grouped into three groups as follows:

1.
Good (G) – the student will finish the course with a good grade (70 − 100%);
2.
Fair (F) – the student will finish the course with a fair grade (51 − 69%);
3.
Weak (W) – the student will finish the course with a weak grade (≤ 50%).

In this case, the target group is the Weak students (W) who are predicted to receive a mark below 50%, meaning that they are at risk of failing the course. Figure 1 shows that Datasets 1 and 2 are characterized by being small sized and unbalanced respectively. These two issues have more of an impact on the classification problem. It can be seen that for the first dataset, the three classes are relatively evenly distributed, but each class consists of only a few students. On the other hand, the second dataset is not small sized but is strongly unbalanced, having only 8 Weak students out of 486 students.

To better visualize the three classes, we applied PCA to the datasets (both considered at Stage 50%) as shown in Figs. 2 and 3. Looking at these two figures, we note that it can be possible to draw a boundary that separates Weak Students from the rest of the students, whereas Fair and Good students are too close and not separable by a boundary. We will see in the next sections that the performance of the models is affected by this distribution and that most of the algorithms fail in distinguishing between Fair and Good students, especially for Dataset 1.

5 Performance evaluation metrics description

In general there are two standard approaches to choosing multiple class performance measures [3, 25]. One approach, namely OVA (One-versus-all), is to reduce the problem of classifying among N classes into N binary problems. In this case, every class is discriminated from the other classes. In the second approach, called AVA (All-versus-all), each class is compared to each other class. In other words, it is necessary to build a classifier for every pair of classes, i.e. building $\frac {N(N-1)}{2}$ classifiers, while discarding the rest of the classes.

Due to the size of our datasets, we chose to follow the first method as opposed to the second one. In fact, if we were to use the second approach for Dataset 1, we would need to train three binary models, one for each pair of classes (G,F), (F,W), and (G,W). In particular, the subset of data for the (F,W) model would consist of only 28 students, which would be split into Training Sample (70%) and Test Sample (30%). This corresponds to training a model using 20 students and testing it using only 8 students. Due to the relatively small size of the (F,W) model, we determine that the AVA approach would not be suitabe for accurate prediction.

It is well-known that the Gini Index metric, as well as the other metrics (Accuracy, ROC curve etc.) can be generalized to the multi-class classification problem. In particular, we choose the Gini Index metric instead of the Accuracy because the latter depends on the choice of a threshold whereas the Gini Index metric does not. This makes it statistically more significant and robust than the accuracy, particularly given that it provides a measure of the statistical dispersion of the classes [27].

In particular, we implemented a generalization of Gini index metric: during the training phase, that computes the Gini Index of each one of the three binary classifications and optimizes (i.e. maximizes) the average of the 3 performances, i.e. the performances corresponding to classes G, F, W.

6 Methodology

For the multi-class classification problem we used several algorithms. More specifically we explored RF, SVM - RBF, k-NN, NB, LR, and NN with 1, 2 and 3 layers (i.e. 3 different NN models), for a total of eight classifiers per dataset.

In order to achieve better performances, we did not build only one individual model for each algorithm, instead we constructed baggings of classifiers. In fact, as explained in the previous section, bagging reduces the variance.

We built a bagging of models for each algorithm in the following way: we started by splitting each dataset into Training and Test samples in proportions 70%-30% then we used the training sample to build baggings of models. More precisely the Training sample was split 200 times into sub-Training and sub-Test samples randomly but forcing the percentages of Fair, Good and Weak students to be the same as the ones in the entire datasets.

The models resulting from the 200 splits were trained on the sub-Training samples and inferred on the corresponding sub-Test samples. If the Average Gini Index was above a certain fixed threshold (lowest acceptable Gini Index) then the model was kept otherwise it was discarded. For each algorithm we obtained in this way a set of models having the best performances, and we averaged their scores on the (external) Test sample, class by class. This procedure is explained in Fig. 4.

Once we had the eight baggings of models (one for each algorithm), we considered all the possible ensembles that could be constructed with them and compared their performances in terms of Gini Index, as explained in Section 5. Moreover, for each dataset, we computed the p-values corresponding to each one of the 256 possible ensembles and aimed to choose as the final ensemble the one that had best Gini Index and, at the same time, that was statistically significant.

The Gini Index, also commonly referred to as the Gini coefficient, can be seen geometrically as the area between the Lorenz curve [43] and the diagonal line representing perfect equality. The higher the Gini Index, the better the performance of the model. Formally the Gini index is defined as follows:

Let F(z) be the cumulative distribution of z and let a and b be the highest and the lowest value of z respectively, then the we can calculate half of Gini’s expected mean difference as:

$$ 2 {{\int}_{a}^{b}} F(z)[1-F(z)] dz $$

(1)

Alternatively, the Gini index can be calculated as 2 ∗ Area Under Curve − 1.

On the other hand, the statistical significance of our results is determined by computing the p-values. The general approach is to test the validity of a claim, called the null hypothesis, made about a population. An alternative hypothesis is the one you would believe if the null hypothesis is concluded to be untrue. A small p-value (≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. For our purposes, the null hypothesis states that the Gini Indices were obtained by chance. We generated 1 million random scores from normal distribution and calculated the p-value. The ensemble learners selected have p-value ≤ 0.05, indicating that there is strong evidence against the null hypothesis. Therefore, choosing an ensemble model using a combination of Gini Index and p-value allows us to have a more statistically significant and robust model.

The classifiers were inferred on the test sample, giving as output three vectors of predictions to be analyzed. These three vectors express the chance that each student is classified as Weak, Fair and Good. In order t o build the confusion matrices, we fixed a threshold for each class, namely τ_F, τ_G, and τ_W. To determine each threshold, a one-vs-all method is considered for each class with the threshold being chosen as the score for which the point on the ROC curve is closest to the top-left corner (commonly referred to as the Youden Index) [20]. This is done in order to find the point that simultaneously maximizes the sensitivity and specificity.

For each student belonging to the Test sample, we defined the predicted class according to the following steps:

1.
The 3 scores corresponding to the 3 classes were normalized in order to make them comparable.
2.
For each class, if the probability is higher than the corresponding threshold then the target variable for the binary classification problem associated to that class is predicted to be 1, otherwise it’s 0.
3.
In this way we obtained a 3-column matrix taking values 1’s and 0’s. Comparing the 3 predictions, if a student has only one possible outcome (i.e. only one 1, and two 0’s) then the student is predicted to belong to the corresponding class. Otherwise, if there is uncertainty about the prediction because there is more than one 1 predicted for the student, then the class with the highest score is chosen to be the predicted one.

For instance, consider the following example:

Example 1

Suppose we have trained a classifier using 70% of Dataset 1. When we infer the model on the test sample (remaining 30%, consisting of 15 students), we obtain 3 vectors of scores, one for each class and we can compute their Gini Indices, see Fig. 5.

In this example, the Gini Indices of Classes F, G, W are 97.2%, 76.8%, 98% respectively, hence the Averaged Gini Index is 90.7%.

We map the three scores linearly to the interval [0,1], i.e. we normalize them to make them comparable. The normalized scores are represented in Table 3 in columns score F, score G, score W.

Table 3 Example - Predicting Classes

Multi-split optimized bagging ensemble model selection for multi-class educational data mining

Abstract

Similar content being viewed by others

Educational Data Mining Using Base (Individual) and Ensemble Learning Approaches to Predict the Performance of Students

Student’s Academic Performance Prediction Using Ensemble Methods Through Educational Data Mining

Process-Based Multi-level Homogeneous Ensemble Predictive Model for Analysing Student’s Academic Performance

Explore related subjects

1 Introduction

2 Related work and limitations

2.1 Related work

2.2 Limitations of related work

3 Research contribution

4 Dataset and target variable description

4.1 Dataset description

4.2 Target variable description

5 Performance evaluation metrics description

6 Methodology

Example 1

7 ML parameter tuning and application

7.1 Neural network tuning

7.2 ML algorithms’ parameter tuning

7.3 Features importance: Dataset 1 - Stage 20%

7.4 Features importance: Dataset 1 - Stage 50%

7.5 Features importance: Dataset 2 - Stage 20%

7.6 Features importance: Dataset 2 - Stage 50%

8 Experimental results and discussion

8.1 Results: Dataset 1 - Stage 20%

8.2 Results: Dataset 1 - Stage 50%

8.3 Results: Dataset 2 - Stage 20%

8.4 Results: Dataset 2 - Stage 50%

8.5 Performance comparison with base learners

8.6 Results summary

9 Conclusion, research limitations, and future work

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Informed Consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation