Keywords

1 Introduction

Student performance is an important task of higher educational institutions because it is a criteria for high quality universities that are based on excellent profile of their academic achievements. According to [1], student performance can be obtained by measuring the learning assessment and curriculum. However, most of the studies mentioned that student performance based on the measurement of students’ success [2].

Several situations of student performance in the universities are released to student and family’s student to warn in the case that students got poor performance. Taking an example from Can Tho University, in the first semester of the academic year 2018–2019, there were more than 800 released warnings for one-poor-performance semester and more than 100 for the two-consecutive-poor-performance semesters cases. These numbers are rising with 986 and 196 respectively for the academic year 2019–2020Footnote 1. One of the main reasons for the students’ poor performance is that they have not selected appropriate courses to their competencies. These results in extension of learning term and increase of cost for their families, higher educational institutions and society as well. Therefore, predicting students’ performance is an important research topic in exploiting educational data, which is of interest to many researchers [3].

In order to students to obtain the best academic performance, recommender systems is needed for course selection to predict the best courses that should be selected by students. There are the existing recommender systems for selecting courses. For example, data mining is one of the most popular approaches to be widely applied in educational area. One of the most popular techniques to predict student performance is classification. There are several algorithms used for classification task such as Decision Tree, Artificial Neural Networks, Naive Bayes, K-Nearest Neighbor and Support Vector Machines [2]. However, the existing researches are primarily based on learning results of previous semesters to predict student performance of next semester or Current Grade Point Average (GPA), but do not analyze additional factors such as English entrance testing grades, activity incentive grades, etc. that affect their performance. Moreover, the researchers have not sufficiently compared among techniques, especially deep learning techniques with other traditionally machine learning techniques.

This study proposes a course selection recommendation system by using deep learning techniques [4], especially the multi-layer perceptron (MLP) to build a student’s performance prediction model for predicting student performance in next semesters based on the course’s achievement results of the previous semesters. In addition, in order to improve the predictive results, we also consider other additional factors such as entrance English testing grades, activity incentive grades etc. for the proposed model. Moreover, a comparison between deep learning techniques and traditionally machine learning ones is also conducted. Experiential data is collected from the student information system of a Vietnamese multidisciplinary university. The results show that the proposed model provides rather accurate prediction and it can be applied in practical other cases.

2 Related Work

Course recommendation is really important to build a good study strategy for students. Mark prediction tasks are required to provide valuable advices for course recommendation. Numerous studies have attempted to propose efficient methods for student performance. The authors in [5], introduced an approach implementing Tensor Factorization (TF) to predict student performance. The proposed method enabled us to personalize the prediction for specific student. The authors reported promising experiential results on two large datasets.

Recently, open source libraries have been more used for predictive work. [6] developed a student performance prediction system using the open source recommendation system called MyMediaLite. For the grade databases collected from the academic management system of a university, the authors proposed using Biased Matrix Factorization (BMF) technique to predict the learning results. This results can help students choose more appropriate courses. The authors in [7] proposed a novel approach which uses recommender system techniques for educational data mining, especially for predicting student performance. For validating this approach, the recommender system techniques were compared with traditional regression methods such as linear regression by using educational data for intelligent tutoring systems. The experimental results showed that the proposed approach can improve prediction results.

The work in [8] presented an approach to boost student performance prediction in interactive online question pools through considering student interaction features and the similarity between questions. The proposed approach evaluated on the dataset from a real-world interactive question pool using four typical machine learning models. The result showed that the approach can achieve a much higher accuracy for student performance prediction in interactive online question pools than the traditional way using the statistical features in various models.

The authors in [9] evaluated exhaustively the prediction performance based on all possible combinations of four types of attributes including behavioral features, demographic features, academic background, and parent participation. Using support vector machine and feature selection, the authors concluded that the behavioral feature is so crucial because of the optimal prediction performance. In [10], the authors proposed a method of predicting student performance in computing majors. This method is based on collaborative filtering using enhanced similarity and yields personalized predictions of student grades in courses required for each computing major. Prediction accuracy is enhanced by analyzing computing major-specific course characteristics, such as core courses, course prerequisites, and course levels.

The authors with the research in [11] proposed a methodology in which the process of data collection and pre-processing is carried out, then the grouping of students with similar patterns of academic performance was carried out. The authors selected the most appropriate supervised learning algorithm. The experimental results showed the effectiveness of machine learning techniques to predict the performance of students. The authors in [12] used the efficacy of Matrix Factorization as a technique for solving the prediction problem. The study uses Singular Value Decomposition (SVD), a Matrix Factorization technique that has been successfully used in recommender systems. The performance of the technique was benchmarked against the use of student and course average marks as predictors of performance. The results obtained suggests that Matrix Factorization performs better than both benchmarks. The work in [13] proposed a direct search of the optimal value of latent factors where the calculation for each number of latent factors is driven by a metaheuristic that select, at the same time, the optimal values of learning rate and regularization factor. Using this method, authors can determine the best number of latent factors to be applied in further predictions for the similar databases.

Another study in [14] introduced a supervised content-aware matrix factorization for mutual reinforcement of academic performance prediction and library book recommendation. The proposed model was evaluated on three consecutive years of book-loan history and cumulative grade point average of 13047 undergraduate students in one university. The results showed that the proposed model outperforms the competing baselines on both tasks, and that academic performance not only is predictable from the book-loan history but also improves the recommendation of library books for students.

An optional course recommendation system based on score prediction was introduced in [15] using a novel cross-user-domain collaborative filtering algorithm to accurately predict the score of the optional course for each student by using the course score distribution of the most similar senior students. The experimental results showed that the proposed method is able to accurately recommend optional courses to students who will achieve relatively high scores. Another work of [16] proposed a method to predict student performance using various deep learning techniques. Several techniques for data pre-processing used such as Quantile Transforms, MinMax Scaler before fetching them into well-known deep learning models such as Long Short Term Memory and Convolutional Neural Networks to do prediction tasks. The experiential results showed that the proposed method provides good prediction results, especially using data transformation.

Some studies also introduced methods for course recommendation. [17] proposed a collaborative recommender system that recommends university elective courses to students by exploiting courses that other similar students had taken. The system used an association rules mining algorithm to discover patterns between courses. The experiments were conducted with real datasets to assess the overall performance of the proposed approach.

In order to support students choose their subjects as per their capability, [18] used data mining and natural language processing techniques that helps in conversion of human-readable format to machine-readable format, both of which are vastly emerging fields to propose a collaborative recommendation system. In addition, [19] proposed a novel course recommendation system based collaborative filtering considering the case of repeating a course and students’ grades in the course for each repetition. The authors experimented different Ordered Weighted Averaging operators which aggregates grades for each student’s repeated courses to enhance the recommendation quality. [20] developed an optional course recommendation system based on score prediction. A novel cross-user-domain collaborative filtering algorithm is designed to accurately predict the score of the optional course for each student by using the course score distribution of the most similar senior students. After generating the predicted scores of all optional courses, the top t optional courses with the highest predicted scores without time conflict will be recommended to the student.

Based on the previous research results, we propose an approach of deep learning techniques using a MLP to build a student’s performance prediction model. Beside course grades, additional factors such as entrance English testing grades, activity incentive grades, etc. also considered to improve the proposed prediction model.

3 Proposed Approach

General framework of the prediction models in the proposed approach is described in detail in Fig. 1. First, we collect real data sets at the Student Management System of a university, then data is pre-processed to remove noise, redundant attributes, etc. Traditional approaches usually used full dataset to build one prediction model (presented in the bottom of Fig. 1) to predict all of the students (denoted as MLP), however, we realize that this approach does not fit to the data since we should not use a “very good performance student” to predict a “poor performance student” or vise versa. In this work, we propose using four prediction models for four group of students which have similar performance based on their marks (as presented in the top of Fig. 1). The proposed approach is called GroupMLP.

3.1 Data Pre-processing

For evaluating the proposed model, we have collected real data at Can Tho University, Vietnam; however, the model can be used for other case studies such as other universities, schools, colleges as well. The collected data relates to students, courses, marks, and other information from the year 2007 to 2019 with 3,828,879 records, 4,699 courses (subjects), and 83,993 students. Data distributions are described in Fig. 2.

Fig. 1.
figure 1

General diagram of the proposed approach

Fig. 2.
figure 2

Data set distribution by grading score

The considered data set is collected from Student Management system and pre-processed as described in Algorithm 1.

After carefully analyzing the data, we have selected the input attributes for learning model as described in Table 1. This selection based on pre-experimental results and previous analysis in predicting student performance [6, 21].

figure a
Fig. 3.
figure 3

The proposed MLP architecture [22].

Table 1. Input attributes

3.2 Deep Learning Architecture

The proposed model is presented in Fig. 3 that was used in the work of [22]. The input attributes are selected from Table 1 and the output (prediction) of the model having eight classes which are the four-grading system (i.e., the output belongs to [0, 1, 1.5, 2, 2.5, 3, 3.5, 4]).

The proposed MLP architecture includes 6 layers. The first layer contains 18 input features while the last layer consists of 1 output exhibiting the predicted mark. The first four hidden layers contain 256 neurons, but they are followed by different activation functions. The first hidden and the fourth hidden layer use Rectified Linear Unit (ReLU) while other hidden layers are followed by Sigmoid function. Before producing predicted score, we implement the fifth hidden layer containing 8 nodes using ReLU.

The Sigmoid function [23] usually appears in the output layers of Deep learning architectures. It transforms the input values which lie in the domain \(\mathrm{I\!R}\) to outputs have the domain in [0, 1]. The Sigmoid function is also called “squashing” because this function squashes any input in the range of (-inf, -inf) to range of [0, 1]. When we shifted to gradient based learning, the Sigmoid was considered as a natural selection due to its smooth and differentiable approximation to a thresholding unit. The Sigmoid function is given by the formula:

$$\begin{aligned} Sigmoid(x)=\frac{1}{1+e^{-x}} \end{aligned}$$
(1)

where, x denotes data after being computed by the preceded neural layer.

Another activation namely, ReLU [24], is also implemented in our architecture. ReLU follows the formula:

$$\begin{aligned} f(x)=max(0,x) \end{aligned}$$
(2)

where, x denotes data after being processed by the preceded neural layer.

ReLU is the most widely used activation function for deep learning architectures with state-of-the-art results to date. ReLU helps models to produce the better performance and generalization in deep learning compared to the Sigmoid and Tanh activation functions. It represents a nearly linear function, so this activation function preserves the properties of linear models that made them easy to optimize, with gradient-descent method [25, 26].

In order to reduce overfitting issues, dropout technique is deployed with a rate of 0.015. In addition, we also consider using Early Stopping with a patience epoch of 5. If the loss cannot be improved after 5 consecutive epochs, the learning will be stopped. Otherwise, the learning will be continued to run to 500 epochs. The network is implemented with Adam optimizer function, use a batch size of 255 and a default learning rate of 0.001.

4 Evaluation

To evaluate results of the proposed model, we describe some baselines and state-of-the-art for comparison and two popular metrics for measuring the performance. We have not applied cross validation since the data are in order, i.e., using the students’ marks of previous semesters to predict the marks of current semester.

4.1 Baseline Methods

In this work, to compare with other methods, we have used two baselines which are User Average and Item Average. Furthermore, we also compared with Collaborative Filtering methods since previous works [5, 27] showed that using Collaborative Filtering such as state-of-the-art Matrix Factorization [28] provided very good results in predicting student performance.

Let denote u as the student, i as the course, and r as the mark of the student on that course. The User Average method, which can be known as Student Average in this work, generates prediction mark (\(\hat{r}_{ui}\)) for student u on course i by using formula 3.

$$\begin{aligned} \hat{r}_{ui} = \frac{\sum _{(u',i,r) \in \mathcal {D}^{train} | u'=u }{r}}{|\{ (u',i,r) \in \mathcal {D}^{train} | u'=u \}|} \end{aligned}$$
(3)

The Item Average is Course Average in this work, which predicts the mark for the student u on course i by using formula 4.

$$\begin{aligned} \hat{r}_{ui} = \frac{\sum _{(u,i',r) \in \mathcal {D}^{train} | i'=i }{r}}{|\{ (u,i',r) \in \mathcal {D}^{train} | i'=i \}|} \end{aligned}$$
(4)

Matrix Factorization is a well-known method in recommender systems, it decomposes a matrix X (each row of X is a user/student, each column is an item/course, and each element is a mark for the student on that course, respectively) to two small matrices W and H such that we can reconstruct X from these two matrices

$$\begin{aligned} X \approx WH^T \end{aligned}$$

where \(W\in \mathbb {R}^{|U| \times K}\);    \(H \in \mathbb {R}^{|I| \times K}\); K is number of latent factors, \(K<< |U|, K<< |I|\). The latent factors W and H can be obtained from optimizing the function:

$$\begin{aligned} \mathcal {O}^{MF} = \sum _{(u,i) \ \in \ \mathcal {D}^{train}} \left( r_{ui} - \sum _{k=1}^{K} {w_{uk}h_{ik}}\right) ^2 + \lambda \cdot (||W||^2_F + ||H||^2_F) \end{aligned}$$

\(\lambda \in (0..1)\) is a regularization and \(||\cdot ||_{F}\) is the Frobenius norm.

One benefit of the Matrix Factorization approach is its flexibility in dealing with various data aspects. However, the variation in the rating values are due to effects associated with either users or items, known as biases. Thus, researchers extended the previous work as Biased Matrix Factorization.

Biased matrix factorization [31], an improvement of MF, models the characteristics of each user and each item and the global tendency that are independent of user-item interactions.

At that time, the optimizing the function described as follows:

$$\begin{aligned} \mathcal {O}^{BMF} = \sum _{(u,i) \ \in \ \mathcal {D}^{train}} \left( r_{ui} - \mu - b_{u} - b_{i} - \sum _{k=1}^{K} {w_{uk}h_{ik}}\right) ^2 + \lambda \cdot (||W||^2_F + ||H||^2_F) \end{aligned}$$

where:

  • \(\mu \) is the global average (average performance of all students and tasks in \({D}^{train}\)) as shown in Eq. 5.

  • \(b_{u}\) (exhibited in Eq. 6) is student bias (average performance of student u deviated from the global average).

  • \(b_{i}\) (Eq. 7) is mark bias (average performance on mark i deviated from the global average).

$$\begin{aligned} \mu =\frac{\sum _{(u,i,r \in D^{train})}r}{|D^{train}|} \end{aligned}$$
(5)
$$\begin{aligned} b_u=\frac{\sum _{(u',i,r \in D^{train})|u'=u|}(r-\mu )}{|\{(u',i,r \in D^{train})|u'=u|\}|} \end{aligned}$$
(6)
$$\begin{aligned} b_i=\frac{\sum _{(u,i',r \in D^{train})|i'=i|}(r-\mu )}{|\{(u,i',r \in D^{train})|i'=i|\}|} \end{aligned}$$
(7)

Details of these methods are described in [5, 27,28,29,30,31].

4.2 Evaluation Metrics

The root mean squared error (RMSE) and mean absolute error (MAE) are used to evaluate the models. They are calculated by Eqs. (8) and (9), respectively.

$$\begin{aligned} \sqrt{\frac{1}{n} \sum _{i=1}^n ({y_i - \hat{y}_i)^2}} \end{aligned}$$
(8)
$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n {|y_i - \hat{y}_i|} \end{aligned}$$
(9)

where, \(y_i\) is the true value, and \(\hat{y}_i\) is the predicted value.

4.3 Experimental Results

Experimental results of MAE and RMSE are presented in Fig. 4. The GroupMLP presents for using four different models for four groups of student performances and the MLP means using one model to predict the result of all the students, as mentioned in Fig. 1. Clearly, by using different prediction models for different student performances, the results are significantly improved.

Fig. 4.
figure 4

RMSE and MAE comparisons between GroupMLP and MLP

Figure 5 presents the comparison results between the proposed GroupMLP and other baselines in Recommender Systems. This result shows that the GroupMLP works better than other baselines, however, using train (67%)/test(33%) split by time, the recommender system techniques may suffer from the “cold-start problem”, i.e., new-users and new-items may happen in the test set.

Fig. 5.
figure 5

RMSE and MAE comparisons between GroupMLP and baselines

Figure 6 presents the results for each model in the group (as presented in Fig. 1). The results show lower errors where levels of Very Good and Good exhibit promising performance. However, with lower levels of marks such as Fair and Poor reveals high errors both in MAE and RMSE in the prediction.

Fig. 6.
figure 6

Performance comparison with mark levels in MAE and RMSE

Moreover, while analyzing the results, we obtain that the English skill of the students is very important attribute. For example, without using English marks (the BSC1, BSC2, BSC3 in Table 1) the MAE of the MLP model is 0.24564, while using these attributes with english courses, the MAE is dropped to 0.23882 as shown in Table 2. This result implies that for the students having better English, they can read learning resources (mostly in English) better.

Deep learning approach may take more times for training the models, however, this is not a problem since models are trained every one or two times in the semester when the marks are changed in mid-term and final examination.

Table 2. Feature analysis on English skill

4.4 Model for Early Warning and Course Recommendation

After evaluating the model, we have attempted to apply the proposed method for the real case study as in Fig. 7. Given the student information (or a list of students) and the course information as described in Table 1, the model can generate prediction results for the student on that course. We can use the result in two cases:

  • Case 1: If the prediction result is less than 1.0 (or other selected threshold in the four-grading system, from 0 to 4), that would be the warning case.

  • Case 2: If the prediction result is more than 3.25 (or other selected threshold) for the elective/selection course, that would be the recommendation course for the student to select.

Fig. 7.
figure 7

The proposed early warning and course recommendation model for students

5 Conclusion

In this study, we proposed a method to predict student performance using deep learning techniques for prediction tasks on the data collected from a Vietnamese multidisciplinary university’s information system. We analyze and propose some techniques for data pre-processing before fetching them into a MLP to do prediction tasks. The proposed method provides good prediction results and is expected to apply in practical cases. Using these results, we can help the educational managers or the academic advisors and the students to know early warning results so that the students can have better plan for studying. Moreover, evaluating on various training courses to help the managers to propose appropriate policies.

We continue to perform experiments on other published data sets and to change the model settings for better performance. Further research should investigate more on groups based on student performance to support better students’ course selection and enhance the prediction tasks performance.