Keywords

1 Introduction

Our study falls into the field of Educational Data Mining: a rather new area where academic performance is investigated and steps are figured out to enhance it. The aim of this study is set towards Higher Secondary Certificate (HSC) (12th grade equivalent) candidates; all data has been collected from recent HSC examinees: university freshers.

The goal of our study is to provide a prediction of HSC result from socioeconomic, psychological and academic factors and to discover pattern in attributes which allow students to visualize how the attributes influencing their educational performance which help students to enhance the outcome of HSC exam. Guardians of students and responsible authorities could use such information generated from our model and help the students to enhance the HSC performance.

2 Related Work

To group the student according to their skill level in [1] the authors have used model based clustering, other clustering and K-mean technique. In paper [2] the authors have used quiz score, midterm marks, final marks, lab marks and CGPA as attribute to predict semester final grade of individual courses. They have used ANN, C4.5 and Naïve Bayes for the classification and the have managed to reach accuracy around 82%. In order to predict academic performance [3] class attendance, seminar, class test score and assignment marks have been used as attributes. Authors have used C4.5, CART and ID3 algorithms to design their predictive model.

3 Designing Survey Questionnaire and Data Set

Designing a predictive model greatly depends on the quality of the data set. We are very careful while designing the survey questionnaire, as the survey is the source of attributes for the predictive model. In order to get the relevant attributes, we have interviewed experts and explored different factors related to academic performance. We have made sure that every attribute is proven to be related to academic performance through published research. We have surveyed 423 students from different location and economic status for maintaining the versatility in the data set. Our data set consists of 33 attributes grouped into 3 main categories namely socioeconomic, psychological and academic as shown in Table 1.

Table 1 All the attributes used in designing the predictive model

4 Overview of Designing Predictive Model

The design of the predictive model is shown in Fig. 1. After getting raw data we have realized that data set needs to be preprocessed before applying any learning algorithm to augment the predictive accuracy. We have used data balancing technique to balance the data set and also used Principal Component Analysis (PCA) for dimensionality reduction then Optimal equal width binning for discretization and finally used normalization technique as pre-processing. Then we have applied 3 different classifiers namely Artificial Neural Network (ANN), K-Nearest Neighbors (K-NN) and Support Vector Machine (SVM) and after comparative study of the classifiers we have derived the best model for the prediction.

Fig. 1
figure 1

An overview of our system

5 Data Pre-processing

5.1 Data Balancing

Our data set is not balanced as shown in Fig. 2. The problem with of the imbalanced data can be visualized while mammographic pictures are used as data set. The picture contains 98% of normal pixels and 2% of abnormal pixels. So simply by guessing the pixel as normal one can be right for the 98% of the time [4]. We have used SMOTEBoost technique [5] which consists of Synthetic Minority oversampling technique and a boosting procedure to provide equal entropy to the both majority and minority class to enrich the quality of the accuracy of the classifiers.

Fig. 2
figure 2

Instances of target class (HSC result)

5.2 Dimensionality Reduction

The data set consists of 33 attributes; most of them are dispersed from each other and some of them are redundantly related to the target class. Removing redundancy bolsters classifiers by excluding the curse of the dimensionality reduction [6]. We have used principal component analysis (PCA) to reduce the dimensionality of our data set. PCA applies an orthogonal transformation; so all resultant attributes are correlated to only the target class and not each other [7].

5.3 Discretization

Our data set consists of attribute such as SSC and HSC result which are continuous valued attributes. To discretize the data set we have implemented Optimal Equal Width binning [8] that searches dynamically for the optimal width and number of bins for the target class. In our data set we got 8 bins with the width of 0.5. As the HSC result is in the range of 1–5 the bins we got from the binning are: [−∞–1.5], [1.5–2], [2–2.5], [2.5–3], [3–3.5], [3.5–4], [4–4.5] and [4.5–∞].

5.4 Normalization

Our data set consists of attributes with different scales and units, for instance, “Weekly study time” has the unit of hour, on the other hand, average family income has the unit of Taka and “Family involvement” is measured in a scale of 5, conversely “Health status” is measured in the scale of 10. To ensure the equal contribution of the attribute to predict the target class we have used min-max [9] normalization technique to rescale the data set in order to produce better training set for the classifiers using the Eq. (1).

$$ x^{{\prime }} = \frac{{x_{i} - x_{min} }}{{x_{max} - x_{min} }} $$
(1)

This technique is able to preserve all the relationship among attributes and rescale them for equal contribution of the attribute to predict target class.

6 Learning Algorithms

After applying the pre-processing techniques we have used learning algorithms to predict the performance of HSC examination. We have spilt the data set in 70:30 for training and testing. In order to ensure consistency in class distribution throughout the data set we have approached stratified sampling to maintain the equal distribution in the training and testing set in order to discard the possibility of having adverse impact on predictive accuracy due to exhibiting inconsistency in the split parts of the data set.

6.1 K-Nearest Neighbors (K-NN)

K-NN learns from each instance from other known outputs of the data set. All attributes are transformed into vector space where each of the N dimensions of an attribute is represented by vectors [10]. K-NN uses Euclidian distance to measure the similarity metric amongst neighbors using the Eq. (2).

$$ d_{E} (x,y) = \sum\limits_{i = 0}^{N} {\sqrt {(x_{i}^{2} - y_{i}^{2} )} } $$
(2)

It works the following way in our system:

  1. (a)

    Finding appropriate k-value to get the similar neighbors for classifying.

  2. (b)

    Calculating distance between training sample and testing sample.

  3. (c)

    Sorting the samples based on the distance.

  4. (d)

    Applying majority vote across the k-nearest neighbors produced the predicted HSC result.

6.2 Support Vector Machine (SVM)

By constructing hyperplanes in a multidimensional feature space it breaks down class levels into different cases. Iteratively, SVM applies a training algorithm to produce an optimal hyperplane and also minimizes the error function in each iteration [11]. Error function is described in the Eq. (3).

$$ \frac{1}{2}w^{t} w + c\sum\limits_{i = 1}^{N} {\xi_{i} } $$
(3)
$$ {\text{Subject}}\,{\text{to}}:y_{i} \left( {w^{T} \phi (x_{i} ) + {\text{b}}} \right) \ge 1 - \xi_{i} \,and\,\,\xi_{i} \ge 0,i = 1, \ldots \ldots .,N $$

where,

  • C = capacity function

  • w = Vectors of coefficients

  • b = Constant

  • \( \xi_{i } \) = Parameter that handles non-separable data

  • i = Number of training cases (N)

  • \( y \in \pm 1 \) = Class labels

  • \( x_{i} \) = Independent variable

  • Kernel \( \phi \) transforms data to the feature space from the input.

As C increases, the error gets penalized more; the value of C can be set to avoid the problem of overfitting. Dot kernel has been applied for our system as shown below in Eq. (4).

$$ K\left( {X_{i,} X_{j,} } \right) = \left\{ {\begin{array}{*{20}l} {\begin{array}{*{20}l} {X_{i} \cdot X_{j} } \hfill \\ {(\gamma \,X_{i} \cdot X_{j} + C)^{4} } \hfill \\ {\exp ( - \gamma \left| {X_{i} - X_{j} } \right|^{2} )} \hfill \\ {\tanh (\gamma |X_{i} \cdot X_{j} + C)} \hfill \\ \end{array} } \hfill & {\begin{array}{*{20}c} {Linear} \\ {Polynomial} \\ {RBF} \\ {Sigmoid} \\ \end{array} } \hfill \\ \end{array} } \right. $$
(4)

6.3 Artificial Neural Network

Artificial neurons perform its actions almost like their biological counterparts: they process their mechanism by transmitting information among neurons that helps each other to ‘fire’ based on given inputs. To ‘fire’ the activation function shown in Eq. (5) must be satisfied [12].

$$ a_{i} = \sum\limits_{J = 1}^{N} {W_{ji} x_{j} + \theta_{i} } $$
(5)

where, \( x_{j} \) is the output from a neuron or an external input, \( W_{ji} \) is the weight and \( \theta_{i} \) is the threshold.

We have used ‘feed forward’ architecture for designing our system; the outputs of a layer of neurons get fed into the inputs of the neurons from the next layer. The ‘hidden’ layer(s) exist in between input and output layers. Number of hidden layers of our system can be calculated using the Eq. (6).

$$ \begin{aligned} & (Number\,of\,Attributes + Number\,of\,Classes)/2 + 1 \\ & \quad = ( 3 2+ 1)/ 2+ 1= 1 7({\text{approx}}.){\text{hidden}}\,{\text{layers}}. \\ \end{aligned} $$
(6)

7 Result Analysis

7.1 K-Nearest Neighbors

While no pre-processing is used the KNN provides the accuracy of 30% as shown in Table 2 but after introducing pre-processing technique KNN it starts achieving better accuracy. After using PCA and SMOTEBoost it increases its accuracy to 50.35%. In addition when we have used Optimal Equal Width binning the accuracy has reached around 63% and after the normalization the accuracy has notably augmented to 70%. It is because KNN uses Euclidian algorithm to measure the distance between neighbors; by rescaling the attribute and normalizing the range bolster K-NN to achieve such high score.

Table 2 Performance comparison of the different predictive models

7.2 Artificial Neural Network (ANN)

ANN has provided 70% of the accuracy without using any pre-processing technique, after SMOTEBoost and PCA it does not provide any promising change in accuracy. Using discretization with Optimal Equal Width Binning it enhances its accuracy by 7% and normalization does not have any significant impact on accuracy of ANN.

7.3 Support Vector Machine (SVM)

SVM provides accuracy of 70.88% without using any pre-processing technique. After balancing the data set using SMOTEBoost the accuracy has increased by 8%. As SMOTEBoost provides equal entropy to the majority and minority class, it bolsters SVM to provide better accuracy. Binning and Normalization lowers the accuracy of the SVM.

7.4 Key Findings

We have generated key findings using decision trees as shown in below. One of the observations is “higher family involvement notably increases the performance producing better result with same time spent on studying weekly”. Family involvement has measured in the scale of 1–5 where 1 is least involvement and 5 is the highest involvement. As we can see in the rule below that with increasing family involvement performance gets better for students with having lower study time.

SSC result = 4.250 | Family Involvement = 1: (HSC result) 2.750 | Family Involvement = 3 | | Weekly Study Time = 1 hour: (HSC result) 3.750 | | Weekly Study Time = 2 hours: (HSC result) 4.000 | Family Involvement = 4: (HSC result) 4.250 | Family Involvement = 5 | | Weekly Study Time = 1 hour: (HSC result) 4.000 | | Weekly Study Time = 2 hours: (HSC result) 4.500 | | Weekly Study Time = 3 hours: (HSC result) 4.750

Students with the previous failure get better results with the increasing family involvement in their study as shown in the tree below.

Previous failures = 3 (Number of previous failures) | Family Involvement = 2: (HSC result) 2.750 | Family Involvement = 3: (HSC result) 3.250 | Family Involvement = 4: (HSC result) 3.500

Another observation is “while parents get separated then having a romantic relation among students hampers the academic attainments”. Conversely, another observation exhibits romantic relationship can cause enhancement in academic performance while parents are together.

Parent Status = Living Apart | Romantic Relation = no: (HSC result) 2.750 | Romantic Relation = yes: (HSC result) 1 Parent Status = Living Together | Romantic Relation = no: (HSC result) 2.250 | Romantic Relation = yes: (HSC result) 3.250

8 Conclusion

Highest accuracy of the system is 78.5% while SMOTEBoost is used along with PCA and the second highest accuracy is the 77.78% by ANN along with the PCA, SMOTEBoost, binning and normalization. Pre-processing has significant impact on classifiers in most of the time. Predictive model and the key finding through the visualization of the data set provide students and their parents an important instrument to get better academic performance in HSC.