Abstract
In the area of educational data mining (EDM), it is important to develop technologically sophisticated solutions. An exponential growth in educational data raises the possibility that conventional methods could be constrained as well as misinterpreted. Thus, the field of education is becoming increasingly concerned in resurrecting data mining methods. This work thoroughly analyzes and predicts students’ academic success using logistic regression, linear discriminant analysis (LDA), and principal component analysis (PCA) to keep track of the students’ future performance in ahead. Logistic regression is enhanced by comparing LDA and PCA in a bid to improve precision. The findings demonstrate that LDA improved the accuracy of the logistic regression classifier by 8.86% as compared to PCA’s output, which produced 35 more correctly classified data. As a result, it is demonstrated that this model is effective for forecasting students’ performance using students’ historical data.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Educational data mining
- Linear discriminant analysis
- Principal component analysis
- Logistic regression
- Data mining
1 Introduction
The use of statistics, learning algorithms, and data mining methodologies is the primary emphasis of data mining research in the field of EDM. The importance of data mining technology in the educational setting has grown over the last few decades. It has soared to great prominence in recent years as a result of the accessibility of open datasets and learning algorithms [1]. EDM entails the creation and implementation of data mining techniques that interpret the substantial amounts of data from various educational levels. Anticipating the learning process and evaluating student success are important objectives in the study of EDM [2]. It is a field which discovers underlying relationships and discovers trends in educational data. Heterogeneous data is contributing in the big data paradigm in the sector of education. In order to adaptively extract relevant information from educational datasets, specialized data mining techniques are required [3]. Many educational domains, including learning outcomes, dropout prediction, educational analysis of data, and academic and behavioral analysis, have used data mining methods [4]. EDM has always placed a premium on assessing and forecasting students’ academic success. Higher education institutions must examine students based not only on their test results, but they should also consider how they learn, make projections about how they will perform academically in the future, and issue timely academic warnings. This work will assist students in raising their performance, which will enhance the management of educational resources while also assisting higher education in raising the quality of instruction [5]. The challenge of interpreting and making judgments from the enormous amount of information is growing progressively more onerous. The dimensionality is one of the primary challenges, although it can be solved by employing dimensionality reduction techniques. Dimensionality reduction refers the method of converting high-dimensional data into a meaningful less dimensionality. PCA [6] and LDA [7] are two well-liked techniques that have been extensively employed in various classification applications among the various dimensionality reduction approaches that have been developed. LDA employs label information; it can produce better classification results than PCA, which is unsupervised. This study applied the PCA and LDA algorithms for dimensionality reduction. The efficiency and effectiveness of PCA and LDA dimensionality reduction approaches are systematically evaluated in this work [8]. This work focused on evaluating students’ academic achievement and to predict future success based on current performance. In order to reduce the dataset's dimensionality, this study suggests PCA and LDA and logistic regression as the dataset's classifier. Section 2 offers an analysis of previous works created by other researchers in the field of academic projection. Section 3 discusses aspects of the experimental methods. The experimental results are described and discussed in Sects. 4 and 5. The conclusion and prospective future approaches are identified in Sect. 6.
2 Related Study
Academic performance prediction has been one of the key goals of academic practitioners. Collaboration research has shown that effective procedures can be created for academic prediction using computational methods (such as data mining). For academic prediction, numerous researchers have created a variety of prediction models incorporating data mining.
Karthikeyan et al. [9] developed a novel method known as a hybrid educational data mining framework to evaluate academic achievement and effectively enhance the educational experience. Crivei et al. [10] examined the applicability of unsupervised machine learning methods, particularly PCA and association rule mining, to assess student academic performance. EDM incorporates data mining techniques with educational data, according to Javier et al. [11]. In this, the well-known data mining methods are listed, including correlation mining factor analysis, and regression. Zuva et al. [12] provided a model which compares four classifiers in order to identify the best method for forecasting a learner's performance.
A key objective of the research will be to improve the current prediction algorithm in light of the requirement for an efficient prediction method. As a result, a model must be put out to improve the classification process.
3 Methodology
In this research work, the methodology was implemented by integrating the benefits of dimensionality reduction and classification. PCA and LDA are utilized in this work to lower the dimension, and also they are compared. PCA helps to eliminate features that are not essential to the model's goals, which reduces training time and expense and improves model performance [13]. LDA transforms a high-dimensional data into a low-dimensional by increasing between-class scatter and decreasing the within-class scatter. Logistic regression is employed in order to create our supervised classification for the dataset after doing dimensionality reduction. Figure 1 depicts the implemented methodology.
3.1 Dataset Description
The UCI machine learning repository's student dataset is used for this work. The dataset has 400 instances. The dataset consists of one target class and a total of 30 attributes. The dataset contains a total of 266 positive and 130 negative instances. The dataset's attributes are outlined below.
-
Mother's Education
-
Father's Education
-
Home to School Travel Time
-
Weekly Study Time
-
Number of Past Class Failures
-
Free Time After School
-
Current Health Status
-
Number of School Absences
-
First Period Grade
-
Second Period Grade
-
Final Grade.
3.2 Data Preprocessing
Due to enormous volumes and likely origin from diverse sources, real-world databases of today are especially prone to noisy, missing, and inconsistent data [14]. In the data mining process, data quality is crucial since poor data might produce predictions that are erroneous [15]. Data preprocessing overarching goal is to eliminate undesirable variability or impacts for effective modeling [16]. By doing normalization on the dataset, the existing data elements are scaled as part of data preprocessing so that they fall inside a narrow predetermined range of [0, 1] values. Speed will increase, and complexity will go down. Dataset V is normalized using the Z-score method to create a normalized value V′ using the following equation:
- V′:
-
Normalized value,
- V:
-
Value,
- Y:
-
Mean,
- Z:
-
SD.
3.3 Implemented Model
The research work consists of two phases. For the processed dataset, dimensionality reduction was done in the first stage. Supervised classification was employed in the second stage. The well-known dimensionality reduction methods PCA and LDA are investigated in this work. High-dimensional datasets are used for performance analysis. Logistic regression was used to classify data in order to compare how well the dimensionality reduction method is performed. These data were used to infer the differences between the supervised and unsupervised dimensionality reduction methods.
3.4 Principal Component Analysis
Data analysis and machine learning frequently employ the dimensionality reduction method known as PCA. Its primary function is to maintain the majority of the original data while downscaling a high-dimensional dataset into a lower dimensional space. This is accomplished by locating the principal components, which are linear combinations of the original characteristics that encompass the broadest range of data variance.
PCA discovers a significant subset of the estimated parameters with the maximum variance, known as the principle components PCs, that is, how PCA attempts to lower the dimension of the data. The initial PCs were accounted for the majority of the variance, making it possible to ignore with less information loss [17]. PCA is used to keep as much of the given dataset's information as feasible while also reducing the dimensionality of the enormous data [18]. The goal is to convert the dataset X, which has p dimensions, and Y, which has L (L < p) dimensions. Y is the PC of X, i.e.,
-
(1)
Configure Dataset
In X, there are n vectors (x1, x2, …, xn), which contain dataset instance.
-
(2)
Determine Mean
$$\overline{x} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \frac{{x_{i} }}{N}$$(3 )
-
(3)
Determine the Covariance
$$C = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {x_{i} - \overline{x}} \right)\left( {x_{i} - \overline{x}} \right)^{T}$$(4)
-
(4)
Find Eigenvalues and Eigenvectors
The directions and magnitude of the new feature space will be determined by the eigenvectors and eigenvalues, respectively.
Creating a feature vector: According to eigenvalue, eigenvectors are ranked from the highest to lowest. This lists the elements in ascending order of importance. The primary element of the data collection is the eigenvector with the highest eigenvalue. The greatest eigenvalue is employed to create the feature vector [19,20,21]. Creating a new dataset involves selecting the principal components to keep in the data, creating a feature vector, and multiplying the vector by its transposition [19, 22,23,24,25].
3.5 Linear Discriminant Analysis
By maximizing between-class scatter and decreasing the within-class scatter, the LDA method reduces the dimensions. It allows dimensionality reduction without information loss and is mostly used prior to classification [18].
-
(1)
Within-class scatter matrix
$$s_{w} = \mathop \sum \limits_{j = 1}^{c} \mathop \sum \limits_{i = 1}^{{N_{j} }} \left( {x_{i}^{j} - \mu_{j} } \right)\left( {x_{i}^{j} - \mu_{j} } \right)^{T}$$(7)
- c:
-
Number of classes
- \(x_{i}^{j}\):
-
ith sample of class j,
- μj:
-
Mean of class j,
- Nj:
-
Number of samples in class j.
-
(2)
Between-class scatter matrix
$$s_{b} = \mathop \sum \limits_{j = 1}^{c} \left( {\mu_{j} - \mu } \right)\left( {\mu_{j} - \mu } \right)^{T}$$(8)
- µ:
-
Mean of all classes.
The between-class scatter determinant and within-class scatter determinants of the projected samples are optimized by LDA approaches [18] (Table 1).
3.6 Logistic Regression
Logistic regression is used when classifying data components. In logistic regression, the target variable is binary, which means that it only contains data that can be classified into two distinct groups: 1 or 0, which corresponds to a student who will be passed or failed in the academies. The aim of the logistic regression technique is to find the diagnostically reasonable model that best describes the relationship between the target variable and the predictor variable [15]. The Sigmoid equation below serves as the foundation for the logistic regression model [15]. Figure 2 depicts the Sigmoid function graph.
The probability-based outcome or classes provided by the logistic regression classifier had probability score between 0 and 1.
The cost method serves as the goal of optimization. Optimizing the cost function in logistic regression to develop a precise model with minimal inaccuracies. The possibility of an event in the future is predicted using this model. The primary principle of logistic regression is to use a model based on the likelihood that an outcome will occur. Pseudocode 1 provides a description of the logistics regression model, which is used to train and test the data instance.
Pseudocode 1: Logistic Regression
-
1.
Input: Featured Data
-
2.
Output: Classified Data
-
3.
For i = 1 to K
-
4.
For Each data instance di
-
5.
Set the Target Regression Value
$$Z=\frac{{y}_{i}-P(1-{d}_{j})}{[p-(1-{d}_{j}).(1-p(1-{d}_{j}))]}$$ -
6.
Initialize the weight of instance dj to P(1|dj). (1 − P). (1|dj)
-
7.
Finalize a f(j) to the data with class value (zj) and weights (wj)
-
8.
Assign (class label:1) if P (1|dj) > 0.5, otherwise (class label:2).
4 Experimental Result
The student dataset, which has 400 instances and 30 attributes, is used in this work. The dataset statistics and description are given in Tables 2 and 3, respectively. The student dataset is used as the basis for performance analysis using the two different dimensionality reduction techniques, PCA and LDA, as well as logistic regression classifier. Dimensionality reduction during preprocessing was accomplished using the LDA and PCA methods. Then logistic regression is used to properly classify samples into defined groups. Prior to deploying a predictive model for implementation, it is crucial to ensure its effectiveness and accuracy. The results of the analysis and evaluation involve assessing various criteria, including Precision, Recall, and Accuracy. Table 5 illustrates the implemented model's performance metrics.
4.1 Employing Different Algorithms for Comparison
The student dataset is modeled with three distinct algorithms using the original dataset, PCA processed data, and LDA processed data in order to further assess how the model works. The outcome is shown in Table 4. LDA enhanced the performance accuracy of the other algorithms, but when Naive Bayes is employed an exception performance is found. As a result of PCA processing, the result in Table 4 shows decrease in Naive Bayes accuracy from 89 to 87%. Also, it was shown that LDA improved the algorithms’ precision.
5 Discussion
The experimental findings shown that LDA improves classification accuracy than PCA. Jawad et al. [26] and Musso et al. [24] produced the similar finding, with a precision of 96% (Table 1). According to experimental findings, the proposed LDA approach increased logistic regression's classification accuracy for the student dataset. The accuracy of such model is determined by comparing it to the classification results published by other researcher’s algorithms for academic prediction.
6 Conclusion and Future Work
The research work implemented an effective framework for predicting academic success. After carefully examining prior published works, this model combines the use of logistic regression for classification with LDA for dimensionality reduction. First, the LDA approach is used to our dataset with the goal of increasing classification accuracy. Although being a widely used approach, PCA's effectiveness in logistic regression has not garnered enough emphasis. In this research work, the integration of LDA and logistic regression can result better for predicting academic prediction. Also, the logistic regression model outperformed other algorithms employed in the work and findings from other studies in terms of prediction performance.
References
Antonio HB, Boris HF, David T, Borja NC (2019) A systematic review of deep learning approaches to educational data mining. Complexity 2019:1306039
Tsiakmaki M, Kostopoulos G, Kotsiantis S, Ragos O (2020) Implementing AutoML in educational data mining for prediction tasks. Appl Sci 10(1):90–117
Kausar S, Huahu X, Hussain I, Zhu W, Zahid M (2018) Integration of data mining clustering approach in the personalized E-learning system. IEEE Access 6:72724–72734
Buenaño-Fernandez D, Villegas W, Luján-Mora S (2019) The use of tools of data mining to decision making in engineering education—a systematic mapping study. Comput Appl Eng Educ 27(3):744–758
Feng G, Fan M, Chen Y (2022) Analysis and prediction of students’ academic performance based on educational data mining. IEEE Access 10:19558–19571. https://doi.org/10.1109/ACCESS.2022.3151652
Turk M, Pentland A (2019) Face recognition using eigenfaces, computer vision and pattern recognition, proceedings CVPR’91. IEEE Comput Soc Conf Int J Emerg Technol Learn (iJET) 14(14):92
Belhumeur PN, Hespanha JP, Kriegman DJ (1997) Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell
Vikram M, Pavan R, Dineshbhai ND, Mohan B (2019) Performance evaluation of dimensionality reduction techniques on high dimensional data. In: 2019 3rd international conference on trends in electronics and ınformatics (ICOEI), Tirunelveli, India, pp 1169–1174. https://doi.org/10.1109/ICOEI.2019.8862526
Karthikeyan VG, Thangaraj P, Karthik S (2020) ‘Towards developing hybrid educational data mining model (HEDM) for efficient and accurate student performance evaluation.’ Soft Comput 24(24):18477–18487
Crivei LM, Czibula G, Ciubotariu G, Dindelegan M (2020) Unsupervised learning based mining of academic data sets for students’ performance analysis. In: Proceedings of IEEE 14th internatonal symposium on application computer intelligence informatics (SACI), Timisoara, Romania, May 2020, pp 11–16
Javier BA, Claire FB, Isaac S (2020) Data mining in foreign language learning. WIREs Data Min Knowl Discov 10(1):e1287
Li S, Liu T (2021) Performance prediction for higher education students using deep learning. Complexity 2021:1–10
Imran M, Latif S, Mehmood D, Shah MS. Student academic performance prediction using supervised learning techniques
Pang Y, Yuan Y, Li X (2008) Effective feature extraction in high dimensional space. IEEE Trans Syst Man Cybern B Cybern
Zhu C, Idemudia CU, Feng W (2019) Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Inform Med Unlock 17:100179
Archana HT, Sachin D (2015) Dimensionality reduction and classification through PCA and LDA. Int J Comput Appl 122(17):4–8. Available at https://doi.org/10.5120/21790-5104
Karalar H, Kapucu C, Gürüler H (2021) Predicting students at risk of academic failure using ensemble model during pandemic in a distance learning system. Int J Educ Technol Higher Educ 18(1)
Ramaphosa KIM, Zuva T, Kwuimi R (2018) Educational data mining to ımprove learner performance in gauteng primary schools. In: 2018 ınternational conference on advances in big data, computing and data communication systems (icABCD), pp 1–6. https://doi.org/10.1109/ICABCD.2018.8465478
Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, USA
Mishra P, Biancolillo A, Roger JM, Marini F, Rutledge DN (2020) New data preprocessing trends based on ensemble of multiple pre- processing techniques. TrAC Trends Anal Chem 132:116045
Fan C, Chen M, Wang X, Wang J, Huang B (2021) A review on data pre-processing techniques toward efficient and reliable knowledge discovery from building operational data. Front Energy Res 9:652801
Smith LI (2002) A tutorial on principal components analysis
Yağcı M (2022) Educational data mining: prediction of students’ academic performance using machine learning algorithms. Smart Learn Environ 9(1)
Musso MF, Hernández CFR, Cascallar EC (2020) Predicting key educational outcomes in academic trajectories: a machine-learning approach. High Educ 80(5):875–894
Waheed H, Hassan SU, Aljohani NR, Hardman J, Alelyani S, Nawaz R (2020) Predicting academic performance of students from VLE big data using deep learning models. Comput Human Behav 104:106189
Jawad K, Shah MA, Tahir M (2022) Students’ academic performance and engagement prediction in a virtual learning environment using random forest with data balancing. Sustainability 14(22):14795
Sassirekha MS, Vijayalakshmi S (2022) Predicting the academic progression in student’s standpoint using machine learning. Automatika 63(4):605–617
Pujianto U, Agung Prasetyo W, Rakhmat Taufani A (2020) Students academic performance prediction with K-nearest neighbor and C4.5 on smote-balanced data. In: 2020 3rd international seminar on research of information technology and intelligent systems (ISRITI)
Tarbes BJ, Morales P, Levano M, Schwarzenberg P, Nicolis O, Peralta (2022) Explainable prediction of academic failure using Bayesian networks. In: 2022 IEEE ınternational conference on automation/XXV congress of the chilean association of automatic control (ICA-ACCA)
Echegaray-Calderon OA, Barrios-Aranibar D (2015) Optimal selection of factors using genetic algorithms and neural networks for the prediction of students’ academic performance. In: 2015 Latin America congress on computational ıntelligence (LA-CCI)
Xu X, Wang J, Peng H, Wu R (2019) Prediction of academic performance associated with internet usage behaviors using machine learning algorithms. Comput Hum Behav 98:166–173
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Vaidehi, B., Arunesh, K. (2024). An Evaluation of Prediction Method for Educational Data Mining Based on Dimensionality Reduction. In: Joby, P.P., Alencar, M.S., Falkowski-Gilski, P. (eds) IoT Based Control Networks and Intelligent Systems. ICICNIS 2023. Lecture Notes in Networks and Systems, vol 789. Springer, Singapore. https://doi.org/10.1007/978-981-99-6586-1_7
Download citation
DOI: https://doi.org/10.1007/978-981-99-6586-1_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6585-4
Online ISBN: 978-981-99-6586-1
eBook Packages: EngineeringEngineering (R0)