Abstract
Nowadays the student performance and its evaluation is a challenge in general terms. Frequently, the students’ scores of a specific curriculum have several fails due to different reasons. In this context, the lack of data of any of student scores adversely affects any future analysis to be done for achieving conclusions. When this occurs, a data imputation process must be performed in order to substitute the data that is missing for estimated values. This paper presents a comparison between two data imputation methods developed by the authors in previous researches, the Adaptive Assignation Algorithm (AAA) based on Multivariate Adaptive Regression Splines (MARS) and other technique called Multivariate Imputation by Chained Equations (MICE). The results obtained demonstrate that the proposed methods allow good results, specially the AAA algorithm.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
According with the guidance of quality assurance systems under the European Higher Education Area (EHEA), the studies tracking is regulated under a legal point of view, and of course is obligatory for official university degrees [1]. Under this point of view, the internal quality systems of the educational institutions, with the aim of on-going improvement, try to enhance their quality ratios or indicators in terms of academic results and performance [2]. This fact causes that the faculties or higher education schools need tools to support or assists on this task [3, 4].
As a previous work for achieving a tool for making decisions, usually, it is necessary a way to obtain the required knowledge. Traditionally, in past research works, the common method is to obtain a model based on a dataset of the historical, well through traditional techniques or through other ones more advanced [5–8].
The above method could be a problem in general terms, given the need to have previous cases with a similar performance [9–13]. Also, it is necessary remark that the case under study could change. If it is the case, the model must be adaptive for the novel cases with different casuistic and performance [14–17]. In this sense, the imputation methods based on evolutionary methods could be a good solution to accomplish the present described problem.
This paper evaluates two imputation methods, which allows the system to fill in the missing data of any of the students’ scores that are used in this research. One of the algorithms, the AAA (Adaptive Assignation Algorithm) [18], is based on Multivariate adaptive regression splines and the other one is the MICE (Multivariate Imputation by Chained Equations) [19]. The first one has a good performance in general terms, when the percentage of missing data for a case is reduced; when the sample is not in that way, the second method is more appropriate. The right combination of the both algorithms is a good solution that requires to stablish the border application of both.
This paper is structured in the following way. After the present section, the case of study is described, it consist on the students’ scores dataset of the Electrical Engineering Studies Degree of the University of A Coruña. Then, the techniques for missing data imputation are shown. The results section shows the achieved outcomes with the imputation over the dataset for three different cases over the case of study. After that, the conclusions and future works are presented.
2 Case of Study
The students’ scores in the Electrical Engineering Studies Degree of the University of A Coruña compose the dataset used in this research since course 2001/2002 until 2008/2009. The dataset includes the scores for each subject in the degree; nine subjects in the first year, another nine in the second year, seven in the last year, and the final project.
The data also includes the scores and the way to access to the University studies; in Spain, there are two different ways, from secondary school or from vocational education and training. Moreover, the scores for the subjects in the degree include not only the mark; the times used to pass each subject is also include.
The dataset under study has all the data. It is an important fact to test the performance of the used algorithms on this study. It will be possible to emulate several different percentages of missing values, and compare the both methods with the aim to stablish the right frontier of the both methods application. Then, with the combination, it will be obtained a hybrid model to increase the applicability of the method in a wide range of possibilities.
3 The Used Data Imputation Techniques
In this section the data imputation techniques employed on the present research are described.
3.1 The MICE Algorithm
The MICE algorithm developed by van Buuren and Groothuis-Oudshoorn [20] is a Markov Chain Monte Carlo Method where the state space is the collection of all imputed values. Like any other Markov Chain, in order to converge, the MICE algorithm needs to satisfy the three following properties [21–23]:
-
Irreducible: The chain must be able to reach all parts of the state space.
-
Aperiodic: The chain should not oscillate between different states.
-
Recurrence: Any Markov chain can be considered as recurrent if the probability that the Markov chain starting from i will return to i is equal to one.
In practice, the convergence of the MICE algorithm is achieved after a relatively low number of iterations, usually somewhere between 5 and 20 [23]. According to the experience of the algorithm creator, in general, five iterations are enough, but some special circumstances would require a greater number of iterations. In the case of the present research, and due to the performance of the results obtained when compared with the other methods applied, five iterations were considered to be enough. This number of iterations is much lower than in other applications of the Markov Chain Monte Carlo methods, which often require thousands of iterations. In spite of these, and from a researcher point of view and experience, it must be also remarked that in the most common of the applications each iteration of the MICE algorithm would take several minutes or even a few hours. Furthermore, the duration of each iteration is mainly linked with the number of variables involved in the calculus and not with the number of cases. It must be taken into consideration that imputed data can have a considerable amount of random noise, depending on the strength of the relations between the variables. So in those cases in which there are low correlations among variables or they are completely independent, the algorithm convergence will be faster. Finally, high rates of missing data (20 % or more) would slow down the convergence process work. The MICE algorithm [23] for the imputation of multivariate missing data consist on the following steps:
-
1.
Specify an imputation model \( P(Y_{j}^{mis} |Y_{j}^{obs} ,Y_{ - j} ,R) \) for variable \( Y_{j} \) with \( j = 1, \ldots ,p \)
The MICE algorithm obtains the posterior distribution of R by sampling interactive from the above represented conditional formula. The parameters R are specific to the respective conditional densities and are not necessarily the product of a factorization of the true joint distribution.
-
2.
For each \( j \) , fill in starting imputation \( Y_{j}^{0} \) by random draws from \( Y_{j}^{obs} \)
-
3.
peat for \( t = 1, \ldots ,T \) (iterations)
-
4.
Repeat for \( j = 1, \ldots ,p \) (variables)
-
5.
Define \( Y_{ - j}^{t} = (Y_{1}^{t} , \ldots ,Y_{j - 1}^{t} ,Y_{j + 1}^{t - 1} , \ldots ,Y_{p}^{t - 1} ) \) as the currently complete data except \( Y_{j} \)
-
6.
Draw \( \emptyset_{j}^{t} \sim P\left( {\emptyset_{j}^{t} |Y_{j}^{obs}, Y_{ - j}^{t}, R} \right) \)
-
7.
Draw imputations \( Y_{j}^{t} \sim P\left( {Y_{j}^{mis} |Y_{j}^{obs}, Y_{ - j}^{t},R,\emptyset_{j}^{t} } \right) \)
-
8.
End repeat \( j \)
-
9.
End repeat \( t \)
In the algorithm referred to, Y represents a n × p matrix of partially-observed sample data, R is a n × p matrix, 0–1 response indicators of Y, and ∅ represents the parameters space. Please note that in MICE imputation [24], initial guesses for all missing elements are provided for the n × p matrix of partially observed sample. For each variable with missing elements, the data are divided into two subsets, one of them containing all the missing data. The subset with all available data is regressed on all other variables. Then, the missing subset is predicted from the regression and the missing values are replaced with those obtained from the regression. This procedure is repeated for all variables with missing elements. After this, all the missing elements are imputed according to the algorithm explained above, the regression and predictions are repeated until the stop criterion is reached. In this case, until a certain number of consecutive iterates fall within the specified tolerance for each of the imputed values.
3.2 The AAA Algorithm
In order to explain the AAA, let’s assume that we have a dataset formed by \( n \) different variables \( v_{1} , v_{2} , \ldots , v_{n} \). In order to calculate the missing values of the i-th column, all the rows with no missing value in the said column are employed. Then, a certain number of MARS models are calculated. It is possible to find rows with very different amounts of missing data from 0 (no missing data) to \( n \) (all values are missing). Those columns with all values missing will be removed and will be neither used for the model calculation nor imputed. Therefore any amount of missing data from 0 to \( n - 2 \) is feasible (all variables but one with missing values).
In other words, if the dataset is formed by variables \( v_{1} , v_{2} , \ldots , v_{n} \) and we want to estimate the missing values in column \( v_{i} \), then the maximum number of different MARS models that would be computed for this variable (and in general for each column) is as follows: \( \sum\nolimits_{k = 1}^{n - 1} {\left( {\begin{array}{*{20}c} {n - 1} \\ k \\ \end{array} } \right)} \). For the case of the data under study in this research, with 10 different variables, a maximum of 5,110 distinct MARS models would be trained (511 for each variable).
After the calculation of all the available models, the missing data of each row will be calculated using those models that employ all the available non-missing variables of the row. In those cases in which no model was calculated, the missing data will be replaced by the median of the column. Please note in that the case of large data sets with a not-too-high percentage of missing data, these will be an unfrequent case. As a general rule for the algorithm, it has been decided that when certain value can be estimated using more than one MARS model, it must be estimated using the MARS model with the largest number of input variables; the value would be estimated by any of those models chosen at random. Finally, in those exceptional cases in which no model is available for estimation, the median value of the variable will be used for the imputation.
3.3 Models Validation
Leave-one-out cross-validation has been used to analyze the spatial error of interpolated data [25, 26]. This procedure involves using eight of the nine stations in the model to obtain the estimated value in the ninth station (this one is left out) in order to calculate Mean Square Error RMSE and Mean Absolute Error (MAE) for this station. The process is repeated nine times, once for each station.
The performance of the three methods has been evaluated using common statistics: Root RMSE, MAE:
where G i and \( \widehat{{G_{i} }} \) are the measurements and the model-estimated, and n is the number of data points of the validation set. The RMSE weights large estimation errors more strongly than small errors and it is considered a very important model validation metric. Also, MAE is a useful complement of the measured-modeled scatter plot near the 1-to-1 line [24].
4 Results
To calculate the performance of each algorithm, several test where made with different quantity of missing data. First of all, it is necessary to remark that, for the results show in the tables, only ten columns of the total dataset have been taking into account. Each column represents a different subject, and the selection was made randomly. In all tests, the percentage of missing data is always the same, 10 %, but the real missing data was varied from 1 to more than three, depending on the test.
Table 1 shows the performance of each algorithm with only 1 value missing in each case. It is possible to appreciate that the AAA algorithm is clearly better than MICE.
In Table 2, the performance was calculated for 2 missing values. In this case, as in the previous one, the AAA algorithm is clearly better than MICE, but the different between each algorithm performance is reduced.
The results present in Table 3, shows that when the missing values increase until 3, the MICE algorithm has better performance than the AAA.
With the aim to obtain the best results, a hybrid of the two algorithms was accomplished. The results of this hybrid system are shown in Table 4. In this table, the percentage of missing values is fixed to 10 %, but the number of missing values is random. When the missing values are less than 3 the algorithm selected is the AAA, and the MICE is the chosen one in the other cases.
Figure 1 shows the evolution of the RMSE for the two algorithms and the hybrid combination. The hybrid algorithm is not the best one for every case, but has the values for the RMSE constant independently on the number of missing values. The blue continued line represents the MICE algorithm; the red dotted line means the AAA algorithm, and the black dashed line is the combined algorithm results.
5 Conclusions
Very good results have been obtained in general terms with the data imputation techniques employed on this study.
It is possible to predict the scores of the students for the three cases contemplated, assuming the data do not exist, and comparing the estimate results with the real dataset. The average of RMSE for MICE was 0.50759 varying from 2.47e-3 to 1.54849; for AAA, the average of RMSE was 0.29130 with a minimum of 3.11e-31 and a maximum of 1.29216. The hybrid combination of these two algorithms achieved 4.92e-3 as average of RMSE, varying from 4.26e-5 to 9.72e-3.
These techniques could be used to predict lacks data and then, accomplish studies about students’ performance taken into account all the cases.
In future research the use of support vector machines (SVM) [26, 27] and hybrid methods [28–30] will be explored by the authors in order to find a new algorithm with even higher performance.
References
Francisco, H.G.: Ferreira and jérémie gignoux. the measurement of educational inequality: achievement and opportunity. World Bank Econ. Rev. 28(2), 210–246 (2014). first published online February 20, 2013 doi:10.1093/wber/lht004
Grissom, J.A., Kalogrides, D., Loeb, S.: Using student test scores to measure principal performance. Educ. Eval. Policy Anal. 37, 3–28 (2015). first published on March 10, 2014 doi:10.3102/0162373714523831
López-Vázquez, J.A., Orosa, J.A., Calvo-Rolle, J.L., Cos Juez, F.J., Castelerio-Roca, J.L., Costa, A.M.A: New way to improve subject selection in engineering degree studies. In: International Joint Conference: CISIS 2015 and ICEUTE 2015 (2015) doi:10.1007/978-3-319-19713-5_47
Kokkinos, C.M., Kargiotidis, A., Markos, A.: The relationship between learning and study strategies and big five personality traits among junior university student teachers. Learn. Individ. Differ. 43, 39–47 (2015). ISSN 1041-6080, doi: 10.1016/j.lindif.2015.08.031
Freeman, S., Eddy, S.L., McDonough, M., Smith, M.K., Okoroafor, N., Jordt, H., Wenderoth, M.P.: Active learning increases student performance in science, engineering, and mathematics. Proc. Nat. Acad. Sci. 111(23), 8410–8415 (2014)
Cook, W.D., Tone, K., Zhu, J.: Data envelopment analysis: Prior to choosing a model. Omega 44, 1–4 (2014). ISSN 0305-0483, http://dx.doi.org/10.1016/j.omega.2013.09.004
Anderman, E.M., Gimbert, B., O’Connell, A.A., Riegel, L.: Approaches to academic growth assessment. Br. J. Educ. Psychol. 85(2), 138–153 (2015)
Calvo-Rolle, J.L., Machón-Gonzalez, I., López-Garcia, H.: Neuro-robust controller for non-linear systems. Dyna 86(3), 308–317 (2011). doi:10.6036/3949
Ghanghermeh, A., Roshan, G., Orosa, J.A., Calvo-Rolle, J.L., Costa, Á.M.: New climatic indicators for improving urban sprawl: a case study of Tehran City. Entropy 15(3), 999–1013 (2013)
Alaiz Moretón, H., Calvo Rolle, J.L., García, I., Alonso Alvarez, A.: Formalization and practical implementation of a conceptual model for PID controller tuning. Asian J. Control 13(6), 773–784 (2011)
Casteleiro-Roca, J.L., Calvo-Rolle, J.L., Meizoso-López, M.C., Piñón-Pazos, A.J., Rodríguez-Gómez, B.A.: Bio-inspired model of ground temperature behavior on the horizontal geothermal exchanger of an installation based on a heat pump. Neurocomputting 150, 90–98 (2015)
Casteleiro-Roca, J.L., Quintián, H., Calvo-Rolle, J.L., Corchado, E., Meizoso-López, M.C., Piñón-Pazos, A.: An intelligent fault detection system for a heat pump installation based on a geothermal heat exchanger. J. Appl. Logic (2015)
Osborn, J., CosJuez, F.J., Guzman, D., Butterley, T., Myers, R., Guesalaga, A.: Using artificial neural networks for open-loop tomography. Opt. Express 20(3), 2420–2434 (2012)
Guzmán, D., Cos Juez, F.J., Myers, R., Guesalaga, A., Sánchez-Lasheras, F.: Modeling a MEMS deformable mirror using non-parametric estimation techniques. Opt. Express 18(20), 21356–21369 (2010)
Cos Juez, F.J., Sánchez-Lasheras, F., García Nieto, P.J., Suárez Suárez, M.A.: A new data mining methodology applied to the modelling of the influence of diet and lifestyle on the value of bone mineral density in post-menopausal women. Int. J. Comput. Math. 86(10–11), 1878–1887 (2009)
García Nieto, P.J., Alonso Fernández, J.R., Sánchez Lasheras, F., Cos Juez, F.J., Díaz Muñiz, C.: A new improved study of cyanotoxins presence from experimental cyanobacteria concentrations in the Trasona reservoir (Northern Spain) using the MARS technique. Sci. Total Environ. 430, 88–92 (2012)
Crespo Turrado, C., Sánchez Lasheras, F., Calvo-Rolle, J.L., Piñón-Pazos, A.J., de Cos Juez, F.J.: A new missing data imputation algorithm applied to electrical data loggers. Sensors 15, 31069–31082 (2015). doi:10.3390/s151229842
Turrado, C., López, M., Lasheras, F., Gómez, B., Rollé, J., Juez, F.: Missing data imputation of solar radiation data under different atmospheric conditions. Sensors 14 (2014). doi:10.3390/s141120382
Van Buuren, S., Groothuis-Oudshoorn, K.: Mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 3 (2011)
Tierny, L.: Introduction to general state-space Markov chain theory. In: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice, pp. 59–71. Chapman & Hall, London (1996)
Van Buuren, S.: Flexible Imputation of Missing Data. Chapman & Hall/CRC, London, UK (2012)
Liu, Y., Brown, S.D.: Comparison of five iterative imputation methods for multivariate classification. Chemom. Intell. Lab. 120, 106–115 (2013)
Perez, R., Lorenz, E., Pelland, S., Beauharnois, M., van Knowe, G., Hemker, K., Heinemannb, D., Remunde, J., Müllere, S.C., Traunmüllerf, W., et al.: Comparison of numerical weather prediction solar irradiance forecasts in the US. Can. Eur. Sol. Energy 94, 305–326 (2013)
Gutierrez-Corea, F.V., Manso-Callejo, M.A., Moreno-Regidor, M.P., Velasco-Gómez, J.: Spatial estimation of sub-hour global horizontal irradiance based on official observations and remote sensors. Sensors 14, 6758–6787 (2014)
Tiengrod, P., Wongseree, W.: A comparison of spatial interpolation methods for surface temperature in Thailand. In: Proceedings of the International Computer Science and Engineering Conference (ICSEC), Nakorn Pathom, Thailand, 4–6 September 2013, pp. 174–178
Liu, Y., Brown, S.D.: Comparison of five iterative imputation methods for multivariate classification. Chemom. Intell. Lab. Syst. 120 (2013) doi:10.1016/j.chemolab.2012.11.010
García Nieto, P.J., Alonso Fernández, J.R., de Cos Juez, F.J., Sánchez Lasheras, F., Díaz Muñiz, C.: Hybrid modelling based on support vector regression with genetic algorithms in forecasting the cyanotoxins presence in the Trasona reservoir (Northern Spain). Environ. Res. 122 (2013) doi:10.1016/j.envres.2013.01.001
Quintian, H., Calvo-Rolle, J.L., Corchado, E.: A hybrid regression system based on local models for solar energy prediction. Informatica. 25 (2014) doi:10.15388/Informatica.2014.14
Manuel Vilar-Martinez, X., Montero-Sousa, J.A., Calvo-Rolle, J.L., Casteleiro-Roca, J.L.: Expert system development to assist on the verification of “TACAN” system performance. Dyna. 89 (2014) doi:10.6036/5756
Acknowledgments
Authors appreciate support from the Spanish Economics and Competitiveness Ministry, through grant AYA2014-57648-P and the Government of the Principality of Asturias (Consejería de Economía y Empleo), through grant FC-15-GRUPIN14-017.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Crespo-Turrado, C. et al. (2016). Student Performance Prediction Applying Missing Data Imputation in Electrical Engineering Studies Degree. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2016. Lecture Notes in Computer Science(), vol 9648. Springer, Cham. https://doi.org/10.1007/978-3-319-32034-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-32034-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32033-5
Online ISBN: 978-3-319-32034-2
eBook Packages: Computer ScienceComputer Science (R0)