Keywords

1 Introduction

According with the guidance of quality assurance systems under the European Higher Education Area (EHEA), the studies tracking is regulated under a legal point of view, and of course is obligatory for official university degrees [1]. Under this point of view, the internal quality systems of the educational institutions, with the aim of on-going improvement, try to enhance their quality ratios or indicators in terms of academic results and performance [2]. This fact causes that the faculties or higher education schools need tools to support or assists on this task [3, 4].

As a previous work for achieving a tool for making decisions, usually, it is necessary a way to obtain the required knowledge. Traditionally, in past research works, the common method is to obtain a model based on a dataset of the historical, well through traditional techniques or through other ones more advanced [58].

The above method could be a problem in general terms, given the need to have previous cases with a similar performance [913]. Also, it is necessary remark that the case under study could change. If it is the case, the model must be adaptive for the novel cases with different casuistic and performance [1417]. In this sense, the imputation methods based on evolutionary methods could be a good solution to accomplish the present described problem.

This paper evaluates two imputation methods, which allows the system to fill in the missing data of any of the students’ scores that are used in this research. One of the algorithms, the AAA (Adaptive Assignation Algorithm) [18], is based on Multivariate adaptive regression splines and the other one is the MICE (Multivariate Imputation by Chained Equations) [19]. The first one has a good performance in general terms, when the percentage of missing data for a case is reduced; when the sample is not in that way, the second method is more appropriate. The right combination of the both algorithms is a good solution that requires to stablish the border application of both.

This paper is structured in the following way. After the present section, the case of study is described, it consist on the students’ scores dataset of the Electrical Engineering Studies Degree of the University of A Coruña. Then, the techniques for missing data imputation are shown. The results section shows the achieved outcomes with the imputation over the dataset for three different cases over the case of study. After that, the conclusions and future works are presented.

2 Case of Study

The students’ scores in the Electrical Engineering Studies Degree of the University of A Coruña compose the dataset used in this research since course 2001/2002 until 2008/2009. The dataset includes the scores for each subject in the degree; nine subjects in the first year, another nine in the second year, seven in the last year, and the final project.

The data also includes the scores and the way to access to the University studies; in Spain, there are two different ways, from secondary school or from vocational education and training. Moreover, the scores for the subjects in the degree include not only the mark; the times used to pass each subject is also include.

The dataset under study has all the data. It is an important fact to test the performance of the used algorithms on this study. It will be possible to emulate several different percentages of missing values, and compare the both methods with the aim to stablish the right frontier of the both methods application. Then, with the combination, it will be obtained a hybrid model to increase the applicability of the method in a wide range of possibilities.

3 The Used Data Imputation Techniques

In this section the data imputation techniques employed on the present research are described.

3.1 The MICE Algorithm

The MICE algorithm developed by van Buuren and Groothuis-Oudshoorn [20] is a Markov Chain Monte Carlo Method where the state space is the collection of all imputed values. Like any other Markov Chain, in order to converge, the MICE algorithm needs to satisfy the three following properties [2123]:

  • Irreducible: The chain must be able to reach all parts of the state space.

  • Aperiodic: The chain should not oscillate between different states.

  • Recurrence: Any Markov chain can be considered as recurrent if the probability that the Markov chain starting from i will return to i is equal to one.

In practice, the convergence of the MICE algorithm is achieved after a relatively low number of iterations, usually somewhere between 5 and 20 [23]. According to the experience of the algorithm creator, in general, five iterations are enough, but some special circumstances would require a greater number of iterations. In the case of the present research, and due to the performance of the results obtained when compared with the other methods applied, five iterations were considered to be enough. This number of iterations is much lower than in other applications of the Markov Chain Monte Carlo methods, which often require thousands of iterations. In spite of these, and from a researcher point of view and experience, it must be also remarked that in the most common of the applications each iteration of the MICE algorithm would take several minutes or even a few hours. Furthermore, the duration of each iteration is mainly linked with the number of variables involved in the calculus and not with the number of cases. It must be taken into consideration that imputed data can have a considerable amount of random noise, depending on the strength of the relations between the variables. So in those cases in which there are low correlations among variables or they are completely independent, the algorithm convergence will be faster. Finally, high rates of missing data (20 % or more) would slow down the convergence process work. The MICE algorithm [23] for the imputation of multivariate missing data consist on the following steps:

  1. 1.

    Specify an imputation model \( P(Y_{j}^{mis} |Y_{j}^{obs} ,Y_{ - j} ,R) \) for variable \( Y_{j} \) with \( j = 1, \ldots ,p \)

    The MICE algorithm obtains the posterior distribution of R by sampling interactive from the above represented conditional formula. The parameters R are specific to the respective conditional densities and are not necessarily the product of a factorization of the true joint distribution.

  2. 2.

    For each \( j \) , fill in starting imputation \( Y_{j}^{0} \) by random draws from \( Y_{j}^{obs} \)

  3. 3.

    peat for \( t = 1, \ldots ,T \) (iterations)

  4. 4.

    Repeat for \( j = 1, \ldots ,p \) (variables)

  5. 5.

    Define \( Y_{ - j}^{t} = (Y_{1}^{t} , \ldots ,Y_{j - 1}^{t} ,Y_{j + 1}^{t - 1} , \ldots ,Y_{p}^{t - 1} ) \) as the currently complete data except \( Y_{j} \)

  6. 6.

    Draw \( \emptyset_{j}^{t} \sim P\left( {\emptyset_{j}^{t} |Y_{j}^{obs}, Y_{ - j}^{t}, R} \right) \)

  7. 7.

    Draw imputations \( Y_{j}^{t} \sim P\left( {Y_{j}^{mis} |Y_{j}^{obs}, Y_{ - j}^{t},R,\emptyset_{j}^{t} } \right) \)

  8. 8.

    End repeat \( j \)

  9. 9.

    End repeat \( t \)

In the algorithm referred to, Y represents a n × p matrix of partially-observed sample data, R is a n × p matrix, 01 response indicators of Y, and represents the parameters space. Please note that in MICE imputation [24], initial guesses for all missing elements are provided for the n × p matrix of partially observed sample. For each variable with missing elements, the data are divided into two subsets, one of them containing all the missing data. The subset with all available data is regressed on all other variables. Then, the missing subset is predicted from the regression and the missing values are replaced with those obtained from the regression. This procedure is repeated for all variables with missing elements. After this, all the missing elements are imputed according to the algorithm explained above, the regression and predictions are repeated until the stop criterion is reached. In this case, until a certain number of consecutive iterates fall within the specified tolerance for each of the imputed values.

3.2 The AAA Algorithm

In order to explain the AAA, let’s assume that we have a dataset formed by \( n \) different variables \( v_{1} , v_{2} , \ldots , v_{n} \). In order to calculate the missing values of the i-th column, all the rows with no missing value in the said column are employed. Then, a certain number of MARS models are calculated. It is possible to find rows with very different amounts of missing data from 0 (no missing data) to \( n \) (all values are missing). Those columns with all values missing will be removed and will be neither used for the model calculation nor imputed. Therefore any amount of missing data from 0 to \( n - 2 \) is feasible (all variables but one with missing values).

In other words, if the dataset is formed by variables \( v_{1} , v_{2} , \ldots , v_{n} \) and we want to estimate the missing values in column \( v_{i} \), then the maximum number of different MARS models that would be computed for this variable (and in general for each column) is as follows: \( \sum\nolimits_{k = 1}^{n - 1} {\left( {\begin{array}{*{20}c} {n - 1} \\ k \\ \end{array} } \right)} \). For the case of the data under study in this research, with 10 different variables, a maximum of 5,110 distinct MARS models would be trained (511 for each variable).

After the calculation of all the available models, the missing data of each row will be calculated using those models that employ all the available non-missing variables of the row. In those cases in which no model was calculated, the missing data will be replaced by the median of the column. Please note in that the case of large data sets with a not-too-high percentage of missing data, these will be an unfrequent case. As a general rule for the algorithm, it has been decided that when certain value can be estimated using more than one MARS model, it must be estimated using the MARS model with the largest number of input variables; the value would be estimated by any of those models chosen at random. Finally, in those exceptional cases in which no model is available for estimation, the median value of the variable will be used for the imputation.

3.3 Models Validation

Leave-one-out cross-validation has been used to analyze the spatial error of interpolated data [25, 26]. This procedure involves using eight of the nine stations in the model to obtain the estimated value in the ninth station (this one is left out) in order to calculate Mean Square Error RMSE and Mean Absolute Error (MAE) for this station. The process is repeated nine times, once for each station.

The performance of the three methods has been evaluated using common statistics: Root RMSE, MAE:

$$ RMSE = \sum\nolimits_{i = 1}^{n} {\sqrt {\frac{1}{n}(\widehat{{G_{i} }} - G_{i} )^{2} } } $$
(1)
$$ MAE = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {\left| {\widehat{{G_{i} }} - G_{i} } \right|} $$
(2)
$$ RMSE({\%}) = \frac{RMSE}{{\frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} G_{i} }}\times100 $$
(3)
$$ MAE\left( \% \right) = \frac{MAE}{{\frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} G_{i} }} \times 100 $$
(4)

where G i and \( \widehat{{G_{i} }} \) are the measurements and the model-estimated, and n is the number of data points of the validation set. The RMSE weights large estimation errors more strongly than small errors and it is considered a very important model validation metric. Also, MAE is a useful complement of the measured-modeled scatter plot near the 1-to-1 line [24].

4 Results

To calculate the performance of each algorithm, several test where made with different quantity of missing data. First of all, it is necessary to remark that, for the results show in the tables, only ten columns of the total dataset have been taking into account. Each column represents a different subject, and the selection was made randomly. In all tests, the percentage of missing data is always the same, 10 %, but the real missing data was varied from 1 to more than three, depending on the test.

Table 1 shows the performance of each algorithm with only 1 value missing in each case. It is possible to appreciate that the AAA algorithm is clearly better than MICE.

Table 1. Results for algorithm with 1 missing value

In Table 2, the performance was calculated for 2 missing values. In this case, as in the previous one, the AAA algorithm is clearly better than MICE, but the different between each algorithm performance is reduced.

Table 2. Results for algorithm with 2 of missing values

The results present in Table 3, shows that when the missing values increase until 3, the MICE algorithm has better performance than the AAA.

Table 3. Results for algorithm with 3 of missing values

With the aim to obtain the best results, a hybrid of the two algorithms was accomplished. The results of this hybrid system are shown in Table 4. In this table, the percentage of missing values is fixed to 10 %, but the number of missing values is random. When the missing values are less than 3 the algorithm selected is the AAA, and the MICE is the chosen one in the other cases.

Table 4. Results for algorithm with random missing values and hybrid combination

Figure 1 shows the evolution of the RMSE for the two algorithms and the hybrid combination. The hybrid algorithm is not the best one for every case, but has the values for the RMSE constant independently on the number of missing values. The blue continued line represents the MICE algorithm; the red dotted line means the AAA algorithm, and the black dashed line is the combined algorithm results.

Fig. 1.
figure 1

Plotting of the RMSE values for the algorithms

5 Conclusions

Very good results have been obtained in general terms with the data imputation techniques employed on this study.

It is possible to predict the scores of the students for the three cases contemplated, assuming the data do not exist, and comparing the estimate results with the real dataset. The average of RMSE for MICE was 0.50759 varying from 2.47e-3 to 1.54849; for AAA, the average of RMSE was 0.29130 with a minimum of 3.11e-31 and a maximum of 1.29216. The hybrid combination of these two algorithms achieved 4.92e-3 as average of RMSE, varying from 4.26e-5 to 9.72e-3.

These techniques could be used to predict lacks data and then, accomplish studies about students’ performance taken into account all the cases.

In future research the use of support vector machines (SVM) [26, 27] and hybrid methods [2830] will be explored by the authors in order to find a new algorithm with even higher performance.