Keywords

1 Introduction

1.1 Problem Statement

Student desertion is a phenomenon that refers to the abandonment of education. This can happen due to several reasons, such as: academic, social, economic and psychological situations. Dropping out affects not only students, but also the prestige of the education center [1]. This situation is common within Ecuador’s higher education, especially in the lower levels. Eight out of ten students that started public college in the year 2012 continued their studies in 2013, and seven out of ten continued in 2014 [2]. This phenomenon contributes to the waste of limited public resources designated for education, decreasing quality standards, and somehow, both rising unemployment and poverty in the country [3].

Early identification of university students who have a greater probability of abandoning their career would help improve the quality indicators in higher education. Applying data mining techniques has had a significant impact in the educational sector in recent years [3, 4]. Having a predictive system able to detect students with a high probability of dropping out and, even more, providing patterns of potential deserters is important to propose and apply preventive action plans to contribute to the improvement of student academic performance and mitigate the adverse effects previously mentioned.

The student welfare department of the “Técnica del Norte” University [5] is the entity in charge of, among other things, helping university students to continue their studies. However, the students most susceptible to dropping out have been detected and reported too late. For this reason, it is necessary to implement a system that allows opportune decision making with the purpose of mitigating this phenomenon, motivating and helping the students to continue with their college studies and life plans. This phenomenon negatively affects, mainly, undergraduates in the lower semesters (up to the fourth level).

A predictive model of pattern detection is proposed in this study, which detects college dropouts in the “Técnica del Norte” University (Ecuador), using the personal and academic data of undergraduate students (bachelor’s degree and engineering) from the last 5 years. The data were processed following the KDD process (Knowledge Discovery in Databases) and using Bayesian and decision trees data mining techniques [6]. Identifying patterns considering, as possible desertion factors, the ethnic group, disability level (physical, mental, intellectual or sensory), gender, and place of residence makes up the main contribution in this research. Additionally, this study is performed using free software (Weka), fulfilling the Executive Decree of Ecuador No. 1014 [7]. This regulation disposes to use open source software within computer systems and equipment in the public administration of the Ecuadorian government. This makes up the second contribution in this study.

1.2 Review of Literature

Some existing studies were relevant for this research because they were carried out in the educational field of higher education using, among others, predictive data mining techniques. These studies are listed below:

Lehr et al. [8] used data mining techniques to predict college student retention at Embry-Riddle Aeronautical University (USA). To accomplish this goal, the data of 972 students, enrolled in 2008, were used to create predictive models, taking into account previous preparation of the students, grades in the first year, and personal and financial data. The classification techniques used were: Logistic Regression, Naive Bayes, K-Nearest Neighbors, Random Forest, Multilayer perceptron, and Decision trees. Logistic regression obtained the best results according to the error rate.

Hernández et al. [3] developed a predictive system that detects students with high desertion probabilities and the profile of dropouts in the Information and Communication Technologies Engineering career, using logistic regression, clustering, decision trees and neural networks. The results obtained by using logistic regression were best fitted to the study.

Peralta et al. [9], using data mining techniques and statistic models, identified the most relevant variables which predicted desertion and graduation rates of students in the Temuco Catholic University (Mexico). The original database held 15183 students and 143 variables. They used linear statistic models, such as the Probit and Logit models, as well as the decision tree technique. The main result obtained was that there is a clear relation among variables that depend on factors such as average and ranking.

Merchan and Duarte [10] performed a predictive model of academic performance based on the academic and demographic data of 932 students in the Computer Science career in the “El Bosque” University (Colombia). J48, PART, and Ridor decision tree algorithms were used. The J48 algorithm proved to be the best, classifying correctly around 78% of new instances.

Zaffar et al. [11] carried out research about the performance of classifiers and feature selection algorithms on a set of student data. The evaluation of the algorithms’ effectiveness was performed through precision, recall, and f-measure. The best result was obtained using the Random Forest classifier.

Devasia et al. [1] proposed a web application to extract useful information about students dropping out in higher education. The experiment was carried out using the data of 700 students containing 19 attributes. The Naïve Bayes algorithm was the most precise compared to other methods such as decision trees, neural networks, and logistic regression.

Kotsiantis [4] worked with stored data belonging to the Hellenic Open University (Greece), a distance learning program. The main objective was to predict student grades, which helped the tutors better comprehend the population characteristics.

Mishra et al. [12] used clustering tasks on the data of 84 undergraduate students of psychology in the north of Spain, grouping them into three clusters according to the course final grades. The smallest cluster was the most understood while the two bigger clusters were more difficult to interpret.

Moscoso-Zea et al. [13] designed a data warehouse in an educational scenario, using the methodology proposed by Kimball and Ross [14]. This repository is intended to the knowledge discovery process and indicators analysis of academic performance in future researches.

Next, the research methodology used in this study is presented.

2 Materials and Methods

2.1 The KDD Process

The KDD (Knowledge Discovery in Databases) process was used to determine student desertion patterns and it consists of four phases [6]: (i) data collection; (ii) selection, cleansing and transformation of data; (iii) data mining; and (iv) evaluation and interpretation of the model. This research used personal and academic students data from 37 undergraduate careers of the UTN (25 bachelor’s degrees and 12 engineering) from last five years (2013–2017). Each phase of the KDD process used in this study is detailed below.

2.2 Data Collection

The purpose in this stage is to integrate all data into a single repository (data warehouse). The main source of the raw data was the academic database of the UTN, which is stored on an Oracle 11g server, containing 17.882 student records. The data are both qualitative and quantitative in nature.

A part of the database was selected, i.e. 9 tables containing personal and academic data. Thirteen variables were considered in this study: age, ethnicity, nationality, place of origin, place of residence, disability level, gender, marital status, family burdens, study modality, faculty, average grades, and current state. The current state variable (active/inactive; YES/NO) was the class attribute, that is, the qualitative variable used to predict the classification (student desertion). The variables age, average grades and disability level are originally of a quantitative nature, while the rest are qualitative attributes.

Some variables mentioned above are relevant for this research, considering that the UTN is in Zone-1 of Ecuador (made up of the provinces: Imbabura, Carchi, Esmeraldas, and Sucumbíos). This zone has 638.979 men and 634.353 women of the following ethnicities: 60.9% mestizos, 21.7% Afro-Ecuadorians and African descendants, 11.5% indigenous, and 4.3% Caucasian white [15]. Also, 36.878 people (2.9%) of this region have a disability [16]. This research is based on the pattern identification in student desertion considering such variables (ethnicity, disability, gender, and place of residence) as possible dropout factors.

A data warehouse was built from the raw data, using PostgreSQL version 9.4 [17]. To accomplish this step, the data were exported from Oracle to a MS Excel file in xlsx format, and then imported into PostgreSQL. This database management system was selected due to it is open source and also its easy connection to the Pentaho Community Suite [18], which is used in the next phase.

2.3 Selection, Cleansing, and Transformation of Data

This phase’s goal is to obtain clean data, without null or inconsistent values, allowing a highly reliable student desertion patterns. The data selection, cleansing, and transformation tasks were performed using the Pentaho Community Suite, obtaining a mineable view, that is, a dataset ready for applying data mining techniques [13].

This study used only data from the first four levels, considering that the student desertion phenomenon in the UTN frequently occurs at lower levels. The data transformation converted discrete quantitative variables (age, average grade, and disability level) into ordinal qualitative attributes using the categories indicated in Table 1. The age attribute was divided into three categories: low, medium and high. The low level was considered until the 25 years of age because the undergraduate students finish their career mostly in this period. The medium level up to 40 years because the postgraduate students (master’s degree) are common in this age range. The high level is considered more than 40 years because here the students are less frequent. The average grade variable was divided into three categories (low, medium and high) considering the academic regulation in the UTN. The disability level was divided into four categories (slight, moderate, severe and very severe) considering the disability regulation of Ecuador [19].

Table 1. Categories used in the transformation of quantitative data

After cleansing and transforming the data, 12620 instances were obtained, which conform the mineable view. Next, it was exported to a plain text file delimited by commas (*.csv) for the following data mining phase.

2.4 Data Mining

This phase aims to obtain predictive models representative of the data, using the WEKA software version 3.8 [20]. This software is open source and it contains a wide collection of machine learning algorithms to generate different models.

In this study, the prediction used classification techniques, which consists in finding a pattern in an uncategorized data group and classify it into a predefined set of classes [6]. The classifiers used were the following: (i) decision trees with the RandomTree algorithm, due to the simplicity of its model, easy interpretation, and speed to classify new data [6, 21, 22]; and (ii) Bayesian technique with the Naïve Bayes algorithm, because of its simplicity and high precision in several domains [1, 21].

2.5 Evaluation and Interpretation of the Predictive Model

In this stage, the predictive models’ performances were evaluated quantitatively. Subsequently, the best model was used to obtain student desertion patterns. The Cross-validation method (10-fold) was used for evaluating the models. For each classification algorithm, the following quality metrics were used [21, 22]: accuracy, error rate, Kappa coefficient, TP rate, FT rate, precision, recall, F1 score, and ROC area. The interpretation of the best model was carried out with the help of an expert using visualization techniques, which facilitate comprehension and discovery of desertion patterns.

Next, the results obtained in this study are presented.

3 Results and Discussion

3.1 Evaluation of Predictive Models

The overall evaluation of the classification was performed by analyzing a confusion matrix and computing several quality metrics [23], which are shown in Table 2.

Table 2. Overall evaluation metrics for the classifiers used in this study

The classifier with the greatest accuracy and the lowest error rate turned out to be the RandomTree algorithm. Regarding the Kappa coefficient, both classifiers keep a “Fair” strength of agreement, according to the classification proposed by Landis and Kock [24]. The coefficient Kappa indicates whether the results obtained in the confusion matrix are significantly better than those produced in a random classification [25].

Some specific metrics used in the evaluation of the classifiers are shown in Table 3. It is noted that similar values were obtained for both classifiers, except for the ROC area, where it was lower for the RandomTree. This discrepancy RandomTree having greater accuracy but a lower ROC area than Naïve Bayes, may be due to an imbalanced dataset [26, 27], i.e. a disparity in the frequencies of the observed classes. This is detailed later.

Table 3. Specific evaluation metrics for the classifiers used in this study (weighted average).

Table 4 shows the confusion matrices generated by both RandomTree and Naive Bayes algorithms, respectively.

Table 4. Confusion matrices for classification algorithms.

For the RandomTree classifier, 12318 instances were correctly classified and 3022 incorrectly. The results of the errors are the following:

  • This classifier attempted to classify 290 examples of the class status = YES. Of these, 41 instances were correctly classified and 249 incorrectly classified as class status = NO.

  • It tried to classify 12330 examples of the class status = NO. Of these, 12277 instances were correctly classified and 53 incorrectly classified as class status = YES.

For the Naive Bayes classifier, 12310 instances were correctly classified and 310 incorrectly. The results of the errors are the following:

  • This classifier attempted to classify 290 examples of the class status = YES. Of these, 46 were correctly classified and 244 incorrectly classified as class status = NO.

  • It tried to classify 12330 examples of the class status = NO. Of these, 12264 were correctly classified and 66 incorrectly classified as class status = YES.

Notably within the confusion matrices of both classifiers (Table 4), the cases for the number of negative examples NO (12330, 97.7%) is much higher than the positive examples YES (290, 2.3%), i.e. it is dealing with imbalanced dataset as mentioned before, than most supervised learning methods will skew the predicted probabilities, tending to predict the abundant class more often. In this case, the goodness of a classifier is best approximated using the F1 score metric, which can be interpreted as a weighted average of the precision and recall values [21, 28].

In this way, considering F1 score metric (Table 3), both classifiers keep similar performances. However, RandomTree provides a set of decision rules (representation equivalent to the decision tree) that facilitates the processing and interpretation of the predictive model [3]. Therefore, it was chosen in this study for pattern discovery in students’ desertion. The decision rules generated by this model can be seen online in Appendix 1 from https://bit.ly/2Mmy9a7. In contrast, the Naïve Bayes algorithm does not generate a model but rather it performs the classification just at the requested time.

3.2 Detection of Student Desertion Patterns

After selecting the best predictive model (tree and decision rules), the exploration and interpretation of the tree was completed from the root toward the leaves (class attribute) labeled with status = YES [22]. This led to the following student desertion patterns being identified:

  1. a.

    Older married women studying arts-related careers with a low average.

  2. b.

    Older married men residing in Imbabura and maintaining a low average.

  3. c.

    Younger married mestizo people who study in blended modality and live outside Imbabura.

  4. d.

    Single, indigenous people, regardless of their sex, do not reside in Imbabura and keep a medium or low average.

  5. e.

    Women in free union who study in the classroom-based modality with a medium or low average.

  6. f.

    People of Colombian nationality.

  7. g.

    Women with moderate disabilities and medicine-related studies.

  8. h.

    People in free union with studies related to health, art or agricultural sciences.

  9. i.

    Afro-Ecuadorian people studying engineering.

  10. j.

    Divorced men studying engineering.

The patterns of student desertion identified above are very useful for the UTN’s student welfare department and these will allow to propose action plans to mitigate such phenomenon.

3.3 Discussion

The desertion patterns identified in this study coincide with some results from prior researches. Merchan and Duarte [10] indicated that (i) students in free union are potential deserters, whereas in this work there is a higher desertion correlation with students studying health, art or agricultural sciences. Some commonalities with other works that were used in the discovery of student desertion patterns are the place of origin, average grade, study modality [1, 3], whereas some variables that are not consider in this study are the cumulate average and the ranking [9]. In addition, this study considers the data of students from 37 undergraduate careers of the UTN, unlike other works where a specific career is considered [3, 10].

Some limitations in this study are the following: (i) the problem of detection of student desertion patterns is addressed only from the academic point of view, i.e. based on average grades; and (ii) predictive classification techniques are solely used for this purpose.

Finally, some future works hold this research line are the following: (i) expand the study also considering socioeconomic and psychological data; (ii) generate other predictive models based on logistic regression and neural networks; (iii) include descriptive tasks of data mining as the clustering [3, 12], association, outlier detection; (iv) create and compare specific models for each faculty of the UTN.

4 Conclusions

Detection of student desertion patterns in the university academic field is addressed in this study by using predictive classification tasks based on the decision tree (RandomTree) and Bayesian (Naïve Bayes) approaches. The student data used are personal and academic types from 37 undergraduate careers of the “Técnica del Norte” University (Ecuador), and these are processed following the KDD process (Knowledge Discovery in Databases) and using free software (Weka). The classifiers evaluation is quantitatively analyzed considering quality metrics. The RandomTree algorithm provides, slightly, a better performance than Naive Bayes (Table 2). The interpretation of the best model allows to identify relevant patterns of student desertion (Sect. 3.2), which contributes to propose action plans to mitigate this phenomenon. As future work, it is recommended to consider socio-economic and psychological data and include descriptive tasks of data mining.