Keywords

1 Introduction

The impact of data mining in the industry is increasingly evident, the proper management of organizational data and how to learn to identify the beneficial information for organizations and turn it into beneficial information for commercial purposes is one of the pillars of its success [6]. Like the other sectors of the industry where data mining has been gaining strength, the educational sector has not been the exception, both in terms of data mining for educational environments, as a research topic or as an innovative factor in the research segment. And also, as an investment in administrative terms, because big data and analytics remain top priority for CIOs, due to the return of the invention (ROI) in this kind of projects [9, 22].

Educational data mining (EDM) [7] describes the process of converting raw information from a platform or educational system to be used as a knowledge asset for educational entities. EDM seeks to generate value from the information they can gather from the interaction of their students or teachers with their systems. This paper seeks to address the concept of EDM in e-learning platforms, and how data from the interaction records of students or tutors in these platforms can be used to define tools or policies that directly impact the retention and permanence. In traditional learning environments, teachers can obtain feedback of learning through direct interaction with students, enabling a continuous evaluation by teachers [21]. The interaction and observation of students’ behavior in the classrooms, as well as the analysis of the history of the courses, give data to estimate appropriate pedagogical strategy to apply in the classroom. However, for work with students in virtual learning environments, this monitoring is more complicated. Tutors should look for other sources than direct observation for audit the learning process of students in the virtual classroom. One of these options is the Web platforms used by educational institutions, which collect large amounts of information automatically from the interaction with their educational systems. The data that these tools collect, after some EDM, can provide multiple dimensions of student behavior to teachers and the institution.

This paper presents a data analytic exercise with the registers (LOGS) coming from a virtual learning platform for higher education programs in virtual mode as an input for the definition of retention and permanence policies. Article is structured as follows: start with the description of the methodology implemented, followed by the exploration and use of data for the recognition of records from educational platforms, and finally the contextualization and use of the same in the definition of student retention and permanence policies.

2 Previous Work

Virtual learning environment (VLEs) or course management systems (CMS) are part of modern pedagogical approaches. These platforms host information about the interaction of the users with these. According to [8, 14, 23], this information has the potential to improve pedagogical approaches. From this premise, it is found multiple interrelated approaches that seek to exploit the registers or LOGs. In Table 1 is presented different approaches for the analysis of logs in virtual learning environments.

For this particular work, there is special interest in the analysis of the behavior of students and tutors in the VLEs. Taking into account the works related to this topic, Table 2 presents multiple approaches or algorithms centered in different dimensions of student behavior in virtual platforms.

Table 1. Approaches for the analysis of logs in virtual learning environments
Table 2. Mining techniques for the analysis of logs for behavior analysis.

3 Methodology

In this work we seek to develop an approach to identify student behaviors in virtual mode for an academic program using the registers (LOGS) left by them in the virtual learning platforms. The general approach of this work can be summarized by Fig. 1, where:

Fig. 1.
figure 1

Academic behavior analysis in academic programs

  • Extraction: From the VLE databases a representative sample of the students enrolled in the different courses of an academic program is taken. From these students, once they have finished the courses, their LOGs of interaction with the platforms and their notes are extracted.

  • Data Characterization: Once the logs extracted from a sample of students enrolled in an academic program, these records are transformed by formulating metrics related to their behavior that are easily associated with their academic behavior.

  • Clustering: Once a database of coded information from the sample drawn from an academic program has been consolidated, an unsupervised clustering algorithm is used to identify the latent behavior patterns in the data sample.

  • Analysis and interpretation: The information related to the academic performance of the extracted sample is used to label the different groups identified with the clustering algorithms.

3.1 Data Characterization

In a virtual learning platform (Course Management System - CMS), A log is a sequential file with temporal records associated with all events in an academic course product of the student’s interactions with the CMS. For finalized courses, we can obtain a set of records of the students and tutors behavior in an specific configuration of the CMS.

For this work, the CMS configuration is defined as following. The set of activities, \(A=\{a_1, a_2,... ,a_n \}\) where n is the number of activities, to performing during a specific course. The set of forums \(F=\{f_1, f_2, ..., f_n\}\) defined for each activity, the agenda (interval time) \(T=\{[t_1,t_2 ],[t_3,t_4 ],\ldots ,[t_(n-k),t_n]\} \) defined for each activity, the evaluative weight \(P=\{p_1,p_2,\ldots ,p_n\}\) for each activity where \(\sum _{i=1}^{n}p_i = 500\), the course materials, \(M=\{m_1,m_2,\ldots ,m_w\} \) where w is the number of folders (folders with books, articles, videos, etc.). The students enrolled in the course, \(E=\{e_1,e_2,\ldots ,e_m \}\) where m is the number of students, and the academic ponderation \(N=\{n_1,n_2,\ldots ,n_m\} \) for each student. According to the CMS configuration, a course is defined as the function \(C(A, F, T, P, M, E)\rightarrow N\).

The Logs list \(L=\{l_1, l_2, \ldots , l_m\} \in C\) for each student where \(l_i = \{ f_i, u_i, ua_i, ec_i,c_i, en_i, o_i, ip_i \} \in e_i\) contains the temporal records of each student, see Table 3.

Table 3. Log structure

Standard logs representation is not enough to contextualize this information in academic terms [3, 13]. To transform the logs into academic relevant information we realize a characterization process in order to measure a set of variables that allows realize an academic interpretation of the behavior of students and tutors in the CMS. We propose a set of variables measured using the logs \(l_i \in e_i\) for each student, see Table 4.

Table 4. Logs characterization for each student

In a first approximation, \(\forall e_i \in E\) we can characterize each student as \(g(l_i) \rightarrow x_i \mid x_i=\{Tp_i,Tt_i,Ef_i,Er_i,Dp_i,De_i,Vm_i,Nl_i,Ls_i,Lf_i,Ld_i,Ln_i,Pa_i,Fa_i\}\). However, this characterization need a special encoding to approach multiple courses particularities.

3.2 Clustering

Let \(\varSigma \) the database of information encoded for an academic program, define the function \( F (\varSigma ) \rightarrow W \) where \( W = [c_1, c_2, ..., c_h] \) are the different types of behavior that students adopt in the CMS. Assuming that it is not known a priori what types of behavior exist, the problem is to classify or segment the behavior inside the virtual courses. Therefore, the type of function proposed is an unsupervised grouping algorithm [26], specifically for this work, the Self-organizing map (SOM) networks proposed by Kohonen [11] following by a hierarchical clustering are used.

SOM networks are an algorithm based on unsupervised neural networks. The main functionality of the SOM networks is their ability to project nonlinear data with high dimensionality in a regular grid of low dimensionality (usually in 2D). The algorithm look for points that are near each other in the input space to be transformed to nearby map units in the SOM.

For our particular case of study, each information element is the vectors, \( X_{i} \in \varSigma \), where each of its components is a variable with a defined meaning. The grid generated by the SOM network can be used as a basis on which vectors with similar characteristics can be projected using a color-based coding, and based on these generated groupings, explain the possible types of behaviors that can be found in virtual courses.

For our study, this topological mapping consists of projecting a set of vectors X k-dimensional in a two-dimensional discrete mesh (2-D) of M positions, see Fig. 2. Each position in the output is characterized by a node \( h_j (j = 1,2, ..., M) \). For each \( h_i \) node, a position in the output space is associated by \(\mathbf {w} \), which is obtained through an optimization process, which reduce the distance between all the inputs and the output in the new space of M positions.

Starting from the database \(\varSigma _{14+ (w-1), E} \) (input space), where \( 14 + (w-1) \) is the number of characteristics, w the number of classifications within an academic program and E the number of students that have been used to create the database. In Fig. 3 is observed that although the student condition is the same for any sub-classification within an academic program, it is assumed that their behavior may vary depending on the type of subject they are studying.

Fig. 2.
figure 2

The SOM mapping pass from a high-dimensional space into a 2-D space.

Fig. 3.
figure 3

Input space configuration.

What in terms of the clustering algorithm, will be introduced w primary clusters to which they are looking to perform an additional subcluster, see Fig. 4. To perform this clustering Hierarchical clustering, is used, which is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each one is distinct from the others, and the objects within each cluster are broadly similar.

Fig. 4.
figure 4

Clusterization scheme.

Fig. 5.
figure 5

SOM network \(5\times 5\) with cluster using hierarchical clustering of three. (Color figure online)

Fig. 6.
figure 6

Correlation of the input data x.

4 Results

The information of the LOGS and the notes of a semester in five courses of a virtual postgraduate offered by the university was used. The 14 variables already described (Table 4 and N) were obtained, and the students with NaN values in any variable were eliminated. After this preprocessing, the results proposed here are based on a total of 175 students.

Visual Analytics is usually the best way to understand the results of data analytics compared with descriptives techniques, since it facilitates to identify relations between the information. The first exploration, presented in Fig. 5, explore the distribution of the each variable through the diagonal of image table. The distribution of all variables are spread over all the range of data, but some variables can be approximated to some probabilistic distributions. Also, the relation between all the variables of vector X is presented, where it is possible o view some correlation between the variables like is presented in the Fig. 6.

Fig. 7.
figure 7

SOM network \(5\times 5\) with cluster using hierarchical clustering of three.

Then, the next step was to obtain a SOM network of dimension five by five. Over this SOM Network the process of grouping, using hierarchical clustering, was done, which allows combining the nodes that are similar and that are side by side in the SOM grid. The results of the grouping with three groups are presented in Fig. 7. To analyze the behavior of each variable, Fig. 9 presents heat maps of the inputs, in which it is possible to analyze the relationship between the clusters and the ranges of the data. Finally, the Fig. 8 presents the histogram of the variable N for each cluster. Both the blue and the green cluster are made up of students with low and medium academic performance, respectively, and are characterized by having little participation in the forums (Ef), low frequency of access (Fa) and little access to the course material (Vm). Otherwise, it happens in the orange cluster, made up of students with the best grades in their majority.

Fig. 8.
figure 8

Histogram of variable Notes (N) for the three clusters.

Fig. 9.
figure 9

Heat Maps of each variable over the SOM network architecture.

This procedure was repeated for a different number of groups as well as different input variables, which allowed to obtain the following results:

  • In the clusters there is a differentiation in a shorter time of participation in the forums (Dp); likewise, there is a more significant number of contributions between the groups. Regarding the number of logs generated during the week and the frequency of access, certain groups are more abundant.

  • Students with low grades, where the total is less than 250 are mainly those who do not participate as much in the activities as in the deliveries and represent \(10\%\) of the sample.

  • The time it takes a student to make the deliveries (Of), as well as the participation in the forums (Dp) does not have an impact on the notes.

  • Few contributions in the forums (Ef) and low frequency of access (Fa) to the platform is positively correlated with better grades.

  • The variables Ln, Ld, Ls, Lf, Vm, \(Tt_{min}\), \(Tp_{min}\), Nl, and Pa have a weak correlation with the notes.

  • Students have greater participation in the forums in the daytime than in the nighttime. Also, they participate more during the week than weekends.

  • Students participate about \(55\%\) on the day versus \(45\%\) at night; however, there is a \(76\%\) participation in the platform during the week compared to \(24\%\) on the weekend.

5 Discussion

Logs analysis in educational platforms is not a new issue. Even, in terms of characterizing the students’ behavior in this kind of technological tools. Research papers as shown in Tables 1 and 2 afford similar approaches like this one. However, the principal contribution of this study is the application of this kind of analysis in a particular population with a socio-economical, cultural, geographical and political environment, in order to afford in future works, use this information to define policies, design new interfaces for technological tools, between others.

This work is framed in the analysis of Colombian virtual students behavior. principally, to understanding how to prevent desertion in virtual high education programs. In this project phase, the characterization approach, and the mining tool was defined, in order to in a future phase use this information to understand the Colombian virtual Student and also improve the technological tools.

6 Conclusion

Teaching or coaching tasks in virtual environments imply new challenges in pedagogy. Virtual Learning environments provide useful tools for the interaction between students, teachers, learning materials, and also with the educational institutions. However, it creates barriers between the players participating in the learning process, such as the understanding of students or teachers behavior beyond the actions in the virtual platforms.

Mathematical approaches that allow continuous monitoring of student behavior according to the logs analysis in VLE, can help to define educational policies in order to take preventive actions in order to reduce the desertion. This also suggests the need to design involving a greater number of variables related to the behavior of the teacher and whose purpose is focused on determining how they affect the performance of their students.

In the development of this work, we will include a representative sample to the analysis of a complete higher educational institution, and also include socioeconomic and geographic information concerning understand particularities of virtual students. Whit this information we want to develop a support tool to aid policies makers to plane in order to reduce desertion and increase the quality of service.