Keywords

1 Introduction

EDM is an emerging discipline, with a suite of computational and psychological methods and research approaches for understanding how students learn, and the settings which they learn in [1].

Data of interest is not restricted to interactions of individual students with an educational system (e.g., navigation behavior, input to quizzes and interactive exercises) but might also include data from collaborating students (e.g., text chat), administrative data (e.g., school, school district, teacher), demographic data (e.g., gender, age, school grades), and data on student affect (e.g., motivation, emotional states) [2].

EDM can be applied to assess students’ learning performance, to improve the learning process and guide students’ learning, to provide feedback and adapt learning recommendations based on students’ learning behaviors, to evaluate learning materials and courseware, to detect abnormal learning behaviors and problems, and to achieve a deeper understanding of educational phenomena [3].

For example, Ayesha et al. [4] described the use of k-means clustering algorithm to predict student’s learning activities. Pal [5] used machine learning algorithm to find students which are likely to drop out their first year of engineering. Parack et al. [6] used multiple data mining algorithms for student profiling and grouping based on their academic records such as exam scores, term work grades, attendance and practical exams.

As the number of EDM studies found in the literature is growing considerably over the last few years, we aim in this chapter to establish a bibliographic review of these studies. Our goal is to discuss the data mining methods and tools used in computer based learning environments to analyze learners’ behaviors and performance in order to facilitate the use and the understanding of data mining techniques to help the educational field specialists to give their feedback and to identify promoter research areas in this field to be exploited in the future.

Therefore, the remaining of the chapter is organized as follows: Sect. 1.2 is devoted to give a detailed view of the EDM field: definition, related areas, goals, methods, the analyzed data, process and the used tools. Section 1.3 presents some examples dealing with the two principal EDM applications: analyzing learners’ behaviors and predicting learners’ performance. We compare and discuss these examples according to their goals, the analyzed data and the used methods. We end the chapter with a conclusion in Sect. 1.4.

2 Educational Data Mining

2.1 Definition

Different definitions have been provided for the term ‘Educational Data Mining’ or EDM. Educational data mining is defined by the journal of educational data miningFootnote 1 and Baker [1] as “an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in”.

This definition does not mention data mining; open to exploring and developing other analytical methods that can be applied to educationally related data [7].

However, in [8] the authors precise that: “EDM is both a learning science, as well as a rich application area for data mining, due to the growing availability of educational data. It enables data-driven decision making for improving the current educational practice and learning material”.

In the same way, Romero and Ventura [9, 10] define EDM as “the application of data mining (DM) techniques to specific type of dataset that come from educational environments to address important educational questions”.

Although different in some details, these definitions share an emphasis on discovering knowledge based on educational data to improve educational systems. Note also that the definition of EDM is often confused with ‘learning analytics’ defined on the LAK (Learning Analytics and Knowledge) website as “the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs” [11].

Although there is no hard and fast distinction between these two fields, they have had somewhat different research histories and are developing as distinct research areas [12]. The objective of this chapter is not to draw up a comparative study between these two concepts (comparisons and details can be found in [1012]).

However, we think worth mentioning that this field is the most related to the EDM field, as they share many goals and it is often difficult to differentiate if an application fits into one or the other of the two areas. The next subsection presents the related fields to EDM.

2.2 Areas in Relation to EDM

EDM can be drawn as the combination of three main areas (Fig. 1.1): computer science, education, and statistics. The intersection of those three areas also forms other subareas closely related to EDM such as learning analytics (LA), CBLE, DM and machine learning [10].

Fig. 1.1
figure 1

Areas in relation with EDM [10]

As an interdisciplinary area, EDM uses methods and applies techniques from statistics, machine learning, data mining, information retrieval, recommender systems, psycho-pedagogy, cognitive psychology, psychometrics, etc. The choice of which method or technique should be used depends on the addressed educational issue.

2.3 Objectives of the EDM

In the last several years, EDM has been applied to address a wide number of goals that are all parts of the general objective of improving learning [10]. Several studies [1, 8, 10, 12, 13] dress a list of these objectives.

Romero and Ventura [10] proposed to classify EDM objectives depending on the viewpoint of the final user (learner, educator, administrator, and researcher) and the problem to resolve:

  • Learners. To support a learner’s reflections on the situation, to provide adaptive feedback or recommendations to learners, to respond to student's needs, to improve learning performance, etc.

  • Educators. To understand their students’ learning processes and reflect on their own teaching methods, to improve teaching performance, to understand social, cognitive and behavioral aspects, etc.

  • Researchers. To develop and compare data mining techniques to be able to recommend the most useful one for each specific educational task or problem, to evaluate learning effectiveness when using different settings and methods, etc.

  • Administrators. To evaluate the best way to organize institutional resources (human and material) and their educational offer.

This view point clearly shows the benefit of EDM applications to the end user, but it is difficult to classify all EDM application goals according to these four actors, especially when an objective is related to more than one actor. That is why, based on the work of [1, 1214] that focused on the related research goal of EDM applications, we distinguish between the following EDM general goals:

  • Student modeling. User modeling in the educational domain incorporates such detailed information as students’ characteristics or states such as knowledge, skills, motivation, satisfaction, meta-cognition, attitudes, experiences and learning progress, or certain types of problems that negatively impact their learning outcomes (making too many errors, misusing or under-using help, gaming the system, inefficiently exploring learning resources, etc.), affect, learning styles, and preferences. The common objective here is to create or improve a student model from usage information.

  • Predicting students’ performance and learning outcomes. The objective is to predict a student’s final grades or other types of learning outcomes (such as retention in a degree program or future ability to learn), based on data from course activities. Examples of predicting student’s performance can be found in Sect. 1.3.

  • Generating recommendation. The objective is to recommend to students which content (or tasks or links) is the most appropriate for them at the current time [15].

  • Analyzing learner’s behavior. This takes on several forms: Applying educational data mining to answer questions in any of the three areas previously discussed (student models, Prediction, Generating recommendation). It is also used to group student according to their profile, and for adaptation and personalization purposes.

  • Communicating to stakeholders. The objective is to help course administrators and educators in analyzing students’ activities and usage information in courses. Macfayden and Dawson in [16] conducted a study that confirms that pedagogically meaningful information that is extracted from e-learning systems can be used to develop a customizable dashboard-like reporting tool for educators that will extract and visualize real-time data on student engagement and likelihood of success. Romero et al. [17] provided feedback to help decision making for improving student learning and taking the appropriate proactive action. Other examples and case studies for this category of applications can be found in [14].

  • Domain structure analysis. The objective is to determine domain structure and improving domain models that characterize the content to be learned and optimal instructional sequences, using the ability to predict the student’s performance as a quality measure of a domain structure model. Performance on tests or within a learning environment is utilized for this goal.

  • Maintaining and improving courses. It is related to the two previous goals. The objective here is to determine how to improve courses (contents, activities, links, etc.), using information (in particular) about student usage and learning.

  • Studying the effects of different kinds of pedagogical support that can be provided by learning software. For example, Anaya and Boticario [18] proposed a method to analyze collaboration using machine learning techniques.

  • Advancing scientific knowledge about learning and learners through building, discovering or improving models of the student, the domain, and the pedagogical support. For example, Siemens and Baker [19] developed and tested a scientific theory about improving learning technology, and formulated a new scientific hypothesis.

We note that these EDM objectives aim to improve several aspects of educational systems in general and CBLE in particular. In this specific context, the learner modeling is a key point to accomplish several goals and tasks (tutoring, adaptation, personalization, etc.). Indeed, the different objectives depend heavily on this first objective “Student modeling” which is often supplemented by the behavior analysis, and therefore, allows the prediction of performance, generating recommendation, providing administrators and educators the adequate information to maintain and improve the content and learning environments.

Thus, if the EDM facilitates the modeling, and thus achieve the objectives mentioned above, several treatments become easier in CBLE. To accomplish these goals, educational data mining researches use the categories of technical methods described below.

2.4 The Used Methods

To achieve the EDM objectives, the majority of traditional data mining techniques including but not limited to classification, clustering, and association analysis techniques have been applied successfully in the educational domain. Nevertheless, educational systems have special characteristics that require a different treatment of the mining problem [14]. That is why researchers involved in EDM apply not only data mining techniques, but also propose, develop and apply methods and techniques drawn from the variety of areas related to EDM (statistics, machine learning, text mining, web log analysis, psychometrics, etc.).

The most popular classification of these methods is the one proposed in Baker [1]: prediction, clustering, relationship mining, distillation for human judgment and discovery with models. Bienkowski et al. [12] then Romero and Ventura [10] extended this taxonomy. Based on these studies and those in [20, 21] we regroup these techniques into the following methods:

  • Prediction. The goal is to develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables). Types of predictions methods are classification (when the predicted variable is a categorical value), regression (when the predicted variable is a continuous value), or density estimation (when the predicted value is a probability density function). An example of EDM application is predicting student’s academic success [4] and behaviors [6].

  • Clustering. Refers to finding instances that naturally group together and can be used to split a full dataset into categories. Typically, some kinds of distance measures are used to decide how similar instances are. Once a set of clusters has been determined, new instances can be classified by determining the closest cluster. In EDM, clustering can be used for grouping students based on their learning patterns or cognitive strategies [22].

  • Relationship mining. Used for discovering relationships between variables in a dataset and encoding them as rules for later use. There are different types of relationship in mining techniques such as association rule mining (any relationships between variables), sequential pattern mining (temporal associations between variables), correlation mining (linear correlations between variables), and causal data mining (causal relationships between variables). In EDM, relationship mining is used to identify relationships between the students’ on-line activities and the final marks [23] and to model learners’ problem solving activity sequences [24].

  • Distillation of data for human judgment. It is a technique that involves depicting data in a way that enables a human to quickly identify or classify features of the data. This approach uses summarization, visualization and interactive interfaces to highlight useful information and support decision-making. On the one hand, it is relatively easy to obtain descriptive statistics from educational data to obtain global data characteristics and summaries and reports on learner’s behavior. On the other hand, information visualization and graphic techniques help to see, explore, and understand huge educational data at once. In [25] the visualization of sequences of student’s activity helps to understand the patterns of learning environment use.

  • Discovery with models. Its goal is to use a validated model of a phenomenon (using prediction, clustering, or knowledge engineering) as a component in further analysis such as prediction or relationship mining. It is used for example to identify the relationships between the student’s behavior and characteristics [26].

  • Outlier Detection. The goal of outlier detection is to discover data points that are significantly different than the rest of data. An outlier is a different observation (or measurement) that is usually larger or smaller than the other values in data. In EDM, outlier detection can be used to detect deviations in the learner’s or educator’s actions or behaviors, irregular learning processes, and for detecting students with learning difficulties [27].

  • Social Network Analysis. SNA or structural analysis, aims at studying relationships between individuals, instead of individual attributes or properties. SNA views social relationships in terms of network theory consisting of nodes (representing individual actors within the network) and connections or links (which represent relationships between the individuals, such as friendship, cooperative relations, etc.). In EDM, SNA can be used to interpret and analyze the structure and relations in collaborative tasks and interactions with communication tools [28].

  • Process Mining. Its goal is to extract process related knowledge from event logs recorded by an information system to have a clear visual representation of the whole process. It consists of three subfields: conformance checking, model discovery, and model extension. In EDM, process mining can be used for reflecting students’ behaviors in terms of their examination traces consisting of a sequence of course, grade, and timestamp triplets for each student [29].

  • Text Mining. It is an extension of data mining to text that is focused on finding and extracting useful or interesting patterns, models, directions, trends, or rules from unstructured text documents such as HTML files, chat messages and emails. Text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling [10]. Text mining is used to analyze the content of discussion boards, forums, chats, Web pages, documents, etc. [3].

  • Knowledge Tracing. KT is a popular method for estimating student mastery of skills that has been used in effective cognitive tutor systems. It uses both a cognitive model that maps a problem-solving item to the skills required, and logs of students’ correct and incorrect answers as evidence of their knowledge on a particular skill. KT tracks student knowledge over time and it is parameterized by variables. There is an equivalent formulation of KT as a Bayesian network. In EDM, it is used for example for predicting student’s behavior [30].

  • Matrix Factorization. It is a decomposition of a matrix into a product of matrices. There are many matrix factorization techniques such as Non-negative Matrix Factorization (NMF). NMF consists of a matrix of positive numbers, as the product of two smaller matrices. For example, in the context of education, a matrix M that represents the observed examinee’s test outcome data that can be decomposed into two matrices: Q that represents the Q-matrix of items and S that represents each student’s mastery of skills [31]. Thai-Nghe et al. [32] used a matrix factorization model inspired from recommender systems to predict student performance.

We note here that an increasing number of techniques are used in EDM for the analysis of the different data produced in educational systems. The choice of which technique to use depends on the nature of the learning environment, the research objectives and the type of the available data. In what follows we discuss the type of the analyzed data.

2.5 The Analyzed Data

There are different analyzed data in EDM studies such as their objectives and techniques. We can distinguish these data according to the following features:

  • Data availability:

    • Data already available recorded over the years in the institution databases (e.g. students’ scores) or the log files of learning software.

    • Data generated during experiments within a research work.

    • Data available to researchers in benchmark repositories (PSL-DatashopFootnote 2, MULCEFootnote 3).

  • Collection sources:

    • Manual. Performed by a human observer that takes notes on the learning situation to evaluate the participants’ activities.

    • Digital. Relies on the use of a hardware configuration that records the learner’s activity. The result of such collection is a numerical trace that can be a log file, information stored in databases, audio or video records.

    • Mixed. Where both methods are used simultaneously.

  • Learning environment [10]:

    • Traditional education. Primary, secondary, higher education, etc.

    • Computer-based education. Intelligent Tutoring System (ITS), Learning Management System (LMS), Adaptative Educational Hypermedia System (AEHS), Computer Supported Collaborative Learning (CSCL), serious games, test and quiz systems, etc.

  • The educational described level [1, 9]:

    • The keystroke level, the answer level, the session level, the student level, the classroom level, the teacher level, and the school level.

  • The type of data:

    • Qualitative or quantitative data.

    • Personal, administrative and/or demographic data (age, sex, etc.).

    • Answers to psychological questionnaires for measuring users’ satisfaction, motivation, skills, cognitive features, etc.

    • Answers to questions and/or test scores of the academic system.

    • Individual interactions with the educational system: from fine grained actions such as mouse click, to high level ones such as number of attempts, the learner browsing pattern, etc.

    • Social interaction (chat, sent messages, forum participation, etc.).

    • Visual and facial reactions, etc.

We note here that the data are highly variable depending on the type of environment. In this chapter, we are interested in EDM applications on computer based education. In such systems, the collected data is often digital, and their size is often less important than traditional environments that have much bigger databases. However, several studies combine these two sources of data to give a complete view of the learner’s behavior and performance. For instance, authors in [23] attempted to predict the success of students in the final exam based on their participation level in online forums. The fusion and the processing of these different types of data require several steps to implement the EDM process that we present in the following section.

2.6 Process of Applying the EDM

Romero and Ventura [10] and Sachin and Vijay [33] proposed a process of applying EDM close to the one of KDD (Knowledge discovery in databases) or other data mining application process (Fig. 1.2).

Fig. 1.2
figure 2

The process of data mining application in educational data mining

This process starts with collecting or choosing the data to study from the educational environment. The obtained raw data require cleaning and preprocessing (heterogeneous data fusion, treatment of missing and incorrect values, converting the data to an appropriate form, feature selection, etc.).

This phase often requires the use of some data mining techniques. That is why, and given its complexity some works try to eliminate this phase as [34] which provides a data model to structure data stored by Learning Management Systems, and a tool that does the actual structure/export functionality, which they implemented for the Moodle LMS.

Once the data preprocessed, the appropriate EDM method/technique is applied. Finally, the last step is the interpretation and the assessment of the obtained results. To apply this process, which is often difficult given the heterogeneity of the data in the educational context, several tools are used.

2.7 Some Technological Tools Used in EDM

There are several tools and technologies used in the process of EDM not specifically designed for teaching and educational environments (Weka,Footnote 4 R,Footnote 5 etc.). However, in the last few years, a large number of data mining tools designed for educational purposes have been developed [10]. A summary of some of the most recent tools are presented in Table 1.1.

Table 1.1 Some tools for EDM applications

By analyzing these tools, we find that they are usually designed for computer-based educational systems. Moreover, apart from benchmark repositories (PSL-Datashop and MULCE), other tools are not re-used by other researchers of the EDM community.

This can be due to several reasons: their availability, the special format of the data to analyze, the difficulty of their deployment outside of their development environment or the ignorance of their existence. That is why; an effort should be made to make these tools available to the different learning actors (teachers, designers, administrators and researchers) to fulfill the different objectives and to analyze data from different environments.

Finally, now that we have an overview of the EDM field, we focus in the following on examples of its applications in computer-based educational systems. We particularly focus in behavior analysis and performance prediction and assessment.

3 Examples of EDM Applications in Computer-Based Learning Environments

In the last years, a wide number of EDM applications have been developed as seen in the previous sections. There are applications dealing with the assessment of students’ learning performance, course adaptation and learning recommendations based on the student’s learning behavior, evaluation of learning material and web-based courses, providing feedback to both teacher and students in e-learning courses, and detection of students’ learning behaviors [21].

A review of these studies can be found in [9, 10, 12, 21, 44, 45]. Through these studies we noticed that the current mainstream EDM research is primarily focused on mining logs generated by the e-learning systems [13, 21]. We also found that the oldest and the most popular applications are the prediction of the student’s performance and the analysis of learning behavior.

The term ‘prediction’ is generally used to characterize models (based on EDM techniques) designed for predicting new outcomes or scenarios based on new observations. Prediction is different from ‘explanation’, where the goal is to build models that explain underlying causal structure and to assess the explanatory power of such models [46]. This term is then linked to the study of the learner’s behavior.

In the following, we present the most recent EDM studies from 2010 to 2013 related to these main objectives: learner’s performance and behaviors in computer-based learning environments.

3.1 EDM Applications for Predicting and Evaluating Learning Performance

In this subsection, we analyze the current state of EDM research in learners’ performance in CBLE. Table 1.2 summarizes some of the recent reviewed researches.

Table 1.2 Some EDM applications for predicting the learner’s performance in CBLE

We note that the majority of the studied works applied EDM in LMS and ITS. In addition, as collaboration activities are often part of LMS, some studies treated collaborative data in LMS [23, 48, 56], while others [28, 55] analyzed collaboration usage data coming from a devoted environment.

The most tested LMS is Moodle. In [47] Jovanovica et al. applied classification models for predicting students’ performance, and cluster models for grouping students based on their cognitive styles in Moodle. They developed a Moodle module that allows automatic extraction of data needed for educational data mining analysis and deploys models developed in this study. They indicate that the classification models helped teachers, students and business people, for early engaging with students who are likely to become excellent on a selected topic.

Furthermore, they indicate that clustering students based on cognitive styles and their overall performance enable better adaption of the learning materials with respect to their cognitive styles. Along the same lines, Falakmasir and Jafar [48] applied data mining methods (Feature Selection, decision trees) to the web usage records of students’ activities in Moodle. As a result, they were able to identify and rank the students activities based on their impact on the performance of students in final exams/grades. Their findings suggest that students’ participation in virtual classrooms had the greatest impact on their final grades.

Romero et al. [49] fulfill trials and demonstrated how web usage mining can be applied in the Moodle e-learning system to predict the marks that university students will obtain in the final exam of a course. They also identified several avenues for using classification in educational settings: discovering student groups with similar characteristics, identifying learners with low motivations, proposing remedial actions, predicting and classifying students using intelligent tutoring systems. In the same way authors in [23] studied student’s usage data from a Moodle system related to quizzes, assignments and forum activities to evaluate the relation/influence between the on-line activities and the final mark obtained by the students. They used several association rule mining algorithms. The discovered rules predict students’ exam results (fail or pass) based on their frequent activities and can also help the instructor to detect infrequent students’ behaviors/activities. In [46] Lauria et al. used another LMS: Sakai. They used demographic data and the LMS log data of individual course events to develop a predictive model of student success. They used many EDM methods (factor analysis and logistic regression, C4/5/C5.0 decision trees, support vector machine (SVM) classifiers, Bayesian network) to build data mining models that can help predict students’ performance and take corrective actions in higher education institutions.

Regarding ITS studies, Dominguez et al. [50] created a system to generate personalized feedback and hints by mining the student data collected by Python Tutor, an online learning system. They found that students who used the hinting system achieved significantly better results than those who did not, and stayed active on the site longer. Gorissen et al. [51] analyzed the interactions of students with the recorded lectures using educational data mining techniques. They found discrepancies as well as similarities between students’ verbal reports and actual usage as logged by the recorded lecture servers. The data suggests that students who do this have a significantly higher chance of passing the exams [3]. Thai-Nghe et al. [32] analyzed students’ interactions log files to build success and progress indicators in order to predict students’ performance using matrix factorization.

In [30, 52, 53] the authors carried out several experiments using data related to test scores and students’ responses on the ASSiSTment tutor. They applied many EDM methods (classification, clustering, Knowledge Tracing, etc.) to improve student’s performance prediction. Toescher and Jahrer [54] analyzed students answering questions from two ITS: Algebra and Bridge to Algebra. They used a set of collaborative filtering techniques adopted from the field of recommender systems (ex. matrix factorization), to predict a student’s ability to answer questions correctly, based on historic results. Similarly, Desmarais [31] used Non-negative Matrix Factorization on students’ scores to determine the skills required for a given question, and how strong different students are for these skills.

Regarding collaboration, López et al. [55] used classification and clustering to predict students’ final marks from their participation in forums. In the same way, Rabbany et al. [28] analyzed students’ interactions in forum asynchronous discussion of online courses using Social Network Analysis to facilitate fairer evaluation of students’ participation in online courses. They also proposed Meerkat-ED, a specific, practical and interactive toolbox for analyzing students’ interactions in asynchronous discussion forums.

In [56] Chang et al. used a web-based discussion board provided by an online educational platform to analyze students’ language production. They used statistical techniques (ANOVA test, least significant difference (LSD) analysis) to evaluate to what degree the different types of web-based discussion affected students’ language production performance.

Regarding the analyzed data, the majority of these studies used students’ question responses since their general goal was the prediction of learners’ success in the final exam, based on their responses to previous tests [31, 52, 54] or previous attempts [30]. However, some studies also used interaction traces [32, 48], communication traces [28, 56], and responses to satisfaction questionnaire [50] or combined between several types of data such as in [23, 44, 48, 51]. Another approach that we found in [31] was to generate simulated data using a probability matrix in order to test several models.

We also notice that the data set size is very variable: from 27 [56] to 4,927 [51] participants producing data over several hours, weeks, months or years attending a big data set size (over 20 millions in [54]). This is related to the context of the study and the data origin: data collected in experiments or dataset already available and used in previous experiments (e.g. the ASSiSTment tutor). This second alternative facilitates the analysis by avoiding the collection step, often not very obvious. Moreover, it even allows testing several methods and environment as in [30].

We also note that in these studies, the used tools are often not mentioned or are DM tools (PentahoFootnote 6 in [46], RapidMinerFootnote 7, R in [31]) except for Rabbany et al. [28] who proposed their own tool (Meerkat-ED). Concerning the used method, clustering and prediction (classification) are on the top of the implemented techniques. However, several studies addressed the use of other techniques such as text mining, sequential pattern, SNA and matrix factorization. Statistical methods as well are used in many studies not only during the treatment phase of the EDM process but also during the preprocessing step where it is often difficult to choice the adequate features to use among the available data. For instance, in [46] “Feature Selection” is used to select the relevant attributes to use.

Through this study, we note that the application of different techniques of EDM allowed to identify the learner’s performance (usually measured as the success of the learner in the final exam), from simple data (previous results or question answers, participation in collaborative activities, productions, etc.).

This can be exploited to improve the learning systems in different ways. For instance, if the majority of learners have low performance on a resource, it could hint to the fact that the course resource and/or the learning material are inadequate and therefore should be changed and/or improved.

Some reviewed studies have discussed some of these results that contribute to a better adaptation and personalization of CBLE: improving adaptation based on cognitive styles in [47], classifying the learners’ activities according to their influence on the performance in [48], identifying the required skills for a learning resource in [31].

Other studies also discussed results contributing to help the educators in their tutoring and assistance task: identifying learners with little motivation as in [49], grouping students based their characteristics in [48, 49], detecting infrequent behaviors in [23], etc. Thus we think that these results should be used to improve and adapt the content and the organization of the learning materials in CBLE, and could be used to the advantage of all the actors involved in the learning process.

3.2 EDM Applications for Analyzing Learners’ Behaviors

In this subsection, we analyze the current state of EDM researches for analyzing (identifying, explaining, etc.) learners’ behaviors in computer-based learning environments. Table 1.3 summarizes some of the recent reviewed research.

Table 1.3 Some EDM applications for analyzing learners’ behaviors in CBLE

Among the reviewed studies we found three that belongs to LMS. Krüger et al. [34] aimed to build a data model to ease analysis and mining of educational data. To experiment their model, they analyzed the data stored in the “Programming 1” course in the Moodle LMS to study learners’ behaviors related to solving self-evaluation exercises using association rule. They found that as the semester progresses, less students solve them. Macfadyen and Dawson [16] made an analysis of LMS racking data from a Blackboard Vista-supported course. The goal was to explain the variation in students’ final grades. Using regression, they found significant correlation between the students’ final grades and their learning behaviors on the LMS, based on key variables such as the total number of discussion messages posted, and the number of assessments completed.

In [40] Bousbia et al. aimed to automatically identify the learner’s behavior and learning style, based on navigation trace analysis in a web-based learning environment: the eFAD LMS. They defined four browsing behaviors using a decision tree and carried out experiments using statistical techniques and machine learning classifiers (C4.5 decision tree, KNN, Bayesian networks, and neural networks).

Learner’s behavior is also studied in other types of CBLE. Peckham and McCalla [22] carried out an experiment in a learning environment designed to emulate hypermedia courses to identify patterns of students’ behaviors in a reading comprehension task using EDM techniques (k-means clustering, and ANOVA test).

Desmarais and Lemieux [25] also aimed to better understand the patterns of use of a learning environment. They applied clustering and activity sequence visualization on gathered logs of learners’ interactions in a self-regulated web based drill and practice learning environment.

In [3] a live video streaming (LVS) system was used to study the students’ patterns using data mining and text mining applied on data of online interaction. Bouchet et al. [26] analyzed students’ characteristics and learning behaviors in MetaTutor, an agent-based ITS. They used clustering and sequence mining to distinguish patterns of behaviors. Similarly, Kinnebrew and Biswas [58] used sequence mining to identify learning behaviors in Betty’s Brain, a learning-by-teaching environment.

In [57] Baker et al. carried out three studies in three CBLE: AutoTutor (a dialogue tutor), the incredible machine (TIM) (a problem solving game: a simulation environment), and Aplusix (a problem-solving based ITS). The studied data were pre-test–intervention–post-test, and video records of the participants and their computer screen in the first study, and observation made by observers related to cognitive affective states on the second and third studies. Using Human judgment and ANOVA test, the authors found that boredom was very persistent across learning environments and was associated with poorer learning and problem behaviors, such as gaming the system. Also, confusion and engaged concentration were the most common states within the three learning environments. These findings suggest that significant effort should be put into detecting and responding to boredom and confusion.

Throughout Table 1.3, we notice that all the reviewed studies used interaction traces, which are generally of low level (action/event) or specific to the analyzed activity (messages, reading task, etc.). These dataset are often structured in numerical attributes were task scores or statistics on log data (frequencies of actions, time spent in actions) are the most used. Moreover, the sample size used in these studies is less variable since almost all these works are based on experiments (from 28 in [22] to 148 participants in [26]). We note that the used tools for analysis are often not mentioned in these studies. The two mentioned ones are Weka in [26] and TraMiner-R in [25].

Regarding the used methods, clustering and classification still on the top as the majority of the presented works aim to identify common learning behaviors. Other methods were also used such as text mining, sequence mining, statistical methods (e.g. to calculate some variables) as well as Human judgments when the analyzed data referred to personal characteristics.

Thus, through the reviewed studies presented here, we find that EDM allows from low level traces to analyze and evaluate the student’s behavior. This task is often a difficult one given the close relationships of the behavior to personal characteristics such as learning styles, emotions and its frequent changes according to the learner’s state, the learning time, the type and the content of the learning materials, the learner’s reaction to other actors, etc. Thus, behavioral analysis should be done in real time to provide a better feedback to teachers as well as learners in order to improve the tutoring and learning tracking tasks. This is still difficult even with the use of EDM techniques regarding the small size of the analyzed samples which does not allow the generalization of the obtained results that remain specific to the studied environments and the context of the carried out experiments.

However, even if the majority of research, such as those presented here, focus on the analysis of the past behaviors to explain a phenomena such as abandon, or evaluate the participation and the obtained results, their findings should be used to improve learning environments based on the students’ behavioral patterns.

3.3 Discussion

The 25 reviewed studies presented in this section give an overview of the typical educational environment, data and methods used in EDM applications. LMS or generally online educational environment and ITS are the most exploited. This is probably due to their wide use in the educational environment, which facilitates the realization of experiments, which is often the data collection source of these studies.

The analyzed data are generally related to assessments (tests, quizzes, exams, etc.), fine grained online interaction (action, event) and also participation in collaboration activities. This data type is related to the type of the studied CBLE that provide such information in their database and log files, as well as the sighted objectives of these studies. However, some researches combined these data with the video recording of learners or the human manual observations during the learning sessions. This combination, although difficult to achieve, can refine the study especially in CBLE where there is less face to face interactions between the teacher and the learner.

Regarding EDM methods, the most used were prediction (classification, association rules, and regression) and clustering. This finding can be explained by the two objectives studied: learner’s performance and behavior, and by the fact that these techniques are mature, widely known, tested and implemented in the DM used tools, and also provide satisfactory results even with small sample size, we often find here.

Other methods were also used, according to the analyzed data and the objective to exploit other techniques and improve the results. Note, however, that it is not easy to identify for a given CBLE type, a given type and size of data set, and a given goal, which is the best EDM technique to use. Certainly this information helps to establish a choice, but it does not limit or confirm that this is the best one.

This observation explains why in several studies several techniques were used to achieve the best results. Note that we did not discuss in this chapter the percentage of the obtained results, since they depend on the different context of the studied works (types of environment and the analyzed data, the student populations, the set parameters and hypothesis during the EDM process, etc.). Indeed, although the obtained results are generally satisfactory in their context, they remain is an experimental stage and cannot be generalized.

Finally, we note that although we focused on the study of EDM applications related to two main objectives, namely the prediction of performance and behavior analysis, the results of the presented research achieved other EDM objectives (student modeling, communicating to stakeholders, maintaining and improving courses, etc.). In addition, as the two studied objectives are closely related, we found studies dealing them both where EDM techniques were applied to explore the relationships between the learner’s behavior and the learning performance, to improve the learning environment.

For example, in [59] learners’ behaviors is used to predict the success or the failure of students without requiring the results of formal assessments. In [60] Bayer et al. focused on predicting drop-outs and school failures when students’ data have been enriched with data derived from students’ social behaviors.

We think that this last objective of analyzing the reasons of failure, drop-out and abandon is a promoter research area in the EDM field that should to be exploited in the future, especially for CBLE, where it is a common phenomenon. We also believe that the data collected from these environments should be enriched by other types of information such as demographic data, to provide a better explanation of the observed phenomena. An effort should also be provided to share the analyzed samples and provide significant benchmark to pass the experimental stage in order to generalize the established models and the results found in these studies to improve the learning environments.

For the same goal, it is required to improve EDM tools. In fact, although DM tools allow the analysis, they require some expertise to set the parameters and make the appropriate interpretation. It is therefore necessary that EDM have their own tools to make these techniques within the reach of teachers, and allow more advanced treatment combining multiple data sources, and proposing some methods according to the type of these data and the analyze goal. So we can imagine these tools included in learning environments to facilitate their access to the different learning actors.

4 Conclusions

In this chapter, we discussed the use of EDM in educational systems. We studied recent EDM applications (2010–2013) by taking into account: the educational system, the analyzed data, the used method for the analysis, the used tool, and the analysis goal, especially in computer based learning environments.

We noticed through this study that a large number of researches are interested today in the application of EDM in educational systems in general and in CBLE in particular, to exploit the available data or the one that can be collected in these environments to ensure their improvement through the various objectives. We can say that EDM introduces a major advantage, drawn from data mining and KDD fields, the one related to extract hidden information about learners and learning from recorded data.

We have reviewed, in some detail, recent research dealing with students’ performance and behaviors. We found that the use of EDM methods helps the prediction of students’ performance; especially final marks. It also helps to identify and explain usual and unusual learning behaviors that should facilitate the assistance of learners, and reduce the costs of educative personalization and adaptation processes.

However, these contributions have to go out of laboratories to be applied in the used educational systems in order to improve learning. Studies in this goal are initiated especially in traditional educational environments where the results of the application of EDM on existent data are used to improve the educational system [5].

We expect however, that this goal will be also applied in software educational systems, to find new ways to improve learning materials and reduce the abandon rate that is considerable in such environments. In fact, to make this area more mature, it is necessary that the established models in these studies could be tested in real environments for frequent use to affirm and exploit the found results to improve these environments.

A first step in this direction is the sharing and the reuse of the dataset through open data repositories and standard data formats to promote the exchange of data and models. It is also necessary to popularize the use of EDM through the popularization of tools targeted to the different learning actors for the analysis of educational data in a simple and intuitive way, while providing suggestion about methods to apply for a better result and facilitating the interpretation of these results. We think that it is also necessary to take into account the EDM process in the overall development process of the computer based learning environment to ensure a significant improvement.