Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In this article, we will discuss a research area/community with close ties to the learning analytics community discussed throughout this book, educational data mining (EDM). This chapter will introduce the EDM community, its methods, ongoing trends in the area, and give some brief thoughts on its relationship to the learning analytics community.

EDM can be seen in two ways; either as a research community or as an area of scientific inquiry. As a research community, EDM can be seen as a sister community to learning analytics. EDM first emerged in a workshop series starting in 2005, which became an annual conference in 2008 and spawned a journal in 2009 and a society, the International Educational Data Mining Society, in 2011. A timeline of key events in the formation of the EDM community can be seen in Fig. 4.1.

Fig. 4.1
figure 1

Timeline of significant milestones in EDM

As of this writing, the EDM Society has 240 paid members, and the conference has an annual attendance around the same number. Many of the same people attend both EDM and the Learning Analytics and Knowledge (LAK) conference, and the general attitude between the two conferences is one of friendly collaboration and/or friendly competition.

As an area of scientific inquiry, EDM is concerned with the analysis of large-scale educational data, with a focus on automated methods. There is considerable thematic overlap between EDM and learning analytics. In particular, both communities share a common interest in data-intensive approaches to education research, and share the goal of enhancing educational practice. At the same time, there are several interesting differences, with one viewpoint on the differences given in (Siemens and Baker 2012). In that work, it was argued that there are five key areas of difference between the communities, including a preference for automated paradigms of data analysis (EDM) versus making human judgment central (LA), a reductionist focus (EDM) versus a holistic focus (LA), and a comparatively greater focus on automated adaptation (EDM) versus supporting human intervention (LA). Siemens and Baker noted that these differences reflected general trends in the two communities rather than hard-and-fast rules. They also noted differences in preferred methodology between the two communities, a topic which we will return to throughout this chapter. Another perspective on the difference between the communities was offered in a recent talk by John Behrens at the LAK 2012 conference, where Dr. Behrens stated that (somewhat contrary to the names of the two communities), EDM has a greater focus on learning as a research topic, while learning analytics has a greater focus on aspects of education beyond learning. In our view, the overlap and differences between the communities is largely organic, developing from the interests and values of specific researchers rather than reflecting a deeper philosophical split or antagonism.

In the remainder of this chapter, we will review the key methods of EDM and ongoing trends, returning to the issue of how EDM compares methodologically to learning analytics as we do so.

2 Key EDM Methods

A wide range of EDM methods have emerged through the last several years. Some are roughly similar to those seen in the use of data mining in other domains, whereas others are unique to EDM. In this section we will discuss four major classes of methods that are in particularly frequent use by the EDM community, including: (a) Prediction Models, (b) Structure Discovery, (c) Relationship Mining, and (d) Discovery with Models. This is not an exhaustive selection of EDM methods; more comprehensive reviews can be found in (Baker and Yacef 2009; Romero and Ventura 2007, 2010; Scheuer and McLaren 2011). Instead, we focus on a subset of methods that are in particularly wide use within the EDM community.

2.1 Prediction Methods

In prediction, the goal is to develop a model which can infer a single aspect of the data (the predicted variable, similar to dependent variables in traditional statistical analysis) from some combination of other aspects of the data (predictor variables, similar to independent variables in traditional statistical analysis).

In EDM, classifiers and regressors are the most common types of prediction models, and each has several subtypes, which we will discuss below. Classifiers and regressors have a rich history in data mining and artificial intelligence, which is leveraged by EDM research. The area of latent knowledge estimation is of particular importance within EDM, and work in this area largely emerges from the User Modeling, Artificial Intelligence in Education, and Psychometrics/Educational Measurement traditions.

Prediction requires having labels for the output variable for a limited dataset, where a label represents some trusted ground truth information about the predicted variable’s value in specific cases. Ground truth can come from a variety of sources, including “natural” sources such as whether a student chooses to drop out of college (Dekker et al. 2009), state-standardized exam scores (Feng et al. 2009), or grades assigned by instructors, and in approaches where labels are created solely to use as ground truth, using methods such as self-report (cf. D’Mello et al. 2008), video coding (cf. D’Mello et al. 2008), field observations (Baker et al. 2004), and text replays (Sao Pedro et al. 2010).

Prediction models are used for several applications. They are most commonly used to predict what a value will be in contexts where it is not desirable to directly obtain a label for that construct. This is particularly useful if it can be conducted in real time, for instance to predict a student’s knowledge (cf. Corbett and Anderson 1995) or affect (D’Mello et al. 2008; Baker et al. 2012) to support intervention, or to predict a student’s future outcomes (Dekker et al. 2009; San Pedro et al. 2013). Prediction models can also be used to study which specific constructs play an important role in predicting another construct (for instance, which behaviors are associated with the eventual choice to attend high school) (cf. San Pedro et al. 2013).

2.1.1 Classification

In classifiers, the predicted variable can be either a binary or categorical variable. Some popular classification methods in educational domains include decision trees, random forests, decision rules, step regression, and logistic regression. In EDM, classifiers are typically validated using cross-validation, where part of the dataset is repeatedly and systematically held out and used to test the goodness of the model. Cross-validation should be conducted at multiple levels, in line with what type of generalizability is desired; for instance, it is typically standard in EDM for researchers to cross-validate at the student level in order to ensure that the model will work for new students, although researchers also cross-validate in terms of populations or learning content. Note that step regression and logistic regression, despite their names, are classifiers rather than regressors. Some common metrics used for classifiers include A’/AUC (Hanley and McNeil 1982), kappa (Cohen 1960), precision (Davis and Goadrich 2006), and recall (Davis and Goadrich 2006); accuracy, often popular in other fields, is not sensitive to base rates and should only be used if base rates are also reported.

2.1.2 Regression

In regression, the predicted variable is a continuous variable. The most popular regressor within EDM is linear regression, with regression trees also fairly popular. Note that a model produced through this method is mathematically the same as linear regression as used in statistical significance testing, but that the method for selecting and validating the model in EDM’s use of linear regression is quite different than in statistical significance testing. Regressors such as neural networks and support vector machines, which are prominent in other data mining domains, are somewhat less common in EDM. This is thought to be because the high degrees of noise and multiple explanatory factors in educational domains often lead to more conservative algorithms being more successful. Regressors can be validated using the same overall techniques as that in classifiers, often using the metrics of linear correlation or root mean squared error (RMSE).

2.1.3 Latent Knowledge Estimation

One special case of classification that is particularly important in EDM is latent knowledge estimation. In latent knowledge estimation, a student’s knowledge of specific skills and concepts is assessed by their patterns of correctness on those skills (and occasionally other information as well). The word “latent” refers to the idea that knowledge is not directly measurable, it must be inferred from a student’s performance. Inferring a student’s knowledge can be useful for many goals—it can be a meaningful input to other analyses (we discuss this use below, in the section on discovery with models), it can be useful for deciding when to advance a student in a curriculum (Corbett and Anderson 1995) or intervene in other ways (cf. Roll et al. 2007), and it can be very useful information for instructors (Feng and Heffernan 2007).

The models used for estimating latent knowledge in online learning typically differ from the psychometric models used in paper tests or in computer-adaptive testing, as the latent knowledge in online learning is itself dynamic. The models used for latent knowledge estimation in EDM come from two sources: new takes on classical psychometric approaches, and research on user modeling/artificial intelligence in education literature. A wide range of algorithms exists for latent knowledge estimation. The classic algorithm is either Bayes Nets (Martin and VanLehn 1995; Shute 1995) for complex knowledge structures, or Bayesian Knowledge Tracing (Corbett and Anderson 1995) for cases where each problem or problem step is primarily associated with a single skill at the point in time when it is encountered. Recently, there has also been work suggesting that an approach based on logistic regression, Performance Factors Assessment (Pavlik et al. 2009), can be effective for cases where multiple skills are relevant to a problem or problem step at the same time. Work by Pardos and colleagues (2012) has also found evidence that combining multiple approaches through ensemble selection can be more effective for large datasets than single models.

2.2 Relationship Mining

In relationship mining, the goal is to discover relationships between variables in a dataset with a large number of variables. This may take the form of attempting to find out which variables are most strongly associated with a single variable of particular interest, or may take the form of attempting to discover which relationships between any two variables are strongest. Broadly, there are four types of relationship mining in common use in EDM: association rule mining, sequential pattern mining, correlation mining, and causal data mining. Association rule mining comes from the field of data mining, in particular from “market basket” analysis used in mining of business data (Brin et al. 1997); sequential pattern mining also comes from data mining, with some variants emerging from the bioinformatics community; correlation mining has been a practice in statistics for some time (and the methods of post hoc analysis came about in part to make this type of method more valid); causal data mining also comes from the intersection of statistics and data mining (Spirtes et al. 2000).

2.2.1 Association Rule Mining

In association rule mining, the goal is to find if-then rules of the form that if some set of variable values is found, another variable will generally have a specific value. For example, a rule might be found of the form:

  • IF student is frustrated OR has a stronger goal of learning than performance

  • THEN the student frequently asks for help

Rules uncovered by association rule mining reveal common co-occurrences in data which would have been difficult to discover manually. Association rule mining has been used for a variety of applications in EDM. For example, Ben-Naim and colleagues (2009) found association rules within student data from an engineering class, representing patterns of successful student performance, and Merceron and Yacef (2005) studied which student errors tend to go together.

There is ongoing debate as to which metrics lead to finding the most interesting and usable association rules; a discussion of this issue can be found in Merceron and Yacef (2008), who recommend in particular cosine and lift.

2.2.2 Sequential Pattern Mining

In sequential pattern mining, the goal is to find temporal associations between events. Two paradigms are seen that find sequential patterns—classical sequential pattern mining (Srikant and Agrawal 1996), which is a special case of association rule mining, and motif analysis (Lin et al. 2002), a method often used in bioinformatics to find common general patterns that can vary somewhat. These methods, like association rule mining, have been used for a variety of applications, including to study what paths in student collaboration behaviors lead to a more successful eventual group project (Perera et al. 2009), the patterns in help-seeking behavior over time (Shanabrook et al. 2010), and studying which patterns in the use of concept maps are associated with better overall learning (Kinnebrew and Biswas 2012). Sequential pattern mining algorithms, like association rule mining algorithms, depend on a number of parameters to select which rules are worth outputting.

2.2.3 Correlation Mining

In correlation mining, the goal is to find positive or negative linear correlations between variables. This goal is not a new one; it is a well-known goal within statistics, where a literature has emerged on how to use post hoc analysis and/or dimensionality reduction techniques in order to avoid finding spurious relationships. The False Discovery Rate paradigm (cf. Benjamini and Hochberg 1995; Storey 2003) has become increasingly popular among data mining researchers across a number of domains. Correlation mining has been used to study the relationship between student attitudes and help-seeking behaviors (Arroyo and Woolf 2005; Baker et al. 2008), and to study the relationship between the design of intelligent tutoring systems and whether students game the system (Baker et al. 2009).

2.2.4 Causal Data Mining

In causal data mining, the goal is to find whether one event (or observed construct) was the cause of another event (or observed construct) (Spirtes et al. 2000). Causal data mining is distinguished from prediction in its attempts to find not just predictors but actual causal relationships, through looking at the patterns of covariance between those variables and other variables in the dataset. Causal data mining in packages such as TETRAD (Scheines et al. 1998) has been used in EDM to predict which factors will lead a student to do poorly in a class (Fancsali 2012), to analyze how different conditions of a study impact help use and learning differently (Rau and Scheines 2012), and to study how gender and attitudes impact behaviors in an intelligent tutor and consequent learning (Rai and Beck 2011).

2.3 Structure Discovery

Structure discovery algorithms attempt to find structure in the data without any ground truth or a priori idea of what should be found. In this way, this type of data mining contrasts strongly with prediction models, above, where ground truth labels must be applied to a subset of the data before model development can occur. Common structure discovery algorithms in educational data include clustering, factor analysis, and domain structure discovery algorithms. Clustering and factor analysis have been used since the early days of the field of statistics, and were refined and explored further by the data mining and machine learning communities. Domain structure discovery emerged from the field of psychometrics/educational measurement.Footnote 1

As methods that discover structure without ground truth, less attention is generally given to validation than in prediction, though goodness and fit calculations are still used in determining if a specific structure is superior to another structure.

2.3.1 Clustering

In clustering, the goal is to find data points that naturally group together, splitting the full dataset into a set of clusters (Kaufman and Rousseeuw 1990). Clustering is particularly useful in cases where the most common categories within the dataset are not known in advance. If a set of clusters is optimal, each data point in a cluster will in general be more similar to the other data points in that cluster than the data points in other clusters. Clusters can be created at several different grain sizes. For example, schools could be clustered together (to investigate similarities and differences among schools), students could be clustered together (to investigate similarities and differences among students), or student actions could be clustered together (to investigate patterns of behavior) (cf. Amershi and Conati 2009; Beal et al. 2006). Clustering algorithms typically split into two categories: hierarchical approaches such as hierarchical agglomerative clustering (HAC), and non-hierarchical approaches such as k-means, gaussian mixture modeling (sometimes referred to as EM-based clustering), and spectral clustering. The key difference is that hierarchical approaches assume that clusters themselves cluster together, whereas non-hierarchical approaches assume that clusters are separate from each other.

2.3.2 Factor Analysis

In factor analysis, the goal is to find variables that naturally group together, splitting the set of variables (as opposed to the data points) into a set of latent (not directly observable) factors (Kline 1993). Factor analysis is frequently used in psychometrics for validating or determining scales. In EDM, factor analysis is used for dimensionality reduction (e.g., reducing the number of variables), including in preprocessing to reduce the potential for overfitting and to determine meta-features. One example of its use in EDM is work to determine which features of intelligent tutoring systems group together (cf. Baker et al. 2009); another example is as a step in the process of developing a prediction model (cf. Minaei-Bidgoli et al. 2003). Factor analysis includes algorithms such as principal component analysis and exponential-family principal components analysis.

2.3.3 Domain Structure Discovery

Domain structure discovery consists of finding which items map to specific skills across students. The Q-Matrix approach for doing so is well-known in psychometrics (cf. Tatsuoka 1995). Considerable work has recently been applied to this problem in EDM, for both test data (cf. Barnes et al. 2005; Desmarais 2011), and for data tracking learning during use of an intelligent tutoring system (Cen et al. 2006). Domain structures can be compared using information criteria metrics (Koedinger et al. 2012), which assess fit compared to the complexity of the model (more complex models should be expected to spuriously fit data better). A range of algorithms can be used for domain structure discovery, from purely automated algorithms (cf. Barnes et al. 2005; Desmarais 2011; Thai-Nghe et al. 2011), to approaches that utilize human judgment within the model discovery process such as learning factors analysis (LFA; Cen et al. 2006).

2.4 Discovery with Models

In discovery with models, a model of a phenomenon is developed via prediction, clustering, or in some cases knowledge engineering (within knowledge engineering, the model is developed using human reasoning rather than automated methods). This model is then used as a component in a second analysis or model, for example in prediction or relationship mining. Discovery with models is not common in data mining in general, but is seen in some form in many computational science domains.

In the case of EDM, one common use is when an initial model’s predictions (which represent predicted variables in the original model) become predictor variables in a new prediction model. For instance, prediction models of robust student learning have generally depended on models of student meta-cognitive behaviors (cf. Baker et al. 2011a, b), which have in turn depended on assessments of latent student knowledge (cf. Aleven et al. 2006). These assessments of student knowledge have in turn depended on models of domain structure.

When using relationship mining, the relationships between the initial model’s predictions and additional variables are studied. This enables a researcher to study the relationship between a complex latent construct and a wide variety of observable constructs, for example investigating the relationship between gaming the system (as detected by an automated detector) and student individual differences (Baker et al. 2008).

Often, discovery with models leverages the generalization of a prediction model across contexts. For instance, Baker and Gowda (2010) used predictions of gaming the system, off-task behavior, and carelessness across a full year of educational software data to study the differences in these behaviors between an urban, rural, and suburban school in the same region.

3 Trends in EDM Methodologies and Research

Given that “educational data mining” has been around as a term for almost a decade at this writing, and several early EDM researchers had been working in this area even before the community had begun to coalesce, we can begin to see trends and changes in emphasis occurring over time.

One big shift in EDM is the relative emphasis given to relationship mining. In the early years of EDM, relationship mining was used in almost half of the articles published (Baker and Yacef 2009). Relationship mining methods have continued to be important in EDM since then, but it is fair to say that the dominance of relationship mining has reduced somewhat in the following years. For example in the EDM2012 conference, only 16 % of papers use relationship mining as defined in this article.

Prediction and clustering were important methods in the early years of EDM (Baker and Yacef 2009), and have continued to be highly used. However, within the category of prediction modeling, the distribution of methods has changed substantially. Classification and regression were important in 2005–2009, and remain important to this day, but latent knowledge estimation has increased substantially in importance, with articles representing different paradigms for how to estimate student knowledge competing to see which algorithms are most effective in which contexts (Pavlik et al. 2009; Gong et al. 2011; Pardos et al. 2012).

A related trend is the increase in the prominence of domain structure discovery in recent EDM research. Although domain structure discovery has been part of EDM from the beginning (Barnes 2005), recent years have seen increasing work on a range of approaches for modeling domains. Some work has attempted to find better ways to find q-matrices expressing domain structure in a purely empirical fashion (Desmarais 2011; Desmarais et al. 2012), while other work attempts to leverage human judgment in fitting q-matrices (Cen et al. 2007; Koedinger et al. 2012). Additionally, in recent years there has been work attempting to automatically infer prerequisite structures in data (Beheshti and Desmarais 2012), and to study the impact of not following prerequisite structures (Vuong et al. 2011).

A third emerging emphasis in EDM is the continued trend towards modeling a greater range of constructs. Though the trends in latent knowledge estimation and domain structure discovery reflect the continued emphasis within EDM on modeling student knowledge and skill, there has been a simultaneous trend towards expanding the space of constructs modeled through EDM, with researchers expanding from modeling knowledge and learning to modeling constructs such as metacognition, self-regulation, motivation, and affect (cf. Goldin et al. 2012; Bouchet et al. 2012; Baker et al. 2012). The increase in the range of constructs being modeled in EDM has been accompanied by an increase in the number of discovery with models analyses leveraging those models to support basic discovery.

4 EDM and Learning Analytics

Many of the same methodologies are seen in both EDM and Learning Analytics. Learning analytics has a relatively greater focus on human interpretation of data and visualization (though there is a tradition of this in EDM as well—cf. Kay et al. 2006; Martinez et al. 2011). EDM has a relatively greater focus on automated methods. But ultimately, in our view, the differences between the two communities are more based on focus, research questions, and the eventual use of models (cf. Siemens and Baker 2012), than on the methods used.

Prediction models are prominent in both communities, for instance, although Learning Analytics researchers tend to focus on classical approaches of classification and regression more than on latent knowledge estimation. Structure Discovery is prominent in both communities, and in particular clustering has an important role in both communities. In terms of specialized/domain-specific structure discovery algorithms, domain structure discovery is more emphasized by EDM researchers while network analysis/social network analysis is more emphasized in learning analytics (Bakharia and Dawson 2011; Schreurs et al. 2013), again more due to research questions adopted by specific researchers, than a deep difference between the fields. Relationship mining methods are significantly more common in EDM than in learning analytics. It is not immediately clear to the authors of this paper why relationship mining methods have been less utilized in learning analytics than in EDM, given the usefulness of these methods for supporting interpretation by analysts (this point is made in d’Aquin and Jay, 2013, who demonstrate the use of sequential pattern mining in learning analytics). Discovery with models is significantly more common in EDM than learning analytics, and much of its appearance at LAK conferences is in papers written by authors more known as members of the EDM community (e.g., Pardos et al. 2013). This is likely to again be due to differences in research questions and focus; even though both communities use prediction modeling, LAK papers tend to predict larger constructs (such as dropping out and course failure) whereas EDM papers tend to predict smaller constructs (such as boredom and short-term learning), which are more amenable to then use in discovery with analyses of larger constructs.

Finally, some methodological areas are more common in learning analytics than in EDM (though relatively fewer, owing to the longer history of EDM). The most prominent example is the automated analysis of textual data. Text analysis, text mining, and discourse analysis is a leading area in learning analytics; it is only seen occasionally in EDM (cf. D’Mello et al. 2010; Rus et al. 2012).

5 Conclusion

In recent years, two communities have grown around the idea of using large-scale educational data to transform practice in education and education research. As this area emerges from relatively small and unknown conferences to a theme that is known throughout education research, and which impacts schools worldwide, there is an opportunity to leverage the methods listed above to accomplish a variety of goals. Every year, the potential applications of these methods become better known, as researchers and practitioners utilize these methods to study new constructs and answer new research questions.

While we learn where these methods can be applied, we are also learning how to apply them more effectively. Having multiple communities and venues to discuss these issues is beneficial; having communities that select work with different values and perspectives will support the development of a field that most effectively uses large-scale educational data. Ultimately, the question is not which methods are best, but which methods are useful for which applications, in order to improve the support for any person who is learning, whenever they are learning.