Keywords

54.1 Introduction

Continuous advancement in data collection and storage techniques changed many industries and research areas. Internet is taking the role of libraries, twitter brings information to public faster than any newspaper, and stock markets are run by high-frequency trading algorithms. In education, still substantial part happens in classroom; however, we also experience new, global initiatives, exemplified by massive online open courses (MOOCs).

One of the key challenges of MOOC research is closing the gap between educational science and online education [2, 3]. Increasing number of computer scientists and data scientists are trying to solve educational problems without contextual knowledge, whereas educational scientists are often not familiar with modern modelling techniques.

Most of educational experiments were run on small groups of students, often from the same school, sharing similar background. Online education allows us not only to see a bigger picture, with millions of students from all over the world, but also gives us opportunity to approach each of these students individually.

New data streams require new methodology. In classical approach, with, say, 50 students in each condition, we could just apply t-test or ANOVA. Since the datasets were small, only the large effects were detectable; so the notion of significance implicitly implied relevance. Conversely, when the number of students is large, we can easily end up in rejecting the null hypothesis and detecting an effect irrelevant in practice. Moreover, in the massive context, predictive models can be more accurate if only associated with large number of valuable variables.

During this tutorial, we present methodology for forming and testing hypothesis in this new setup. We also present practical guidance for building data-driven predictive models with the state-of-the-art machine learning methods (Fig. 54.1).

Fig. 54.1
figure 1

The flow of data-driven educational research is now altered by information retrieval step, where we find an adequate representation of a vast dataset

54.2 Dataset

We use the data from the Introduction to C++ and Introduction to Java offered by the EPFL in the fall 2013. We had 13,787 students in the C++ course and 17,716 students in the Java course. Both courses had very similar structure in terms of number of weeks, assignments, and the abstract object-oriented content.

54.3 Hypothesis

The main step of any analysis is the formation of good questions. Classical experiment design still holds and large datasets allow us to analyze deeper patterns, for example: to what extent perceived video difficulty is reflected by video interactions (pauses, speed ups, etc.)? [5] or Does forum activities and in-video interactions reflect decreasing engagement over time? [9]. In this tutorial, we predict the students’ grade based on their temporal behavior. We hypothesize that the model is independent of the course, since both courses have similar structure and they were supposed to deliver object-oriented programing paradigms. We expect students to behave similarly.

54.4 Data Collection

As soon as we have formulated the hypothesis, we start gathering the data to support it. In the online context, we can still use the classical tools (e.g., questionnaires), but also new sources of data are available. We can record, among others: clickstream (sequence of sites clicked), mouse moves, keyboard writing pattern, video interactions (pauses, forwards, etc.), scroll depth and growing amount of other information provided by Web browsers.

In addition, in an experimental setup, we can ask users for access to their cameras or microphones. Increasing popularity of social networks may provide additional information about student’s background and social context. Interesting research may arise just from the analysis of these streams of data. We can, for example, assess student’s attention from the camera images, leveraging the small sample research [7].

In this work, we analyze student’s time series. We will look on the activities of the student over time, we extract information supporting the hypothesis and we build a predictive model. For each student, we have a time stamps of events of following types: Forum View, Forum Subscription, Thread View, Lecture Re-View, Thread Subscription, Post on Thread, Quiz Submission (Video), Quiz Re-Submission (Video), Assignment Re-Submission, Thread Launch, Quiz Re-Submission (Quiz), Forum Upvote, Quiz Submission (Quiz), Forum Downvote, Lecture Download, Lecture Re-Download, Comment on Post, Quiz Submission (Survey), Quiz Re-Submission (Survey), Lecture View, Assignment Submission, Registration.

54.5 Information Retrieval

After gathering the relevant data, we extract features important for the research question. First, we extract simple characteristics like the number of the following: videos watched, posts written, and posts read. Next, we add more sophisticated constructs. To that end, we use the existing domain knowledge and we explore the dataset.

In our context, visualization of a students’ time series may give us insights about how to extract variables as illustrated in Fig. 54.2. We can also look for well-establish constructs, defining, for example, Procrastination as the number of times a student submitted the assignment just before the deadline, Persistence as the number of retries of assignment submission, or Regularity as the variance of difference between two watching sessions.

Fig. 54.2
figure 2

Visualization of time series of two students, one who succeeded with 87 % points (left) and one who failed with 66 % points (right). Although both of them completed the assignments, the left one was clearly more engaged. Observations from visual assessment can help us to engineer variables informative in the given context

As an output of this step, we have a large structured table, with one row per each student and extracted variables. In the next step, we use this table for training machine learning models.

54.6 Pattern Recognition

We identify two main branches of machine learning: supervised learning and unsupervised learning. The goal of supervised learning was to identify patterns within independent variables to explain a dependent variable. The key example here is the linear regression and logistic regression, known from classical statistics. Recent techniques such as support vector machines [1], random forests [6], and generalized boosted regression [8] are gaining popularity due to their robustness, computational feasibility, and effectiveness. The unsupervised learning, whenever there is no dependent variable and we want to investigate patterns in the data, most commonly clusters similar observations.

For our example, we use supervised learning to predict grades of students. To this end, we represent each student as a vector of his characteristics as described in Sect. 54.5. To fit the model to the known instances from the training set, we need an accuracy measure. In our example, we use the root mean square error, which, intuitively, expresses the mean distance of the prediction to the observed value.

Listing 54.1 presents a process of model building using the dataset with features described in Sect. 54.5. We use a very convenient R framework CARET [4] for application of machine learning methods. In particular, to choice of the underlying supervised learning technique is govern by method and with method = “rf” we apply random forests instead of SVM. This allows us to quickly prototype and to compare models.

Listing 54.1
figure 3

Building an SVM model using the CARET package in R

Since over 90 % of students dropout out before finishing any assignment, prediction of their score equal to 0 is easy, and therefore, we focus only on students who achieved at least 10 % points from the assignments.

To asses the quality of various models, we look on the estimated RMSEs. The simple commands to compute and plot these values are presented in Listing 54.2, where we assume that models model.svm, model.rf, and model.gbm were build as described above.

Listing 54.2
figure 4

Comparison of performance of different models

The best model, generalized boosted regression, achieves RMSE around 13 as presented in Fig. 54.3. We consider this result satisfactory, taking into account simplistic, illustrative approach.

Fig. 54.3
figure 5

Errors of each model in terms of the RMSE

Exploratory data analysis [10] can be useful for finding an appropriate technique, for adequate data transformation, for outlier detections, etc. Moreover, this exploration brings new insights and hypothesis and eventually closes the cycle in Fig. 54.1.

54.7 Discussion

The analysis of educational data in the massive context requires new techniques and methodologies. The goal of this tutorial was to shed light on the usage of machine learning and the process of the analysis, in the context of online education. Since it is not possible to introduce advanced techniques in details during a short tutorial, we focused on illustrating the simplicity of application of state-of-the-art machine learning using the R package CARET.