Introduction

In the past forty years, several major paradigm shifts have appeared in the area of cognitive science and machine learning [5]. Now, deep learning reaches the new peak of the interest [32]. Deep learning, as the resurgence of neural networks, has attracted many research devotions due to its outstanding performance in various real-world applications and actual deployment in state-of-the-art systems for speech recognition [17], image classification [13, 30, 42], text classification [23], generating image captions [47], and even playing Go [41].

The effectiveness of deep learning is the result of its ability to automatically learn the representation from the input features without handcraft. The “deep” in deep learning refers to multiple layers of neurons (or feature representation) between the input and the output of the neural networks [16, 32]. The success of deep learning starts from deep belief nets [19] and quickly extends to other neural networks, such as convolutional neural networks and recurrent neural networks [16, 30].

Recurrent neural networks (RNNs) are one of the effective tools in cognitive computing and machine learning to model dynamic temporal behaviors for sequences of inputs [40, 50]. Recently, deep RNNs such as the long short-term memory (LSTM) and the gated recurrent unit (GRN) have been intensively investigated and proved their effectiveness in solving various real-world applications such as speech recognition, language modeling, and knowledge tracing [8, 20, 34, 38].

Knowledge tracing is one of the significant research topics in educational data mining (EDM) [26, 31]. The goal is to capture students’ knowledge states over time so that we can estimate their learning progress of mastering the required knowledge components [2, 9, 10]. Inferring students’ knowledge states allows us to adapt to different learning progress and to recommend suitable personalized exercises or necessary teaching materials according to students’ needs [1, 12, 44]. Recently, massive online open course (MOOC) platforms, such as Coursera, Edx, and Khan Academy, have provided high-quality open access online courses and attracted a large amount of enrolled users worldwide to enrol these courses [11]. The abundance of the data generated in these platforms enables researchers to investigate and monitor the learning process of students [26, 28, 31].

In the literature, the proposed methods for knowledge tracing can be divided into two main streams: One is Bayesian Knowledge Tracing (BKT) [3, 10, 36], which applies a Hidden Markov Model (HMM) to model the knowledge components. The hidden states are updated according to each student’s responses to the exercises. Another approach is Deep Knowledge Tracing (DKT) [38, 49], which utilizes deep recurrent neural networks to discover the hidden structure of the correlation of exercises by analyzing students’ responses to the exercises. In the debut, it is shown that DKT can achieve 25% gain in the area under the ROC curve (AUC) score over BKT [38]. More variants of DKT models are further investigated in [25, 48, 52].

There are still two main issues in the previously proposed DKT models: First, the complexity of the DKT models increases the tension of psychological interpretation of the models. How to propose a model with sufficient psychological interpretation is favorable in cognition science [43, 51]. Second, the input of DKT models is only a one-hot encoding of the exercise tags [38]. It excludes many rich and informative features, such as the exercise title, the number of attempts to answer, and the duration time of answers. These features are heterogeneous and exhibit different characteristics of students’ learning procedure. They not only provide additional information on exercises but also capture students’ learning procedure. Analyzing and utilizing them properly will help to trace students’ learning states. Currently, researchers mainly focus on incorporating different types of features in the learning. The proposed features capture different aspects of students’ learning procedure, such as measuring the effect of students’ individual characteristics, assessing the effect of help in a tutor system, controlling the difficulty level of exercises, and measuring the effect of subskills [24, 53]. In [53], a manual method is proposed to analyze the features and to select appropriate feature subsets. The selected features are then discretized based on the statistics of the features and the semantic meaning is learned via an autoencoder. The manual feature engineer effort offers further improvement in knowledge tracing but is still restricted in two aspects: First, they require sufficient domain knowledge to understand the data. This may introduce bias when practitioners cannot fully explore the data. Second, they are infeasible to extend the method to huge feature size.

To tackle the above issues, we propose an automatic and intelligent approach to integrate the heterogeneous features into the DKT model. More specifically, we conduct a pre-processing step via the tree-based classifiers due to their effectiveness and interpretation power [39]. We then apply the tree-based classifiers to predict whether a student can answer an exercise correctly given the heterogeneous features. After that, we encode the predicted response and the true response into binary codes and concatenated them with the original one-hot encoding features as the input to train a LSTM model. Although the pre-processing step is simple, it can provide additional information of students’ learning behaviors, especially, how a student deviates from others in the learning process.

We highlight the contributions of this article in the following:

  • We have proposed an automatic and effective way to pre-process the heterogeneous features based on tree-based classifiers. The splitting features can provide us insight into the characteristics of students’ learning behaviors.

  • We present a systematic framework to incorporate the learned response, the true response, and the original one-hot encoding features to train an LSTM model. The output can produce the predicted probability whether a student will answer the next exercise correctly. The learned response given the heterogeneous features allows us to exploit students’ learning behaviors.

  • We conduct a thorough evaluation on two educational datasets and demonstrate the effectiveness and merits of our proposal.

The rest of the paper is organized as follows: In “Related Work,” we review the related work in this paper, including techniques for knowledge tracing and the tree-based classifiers we adopt. In “Methods,” we detail the overall architecture of our proposal, especially, how the heterogeneous features are learned and incorporated. In “Experiments,” we present the educational datasets, the experimental setup, and the experimental results with detailed explanation. Finally, in “Conclusions,” we conclude the whole paper with some remarks.

Related Work

Knowledge Tracing

Knowledge tracing is an important research topic in EDM, aiming to capture the students’ knowledge learning states based on their performance on the exercises. BKT is a dominant approach in the field, where the knowledge components are denoted as a series of binary variables in the hidden states modeled by a HMM [3, 10, 36]. The hidden states are updated according to each student’s responses to the given exercises. Researchers then extend the HMM model to explore different latent factors [15, 27, 37].

DKT [38] is a breakthrough to leverage a vanilla RNN or an LSTM model to solve this task. In the debut, it is shown that DKT can achieve 25% gain in the AUC score over BKT. Though later, some researchers argue that, with suitable extensions, BKT can achieve comparable performance with DKT [26]. Due to the good performance of DKT, variants of DKT models are accordingly proposed [48, 52, 53]. For example, a memory-augmented neural network replaces the LSTM in the original DKT to capture the long-term dependence and extended to utilize the key-value mechanism to store the concept representation and students’ understanding state of each concept, respectively [52].

Some researchers also try to include heterogeneous features to improve the model performance of knowledge tracing [24, 53]. The existing publications show that by employing these heterogeneous features properly, they indeed can further improve the model performance. However, the existing work contains the deficiency of handcraft or not fitting in the DKT architecture. This motivates us to further explore the heterogeneous features and to exploit them in the DKT architecture.

Tree-Based Classifiers

Decision trees are one type of the most popular data mining algorithms for classification and decision-making [39]. They utilize the entropy to split features and have triggered a family of feature discretization and feature selection techniques [14, 29]. The processes of feature selection and feature discretization embedded in the training procedure via decision trees have shown that they can largely reduce the engineering effort [45]. Moreover, the learned features are meaningful and interpretable, which motivates us to include them in our proposal.

Among various decision trees, Classification and Regression Trees (CART) [39] have exhibited several significant advantages: First, they can handle both numerical and categorical features. Second, they can handle outliers properly [46] and avoid overfitting [35]. Hence, we apply CART in [7]. After observing the power of CART, we decide to further explore other tree-based classifiers such as random forest [4] and the Gradient Boosted Decision Tree (GBDT) [18]. These two methods are both ensemble classifiers, where random forest consists of weak decision trees to reduce the overall performance variance and the GBDT is a linear combination of weak learners greedily to improve the performance of the entire ensemble. Due to their power of retaining robustness and improving the predictive performance of solving classification applications [4, 18], we apply these two methods in the rest.

Methods

Figure 1 illustrates the overall architecture of our proposed model, Deep Knowledge Tracing with tree-based classifiers, where CART is illustrated as the representative tree-based classifier. It can be replaced by other tree-based classifiers, e.g., random forest or GBDT. In Fig. 1, the bottom part is the pre-processing procedure on the heterogeneous features. These features are learned by CART to predict whether a student will answer the exercise correctly and output the predicted response. The predicted response and the true response are encoded into a 4-bit one-hot encoding. They are concatenated with the original one-hot encoding on the exercise tag as a new input. This new input is fed into an LSTM [20] to learn the similarity of exercises and trace the knowledge components mastered by the students. It is noted that Fig. 1 shows a vanilla RNN for simplicity, where we deploy an LSTM in our evaluation.

Fig. 1
figure 1

The architecture of our proposal with the 4-bit one-hot encoding for responses consists of three parts: (1) heterogeneous features learned via tree-based classifiers, (2) feature concatenation, and (3) model training and prediction by an RNN/LSTM. The solid large black nodes indicate the splitting of features on different sub-branches of the trees. The dots in the black color, the white color, and the gray color in a vector indicate the values of 1, 0, and the probability of the prediction, respectively

Input and Output

Consider a specific student practicing an exercise at the t th time stamp, let e t and a t be the exercise tag and the heterogeneous features, respectively. The notation c t = 1 implies that the student will answer the exercise correctly while c t = 0 for an incorrect answer. As shown in Fig. 1, CART will take a t as the input and tries to predict whether the student will correctly answer the exercise given these heterogeneous features. The corresponding predicted response is then denoted by \(a^{\prime }_{t}\).

All features are represented into the binary representation denoted by O(⋅,⋅), where O(e t , c t ) ∈{0,1}M is the original one-hot encoding for an exercise tag with the number of exercises beining M and \(O(a^{\prime }_{t}, c_{t})\) is the 4-bit one-hot encoding. Hence, the total size of O(e t , c t ) and \(O(a^{\prime }_{t}, c_{t})\) is 2M + 4. For O(e t , c t ), all elements are zeros except that 1 will be denoted at the i th index when the answer of the i th exercise is correct; otherwise, 1 will be placed at the i + M-th index. O(a t′, c t ) is constructed by the predicted response learned by the tree-based classifiers and the true response, which is defined in Table 1.

Table 1 Explaination of the notation of \(O(a^{\prime }_{t}, c_{t})\)

After concatenating O(e t , c t ) and \(O(a^{\prime }_{t}, c_{t})\), we feed it as a new input of LSTM x t to train the corresponding model and output a vector \(\mathbf {y}_{t}\in \mathbb {R}^{M}\) for predicting the probability that whether a student will answer the question correctly. In the level of RNN/LSTM in Fig. 1, different color grades in the nodes of y t represent different levels of the probability, where dark colors represent higher probabilities while light colors represent lower probabilities.

Models

Tree-Based Classifiers

We apply the following tree-based classifiers, CART [39], random forest [4], and GBDT [18], to automatically partition the feature space and output the predicted response whether a student will correctly answer an exercise. We briefly introduce these three models in the following.

At each node, CART continuously conducts binary partitioning to group the interaction of the same class by maximizing the gini index or information gain. Here, we take information gain as an example in the following formulation. Given a set S at a node contains training data \(a_{t} \in \mathbb {R}^{n}\) and the corresponding labels c t ∈{0,1}, CART partitions the data into two subsets

$$S_{l} = \{(a_{t}, c_{t})|a_{t, j} < t\},\quad \text{ and }\quad S_{r} = S\setminus S_{l} $$

where j is the splitting variable and t is the threshold determined by minimizing the information gain defined as follows:

$$(j^{*}, t^{*}) = \underset{j, t}{\arg\min} G(S, j, t) \triangleq \frac{|{S_{l}}|}{|{S|}}H(S_{l})+ \frac{|{S_{r}}|}{|{S}|}H(S_{r}), $$

where |⋅| defines the Cardinality of the set, i.e., the number of elements in the set. H(⋅) defines the impurity measured by the cross entropy, and G(⋅) denotes the gini index or information gain. For a region R with N observations, the cross entropy H is defined by

$$H(X) = -\sum\limits_{k} p_{k}\log(p_{k}), \text{ where } p_{k} = \frac{1}{N}\sum\limits_{a_{t}\in R} I(c_{t} = k) $$

In binary classification, k is the label set, which can be selected from {0,1}, or {− 1,+ 1}.

By minimizing the cross entropy, CART learns a set of classification rules. At time t, the heterogeneous feature a t is fed into the root of CART and follows the path assigned by the classification rules until getting a predicted response \(a^{\prime }_{t}\).

Random forest

learns an ensemble of decision trees by growing a bag of trees with the bootstrap sample and variable subsets [4]. Suppose there are B trees in the bag, for each tree, a bootstrap sample Z of size N is drawn. At each node of the corresponding tree T b , a subset of variables is selected at random and is split according to the same criteria of CART until the maximum node size is reached. The class prediction \(\mathbf {h}at{C}^{B}(a_{t}) = \text {majority vote}\{a_{t}^{{\prime }b}\}^{B}_{b = 1}\), where \(a_{t}^{{\prime }b}\) is the class prediction of a particular tree b.

GBDT

is also a tree-based ensemble method and achieves better performance by reducing bias rather than variance [18]. The ensemble is learned by adding a CART iteratively which minimizes the following objective function:

$$\begin{array}{@{}rcl@{}} { H}^{(t)} = \sum\limits_{i = 1}^{n} l(y_{i}, \hat{y}_{i}^{(t)}) + \sum\limits_{i = 1}^{t}{\Omega}(f_{i}) \end{array} $$

where n is the number of training samples, l(⋅) is the loss function, and Ω(f i ) is the regularization term. \(\hat {y}_{i}^{(t)}\) is the predicted value for the i-th sample at the t-th step and is defined by: \(\hat {y}_{i}^{(0)} = 0\), \(\hat {y}_{i}^{(t)} = {\sum }_{k = 1}^{t} f_{k}(a_{i}) = \hat {y}_{i}^{(t-1)} + f_{t}(a_{i})\), f k is the CART learned at the k-th step and f k (a i ) is the corresponding predicted value on the feature a i .

It is noted that the information of heterogeneous features is therefore implicitly captured by \(a^{\prime }_{t}\) via the predicted response, which embeds the information of how a student deviating from others in the exercises. The personalized information is then utilized in the DTK models.

Recurrent Neural Networks

A recurrent neural network is a neural network that simulates a discrete-time dynamical system that has an input x t , an output y t , and a hidden state h t , where t represents time. The dynamical system is defined by

$$\begin{array}{@{}rcl@{}} \mathbf{h}_{t} &=& f_{h}(\mathbf{x}_{t}, \mathbf{h}_{t-1}) \end{array} $$
(1)
$$\begin{array}{@{}rcl@{}} \mathbf{y}_{t} &=& f_{o}(\mathbf{h}_{t}), \end{array} $$
(2)

where f h and f o are a state transition function and an output function, respectively. Each function is parameterized by a set of parameters: 𝜃 h and 𝜃 o .

LSTM is proven to be an efficient RNN in modeling long-term dependency through a collection of cells and gates. The cell states are controlled by gates to decide whether to store or remove the information, which facilitates complex interaction of current input and history. The hidden state and output are updated by the following set of equations:

$$\begin{array}{@{}rcl@{}} i_{t} &=& \sigma(\mathbf{W}_{xi}\mathbf{x}_{t} + \mathbf{W}_{hi}\mathbf{h}_{t-1} + \mathbf{W}_{ci}c_{t-1}+b_{i}) \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} f_{t} &=& \sigma(\mathbf{W}_{xf}\mathbf{x}_{t} + \mathbf{W}_{hf}\mathbf{h}_{t-1} + \mathbf{W}_{cf}c_{t-1}+b_{f}) \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} c_{t} &=& f_{t}c_{t-1} + i_{t}\tanh(\mathbf{W}_{xc}\mathbf{x}_{t} + \mathbf{W}_{hc}\mathbf{h}_{t-1}+b_{c}) \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} o_{t} &=& \sigma(\mathbf{W}_{xo}\mathbf{x}_{t} + \mathbf{W}_{ho}\mathbf{h}_{t-1} + \mathbf{W}_{co}c_{t-1}+b_{o}) \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} \mathbf{h}_{t} &=& o_{t}\tanh(c_{t}) \end{array} $$
(7)

where σ is the sigmoid function. The components of LSTM, denoted as i, f, c, and o, are input gate, forget gate, cell activation vector, and output gate, respectively.

In training an RNN/LSTM, we are given a set of N training sequences \(D={\{(\mathbf {x}_{i}^{(n)}, \mathbf {y}_{i}^{(n)})\}_{i = 1}^{T_{n}}}\), where n = 1,…, N, and estimate the parameters by minimizing the following objective function:

$$ J(\theta) \propto \sum\limits_{n = 1}^{N}\sum\limits_{t = 1}^{T_{n}} \mathcal{L}(\mathbf{y}_{t}^{(n)}, f_{o}(\mathbf{h}_{t}^{(n)})), $$
(8)

where 𝜃 defines all variables of W’s and b’s defined in Eqs. (3)–(7). \(\mathcal {L}(\mathbf {a}, \mathbf {b})\) is a predefined divergence measure between a and b, such as Euclidean distance or cross-entropy.

The input \(\mathbf {x}_{t}=[O(e_{t}, c_{t}), O(a^{\prime }_{t}, c_{t})]\), capturing students’ exercise performance, is fed into an LSTM to learn the hidden structure of the sequence of exercises, which represents the knowledge components. The hidden cell h t is then passed to a fully connected layer via a sigmoid activation function to get the output y t , which denotes the predicted probability of whether a student will correctly answer the next exercise. In this setting, we will predict the probability for all M exercises because we do not know which one is the next exercise.

Prediction

In the test, the average loss is computed by the binary cross-entropy defined as follows:

$$L = {1\over N}\sum\limits_{n = 1}^{N}\sum\limits_{t={t_{0}^{n}}}^{{t_{0}^{n}}+T^{n}} \mathbf{c}^{n}_{t + 1}\log\hat{\mathbf{y}}^{n}_{t}+(1-\mathbf{c}^{n}_{t + 1})\log(1-\hat{\mathbf{y}}^{n}_{t}), $$

where N is the number of students, \({t_{0}^{n}}\) is the starting index for the n th student in the test set, and T n is the number of exercises for the student. The predicted value \(\hat {\mathbf {y}}^{n}_{t}\) is the inner product of predicted output and the one-hot encoding of the exercises conducted by the student n, i.e., \(\hat {\mathbf {y}}^{n}_{t}={{\mathbf {y}_{t}^{n}}}^{\top } O(\mathbf {e}_{t + 1}^{n})\) because \(\hat {\mathbf {y}}^{n}_{t}\) can output the corresponding predicted probability of whether the student n can answer the question correctly in the next time stamp.

Experiments

In the experiments, we address the following issues:

  1. 1)

    What is the performance of our proposal compared to the baseline methods?

  2. 2)

    What is the effect of the encoding scheme?

  3. 3)

    What is the importance of the features learned?

  4. 4)

    What is the effect of tree-based classifiers to DKT and the visualized trees?

Model Comparison”–“Effect of Tree-Based Classifiers and Tree Results” will answer the above questions accordingly.

Datasets

In the following, we conduct experiments on two popular educational datasets collected from the computer-based online learning platforms [7, 53]. The datasets are:

  • ASSISTments 2009-2010Footnote 1 [53]: The dataset consists of 4,151 students exercising on 124 knowledge components with 332,343 interactions (records). It is also called the mastery learning data because a student is considered mastering a skill when meeting certain criteria.

  • Junyi academyFootnote 2 [6]: The dataset is crawled from a Chinese e-learning platform established on the basis of the open-source code released by the Khan Academy. The dataset contains students’ exercises in mathematics. We select 1,000 most active students from the exercise log, which yields 666 knowledge components and 971,402 records.

Table 2 summarizes the basic statistics of the datasets, including the number of users, skills (knowledge components), records (exercise interactions), and the used heterogeneous features. In the following, ASSISTments and Junyi are bold to denote the corresponding dataset. In ASSISTments, we use the following 12 features:

  • original: a binary feature records whether the exercise is the main problem or a scaffolding problem, i.e., whether the original problem is broken into several steps.

  • attempt_count: a numerical value records the number of attempts (times) a student tries to answer the exercise.

  • ms_first_response: a numerical value records the time in milliseconds between the start time and the first action from the student, e.g., asking for the hints or entering an answer.

  • answer_text: a categorical feature records whether the answer is entered by the student or the value is selected in a multiple choice.

  • assistments_position: the position of the assignment on the class assignments page.

  • type (problem_set_type): the organization of contents in the problem set. It has three classes:

    • Linear: Student completes all problems in a pre-determined order.

    • Random: Student completes all problems, but each student is presented with the problems in a different random order.

    • Mastery: Random order, and student must “master” the problem set by getting a certain number of questions correct in a row before being able to continue.

  • hint_count: an integer records the number of the hints a student requests in practicing the exercise.

  • hint_total: an integer records the number of possible hints on the problem. Note that each problem has a different number of hints.

  • overlap_time: a numerical value records the time in milliseconds for the student to complete the problem.

  • first_action: a categorical feature records the first action the student performs to the problem: 0 for attempting to solve it, 1 for asking for the hint, 2 for scaffolding, and 3 for doing nothing.

  • opportunity: an integer records the number of opportunities the student has to practice the skill.

  • tutor mode: a categorical feature indicating whether the exercise is in the tutor mode, the test mode, the pre-test mode, or the post-test mode.

Table 2 Summary of the datasets

In Junyi, there are ten heterogeneous features:

  • problem number: an integer records how many times the student practices the exercise. For example, the value is 1 if the student tries to answer the exercise at the first time.

  • topic mode: a binary feature records whether the student is assigned this exercise by clicking the topic icon.

  • suggested: a binary feature records whether the exercise is suggestede by the system according to prerequisite relationships on the knowledge map.

  • review mode: a binary feature records whether the exercise is done by the student after he/she earns proficiency.

  • time: a numerical value records the time (in seconds) of a student spending on the exercise.

  • time taken: a numerical value records the total number of seconds the student spends on this exercise.

  • attempt counts: an integer records how many times the student attempts to answer the problem.

  • hints used: a binary feature records whether the student requests hints.

  • count hints: an integer records how many times the student requests hints.

  • earn proficiency: a binary feature records whether the student reaches proficiency. Please refer to [21] for the algorithm of computing proficiency.

Both datasets also contain other non-numerical and non-categorical features, which are removed in the experiment. Detailed explanation about the features can be referred to the following two links:

Model Comparison

A fivefold student level cross-validation is conducted in the test. The results are evaluated by the area under the ROC curve (AUC) and R 2, two standard metrics for evaluating the predicted performance [22, 38, 53]. The following models with different feature processing are compared:

  • Deep Knowledge Tracing (DKT) [38]: the input feature is the one-hot encoding of the exercise tags.

  • DKT with Feature Engineering (DKT-FE) [53]: Feature engineering has been conducted by manually selecting a subset of heterogeneous features and discretizing them by a certain pre-determined criterion while reducing the dimensionality of the input via autoencoder. The learned feature is concatenated with the one-hot encoding of the exercise tags as the input.

  • DKT without Feature Engineering (DKT-W): The selected heterogeneous features are the same as those of DKT-FE but without any further feature processing. The selected feature is directly concatenated with the one-hot encoding of the exercise tags as the input.

  • DKT with Decision Trees (DKT-CART): The selected heterogeneous feature is learned by CART to output the predicted response, which is represented by a 2-bit binary code, and concatenated with the 2-bit binary code of the true response and the one-hot encoding as the input of the LSTM.

  • DKT with Random Forest (DKT-RF): The selected heterogeneous feature is learned by the random forest to predict the corresponding response. The setting of the input is the same as that of DKT-CART.

  • DKT with GBDT (DKT-GBDT): The selected heterogeneous feature is learned by the corresponding response. The setting of the input is the same as that of DKT-CART.

For the LSTM, we set the hidden dimension to 200 and train it via the stochastic gradient descent on the size of a mini-batch being 5. Other parameters are set as default in the Tensorflow. For different tree-based classifiers, the parameters are set as default in the Python toolbox, scikit-learn.

Table 3 reports the results of all four compared methods. From the results, we have the following observations:

  • Our proposed DKT with tree-based classifiers attains significantly better performance over other methods in terms of both the AUC and R 2 metrics on both datasets. Especially, our proposed DKT-CART attains 13% gain over DKT in the R 2 metric.

  • For models of the DKT with tree-based classifiers, DKT-GBDT attains the best performance in both datasets among all three compared methods, while DKT-RF gets the second best performance. The results show that ensemble methods can further improve the model performance in these two datasets.

  • An interesting observation is that including heterogeneous features without appropriately pre-processing degrades the performance of DKT. We conjecture this may be due to the introduction of noise, which intervenes DKT to extract the similarity between exercises.

  • The degrading effect of DKT-W is highly dependent on the size of the dataset. The size of the Junyi dataset is much larger than that of ASSISTments and it may help to relieve the effect of training the LSTM.

  • The performance of DKT-FE is a slightly poor than that of DKT-W in the Junyi dataset. The reason is that we adopt the same criterion to process the feature as the ASSISTments data shown in [53]. The provided criterion is not extensible to the new Junyi dataset.

  • Overall, the experimental results show that including additional features may improve the prediction accuracy, but it requires properly pre-processing.

Table 3 Experimental results on the compared models. The predicted and true responses are encoded into 4-bits in the proposed models

Effect of Encoding Scheme

One issue is that the predicted and true responses can be encoded into 2-bit instead of 4-bit as the designed setting of one-hot encoding of the exercise tags. To test the effect of these two settings, we change the inputs and conduct the experiments accordingly.

Tables 4 and 5 report the results of the compared results with respect to the number of encoded response units, respectively. We can observe that

  • In ASSISTments, DKT-RF and DKT-GBDT achieve better performance than DKT-CART in both the AUC and R 2 metrics. It shows that ensemble classifiers can further improve the model performance.

  • In Junyi, the performance is not significantly different. DKT-CART with 2-bit units and DKT-GBDT with 4-bit units attain the best performance in the AUC metric, while DKT-CART earns a little better than DKT-GBDT in terms of the R 2 metric.

Table 4 Experimental results of the DKT model with different tree-based classifiers on ASSISTments with respect to the number of encoded response units
Table 5 Experimental results of the DKT model with different tree-based classifiers on Junyi with respect to the number of encoded response bits

Generally speaking, 4-bit units encoding can get better or at least comparable performance than the scheme of 2-bit units encoding. We hypothesize that it may be the consequence of “division of labor.” As the input x t is the concatenation of two one-hot encoding vectors, it selects and adds two columns in the input-to-hidden weight matrix, which contributes to the updating of hidden layer. If the scheme of the 4-bit unit encoding is applied, the last four columns in the input-to-hidden weight matrix would be dedicated to learn how the cross-effect of the predicted response and the true response to the contribution of knowledge tracing. Meanwhile, the original one-hot encoding features are assigned to simply learn the effect of accumulating proficiency through exercises. In contrast, for the scheme of the 2-bit unit encoding, no column in the weight matrix is assigned to learn the cross-effect. The original one-hot encoding features have to learn not only the accumulation of proficiency but also the cross-effect, which increases the learning complexity.

Importance of the Features

Tree-based classifiers have the power to evaluate the importance of features. In [4, 33], the importance of a variable can be measured by mean decrease impurity (MDI). That is, the importance of a variable X m for predicting Y is measured by adding up the weighted impurity decreases p(ti(s t , t) for all nodes t where X m is involved, averaged over all N T trees:

$$\begin{array}{@{}rcl@{}} Importance(X_{m}) = \frac{1}{N_{T}}\sum\limits_{T}\sum\limits_{t\in T: v(s_{t}) = X_{m}}p(t){\Delta} i(s_{t}, t) \end{array} $$

and Δi(s, t) = i(t) − p L i(t L ) − p R i(t R ), where p(t) is the proportions \(\frac {N_{t}}{N}\) of samples reaching t and p L , p R are \(\frac {N_{t_{L}}}{N_{t}}, \frac {N_{t_{R}}}{N_{t}}\), the proportions of samples in the left and right node, respectively. v(s t ) is the variable involved in split s t . i(t) can be any impurity measure.

Figure 2 shows the importance of the features in random forest for ASSISTments and Junyi, respectively. We can observe that

  • Figure 2a shows that the features of ASSISTments are grouped into five sectors based on their importance. The feature, original, is the most important with a mean over 0.3. Following it are the features of attempt_count, ms_first_response, tutor_mode, and answer_type, which get a mean weight around 0.12. The third sector consists of the features of position, type, and hint_count, which get a mean weight of 0.05. The fourth sector consists of the features of hint_total and overlap_time. The final sector consists of the features of first_action and opportunity, which show no importance for the classification.

  • Figure 2b shows that the features of Junyi are grouped into four sectors based on their importance. The feature of problem_number gains an important weight over 0.7. The second sector consists of the features of topic_mode and suggested which exhibits a mean weight around 0.1. The third sector consists of the features of review_mode, time_taken, and count_atte-mpts, whose mean weights are less than 0.1. The rest four features, hint_used, count_hints, earned_proficie-ncy, and time, are in the final sector and contain negligible weights.

Overall, the importance of the features can provide us guidance to understand the data and extract meaningful information to improve model performance.

Effect of Tree-Based Classifiers and Tree Results

We also record the performance of the tree-based classifiers predicting the correctness of the student’s answer to the current exercise in Tables 6 and 7 for ASSISTments and Junyi, respectively. The results show that the adopted tree-based classifiers usually attain good performance on the prediction at the current state. The ensemble classifiers can get better performance than CART and accordingly achieve better performance in the corresponding DKT models.

Fig. 2
figure 2

Feature importance measured on both test datasets. The importance of variables are evaluated by mean decrease impurity. a ASSISTments. b Junyi

Table 6 Prediction accuracy of different tree-based classifiers given the heterogeneous features on ASSISTments
Table 7 Prediction accuracy of different tree-based classifiers given the heterogeneous features on Junyi

Another advantage of tree-based classifiers is its interpreting ability for the results. The decision tree can be visualized such that teachers and researchers can extract the latent factors which affect the probability of correct prediction. Figure 3 shows parts of the tree learned by CART on both datasets. The learned trees can then be further analyzed to understand the importance of the features.

Fig. 3
figure 3

Parts of the trees learned by CART on ASSISTments and Junyi. The color of a block indicates the majority class, i.e., the class with more training samples, in that node: the blue color denotes that the sample is correctly assigned to the majority class and the red color for incorrect assignment. The light of the color implies the value of the gini coefficient, a lighter one for a larger gini coefficient. In each block, the first row denotes the selected feature and its splitting threshold. “gini”” stands for the gini coefficient. “samples” means the total number of samples being assigned and classified in that node. “value” denotes the number of samples in each class. The last row “class” indicates whether the current block is in the correct state

Conclusions

We have proposed an effective method to pre-process the heterogeneous features and integrate the learned feature implicitly to the original deep knowledge tracing model. The pre-processing step is conducted by tree-based classifiers, i.e., CART, random forest, and gradient boosting decision tree, to output the predicted response that a student will correctly answer the current exercise given the heterogeneous features. This allows us to capture students’ behaviors on the exercises and to provide an good initialization to the DKT model. Our experiments on two educational datasets demonstrate the effectiveness and merits of our proposal.

Some interesting future work can be considered. For example, we do not fully utilize the importance of the features. How to include such information in the RNN models is a significant research topic. The current setting only predicts exercises in a fixed set. It is valuable to explore the current model to provide the personalized recommendation for students to select appropriate exercises and to conduct selective practice.