Keywords

1 Introduction

Nowadays, online learning platform realizes the acquisition of high-quality learning resources without the constraints of time and space. Students can flexibly study on computers and mobile terminals, and can independently arrange study plans and tasks. Because of this, millions of students are learning a variety of courses through online learning platforms. However, there are many obstacles in online learning platform for the supervision of students and the provision of personalized learning guidance due to the large number of learners. In terms of providing personalized guidance, it is very important to evaluate students’ knowledge state for online learning platform, which is also an important research topic in the field of intelligent education [1].

Knowledge tracing (KT) is a widely used model for predicting students’ knowledge state in intelligent online learning platform [2]. KT can model the interaction process between students and exercises based on the students’ past exercise records to trace students’ knowledge state dynamically [3]. The goal of KT can be described as: given the interaction sequence of past exercises of a student \({ } X = x_{1} ,x_{2} , \ldots ,x_{t}\), KT acquires the knowledge state of the student, which is used to predict the probability of the correct answer to the next exercise. The input \(x_{t} = \left( {q_{t} , a_{t} } \right)\) contains the exercise \(q_{t}\) and the actual answer \(a_{t}\) [4].

Using KT model, online learning platforms not only customize learning materials for students based on the knowledge state of students, but also provide to students and teachers with feedback reports. Therefore, students reasonably allocate their study schedules to maximize their learning efficiency, and teachers can timely adjust appropriate teaching plans and schemes.

At present, the traditional KT model and the deep learning-based KT model are two kinds of models provided in the field of knowledge tracing. Among the traditional knowledge tracing models, the most typical one is Bayesian Knowledge Tracing (BKT) [5], which models each concept state separately. Therefore, BKT is limited to capture the correlations between different concepts, which ineffectively simulates the knowledge state transition between complex concepts. Researchers further applied deep learning to KT task and proposed Deep Knowledge Tracing (DKT) [6]. Compared with BKT, DKT uses a hidden state to sum up the knowledge state of all concepts. Considering correlations between multiple concepts, DKT delivers a better simulation in students’ knowledge state. But DKT can’t pinpoint which concepts a student has mastered like BKT. Consequently, DKT has its weakness in indicating the certain concept that students grasp or not. Combining the advantages of the BKT and DKT, DKVMN uses external memory to store the student’s knowledge state [7], and its prediction performance is better than BKT and DKT.

However, existing KT models ignore two aspects in simulating the changes in students’ knowledge states. Firstly, in the aspect of knowledge application, students apply different concepts according to their knowledge states for the same exercise. Secondly, according to the Ebbinghaus forgetting curve [8], the process of forgetting is not uniform. Students forget the knowledge they have just learned from the exercises very fast, but the knowledge they have learned before is slow. Existing models have limits in distinguishing the degree of forgetting the learned knowledge.

Based on the external memory mechanism of DKVMN, this paper designed a knowledge update network inspired by the idea of GRU’s gating mechanism [9], and proposed a knowledge tracing model based on Dynamic Key-Value Gated Recurrent Network (DKVGRU). In the huge exercise data, DKVGRU uses the Key-Value matrix of DKVMN to explore the relationship between exercises and underlying concepts, while tracing the knowledge state of a certain concept. We provided two knowledge gates for simulating the change of students’ knowledge states. The knowledge application gate calculates the proportion of knowledge concepts applied by students in solving exercises, and the knowledge forgetting gate measures the forgetting degree of the learned knowledge.

2 Related Work

There are two main types of KT models. One is the traditional KT model, the other is the KT model based on deep learning. In this chapter, we first introduce BKT, DKT and DKVMN. Besides, DKVGRU is inspired by the gating mechanism, and this chapter also introduces Recurrent Neural Network (RNN) [10] and its variants, which can capture long-term sequence data relations.

2.1 Bayesian Knowledge Tracing

BKT is the most commonly used among traditional KT models, which was introduced in the field of intelligent education by Corbett and Anderson and used to intelligent tutoring systems in 1995 [11]. BKT assumes that each concept is independent of each other and students have only two states for each concept: mastered or not mastered. As shown in Fig. 1, BKT uses Hidden Markov Model (HMM) to model a certain concept separately, and updates the state of a concept with the help of two learning parameters and two performance parameters. The original BKT assumes students do not forget knowledge in learning, which is obviously against students’ regular learning pattern [12]. And researchers have proposed several aspects to optimize BKT from forgetting parameters [13], exercise difficulty [14], personalized parameters [15], emotions [16], etc.

Fig. 1.
figure 1

The architecture of BKT

2.2 Deep Learning-Based Knowledge Tracing

In 2015, Piech et al. firstly applied deep learning to KT tasks and proposed DKT based on RNN and Long Short-Term Memory (LSTM) [17]. As illustrated in Fig. 2, DKT can represent the student’s continuous knowledge state using a high-dimensional hidden state. And without the manual annotation, DKT can automatically discover the relationship between concepts from exercises. Using the forgetting gate of LSTM, DKT can simulate the knowledge forgetting that occurs in the learning process. Khajah et al. proved that the advantage of DKT lies in the ability to make good use of some statistical rules in data, which BKT cannot use [18]. Yeung et al. added a regularization term to the loss function to solve two problems of DKT: inaccuracy and instability [19]. Xiong et al. believe that DKT is a potential KT method if more features can be modeled, such as student abilities, learning time, and exercise difficulty [20]. And many variations were raised by adding dynamic student classification [21], side information [22] and other features [23] into DKT.

Fig. 2.
figure 2

The architecture of DKT based on RNN

The Memory Augmented Neural Network (MANN) [24] uses an external memory module to store information, which has a stronger information storage capacity than using a high-dimensional hidden state. And MANN can rewrite local information through the external memory mechanism. Different from the general MANN which uses a simple memory matrix or two static memory matrices [25], DKVMN utilizes the key-value matrix to store all concepts and the knowledge state of each concept. The key matrix is used to calculate the correlation between exercises and concepts, and the value matrix is used to read and write the knowledge state of each concept. Ha et al. [26] optimized DKVMN from knowledge growth and regularization.

2.3 Recurrent Neural Network

For sequence data, researchers use RNN to obtain data relationships in general. However, RNN cannot effectively capture long-term sequence data relationships because of its structural defects. And Hochreiter et al. proposed LSTM to solve the problem of long-term in 1997, which used three gates to effectively deal with long-term and short-term dependence. And Cho et al. proposed GRU by optimizing the structure of LSTM in 2014, which not only guarantees model performance but also improves model training efficiency [27]. GRU uses two gates to determine which information needs to be memorized, forgotten, and output respectively, which effectively achieve long-term tracing of information. As shown in Fig. 3, the reset gate generates the weight to decide how much historical information is used according to the input information, and the update gate is used to generate the proportion of historical memory and current memory in new memories.

Fig. 3.
figure 3

The architecture of GRU

3 Model

DKVGRU can be divided into three parts: correlation weight, read process and write process, which are represented in Fig. 4. Correlation weight represents the weight of each concept contained in the exercise. Read process can read the student’s current memory, which is used to predict students’ performance of a new exercise. And write process is used to update the student’s memory state after answering a exercise. Correlation weight and read process refer to DKVMN. The correlation weight, read process and write process are described in Sect. 3.2, 3.3, and 3.4.

Fig. 4.
figure 4

In the framework of DKVGRU, the green part is write process we designed. The blue and purple parts are correlation weight and read process, which refer to DKVMN.

3.1 Related Definitions

Given a student’s past interaction sequence of exercises \(X = x_{1} ,x_{2} , \ldots ,x_{t - 1}\), our task is to obtain the student’s current knowledge state according to the student’s interaction sequence and predict students’ performance of the next exercise. The interaction tuple \(x_{t} = \left( {q_{t} , a_{t} } \right)\) represents the student’s answer to the exercise \(q_{t}\), where \(a_{t}\) is 1 means the answer is correct and 0 means wrong.

Table 1. Symbols

As illustrated in Table 1, the definition of various symbols used in the model is described. The \(N\) represents the number of concepts, and the key matrix \(K\)(\(N \times d_{k}\)) stores these concepts. Besides, the knowledge state of each concept is stored in the value matrix \(V\)(\(N \times d_{v}\)).

3.2 Correlation Weight

Each exercise contains multiple concepts. The exercise \(q_{t}\) is firstly mapped into a vector \(e \in R^{{d_{k} }}\) by an embedding matrix \(A \in R^{{d_{k} }}\). The correlation weight \(w_{t} \in R^{N}\) is computed by taking the softmax activation of the inner product between \(e_{t}\) and each \(k_{i}\) of the key matrix \(K = \left( {k_{1} ,k_{2} , \ldots ,k_{N} } \right)\):

$$ w_{t} = Softmax\left( {e_{t} \cdot K^{T} } \right). $$
(1)

\(k_{i}\) is the key memory slot which is used to store the \(i^{th}\) concept. And \(w_{t}\) measures the correlation weight between this exercise and concepts.

3.3 Read Process

The probability of answering \(q_{t}\) correctly needs to consider two factors: the student’s current knowledge state and exercise difficulty. Above all, \(w_{t}\) is multiplied by the each \(v_{i}\) of the value matrix \(V = \left( {v_{1} ,v_{2} , \ldots ,v_{N} } \right)\), which is to get the read content vector \(r_{t} \in R^{{d_{v} }}\):

$$ r_{t} = w_{t} \cdot V_{t} . $$
(2)

\(v_{i}\) is the value memory slot which is used to store the state of the \(i^{th}\) concept. And the read content \(r_{t}\) is regarded as the student’s overall mastery of \(q_{t}\).

Then considering that the difficulty of \(q_{t}\), the exercise vector \(e_{t}\) passes through the fully connect layer and \(Tanh\) function to get the difficulty vector \(d_{t} \in R^{{d_{k} }}\):

$$ d_{t} = Tanh\left( {e_{t} \cdot W_{1} + b_{1} } \right), $$
(3)
$$ Tanh\left( x \right) = \frac{{1 - {\text{e}}^{{ - 2{\text{x}}}} }}{{1 + {\text{e}}^{{ - 2{\text{x}}}} }}, $$
(4)

and \({\text{W}}_{{\text{i}}}\) and \( b_{i}\) are the weight and bias of the full connect layer.

The summary vector \(f_{t}\) is obtained after concatenating the read content vector \(r_{t}\) and the difficulty vector \(d_{t}\):

$$ f_{t} = Tanh\left( {\left[ {r_{t} ;d_{t} } \right] \cdot W_{2} + b_{2} } \right). $$
(5)

Finally, the probability \(p_{t}\) is computed from the summary vector \(f_{t}\):

$$ p_{t} = Sigmoid\left( {f_{t} \cdot W_{3} + b_{3} } \right), $$
(6)
$$ Sigmoid\left( x \right) = \frac{1}{{1 + {\text{e}}^{{ - {\text{x}}}} }}. $$
(7)

And \(Sigmoid\) function makes the probability \(p_{t}\) between 0 to 1.

3.4 Write Process

The knowledge state of each concept are updated after the student answering the exercise \(q_{t}\). The interaction tuple \(x_{t} = \left( {q_{t} ,a_{t} } \right)\) is turned into a number by \(y_{t} = q_{t} + a_{t} *E\), which \(y_{t}\) represents the student’s interactive information. And \(y_{t}\) is converted into an interaction vector \(c_{t} \in R^{{d_{v} }}\) by an embedding matrix \(B \in R^{{E \times d_{v} }}\). Considering that students apply knowledge to exercise according to their knowledge state, we adds the interaction vector \(c_{t}\) and each value memory slot \(v_{i}\) of the value matrix \(V_{t}\), and pass the result through the fully connect layer and an activation function to obtain the knowledge application gate \(Z_{t} \in R^{{N \times d_{v} }}\):

$$ C_{t} = Concat\left( {c_{t} ,c_{t} , \ldots ,c_{t} } \right), $$
(8)
$$ Z_{t} = Sigmoid\left( {\left[ {V_{t} + C_{t} } \right] \cdot W_{z} + b_{z} } \right). $$
(9)

\(Z_{t}\) is used to calculate the proportion of concepts used in an exercise. The application knowledge state \(D_{t} \in R^{{N \times d_{v} }}\) is obtained by using \(Z_{t}\) to weight the value matrix \(V_{t}\):

$$ D_{t} = Z_{t} *V_{t} . $$
(10)

Then, we concatenate the interaction vector \(c_{t}\) and each \(d_{i}\) of the value matrix \(D_{t} = \left( {d_{1} ,d_{2} , \ldots ,d_{N} } \right)\) to get the knowledge growth matrix \(\tilde{V}_{t} \in R^{{N \times d_{v} }}\):

$$ \tilde{V}_{t} = Tanh\left( {\left[ {D_{t} ;C_{t} } \right] \cdot W_{r} + b_{r} } \right). $$
(11)

For the purpose of measuring student’s forgetting degrees, We adds the interaction vector \(c_{t}\) and each value memory slot \(v_{i}\) of the value matrix \(V_{t}\) to obtain the knowledge forgetting gate \(U_{t} \in R^{{N \times d_{v} }}\):

$$ U_{t} = Sigmoid\left( {\left[ {V_{t} + C_{t} } \right] \cdot W_{u} + b_{u} } \right). $$
(12)

Each concept state of the value matrix \(V_{t}\) is updated by \(U_{t}\). \(\left( {1 - U_{t} } \right)*V_{t}\) represents the unforgettable part of the previous knowledge state, and \(U_{t} *\tilde{V}_{t}\) represents the unforgettable part of the knowledge gained from this exercise. And \(V_{t + 1}\) means the new student’s knowledge state.

$$ V_{t + 1} = \left( {1 - U_{t} } \right)*V_{t} + U_{t} *\tilde{V}_{t} . $$
(13)

3.5 Optimization Process

The optimization goal of our model is that the predicted probability \({p}_{t}\) is close to the student’s answer \(a_{t}\), that is to minimize the cross entropy loss \(L\).

$$ L = - \sum\nolimits_{t} {a_{t} \;log \left( {{{p}}_{{{t}}} } \right) + \left( {1 - a_{t} } \right)\; log \left( {1 - {{p}}_{{{t}}} } \right)} . $$
(14)

4 Experiments

4.1 Datasets

There are several datasets to test the performance of models in Table 2, including Statics2011, ASSISTments2009, ASSISTments2015 and ASSISTment Challenge. And these datasets come from real online learning systems.

Table 2. Dataset statistics
  1. (1)

    Statics2011: This dataset has 1,223 exercise tags and 189,297 interaction records of 333 students, which comes from an engineering mechanics course of a university.

  2. (2)

    ASSISTments2009: This dataset contains 110 exercise tags and 325,637 interactions records for 4,151 students, which comes from the ASSISTment education platform in 2009.

  3. (3)

    ASSISTments2015: This dataset is collected from the ASSISTment education platform, which has 100 exercise tags and 683,801 interactions records of 19,840 students.

  4. (4)

    ASSISTment Challenge: This dataset was used in the ASSISTment competition in 2017, and it contains 102 exercise tags and 942,816 interaction records of 686 students.

4.2 Evaluation Method

In the field of knowledge tracing, we usually use AUC as the evaluation criteria for model classification. The advantage of AUC is that even if the sample is unbalanced, it can still give a more credible evaluation result [28]. This paper also uses AUC as the evaluation of the model. And the higher the value of AUC, the better the classification result. As shown in Fig. 5, ROC curve is drawn according to TPR and FPR, and AUC is obtained from the area under ROC curve.

Fig. 5.
figure 5

Description of the AUC calculation

4.3 Implementation Details

In this paper, the training set and test set of each dataset was randomly assigned, 70% of which is the training set and the remaining 30% is the test set. The five-fold cross-validation method was used on the training set, and 20% of the training set was divided into the validation set. We used early stopping and selected hyperparameters of model on the validation set. And the performance of the model was evaluated on the test set.

Gaussian distribution was used to initialize the parameters randomly. Stochastic gradient descent method was adopted as the optimization method for training. And batch size was set to 50 on all datasets. The maximum number of training times of the model was set to 100 epochs. The epoch with the best AUC value on the validation set was selected for testing. And the average value of AUC on the test set was used as the model evaluation result.

Using different initial learning rates, we compared the performance of DKVMN and DKVGRU models when the sequence length was 200. Then, we set sequence lengths of 100, 150, and 200 to compare the performance of DKVMN and DKVGRU.

4.4 Result Analysis

On the four datasets, the experiment used the initial learning rate of 0.02, 0.04, 0.06, 0.08, and 0.1 to measure the AUC scores of DKVMN and DKVGRU. And AUC of 0.5 represents the score that can be obtained by random guessing. The higher the AUC score, the better the prediction effect of the model. As shown in Table 3, there are the test AUC score of DKVMN and DKVGRU of all datasets. It can be clearly seen that DKVGRU performs better than DKVMN on all datasets.

Table 3. The test AUC scores of DKVMN and DKVGRU with different initial learning rates on all datasets

For Statics2011 dataset, the average AUC of DKVMN is 81.76%, while the average AUC of DKVGRU is 83.06%, which indicates a 1.29% higher than DKVMN. On the ASSISTments2009 dataset, DKVMN produces the average test AUC value of 80.34%, which shows a 0.44% difference compared with 80.70% for DKVGRU. For ASSISTments2015 dataset, the average AUC of DKVGRU is 72.87% and DKVMN is 72.54%. On the ASSISTment Challenge dataset, DKVGRU achieves the average AUC of 68.53%, which improves 1.82% as DKVMN in 66.72%. Therefore, DKVGRU has a better performance than DKVMN on all four datasets. For both models, the paper observes that a larger initial learning rate might lead to a better AUC score from the aforementioned experiments.

Then, we set the initial learning rate of 0.1 and sequence lengths of 100, 150, and 200 to evaluate these two models. And the experimental results indicate that DKVGRU performs better than DKVMN at different sequence lengths in Table 4.

According to Fig. 6, the AUC results of DKVGRU and DKVMN become better with the increase of sequence length except for Statics2011 dataset. The findings support that the setting of the sequence length of the exercises has a positive correlation with the models performance, which means a longer sequence length results in a better prediction performance for the model. That is, the model can more accurately trace students’ knowledge state by utilizing more exercise records.

Table 4. The test AUC scores of DKVMN and DKVGRU with different sequence length on all datasets

On the Statics2011 dataset, the reason why the AUC results have a negative correlation with the sequence length is that exercise tag is the largest among the four datasets, which included 1,223 exercise tags. The more exercise labels in the sequence, the more complex relationships between exercises and concepts need to be considered by the model. Nonetheless, the AUC score of DKVGRU on the Statics2011 dataset is higher, which means DKVGRU can simulate students’ knowledge state better than DKVMN.

Fig. 6.
figure 6

The test AUC scores of DKVMN and DKVGRU with different sequence length on all datasets

In summary, DKVGRU performs better than DKVMN with different learning rates and sequence lengths, which shows that the gating mechanism of DKVGRU effectively simulates the changes of students’ knowledge state.

5 Conclusions and Prospects

For the existing shortcomings of knowledge tracing, such as ignoring students apply different concepts to the same exercise and failing to consider the forgetting process of concepts they have learned, we propose a knowledge tracing model DKVGRU, which is based on the dynamic Key-Value matrix and gating mechanism. DKVGRU updates students’ knowledge state by the gating mechanism. The experimental data comes from four public datasets. And the experiments demonstrate that DKVGRU performs better than DKVMN.

In addition to the students’ exercise records, online learning platforms also record various learning activities of students, such as watching videos, viewing exercise explanations and other learning actions. For future work, we will consider these features in KT tasks. And using these data, we also can classify students according to students’ learning attitude and habits, which simulates students’ knowledge state reasonably.