1 Introduction

Due to the spread of COVID-19, offline education has caused certain difficulties, and many schools have started to try online education. With the rapid development of the Internet and artificial intelligence technology, the combination of education and the Internet has become inevitable, which is the online intelligent learning platform for students. These platforms can provide students with online intelligent learning conditions, and also provide teachers with a better understanding of students’ knowledge state, so as to further provide great help for students to develop personalized teaching plans. Knowledge Tracing (KT) [1] is one of the fundamental tasks of an online learning platform for students. It can predict a student’s mastery of knowledge concepts based on the student’s past interaction and predict the student’s future performance.

Fig. 1
figure 1

Example for domain adaptive KT task. The source domain is derived from mathematical exercises and the target domain is derived from physical exercises

The research on the knowledge tracing model has made some progress, which is mainly divided into the traditional knowledge tracing model and deep learning knowledge tracing model. The traditional knowledge tracing model like the Bayesian knowledge tracing (BKT) [2] based on the Hidden Markov Model (HMM) has good explanatory properties because these methods are based on specific assumptions and prior knowledge, making the model’s outputs more interpretable and understandable. The deep learning knowledge tracing models achieve higher accuracy because they are typically composed of multiple layers of neural networks, which possess powerful nonlinear modeling capabilities. Additionally, the rich feature representations enable the models to more accurately model students’ knowledge levels. Sequenced-based deep knowledge tracing (DKT) [3], attention mechanism-based AKT [4], and the SAINT [5] based on the transformer are currently well-known models in the deep learning knowledge tracing models. However, these KT models require a significant amount of data for sufficient training and are highly dependent on the current training data.

We argue that it is a worthwhile problem to the adaptation of the knowledge tracing model since the acquisition of data from students doing questions is often difficult in many cases. Also, there are few studies on the adaptation problem of KT models. Currently, the only work that tries to solve the domain adaptation problems in KT is AdaptKT [6]. They selected textually similar items from two datasets for knowledge transfer and adopted Maximum Mean Discrepancy (MMD) to minimize the distribution discrepancy between the two domains. However, it is demanding that such an approach relies on an ample amount of textual information. Thus, we aim to design an adaptive knowledge tracing model that can be trained from the existing source domain and then quickly adapted to the target domain based on only a small number of target domain data. The source domain represents existing student exercises that could be derived from different subjects or datasets, while the target domain represents the current exercises with only a small number of student sequences that could be derived from a new subject or dataset. To better illustrate the domain adaptation issue in knowledge tracing, we present a simple example in Fig. 1. The source domain has a substantial amount of data, while the target domain has very few samples, and they could be from different subjects or different datasets within the same subject. Our objective is to train the model on the source domain and utilize a small set of target domain samples to accurately predict the students’ mastery of knowledge points from the target domain.

It is worth noting that even though there is ample knowledge tracing data available from various domains, adapting to new domain on online intelligent learning platforms remains challenging. This situation presents two primary issues: firstly, significant distribution differences exist between domain data from different subjects; secondly, target domain data typically consists of only a small amount of student interaction records. Therefore, we propose a novel Domain-Adaptive Knowledge Tracing model (namely DAKT), which includes the domain-shared answer embedding module and domain-adaptive knowledge state changing module. The domain-shared answer embedding module leverages the mask attention [7] to encode the answer sequences with the past answers, aiming to alleviate the effect of different sequences among domains. The domain-adaptive knowledge state changing module leverages adaptation parameters to make the model adapt to the target domain. Therefore, the main contributions of this paper are summarized as follows:

  • A rarely-studied domain adaptation issue in KT. We approach the domain adaptation of KT model when data acquisition of student interactions is limited, i.e., even a single interaction sequence.

  • A straightforward but effective KT model for domain adaptation. We design a novel Domain Adaptive Knowledge Tracing model (DAKT), which consists of a domain-shared answer embedding module and domain-adaptive knowledge state changing module that can leverage adaptive parameters to quickly adapt to the target domain.

  • Promising results. In addition, we conduct extensive experiments on four real datasets to validate the effectiveness of the proposed model.

2 Related work

2.1 Knowledge tracing

Generally speaking, knowledge tracing models have two broad categories: traditional knowledge tracing models and deep learning knowledge tracing models. We list some representative models for the introduction.

Traditional knowledge tracing research is mainly divided into two directions, one is Bayesian knowledge Tracing (BKT) [2] motivated by mastery learning [8], and the other is factor analysis models. BKT is the earliest knowledge tracing model. It usually uses probabilistic graphical models, such as Hidden Markov Models (HMM) and Bayesian belief networks to trace students’ knowledge states. Their core is Bayes’ theorem. Factor analysis models are based on Item Response Theory (IRT) [9] which has been widely used in educational evaluation, its history can almost be traced back to the 1920s [10]. The main idea is to learn a logical function to evaluate the performance of students. In the original IRT, each item can only correspond to one concept, and it has been extended to involve multiple concepts in subsequent work. The additive factor model (AFM) [11] originated from learning factors analysis (LFA) [12]. The main idea of AFM is that the probability of a student answering a question correctly is proportional to the cumulative combination of the student’s ability, the difficulty of the knowledge concepts involved in the question, and the amount of learning the student acquires during each studying process.

Inspired by deep learning, recent studies on KT all use deep learning techniques. Deep knowledge tracing (DKT) [3] is a model based on Recurrent Neural Network (RNN) [13] and Long Short-Term Memory (LSTM) [14], which models the time series data formed by students in the process of doing questions. After that, in order to improve the ability of the hidden state of the model to capture historical information in the time series, the model based on the memory network was also added to KT. Dynamic key-value memory network (DKVMN) [15] adopts key-value memory to represent the knowledge state, which has a stronger representation ability than the hidden variables used in DKT. Later, self-attentive knowledge tracing (SAKT) [16], a simple but tough-to-beat baseline for knowledge tracing (SimpleKT) [17] and attentive knowledge tracing (AKT) [4] incorporate the attention mechanism into the KT model, and separated self-attentive neural knowledge tracing (SAINT) [5] further adopts the encoder-decoder model in Transformer [18] and the scaling dot product attention mechanism. In addition, there are also KT models based on graph neural networks such as graph-based knowledge tracing (GKT) [19], and pre-training embeddings via bipartite graph (PEBG) [20]. And KT models exercise-enhanced recurrent neural network (EERNN) [21] and exercise-aware knowledge tracing (EKT) [22] based on text information features. These deep learning knowledge tracing models have higher prediction accuracy than traditional knowledge tracing models. However, all these models are domain-specific and they rely on a large amount of data for training. So it is certainly very challenging to directly adapt these models to a new domain. AdaptKT is one of the rare studies that simultaneously delve into both knowledge tracing and domain adaptation. They conduct their approach in three stages, involving the training of an autoencoder on questions with similar text, the use of Maximum Mean Discrepancy (MMD) to lessen distribution differences between the source and target domains, and the training of a KT model using target domain data. This allows the AdaptKT to possess a certain level of generalization.

Remark

Note that the the aforementioned methods AKT [4] and SimpleKT [17] are different from our proposed DAKT. (1) The two models utilize the multi-head attention mechanism to predict students’ knowledge state directly, while our DAKT only leverages the attention to capture the features of answer sequences. (2) The first two models are designed for domain-specific knowledge tracing, while our model is designed for domain adaptation of KT model. Thus, our DAKT not only analyzes students’ mastery of knowledge points but also introduces adaptive parameters to enable the model to adapt to a new domain. In addition, there are significant differences between AdaptKT and our approach. This approach not only necessitates a high demand for textual information in the data but also introduces complexity due to its multi-stage process. On the contrary, the proposed DAKT in this paper can alleviate the model’s reliance on textual information during training and achieve good domain adaptation effects with only a small amount of target domain samples.

2.2 Domain adaptation

Numerous research avenues are connected to the broader concept of generalization, including domain adaptation, meta-learning, and transfer learning, among others [23]. Over recent years, domain adaptation has emerged as a particularly prominent focus, with the goal of optimizing the performance of a specified target domain through the utilization of existing training data from a source domain. In contrast to domain generalization, domain adaptation can leverage actual target domain data, a characteristic that aligns more closely with real-world applications and provides practical benefits. A plethora of domain adaptation methods have been devised to bridge the gap between source and target domains [24]. Common approaches encompass instance-based domain adaptation [25], which minimizes errors by assigning weights to source samples and training on these weighted instances, as well as feature-based domain adaptation [26], which endeavors to uncover a shared space enabling the alignment of two datasets.

Directions related to domain adaptation also include the following. (1) Unsupervised Domain Adaptation (UDA) [27]: Some samples of the target domain can be obtained, but they do not have corresponding label information. UDA transfers knowledge from the source domain to the target domain by taking advantage of the feature distribution difference between the source domain and the target domain [28]. (2) Semi-supervised Domain Adaptation (SDA) [29]: It can obtain target domain samples and label information of a small number of target domains. SDA is trained with labeled target domain samples and source domain data to improve performance on the target domain. (3) Source-Free Domain Adaptation (SFDA) [30]: The source domain data cannot be accessed due to privacy or data transmission, but there is a model trained on the source domain. These directions are all aimed at addressing the domain differences between the source and target domains in order to achieve better performance on the target domain. Additionally, there are other research directions related to domain adaptation that are worth noting. One of these is Domain Adversarial Training (DAT) [31], a method that introduces a domain classifier to minimize the distribution difference between the source and target domains. During training, this approach employs adversarial training of the domain classifier between the source and target domains, enabling the model to better adapt to the target domain and consequently enhance performance in that domain. Moreover, some studies focus on Cross-Domain Learning [32], aiming to facilitate the transfer of knowledge learned in the source domain to the target domain. These techniques can be realized through technologies such as shared representations and transfer networks, thereby achieving improved generalization effects in the target domain.

Fig. 2
figure 2

Overview of the DAKT method. We use the domain-shared answer embedding module to get the interaction representation \(x_t\), and the domain-adaptive knowledge state changing module, which includes adaptation parameters \(W_\alpha\) and \(W_\beta\), to get the student’s knowledge state \(h_t\). In the first module, query and key are obtained from the question encoding sequence, while value is obtained from the answer encoding sequence

3 Proposed Method

In this section, we aim to provide a comprehensive introduction of our proposed DAKT model. Firstly, we will introduce the problem definition of the domain adaptive KT task. Subsequently, we will present the key components of DAKT, which consist of the domain-shared answer embedding module and the domain-adaptive knowledge state changing module. Moreover, we will delve into the specific algorithm employed by the entire model.

3.1 Problem Definition

Domain Adaptive KT task. Knowledge tracing is to trace students’ knowledge state based on students’ past interactions. The dataset has N students and the input of the KT task is a student’s interaction sequence \(\textit{I}=((q_1,r_1),(q_2,r_2),...,(q_n,r_n))\), where n is the length of the sequence, \(q_t\in \mathbb {N}^+\) represents the question ID of the t-th interaction in the sequence \(t\le n\), and \(r_t\in \{0,1\}\) is the student’s answer to the question \(q_t\). Each question corresponds to at least one knowledge concept in \(\mathcal {C}\). Students should master every knowledge concept as much as possible in the process of doing the questions. The goal of a KT model is to predict the probability of answering the next questions correctly, i.e., \(P(r_{t+1}|\textit{I},q_{t+1})\).

Given a training data set with a large amount of labeled data is the source domain \(\text {D}_s\), a test data set with a small amount of labeled data is the target domain \(\text {D}_t=\text {D}_{tl}\cup \text {D}_{tu}\), where \(\text {D}_{tl}\) represents a small number of labeled target domain samples, and \(\text {D}_{tu}\) represents unlabeled Target domain sample. \(\text {D}_{tl}\) contains only m students and an interaction sequence of length t, \(\text {D}_{tl}=\{ \textit{I}_i\}^m_{i=1}\ (m\ll N)\). During the training phase, the target domain samples \(\text {D}_t\) are not visible. The goal of the domain adaptive KT task is to train an adaptable KT model with source domain data \(\text {D}_s\) and use \(\text {D}_{tl}\) to adapt the model to the target domain for knowledge tracing. We specifically show the main notations and corresponding descriptions related to this paper in Table 1.

Table 1 Notations

3.2 Model Architecture

As shown in Fig. 2, our DAKT is mainly divided into two modules: domain-shared answer embedding module and domain-adaptive knowledge state changing module. The first module utilizes the mask attention mechanism to capture students’ answering behavior features, obtaining question encodings. The second module predicts students’ mastery of the knowledge state and enhances the model’s adaptation ability by updating the adaptation parameters. Both source and target domain data go through these two modules sequentially. Specifically, target domain data undergoes the calculation of adaptation parameters \(W_{\alpha }\) and \(W_{\beta }\), whereas source domain data does not require this. The detailed process will be elaborated in subsequent chapters.

3.2.1 Domain-shared answer embedding module

The correspondence between the quantities of knowledge concepts and questions varies among different datasets. Each question may correspond to one or multiple knowledge points, but there will always be a primary knowledge concept that the current question primarily focuses on. Therefore, we model this particular knowledge concept as the relevant knowledge for the current question. Moreover, datasets from different subjects or even different datasets within the same subject can have varying numbers of knowledge concepts. We select the larger number of knowledge concepts between the source domain and the target domain datasets as the input dimension for the model.

The student’s interaction sequence \(\textit{I}=((q_1,r_1),(q_2,r_2),...,(q_n,r_n))\) represents the questions the students have done from the first moment to the n-th moment and the student’s answers to each question. The student’s future answering performance depends on the student’s mastery of all the knowledge concepts in the past interactions. In order to better capture the characteristics of student’s past interactive behavior, we split I into \(I_q\) and \(I_r\), where \(I_q\) represents \(\{q_1,q_2,...,q_n\}\), \(I_r\) represents \(\{r_1,r_2,...,r_n\}\), and use embedding encoding to change them as the input of the model.

The possibility of the student answering the current question correctly is closely related to his past answer. Doing the same question continuously or each question is different in the first \(n-1\) moments will have a huge impact on the correct rate \(r_n\) of the questions \(q_n\) that students are doing at the current n-th moment. We use the Multi-head attention to describe the current impact of past interactions on students. There will be queries, keys, and values in the normal attention mechanism. We use the question embedding sequence to generate the query and key and answer embedding sequence to generate the value. Question embeddings can find the influence factor \({\alpha }_{\tau ,t}\) \((1<\tau <t)\) at the past moment using the softmax function. In order to ensure that the model will not cheat, we add a mask during the calculation. We obtain an interaction \(i_t\) representation that contains information about the students’ past questions and answers:

$$\begin{aligned} {\alpha }_{\tau ,t}= & {} \text {Softmax}\left( \frac{q_{\tau }k^\intercal _t}{\sqrt{d_k}}\right) =\frac{\exp \left( \frac{q_{\tau }k^\intercal _t}{\sqrt{d_k}}\right) }{{\sum }_{t} \exp {\left( \frac{q_{\tau }k^\intercal _t}{\sqrt{d_k}}\right) }}, \end{aligned}$$
(1)
$$\begin{aligned} i_t= & {} \text {Attention}(Q_t,K_t,V_t)={\sum }_{\tau }{\alpha }_{\tau ,t}V_t. \end{aligned}$$
(2)

We think that \(i_t\) mainly considers the behavior of students’ past answering situations, and the student’s past interaction of doing questions is also very important. If students do the same questions continuously, such behavior of doing questions will definitely affect the student’s mastery of relevant knowledge concepts. So we concatenate \(i_t\) with the question embedding sequence \(qu_t\) to get \(x_t\). The \(x_t\) considers the behavioral features of the current sequence of student answers, so it will not have too much dependence on a certain data set. No matter the module receives a student sequence input from any domain, it can capture the student’s answer behavior features. So this answer embedding module is domain-shared and it will lay the foundation for our subsequent adaptation work.

3.2.2 Domain-adaptive knowledge state changing module

LSTM has been proven to be useful in the field of KT [3]. Compared with RNN, LSTM improves the long-standing problem, and its performance is better than both RNN and HMM. Therefore, we use LSTM as the base model and add two adaptation parameters \(W_{\alpha }\) and \(W_{\beta }\) to it, forming the AdaptLSTM. We take \(x_t\) as the input of the AdaptLSTM at the current time step t, and the hidden state obtained from the previous time step is denoted as \(h_{t-1}\). Therefore, the output \(h_t\) of the AdaptLSTM represents the student’s mastery of knowledge concepts:

$$\begin{aligned} h_t=\text {AdaptLSTM}(x_t,h_{t-1};W_{\alpha },W_{\beta }). \end{aligned}$$
(3)

LSTM is capable of capturing students’ mastery of knowledge concepts based on their answering sequences, but it lacks the ability to adapt. If it is directly applied to domain adaptation, it is difficult to get the results we expect. In other words, the current model structure is sufficient for regular KT tasks, but there is still room for exploration in the domain of adaptive KT tasks. Therefore, through improvements, we have obtained AdaptLSTM. When we train the model using the source domain data, the model parameters are optimized toward the source domain. So, in order to enable the pre-trained model based on the source domain to adapt to the target domain when the target domain data \(x^{\text {target}}_t\) is fed, we added a linear layer before LSTM and designed an adaptation parameter \(W_{\alpha }\). \(W_{\alpha }\) is adaptively fine-tuned for the target domain data at the feature representation level, allowing the source domain model parameters to assist the target domain while also retaining the information specific to the target domain. It can also attenuate the differences in feature representation between the source domain and the target domain obtained by the domain-shared answer embedding module. We set the adaptation weight matrix \(W_{\alpha }\) to zero to avoid any impact on the source domain data:

$$\begin{aligned} x^{\text {source}}_t= & {} W_l(x^{\text {source}}_t+{\textbf {0}}\cdot x^{\text {source}}_t)+b_l, \end{aligned}$$
(4)
$$\begin{aligned} x^{\text {target}}_t= & {} W_l(x^{\text {target}}_t+ W_{\alpha } \cdot x^{\text {target}}_t)+b_l. \end{aligned}$$
(5)
Fig. 3
figure 3

The AdaptLSTM with \(W_{\alpha }\) and \(W_{\beta }\)

In conventional LSTM, the memory cell \({C}_t\) decides how much information needs to be retained when a new interaction is inputted. In addition, there are four gating units. The forget gate determines which information in \({C}_t\) should be discarded from the past and retrieve temporarily memory cell \(\tilde{C}_t\) relevant to the current input. The input gate decides which information from the current input should be added to \({C}_t\). The update gate combines the forget gate and the input gate to obtain the new \({C}_t\). The output gate controls which information is propagated to the next time step’s hidden state and output. Different students’ learning behaviors may vary across different datasets, but their knowledge state features are similar. Students tend to gradually acquire relevant knowledge concepts during the learning process, with differences in the rate of mastery. Therefore, as shown in Fig. 3, we also design an adaptation weight matrix \(W_\beta\) into the AdaptLSTM. In contrast to \(W_{\alpha }\), \(W_\beta\) adapts this module to the target domain at each time step by adjusting the input gate. We incorporate \(W_\beta\) into the input gate because other gate units are also influenced by changes in the input gate. Similarly, When the source domain data \(x^{\text {source}}_t\) is fed, Equation (6) and Equation (8) are executed, and when the target domain data \(x^{\text {target}}_t\) is fed, Equation (7) and Equation (8) are executed:

$$\begin{aligned} {in}^{\text {source}}_t= & {} \sigma (W_i\cdot [h_{t-1},x^{\text {source}}_t+{\textbf {0}}\cdot x^{\text {source}}_t]+b_i), \end{aligned}$$
(6)
$$\begin{aligned} {in}^{\text {target}}_t= & {} \sigma (W_i\cdot [h_{t-1},x^{\text {target}}_t+W_\beta \cdot x^{\text {target}}_t]+b_i), \end{aligned}$$
(7)
$$\begin{aligned} C_t= & {} f_t\cdot C_{t-1}+{in}_t\cdot \tilde{C}_t. \end{aligned}$$
(8)

The prediction layer in Fig. 2 is to map the student knowledge state obtained by the model to the sample label space, but the sample label space of the source domain and the target domain will definitely be different. If the parameters of the prediction layer are directly used for the target domain data, it will affect the output results. Therefore, we also choose to let the parameters of the prediction layer \(\theta _{\gamma }\) updated with \(W_\alpha\) and \(W_\beta\) when feeding target domain data. So that the \(W_\alpha\) and \(W_\beta\) allow the domain-adaptive knowledge state changing module to obtain better sequence features for the data in the target domain, and the \(\theta _{\gamma }\) allows the module to adapt to the sample space of the target domain.

Our algorithm. For clarification, the above process of DAKT is shown in Algorithm 1. The number \(N^s\) represents the total number of interaction sequences in the source domain. The number \(N^t\) represents the total number of interaction sequences in the target domain, but it is not used in this paper. The number m represents the total number of interaction sequences used for training in the target domain \((m\ll N^s, m\ll N^t)\). During the training process, we use cross-entropy loss \(L_{CE}\) as the objective function for model optimization.

Algorithm 1
figure a

Domain adaptive knowledge tracing

4 Experiments

In this section, we evaluate our proposed DAKT on four existing knowledge tracing benchmark datasets, i.e., ASSISTment2009,Footnote 1 ASSISTment2015,Footnote 2 ASSISTment2017,Footnote 3 and KDDcup.Footnote 4 We take one dataset as the source domain and the other three datasets as target domains to evaluate on the KT task and DA task.

Table 2 The specific information of the dataset

4.1 Datasets and settings

We use four benchmark datasets to evaluate our proposed method. The ASSISTments dataset contains longitudinal data collected from the free online tutoring ASSISTment platform, which includes ASSISTment2009, ASSISTment2015, and ASSISTment2017. ASSISTment is widely used to evaluate KT models. KDDcup is obtained at the KDDcup Educational Data Mining Challenge and contains answers to algebra questions from 13–14 year old students from 2005 to 2007, which are extracted from the intelligent tutoring system called The Cognitive Tutors. The specific information of the dataset can be found in Table 2. Among these, Stu_num represents the total number of students, Itr_num represents the total number of student-question interactions, Cpt_num represents the total number of knowledge concepts, and Avg_qID represents the average number of questions answered per student.

Each dataset primarily consists of three pieces of information: the current student’s sequence length of answered questions, the question ID sequence, and the student’s answer history for each question. The sequence length represents the number of questions answered by the student. Each interaction sequence will exhibit certain variations, including potentially different numbers of answered questions and variations in the student’s answer patterns. During the training phase, whether in the source domain or the target domain, we allocate 80% of the data for training and 20% for testing. In the source domain, we train the model using all the training data in the source domain and evaluate it on the source domain test set. In the target domain, we randomly select a single sequence from the training set as the only available target domain data for fine-tuning, and then test the model on the target domain’s test set.

Table 3 One sequence information in the target domain

In the KT network in this paper, the dimension d of feature embeddings is set to 256. We use Adam to optimize the model the whole time, with a learning rate of 0.0001. Our batch size is set to 64, and the maximum sequence length per student’s interaction is 50. We randomly select a student’s sequence of questions from the target domain dataset as the data fed to the model, which is fixed throughout the experiment. For the sake of fairness, when selecting the data of the target domain, we count the average number of the question IDs that each student has done in the sequence of questions to ensure that the questions that the student has done in the current data set are in the average position. The specific information is shown in Table 3. Among them, Avg_Cpt_num represents the average number of knowledge points involved for each student. When extracting sequences, we ensure that the knowledge concept in the extracted samples is around this average. Seq_length stands for the length of sequences we extract as target domain data. Proportion indicates the proportion of this sequence length in the entire dataset. As a result, it is evident that when a dataset serves as the target domain, the utilized sample data is very limited, which aligns well with real-world scenarios. We evaluate our models using a test set originally designed on the target domain dataset.

For the metrics, we use the area under the ROC curve (AUC) [33] to evaluate the performance of KT models. The model’s task is to predict, based on the student’s historical answering behavior, whether the student will answer a new question correctly, thereby determining whether the student has acquired the corresponding knowledge point. AUC is a widely used performance metric for binary classification problems, measuring the trade-off between true positive rate and false positive rate at different thresholds. In knowledge tracing, correct answers to questions are considered positive cases, while incorrect answers to questions are treated as negative cases. AUC values range from 0 to 1[34], with higher values indicating better predictive performance of the model.

Table 4 The AUC(%) results of different models on domain adaptive KT tasks

4.2 Comparison results and analysis

We compare our method DAKT with five other representative KT models: DKT [3], AdaptKT [6], AKT [4], SAINT [5], and SimpleKT [17]. DKT, AKT, and SAINT are models that we re-implemented based on the original papers. AdaptKT is re-implemented without the phase of training an auto-encoder using similar texts, as the original AdaptKT requires the use of private textual information during its training process. The author of SimpleKT provided the source code, which we applied to our dataset for training and testing. We show the results of each model on the domain adaptive KT task in Table 4. For the statistic analysis, we assess the significance of differences between models by calculating the p-value of the AUC for DAKT in comparison with other models. All experimental results are obtained by averaging the outcomes of ten distinct runs. In Table 4, the data for each model are presented in two rows: the first row represents the AUC, and the second row indicates the corresponding p-value. We bold the best-performing model results and the experimental results that have no significant differences from the best results. A p-value less than 0.05 indicates that there is a significant difference between DAKT and the compared model, that is, the prediction results of DAKT are better. Among them, all the experimental results are that the comparison model is trained on the source domain data, and all parameters are fine-tuned on one sequence data in the target domain.

Additionally, to further explore the effectiveness of parameters during the adaptation process of DAKT, we introduced one additional variant DAKT-E. For DAKT-E, in the fine-tuning stage where, along with updating \(W_{\alpha }\), \(W_{\beta }\), and \({\theta }_{\gamma }\), we also update the parameters in the domain-shared encoder. To simplify the table, the dataset before \(\rightarrow\) denotes the source domain, and after donates the target domain. According to the experimental results, we can find that our model obtains the highest average AUC scores and all p-values are less than 0.05, which proves the effectiveness of our approach in adaptive knowledge tracing tasks and is significantly better than the other models. Taking the \(15\rightarrow 09\) scenario as an example, the ASSISTment2015 dataset contains more comprehensive and diverse information compared to ASSISTment2009, which better assists the target domain. Furthermore, it’s noticeable that the performance of LSTM-based DKT and attention-based AKT models is significantly lower than our DAKT. Moreover, the attention-based methods, including AKT, SAINT, and SimpleKT, perform better in extracting past student interactions for domain adaptation. However, their performance is still inferior to our DAKT, indicating the usefulness of our designed domain-adaptive knowledge state changing module. Comparing the experimental results of DAKT-E and DAKT, it can be inferred that updating \(W_{\alpha }\), \(W_{\beta }\), and \({\theta }_{\gamma }\) is highly effective in domain adaptation tasks, and additional parameter update in the encoder doesn’t work. After all, considering our limited target domain sample size, having too many parameters can negatively impact model training. We will provide more detailed experiments and analyses in the following Sect. 4.3.

Fig. 4
figure 4

The target domain AUC results obtained for different sequence numbers. The last entry in the legend represents training and testing the model directly using only target domain samples

Regarding the number of parameters, our model includes two main parts: the domain-shared answer embedding module and the domain-adaptive knowledge state changing module. The primary parameters are composed of fully connected layers in the first part, linear layers in the second part, and AdaptLSTM. The overall size of our model is approximately 3MB, similar to DKT but much smaller than AKT (around 14MB), SimpleKT (around 15MB), and SAINT (around 20MB). As for the time cost, the time required for the target domain updates in each model is negligible, so we are only considering the training time. DAKT takes approximately 10 min in total to complete 500 training epochs. This is almost identical to DKT, roughly half the time of SimpleKT and AKT, and one-fourth the time of SAINT.

Table 5 The AUC(%) results of different knowledge tracing models on conventional KT tasks

To further explore and investigate the impact of different source domains on a specific target domain, as well as the influence of varying target domain sequence quantities on training outcomes, we conduct some more experiments in Fig. 4. We primarily investigate the extent to which different datasets contribute to the learning of the target domain through the involvement of source domains. Each graph corresponds to a scenario where a specific dataset serves as the target domain, while the other datasets act as source domains. The experiments are conducted by directly training and testing on the current target domain dataset. The vertical axis represents the AUC results obtained from the target domain data, and the horizontal axis represents different quantities of target domain data. We set the target domain sequence quantity as \(n \in \{1,2,4,8,16,32,64,128,256,512\}\). When the sequence quantity is large, we approximate the values in thousands. By comparing scenarios with the same quantity of target domain data, we can observe that among the ASSISTment series datasets, DAKT can significantly benefit from source domain data to facilitate learning in the target domain. The model’s accuracy trained with source domain data is higher than the model trained with only target domain data. This enables DAKT to rapidly adapt to the target domain using no more than \(5\%\) of the total data, even in cases of varying numbers of knowledge points or questions. Furthermore, the experimental results consistently surpass those obtained by solely using target domain data for training and testing. However, we also noticed that the ASSISTment series data’s assistance to the KDDcup dataset is not as pronounced. As a result, we extended the experiments targeting the KDDcup dataset. In the fourth subplot of Fig. 4, we can observe that when the target domain data exceeds 512, the ASSISTment series data continues to assist the model in adapting to the KDDcup dataset. Based on Table 2, we can observe significant differences between the ASSISTment dataset and the KDDcup dataset, especially in terms of the number of knowledge concepts and the number of questions per student. This leads to a challenging gap to bridge during the domain adaptation process. For example, in the second subplot of Fig. 4, considering ASSISTment2015 as the target domain with an average of only 34.4 questions per student, if the model is trained and tested using only the data from this dataset, the AUC would be much lower than when trained with the assistance of source domain data.

In addition, although we are not studying conventional knowledge tracing tasks, we also conducted experiments on conventional knowledge tracing tasks. We present the results of each model on traditional KT tasksC in Table 5 and bold the best-performing model results. We can see that DAKT not only achieves good results in adaptive knowledge tracing and demonstrates a certain level of adaptability, but it can also has excellent performance compared to current state-of-the-art models in the conventional domain-specific knowledge tracing task.

Table 6 Ablation experiment of DAKT. 09, 15, 17 and kdd represent the ASSITment2009, ASSITment2015, ASSITment2017 and KDDcup datasets, respectively

4.3 Ablation studies

In this subsection, we conducted several ablation experiments to investigate the model architecture and adaptive parameters.

  • For the exploration of model architecture, we decompose our model into the domain-shared answer embedding module (abbreviated as DS) and the domain-adaptive knowledge state changing module (abbreviated as DA). We attempted to retain only the DS module and fine-tuned the entire model, and also tried to retain only the DA module and fine-tuned the parameters \(W_{\alpha }\), \(W_{\beta }\), and \({\theta }{\gamma }\).

  • For the adaptive parameters, \(W_{\alpha }\), \(W_{\beta }\), and \({\theta }{\gamma }\), we conducted seven experiments to observe their roles in the model adaptation process.

As shown in Table 6, we performed domain-adaptive knowledge tracing experiments on different datasets. We used the AUC results obtained from training and testing the complete DAKT model on the target domain as a comparison, as shown in the table. To better reflect the significant differences between different models based on p-values, we not only bold the results of the best-performing models but also bold the results with p-values greater than 0.05. Based on the results of the first two rows, it can be observed that the DS module effectively captures the time-series information of student interactions, thus better evaluating students’ grasp of knowledge concepts. The DA module also contributes to the model’s adaptation to the target domain. Moreover, firstly, by comparing the results of experiments using only one of these parameters (rows 4 to 6) with the results of direct testing (row 3), it can be observed that each parameter contributes to the model’s adaptation capability to some extent. However, these individual effects are not as favorable as the overall DAKT performance. Secondly, comparing the results of the last four rows, it becomes evident that the adaptability of \(W_{\beta }\) and \({\theta }_{\gamma }\) is the strongest. Removing either of them leads to a significant drop in AUC. Finally, by comparing the results of any individual experiment with the complete DAKT results (last row), it becomes clear that updating all three parameters \(W_{\alpha }\), \(W_{\beta }\), and \({\theta }_{\gamma }\) simultaneously results in the strongest adaptation capability of the model.

4.4 Visualization

In this section, we visualized the attention weights of the domain-shared answer embedding module. We randomly selected 50 interactions of one student from the ASSISTment2009 dataset and generated a heatmap as shown in Fig. 5. For example, in the latter part of the sequence, the student had two interactions \(I_1=((q_{41},r_{41}),(q_{42},r_{42}),(q_{43},r_{43}))\) and \(I_2=((q_{44},r_{44}),...,(q_{50},r_{50}))\), where \(q_{41\sim 43}=62\) and \(q_{44\sim 50}=98\). It is evident that when the student first finished \(q_{44}\), there was very little correlation with past interactions. However, as the student continued to answer the following questions, the correlation among the same question’s problem-solving behavior became significant. Therefore, the domain-shared answer embedding module can capture the impact of a student’s past problem-solving behavior on the current state, providing more valuable information for the input of subsequent modules.

Fig. 5
figure 5

Heatmap of attention weights in the domain-shared answer embedding module

4.5 Parameter analysis

We conducted further analysis and research on the parameters in DAKT. \(W_{\alpha }\) helps the model differentiate the distribution differences between the source and target domains from the perspective of feature expression, while \(W_{\beta }\) assists the model’s adaptation to the target domain from the knowledge state level. Considering that the amount of data in the target domain is much smaller than that in the source domain, setting too many adaptation parameters may lead to insufficient training, and setting too few parameters may result in overfitting. To address this, we experimented with setting \(W_{\alpha }\) and \(W_{\beta }\) as vectors, two-dimensional matrices, and three-dimensional matrices of the same length as the input feature dimension. We primarily used ASSISTment2009 as the target domain and other datasets as source domains for experimentation. The results of these experiments are shown in Fig. 6. Without updating \({\theta }_{\gamma }\) and without \(W_{\beta }\), we updated \(W_{\alpha }\). We found that setting \(W_{\alpha }\) as a two-dimensional matrix yielded the best results for the target domain. After determining \(W_{\alpha }\), we conducted experiments with different \(W_{\beta }\) settings without updating \({\theta }_{\gamma }\), which yielded the same conclusion and confirmed our earlier hypothesis that having too many parameters could lead to poor training results. Additionally, we analyzed the dimension d of the hidden knowledge states in LSTM. We found that setting d to 256 resulted in the best performance. Therefore, in our experiments, we set \(W_{\alpha }\) and \(W_{\beta }\) as two-dimensional matrices and d as 256.

Fig. 6
figure 6

Paramter analysis for \(W_{\alpha }\), \(W_{\beta }\) and d

5 Conclusions

The proposed DAKT model in this paper effectively addresses the issue of limited data availability in the target domain for knowledge tracing tasks combined with domain adaptation. This model is capable of performing knowledge tracing within specific domains and can swiftly adapt to the target domain with only a small amount of student interaction sequences. Extensive experiments demonstrate the effectiveness of our model in domain adaptation. However, there is still room for improvement in this work, such as utilizing methods like data augmentation to further enhance the model’s accuracy in the target domain. We aim to raise more awareness about the challenges of domain adaptive knowledge tracing, which holds significant implications for applying knowledge tracing in online intelligent platforms.