Keywords

1 Introduction

The rapid advancement of AI technology has profoundly influenced on individuals of diverse backgrounds and skill levels. In this connection, online judge systems have emerged as an indispensable avenue for those seeking to boost their programming proficiencyFootnote 1. However, despite the growing popularity of this learning modality, high dropout rates have been observed, attributable to the inadequate provision of personalized instructions tailored to learners’ unique learning preferences [22].

Recently, recommender systems have been widely applied in online education scenarios to facilitate personalized learning. There are various recommendation models, including CF(collaborative filtering)-based methods [20, 31], content-based methods [1, 13] and deep-learning-based methods [7, 8]. These general models aim to provide personalized recommendations by capturing users’ interests and needs through static preferences and individual interactions. In the context of programming, however, the learning process exhibits a dynamic and progressive nature. This represents an essential application of the sequential recommendation (SR) task, which predicts subsequent behavioral sequences based on historical records [12, 27, 29, 32].

While extant SR models have yielded successful results in e-learning contexts, there remain significant gaps in directly deploying them to programming scenarios [18]. As illustrated by Fig. 1, programming learning differs from traditional learning in two crucial respects: i) it enables learners to make multiple attempts on the same exercise and edit their previous submissions based on the feedback received from the compiler and, ii) the platform can record fine-grained behavioral data related to programming, including code snippets, compilation time, and compilation status. Furthermore, current sequential models prioritize learners’ patterns with little regard to their intrinsic behaviors, including learning styles. These styles reflect the ways in which learners process and comprehend information, and are thus factors that cannot be ignored.

Consequently, the study of SR in programming learning confronts two significant challenges. First, it is imperative to model distinctive and fine-grained patterns involved in programming, including code-related side features and iterative submission behavior (C1). Second, there is an urgent need to incorporate pedagogical theory into the model to bolster its interpretability with the actual learning process (C2).

To address the above challenges, we propose a new model named Programming Exercise Recommender with Learning Style (PERS). To simulate the iterative process in programming, we employ a two-step approach. Firstly, we map programming exercises and code-related features (such as code snippets, execution time, and execution status) into embeddings using a representation module with positional encoding. Secondly, we formulate a differentiating module that calculates the changes between consecutive code submissions. This module can adeptly capture fine-grained learning patterns by effectively distinguishing between intra-exercise or inter-exercise attempts (for C1). To enhance the consistency between our proposed model and the actual learning process, we draw inspiration from a pedagogical theory known as the Felder-Silverman Learning Style Model (FSLSM) [4], which is widely utilized in educational scenarios for mining learning patterns and delivering personalized guidance. Considering the processing and understanding dimensions in FSLSM, we present a formal definition and detailed descriptions of programming learning styles in this paper. On this foundation, we develop three latent vectors: programming ability, processing style, and understanding style, which are designed to track the learners’ intrinsic behavioral patterns during the programming process (for C2). After obtaining the above vectors, our model employs a multilayer perceptron (MLP) to generate personalized predictions that align individuals’ learning preferences. In summary, the main contributions of this paper are summarized as follows:

  • Our study endeavors to furnish personalized programming guidance by emulating the iterative and trial-and-error programming learning process, thereby offering a novel vantage point on programming education.

  • We have meaningfully incorporated the FSLSM pedagogical theory into our model, enabling us to effectively capturing the intrinsic behavioral patterns of students while also enhancing rationality and consistency.

  • We conduct experiments on two real-world datasets to validate the efficacy and interpretability of our approach.

2 Related Works

2.1 Sequential Recommendation

Sequential recommendation models aim to incorporate users’ personalized and contextual information based on their historical interactions [16] to predict the future behaviors.

In earlier studies, researchers considered Markov chains as a powerful method to model interaction processes and capture users’ sequential patterns [2, 9]. Later, the advent of recurrent neural networks (RNN) greatly has expanded the potential of recommender systems to process multi-type input data and understand complex item transitions. For example, [10] first adopt RNN on real-life session-based recommendations and then enhance the backbone with parallel RNN to leverage more rich features [11]. There are various techniques designed to improve the RNN-based models, such as data augmentation (GRU4Rec [29]), data reconstruction (SRGNN [32]), and unified module injection (SINE [28], CORE [12]). Recently, another line of works has seeked to use the Transformer module to capture global information, which RNN overlooks. For instance, BERT4Rec [27] utilize bidirectional self-attention with Cloze tasks during training to enhance the hidden representation.

2.2 Sequential Recommendation in E-Learning

Existing research on SR in e-learning typically focus on recommending the most appropriate resources, such as courses and exercises, to learners by capturing their static and dynamic characteristics through their past behavioral record [14]. For instance, [15] and [19] propose a cognitive diagnostic method to model students’ proficiency on each exercises based on probabilistic matrix factorization and students’ proficiency. [25] apply a knowledge tracing model with an enhanced self-attention to measures students’ mastery states and assist model to recommend. These methods effectively capture students’ preferences and mastery of knowledge points. However, they often overlook the impact of students’ internal learning styles.

In the field of programming, some preliminary attempts have been made to explore the personalized recommendations. For example, [18] apply BERT [3] to encode students’ source code and propose a knowledge tracing model to capture mastery of programming skills. However, the dynamic sequential patterns in existing works are not consistent with real programming process due to ignore the iterative process.

2.3 Learning Style Model

Learning styles refer to the way in which students prefer to obtain, process and retain information [5]. The most common theoretical models include Felder-Silverman Learning Style Model (FSLSM), Kolb’s learning style [17] and VARK model [23]. Previous research has demonstrated that the FSLSM is more comprehensible and appropriate for identifying learning styles in online learning compared to other models [21]. This model describes learning styles from four dimensions: the perspective of perception (sensitive/intuitive), information input (visual/verbal), processing (active/reflective) and understanding (sequential/global) based on the learner’s behavior patterns during learning process.

3 Preliminaries

3.1 Programming Learning Style Model

Inspired by the FSLSM, we define a programming learning style model (PLSM) centered around the problem-solving behavior observed in online judging systems.

As shown in Table 1, the PLSM delineates the inherent learning patterns during programming through two dimensions: processing and understanding. In terms of processing, learners can be classified as either active or reflective. When solving exercises, active learners tend to think through a complete answer before submitting their solution, while reflective learners prefer to attempt the same exercise multiple times and refine their previous submissions based on the compiler feedback. As for the dimension of understanding, learners can be labeled as sequential or global. Sequential learners tend to approach learning tasks in a progressive sequence, such as in numerical or knowledge concept order. In contrast, global learners tend to approach tasks in a non-linear fashion, such as by selecting tasks that they find most interesting or engaging. These distinct learning styles reflect learners’ preferences and can significantly impact the trajectory of their problem-solving process.

Table 1. Programming Learning Style Model

3.2 Problem Definition

To foster learners’ involvement and enhance their programming skills in online judge systems, we present a new task called programming exercise recommendation (PER). The definition of PER is as follows and an example of data model is depicted in Fig. 1.

Fig. 1.
figure 1

Data model for PER task

Definition: Programming Exercise Recommendation. Suppose there are n online learning users \(\mathcal {U} = \{u_1, u_2, \cdots , u_n\}\) with problem-solving behavior logs \(\mathcal {B}= \{B_1, B_2, \cdots , B_n\}\) and m programming exercises \(\mathcal {P} = \{p_1, p_2, \cdots , p_m\}\). Specifically, the i-th record \(B_i\) = \(\{b_1, b_2, \cdots , b_{l_i}\}\) represents the interaction sequence of the i-th learner \(u_i\), where \(l_i\) represents the length of the sequence. Each element \(b_j\) in the sequence is a triple \(\langle p_{b_j}, c_{b_j}, r_{b_j} \rangle \) consisting of the problem \(p_{b_j}\), the code \(c_{b_j}\) and the compilation result \(r_{b_j}\). The ultimate goal of programming exercise recommendation is to predict learners’ learning preferences in the future based on the past interaction behavior \(B_i\) between learners and exercises. that is, the next exercise \(p_{b_{li}+1}\) that will be tried. Correspondingly, in machine learning methods, the optimization objective is:

$$\begin{aligned} l_i = & \max _{\mathcal {A}} \prod _{(b_i, p_{b_{l_i + 1}}) \in \mathcal {B}\cup \mathcal {B}^{-}}\log \mathcal {A}(p_{b_{l_i + 1}}|B_i)^{y_i}(1-\mathcal {A}(p_{b_{l_i + 1}})|B_i)^{(1-y_i)}, \end{aligned}$$
(1)

where \(\mathcal {A}\) is a probabilistic prediction model, such as neural networks, whose output is to predict the probability of interacting with the next exercise \(p_{b_{li}+1}\) based on the historical behavior sequence \(B_i\). \(\mathcal {B}^{-}\) is a set of negative samples, i.e., the exercises that learner \(u_i\) has not interacted with label \(y_i = 1\) if and only if \((B_i, p_{b_{l_i + 1}}) \in \mathcal {B}\), otherwise \(y_i\) = 0.

4 PERS Framework

In this section, we propose a deep learning framework, namely PERS, to solve programming exercise recommendation. As shown in Fig. 2, the architecture of PERS is mainly composed of four functional modules: representing, differentiating, updating and predicting. The details of the four modules are given in the following.

Fig. 2.
figure 2

PERS Architecture

4.1 Representing Module

The representing module mainly focuses on obtaining the embedding of the two inputs: exercises and codes.

Exercise Representation. As demonstrated in Fig. 1, learners typically attempt a programming exercise multiple times until they pass all test cases. Even when trying the same exercise, the compilation results for each attempt are distinct and progressive. Therefore, the exercise embedding and the positional embedding in the sequence are both critical. Suppose \(p_t\) denotes the programming exercise coded by the learners at time t. First, we use a projection matrix \(\textbf{E}_p \in \mathcal {R}^{(N+2)\times d_p}\) to represent each exercise by its id, where N is the total number of exercises and \(d_p\) is the dimension. The first dimension of the projection matrix here is \(N+2\) because two zero pads are added. Then the representation vector for the problem \(p_t\) can be obtained as \(\textbf{e}_{p_t} \in \mathcal {R}^{d_p}\). In addition, inspired by the work [30], we use the sinusoidal function to acquire the position embedding \(\textbf{pos}_{t}\) at time t:

$$\begin{aligned} \textbf{pos}_{(t, 2i)} & = \sin (t/10000^{2i/d_{pos}}), \end{aligned}$$
(2)
$$\begin{aligned} \textbf{pos}_{(t, 2i+1)} & = \cos (t/10000^{2i/d_{pos}}). \end{aligned}$$
(3)

where \(d_{pos}\) denotes the dimension. Based on the exercise embedding \(\textbf{e}_{p_t}\) and the position embedding \(\textbf{pos}_t\), we obtain the enhanced exercise embedding \(\textbf{e}_{p_t}^{'}\) through an MLP:

$$\begin{aligned} \textbf{e}_{p_t}^{'} & = \textbf{W}_1^{T}[\textbf{e}_{p_t} \oplus \textbf{pos}_{t}] + \textbf{b}_1, \end{aligned}$$
(4)

where \(\oplus \) denotes the vector concatenation, \(\textbf{W}_1 \in \mathcal {R}^{(d_p + d_{pos}) \times d_{k}}\) , \(\textbf{b}_1 \in \mathcal {R}^{d_k}\) are learnable parameters.

Code Representation. Suppose \(c_t\) denotes the code submitted by the learners at time step t. First, we apply a code pre-training model CodeBERT [6] to obtain the initial embedding of code \(\textbf{e}_{c_t} \in \mathcal {R}^{d_c}\). Additionally, we employ different projection matrices to obtain the representation vectors of code-related side features: the execution time \(\textbf{et}_{c_t}\in \mathcal {R}^{d_{ct}}\), the execution memory \(\textbf{em}_{c_t}\in \mathcal {R}^{d_{cm}}\), and the execution status \(\textbf{es}_{c_t}\in \mathcal {R}^{d_{cs}}\). After all the representation vectors are generated, we can obtain the enhanced code embedding \(\textbf{e}_{c_t}^{'}\) by an MLP:

$$\begin{aligned} \textbf{e}_{c_t}^{'} & = \textbf{W}_2^{T}[\textbf{e}_{c_t} \oplus \textbf{es}_{c_t} \oplus \textbf{et}_{c_t} \oplus \textbf{em}_{c_t}] + \textbf{b}_2, \end{aligned}$$
(5)

where \(\textbf{W}_2 \in \mathcal {R}^{(d_c + d_{cs} + d_{ct} + d_{cm}) \times d_{k}}\) is the weight matrix, \(\textbf{b}_2 \in \mathcal {R}^{d_k}\) is the bias term.

4.2 Differentiating Module

As the introduction highlights, one of the challenges in PER is to simulate the iterative and trial-and-error process of programming learning. In this paper, we develop a differentiating module to capture fine-grained learning patterns. To distinguish whether students are answering the same exercise or starting a new one, we first calculate the exercise difference embedding \(\varDelta \textbf{e}_{p_t}\) between students’ present exercise embedding \(\textbf{e}_{p_t}^{'}\) and previous exercise embedding \( \textbf{e}_{p_{t-1}}^{'}\) by subtraction. Then, we feed the above three embeddings into a multi-layer perceptron to output the final exercise difference embedding \(\varDelta ^{'} \textbf{e}_{p_t}\):

$$\begin{aligned} \varDelta \textbf{e}_{p_t} & = \textbf{e}_{p_t}^{'} - \textbf{e}_{p_{t-1}}^{'} \end{aligned}$$
(6)
$$\begin{aligned} \varDelta ^{'} \textbf{e}_{p_t} & = \textbf{W}_3^{T}[\varDelta \textbf{e}_{p_t} \oplus \textbf{e}_{p_t}^{'} \oplus \textbf{e}_{p_{t-1}}^{'}] + \textbf{b}_3, \end{aligned}$$
(7)

For the same exercise, the codes students submit are different at each attempt, which can indicate their progress in the trial-and-error process. Therefore, we use students’ present code embedding \(\textbf{e}_{c_t}^{'}\), previous code embedding \(\textbf{e}_{c_{t-1}}^{'}\) and the difference between them \(\varDelta \textbf{e}_{c_t}\) to obtain the final code difference embedding \(\varDelta ^{'} \textbf{e}_{c_t}\):

$$\begin{aligned} \varDelta \textbf{e}_{c_t} & = \textbf{e}_{c_t}^{'} - \textbf{e}_{c_{t-1}}^{'}, \end{aligned}$$
(8)
$$\begin{aligned} \varDelta ^{'} \textbf{e}_{c_t} & = \textbf{W}_4^{T}[\varDelta \textbf{e}_{c_t} \oplus \textbf{e}_{c_t}^{'} \oplus \textbf{e}_{c_{t-1}}^{'}] + \textbf{b}_4, \end{aligned}$$
(9)

where \(\textbf{W}_3, \textbf{W}_4 \in \mathcal {R}^{3d_k \times d_{k}}\) is the weight matrix, \(\textbf{b}_3, \textbf{b}_4 \in \mathcal {R}^{d_k}\) is the bias term.

4.3 Updating Module

The purpose of this module is to update the latent states that represent the learner’s intrinsic learning style. Inspired by the classic learning style model FSLSM, we propose two hidden vectors, processing style \(\textbf{PS}_{t}\) and understanding style \(\textbf{US}_{t}\), to capture the programming learning style of learners. In addition, motivated by the programming knowledge tracing research [33], we introduce another hidden vector called programming ability \(\textbf{PA}_{t}\) to enhance the modeling of programming behavior.

First, we assume that all learners start with the same programming ability \(\textbf{PA}_{0}\), and their programming ability will gradually improve as they progress through exercises. The learners’ programming ability \(\textbf{PA}_{t}\) at time step t depends on their performance in completing the current exercises as well as their previous programming ability \(\textbf{PA}_{t-1}\). The corresponding update process is as follows:

$$\begin{aligned} \varDelta \textbf{PA} & = \textbf{W}_5^T[\textbf{e}_{p_t}^{'} \oplus \textbf{e}_{c_t}^{'}] + \mathbf {b_5}, \end{aligned}$$
(10)
$$\begin{aligned} \textbf{PA}_{t} & = \textbf{W}_6^T [\varDelta \textbf{PA} \oplus \textbf{PA}_{t-1}] + \mathbf {b_6}, \end{aligned}$$
(11)

where \(\textbf{W}_5, \textbf{W}_6 \in \mathcal {R}^{ 2d_k \times d_{k}}\) are weight matrices, \(\textbf{b}_5, \textbf{b}_6 \in \mathcal {R}^{d_k}\) are bias terms. When \(t=0\), \( \textbf{PA}_{0} \in \mathcal {R}^{d_k}\) is initialized as a vector of all zeros.

Similarly, the initial processing style \(\textbf{PS}_{0} \in \mathcal {R}^{d_k}\) at time \(t=0\) is also initialized as a vector of all zeros. As shown in Table 1, the learner’s processing style mainly manifests in their continuous trial-and-error behavior on the same exercise. Leveraging the difference of exercise \(\varDelta ^{'} \textbf{e}_{p_t}\) and code \(\varDelta ^{'} \textbf{e}_{c_t}\) generated from the difference module, we introduce a gating mechanism to update the learner’s processing style vector \(\textbf{PS}_{t}\). We first calculate a selection gate \(\textbf{g}_{ps}\) using \(\varDelta ^{'} \textbf{e}_{p_t}\), which determines whether the current exercise is identical to the previous one. Then \(\textbf{g}_{ps}\) is multiplied by \(\varDelta ^{'} \textbf{e}_{c_t}\) to figure out how much semantic information should be learned from the code. Finally, we concatenate the result with the previous processing style \(\textbf{PS}_{t-1}\) and employ a multi-layer perception to fuse these vectors as follows:

$$\begin{aligned} \textbf{g}_{ps} & = \tanh (\textbf{W}_7^T \varDelta ^{'} \textbf{e}_{p_t} + \textbf{b}_7), \end{aligned}$$
(12)
$$\begin{aligned} \textbf{PS}_{t} & = \textbf{W}_8^T[\textbf{PS}_{t-1} \oplus (\textbf{g}_{ps} \odot \varDelta ^{'} \textbf{e}_{c_t})] + \textbf{b}_8, \end{aligned}$$
(13)

where \(\tanh \) is the non-linear activation function, \(\textbf{W}_7 \in \mathcal {R}^{ d_k \times d_{k}}\), \(\textbf{W}_8 \in \mathcal {R}^{ 2d_k \times d_{k}}\), \(\textbf{b}_7, \textbf{b}_8 \in \mathcal { R}^{d_k}\) are trainable parameters, \(\odot \) is the vector element-wise product operation.

Another latent vector is the understanding style \(\textbf{US}_t\), which indicates whether learners prefer to learn step-by-step or in leaps and bounds. It is derived from the learner’s historical records. Thus, the initial understanding style \(\textbf{US}_{0} \in \mathcal {R}^{d_k}\) is also initialized as a vector of zeros. Similar to the processing style, we also employ a gating mechanism to determine whether the learner is attempting the same exercise, and subsequently update the current understanding style \(\textbf{US}_t\) based on the previous one \(\textbf{US}_{t-1}\):

$$\begin{aligned} \textbf{g}_{us} & = \tanh (\textbf{W}_9^T \varDelta ^{'} \textbf{e}_{p_t} + \textbf{b}_9), \end{aligned}$$
(14)
$$\begin{aligned} \textbf{US}_t & = \textbf{US}_{t-1} + \textbf{W}_{10}^T (\textbf{g}_{us} \odot \textbf{e}_{p_t }^{'}) \end{aligned}$$
(15)

where \(\textbf{W}_9, \textbf{W}_{10} \in \mathcal {R}^{ d_k \times d_{k}}\) are weight matrices, \(\textbf{b}_9 \in \mathcal {R}^{ d_k}\) is the bias term.

4.4 Predicting Module

After obtaining the learner’s programming ability \(\textbf{PA}_t\), processing style \(\textbf{PS}_t\), and understanding style \(\textbf{US}_t\), we can predict the next exercise in the predicting module. First, the three intrinsic vectors are concatenated and then projected to the output layer using a fully connected network to get \(\textbf{Pre}_t\). After that, we encode \(\textbf{Pre}_t\) into an m-dimensional project matrix and obtain the final probability vector \(\textbf{p}_n\) of exercises being recommended at the next step.

$$\begin{aligned} \textbf{Pre}_t &= \textbf{W}_{11}^T[\textbf{PL}_t \oplus \textbf{PS}_t \oplus \textbf{US}_t] + \textbf{b}_{ 11}, \end{aligned}$$
(16)
$$\begin{aligned} \textbf{p}_n &= \textbf{W}_{12}^T \textbf{Pre}_t + \textbf{b}_{12} \end{aligned}$$
(17)

where \(\textbf{W}_{11} \in \mathcal {R}^{ 3d_k \times d_{k}}\) and \(\textbf{W}_{12} \in \mathcal {R}^{d_k \times d_{n}}\) are weight matrices, \(\textbf{b}_{11} \in \mathcal {R} ^{d_k}\) and \(\textbf{b}_{12} \in \mathcal {R} ^{d_m}\) is the bias term.

5 Experiments

In this section, we aim to evaluate the effectiveness of PERS on programming exercise recommendation through empirical evaluation and answer the following research questions:

  • RQ1: How does PERS perform compared with state-of-the-art pedagogical methods and sequential methods on programming exercise recommendation?

  • RQ2: What is the impact of different components on the performance of PERS ?

  • RQ3: How do the primary hyperparameters influence the performance of our model?

  • RQ4: Can the proposed method learn meaningful intrinsic representations of students during programming?

5.1 Experimental Settings

Datasets. We evaluate our proposed method PERS on two real-world datasets: BePKT [33]and CodeNet [24]. The two datasets are both collected from online judging systems, including problems, codes and rich contextual information such as problem descriptions and code compilation results. Due to the millions of behaviors and contextual data in CodeNet, memory overflow exists when processing contextual features such as code and problem descriptions. Therefore, we sample the CodeNet dataset based on sequence length and submit time, resulting in two smaller-scale datasets: CodeNet-len and CodeNet-time. A brief overview of each dataset is listed as follows:

  • BePKT: collected from an online judging systemFootnote 2 targeted at university education, with its users primarily being college students who start learning to program.

  • CodeNet: collected and processed by IBM researchers from two large-scale online judgment systems AIZUFootnote 3 and AtCoderFootnote 4. The dataset contains hundreds of thousands of programming learners from different domains.

  • CodeNet-len: a subset of the CodeNet dataset, which only keeps learners’ programming behavioral sequences with lengths between 500 and 600.

  • CodeNet-time: a subset from the CodeNet dataset with submission timestamps between March and April 2020.

Table 2 presents detailed statistics for the above datasets. Specifically, the calculation formula of #Sparsity is as follows:

$$\begin{aligned} \#\text {Sparsity} & = 1 - \frac{\#\text {Interactions}}{\#\text {Students} \times \#\text {Exercises}}, \end{aligned}$$
(18)
Table 2. Detailed statistics of all datasets in experiments, where #Learners denotes the number of learners, #Interactions denotes the number of interactions, #Exercises denotes the number of exercises, #Sparsity denotes the sparsity of the dataset, #Pass-Rate denotes the proportion of successful submissions in all submissions, and #APE(short for Avg-Attempts-Per-Exercise) denotes the average number of attempts on the same programming exercise.

Baselines. We compare PERS with the following 8 comparable baselines, which can be grouped into two categories:

  • Pedagogical methods: ACKRec [8] and LPKT [26] are two representative methods in e-learning recommendation. ACKRec constructs a heterogeneous information network to capture entity relationships. LPKT develops a model by simulating students’ learning processes.

  • Sequential methods: We introduce 6 state-of-the-art sequential models, which are 1) GRU4Rec [29] introduces data augmentation on recurrent neural network to improve model performance. 2) GRU4Recf [11] further integrates a parallel recurrent neural network to simultaneously represent clicks and feature vectors within interactions. 3) BERT4Rec [27] introduces a two-way self-attention mechanism based on BERT [3]. 4) SRGNN [32] converts user behavior sequences into graph-structured data and introduces a graph neural network to capture the relationship between items. 5) SINE [28] proposes a sparse interest network to adaptively generate dynamic preference. 6) CORE [12] designs a representation consistency model to pull the vectors into the same space.

Since all the above baselines do not incorporate code as the model input, for a fair comparison, we implement a degraded version of PERS:

  • ERS: Remove the code feature input and all subsequent related modules in PERS.

Evaluation Metrics. To fairly compare different models, inspired by the previous [8], we choose the HR@10 (Hit Ratio), NDCG@10 (Normalized Discounted Cumulative Gain), and MRR@10 (Mean Reciprocal Rank) as the evaluation metrics.

Table 3. The overall performance on two full datasets and two sample datasets. OOM refers to out of memory.

Training Details. For pedagogical methods, we use the original codes released by their authorsFootnote 5 Footnote 6. Additionally, we implement the PERS model and other baseline models using PyTorch and the RecBole libraryFootnote 7. We run all the experiments on a server with 64 G memory and two NVIDIA Tesla V100 GPUs. For all models, we set the max sequence length to 50, the batch size of the training set to 2048 and the test set to 4096, and the optimizer to Adam. For the PERS model, we set the exercise and code representation embedding dimensions to 128. We perform the hyper-parameter tuning for the learning rate \(\{0.1, 0.01, 0.001\}\), the layer number \(\{1, 2, 3\}\), and the dropout rate \(\{0.1, 0.3, 0.5\}\). For all methods, we fine-tune the hyperparameters to achieve the best performance and run experiments three times to report the average results.

5.2 RQ1: Overall Performance

Table 3 summarizes the performance results. We evaluate the methods on four datasets under three evaluation metrics. The best results are highlighted in bold and the best baselines are underlined. From results in the Table 3, we make the following observations:

  • Our proposed models, PERS and ERS, demonstrate state-of-the-art performance on large-scale programming learning datasets. For instance, in the case of the CodeNet dataset, our models exhibit a significant improvement of 1.41% on HR@10, 1.30% on MRR@10, 1.12% on NDCG@10 over the best baseline.

  • Code features can significantly improve the performance of the model. In the BePKT, CodeNet-len, and CodeNet-time datasets, the PERS model with code features outperforms the ERS model This finding highlights that code-related features contribute to modeling students’ programming learning preferences.

  • RNN-based sequential models exhibit superior capabilities in capturing learning behaviors. Our PERS and ERS, which extends recurrent neural networks, achieve the best performance. Additionally, the GRU4Rec and GRU4Recf models, designed based on recurrent neural networks, outperform all other sequential methods. This observation suggests that RNNs are particularly adept at capturing sequential programming behaviors.

5.3 RQ2: Ablation Study

We conduct an ablation study on PERS to understand the importance of the primary components. We obtain five variants: 1) PERS-ep, which removes the exercise position encoding; 2) PERS-cr, which removes the code representation; 3) PERS-pa, 4) PERS-ps, and 5) PERS-us are another three variants that remove \(\mathbf {PA_t}\), \(\mathbf {PS_t}\) and \(\mathbf {US_t}\), respectively. Figure 3 displays the results of the PERS model and the its variants on the CodeNet-len and CodeNet-time datasets. From the figure, we can observe:

Fig. 3.
figure 3

Ablation Study Results on CodeNet-len(left) and CodeNet-time(right)

  • Both the representating and updating modules play a crucial role in capturing programming behaviour. As can be observed, the removal of any component of the PERS adversely affects its performance, which emphasizes the rationality and effectiveness of the proposed methods.

  • The impact of different components varies across different stages of learning. Specifically, in the CodeNet-len dataset, the performance is significantly affected when removing the position encoding of exercises (PERS-ep variant). On the other hand, in the CodeNet-time dataset, the performance sharply declines when the code representation (PERS-cr variant) is removed. This is because the CodeNet-len dataset comprises the latter part of students’ behavioral sequences, where students have developed a fixed behavioral pattern. Consequently, the representation of the exercises significantly impacts the model’s performance. Similarly, removing different intrinsic latent vectors leads to different degrees of performance decline. The finding indicates that processing style is more critical in the initial learning stages while the understanding style are more influential as the learning pattern becomes more fixed.

5.4 RQ3: Sensitivity Analysis of Hyperparameters

We conduct a sensitivity analysis on the hyperparameter of PERS with two datasets: CodeNet-len and Code-time. In particular, we study three main hyperparameters: 1) sequence length \(\lambda \in \{50, 100, 150, 200\}\), 2) dimension of exercise embedding \(d_p\in \{32, 64, 128, 256\}\), and 3) dimension of code embedding \(d_c\in \{32, 64, 128, 256\}\). In our experiments, we vary one parameter each time with others fixed. Figure 4 illustrates the impacts of each hyperparameter and we can obtain the following observations:

Fig. 4.
figure 4

Influence of three key hyperparameters on the performance of the PERS.

  • Our model is capable of capturing long sequence dependencies. In Fig. 4(a), PERS performs better as the sequence length increases, while the results of GRU4Rec remain unchanged or even decline.

  • As shown in Fig. 4(b), the performance of both PERS and GRU4Rec initially improves and then declines as the dimension of exercise embedding increases. The optimal performance is achieved at approximately \(d_p=128\).

  • As the dimension of code embedding increases, the performance of PERS in Fig. 4(c) shows a consistent enhancement, highlighting the significance of code features in capturing programming learning patterns.

5.5 RQ4: Case Study on Visualization Analysis

To demonstrate the interpretability of our approach, we conduct a visualization analysis of three latent vectors involved in the PERS, i.e., programming ability \(\textbf{PA}_{t}\), processing style \(\textbf{PA}_{t}\) and understanding style \(\textbf{US}_{t}\). We randomly selected the behavioral sequences of two students from the CodeNet dataset for the case study. From the exercise sequence of each student, we can observe that \(\textbf{u}_{222602662}\) tends to make multiple attempts at the same exercise and solve problems in a systematic manner, while \(\textbf{u}_{737111725}\) prefers solving problems by leaps and bound. We extract these three instinct vectors from the last time step of the model and visualize the dimensional reduction results in Figure 5. We note some observations in the visualization results:

  • The extent of variation in students’ programming abilities differs between inter-exercise and intra-exercise. Taking \(\textbf{u}_{222602662}\) as an example, as he made multiple attempts on \(\textbf{p}_{03053}\), his programming ability continuously improved. However, when he attempted the next exercise, his \(\textbf{PA}_{t}\) showed a noticeable decline. Therefore, fine-grained modeling of inter-exercise contributes to better capturing students’ learning state.

  • The changing patterns of learning styles among different students is consistent with their learning process. For \(\textbf{u}_{222602662}\), the value of \(\textbf{PS}_{t}\) and \(\textbf{US}_{t}\) gradually approach 1 during the programming learning process, suggesting a reflective and sequential learning style. As for \(\textbf{u}_{737111725}\), his corresponding latent vectors exhibit a gradual tendency towards -1, indicating an active and global learning style. This shows that the latent vectors can learn valuable information, thereby validating the rationality of our model.

Fig. 5.
figure 5

Case study on latent vectors visualization

6 Conclusions

In this paper, we study programming exercise recommendation (PER) to enhance engagement on online programming learning platforms. To solve PER, we propose a novel model called PERS based on simulating learners’ intricate programming behaviors. First, we extend the Felder-Silverman learning style model to the programming learning domain and present the programming learning style. After that, based on the programming learning style, we construct latent vectors to model learner’s states, including programming ability, processing style, and understanding style. In particular, we introduce a differentiating module to update the states based on enhanced context, which are positions for exercises and compilation results for codes, respectively. Finally, the updated states at the last time step are sent to predict. Extensive experiments on two real-world datasets demonstrate the effectiveness and interpretability of our approach. In future work, we will explore incorporating the difference of structural features from students’ submitted code to further enhance the performance of the model.