Structural and Temporal Learning for Dropout Prediction in MOOCs

Han, Tianxing; Hao, Pengyi; Bai, Cong

doi:10.1007/978-3-031-10986-7_24

Tianxing Han¹²,
Pengyi Hao¹² &
Cong Bai¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13369))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1972 Accesses

Abstract

In recent years, Massive Online Open Courses (MOOCs) have gained widespread attention. However, the high dropout rate has become an important factor limiting the development of MOOCs. Existing approaches typically utilize time-consuming and laborious feature engineering to select features, which ignore the complex correlation relationships among entities. For solving this issue, in this paper, we propose an approach named structural and temporal learning (STL) for dropout prediction in MOOCs. The multiple entities and the complex correlation relationships among entities are modeled as a heterogeneous information network (HIN). To take full advantage of the rich structural information in the HIN, we present a hierarchical neural network, in which a series of calculations are used to guide and learn the importance of intra-correlation and inter-correlation. Besides, we fully exploit the temporal features of user activities based on activity sequences. Finally, structural and temporal features are fused to predict dropout. The experiments on the MOOCCube dataset demonstrate the effectiveness of STL.

Access provided by Autonomous University of Puebla. Download conference paper PDF

MOOC Student Dropout Rate Prediction via Separating and Conquering Micro and Macro Information

Dropout Prediction in MOOC Combining Behavioral Sequence Characteristics

Learning behavior feature fused deep learning network model for MOOC dropout prediction

Article 21 June 2023

Keywords

1 Introduction

In recent years, MOOCs have received widespread attention because they break through the constraints of time and space [1]. However, some studies point out that less than 10% of users can complete the courses they take and receive the corresponding certificates [9], which has become a major obstacle to the development of MOOCs. Therefore, it is extremely crucial to accurately identify users who have a tendency to drop out early in their learning process, so that timely and appropriate measures can be taken to keep them learning.

Most of the researchers viewed the dropout prediction as a binary problem based on machine learning. They predicted whether a user would drop out by modeling the user’s behaviors. For example, Chen et al. [2] combined decision trees and extreme learning to make prediction. Jin et al. [10] calculated and optimized the weights of training samples based on the definition of the max neighborhood. Nitta et al. [13] extracted the relationship among users’ actions by tensor decomposition and transformer. Zhang et al. [19] analyzed users’ learning behavior and pointed out that introductory learning resources are beneficial in guiding users and preventing them from dropping out. Feng et al. [5] proposed a model that uses CNN to smooth the context and integrates the attribute information of users and courses with an attention mechanism. However, such researches use only user- or course-based statistics as contextual information. They ignored the deep correlation relationships among entities, such as classmate relationships between users who have taken the same course, correlations between courses taken by the same user, etc. These correlation relationships are complex and diverse. If they can be explored to describe the features of users and courses, the prediction of dropout will be more in line with the users’ reality.

Meta-path [20], a composite path connecting a pair of entities, through which we can not only capture the rich and diverse structural and semantic information in the network, but also introduce the prior knowledge. Therefore, it has been widely applied to data mining related tasks such as node classification [16], link prediction [3] and recommendation [4, 7], but there is no research to employ meta-path for dropout prediction as far as we know. The scenario in which user learns in MOOCs typically contains three types of entities (i.e., user, course, video) and rich semantic relations among entities (e.g., the elective relation between the user and the course, the subordinate relation between the video and the course, the watching relation between the user and the video). Inspired by meta-paths, we design multiple entity triads to explore the correlation relationships among entities, such as ${<}$user,course,user${>}$ and ${<}$course,video,user${>}$. ${<}$user,course,user${>}$ implies that two users have taken the same course, while ${<}$course,video,user${>}$ indicates that a course is equipped with some videos, and these videos have been watched by some users recently.

Based on such entity triads, we propose an approach named structural and temporal learning (STL) for dropout prediction in MOOCs in this paper. On the one hand, hierarchical neural network is proposed to extract the structural information among users and courses according to the entity triples designed for them. In this network, we use relevance calculation to assist in generating initial representations of nodes, and then enable the network to automatically focus on important neighbor nodes and correlations by intra-correlation calculation and inter-correlation calculation. On the other hand, the activity information is processed by Bidirectional Long Short-Term Memory (Bi-LSTM) [8] to extract time-influenced temporal features. Finally, the structural features and temporal features are fused to predict dropout. The proposed STL is evaluated on a public real-world dataset called MOOCCube [18] and compared with several state-of-the-art methods. The evaluations demonstrate the effectiveness of STL.

2 The Proposed Method

2.1 Problem Description

Given the video click stream data, we extract the set of users U, the set of courses C and the set of videos V. If a user $u_i\in U$ takes a course $c_j \in C$, the purpose of our study is to predict whether $u_i$ will dropout from $c_j$ or not in the future. Figure 1 illustrates the overall framework of the proposed model, which includes structural feature extraction based on hierarchical neural network shown in Fig. 2, and the temporal activity feature extraction based on Bidirectional Long Short-Term Memory. The main notations used in this paper and their explanations are presented in Table 1.

Table 1. Explanations of the main notations used in this paper.

Full size table

2.2 Hierarchical Neural Network

Since the complex correlation relationships among entities in MOOCs may affect dropouts, a heterogeneous information graph G is constructed to model the MOOCs scenario, and hierarchical neural network as shown in Fig. 2 is proposed to extract the structural features among entities. The graph G contains a series of user, course and video nodes based on the sets U, C and V, and the edges between different types of nodes represent different meanings, e.g., user-course edge represents the elective relationship, user-video edge represents the watching relationship and course-video edge represents the subordinate relationship. Then based on prior knowledge, a triad set $t_u$ = [$t_u^1$, $\dots $, $t_u^m$] with m triads for users and a triad set $t_c$ = [$t_c^1$, $\dots $, $t_c^n$] with n triads for courses are designed. Each triad $\eta \in t_u$ can be denoted as ${<}U, X, Y{>}$, where X = [$x_1$, $\dots $, $x_{n_x}$], Y = [$y_1$, $\dots $, $y_{n_y}$] represent subsets of entities, $n_x$ and $n_y$ represent the number of elements in X and Y, respectively. Similarly, each triad $\xi \in t_c$ can be denoted as ${<}C, X, Y{>}$.

We now present how to extract structural features by hierarchical neural network. By taking a triad $\eta \in t_u$ as an example. First, we form a first-order neighbor set $N_{\eta }^1(u_i)$ for the target node $u_i$ by randomly sampling $n_1$ nodes from neighbor set $X(u_i)$. Let $u_i^{l_1}$ be a node of $N_{\eta }^1(u_i)$. For $\forall u_i^{l_1} \in N_{\eta }^1(u_i)$, a subset of the second-order neighbor set of $u_i$ : $N_{\eta }^1(u_i^{l_1})$ is obtained by randomly sampling $n_2$ nodes from the neighbor set $Y(u_i^{l_1})$. By the above operation, we obtain the sampled second-order neighbor set of $u_i$: $N_{\eta }^2(u_i)=\left\{ N_{\eta }^1(u_i^{l_1}), \forall u_i^{l_1} \in N_{\eta }^1(u_i)\right\} $. Let $u_i^{l_2}$ be a node of $N_{\eta }^2(u_i)$.

$\bullet $ Relevance Calculation. To make a good input to hierarchical neural network, we use HeteSim [14] to calculate the relevance score between $u_i^{l_2}$ and $u_i$:

$$\begin{aligned} R(u_i^{l_2},u_i| \eta ) = \frac{|TI_{u_i^{l_2}\sim u_i}|}{|O(u_i^{l_2})||I(u_i)|}, \end{aligned}$$

(1)

where $TI_{u_i^{l_2}\sim u_i}$ denotes the triad instances between $u_i^{l_2}$ and $u_i$ following the triad $\eta $, $O(u_i^{l_2})$ denotes the out-degree of $u_i^{l_2}$, and $I(u_i)$ denotes the in-degree of $u_i$. Note that it is unreasonable for $R(u_i^{l_2},u_i| \eta ) \ne 1$ if $u_i^{l_2}$ is equal to $u_i$. In order to solve it, we normalize the equation using the cosine of the probability distribution that $u_i^{l_2}$ and $u_i$ arrive at the node $x_j$ of the set X.

$$\begin{aligned} R^{\prime }(u_i^{l_2},u_i| \eta ) = \frac{R(u_i^{l_2},u_i| \eta )}{\sqrt{\sum \limits _{j=1}^{n_x} P^{2}\left( u_i^{l_2}, x_{j}\right) } \cdot \sqrt{\sum \limits _{j=1}^{n_x} P^{2}\left( x_{j}, u_i\right) }}, \end{aligned}$$

(2)

where $P\left( u_i^{l_2}, x_{j}\right) $ and $P\left( x_{j}, u_i\right) $ denote the probability of starting from $u_i^{l_2}$ to $x_j$ and the probability of starting from $x_j$ to $u_i$ under the triad $\eta $, respectively. Then the relevance-guided embedding $f(u_i^{l_2}) \in \mathbb {R}^{1 \times d_I}$ can be obtained as

$$\begin{aligned} f(u_i^{l_2}) = Xavier(u_i^{l_2}) *R^{\prime }(u_i^{l_2},u_i| \eta ). \end{aligned}$$

(3)

where $Xavier(u_i^{l_2}) \in \mathbb {R}^{1 \times d_I}$ is a trainable parameter vector with dimensions $d_I$ by the Xavier [6] initializer.

$\bullet $ Intra-correlation Calculation. Based on second-order neighbor set $N_\eta ^2(u_i)$ and relevance-guided embedding $f(u_i^{l_2})$, the weight $\alpha _\eta $ between $u_i$ and its sampled neighbor $u_i^{l_2}$ can be obtained by $\alpha _\eta = softmax(v \cdot tanh(f(u_i^{l_2}) \cdot w_1 + b_1))$, where v, $w_1$ and $b_1$ are trainable parameters. Then the correlation-specific feature $f_\eta (u_i) \in \mathbb {R}^{1 \times d_I}$ can be calculated as:

$$\begin{aligned} f_\eta (u_i) = \sum \nolimits _{u_i^{l_2} \in N_\eta ^2(u_i)}(\alpha _\eta *f(u_i^{l_2})) \end{aligned}$$

(4)

$\bullet $ Inter-correlation Calculation. After iterating over the triad sets $t_u$ and $t_c$, respectively, we obtain the correlation features $f_{t_u}(u_i)$ = $f_{t_u^1}(u_i) \oplus \dots \oplus f_{t_u^m}(u_i)$ and $f_{t_c}(c_j)$ = $f_{t_c^1}(c_j) \oplus \dots \oplus f_{t_c^n}(c_j)$ for user-course pair($u_i$,$c_j$), where $f_{t_u}(u_i) \in \mathbb {R}^{m \times d_I}$, $f_{t_c}(c_j) \in \mathbb {R}^{n \times d_I}$ (abbreviated as $f_{t_u}$ and $f_{t_c}$) and $\oplus $ denotes the concatenate operation. In order to incorporate multiple types of correlations, we adopt self-attention [15] to calculate the attention value between each two correlations. Firstly, $Q_u$ is calculated by $Q_u = \sigma (f_{t_u}\cdot w_q + b_q)$, $E_u$ is calculated by $E_u = \sigma (f_{t_u}\cdot w_e + b_e)$, $Z_u$ is calculated by $Z_u = \sigma (f_{t_u}\cdot w_z + b_z)$, where $Q_u$, $E_u$ and $Z_u \in \mathbb {R}^{m \times d_a}$, $d_I {<} d_a$, $\sigma (\cdot )$ is the sigmoid function with an output between 0 and 1, $w_q, w_e, w_z, b_q, b_e, b_z$ are trainable parameters. Then the converged structural feature $\widetilde{f_{t_u}}$ $\in \mathbb {R}^{m \times d_a}$ for user $u_i$ is calculated as

$$\begin{aligned} \begin{aligned} \widetilde{f_{t_u}} = {\text {softmax}}\left( \frac{Q_u E_u^{T}}{\sqrt{d_{a}}}\right) *Z_u \end{aligned} \end{aligned}$$

(5)

where $E_u^{T}$ denotes the transpose of $E_u$. Similarly, the converged structural feature $\widetilde{f_{t_c}} \in \mathbb {R}^{n \times d_a}$ for course $c_j$ is calculated as

$$\begin{aligned} \begin{aligned} \widetilde{f_{t_c}} = {\text {softmax}}\left( \frac{Q_c E_c^{T}}{\sqrt{d_{a}}}\right) *Z_c \end{aligned} \end{aligned}$$

(6)

where $Q_c$, $E_c$ and $Z_c \in \mathbb {R}^{m \times d_a}$ are obtained by feeding $f_{t_c}$ into three linear layers, respectively and $E_c^{T}$ denotes the transpose of $E_c$. Additionally, in order to further fuse the converged features, the final structural features $f_{S}(u_i, c_j) \in \mathbb {R}^{1 \times d_s}$ with dimensions $d_s$ can be obtained by:

$$\begin{aligned} \begin{aligned} f_{S}(u_i, c_j)=\sigma (\delta (\widetilde{f_{t_u}} \oplus \widetilde{f_{t_c}}) \cdot w_2 + b_2), \end{aligned} \end{aligned}$$

(7)

where $w_2$ and $b_2$ are trainable parameters, $\delta (\widetilde{f_{t_u}} \oplus \widetilde{f_{t_c}})$ flattens matrix $(\widetilde{f_{t_u}} \oplus \widetilde{f_{t_c}}) \in \mathbb {R}^{(m+n) \times d_a}$ to a row vector with dimension $\mathbb {R}^{1\times (m+n)d_a}$. The overall flow of the hierarchical neural network is given in Algorithm 1.

2.3 Temporal Activity Feature Extraction

Dropouts may exhibit dramatically different learning behaviors over time, especially in the early stage of his/her learning. Therefore, modeling users’ learning behaviors based on temporal relationships is crucial to the dropout prediction. In order to cope with it, a recurrent network, Bidirectional Long Short Term Memory (Bi-LSTM) [8] is applied in our model to extract the temporal activity feature $f_T(u_i,c_j) \in \mathbb {R}^{1 \times d_t}$ for user-course pair ($u_i$, $c_j$).

From the video click stream data, some statistical data can be extracted, such as the number of times the user watches the video, the number of days the user is active on the platform, and so on. In order to express the user’s activity information, an activity sequence $A(u_i,c_j) = [a_1, \cdots , a_e, \cdots , a_d]$ is established based on the data of first d days after $u_i$ started learning on $c_j$, where $a_e$ is a row vector containing a fixed number of types of activities.

In Bi-LSTM model, there is a forward LSTM network and a reverse LSTM network that jointly capture the past and future contextual information. We take the generation of activity feature ${h_e}$ on e-th day as an example. For the memory cell at the e-th day time step, the forget gate $f_e$, the input gate $i_e$ and the output gate $o_e$ are used to control the information flowing into and out of the current memory cell. $f_e$, $i_e$, $o_e$ are calculated by the following equations,

$$\begin{aligned} \left\{ \begin{aligned} f_{e}=\sigma \left( w_{f} \cdot a_{e} + w_{f}^{\prime } \cdot h_{e-1} + b_{f}\right) \\ i_{e}=\sigma \left( w_{i} \cdot a_{e} + w_{i}^{\prime } \cdot h_{e-1} + b_{i}\right) \\ o_{e}=\sigma \left( w_{o} \cdot a_{e }+ w_{o}^{\prime } \cdot h_{e-1} + b_{o}\right) \end{aligned} \right. \end{aligned}$$

(8)

where $w_{f}, w_{f}^{\prime },w_{i}, w_{i}^{\prime }, w_{o}, w_{o}^{\prime }, b_{f}, b_{i}, b_{o}$ are trainable parameters, $h_{e-1}$ represents the value of previous hidden layer. Then the value of the current memory unit $C_e$ can be obtained by selectively forgetting the previous information and adding the current information appropriately as $C_{e}=f_{e} * C_{e-1}+i_{e} * \tilde{C}_{e}$. Here $\tilde{C}_{e}=\tanh \left( w_{c} \cdot a_{e} + w_{c}^{\prime } \cdot h_{e-1} + b_{c}\right) $ denotes alternate information for the current time step, $w_{c}, w_{c}^{\prime }, b_{c}$ are trainable parameters. Once the current memory cell $C_e$ is updated, the activity feature $h_e$ for the e-th time step can be obtained as

$$\begin{aligned} \begin{aligned} h_{e}=o_{e} * \tanh \left( C_{e}\right) . \end{aligned} \end{aligned}$$

(9)

Similarly we can obtain the activity feature for each time step on the forward and reverse LSTM networks: $\overrightarrow{h} = [\overrightarrow{h_1}, \cdots , \overrightarrow{h_e}, \cdots , \overrightarrow{h_d}]$ and $\overleftarrow{h} = [\overleftarrow{h_1}, \cdots , \overleftarrow{h_e}, \cdots , \overleftarrow{h_d}]$. In addition, to further represent the temporal relationship of user activities, the final temporal activity feature $f_T(u_i, c_j)$ is obtained by adding the forward and reverse activity features and mapping them to a higher dimension,

$$\begin{aligned} \begin{aligned} f_T(u_i, c_j) = tanh(w_3(\overrightarrow{h} + \overleftarrow{h}) + b_3). \end{aligned} \end{aligned}$$

(10)

In the above equation, $f_T(u_i, c_j) \in \mathbb {R}^{1\times d_t}$, $d_t$ is the same as $d_s$, $w_3$ and $b_3$ are trainable parameters.

2.4 Model Learning

Based on the set of users U and the set of courses C, if there exist K user-course selective pairs, then the prediction score $\hat{y_k} \in (0,1)$ for whether a user $u_i$ dropout from a course $c_j$ can be obtained by

$$\begin{aligned} \begin{aligned} \hat{y_k}=sigmoid(MLP(f_S(u_i, c_j) \oplus f_T(u_i, c_j))), \end{aligned} \end{aligned}$$

(11)

where the $MLP(\cdot )$ is the Multi-Layer Perceptron layer, $sigmoid(\cdot )$ is the sigmoid layer with an output between 0 and 1. All the parameters in our model can be trained by minimizing the following objective function:

$$\begin{aligned} \begin{aligned} {\text {Loss}}(\varTheta )=\sum _{k \in [1, K]}\left[ y_{k} \log \left( \hat{y}_{k}\right) +\left( 1-y_{k}\right) \log \left( 1-\hat{y}_{k}\right) \right] +\lambda ||\varTheta ||_2^2, \end{aligned} \end{aligned}$$

(12)

where $\varTheta $ is the parameter set of proposed model, $y_k$ denotes the corresponding ground truth of user $u_i$ in course $c_j$ and $\lambda $ is the regularizer parameter.

3 Experiments

3.1 Dataset and the Definition of Dropout

The dataset used in this paper is from MOOCCube [18], a large-scale data repository, which stores more than 700 courses, 38k videos and 200k students. The user log file in MOOCCube records 4,873,530 video watch logs of 48,639 learners enrolled in 685 courses from 26 June 2015 to 16 April 2020.

It is difficult to define dropout, because the user can be inactive for a period of time without dropping out of the course and continuing to learn later. Inspired by [12], we introduce the concept of inactive period, i.e., the maximum of the period between interactions and the inactivity period to the end of data collection. According to the statistics for MOOCCube, over 95% of users who are inactive for 365 days actually give up studying, 365 days is chosen as an inactive period to consider dropping out. In addition, unlike assignments and exams, videos as a core resource of MOOCs are widely available in different courses, so we give a novel definition of dropout by combining inactive period and the percentage of watched videos. Specifically, if a user $u_i$ has been inactive for more than 365 days and has not watched 80% of the videos in a course $c_j$, then this enrollment record will be marked as “dropout”.

Based on the above definition, we obtained a dataset containing 232,864 enrollment records generated by 47,074 users in 556 courses. There are 220,045 enrollment records for dropouts. We divided the dataset into training and test sets in the ratio of 7:3, with the same proportion of positive and negative samples. In the following experiments, we use the user’s seven-day activity log to predict whether the user will drop out in the future.

3.2 Evaluation Metrics and Implementation Details

Considering the highly unbalanced proportion of positive and negative samples in the dataset, we use Area Under the ROC Curve (abbreviated as AUC) to depict the ability of the model to distinguish between positive and negative samples under different thresholds. AUC calculates a score for each sample based on how close the model’s predicted probability value is to the true label, and the closer the predicted value is to the true label, the higher the AUC is and the better the model’s predictive power is.

We implement the STL based on tensorflow. For our model, we design three triads for users, including $t_u^1$:${<}$user,course,user${>}$, $t_u^2$:${<}$user,video,course${>}$ and $t_u^3$:${<}$user,video,user${>}$ and two triads for courses, including $t_c^1$:${<}$course,user,course${>}$ and $t_c^2$:${<}$course,video,user${>}$. We randomly initialize the parameters with the Xavier [6] initializer. Adam [11] optimizer with an initial learning rate of $1 \times 10^{-3}$ is chosen to learn the parameters. Dropout rate is set to be 0.5 and regularization parameter $\lambda $ is set to be $1 \times 10^{-4}$ to avoid overfitting.

3.3 Parameters in Hierarchical Neural Network

In this subsection, we explore the effect of some parameters in hierarchical neural network. For unobtrusive comparison, we only use hierarchical neural network to extract structural features. Figure 3 illustrates the model performance with different number of aggregated neighbors, where the horizontal axis represents the number of first-order neighbors $n_1$ and the different colored dashes represent the different number of second-order neighbors $n_2$. In general, the performance of the model steadily improves as the number of aggregated neighbors increases, which indicates that the neighbor information is beneficial to enhance the embedding representation of the target nodes, and the richer neighbor information helps to characterize the nodes. However, it can be clearly observed that the growth momentum of red line slows down significantly when $n_1=13$, and the growth almost stops when $n_1=19$. This suggests that as the number of neighbors increases, the neighbor information gradually tends to saturate and may introduce some noise information. Similar conclusions can be drawn on second-order neighbors by comparing different folds. In order to keep a balance between accuracy and complexity, $n_1=19$, $n_2=15$ are chosen in the next experiments.

Table 2. Evaluation on several commonly used feature processing methods.

Full size table

In hierarchical neural networks, we enrich and enhance the node representations of users and courses by aggregating intra-correlation information from different neighbor nodes and aggregating inter-correlation information from different triads. To explore the effectiveness of different feature processing methods, we evaluate MaxPool, Mean and Soft-Attention [17] for intra-correlation calculation among nodes, and evaluate Concat, MaxPool, Soft-Attention and Self-Attention for the inter-correlation calculation among triads. As can be seen in Table 2, the combination of Soft-Attention and Self-Attention boosts the AUC from 0.63% to 4.08% compared to other combinations. This suggests that Soft-Attention can capture the importance of triad-based neighbors and aggregate meaningful neighbor information, and on the other hand, the degree of dependency between different triads can be captured by Self-Attention and thus give us the enhanced representation of users and courses.

3.4 Comparison with Other Methods

To verify the validity of our method, we consider three versions of STL. They are STL without structural feature, STL without temporal feature and STL which uses both structural feature and temporal feature. They are compared with machine learning based methods such as LR (Logistic Regression), RF (Random Forest), GBDT (Gradient Boosting Decision Tree) and a deep learning-based method named CFIN [5] that uses CNN to learn the representation of each activity by leveraging its statistics, and uses soft-attention to learn the importance of different activities by combining attribute information. For LR, RF, GBDT and STL without structural feature, we use the activity sequence A(u, c) extracted from the video click stream data as input. For CFIN, we extract the activity matrix, statistics of activity matrix and the information of users and courses from the video click stream data as input. For STL without temporal feature, we obtain the relevance-guided embedding as input. To make a fair comparison, the most suitable parameters are chosen for them. For RF, the number of trees in the forest is set to 500. For GBDT, the number of weak learners is set to 200 and the maximum depth is set to 7, with a learning rate of 0.1. For CFIN, it is trained by the Adam optimizer with a learning rate of $1\times 10^{-4}$ and an L2 regularization strength of $1\times 10^{-5}$.

The results are given in Table 3. STL without structural feature achieves an AUC of 90.62%, second only to GBDT, while after adding structural features, STL obtains an AUC of 92.04%, which increases by 0.7% to 5.72% compared with LR, RF, GBDT and CFIN. Although STL is only 0.7% higher than GBDT in terms of AUC, STL greatly outperforms GBDT in terms of time overhead. Not only because GBDT is difficult to parallel the processing due to the dependencies among weak learners, but also because GBDT requires the use of grid search to find the optimal parameters. Meanwhile, STL is 1.61% higher compared with CFIN. The reasons are that, STL enriches the representations of users and courses by deep correlation relationships among entities, and extracts temporal features from the activity sequence. Overall, our proposed STL obtains optimal performance and has good generalizability.

Table 3. Comparison with other methods.

Full size table

4 Conclusion

In this paper, a general approach named structural and temporal learning (STL) was proposed to improve dropout prediction on MOOCs. The multiple entities and the complex correlation relationships among entities were modeled as a heterogeneous information network (HIN). To take full advantage of the rich structural information in the HIN, we designed multiple triples to represent the correlation relationships between different entities and proposed a hierarchical neural network in which relevance calculation, intra-correlation calculation and inter-correlation calculation were jointly used to bootstrap and learn the importance of neighbor nodes and triads. Besides, we used Bi-LSTM to fully exploit the temporal features of user activities based on activity sequences. Finally, structural and temporal features were fused to predict dropout. The experiments on the MOOCCube dataset demonstrated the effectiveness of our proposed method. In the future, we will deploy STL to the MOOCs platform and establish a complete intervention mechanism for users.

References

Blum-Smith, S., Yurkofsky, M.M., et al.: Stepping back and stepping in: facilitating learner-centered experiences in MOOCs. Comput. Educ. 160, 104042 (2021)
Article Google Scholar
Chen, J., Feng, J., Sun, X., Wu, N., Yang, Z., Chen, S.: MOOC dropout prediction using a hybrid algorithm based on decision tree and extreme learning machine. Math. Prob. Eng. 2019, 1–11 (2019)
Google Scholar
Fan, H., Zhang, F., et al.: Heterogeneous hypergraph variational autoencoder for link prediction. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/TPAMI.2021.3059313
Fan, S., Zhu, J., et al.: Metapath-guided heterogeneous graph neural network for intent recommendation. In: KDD, pp. 2478–2486 (2019)
Google Scholar
Feng, W., Tang, J., et al.: Understanding dropouts in MOOCs. In: Proceedings of the AAAI, vol. 33, pp. 517–524 (2019)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)
Google Scholar
Gong, J., Wang, S., et al.: Attentional graph convolutional networks for knowledge concept recommendation in MOOCs in a heterogeneous view. In: ACM SIGIR, pp. 79–88 (2020)
Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
Article Google Scholar
He, J., Bailey, J., et al.: Identifying at-risk students in massive open online courses. In: Proceedings of the AAAI, vol. 29 (2015)
Google Scholar
Jin, C.: Dropout prediction model in MOOC based on clickstream data and student sample weight. Soft. Comput. 25(14), 8971–8988 (2021)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Moreno-Marcos, P.M., Munoz-Merino, P.J., et al.: Temporal analysis for dropout prediction using self-regulated learning strategies in self-paced MOOCs. Comput. Educ. 145, 103728 (2020)
Article Google Scholar
Nitta, I., Ishizaki, R., et al.: Graph-based massive open online course (MOOC) dropout prediction using clickstream data in virtual learning environment. In: ICCSE, pp. 48–52 (2021)
Google Scholar
Shi, C., Kong, X., et al.: HeteSim: a general framework for relevance measure in heterogeneous networks. IEEE Trans. Knowl. Data Eng. 26(10), 2479–2492 (2014)
Article Google Scholar
Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, X., Ji, H., et al.: Heterogeneous graph attention network. In: World Wide Web, pp. 2022–2032 (2019)
Google Scholar
Xu, K., Ba, J., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
Yu, J., Luo, G., et al.: MOOCCube: a large-scale data repository for NLP applications in MOOCs. In: ACL (2020)
Google Scholar
Zhang, J., Gao, M., Zhang, J.: The learning behaviours of dropouts in MOOCs: a collective attention network perspective. Comput. Educ. 167, 104189 (2021)
Article Google Scholar
Zhao, J., Wang, X., et al.: Heterogeneous graph structure learning for graph neural networks. In: Proceedings of the AAAI (2021)
Google Scholar

Download references

Acknowledgements

This work is supported by Natural Science Foundation of Zhejiang Province of China under grants No. LR21F020002, and the First class undergraduate course construction project in Zhejiang Province of China.

Author information

Authors and Affiliations

Zhejiang University of Technology, Hangzhou, China
Tianxing Han, Pengyi Hao & Cong Bai

Authors

Tianxing Han
View author publications
You can also search for this author in PubMed Google Scholar
Pengyi Hao
View author publications
You can also search for this author in PubMed Google Scholar
Cong Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengyi Hao .

Editor information

Editors and Affiliations

Télécom Paris, Paris, France
Gerard Memmi
Purdue University, West Lafayette, IN, USA
Baijian Yang
Shanghai Jiao Tong University, Shanghai, Shanghai, China
Linghe Kong
Nanyang Technological University, Singapore, Singapore
Tianwei Zhang
Texas A&M University – Commerce, Commerce, TX, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, T., Hao, P., Bai, C. (2022). Structural and Temporal Learning for Dropout Prediction in MOOCs. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13369. Springer, Cham. https://doi.org/10.1007/978-3-031-10986-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-10986-7_24
Published: 19 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10985-0
Online ISBN: 978-3-031-10986-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Structural and Temporal Learning for Dropout Prediction in MOOCs

Abstract

Similar content being viewed by others

MOOC Student Dropout Rate Prediction via Separating and Conquering Micro and Macro Information

Dropout Prediction in MOOC Combining Behavioral Sequence Characteristics

Learning behavior feature fused deep learning network model for MOOC dropout prediction

Keywords

1 Introduction