1 Introduction

Massive open online courses (MOOCs) are widely recognized as a revolution in the field of education, along with interdisciplinary research topics in the field of education, information technology, psychology, pedagogy, and artificial intelligence [1]. With the rapid development of science, technology, academia, and industry, many MOOC platforms have emerged, such as Coursera, EdX, Udacity, and XuetangX. These platforms can provide a variety of well-designed courses to meet the different learning requirements of learners. On the MOOC platform, learners can not only access assignments, transcripts, and video lectures, but also communicate with other learners through online forums and wikis. Therefore, MOOC has become the first choice of online learning for tens of millions of people around the world.

Different from traditional education, MOOC learners have extensively different motivations, thus exhibiting various ways of participation. Most learners do not have enough perseverance to complete their courses. One of the major concerns of MOOC is the high dropout rate. According to statistics, the completion rate of most courses is less than 10% [2, 3], which means that the vast majority of learners who have registered for the course from the beginning will eventually fail to complete their studies. Scholars are trying to find out the reasons why learners fail to complete the course and take timely actions to accurately identify and predict dropouts. If teachers find that learners are likely to drop out at an early stage, they can take timely measures during the teaching process, such as providing positive feedback to the learner’s email or voice.

In the field of education, Big Data has become a hot topic. Educational Data Mining [4] and Learning Analytics [5] are trying to make sense of educational data and further improve teaching experience to achieve enhanced teaching and learning experience. From the above two perspectives, the dropout situation can be predicted by supervised learning method. However, due to the following practical reasons, this process becomes very difficult. Firstly, dropout prediction is not a simple classification problem. The goal of dropout prediction is to predict whether a learner will exhibit a learning behavior in the next few consecutive days, and label those who do not exhibit learning behavior as dropouts. Otherwise, it is labeled as retain. Therefore, from the above analysis, we find that the concept of dropout is simple and convenient, but the concept of retain is very complex. For example, the learners of retain may only have learning behavior in a certain period and certain days. Secondly, MOOC learners drop out for a variety of reasons [6]. Some learners are just trying to get a taste of the new MOOC learning style, and dropping out is a natural consequence of trying. Another part of learners may lack the fundamental knowledge of the chosen course and drop out when they faced with difficulties in the learning process. Some learners may find the course boring and lose interest, they will drop out. In addition, some learners do not have enough time to study due to work and other reasons, they just stop studying the course temporarily. Therefore, it is difficult to evaluate learners’ behavior.

Although there are many good dropout prediction models in existing studies, the following deficiencies still exist:

  1. (1)

    Most of the current studies based on static scenario design and implementation cannot flexibly handle dynamic changes and ignore the local relevance of learning behaviors.

  2. (2)

    The current prediction model ignores the information related to learners’ learning behaviors for several consecutive days. We note that learners’ learning behavior usually remains the same for several days, for example, learners have no learning behavior for several days.

To address the above problems, this study focuses on predicting whether learners will exhibit learning behaviors in the next 10 days by learning behaviors in the previous days. If the learner does not exhibit any learning behavior within the next 10 days, it will be labeled as dropout, otherwise, it will be labeled as a retainer. In this study, the local correlation features and time changes of learning behavior are fully considered, and a convolutional neural network (CNN) model with a multidilation pooling module is further proposed to extract high-level features with learning behavior and time changes, which are specifically used for dropout prediction.

The main contributions of this study are as follows:

  1. (1)

    A novel Lie Group regional covariance feature matrix and a CNN model with a multidilation pooling module are proposed, which are specially used for dropout prediction.

  2. (2)

    We explored the possibility of using the CNN model to implement temporally and early dropout prediction.

  3. (3)

    We carry out some extensive experiments on the data set. Experimental results show that our method is superior to competitive methods and improves the accuracy of prediction.

The remainder paper is organized as follows. In Sect. 2, we briefly review recent advances that work on previous studies. Section 3, we introduce our proposed prediction approaches. Section 4, we introduce the dataset. Section 5, we describe the experimental results. Section 6, finally conclude the paper.

2 Related Work

In this section, we briefly introduce the research related to the prediction of dropout by MOOCs. Fisnik Dalipi et al. [6] reviewed recent research on machine learning application toward predicting, explaining, and solving the problem of dropout in MOOCs, and found that different research groups often choose different data sources and extract different characteristics. This study used assignment grades, social networks, clickstream data, and even demographic information as features to extract, and combined with various classification models to predict dropout.

In the existing research, many commonly used classification algorithm models are used to implement supervised learning dropout prediction. Kloft et al. [7] believed that if a learner does not exhibit any learning behavior within 7 days, it will be labeled as a dropout. They used clickstream logs as the data source, they extracted 19 types of characteristics of each learner, such as the number of video views for a course, and attempted to predict the dropout rate in the coming week. Specifically, 19 types of features extracted in the first week were used to predict the dropout learners in the second week, 38 types of features extracted in the first 2 weeks were used to predict dropout learners in the third week, and so on. Then principal component analysis (PCA) is used to process the features, and the support vector machine (SVM) is trained to realize the weekly prediction.

Taylor et al. [8] believed that if a learner failed to submit any exercise or assignments questions, he will be labeled as a dropout. For example, learners submitted final assignments in the sixth week and were considered dropout in the seventh week. They defined the time interval as weeks, combined the features of several weeks, and constructed a feature vector to predict dropouts. Multiple classifications and Logistic regression methods were used in this study. Liang et al. [9] believed that a learner who had no learning behavior over the past 10 days was labeled as a dropout. They used data from the first 30 days as input to the model, which contained 112 features such as enrollment information, number of registered learners, total engage the time of learners and enrolled time. Four common classification models, SVM, gradient boosted decision trees, logistic regression, and random forest (RF), were used to predict dropouts.

Ramesh et al. [10] proposed that learner’s activities in the forum can be divided into disengaged, active, and passive as measurement indicators. They leveraged structural features, linguistic, behavior, and temporal to train and construct probabilistic logical models, taking learner involvement as a potential variable.

Scholars also used a semi-supervised learning method to predict dropout. Similar to the previous study by Liang et al. [9], Li et al. [11] utilized the learning behavior of learners in the first 30 days to predict whether learners would exhibit learning behavior for a certain course in the next 10 days. The features extracted in this method include accessing forums, viewing videos, and accessing wiki. Count the number of learning behavior records per week of each learner, capture different types of features in the first 30 days, and construct different views. Therefore, they proposed a multi-view semi-supervised learning algorithm, using the co-training method to predict dropouts, and compared with SVM, Naive Bayes (NB), decision tree, and logistic regression methods.

Several studies have used temporal model-based prediction of dropouts. Balakrishnan et al. [12] suggested that learners who did not exhibit learning behavior within 7 days were labeled as dropouts. The method in this study uses four types of data, such as the number of threads viewed on the forum, the number of times the course progress page was checked, the cumulative percentage of course videos watched and the number of posts posted on the forum. The Hidden Markov Model (HMM) is trained by using the learner’s label and four defined features. Fei and Yeung [13] also considered that if the learner did not exhibit learning behavior within 7 days, he was labeled as a dropout. Furthermore, the method also uses learners’ learning behavior in 1 week to predict the dropout in the next week. Five features of the EdX courses and seven features of Coursera are extracted based on the browser- or server-side events. Two new Recurrent Neural Network models (RNN), i.e., long short-term memory network and a variant of HMM are trained to predict dropout.

There are few kinds of research on the prediction of dropouts using deep learning network model. Whitehill et al. [14] extracted 37 features from the personal information and clickstream data and used proxy label or target label to represent dropouts. The values of the various vector before \((t-1)\) weeks are added to a new feature vector to predict the dropout of the tth week. Then, a logistic regression model and a fully connected feed-forward network model with five hidden layers are used to predict dropouts. Similar to the previous [9] and [11], Wang et al. [15] also used the learning behavior of the first 30 days to predict whether learning behavior would be performed in the next 10 days. 186 features were extracted from the raw data for training, such as Radial Basis Function (RBF), logistic regression, RF, and linear SVM. The mini-batch stochastic gradient descent algorithm is used to train the combination of RNN and CNN with 30 matrices as input. The size of each matrix is \(24 \times 48\).

Marque-Vera et al. [16] used data mining methods to deal with imbalanced behavior and high-dimensional data, to predict dropouts. Kim et al. [17] used a large number of learners’ watching video records to summarize various video viewing modes in online learning. Kizilcec et al. [18] combined different types of user data, such as learner demographic data, learner course enrollment data, and learner behavior data, and used an unsupervised machine learning algorithm to divide learners into a few stereotypes. Gerben et al. [19] extracted the behavior features of learners from the previous curriculum, combined with machine learning algorithms to predict dropout. In addition, Xu et al. proposed a variety of prediction and classification methods based on the CNN model [20,21,22].

According to the above references, machine learning can be used to predict dropout and can achieve better prediction accuracy. Feature extraction is very important for dropout prediction. There are many different feature extraction methods for dropout prediction. However, there are still deficiencies in these approaches. Firstly, most of the existing methods connect the features at different times into a one-dimensional feature vector, which ignores the correlation between different features. Secondly, the goal of dropout prediction is to predict whether learners will exhibit learning behavior in consecutive days in the future. Therefore, continuous-time learning behavior should be considered in the training of the classification model. However, in the existing research approaches, to reduce the length of feature vectors and reduce the amount of calculation, the features in different days are superimposed into the features of a week, and other details related to learners’ learning behaviors are directly ignored. To address the deficiency of existing methods, in this study, we put forward based on learners clickstream logs to extract the features of the learners’ learning every day, will be the features of the construction of Lie Group region covariance matrix, and then, we put forward a novel kind of CNN model, the extract contains learning behavior of high-level features, dedicated to dropout prediction. Different from the above methods [14, 15], we constructed the Lie Group regional covariance matrix and considered the correlation between features.

3 Proposed Approach

In this section, we will introduce our method of predicting dropouts. Firstly, the learning behavior patterns of learners in the first 30 days were statistically analyzed to verify the local correlation of learning behavior. Secondly, a Lie Group regional covariance feature matrix is extracted for each pair of learner-course. Finally, a CNN model with a multidilation pooling module is proposed to predict dropout.

3.1 Local Correlation Analysis

For a more intuitive understanding, we define the concept of the learner’s learning status.

Definition 1

If a learner exhibits a learning behavior on a certain day, then his/her learning status is defined as True (T), otherwise, his/her learning status is defined as False (F).

We measured the learning behavior patterns of learners during the first 30 days. We use a vector to represent the learner’s learning status in the first 30 days, \(LS =(LS_1, LS_2, \cdots , LS_i, \cdots , LS_{30})\), where \(LS_i\) represents the learning status of the learner on the ith day.

The vector representation of each learner’s learning status can be divided into several learning status subsegments. For example, if a learner’s learning status of [T, T, T, F, F, T, T, F, F, F, F, T, T, T, T, F, T, F, T, F, F, T, F, T, T, T, F, T, F, T], can be divided into [T, T], [T, F, F, T, T, F], [F, F], [F, T, T, T, T, F], [T], [F, T], [F, F, T, F, T], [T, T], [F], [T, F, T]. Among them, only one day’s learning status subsegment is recorded as a discontinuous learning status subsegment, and multiple days of learning status is recorded as continuous learning status. The segmented learning status is valid in one day or several consecutive days.

For different learner-courses, we define three cases of learning status segments:

Definition 2

If \(A < B\), it is recorded as less than the count, if \(A = B\), it is recorded as equal to the count, if \(A > B\), it is recorded as more than the count, where A represents the number of continuous learning status segments, and B represents the number of discontinuous learning status segments.

We counted the cases in the above three cases, as shown in Fig. 1. Yellow represents the retained learners, blue represents the dropout learners. The horizontal axis represents the three cases, and the vertical axis represents the learner-course data statistics. It can be seen from Fig. 1 that the continuous learning status segment of most learners is larger than the discontinuous learning status segment, which also exists in the retained and dropout learners. Therefore, we believe that most MOOC learners prefer to keep continuous learning status.

Fig. 1
figure 1

Statistics of the learning status segments

The above content describes the two situations of learning state and analyzes the local correlation of learning behaviors. Next, we define the transformation of learning:

Definition 3

If the learning status of the second day is the same as that of the first day, it is recorded as the same status transfer. If the learning status of the second day is different from that of the first day, it is recorded as a different status transformation.

We have calculated the learning status transition, as shown in Tables 1 and 2. From Tables 1 and 2, we find that the probability that the learning status of the second day is the same as that of the first day is very large. Therefore, learners tend to maintain the same learning status and behavior in the adjacent time.

Through the above statistics and analysis, we find that learners’ learning behaviors in consecutive days exhibit strong correlation, indicating that learners’ behaviors are locally correlated. However, in the real world, learners are affected by various external factors, which often affect their learning behavior for several consecutive days.

Table 1 Statistics transition for dropout learner
Table 2 Statistics transition for retained learner

3.2 Regional Covariance Characteristic Matrix of Lie Groups

In the previous section, we found the local correlations of learning behaviors. Therefore, we directly extract the daily learning behavior records of learners, which has the advantage of being more intuitive. We extracted seven types of behavior features from the clickstream logs, as shown in Table 3. Dropout prediction is a binary classification problem. Specifically, each D(L, T) represents the data sample and its label. If the learner drops out from the course, it is labeled as False (F), otherwise it is labeled as True (T).

Firstly, we map the data samples onto the Lie Group manifold space to obtain the Lie Group samples.

$$\begin{aligned} M_{ij}=log(L_{ij}) \end{aligned}$$
(1)

where \(L_{ij}\) represents the statistical number of the jth learning behavior on the ith day for a learner, and \(M_{ij}\) represents the statistical number of the jth learning behavior of a learner on the manifold space of a Lie Group [23].

For the sample data set of the Lie Group, and the attribute of each data can be expressed as I(xy). In the manifold space of the Lie Group, let F denote the feature extracted from the sample of Lie Group:

$$\begin{aligned} F(x,y)=\phi (I,x,y) \end{aligned}$$
(2)

where \(\phi \) represents the mapping function that extracts an \(d-dimensional\) feature. For the Lie Group sample set, given rectangular region \(R\subset F\), the corresponding \(\phi \) of all elements in the set can be expressed as \(\{z_i\}_{i=1}^n\), \(z_i\) is a \(d-dimensional\) real space vector, and n represents the number of elements in the set. The region R covariance within feature points:

$$\begin{aligned} C_R={\frac{1}{n-1}\sum _{i=1}^{n}(z_i-\mu )(z_i-\mu )^T} \end{aligned}$$
(3)

where \(\mu \) represents the intrinsic mean of the data sample points [23]. For other detailed introductions, please refer to reference [24, 25]. The calculation method of the intrinsic mean of the Lie Group covariance matrix can be referred to references [26, 27].

3.3 Constructing CNN

To make full use of the local correlation characteristics and consider the characteristics of its high-level features to predict dropouts. We propose a novel deep learning model through a large number of experiments.

The general architecture is shown in Fig. 2. Our proposed model is mainly comprised of two parts: a lightweight CNN and a multidilation pooling module. We use a lightweight CNN as a base network to extract high-level features. We take the Lie Group region covariance matrix as the input, the size of the input is \(30 \times 7\) (30 represents row data, representing the count of 30-day learning behavior, 7 represents 7 types (Table 3) of behavior features extracted from the clickstream logs data). Layer C1 is a convolutional layer with four feature maps having the size of \(28 \times 5\), in which each unit in each feature map is connected to a \(3 \times 3\) neighborhood in the input layer. Use the Rectified linear unit (ReLU) as the activation function [28]. Layer P2 is the max-pooling layer and has four feature maps of size \(14 \times 4\). Each unit in each feature map is connected to the \(2 \times 2\) neighborhood in the corresponding feature map in Layer C1. According to the comparative analysis of the experiment, we set the steps of rows and columns of the Lie Group region covariance matrix to be 1. Then, the multidilation pooling module is applied to extract multidimensional features, and by connecting multidimensional features, the feature map is combined to learn multiscale association information. Finally, complete the classification task.

Fig. 2
figure 2

The general architecture of the proposed network

To achieve a multidimensional feature representation, we added a multidilation pooling module, as shown in Fig. 3. This module is a pyramidal pooling module, whose idea originates from SPPNet [29]. To eliminate the fixed-size constraint of the network model when introducing the SPP layer into SPPNet, since the input scale is very flexible, the SPP layer can pool features extracted with a variable scale. It can be seen from Fig. 3 that our module has multiple branches, which can not only complete the extraction of features of different dimensions but also has good robustness. One branch is used to directly use the global average pooling, and the other three branches are divided into global average pooling, dilated convolution, and SE layer. Finally, the features of multiple branches are joined together to capture the global multidimensional feature representation. Multiple dilation rates of 2, 5, and 6 are set in three dilated convolutions to extract features of different dimensions.

Fig. 3
figure 3

Structure of the multidilation pooling module

4 Data Preparation

We selected 39 categories of courses from the XuetangX platform [30], which are used in KDDCup2015 [31] and contain both labeled and unlabeled data. The labeled data contains 30 days of recorded information and a label representing whether the learner exhibited learning behavior during the last 10 days. If the learner does not exhibit any learning behavior in the last 10 days, it is labeled as a dropout, otherwise, it is labeled as a retainer.

For the convenience and intuitionistic display of learner behaviors in the follow-up research, we have statistically analyzed the frequency of different types of events in the data set, as shown in Tables 3 and 4. From Table 4, we find that wikis are a resource that learners do not frequently access. Dropout and retained learners statistics as shown in Table 5, we find that only 21% of the learners complete their studies.

Table 3 Types of events in the clickstream log
Table 4 Frequency of different types of learning behaviors
Table 5 Statistics of dropout and retained learners

5 Experimental Analysis

5.1 Evaluation Metrics

There exist five widely-used, standard evaluation metrics in classification: Accuracy, F-Measure, Precision, Recall, and Confusion Matrix (as shown in Table 6). As follows:

$$\begin{aligned} Accuracy= & {} \frac{TN+TP}{TN+TP+FN+FP} \end{aligned}$$
(4)
$$\begin{aligned} F\text {-}Measure= & {} \frac{2 \times Precision \times Recall}{Recall+Precision} \end{aligned}$$
(5)
$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP} \end{aligned}$$
(6)
$$\begin{aligned} Recall= & {} \frac{TP}{FP+FN} \end{aligned}$$
(7)
Table 6 Confusion matrix

5.2 Experimental Setup

The experiments and the contrast experiments were carried out in the same environment. The experimental environment and relevant parameters are shown in Table 7. The parameter values were set based on reference to [32, 33].

Table 7 Experimental environment parameters

The Classification algorithms used in this study include Classification And Regression Tree (CART), CNN, Gradient Boosted Decision Tree (GBDT), Linear Discriminant Analysis (LDA), Logical Regression (LR), NB, RF, SVM. Table 8 lists the parameters of some algorithms, where \(n\_estimator\) represents the number of base classifiers in RF and GBDT, C and \(\gamma \) represent the parameters of SVM, default parameter values in libSVM [34] are adopted, and \(num\_feature\) represents the number of features of data samples, which is set to 200 in this study.

Table 8 Parameters setting

5.3 Experiment Results

5.3.1 Local Correlation Experiment Results

In the previous discussion, we found that learners’ learning behaviors during the MOOC learning process exhibited a strong local correlation over several consecutive days. In the feature matrix, each row represents the learner’s learning behavior statistics for a day. By arranging the feature vectors of each day’s learning behavior in time order, the obtained matrix maintains local relevance. If the vectors in the feature matrix are arranged randomly, the local relevance of learning behavior is destroyed. Can the proposed method achieve the best prediction results?

To explain the above problem, we randomly arrange the feature vectors of each row in the matrix 200 times and obtain the average results of 200 experiments. The experimental results were calculated by 20 cross-validations. At the same time, we compare the experimental results with the ordered experimental results. As shown in Fig. 4, after the local correlation of learning behavior is destroyed, all indicators have a certain degree of decline. The experimental results show that the local correlation of learning behavior can not be ignored, and our model can make good use of the local correlation of learning behavior.

Fig. 4
figure 4

Comparion of the dropout predictions when ordered or disordered

5.3.2 Experimental Results Under Different Situations

In the actual scenario, we can predict the next stage of dropout based on the known stages. As shown in Fig. 5, the dropout of the third stage is predicted according to the learning behavior of the first two stages. As shown in Fig. 6, the dropout of the second stage is predicted based on the learning behavior of the first stage. As shown in Fig. 7, the dropout of the fourth stage is predicted according to the learning behavior of the first two stages, and that of the fourth stage is predicted according to the learning behavior of the first stage, as shown in Fig. 8.

Fig. 5
figure 5

Temporal dropout prediction in Situation1

Fig. 6
figure 6

Temporal dropout prediction in Situation2

Fig. 7
figure 7

Early dropout prediction in Situation3

Fig. 8
figure 8

Early dropout prediction in Situation4

Tables 9 and 10 show the experimental results in Situations 1 and 2, respectively. In Table 9, our approach achieved the best results in accuracy, F-measure, precision, and recall. Compared with the average, our approach achieves the best results in all evaluation metrics. In Table 10, our approach achieved the best results in accuracy, precision, and recall. Compared with the average, our approach has the best results in accuracy, precision, and recall.

Table 9 Experimental results of temporal dropout prediction in Situation1
Table 10 Experimental results of temporal dropout prediction in Situation2

Tables 11 and 12 show the experimental results in Situations 3 and 4, respectively. In Table 11, our approach achieved the best results in accuracy, F-measure, precision. Compared with the average, our approach achieved the best results in accuracy, F-measure, precision. In Table 12, our approach achieved the best results only in accuracy. But compared with the average, our approach achieved the best results in accuracy and recall.

Table 11 Experimental results of temporal dropout prediction in Situation3
Table 12 Experimental results of temporal dropout prediction in Situation4

It can be seen from the above experimental results that the classification accuracy of traditional methods (such as CART and GBDT) is relatively low, mainly because these methods treat the dimensions of feature vectors as independent individuals and fail to fully consider the correlation between features and the local correlation of learning behaviors. The CNN [35] method achieves higher accuracy than the above methods because it extracts more abstract and higher-level features. Compared with the above methods, our proposed method achieves superior performance. The main reason is that the features in the model are concatenated by the time order, instead of treating all dimensions of the feature vector as independent individuals, taking full advantage of the local correlation of learning behaviors. In our method, the local correlation of learning behavior is retained in the Lie Group regional covariance matrix, and the row feature vectors of different dates are arranged according to the time order. The local correlation feature is retained while the feature dimension is reduced, which has good robustness and computational performance.

In the analysis in Sect. 5.3.1, we found that learners’ learning behaviors in the MOOC learning process show local correlation, that is, the learning behaviors on several consecutive days showed very strong correlation characteristics. From Tables 9, 10, 11 and 12, different experimental results are obtained for different situations. Specifically, the prediction accuracy of situation 1 is the highest and that of situation 2 is the lowest. Further analysis shows that the amount of known historical data will influence the dropout prediction results. Situation 1 has more historical data and higher prediction accuracy, while situation 2 has less historical data and lower prediction accuracy. Situation 4 is similar to situation 2, although there is only less historical data, it still retains some information of other stages (such as stage 2 and stage 3), so the prediction accuracy is relatively higher. Therefore, we can conclude the following conclusions: (1) The amount of historical data will influence the dropout prediction results, (2) When a large amount of historical data are used, local correlation of learning behavior can be captured effectively. From the experimental results in Tables 9, 10, 11 and 12, we found that the first two stages are known may be enough for dropout prediction. Therefore, we can speculate that this is the reason why KDDCup2015 only provides the historical data of three stages.

6 Conclusion

In this study, we explored methods of predicting dropouts to improve the completion rate of MOOCs. Firstly, we conducted a detailed statistical analysis of MOOC learners’ learning behaviors. The research results showed that learners exhibit similar learning behaviors on consecutive days of learning, and learners’ learning status on the next day might be related to their learning status on the previous day. We proposed a Lie Group regional covariance matrix to represent the local correlation information of learning behavior, and construct a CNN model with a multidilation pooling module to extract the local correlation high-level features of learning behavior for dropout prediction.

In the future, we will continue to study related CNN models to further improve the accuracy of dropout prediction.