Applying Machine Learning to Predict Whether Learners Will Start a MOOC After Initial Registration

Panagiotakopoulos, Theodor; Kotsiantis, Sotiris; Borotis, Spiros; Lazarinis, Fotis; Kameas, Achilles

doi:10.1007/978-3-030-79157-5_38

Theodor Panagiotakopoulos^18,19,
Sotiris Kotsiantis²⁰,
Spiros Borotis¹⁸,
Fotis Lazarinis¹⁸ &
…
Achilles Kameas¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 628))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

1104 Accesses
2 Citations

Abstract

Online learning has developed rapidly in the past decade, leading to increased scientific interest in e-learning environments. Specifically, Massive Open Online Courses (MOOCs) attract a large number of people with respective enrollments meeting an exponential growth during the COVID-19 pandemic. However, only a small number of enrolled learners successfully complete their studies creating an interest in early prediction of dropout. This paper presents the findings of a study conducted during a MOOC for smart city professionals, in which we analyzed demographic and personal information on their own and in tandem with a small set of interaction data between learners and the MOOC, in order to identify factors influencing the decision of starting the MOOC or not. We also applied different models for predicting whether a person previously registered to a MOOC will eventually start it or not, as well as for identifying the most informative attributes for the prediction process. Results show that prediction reached 85% accuracy based only on the number of the first days’ logins in the MOOC and few demographic data such as current job role or occupation and number of study hours that the learner estimates he/she can devote on a weekly basis. This information can be exploited by MOOC providers to implement learner engagement strategies in a timely fashion.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Towards the Differentiation of Initial and Final Retention in Massive Open Online Courses

MOOC Learner Behaviour: Attrition and Retention Analysis and Prediction Based on 11 Courses on the TELESCOPE Platform

Transfer Learning for Predictive Models in Massive Open Online Courses

Keywords

1 Introduction

The recent popularity of Massive Open Online Courses (MOOCs), with their huge number of enrolled learners—out of which only a fraction completes their studies successfully [1]—has led to increased interest in dropout prediction. Knowing which learners are likely to quit a MOOC helps distance learning institutions and MOOC providers to develop intervention strategies and remedial actions so as to provide personalized support [2]. According to the Coursera 2020 Impact Report [3], more than 80% of the learners report career benefits six months after completing a course on the platform. Given the fact that MOOCs are typically free of charge, people select courses to attend many times without being ready to commit to learning, or even disposing the required study skills. Therefore, MOOCs face increased drop-out rates compared to other learning modalities, an outcome that affects the business models of MOOC providers, their suppliers, and even the learners themselves. In this regard, it is essential for MOOC providers to understand early enough their audiences – ideally from the phase of registration in a course.

Educational Data Mining (EDM) is a proper tool for effectively analyzing learners’ behavior and predicting their performance. In this research, we explore the potential of demographic information in tandem with minimal interaction data between learners and MOOCs to identify factors of influence and predict whether a person previously registered in a MOOC will eventually start it or not. Moreover, we examine if such a prediction could be done in a timely manner giving the opportunity to implement early interventions.

The rest of the paper is organized as follows: Sect. 2 reviews recent studies concerning the implementation of machine learning techniques for detecting high-risk students in terms of course completion and performance. Section 3 describes the data collection process, provides a description of the dataset and illustrates the results of data mining and statistical methods. In Sect. 4, we present the classification experiments that were conducted and the results obtained. Finally, the paper concludes by summarizing the main findings of the study and considering some thoughts for future research.

2 Related Work

There are few empirical studies of the relationship between MOOC learners’ demographics and completion rates. Guo and Reinecke [4] studied four Open edX MOOCs with data from 140,546 students and found that age was positively correlated with the volume of coverage, which in turn was positively correlated with final grade. On the other hand, Cisel [5] found that employment status had an influence on performance. Unemployed learners achieved higher grades than working students. Morris et al. [6] also found that learners who are older, not working, as well as those with prior online experiences and with prior educational attainment are more likely to complete their course.

Brooks et al. [7] explored the predictive power of learner demographics compared to interaction trace data generated by students in two MOOCs in terms of student success. The authors showed that demographic information offers minimal predictive power in relation to interaction data. Interaction trace data helped models very early in the course. Kizilcec et al. [8] divided students into few stereotypes using a clustering algorithm. The authors used for clustering different data such as students’ demographic data, behavior data, geographic and students’ course enrollment data. Through their analysis, the authors presented a framework for the conversation about MOOC engagement.

Hone & Said [9] reported a survey study of 379 participants enrolled in a MOOC at university in Cairo. The authors reported that 32.2% out of the participants completed the entire course. The authors stated no significant differences in completion rates by gender and level of study (undergraduate or postgraduate). Qiu et al. [10] proposed a framework to predict learners at risk of not completing the MOOC. This framework used feature selection methods and logistic regression. The presented method searched for the most influential features for training and recurrently tested the predictive power of the model to improve its accuracy. The clickstream data was found as useful. Rizvi et al. [11] compared the influence of demographic characteristics on online learning outcomes using a sample of UK based learners across four Open University online courses. The authors found that region, neighborhood poverty level as well as prior education to have strong influence on overall learning outcomes. However, such influence varied as the course progressed.

The review shows that a number of attempts have been made to predict the completion rates based on demographics, interaction and previous e-learning experiences. The results show that prediction is challenging, especially when relying only on demographics. That is why we employed login data during the first days of the e-course in addition to demographics.

3 Data

Our study was performed in the context of the Erasmus + Sector Skills Alliance project “DevOps: Competences for Smart Cities”^{Footnote 1}. DevOps focuses on equipping current and prospective professionals in municipalities and regional authorities with appropriate competences to support the emerging smart city concept, needs and requirements [12]. Registrations in the DevOps MOOC^{Footnote 2},^{Footnote 3} lasted from September 15 to October 15, 2020, while the MOOC started on October 19, 2020. It lasted approximately 3 months and it was structured on a weekly format delivering 1 or 2 training modules (i.e. competences) per week. Each training module – available in English - comprised 2 to 5 learning units, each of which included an automatically graded assessment test. The content delivered to the learners was designed so as to address the European Qualifications Framework level 5, as this is the required level of autonomy and responsibility for smart city professionals. The registration form included a questionnaire asking applicants to provide personal and demographic data notifying them that all data would be acquired and used according to the General Data Protection Regulation Data protection (EU) 2016/679 and the Regulation (EU) No. 2018/1725 to evaluate the quality of the DevOps MOOC. All applicants were requested to provide their consent to store and use this data; otherwise they could skip this questionnaire and proceed with registration only providing their full name and email.

Personal and demographic information included sex, age, nationality, country of residence, mother tongue, education level, current employment status, current job role or occupation, years of experience in the role/occupation, average amount of daily working hours, level of technical English language skills, current digital proficiency, number of under aged children, available amount of study hours on a weekly basis and prior MOOC attendance. A total of 961 people applied for the DevOps MOOC, the vast majority of which were from the EU, where the MOOC was promoted. 944 of them provided sociographic information which, after being anonymized, consisted the initial set of attributes along with the number of the first two days logins in the MOOC. Registered learners who did not submit any of the first week’s assessment tests (the first training module included 5 assessment tests) are considered as “Not Starting the MOOC” class, while the remaining learners are considered as “Starting the MOOC” class. Figures 1–5 show the impact of various types of demographic and personal information on starting the MOOC or not.

Figure 1 shows the distribution between those who actually started the MOOC and those who did not start it in relation to their level of education, their employment status and their digital skills and their past e-learning experiences. Approximately 46% started the MOOC. Based on the diagrams of Fig. 1, there are no noticeable differences among the two groups, which corroborates with the findings of the literature review.

Figures 2, 3, 4 and 5 present the data of only those who have not started the MOOC with respect to language competence, digital skills, education level and previous e-learning experiences. Again, it is difficult to draw definite conclusions as the data indicate a balanced distribution among the different groups and among the different skill levels within each group. From Fig. 4, it could be argued that those who have not previously attended a MOOC have a higher chance of not starting the MOOC. Those with poor digital skills who have already attended a MOOC have a high probability of not starting the MOOC, although this observation needs more thorough analysis and more data to verify it. In general, these data should be combined with user logins to understand early in the training period who has a higher probability of not starting the MOOC and to provide more tailored assistance.

An examination of repeated patterns in the learners’ data was also performed using association rules [13]. First and foremost, all learners that logged in to the MOOC at least 16 times and declared initially that are able to devote on average at least 5 h per week for their learning, started the MOOC. Second, the learners that logged in to the MOOC less than two times the first (learning) week, did not eventually start the MOOC on a degree of 97%. Third, the learners that logged in to the MOOC less than fifteen times, and their current occupational status was declared as ‘employee’ (including public servants) or student or unemployed, and could devote between two and twelve hours per week for their learning, started the MOOC. These three rules strengthen the importance of monitoring learners’ login activity in the e-course so as to identify whether they shall eventually start their training. Complementary to that, they underline the - even limited - significance of the employment status and the available time for learning as characteristics of the target group that accumulates increased chances to start the MOOC.

4 Prediction Models and Experiments

As we have already said, it is important to recognize the learners that will most probably not start the MOOC at an early stage, in order to implement timely interventions. Thus, the attributes that were collected during registration (i.e. personal and demographic information), as well as those collected during the initial offer period of the MOOC (i.e. first two days logins in the MOOC) were used to identify learners that would potentially not start the MOOC using six different prediction models: Cart decision tree [14], Random forest [15], Multi-layer Perceptron (MLP) classifier [16], Adaboost Classifier [17], Naive Bayes algorithm and Logistic regression [18]. In order to calculate the classifiers’ accuracy, stratified ten cross validation was used. Figure 6 depicts the learning curves of the examined learning algorithms.

From the learning curve analysis, we can observe that we can take accurate results using only 100 instances as training set. This rather small number of instances enables MOOC providers with predicting results with even decreased number or resources invested for this effort. It must be mentioned that the four examined classifiers produce similar results. There is not one learning algorithm that produces statistically better results than the rest examined methods. However, Random Forest seems to produce slightly better results using less training data.

We then used the Random Forest algorithm for feature importance (see Fig. 7) implemented in scikit-learn [19]. Permutation feature importance is a technique for calculating relative importance scores that is independent of the model used. First, a model is fit on the dataset, such as a model that does not support native feature importance scores. Then the model is used to make predictions on a dataset, although the values of a feature (column) in the dataset are scrambled. This is repeated for each feature in the dataset. Then this whole process is repeated 10 times. The result is a mean importance score for each input feature (and distribution of scores given the repeats), as shown in Fig. 8.

Using ideas from coalitional game theory, the SHAP method [20] computes the Shapley value of a feature, which is the average of the marginal contributions of that feature to all predictions across all permutations of selected features (Fig. 9).

The most informative attributes according to the examined feature importance strategies are user total logins, current job role or occupation and number of hours learners can devote on average per week for learning. The two out of the three features are known before the MOOC starts, enabling MOOC providers to act early enough. Using only the three most informative attributes, the examined learning algorithms lose only about 1% of their accuracy produced with the whole feature space.

5 Conclusions and Future Work

This paper presented a study concerning the identification of factors affecting the decision of starting the MOOC or not of people who have registered in it. Taking into account learners’ logins in the first two days of the MOOC, we employed data mining techniques that revealed interesting patterns linking the actual number of logins on the start of the "MOOC journey" of the learners with characteristics related to their profile. It was shown that the learning algorithms can predict if a learner will start the MOOC after the initial registration. The accuracy reaches 85% based only on the number of the first days’ logins in the MOOC and some demographic data such as the current job role or occupation as well as the number of hours that the learner estimates that can devote on average per week for learning.

Being able to know and predict early on the training cycle those more likely to drop out of the course is quite important for MOOC providers to create and implement timely learner engagement strategies. More personalized content and support could be offered, especially for people that relate their learning with career advancement and decreased available time for learning respectively; micro-learning [21] and micro-credentials could be employed for those people as well, opening up education to more people as they support by nature flexibility and inclusiveness. Complementary list of runner up candidates could be exploited, so as to replace those who are likely to drop out; alternative registration policies could be coined; and personalized learning paths could be offered addressing different learning behaviors.

Notes

References

Onah, D.F., Sinclair, J., Boyatt, R.: Dropout rates of massive open online courses: behavioural patterns. In: EDULEARN14 Proceedings, pp. 5825–5834 (2014)
Google Scholar
Assami, S., Daoudi, N., Ajhoun, R.: Personalization criteria for enhancing learner engagement in MOOC platforms. In: 2018 IEEE Global Engineering Education Conference (EDUCON), pp. 1265–1272. IEEE (2018)
Google Scholar
Coursera. Coursera 2020 Impact Report. Serving the world through learning. https://about.coursera.org/press/wp-content/uploads/2020/09/Coursera-Impact-Report-2020.pdf. Accessed Mar 2021
Guo, P.J., Reinecke, K.: Demographic differences in how students navigate through MOOCS. In: 1st ACM Conference on Learning@scale, pp. 21–30. ACM (2014)
Google Scholar
Cisel, M.: Analysing completion rates in the First French xMOOC. In: EMOOCs 2014: European MOOCs Stakeholders Summit. Proceedings. Research Track, pp. 26–32 (2014)
Google Scholar
Morris, N.P., Swinnerton, B.J.D., Hotchkiss, S.: Can demographic information predict MOOC learner outcomes? In: Experience Track: Proceedings of the European MOOC Stakeholder. eMOOCs Conference, Mons, Belgium (2015)
Google Scholar
Brooks, C., Thompson, C., Teasley, S.: Who you are or what you do: comparing the predictive power of demographics vs. activity patterns in massive open online courses (MOOCs). In: 2nd ACM Conference on Learning@ Scale, pp. 245–248. ACM (2015)
Google Scholar
Kizilcec, R.F., Piech, C., Schneider, E.: Deconstructing disengagement: analyzing learner subpopulations in massive open online courses. In: Proceedings of the Third International Conference on Learning Analytics and Knowledge, pp. 170–179. ACM (2013)
Google Scholar
Hone, K.S., El. Said, G.R.: Exploring the factors affecting MOOC retention: a survey study. Comput. Educ. 98, 157–168 (2016)
Article Google Scholar
Qiu, L., Liu, Y., Liu, Y.: An integrated framework with feature selection for dropout prediction in massive open online courses. IEEE Access 6, 71474–71484 (2018)
Article Google Scholar
Rizvi, S., Rienties, B., Khoja, S.A.: The role of demographics in online learning; a decision tree based approach. Comput. Educ. 137, 32–47 (2019)
Article Google Scholar
Iatrellis, O., Panagiotakopoulos, T., Gerogiannis, V.C., Fitsilis, P., Kameas, A.: Cloud computing and semantic web technologies for ubiquitous management of smart cities-related competences. Educ. Inf. Technol. 26(2), 2143–2164 (2020). https://doi.org/10.1007/s10639-020-10351-9
Article Google Scholar
Abdullah, Z., Herawan, T., Ahmad, N., Deris, M.M.: Mining significant association rules from educational data using critical relative support approach. Procedia-Soc. Behav. Sci. 28, 97–101 (2011)
Article Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Article Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet Google Scholar
Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. Adv. Neural Inf. Process. Syst. 2, 841–848 (2002)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Lundberg, S., Lee, S.I.: A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874 (2017)
Emerson, L.C., Berge, Z.L.: Microlearning: Knowledge management applications and competency-based training in the workplace. Knowl. Manage. E-Learn. 10(2), 125–132 (2018)
Google Scholar

Download references

Acknowledgement

Supported by the Erasmus+ KA2 under the project DEVOPS, “DevOps competences for Smart Cities” (Project No.: 601015-EPP-1–2018-1-EL-EPPKA2-SSA Erasmus+ Program, KA2: Cooperation for innovation and the exchange of good practices-Sector Skills Alliances).

The European Commission’s support for the production of this publication does not constitute an endorsement of the contents, which reflect the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein.

Author information

Authors and Affiliations

School of Science and Technology, Hellenic Open University, Patras, Greece
Theodor Panagiotakopoulos, Spiros Borotis, Fotis Lazarinis & Achilles Kameas
Business School, University of Nicosia, Nicosia, Cyprus
Theodor Panagiotakopoulos
Department of Mathematics, University of Patras, Patras, Greece
Sotiris Kotsiantis

Authors

Theodor Panagiotakopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Sotiris Kotsiantis
View author publications
You can also search for this author in PubMed Google Scholar
Spiros Borotis
View author publications
You can also search for this author in PubMed Google Scholar
Fotis Lazarinis
View author publications
You can also search for this author in PubMed Google Scholar
Achilles Kameas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Theodor Panagiotakopoulos .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
University of Sunderland, Sunderland, UK
John Macintyre
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Panagiotakopoulos, T., Kotsiantis, S., Borotis, S., Lazarinis, F., Kameas, A. (2021). Applying Machine Learning to Predict Whether Learners Will Start a MOOC After Initial Registration. In: Maglogiannis, I., Macintyre, J., Iliadis, L. (eds) Artificial Intelligence Applications and Innovations. AIAI 2021 IFIP WG 12.5 International Workshops. AIAI 2021. IFIP Advances in Information and Communication Technology, vol 628. Springer, Cham. https://doi.org/10.1007/978-3-030-79157-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-79157-5_38
Published: 22 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79156-8
Online ISBN: 978-3-030-79157-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Applying Machine Learning to Predict Whether Learners Will Start a MOOC After Initial Registration

Abstract

Similar content being viewed by others

Towards the Differentiation of Initial and Final Retention in Massive Open Online Courses

MOOC Learner Behaviour: Attrition and Retention Analysis and Prediction Based on 11 Courses on the TELESCOPE Platform

Transfer Learning for Predictive Models in Massive Open Online Courses

Keywords

1 Introduction

2 Related Work

3 Data

4 Prediction Models and Experiments

5 Conclusions and Future Work

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Applying Machine Learning to Predict Whether Learners Will Start a MOOC After Initial Registration

Abstract

Similar content being viewed by others

Towards the Differentiation of Initial and Final Retention in Massive Open Online Courses

MOOC Learner Behaviour: Attrition and Retention Analysis and Prediction Based on 11 Courses on the TELESCOPE Platform

Transfer Learning for Predictive Models in Massive Open Online Courses

Keywords

1 Introduction

2 Related Work

3 Data

4 Prediction Models and Experiments

5 Conclusions and Future Work

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation