Keywords

1 Introduction

With the rapid development of the Internet and Web technologies, online learning has attracted millions of registered students worldwide (Henderikx et al. 2017). There are several reasons people tend to choose online learning. For example, learning online is cost-efficient. Another reason is that there are no restrictions on time and location for learning. Some universities or institutions also offer high-quality education free to anyone in the online learning platforms like MOOCs. The learning environment is diverse for learners worldwide (Castle and McGuire 2010). Besides, learning resources are multimedia objects, such as videos, animations, and images (Lau et al. 2013). However, confronted with a growing number of courses, it is challenging to pick up a satisfying course from the large course pool in an online learning platform as the semantics of complex learning needs and courses are difficult to understand and match. A search engine is one of the solutions for addressing this problem. In this method, course descriptions and categories are the main factors for displaying the search results. In other words, if the learner’s need is obvious and can be specified by some keywords, the search engine will be sufficient for learners to identify their preferred courses. However, learners may not understand their learning needs very clear in most cases. Furthermore, online courses usually have limited text descriptions, which can be used as keyword search indices. These keywords cannot fully represent the high-level semantics of the courses. Therefore, the search engine is not a good solution when the learning needs are vague, or the courses’ indices are insufficient.

The recommender systems (Resnick and Varian 1997) have been another effective solution for the above problem. From e-commerce (e.g., Amazon (Linden et al. 2003) or Alibaba) to online services (e.g., hotels (Zhang et al. 2015) or movie (Bennett and Lanning 2007)), recommender systems can assist users in clarifying their true needs and decision-making in every daily life. Recommender systems attempt to establish the linkages between users and items based on user interaction and predict users’ preferences. Collaborative Filtering (CF) was one of the most popular techniques in recommender systems (Herlocker et al. 2004). In a CF-based recommender system, similar items (users) will be recommended based on the user’s historical preferences. Although CF has achieved an excellent performance of recommendation, there are still some problems in this approach. A well-known issue of the CF-based approach is the cold-start problem. The system cannot understand a new user’s preference as there is no or little interaction between the system and the new user.

Moreover, current course recommender systems rarely consider course contextual information in the historical user records (Ma et al. 2017; Bridges et al. 2018; Hidasi et al. 2018). The course position in the records contains semantic information and can be clustered by the semantic information. Relevant courses will share a common context in the historical user records during a period. Word2vec (Mikolov et al. 2013a, b) is one of the most commonly used Natural Language Processing (NLP) techniques that transform the unstructured natural language to the normalized and structured data with semantic information. Another issue is that learning is a spiral process of knowledge acquisition (Diamond et al. 2008). If someone tries to gain a skill, she will choose some similar courses at the first and most light-similar courses in the second step. Usually, similar courses are in the same category and have similar social tags, while the light-similar courses are not. It should be noted that the first step and the second step is not independent but alternation. However, Word2vec assumed that words share a common context will have a similar word vector. This hypothesis makes it difficult to distinguish two courses that share the same context but in different categories.

In this paper, we propose a course recommendation method by Word2vec paradigm through social tags. Firstly, we treated historical user records as the training data and used a content-based recommendation method to overcome the cold-start problem. Then we introduce Laplacian Eigenmaps as the objective function, and the course social tags and course-user interaction as the penalty factor, to fine-tune the vectors generated by the language model Word2vec. One of the advantages of the proposed method is that it can still have a stable performance for new users who have limited interactive data. The other advantage is that using penalty factors will weaken the influence of light-similar semantics and improve the recommendation’s accuracy.

The remaining sections of this paper are organized as follows: Sect. 2 formalizes the problem of course recommendations. In Sect. 3, we introduce the proposed model which exploits Course2vec and integrates social tags in our proposed course recommender system. The experimental results are shown in Sect. 4. Finally, we summarize this research and discuss future directions in Sect. 5.

2 Problem Formalization

We formalize recommending courses based on historical user records and course attributes into an embedding problem. The goal is to learn \( X_{C} \in {\mathbb{R}}^{\left| V \right| \times d} \), where \( X_{C} \) is a low-dimension vector of courses, and \( d \) is a small number of latent dimensions. For each user \( u \in {\text{U}},\quad {\text{U}} = \left\{ {u_{1} , \cdots ,u_{\left| U \right|} } \right\} \), given his profile, i.e., the historical course records \( Q_{u} = \left( {c_{1} , \cdots ,c_{n} } \right) \) with \( c_{n} \in C \), \( Q \) is the set of all user historical course records \( Q = \left( {Q_{1} , \cdots ,Q_{u} } \right) \) and course attribute \( C = \left\{ {T_{1} , \cdots ,T_{\left| C \right|} } \right\} \), \( T_{i} \) is the course social tag. We aim to obtain the course vectors with historical user course records \( Q \) and course attribute \( C \) as formula 1.

$$ X_{C} = f\left( {Q,C} \right) $$
(1)

3 The Proposed Model

This section gives an overview of the proposed model, then introduces a language algorithm to transform historical user records, and finally uses Laplacian Eigenmaps as the objective function to train the entire model. Our study’s core contribution is that we further divide course contextual information into two types of semantic information based on the principle of knowledge acquisition. We define these two types of semantic information as follows:

Similar semantics:

Courses share the same contextual information in the historical user records, and in the same category, share common social tags.

Light-similar semantics:

Courses share the same contextual information in the historical user records but in different categories and have different social tags.

These two kinds of semantic information will cross-over be occurred in the user learning records. Therefore, Word2vec paradigm’s significance through social tags in a recommender system include the following two aspects. Firstly, courses that have the same contextual information will locate closely in the low dimension vectors by extracting the semantic information from user behavior records. Secondly, the course vectors with social tags to distinguish similar semantics and light-similar semantics can be fine-tuned.

3.1 Generic Framework

We propose a course recommender system which exploits Word2vec to obtain semantic information (e.g., Bengio et al. 2013) from historical user records and integrate the course social tags and course-user interaction. Figure 1 illustrates the generic framework of the proposed model. The left part of Fig. 1 denotes using Word2vec to obtain the course vector, while the right one demonstrates integrating social tags to adjust the vectors.

Fig. 1.
figure 1

The overall framework of the proposed model

3.2 Course2vec

Word2vec is most commonly used in NLP. It is a two-layer neural network model of natural language processing topics introduced by Google in 2013 (Mikolov et al. 2015). In Word2vec, sentences are regarded as an ordered sequence of words as the model input, train the network model and generate the vector to represent the word based on the assumption that words share the same context should have a close position in the target dimension. Compared to other language models like TF-IDF and Latent Semantic Analysis, these generate vectors to make the semantic information abundant (Zhao and Shang 2010) and can easily be used in the downstream tasks as word clustering and classification. User historical records can be modeled as sentences in Word2vec because they are both generated by basic elements based on semantic rules. Words co-occurred in the sentence contain semantic information, while relevant courses are located closely in the records during a period.

Word2vec contains two sub-models which are the CBOW model and the Skip-gram model. These two models can be used in different scenarios. In the CBOW model, surrounding words are the input; the model estimates the center words’ likelihood. While in the Skip-gram model, center words are the input, the model aims to predict its neighbors. As our goal is to predict the next selected item based on the existing records, this study employs the Skip-gram model.

Similar to Word2vec, we propose the Course2vec to model the course representation. Given the set of courses \( Q_{u} = \left( {c_{1} , \cdots ,c_{n} } \right) \) taken by one user. Sequences \( Q = \left( {Q_{1} , \cdots ,Q_{u} } \right) \) are the ordered course list. We aim to maximize the probability of the contextual words in the sequence.

$$ y_{t} = \frac{1}{T}\sum\nolimits_{t = 1}^{T} {\left( {\sum\nolimits_{ - c \le j \le c,j \ne 0} {{\text{log }}p\left( {c_{t + j} |c_{t} } \right)} } \right)} $$
(2)

where \( c_{1} , \ldots ,c_{T} \) are courses in the training corpus, and \( c \) is the length of the window around target course \( c_{t} \). Train the network by feeding it word pairs \( {<}c_{t + j} , c_{t} {>} \). The probability \( \Pr \left( {c_{t + j} |c_{t} } \right) \), the important part of the objective \( y_{t} \), is given by SoftMax:

$$ \Pr \left( {c_{t + j} |c_{t} } \right) = \frac{{\sum e^{{v_{{c_{t + j} }} .v_{{c_{t} }} }} }}{{\mathop \sum \nolimits_{c '\in C} e^{{v_{c} '.v_{{c_{t} }} }} }} $$
(3)

where \( v_{{c_{t} }} \) and \( v_{{c_{t + j} }} \) \( \in {\mathbb{R}}^{d} \) are vector representations for course \( c_{t} \) and \( c_{t + j} \), respectively. Training the whole neural network indicates maximizing the function \( \Pr \left( {c_{t + j} |c_{t} } \right) \). In other words, each training \( \Pr \left( {c_{t + j} |c_{t} } \right) \) will adjust all the neural network cells, which will slow down the training process. Negative sampling addresses this problem by only modify a small number of negative courses to update the weights. The objective function can be defined as follows:

$$ J = log\sigma \left( {c_{w} .c_{t} } \right) + \sum\nolimits_{i = 1}^{k} {log\sigma \left( { - c_{N, i} .c_{t} } \right)} $$
(4)

where \( \upsigma \) denotes the sigmoid function, and \( k \) is the total number of randomly selected negative courses.

3.3 Laplacian Eigenmaps

Although the users’ historical records contain contextual information, language models can transform contextual information into vectors. There are two types of semantic information: similar semantics and the other is the light-similar semantics. To deal with this issue, we propose to use Laplacian Eigenmaps as the objective function to generate the low-dimensional representation and fine-tune the course vectors obtained by Course2vec, especially courses share common context but in different categories. The objective function is defined as follows:

$$ {\mathcal{L}} =\upmu_{ij} \left| {\left| {y_{i} - y_{j} } \right|} \right|_{2}^{2} $$
(5)

where \( y_{i} \), \( y_{j} \) is the final vector of courses \( i,j \), \( \upmu_{ij} \) is the weight between them. In the model, \( \upmu_{ij} \) is a penalty factor, if there is a slight penalty, course \( i \) and course \( j \) will locate far in the final space. We aim to minimize the objective function to ensure that \( y_{i} \,{\text{and}}\,y_{j } \) are close if course \( i \) and course \( j \) have a heavy penalty.

$$ argmin\,{\mathcal{L}} =\upmu_{ij} \left| {\left| {y_{i} - y_{j} } \right|} \right|_{2}^{2} $$
(6)

In our model, there are two factors related to \( \upmu_{ij} \). The first one is the number of common interactive users, and the second one is the number of social tags, \( \upmu_{ij} \) can be defined as follows.

$$ \upmu_{ij} = \frac{{U_{i} \cap U_{j} }}{{U_{i} \cup U_{j} }} + \frac{{T_{i} \cap T_{j} }}{{T_{i} \cup T_{j} }} $$
(7)

where \( U_{i} \) is the set of user enroll course \( i \) and \( U_{j} \) is the set of user enroll course \( j \), \( T_{i} \) is the set of social tags of the course i, and \( T_{j} \) is the set of course \( j \) social tags. These social tags are obtained by using Text-CNN (Kim 2014) from analyzing the course introduction and user online comments. To solve our model’s objective function, we regard the course as the node in the graph. Given a network \( G = \left( {V,E} \right) \), in which \( V \) and \( E \) are the set of course node and edge, respectively. We can obtain course adjacency matrix \( S \). For each instance in \( S \), \( w_{i} = \left\{ {\mu_{ij} } \right\}_{j = 1}^{n} \), \( \mu_{i,j} \) means the links between \( c_{i} \) and \( c_{j} \). Therefore, the objective function can be rephrased as follows:

$$ argmin\,{\mathcal{L}} =\upmu_{ij} \left| {\left| {y_{i} - y_{j} } \right|} \right|_{2}^{2} = 2tr\left( {Y^{T} LY} \right) $$
(8)

where \( L = D - S \), \( D \) is the diagonal matrix, and \( D_{i,i} = \mathop \sum \limits_{j} s_{i,j} \), \( Y \) is the prediction vector. Finally, the model generates the final representation of the course with the user’s historical behavior information. In this paper, cosine similarity is employed for measuring the courses’ similarity with the low-dimensional vectors. At last, top-N most similar courses will be recommended to the user based on the course similarity.

4 Experiment

In this paper, the dataset was introduced in the section of MOOC data selected from XuetangX. XuetangX, launched in October 2012, provided over 1,000 courses distributed 12 categories from art, biology to computer science. More than 10,000,000 users have registered in XuetangX. In this experiment, we selected enrolled behaviors from October 1st, 2016 to December 30th, 2017, as samples. Each instance in the training set was a sequence of historical records. For each course, we constructed social tags by using text-CNN to extract from the course introduction. In this experiment, we selected the top-5 social tags as the course tags. Table 1 showed the social tags of some courses.

Table 1. The social tags of sample courses

Firstly, we sorted the user-selected historical records during the period as the dataset. Then, we built a Course2vec model using python and genism to transform the semantic information to the word representation. Finally, to avoid courses in different categories share common context in Course2vec, we used the Laplacian Eigenmaps method as the objective function to fun-tune the Course2vec vectors. The penalty factor in the objective function was based on user-item-tag interaction.

Metrics and Baselines.

We compared our proposed method with the following two methods measured by the two metrics: Recall@20 and Precision@20.

Collaborative Filtering (CF):

This method regarded the user-item interactions as the original user vector. If the user enrolled in the course, the interactive matrix’s value equals 1, else 0.

Matrix Factorization (MF):

This method modeled user preferences by decomposing the user-item interactive matrix to obtain user and item embeddings.

Parameter Settings.

To gain the course vectors by Course2vec, we constructed the model with a 64-dimensional course vector as output, selecting five neighbors as the context, five courses as the negative, and neglecting the course that had been occurred less than five times. Here we chose a relative high dimension vector because high dimension reserves rich information, and in the following step, Laplacian Eigenmaps reduced the number of vector dimensions to 32. According to each course’s selected records, we obtained the similarity of each course by the course vector and recommend top-N courses.

Table 2 showed the results of course recommendations in terms of Recall@20 and Precision@20. Course2vec ++ was a model of the Course2vec with course social tags and course-user interactions. The proposed Course2vec and Course2vec ++ models outperformed all baselines. The second best performance achieved by Course2vec also verified that incorporating semantic information can improve the effectiveness of the recommendation. Course2vec ++ achieved the best performance and was lightly better than Course2vec. Table 3 showed a qualitative example of recommendation results for a specific user (i.e., the user id is 58 in the dataset) by using two different models (i.e., Course2vec and Course2vec ++).

Table 2. Performance of course recommendations by using different models
Table 3. A qualitative example of recommendation results of a specific user (id 58)

5 Conclusion

This paper proposed a method of course recommendation. First, we represented the user’s historical records as word sentences through exploiting the skip-gram with negative sampling to obtain course embeddings. To avoid two types of semantic information that share common context in the records, we integrated the course social tags and course-user interactions as penalty factors to adjust course embeddings under different categories. Experimental results showed that Course2vec captured semantic information from the records. Compared to recommendations based on Course2vec, Course2vec ++ improved the recommendation accuracy. However, this paper focused on observed interactions between users and courses from the perspective of their bipartite graph. For our future research, we plan to integrate information extracted from unobserved items to improve the recommendations’ accuracy and serendipity.