Keywords

1 Introduction

The rapid development of modern networks and the Internet 2.0 have spawned many online open platforms, of which Coursera and Xuetang Online are online curriculum education platforms, providing great convenience for learners within doors. Following this trend, a large amount of knowledge data are created, including course videos and their subtitles. However, it is difficult for learners to understand and analyze the knowledge from a global perspective, while course concepts can describe the knowledge points contained in these classrooms or textbooks. Understanding the overall concept makes it easier to learn the subject and assist in understanding the text for learners.

Although quite a few researches [5, 14, 19, 25] on course concepts extraction from teaching materials and course subtitles have been done, the problem of concept extraction from course subtitles in MOOCs is far from solved. Course concept extraction is non-trivial and challenging due to three reasons, including the single short context problem, the low-frequency problem, and the poor diversity of concepts.

Related research topics, including keyphrase extraction [6, 12, 15, 21, 22] and term extraction [8, 11] are popular and valid in the information retrieval domain. Pan et al. [19] introduce external knowledge to explore the relationships between different concepts that have the same meanings. However, simply making use of external knowledge by using relatedness of candidates via word embedding resulting in being unable to utilize the global embedding feature. Furthermore, their work is based on multi documents, while ours is both simple and only requires the current document (single document), rather than an entire corpus. Also, their work relies heavily on seed sets, yet these seed sets are limited to acquire in some cases.

Fig. 1.
figure 1

Our proposed framework LTWNN. Note that we do not need extract candidates in training phase

To address the above problems, we propose learning to weight using sentence embedding with neural networks for course concept extraction in this paper, as shown in Fig. 1. The critical aspect of our idea is that it cannot only improve the diversity of extracted course concepts by introducing external knowledge but also automatically learn to weight to leverage inner statistical information and external expertise. First, we extract some keyphrases as candidates by the Part-of-speech (POS) rule template, and we introduce external knowledge to represent each document by sentence embedding model. Then, to improve the diversity of extracted concepts, we introduce the MMR algorithm and change the formula to fit our task. Next, we combine with the score of MMR and statistical information (i.e., PMI), and then our model learns to weight by neural network classifier (e.g., MLP). Finally, in the prediction phase, the MMR score and PMI score of each candidate concept will be the input of the trained model. Note that we do not care label that the model predicts, we just select the value of the maximal probability of result, as shown in Fig. 2. After the prediction of each candidate concept, we rank them by the selected maximal probability.

Fig. 2.
figure 2

For the candidate concept ‘bigram model’, its maximal probablity corresponding to the result of classification ‘Java statements’. Here we just select the value P = 0.34 as score of concept.

The main contributions of our model are summarized as follows:

  • We propose to introduce the MMR algorithm and utilize external knowledge to calculate relatedness between candidate concepts and documents, which validly solves the diversity of extracted concepts.

  • We propose to combine inner statistical information and external knowledge properly, in which we apply neural networks to learn to weight for each feature information automatically.

  • We propose LTWNN, which incorporates neural networks into the course concept extraction model without relying on multi-document corpus and seed sets.

2 Related Work

2.1 Course Concept Extraction (CCE)

Based on the keyphrases extraction, Pan et al. [19] compared the task with keyword extraction and designed a novel graph-based propagation process. Chen et al. [5] extended Pan’s approach to upgrading the quality of candidate concepts via a novel automated phrase mining method called AutoPhrase [24]. Moreover, based on Pan’s approach, Yu et al. [25] achieved course concept expansion with an interactive game.

Different from these architectures are listed above that regard CCE as a ranking problem, Lu et al. [14] applied deep learning in CCE by setting three types of tag for educational textbooks. Their proposed model mainly adopts a gated recurrent unit (GRU) network. Simultaneously, their application scenario is national curriculum standards of mathematics, which is different from ours, for that the colloquial of course data brings more difficulties for our task. All the above approaches bring valuable references for our work course concept extraction.

2.2 Word and Sentence Embeddings

We introduce external knowledge via embeddings in this paper. Next, we review the development of embeddings. Word embedding (word2vec) [16] is proposed to improve the semantic via representing words as vectors in continuous vector space. To make up for the weakness of word2vec, GloVe [20] is proposed to train the embedding model based on global vocabulary. GloVe integrated Global Matrix Factorization into word2vec, which enriches the semantic and syntax information between words.

The represent of entire sentences and documents is needed to get relatedness between two sentences. Similar to word2vec, Skip-Thought [9] provides sentence embeddings trained to predict neighbor sentences. Based on the Skip-Thought, Logeswaran et al. [13] proposed Quick-thoughts via classifying neighbor sentence, but not generating a new sentence. The Quick-thoughts features a much faster training than Skip-Thought. Different from general word vectors, Sent2Vec [18] produces words and N-gram vectors that can be integrated to form sentence vectors after special training. Additionally, experiments conducted by [1] suggest that sentence representation based on averaged word vectors is effective. This property is used in our embedding method, for the reason that it is accessible and valid.

3 LTWNN: Learning to Weight with Neural Networks

Next, we will clearly describe every procedure of the proposed method. Note that the extraction of candidates has been described in Sect. 2 and so that it will not be described in detail.

3.1 Statistical Information

Statistical information is usually regarded as an important quantization indicator for extracting keyphrases, including TFIDF [21], Log-likelihood (LL) [7], and Pointwise Mutual Information (PMI) [6]. Due to the existence of a single short document, in our paper, we adopt PMI to get enough statistical features. The basis of these methods is that if the constituents of a multi-word candidate phrase form a collocation rather than co-occurring by chance, it is more likely to be considered a phrase [10]. Specifically, for the N-gram candidate concept \({P=\{c_{1},c_{2},...,c_{n}\}}\), where \({N>1}\), the PMI will be calculated by

$$\begin{aligned} {PMI(c_{1},c_{2})}=\frac{2\times {freq(c_{1},c_{2})}}{freq(c_{1})+freq(c_{2})} \end{aligned}$$
(1)

where freq(P) indicates the frequency of the candidate concept P on one document \({d\in {Cor}}\). For the candidates that belong to N-gram (N > 2), the PMI is defined as

$$\begin{aligned} {PMI_{t}}={max(\{PMI(P,B)\})} \end{aligned}$$
(2)

where \({P=\{c_{1},c_{2},...,c_{i}\}}\) and \({B=\{c_{i+1},c_{i+2},...,c_{N}\}}\).

3.2 From Embedding to Candidate Concepts with MMR

The problem of low-frequency and single short context leads to some apparent weakness. For example, most candidates appear only once (i.e., the \({freq(c_{1}+c_{2})}\) \({=1}\)), which shows that the semantic relatedness between each candidate concept provided by internal statistics is limited. Therefore, we propose to represent candidates via introducing information from external knowledge.

Typical embedding methods (e.g., word, sentence and document embedding) show great performance on capturing semantic relatedness between different words within the shared vector space. Word embeddings represent each phrase and word via low-dimensional space vector, the relatedness between two phrases can be reflected by their cosine distance of their vectors. Here, we use trained word embedding \({vec=\{v_{w1},v_{w2},...,v_{wi}\}}\), where \({v_{wi}}\) is real-valued vector of each word \({w_{i}}\). Then, for the each candidate consist of L length, \({P=\{char_{1},char_{2},...,}\) \({char_{i}\}}\), we get its vectors \({vp=\{v_{1},v_{2},...,v_{L}\}}\) is the corresponding word vector of \({char_{i}}\) from vec.

Getting word vector from external knowledge is helpful to improve semantic relatedness for low-frequency words, and it improves the probability of extracting the informal expression “Q sort” of “quick sort”. However, it brings new problems for us. For example, we can extract the concept “bubble sort algorithm” and “heap algorithm”, while another concept “algorithm methods” is also extracted just because it contains a key-word “algorithm”. Pan et al.[19] called the issue “overlapping problem”, they simply introduced a penalty factor to overcome the problem. In fact, the method may incorrectly filter those gold concepts containing “algorithm”, for that it is hard to control the proper value of the penalty factor.

To address the problem described above, inspired by [4], we introduce Maximal Marginal Relevance (MMR), which is one of the simplest and most effective solutions to balance query-document relevance and document diversity. Next, we show how to adapt the MMR algorithm to our task course concept extraction.

The original MMR is used to improve diversity in the information retrieval and recommendation domain. Specifically, based on the all retrieved documents R, for a given input query Q, and initial set S that receives the good answer for Q in each iteration via computing MMR as described in formula (3), where Sim represent cosine similarity between two documents or query, \({\lambda }\) is a balance factor that controls relevance and diversity of result, \(D_{i}\) and \(D_{j}\) are retrieved documents.

$$\begin{aligned} \begin{aligned} \mathbf {M M R}:=\underset{D_{i} \in R \backslash S}{\arg \max } [\lambda \cdot {Sim}_{1}\left( D_{i}, Q\right) -(1-\lambda ) \max _{D_{j} \in S} {Sim}_{2}\left( D_{i}, D_{j}\right) ] \end{aligned} \end{aligned}$$
(3)

In order to use MMR here, we change the formula to fit our task [2], as follows:

$$\begin{aligned} \begin{aligned} \mathbf {M M R}:=\underset{C_{i} \in C \backslash K}{\arg \max }[\gamma \cdot \widehat{\cos }_{sim}\left( D_{i}, doc\right) -(1-\gamma ) \max _{C_{j} \in K} \widehat{\cos }_{sim}\left( D_{i}, D_{j}\right) ] \end{aligned} \end{aligned}$$
(4)

where C is the set of candidate concepts, K is the set of extracted concepts, doc represents full embedding of each course corpus preprocessed (it will be described as follow), \({D_{i}}\) and \({D_{i}}\) are embeddings of candidate concepts i and j, respectively. The \({\gamma }\) will be set as 0.5 to ensure that the relatedness and diversity parts of the equation have equal importance. Note that \({\widehat{\cos }}\) is a normalized cosine similarity [17], described by the following equations.

$$\begin{aligned} \begin{aligned} \widehat{\cos } \left( C_{i}, d o c\right) =0.5+\frac{n \cos _{s i m}\left( C_{i}, d o c\right) -n \cos _{s i m}(C, doc)}{\sigma \left( n \cos _{s i m}(C, doc)\right) } \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned}&n \cos _{sim}(C, doc)=\frac{\cos _{sim}\left( C_{i}, d o c\right) -\min _{C_{j} \in C} \cos _{s i m}\left( C_{j}, doc\right) }{\max _{C_{j} \in C} \cos _{s i m}\left( C_{j}, doc\right) } \end{aligned} \end{aligned}$$
(6)

To compute the cosine similarity between each candidate concept and the corresponding entire course corpus, we need to calculate the full embedding of each document (i.e., single video corpus). Compared with word embeddings, it has been proved that sentence embedding is able to retain key sentence information, which will improve semantic relatedness between one concept and corresponding corpus document.

3.3 Learn to Weight and Concepts Ranking

Learn to Weight. To properly allocate weight for each feature information, we apply a Multi-Layer Perceptron to predict concept label \({y_{c} (i.e., 0,1,2,...,n)},\) where n is the total label number of gold concepts, as followed:

$$\begin{aligned} p(y_{c} \mid c)=M L P(c), \end{aligned}$$
(7)

where c is a candidate concept to be classified.

Concepts Ranking. At the prediction phase, \({X=[PMI;MMR]}\) is used as input, we get the classification probability via MLP, as followed:

$$\begin{aligned} pro = softmax(ReLU(XW_{h} + b_{h})) \end{aligned}$$
(8)
$$\begin{aligned} score = max(pro) \end{aligned}$$
(9)

Again, we do not care classification label for each candidate but focus on maximal classification probability. Here, we get the maximal value of the function, and then select the Top-K as concepts by ranking score.

Table 1. The four datasets we use. Columns are: the domain of each dataset; the number of documents (i.e., course subtitles); the number of gold concepts; the average number of candidates per document; the average number of tokens per document; the average number of gold concepts per document.

4 Experiments

4.1 Dataset and Experiments Setup

We evaluate the proposed model in online MOOC datasets. The datasetsFootnote 1 include two course corpus with Computer Science and Economic domains in two different languages. The statistics of the MOOC datasets are reported in Table 1. For our method, in the training phase, given gold concept as a classification result, for the extraction of statistical feature PMI, we calculate the frequency of words based on the entire document (i.e., all documents). For the concept N = 1 (i.e., the length of a concept is 1), we directly set PMI to 0.001. In the evaluation phase, we extract feature information on a single document separately, but not the entire corpus.

4.2 Evaluation Measure

In this paper, we select Mean Average Precision (MAP) as an evaluation metric. Considering the precision of the ranking item, we select the R-precison [26], which is also a standard information retrieval metric that is different from Recall and Precision. Specifically, given a ranking list with K candidate concepts, it computes the number of gold concepts (i.e., precision) over K highest-ranked candidates, and the real value of K will be considered in the experiment.

Table 2. Comparison of proposed method with CGP on the four datasets. MAP and R-precision at K (= 5, 10, 15) are reported. Two ablation experiments about diversity are reported.
Fig. 3.
figure 3

Comparison of our method with baselines on the four dataset at MAP metric, these green lines are the performance of our method.

4.3 Comparison Method with Baseline Models

We compare the proposed method LTWNN against the following baselines:

PMI [6]: In the Point-Mutual-Information (PMI) method, we directly rank each candidate concept based on the score calculated by the method described in Sect. 3.1.

TextRankFootnote 2 [15]: TextRank is a well-known graph-based algorithm inspired by PageRank [3]. It regards each candidate as a vertex and word relatedness as an edge. As an undirected weight graph, TextRank iteratively computes the rank value of each vertex.

CGPFootnote 3 [19]: Concept Graph Propagation is the state-of-the-art method in the course concept extraction of the MOOC dataset. They construct a concept graph for each course corpus, which is similar to TextRank. Different from TextRank, they calculate concept scores with PMI and external knowledge via generalized voting scores.

4.4 Result Analysis

As shown in Table 2, at the overall level, our method LTWNN outperforms available methods on three of the four datasets in MAP and R-precision.

For the performance on English data, LTWNN outperforms other methods at the K = 5. Moreover, when the K = {10, 15}, LTWNN shows similar performance with the state-of-the-art model.

For the performance on dataset CSZH, LTWNN shows apparent robustness and effectiveness over other methods. From the information described in Table 1, we know the average number of concepts per document is only 1.86, which indicates that the phenomenon of low-frequency and poverty of diversity on the dataset is more obvious than others. Thus, the experiment suggests that LTWNN is effective in solving the problem of low-frequency and poverty of diversity on a single document.

The performance of LTWNN on dataset EcoZH shows worse than available model CGP and PMI. We conduct an experiment of diversity factor influence on dataset EcoEN, as shown in Fig. 4 and formulate (4), with the raise of \({\lambda }\) (i.e., the diversity decrease), the performance of LTWNN increase continuously. As shown in experiments, the robustness and the effectiveness of Textrank in dataset CSZH is more evident than that in dataset CSEN and EcoEN, for that the average number of tokens per document in the former dataset is smaller than that in the latter.

Ablation Study. In our approach, the diversity of concepts plays a critical role in improving course concept extraction. As can be seen in Fig. 5, we show a concrete example, utilizing one 300-dimensional vector representing a single document and a 300-dimensional vector for each candidate concept. Then, We select the top-10 gold concepts out of 23 candidates, and the closer candidate is to the document vector, the higher the probability score it is a gold concept. Furthermore, as shown in Table 2 and Fig. 3, the comparison (except dataset EcoZH) of LTWNN-Without-PMI and LTWNN-Without-MMR suggests that poverty of diversity hampers their performance.

Fig. 4.
figure 4

The study of diversity factor influence on dataset EcoZH.

Fig. 5.
figure 5

The effect of diversity on the distribution of the extracted concepts. Embedding space (Visualization based on multidimensional scaling with cosine distance on the original Z = 300 dimensional embeddings) of one documents, which includes sentence embedding and word embedding of candidates “Q sort”, “unstable sorting algorithm”, and so on.

5 Conclusion and Future Work

This study demonstrates how the course concept is extracted from the MOOC corpus, in which each online course may attract more than 100,000 learners [23]. Due to the attribution of open-course, the learners have diverse knowledge backgrounds. The study is aimed at extracting core knowledge for different background students. The content from MOOC courses is usually rich and complex, which is difficult for students to understand and analyze the knowledge from a global perspective. Course-related concepts represent the core knowledge, which will help students grasp the core knowledge.

Moreover, constructing educational knowledge graph based on the course concept entity is helpful for students and teachers, including makes personal education and deep knowledge tracking. And with the course concept extraction, we can build an interaction machine to help students better grasp core knowledge.

In future work, incorporating other external knowledge such as topic knowledge that classifies course knowledge into several groups is an available method to further improve the performance of course concept extraction.