Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The lecture videos captured in classrooms contain a substantial portion of the instructional content [12]. These videos have several advantages over graphic and textual media, e.g., portrayal of concepts involving motion, the alteration of space and time, the observation of dangerous processes in a safe environment, etc. [14]. They also provide the viewer with the benefit of pausing and reviewing at leisure, which is not possible in traditional classrooms. These videos are often hosted on the intranet of the universities and also on numerous online sites like CourseraFootnote 1, Khan AcademyFootnote 2, etc. Leading universities like MIT and Stanford have made their lectures available online for distance learning. Besides, there are numerous lecture videos uploaded to video sharing sites like YouTubeFootnote 3, Videolectures.netFootnote 4, etc., to facilitate further learning. With the increase in Internet speed globally, the demand for multimedia content has been increasing over the past couple of years, with a projected Internet video consumption of 56,800 PB/month in 2018 [1]. E-learning and smart learning videos constitute a huge fraction of this consumption. Hence there is an increase in the demand and consumption of multimedia content related to e-learning.

The entire e-learning process is not limited to videos alone. Text based blogs and websites like WikipediaFootnote 5 and EdublogsFootnote 6 are also important sources of learning material. Wikipedia is a user generated knowledge base available on the Internet that is frequently visited for learning and reference by users from both academia and industry. In essence, wikis offer an online space for collaborative authorship and writing. A survey demonstrates that students reported that they frequently, if not always, consult Wikipedia at some point during their course-related research [8]. Wikipedia serves as a convenient go-to source when students are stuck on some concepts and can not move forward [8]. However, Wikipedia usually only provides a brief explanation of concepts without the detailed descriptions. To provide comprehensive learning for students over the Internet, the contents from multiple sources, e.g., text and videos are required and should be provided in an integrated platform [15]. For example, it will be more effective to enhance the Wikipedia contents with multimedia information such as videos that cover some online lectures about the topic described in the wikipage.

In this paper, we design a system for integrating the blogs and the videos containing educational multimedia data and propose a novel recommendation system called Videopedia. We provide the necessary matching between the texts in the webpages and the lecture videos based on their contents. Retrieving suitable videos for a particular webpage is challenging mainly because webpages and videos are represented in different forms and finding a correlation between them is not straightforward. Also e-learning lecture videos mostly consist of lecture slides, whiteboard/blackboard content and sometimes the instructors delivering the lecture. Hence extracting the content of these videos based on visual words is not a plausible solution. Video transcripts have been generated using automatic speech recognition (ASR) of the audio as an attempt for content detection [16]. Most of the extracted texts by ASR contain errors due to poor audio recording quality, which degrades the spoken text’s usability for video retrieval [6]. Recently, the majority of video sharing platforms like TedFootnote 7, YouTube, etc., provide videos with automatic video transcripts. These are manually created closed captions (CC) and contain less error than ASR. These transcripts provide the content of the lecture videos which we can then mine to find the correlation with the text-based blogs.

Our novel algorithm implemented as a part of Videopedia thus employs techniques of topic modeling in extracting the contents of the webpages and the videos. We use Latent Dirichlet Allocation (LDA) [3] to find the topics in a particular website and those in a particular video. These topics provide better representation of the content of the video compared to video metadata. The topics for the videos are extracted from video transcripts, stored on the server and indexed by their topic distributions. The topics representing the contents of a blog are generated at run-time. Videopedia determines the similarity between the topics of a webpage and those of a video and recommends the relevant videos based on the similarity. In summary, our specific contributions in this paper are as follows:

  • For a webpage recommend the relevant videos that would elaborate the concepts introduced in the webpage.

  • Provide a comprehensive framework which combines multimodal data, i.e., text and video for an enhanced e-learning experience.

  • Effectively use video transcripts for lecture video recommendation.

The rest of the paper is organized as follows. In Sect. 2 we provide a brief summary of the related work. In Sect. 3 we give a description of the model of our system and Sect. 4 summarizes our experiments and the results. We conclude with the direction of our future work in Sect. 5.

2 Related Work

Chen et al. attempted to link web-videos to wikipages by leveraging Wikipedia categories (WikiCs) and content-duplicated open resources (CDORs) [5]. Later they adopted a multiple tag property exploration (mTagPE) approach to hyperlink videos to webpages [6]. Okuoka et al. have used Wikipedia entries to label news videos [15]. This labeling was primarily done by using date information and labels that are extended along the topic thread structure following a time-series semantic structure. Roy et al. have used Wikipedia articles to model the topic of a news story [17]. Topic modeling has found its use in a lot of research areas to find the latent topics in documents, many of which being in the e-learning scenario. Wang and Blei combined probabilistic topic modeling with collaborative filtering to recommend scientific papers [19]. Chen, Cooper, Joshi, and Girod have proposed a Multi-modal Language Model (MLM) that uses latent variable techniques to explore the co-occurrence relation between multi-modal data [4]. They mainly use MLMs to index text from slides and speech in lecture videos. One of the major contributions to indexing and searching lecture videos has been done by Adcock et al. [2]. They have designed a system that provides a keyword-based search on a database of 62,406 video lectures and talks. Zhu, Shyu and Wang used topic modeling techniques in video recommendation systems and designed VideoTopic [20]. The recommendation system generates a personalized video recommendation list to fit the user’s interests. The bag-of-words model used in VideoTopic contains both visual as well as textual features.

In contrast, Videopedia is designed to automatically recommend relevant videos based on the content of the webpages and not for a particular user. The aim of our system is to integrate e-learning materials from different online sources and not to develop a latent variable model as MLM [4]. One of the main problems in extracting the content of the lecture videos is the absence of meaningful visual words. Hence we need to extract the content covered in the speech of the lecture videos. Compared to Chen et al. [5], our system extracts the educational material covered in the videos and not just metadata. Also the videos in Videopedia are indexed by the latent topics present in them and not timing information as is done by Okuoka et al. [15]. We are recommending lecture videos for Wikipedia entries and not using the content of wikipages for mining topics in our videos like Roy et al. [17]. In the domain of online education, topic modeling has been used only for mining the text and not utilized on lecture video content analysis. Hence ours is a first attempt in that direction. Also by indexing the videos by topics, we propose a content based search and not just keyword based search as proposed by Adcock et al. [2].

3 System Model

Videopedia proposes a novel video search technique based on topic extraction and automatically recommends videos based on the content of a particular webpage or educational blog. The existing e-learning systems provide keyword-based video search like Videotopic [2] and hence they can not be used to automatically recommend videos for wikipages and other blogs. The webpage and the video represent two different media formats and finding a correlation between them is the main challenge in designing such a system. An important factor in our recommendation system is that we are recommending lecture videos and hence mining the visual content of the videos extracted through visual words do not provide too much semantic information. This is because the lecture content is delivered either through slides or blackboard/whiteboards and the visual content does not change much throughout the entire video. However, the instructor explains most of the points written in the slides and/or on boards, and the speech of the videos will represent the content of these videos. Therefore, we use the transcripts of these videos and use them as the representative of the content of these lecture videos.

Figure 1 provides a pictorial description of the proposed model. A user views the webpage (e.g., a Wikipedia page) via a browser. The request for videos comes from the webpage on the browser by sending the URL of this webpage to the server. The content of the webpage is extracted and then the latent topics are generated from the content. The video repository is stored on the server and the topic models of these videos are generated beforehand. The matching between the topic models of the webpage and those of the video transcripts in the repository is also done on the server side in real time. The recommended videos are then linked to the webpage on the browser.

Fig. 1.
figure 1

System model for Videopedia

3.1 Dataset Used for Recommendation

For this research, we collected the videos from the YouTube channel for National Programme on Technology Enhanced Learning (NPTEL) and also from Videolectures.net. Our video repository now consists of 3,000 videos covering the following subjects – Humanities and Social Sciences, Metallurgical and Material Sciences, Civil Engineering, Electrical Engineering, Chemistry, Mathematics, Electronics Engineering and Management. All the videos have English as the medium of instruction. Hence the transcripts generated consist of words in English.

3.2 Extracting Video Content

The contents of the videos in Videopedia are represented by their transcripts. The transcripts are generated from the YouTube platform using the CC feature of YouTube. We also extract the metadata for each of the videos in the form of the title, description and keywords of the videos. The subtitle files available along with the videos uploaded on the Videolectures.net are used. The text available after removing the timing information from the subtitle files provide the text for the transcripts of these videos. These text files are then processed as illustrated in Sect. 3.4.

3.3 Extracting Webpage Content

When the browser sends a request for relevant videos, the content of the Wikipedia page is extracted at the server. For the metadata of the webpages, we consider the topic of the Wikipedia page and the summary at the top of the page. The content of the remaining part of the wikipage along with all the subsections are extracted for topic modeling. Processing according to Sect. 3.4 provides us the bag of words to be used as representative for the webpage.

3.4 Processing the Extracted Text

Table 1. Comparison of precision before and after dictionary matching.

The transcripts extracted from the videos are not 100 % accurate, which necessitates correcting the transcripts by checking with a dictionary. We match the words against a dictionary provided by Wordnet and used the nearest word as a substitute for the incorrect words. Next we remove the stop-words from the remaining words in the transcripts. This leaves us with the main bag of words of the documents. The videos are educational in nature and the topics mined from the transcripts should also be educational. Hence we generate the representative words in these transcripts by removing the non-academic words from the bag of words. This is achieved by matching them against a set of 20,000 academic words [7]. To find the best matching among the wikipages and the lecture videos, we extend the bag-of-words to include synonyms of the academic words already present in it. This will minimize the errors in case the web pages and the videos are illustrating the same topic but use synonymous words. We apply the deep learning library Word2vec [13] to extract words similar to the words in the transcript of the videos and use the newly generated vector of similar words as the representative bag-of-words for the videos. A similar approach is taken while processing the content of the Wikipedia webpages also. The improvement in the precision of the recommendation by matching against a dictionary is shown in Table 1. We define the methods VSM and PLSA in Sect. 4.1.

As explained in Sects. 3.2 and 3.3, we extract the metadata for both the videos and the Wikipedia pages. The metadata in both cases are also matched to the academic dictionary to remove the non-academic words. Similar to the case of processing the content of the videos and webpages, the bag-of-words for metadata are also extended by Word2vec. For each of the words in the metadata for the videos and wikipages, we find the vector of words similar to it by using Word2vec with the set of academic words as the dictionary. The union of the vectors returned by the Word2vec forms the bag of words for metadata.

3.5 Topic Modeling

After the transcripts and the bag-of-words are processed for each video, the topics of the videos are generated using LDA. LDA posits that each word in a document is generated by a topic and each document is a mixture of a finite number of topics. Each topic is represented as a multinomial distribution over words. There are a number of outputs from the LDA. We will consider \(Z = \{P(z_{i})\}\) – the probability distribution of the topics inside a document. These topics along with their probability distribution are then stored on the server indexed by the corresponding video-ids.

3.6 Definition of Similarity Matching

In probability theory and information theory, the Kullback-Leibler divergence (KL Div) is a measure of the difference between two probability distributions P and Q [10]. KL Div being a non-symmetric metric, we calculate the KL Div of both P from Q and Q from P and then take the average of the two. The lower the value of the KL Div between two distributions, the closer they are semantically. Hence the similarity between the videos and the webpages are calculated as the inverse of the KL Div(\(D_{KL}\)). Let \(Z_{LDA}(Blog_{wiki}) = \{ P(z_{i})\}\), where \(z_{i}\) are the latent topics present in \(Blog_{wiki}\) of blog pages which are returned by LDA and \(Z_{LDA}(Trans_{v^{m}}) = \{ P(z_{k})\}\), where \(z_{k}\) are the latent topics present in \(Trans_{v^{m}}\) of videos. We define the similarity measure in Videopedia as:

$$\begin{aligned} \begin{aligned} Sim_{\textit{Videopedia}}(Blog_{wiki},Trans_{v^{m}}) =[\frac{1}{2}(D_{KL}(Z_{LDA}(Blog_{wiki})\Vert Z_{LDA}(Trans_{v^{m}})) \\ + D_{KL}(Z_{LDA}(Trans_{v^{m}}) \Vert Z_{LDA}(Blog_{wiki})))]^{-1} \end{aligned} \end{aligned}$$
(1)

This matching between the topic models of the webpage and those of the video transcripts in the repository is also done on the server side. The recommended videos are then linked to the webpage on the browser.

3.7 Algorithm for Video Recommendation

The algorithm used to recommend the videos is given in Algorithm 1. As described in the procedure PrepareVideoTopicModels, our system extracts the transcripts of the videos, removes the stop-words from these transcripts and checks the remaining words against an academic dictionary to eliminate spelling mistakes and extract the academic words from the transcripts. This facilitates the extraction of the topics related to academia. Then, the topic distribution of these videos is obtained by using LDA on these transcripts. The metadata of the videos are extracted as described in Sect. 3.2.

When a request is received from a website, the text of the webpage is retrieved by using the provided URL and then relevant videos are recommended as described in the procedure RecommendVideo. The actual matching between the webpage and videos is divided into two stages. In the filtering stage, the meta data is compared in the vector space model by the cosine similarity, which removes most non relevant videos. These bag-of-words for metadata of videos are compared against the bag-of-words for metadata for the incoming website as is illustrated in the procedure MatchMeta. The top 10 % of the videos with the maximum similarity are then selected and used for matching based on topics. Let the content of the webpage \(W_{j}\) after checking against a dictionary be \(C_{j}^{'}\). LDA will return a set of topics \(l_{C_{j}^{'}}\) along with the probability distribution of the topics. Similarly, the transcript \(t_{i}^{'}\) of a video \(v_{i}\), has a set of topics \(l_{t_{i}^{'}}\) with their distribution. To find the video closest to \(C_{j}^{'}\) we calculate the KL Div between \(l_{C_{j}^{'}}\) and \(l_{t_{i}^{'}}\)s for the videos \(v_{i}\) in the set \(V^{'}\), returned by the procedure MatchMeta. Equation 1 is used to calculate the similarity and it is described in the procedure MatchTopics.

figure a

4 Experimental Results

The videos used in this research paper were extracted from the repository of publicly available videos on YouTube and Videolectures.net. The webpages used as input to Videopedia were Wikipedia pages. In this section we provide the evaluation for our recommendation system against three baselines. The baselines are defined in the Sect. 4.1 and the results are presented in Sect. 4.2.

4.1 Baselines Used for Comparison

Topic modeling methods have been used to find latent topics from a collection of documents using probabilistic analysis of documents. However to find the semantic similarity between the documents, we need to find the similarity between the topics of these documents. There are broadly two categories of similarity measures in existing literature: (1) count similarity measures [11], (2) probabilistic similarity models [10].

In this paper, we choose one widely accepted technique from each category of models. Cosine similarity represents the count similarity measures paradigm and KL divergence represents the probabilistic similarity models. To compare the efficacy of our system, we use the VSM model [18] which uses cosine similarity and PLSA [9] which is another probabilistic topic modeling. After generating the topics through PLSA we use the KL Divergence method to find the similarity. Since in Videopedia, we first prune the set of videos based on metadata before applying LDA, we compare the efficiency of our algorithm with calculating the similarity between the topics generated by LDA without matching the metadata. The following subsections give the definitions necessary for the similarity measures and the topic models.

Vector Space Model. Vector Space Models represent text documents in an algebraic structure of vectors of identifiers [18]. The identifiers are often the term frequency-inverse document frequency (tf-idf) of the words in the document. Cosine similarity, which calculates the cosine of the angle between two vectors, provides a measure of the deviation between the identifiers of the document. Let the Wikipedia page be denoted by \(Blog_{wiki}\) and the transcripts for the \(m^{th}\) video, \(v^{m}\), as \(Trans_{v^{m}}\) In our case, while constructing the baseline, we thus represented \(Blog_{wiki}\) by its tf-idf vector \(v_{wiki} = (v_{wiki_{1}}, v_{wiki_{2}}, \ldots , v_{wiki_{k}})\). \(Trans_{v^{m}}\) is also represented by the tf-idf vector for its transcript \(v_{vid}^{m} = (v_{vid_{1}}^{m}, v_{vid_{2}}^{m}, \ldots , v_{vid_{l}}^{m})\), where \(m = 1,\ldots n\), n represents the number of videos in the repository and l represents the number of terms in the \(vid^{i}\). Thus the similarity between the Wikipedia page and the video is calculated in the VSM model by the following formula:

$$\begin{aligned} Sim_{VSM}(Blog_{wiki},Trans_{Vid}^{m}) = {v_{wiki}\cdot v_{vid}^{m} \over \Vert v_{wiki}\Vert \Vert v_{vid}^{m}\Vert } = \frac{\sum \limits _{i=1}^{n}v_{{wiki}_i} \times v_{vid_i}^{m}}{ \sqrt{\sum \limits _{i=1}^{n}{(v_{{wiki}_i})^2}} \times \sqrt{\sum \limits _{i=1}^{n}{(v_{vid_i}^{m})^2}} } \end{aligned}$$
(2)

PLSA. Probabilistic Latent Semantic Analysis (PLSA) is a technique to statistically analyze the co-occurrence of words and documents [9]. Let \(Z_{PLSA}(Blog_{wiki}) = \{ P(z_{i})\}\), where \(z_{i}\) are the latent topics present in \(Blog_{wiki}\) and \(Z_{PLSA}(Trans_{v^{m}}) = \{ P(z_{k})\}\), where \(z_{k}\) are the latent topics present in \(Trans_{v^{m}}\). Using the probability distribution of the topics in a Wikipedia page and that in the transcripts of the videos, we find the KL Div between them and subsequently the similarity with the following formula:

$$\begin{aligned} \begin{aligned} Sim_{PLSA}(Blog_{wiki}, Trans_{v^{m}}) =[\frac{1}{2}(D_{KL}(Z_{PLSA}(Blog_{wiki})\Vert Z_{PLSA}(Trans_{v^{m}})) \\ + D_{KL}(Z_{PLSA}(Trans_{v^{m}}) \Vert Z_{PLSA}(Blog_{wiki})))]^{-1} \end{aligned} \end{aligned}$$
(3)

We calculate the \(Sim_{PLSA}\) of \(Blog_{wiki}\) with each of the n videos in the repository using the formula given above and the one having the highest value is recommended as the relevant video for the wikipage.

4.2 Evaluation of Recommendation

We use Precision, Recall and F-measure to evaluate the recommendation by our system. The definitions for them are provided below:

$$\begin{aligned} {\text {Precision}} = \frac{|{Relevant Videos} \bigcap {Retrieved Videos}|}{|{Retrieved Videos}|} \end{aligned}$$
$$\begin{aligned} {\text {Recall}} = \frac{|{Relevant Videos} \bigcap {Retrieved Videos}|}{|{Relevant Videos}|} \end{aligned}$$
$$\begin{aligned} {\text {F-measure}} = 2\times \frac{Precision \times Recall}{Precision + Recall} \end{aligned}$$

We tested the system by recommending videos for 1,000 Wikipedia pages categorized under the same eight subjects as the e-learning videos. The distribution of the videos across the subjects is given in Fig. 2a and the distribution of the subjects among the wikipages used for testing the system are shown in Fig. 2b. We have created the ground truth of whether a video is relevant or not by checking if the wiki-category of the Wikipedia page, which is an input to Videopedia, matches the subject in the metadata of the relevant videos. Hence, the number of Relevant Videos for a particular webpage is often more than 1 and thus |Relevant Videos| may be greater than the |Retrieved Videos|. The Precision, Recall and F-measure values for Videopedia and the three baselines VSM, PLSA and LDA are presented in Table 2. We proceed to measure the effectiveness of our method by recommending more than 1 video for a webpage. The columns @n are the results of Precision (Table 2a), Recall (Table 2b) and F-measure (Table 2c) when we recommend n videos for a webpage, where n = 1,5,10. When Videopedia is recommending more than one video for a particular webpage, we use the mean average precision defined below to find the precision.

Fig. 2.
figure 2

Distribution of videos and webpages according to subjects

$$\begin{aligned} {\text {mAP}} = \frac{\sum \limits _{i=1}^{n} AveragePrecision_{i}}{n} \end{aligned}$$

where n represents the total number of webpages for which the videos are retrieved and \(AveragePrecision_{i}\) is the Average Precision for retrieving video for the \(i^{th}\) webpage.

Table 2. Videopedia compared with VSM, PLSA and LDA

As is evident from the results, Videopedia performs much better than previous methods, by using a refining stage based on metadata. To decide the actual percentage of videos that should be selected after matching the metadata between the incoming webpage and the videos in the repository, we calculate the precision of the recommendations made by Videopedia for different percentages of videos selected. This is reported in Fig. 3b. From the precision values for each of the three recommendations, we find that Videopedia performs best when we select the top 10 % of the videos in the repository are selected after matching the metadata. We also perform a comparative study to select the optimal number of topics mined from the transcripts by LDA and report the findings in Fig. 3a.

From the results reported in Figs. 3b and a, we find that our algorithm in Videopedia, performs best when it selects 10 % of the videos after matching the metadata of the incoming wikipage with the metadata of the videos. The results also show that the best precision is obtained when LDA returns 10 topics from the content of the webpages and videos. This also provides a more comprehensive topic modeling of the documents.

Fig. 3.
figure 3

Comparison of various methods

5 Conclusions

We designed and demonstrated a novel technique in VIDEOPEDIA which integrates multiple media formats – namely making the educational blogs and Wikipedia pages more illustrative with multimedia contents like videos. The main contribution of this work is to automatically recommend relevant educational videos for blogs like Wikipedia. We effectively used topic modeling on automatically generated video transcripts from various video sharing platforms and using them as a representation of the video content. Our promising results show that topic modeling is a good way of video recommendations. This integrated framework reduces the users efforts in searching relevant e-learning videos. Moreover, it provides a mechanism for content-based search. We aim to scale our system for global use in academia and industry in near future. Although we focused on the educational blogs in an e-learning setup, this technique can be extended to other blogs as well.