Keywords

1 Introduction

The internet is an ever-growing medium that has seen and continues to see a boom of users daily owing to the decrease in broadband and data prices, also increasing familiarity of usage of the internet has stimulated a wider reach to even the remotest of places on the globe. As a result, the content on the world wide web has skyrocketed. A video is the easiest way to imbibe information as it is an amalgamation of pictures and sound creatively put together to ensure easier comprehension by humans. This makes it difficult for a user to choose a particular video suited to his taste and liking because of the presence of multiple videos which may be of interest to him. Hence, a video recommender system is essential for providing the user with relevant videos based on the user’s interest in the content he consumes.

A recommender system is an algorithm that can be used to suggest relevant content for the users which suit their diverse tastes by taking into account their past interests and habits. The recommender system filters contents to the liking of the user, helping the user to make a decision when met with a variety of choices in cases like the world wide web. A video recommender system makes use of prior data available on the user’s interest like favourite channel, the favourite topic of content from a said channel, video length, duration of the video watched by the user, choice of the video after completion of the prior etc. This helps in reducing the time spent by the user to search for videos based on his liking separately by instantly suggesting him related videos based on the content being watched currently.

Motivation: The meteoric rise of users of the world wide web has contributed to an explosion in the amount of data found on the web, hence making it difficult for a user to find relevant content based on their tastes. This directly translates to the user not being able to find videos that suit his interests and is of his liking in the case of websites like YouTube. Thus arises a need for recommendation strategies for the current structure of the web. The existing strategies prevailing don’t work particularly well with the current structure, since it is in the midst of Web 2.0 and Semantic Web 3.0. As a result, there is a need for a semantically inclined approach for a recommendation of content from the world wide web. Knowledge representation and reasoning schemes are required to make the recommendation less computationally complex.

Contribution: A semantically compliant video recommendation approach KTSVidRec which uses knowledge representation and reasoning schemes, that help to tackle the highly linked hierarchical data present on the current structure of the world wide web. It uses the Structural Topic Modelling (STM) of entity population, XGBoost and computation of semantic similarity using Jaccard similarity, Kullback-Leibler (KL) Divergence and Normalized Pointwise Mutual Information (NPMI). Which help to tackle the highly linked hierarchical data present on the current structure of the World Wide Web. Firstly, query terms are subject to pre-processing and the query words are obtained. The query words are subject to STM and entity aggregation from knowledge bases like Freebase and other popular video repositories. Categories are obtained from the YouTube trending statistics dataset and enriched using the Linked Open Data (LOD) cloud. Classification based on XGBoost is done on the dataset using the query terms obtained earlier and the top 50% of the classified videos are obtained. Computation of semantic similarity is done between the categories and the entity population of structural modelling; the categories are stored in a HashSet. Further, the semantic similarity of the categories in the HashSet and categories of the classified video categories are compared, ordered in ascending order of semantic similarity and the video is recommended to the user.

Organization: The following portion of this paper’s structure is as follows. The works related pertaining to the proposed approach is present in Sect. 2. Section 3 consists of the proposed architecture and further explanation of the KTSVidRec. Section 4 elaborates on the implementation and evaluation of the performance exhibited by the proposed approach. The conclusions drawn from the proposed approach is presented in Sect. 5.

2 Related Works

Davidson et al. [1] put forward a system for recommending the YouTube videos by making use of the user’s activity on the site. The recommendations are evaluated using CTR (click-through rate) and recommendation coverage metrics. An easy-to-use user interface is used where thumbnails of videos along with descriptions are present with a link providing an explanation on why the particular video was recommended to the user. Deldjoo et al. [2] have proposed a system to recommend the videos based on the features which are stylistic and also feature sets based on the visual aspects which are extracted from the video. The model involving the extracted visual features are pit against conventional video recommendation systems which make use of features like the genre. The system exhibits better results compared to the conventional model with respect to the metrics of relevance like recall. The addressed system can be used as standalone or in tandem with traditionally prevailing video recommendation systems based on content. Chen et al. [3] have put forward a personalized video recommendation system that makes use of tripartite graph propagation for recommending personalized videos to the user. In addition to the graph approach involving click through data, there are also other graph approaches involving the user queries raised and if the video associated with the query appears on their recommendation lists. The tripartite graph takes into consideration the aforementioned characteristics on a dataset of over 2000 users, 23,000 queries and 55000 videos. Lee et al. [4] propose a large-scale content only video recommendation system which uses a deep learning approach to study the video embeddings and similarity learning, based on the video content is done thereby predicting the relationship between the videos using only visual and audio content on a dataset of a large scale. The proposed system is not completely dependent on the metadata pertaining to the video and can recommend videos based on recently relevant and prior uploaded videos. Tavakoli et al. [5] have suggested a novel system for the recommendation of educational videos for the aid of learners looking for professional jobs. The model makes use of classification and mining of texts on openings for jobs and the skills pertaining to the job and suggests relevant videos based on the quality of the videos by use of an open recommendation system for educational videos. The approach was carried out through interviews, gauging the effectiveness of the videos. Based on more than 250 videos that were recommended, 83% returned to be useful for the users. Soni et al. [6] have put forth a system for recommending videos, which uses facial emotions in comparison to the conventional recommendation systems based on former user statistics. The model carries out the process based on two queries which are, user rating prediction with the use of facial data and the likeliness of the user to predict the video to another. The classifier for deducing the facial information takes into account, the expressions and pulses associated with the skin. Furthermore, it also addresses the influence of the increase in the size of data. Belarbi et al. [7] have come forward with a system for recommending educational videos in accordance with the user’s video interests while enrolling in a small-scale online course and looking for other learners with the same interests to recommend videos that may be of relevance to them. The model analyses the user’s stream of clicks and designs a user profile to filter based on taste. A similar demographic are clustered together based on clustering in the purview of the K-Means clustering algorithm. Liu et al. [8] have suggested a system for the recommendation of videos based on the tags in the description, which associate a video to a specific category. A framework based on a graph incorporating neural networks is used which combine various parameters like the tags, users, the video and source of media. A loss that depends on the similarity of neighbours is used for encoding the preferences of the users into the diverse representation of nodes. The experimentation based on online and offline assessment is carried out on a Prominent social media network WeChat. Covington et al. [9] propose a system for video recommendation where neural networks are used to recommend videos based on neural rankings. The model follows a two-step approach where the detailing is done based on the generation of the candidates and further designing of a model for ranking of the candidates. Cognizance involving the design and the maintenance of a large-scale recommendation system is addressed. In [10,11,12,13,14,15,16,17,18,19,20] several ontological models and semantic approaches in support of the proposed framework have been discussed.

3 Proposed System Architecture

Figure 1 shows the architecture diagram for entity population, entity aggregation, category enrichment and classification of the proposed knowledge-driven framework for video recommendation, KTSVidRec. The model is semantically compliant as it incorporates semantic similarity strategies, it includes machine learning classification schemes like XGBoost, as a result, it is a form of semantics infused machine learning.

Phase 1 of the proposed architecture consists of query pre-processing. Firstly, the user query, as well as the web usage data, is fed as the input to the framework, subject to pre-processing. The pre-processing of the queries obtained from the user includes tokenization, lemmatization, removal of stop words and named entity recognition. Tokenization is done to obtain individual terms from the queries and the web usage data. Lemmatization involves deriving the base form of the word from its inflectional form. Stop words like of, and, they are eliminated as they tend to increase the noise and don’t contribute to aggregating the terminologies. Named entity recognition, signifies which domain and categories the query terms and the web usage data belong to. Web usage data is also normalized in certain cases owing to the metadata available, specifically when the URLs are included the URL normalization is done. Ultimately, the terms form queries and web usage data are obtained.

Fig. 1.
figure 1

KTSVidRec architecture for entity population, entity aggregation, video category enrichment and classification

Phase 2 of the proposed architecture consists of entity population-based on topic modelling, entity aggregation from knowledge bases and video repositories, obtaining and enriching video categories and classifying the categories using XGBoost.

The entity population is done based on topic modelling to increase the mapping of query terms and web usage data terms to the existing categories in the real-life space. The topic modelling used is Structural topic modelling (STM). The STM is similar to the other types of topic modelling but specialises in using metadata on documents to optimize the allocation of latent subjects of the corpus. Furthermore, entity aggregation is done from Freebase. Freebase is a collaborative knowledge base composed of data mainly provided by the users. Freebase is preferred as it’s a real-world knowledge source for various terminologies. Entity aggregation is done from popular video repositories like Netflix, Amazon prime, MX player, Vimeo and daily motion. These video repositories are scrapped using Beautiful Soup an HTML parser python package and are crawled for the extraction of the metadata. The categories are put in a localized repository and enriched.

From the YouTube trending video statistics dataset which is a categorical dataset, the categories for the videos are obtained. These categories obtained are enriched by the LOD cloud API, this LOD cloud stores a graph of real-world entities. Subgraphs relevant to the categories of the dataset is loaded and are enriched. Then, the dataset is classified using the XGBoost classifier and the terms for classification or classification categories are based on the initial query terms and the unique terms obtained from the Web usage data. Under each class, the top 50% of the classified videos are obtained and the categories are loaded. The reason for obtaining only the top 50% data is in this case relevance is of utmost importance. Also of importance is the diversity in results which is taken care of utilizing entity aggregation, topic modelling and enrichment of categories from various heterogeneous sources.

Fig. 2.
figure 2

KTSVidRec architecture for computation of semantic similarity, mapping of categories and recommendation of video to the user.

In phase 3 of the proposed approach’s architecture, as illustrated in Fig. 2. Computation of the semantic similarity, mapping of categories, arranging the videos in increasing order of similarity and recommendation of the video for the user is carried out.

The computation for semantic similarity is done between the enriched categories obtained from the dataset, entity aggregated from Freebase and popular video repositories, topic modelling of the query terms and the web usage data terms which were obtained in phase 1 and phase 2. The categories are obtained whose semantic similarity is >0.75. Further, the video categories are mapped. The computation of the semantic similarity is done using KL Divergence, Jaccard Similarity, Cosine Similarity and NPMI methodologies.

The KL Divergence is an estimation of how one distribution of the probability is differentiated from another probability distribution which is taken as the reference. The KL Divergence is in simple terms known as the measure of surprise and is employed in diverse fields like fluid mechanics, applied statistics and bioinformatics. Equation (1) illustrates the KL Divergence and P and Q are the probability distributions.

$${D}_{\mathrm{KL}}(P\parallel Q)=\sum\nolimits_{x\in \mathcal{X}} P(x)\mathrm{log}\left(\frac{P(x)}{Q(x)}\right)$$
(1)

Jaccard index is used to estimate the measure of similarity in the comparison of two given sets which are of finite size. Jaccard similarity coefficient is illustrated in Eq. (2) where the J is Jaccard distance. A and B are the two given sets

$$J\left(A,B\right)=\frac{\left|A\cap B\right|}{\left|A\cup B\right|}=\frac{\left|A\cap B\right|}{\left|A\right|+\left|B\right|-\left|A\cap B\right|}$$
(2)

Cosine Similarity is an evaluation of similarity in two vectors of inner product space, it’s defined as the cos theta value for the angle which is between the two given vectors and it evaluates if both the vectors point in a similar direction and is essential to gauge the similarity of the documents. The Cosine Similarity is illustrated in Eq. (3).

$$\text{Similarity }=\mathrm{cos}\left(\theta \right)=\frac{\mathbf{G}\cdot \mathbf{L}}{\parallel \mathbf{G}\parallel \parallel \mathbf{L}\parallel }=\frac{\sum_{i=1}^{n} {G}_{i}{L}_{i}}{\sqrt{\sum_{i=1}^{n} {G}_{i}^{2}}\sqrt{\sum_{i=1}^{n} {L}_{i}^{2}}}$$
(3)

Pointwise mutual information is a measure of association between two events namely a and b, instead of focusing on an average of all events like mutual information, point mutual information uses singular events. NPMI can be used to normalize the information between −1 and 1, where −1 is never occurring together, 0 for independent and 1 for occurring together. Equation (4) illustrates NPMI where h (a, b) is the joint self-information.

$$\mathrm{NPMI}\left(a;b\right)=\frac{\mathrm{PMI}\left(a;b\right)}{h\left(a,b\right)}$$
(4)

From the mapped categories a HashSet of similar categories is formulated. Further, the semantic similarity of the video categories presents in the HashSet and the top 50% of the classified video categories are computed for a threshold value of >0.75. The videos are arranged in increasing order of semantic similarity and finally recommended to the user.

4 Implementation and Performance Evaluation

The implementation of the proposed semantically compliant video recommendation dataset was done using Google Collab notebook for python, with an i5 processor of 4.0 GHz, 16 GB Ram and 50 GB of hard disk storage. The integration of the R script in Python is achieved using the single subprocess call. This allows the STM library of R to be used with Python. The Freebase API is used to access the Freebase data dump for entity aggregation using the Freebase knowledge base. The entity aggregation which involved popular video repositories like Netflix, Amazon Prime, Vimeo, MX player were called using an HTML parser based on python called Beautiful Soup. It was crawled based on the top trending videos on these platforms. Sklearn machine learning library was used for the implementation of the XGBoost machine learning algorithm. The collection framework HashSet was incorporated from the collection framework of Python. Table 1 presents the algorithm for the proposed KTSVidRec.

Table 1. Algorithm for the proposed KTSVidRec

The dataset used for the model is the Trending YouTube videos statistic dataset, it incorporates a variety of metrics like the metrics for views, shares, videos liked and comments, it also involves this data collected for several months on these videos and of wide demography which includes countries like USA, Great Britain, India, Germany etc. with up to almost 200 listed trending videos in a day. The different regions are classified into separate files. The dataset’s parameters based on which the evaluations are done include the title of the video, name of the channel, time of video upload, various tags involved in the description associating it to a particular domain, the number of views, the number of likes and dislikes for the video, information present in the description and the number of comments. The categories for the videos were clustered based on different regions. The experiment is carried forward for 4047 queries and the ground truth was either manually crawled or automatically incorporated from standard recommendation systems. Further, the ground truths were verified by 436 users who knew much about the experimentation domain. To elaborate, each user was given up to 45 queries and were asked to give top 10 categories, as well as recommendations from YouTube and other video sharing websites. The ground truth was obtained of the top categorizations, such that at least five users got a similar set of queries. This was carried out over a period of eight months in two phases. In the first four months, 25 queries were given out and in the next four months 20 queries. Top recommendations for each query from user recommendations were taken and the top 20% were used to verify the ground truth.

The metrics pertaining to performance for the architecture proposed is essential to be evaluated especially in the case of a recommendation system The Precision, Recall, Accuracy, F-Measure, False Discovery Rate (FDR) and Normalized Discounted Cumulative Gain (nDCG) were used in the evaluation metrics. FDR is the ratio of the positive results which were false to the total number of test results that were positive. nDCG is used to estimate the diversity of the results obtained. Standard formulae are used to evaluate the values for Average Precision, Average Recall, Accuracy Percentage, F-Measure, FDR and nDCG. The performance metrics for the proposed KTSVidRec is evaluated and tabulated in Table 2 and is compared with several baseline models and a benchmark approach. It is clear that the KTSVidRec achieves a better performance in comparison to the baseline models and benchmark approaches. The comparison of performance is done between four baseline models namely YVRS [1], CBVRS [2], PVRTGP [3], LSCVR [4] which are implemented in the same environment using the same user query and web usage data and a benchmark approach the Collaborative Filtering approach which is a standard benchmark model, highly famous in recommendation strategies.

Table 2. Performance comparison for the proposed KTSVidRec with other approaches

It is quite evident from Table 2, that YVRS [1] has a Precision of 82.14%, Recall of 80.44%, Accuracy of 81.29%, F-Measure of 81.28%, FDR of 0.17 and nDCG of 0.81 respectively. The CBVRS [2] approach has a Precision of 83.48%, Recall of 81.14%, Accuracy of 82.31%, F-Measure of 82.29%, FDR value of 0.16 and nDCG of 0.78 respectively. The PVRTG [3] furnishes a Precision of 86.54%, Recall of 83.66%, Accuracy of 85.1%, F-Measure of 85.07%, FDR of 0.13 and nDCG of 0.79 respectively. The LSCVR [4] yields Precision of 85.18%, Recall of 82.14%, Accuracy of 83.66%, F-Measure of 83.63%, FDR of 0.14 and nDCG of 0.81 respectively. The benchmark collaborative filtering approach gives a Precision of 81.16%, Recall of 78.14%, Accuracy of 79.65%, F-Measure of 79.62%, FDR of 0.18 and nDCG of 0.84 respectively. The proposed KTSVidRec yields the highest Precision of 94.87%, Recall of 91.18%, Accuracy of 93.02%, F-Measure of 92.98%, it also yields the lowest FDR of 0.05 and the highest nDCG of 0.96 respectively.

The YVRS [1] uses a co visitation-based traversal of graphs approach where the candidates are generated and a MapReduce is also used for further ranking and recommendation of the videos. User relevance is also an important strategy in this approach. The graph-based approach makes this approach a highly complicated one. As a result, the user relatedness that is co visitation graph based on previously visited videos, the relevance is not computed between the query as well as content in the video, rather main focus is based on previous visits to a specific set of results, which yields a lower relevance score, because it only satisfies previous visitation count. The nDCG values are very low because there is no factor to compute diversity in the result. The CBVRS [2] mainly focuses on stylistic features of the videos and KNN along with similarity is incorporated. This approach is still non-semantic and doesn’t work well when the dataset density increases. Moreover, KNN is a quite naïve approach and the nDCG values are quite low and the number of false-positive values becomes pretty high. Stylistic feature extraction requires a proper strategy for the extraction of features, and there is always scope of improvement in such systems based on the above-stated factors. The PVRTGP [3] approach is based on tripartite graph traversal, and a pair of graphs are used to connect users and videos. Most importantly, instead of a query centric approach, the focus is on a user-centric approach where a user is given more importance than the query. While user feedback is important, highly driving something based on only the user is quite a tedious task and tripartite graph traversal is more complicated. There is always a scope for making such computationally expensive approaches efficient. In the LSCVR [4] approach, only the video embeddings are done using a neural network and similarity learning takes place. Since this approach is deep learning, it is very difficult to predict the kind of output that goes to the system. Although, it yields results that are distinct and acceptable it is a blind box approach, where the deep features and deep learning algorithm doesn’t contribute to an explainable AI, hence a better approach can be formulated. In the collaborative filter approach, the ratings are required and ratings can be biased. So, content-based approaches can go awry. Hence, there is a need for a semantic approach that is knowledge-driven and query centric but also obtains user feedback. Hence, the proposed approach KTSVidRec is formulated. KTSVidRec having the highest Precision, Recall and F-Measure are because it is semantic and it incorporates topic modelling using STM. Therefore, the density of initial knowledge keeps on increasing based on entity population and entity aggregation from both DBpedia and video repositories crawled, XGBoost for classification makes sure that the approach is inferencing along with learning which is a semantically driven machine learning approach. The KL Divergence to compute Divergence factor and a combination of three similarity measure Jaccard similarity, Cosine Similarity and NPMI are all used to compute semantic similarity thereby contributing to very high relevance, very low Divergence and owing to the density of the knowledge supplied to the approach, the relevance of the approach is also high. As a result, the nDCG is naturally high.

Fig. 3.
figure 3

Precision Percentage vs No of recommendations comparison of the proposed KTSVidRec and other baseline models

Figure 3 presents the percentage of precision for different no of recommendations ranging from 10 to 40 recommendations for KTSVidRec and the other baseline models. It is observable that the precision percentage of KTSVidRec is higher when compared to all the baseline models. This is because this approach uses a semantic approach for the recommendation of videos. Also, the approach uses entity population through STM, and a knowledge base methodology by entity aggregation using Freebase and popular video repositories.

5 Conclusion

A novel semantically inclined video recommendation framework KTSVidRec is proposed, which follows a query-centric approach for video recommendation, based on user queries and web usage data. It also incorporates entity population using the STM and further knowledge representation by entity aggregation from Freebase and different video repositories. YouTube trending video statistic dataset is used and is subject to the classification of the top 50% of video categories and the dataset’s categories are further enriched by the LOD cloud. Computation of semantic similarity is done by a variety of methods primarily KL Divergence, Jaccard Similarity, Cosine Similarity and NPMI. It can be concluded that the proposed approach displayed better performance compared to the other baseline models mainly due to its semantic driven approach, entity population through STM and aggregation through a variety of knowledge bases. The KTSVidRec had a very low FDR of 0.05 and the highest nDCG value of 0.96 which makes it an efficient and semantically sound solution for the recommendation of videos on the Web.