Keywords

1 Introduction

With the rapid development of Internet, data is growing at a very high pace, having said so many online movie platforms are exploding with new content everyday. Recommendation systems have proved to be one of the successful information filtering system. In general, recommendation systems are used to predict how much user may like a certain product/service, compose a list of N best items for user, and compose a list of N users for a product/service. With this growth of media, users have to spend significant amount of time searching for the movies in which they are interested [1]. Here, the task of recommendation system is to automatically suggest users what movie to watch next based to the current movie and thus saving their time searching for the related content.

Movie recommendation is the most used application on the media streaming Web sites, both in academics and as well as commercially research has been done extensively in this topic. The Netflix Prize challenge [2] is one such example where a prize of one million dollar was at stake. The aim of the competition was to beat the Netflix’s very own recommender system by ten percent. This attracted various researchers and companies and more than forty thousand entries were submitted for the same. Most of these recommender use collaborative filtering mechanism which has been developed in recent few years [3,4,5]. First the ratings of movies are collected given by each individual and then the recommendation is given to the users based on similar type of people with similar taste in the past. Many popular online services like netflix.com, youtube.com have used this collaborative filtering technique to suggest media to users.

This work is an attempt at implementation of a recommender system which uses genome tags to find out similar types of movies and then basic content-based filtering technique to further enhance the results. MovieLens dataset [6] is used for the development of this recommender engine and is accessed through its public FTP interface [7].

2 Related Work

2.1 Recommendation Using Collaborative Filtering

Collaborative filtering recommender relies heavily on the users data or some contribution in order to make recommendation. Contributions may include users to give ratings, like, dislikes, or other kinds of feedback which can cluster similar type of users together. As the name suggest, is a way of recommendation for a user in “collaboration” with other users. The fundamental of this filtering lies in the fact that if a person A shares same interest as of B, for certain object(here movie), then A will more likely share the same interest as of B, for a different object, than that of a randomly chosen person [8]. Collaborative filtering is easy to implement, it works well in most of the cases and has ability to find links between items which are otherwise considered dissimilar [9]. One of the drawbacks that this system suffers from “cold start” problem, this happens when either a new user comes or new item is added and we do not have much information/feedback for the item/user [10, 19]. To overcome this problem, recommender user content-based recommendation techniques coupled with collaborative recommendation.

2.2 Recommendation Using Content-Based Filtering

Recommendation is purely made on the attribute of the item, hence avoiding “cold start” problem. The attribute of an item, for example, in a movie can be its genre, year, running time, rating, starring actors can be used by the content-based recommender to make a movie recommendation. This concept has its root from the information retrieval theory where a document representation methods can abstractly encapsulate features of an item for potential recommendation [11, 20].

Over the period of time content-based recommender starts building profile for a person, it includes the taste of an individual which is extremely helpful for highly personalized recommendation [3]. This type of recommender does not require community data as it solely relies on the individual’s preference, hence explanation can be given why a certain item/media was recommended. One major disadvantage of this is that it requires contents which can be broken down into meaningful attributes.

3 Proposed Approach

The dataset which was downloaded for the research contained 24404096 ratings and 668953 tag applications across 40110 movies. These data were created by 259137 users between January 09, 1995 and October 17, 2016. And was generated on October 18, 2016. The dataset contained the following files “genome-scores.csv”, “genome-tags.csv”, “links.csv”, “movies.csv”, “ratings.csv”, and “tags.csv”. Description of who’s is as follows

  • genome-scores.csv: Contains the genome score of the movies corresponding to the tags.

  • genome-tags.csv: Contains genome tag id and its corresponding string.

  • links.csv: Contains the link to the other sources of movie data.

  • movies.csv: Contains information about movie like its title, movie id, and genres.

  • ratings.csv: Each row of this file contains rating of one movie by one user.

  • tags.csv: Each row of this file represents one tag applied to one movie by one user.

3.1 The Genome Tags

Netfilx and Youtube are using hybrid of collaborative and content-based filtering, the prominent feature of which is genre of the video or movie. The major problem with the genre is that they are binary in nature, i.e., they do not tell till what extent that genre applies to the certain content. A user may apply the tag ‘violent’ to ‘Fight Club’, indicating that it is a violent movie, but they might not indicate how violent the movie is. Just like in Human Genome Project where all the genes in human DNA were identified and mapped, the researchers were inspired to find and index the building blocks of their media. Pandora has developed their own Music Genome Project [12], similarly for movie recommendation GroupLens research laboratory developed tag genome [13]. The genome tag extends the traditional tagging system to give the enhanced user interaction. Genome tag contains item and its relationship to the set of tags. These range between 0 and 1, where 1 being the most relevant and 0 being the least. This creates a dense matrix in which every movie in the genome has a value for every tag. This can be used to recommend similar type of content.

3.2 Data Preparation

Since genome score does not consist all the movies present in the dataset, first task was to select only those movies who’s genome score we have. After filtering, we were left with around 10,000 movies. Next we transformed genome score which was stored like in Table 1 to like in Table 2.

Table 1 Before transformation

After the transformation of genome scores average rating and number of users who rated the particular movie was calculated. This will come in handy while coupling our model with content-based filtering. Average rating for a particular movie i was calculated simply using the formula, total rating given by each user to that movie divided by the total number of users rated that movie, i.e.,

$$\begin{aligned} {\textit{avg}\_\textit{rating}_{i} = \dfrac{\Sigma \textit{user}\_\textit{rating}_{i}}{\Sigma \textit{user}_{i}}} \end{aligned}$$
(1)

3.3 Feature Reduction

We have total of 1128 genome tags, which are very large and many of them are redundant, this will increase the computational complexity. There is need to reduce the number of features, it will not only remove redundancy in data but will also increase the performance of the model [14]. Principal component analysis (PCA) was run on the complete genome score in order to the variance explained in data by the tags available [15].

Table 2 After transformation
Fig. 1
figure 1

PCA on compete set of tags

As it is quite clear from the Fig. 1 that most of the tags show very low variance in data and can be reduced. We used correlation-based feature selection technique, for which Pearson correlation method was preferred [16]. Pearson correlation find out linear correlation between two variable X and Y. It results in the value between −1 and \(+\)1 where −1 is the total negative correlation and +1 being the total positive. Correlation between all the tags were calculated, if the tags were to be correlated they should have a value between 0 and \(+\)1. We were suppose to choose the optimal threshold value above which we will say tags are correlated else not, lets say this to be cutoff value. PCA was ran after selecting the cutoff to be 0.6, 0.5, 0.4, and 0.3 (Figs. 2 and 3).

Fig. 2
figure 2

PCA after cutoff was 0.6

Fig. 3
figure 3

PCA after cutoff was 0.5

Fig. 4
figure 4

PCA after cutoff was 0.4

Fig. 5
figure 5

PCA after cutoff was 0.3

Choosing cutoff to be 0.6 and 0.5 will still leave some of the redundant tags, whereas if cutoff is set to be 0.3 we might loose some important tags, so for this model cutoff was set to be 0.4. With this cutoff, the number of tags was dramatically reduced from 1128 to 275 (Figs. 4 and 5).

3.4 Distance Between Movies

Now we are left with 275 attributes and over 10,000 rows, we can use this to find out which movies are similar to whom. Vector space model approach was used to achieve this task [17]. In vector space model, we represent documents (or any object in general, here movie) as vectors of identifier. Relevance between these vectors can be find by comparing the deviation between then angles of all the vectors [18]. In practice, cosine of the angle between vector is calculated (Fig. 6).

$$\begin{aligned} \cos (\theta ) = \dfrac{a_{1} \cdot a_{2}}{\Vert a_{1} \Vert \Vert a_{2}\Vert } \end{aligned}$$
(2)

where \(a_1 \cdot a_2\) is the dot product.

$$\begin{aligned} \Vert a_1\Vert = \sqrt{\sum _{i=1}^{n}} {q_i}^2 \end{aligned}$$
(3)

In our dataset, every row can be treated as a vector, hence we can find out the cosine distance between each of them. For the demonstration purpose, we have chosen sample 2000 movies in this experiment. We obtained the 2000 * 2000 matrix M, where \(M_{ij}\) will contain the value of the cosine angle between the \(\textit{movieId}_i\) and \(\textit{movieId}_j\) (Fig. 7).

Fig. 6
figure 6

Cosine angle between movies

3.5 Recommending Movies

Now when we have obtained the cosine matrix, the recommendation is fairly easy from here. Suppose a user watching a movie with movieId n, then we will go to the nth of the matrix and sort out the row in decreasing order. Greater the cosine value, smaller the angle, smaller the angle more close the movies are in vector space, more closer the movie more similar it is. We will pick up top K results from the same. Now using content-based filtering on average rating of the movie, we will recommend top N movies to the users (Fig. 8).

Fig. 7
figure 7

Recommendation for “Zodiac”

Fig. 8
figure 8

Recommendation for “Yes Man”

Fig. 9
figure 9

Recommendation for “Step Up 2”

Fig. 10
figure 10

Recommendation for “Spider Man 3”

4 Results

The system takes movieId as an input and recommends top five similar movies based on it. Recommended movies rating should be above than 2.4 and at least 100 people should have rated that movie. The complete experiment is performed in R 3.2.2 (Fig. 9).

5 Conclusion

This hybrid model seems to perform good in all the early testing and gives more personalized and accurate results. Genome tag is the key driver for this model along with the content-based filters (Fig. 10).