1 Introduction

With the advancement of the Internet, the number of e-commerce sites and the number of online customers and products have grown dramatically. As online markets become more competitive, online stores need targeted marketing tools that can increase sales, profits, and customer satisfaction. Recommendation systems serve these business objectives of e-commerce sites by producing personalized recommendations for customers based on their past history, purchase records and interests. In recent years, the widespread use of smart devices and social networks has enabled e-commerce sites to collect a large amount of information on users’ behaviors, activities or preferences. Naturally, new technologies that can efficiently analyze and use such data in recommender systems are required. In addition, recommendation technologies are increasingly linked to other areas of expertise to improve the performance, coverage, and accuracy of recommender systems [9].

Collaborative filtering (CF), a technique used by recommendation systems, predicts and recommends items (information, products or services) that the user might like. Amazon.com’s recommender system is one of the most famous examples of CF. Recommendation systems are popular in both commercial and research sectors, and they are applied in a variety of applications such as movies, music, books, social connections and venues. In particular, movie recommendation systems produce personal recommendations for movies.

CF-based movie recommendations predict a likeness score or a list of top-N recommended movies for a given user based on ratings (preference scores) from many users. Since they use only available ratings that are explicitly given by users, their prediction accuracy faces certain limits [1, 3, 7]. A number of works that employ other item attributes have been conducted to achieve more precise recommendations [4, 6, 1113]. This paper proposes an algorithm for movie recommendation systems that utilizes the genres of the movies as well as the ratings of the movies, to increase rating prediction accuracies.

Each movie has various attributes such as name, genre, starring, director, theme or topic, mood, setting, etc. The movie genres (action, comedy, romance, etc.) can be categorized in several ways, but usually the knowledge of the expert is used to assign a genre to a movie. In general, one can find certain similarities in the movies of the same genre but there are no concrete and quantified criteria for genre classification and assignments [4]. In addition, a movie can have more than one associated genre, so there is no way to computationally pinpoint a single representative genre for the movie when this information is not explicitly given.

Considering a combination of the genres associated with a movie gives more insights than considering each of them individually. For example, it makes more sense that a movie’s genre is action and crime than children and crime. The fact that some genre combinations are more sensible than others indicates that the movie genres are correlated [4, 13]. Viewers often can guess the story, mood, and setting of a movie by its genre(s), so the movie’s genre influences viewers’ interest in the movie and eventually the decision whether or not to watch it. Most people have at least one movie genre that they prefer. It is presumed that users who prefer action would likely enjoy the recommended action movies than users with other preferred genre choices [4].

Existing CF algorithms for movie recommendation systems make recommendations using user-given ratings of the movies without taking into account their genre or genre correlation [2, 5, 8, 10]. The algorithm proposed in this paper calculates the correlation between genres using rated movie scores, and performs the classification of movies and the prediction of a list of recommended movies for a target user based on the calculated genre correlations.

2 Related work

In this paper, the prediction accuracy of our proposed movie recommendation algorithm is examined in connection with previous works: the user-based collaborative filtering in [2], and the item-based collaborative filtering in [10].

2.1 User-based collaborative filtering

The user-based CF calculates the similarity of users based on user ratings in order to find a set of users, known as neighbors, whose opinions are historically similar to the target user. It then combines the ratings of the neighbors to predict a rating or top-N recommendation for the target user. User similarities are measured using Pearson correlation coefficient, and rating predictions are performed using the preference prediction formula shown below. I u and I v , respectively, are the set of items which are rated by users u and v, and thus I u  ∩ I v is the set of co-rated by user u and v. r u,i denotes the rating of item i rated by user u, \( {\overline{r}}_u \) is the average of the ratings of the items rated by user u. The set N u contains k nearest neighbors of user u.

2.1.1 User similarity computation

$$ sim\left(u,v\right)=\frac{{\displaystyle {\sum}_{i\in {I}_u{\displaystyle \cap }{I}_v}}\left({r}_{u,i}-{\overline{r}}_u\right)\left({r}_{v,i}-{\overline{r}}_v\right)}{\sqrt{{\displaystyle {\sum}_{i\in {I}_u{\displaystyle \cap }{I}_v}}{\left({r}_{u,i}-{\overline{r}}_u\right)}^2}\sqrt{{\displaystyle {\sum}_{i\in {I}_u{\displaystyle \cap }{I}_v}}{\left({r}_{v,i}-{\overline{r}}_v\right)}^2}} $$
(1)

2.1.2 Preference prediction

$$ {P}_{u,i}={\overline{r}}_u+\frac{{\displaystyle {\sum}_{v\in {N}_u}^k}sim\left(u,v\right)\times \left({r}_{v,i}-{\overline{r}}_v\right)}{{\displaystyle {\sum}_{v\in {N}_u}^k}\left|sim\left(u,v\right)\right|} $$
(2)

2.2 Item-based collaborative filtering

The item-based CF calculates the similarity between items using user-given ratings. Recommendations for a user are computed by finding items that are similar to other items the user has liked. Once the most similar items are found, the rating prediction is computed by taking a weighted average of the target user’s ratings on these similar items. Item similarities are calculated using Pearson’s correlation coefficient, and rating predictions are performed using the weighted sum equation below [9, 10].

U i and U j , respectively, are the sets of uses who rated items i and j, and thus U i  ∩ U j is the set of users who rated both items i and j. I u is the set of items rated by user u, and \( {\overline{r}}_i \) is the average of the ratings given by users to item i.

2.2.1 Item similarity computation

$$ sim\left(i,j\right)=\frac{{\displaystyle {\sum}_{u\in {U}_i{\displaystyle \cap }{U}_j}}\left({r}_{u,i}-{\overline{r}}_i\right)\left({r}_{u,j}-{\overline{r}}_j\right)}{\sqrt{{\displaystyle {\sum}_{u\in {U}_i{\displaystyle \cap }{U}_j}}{\left({r}_{u,i}-{\overline{r}}_i\right)}^2}\sqrt{{\displaystyle {\sum}_{u\in {U}_i{\displaystyle \cap }{U}_j}}{\left({r}_{u,j}-{\overline{r}}_j\right)}^2}} $$
(3)

2.2.2 Preference prediction

$$ {P}_{u,i}=\frac{{\displaystyle {\sum}_{j\in {I}_u}}sim\left(i,j\right)\times {r}_{u,j}}{{\displaystyle {\sum}_{j\in {I}_u}}\left|sim\left(i,j\right)\right|} $$
(4)

3 Proposed method

The proposed algorithm has the pre-processing process that measures the correlation between movie genres using rated scores and uses the measured correlations to categorize a movie into a single genre cluster. When a recommendation event occurs (i.e., a user requests for a movie recommendation), the proposed algorithm calculates the genre preferred by the target user, identifies movies that belong to the target user’s preferred genre and its similar genres (i.e., genres highly correlated to the target user’s preferred genre), and creates a recommendation list consisting of the identified movies. Finally, the proposed algorithm predicts the ratings of the movies in the list and recommends them to the target user.

3.1 Genre correlation measurement

The genre of a movie is generally assigned by expert’s subjective judgment and it is hard to quantify the criteria for genre assignments. The proposed algorithm uses movie’s rated scores to calculate the correlation between movie genres. The correlation between genres a and b, denoted as genre_corr(a, b), is calculated using the formula below.

$$ genre\_ corr\left(a,b\right)=\omega \times genre\_ prob\left(a,b\right)+\left(1-\omega \right)\times genre\_ weight\left(a,b\right) $$
(5)

Note the in (5), the genre probability denoted by genre_prob(a, b) and the genre weight denoted by genre_weight(a, b) equally contribute to genre_corr(a, b). The genre_corr is the correlation between movie genres. Since genre_corr(a, b) and genre_corr(b, a) may be different, the correlation matrix is asymmetric. It is calculated by using genre_weight and genre_prob. The genre_weight is calculated by using the Pearson correlation coefficient, so the weight matrix is symmetric. The genre_prob is the co-occurrence probability of movie genres. The probability matrix is asymmetric. In Eq. (5), is calculated to reflect each characteristic of genre_weight and genre_prob in the same ratio (ω = 0.5). As a consequence, the correlation matrix is asymmetric (Figs. 1 and 2).

Fig. 1
figure 1

UBGC prediction accuracy of MAE for each ω values

Fig. 2
figure 2

UBGC prediction accuracy of RMSE for each ω values

3.1.1 Genre probability

The interest of an action movie lover toward adventure movies is not necessarily equivalent to the interest of an adventure movie lover toward action movies. Thus, the correlation between genres needs to be calculated asymmetrically [4, 5]. In the proposed algorithm, the conditional probability is used to compute the genre probability.

$$ genre\_ prob\left(a,b\right)=P\left(b\Big|a\right)=\frac{P\left(a{\displaystyle \cap }b\right)}{P(a)}=\frac{\left|{I}_{a{\displaystyle \cap }b}\right|}{\left|{I}_a\right|}, $$
(6)

where I a is the set of movies belonging to genre a, and I a ∩ b is the set of movies belonging to both genres a and b.

3.1.2 Genre weight

The genre weight equation, a variant of Pearson’s correlation coefficient, calculates the rating correlation of the movies belonging to genre a and b. In the equation below, pnt i (a, b) denotes the penalty of movie i, s *,i denotes the set of ratings given by the users who rated movie i, and \( {\overline{s}}_a \) is the average of the ratings of the movies belonging to genre a.

$$ genre\_ weight\left(a,b\right)=\frac{{\displaystyle {\sum}_{i\in {I}_{a{\displaystyle \cap }b}}}pn{t}_i\left(a,b\right)\left({s}_{*,i}-{\overline{s}}_a\right)\times pn{t}_i\left(a,b\right)\left({s}_{*,i}-{\overline{s}}_b\right)}{\sqrt{{\displaystyle {\sum}_{i\in {I}_{a{\displaystyle \cap }b}}}{\left(pn{t}_i\left(a,b\right)\left({s}_{*,i}-{\overline{s}}_a\right)\right)}^2}\sqrt{{\displaystyle {\sum}_{i\in {I}_{a{\displaystyle \cap }b}}}{\left(pn{t}_i\left(a,b\right)\left({s}_{*,i}-{\overline{s}}_b\right)\right)}^2}} $$
(7)

As mentioned earlier, there can be more than one genre associated with a single movie. The smaller the number of genres associated with a movie is, the higher the correlation between associated genres is. Hence, pnt i is given differentially according to the number of genres to which a movie belongs. The genre weight equation has two target genres (genres a and b), so the numerator of the pnt i formula is 2 and the denominator is the number of genres to which movie i belongs. G i is the set of genres to which movie i belongs.

$$ pn{t}_i\left(a,b\right)=\frac{2}{\left|{G}_i\right|} $$
(8)

s u,i denotes the bias-removed rating of movie i. It is calculated by subtracting the user bias, the movie bias, and the average of all ratings from the rating of movie i given by user u [1, 8]. s u,i is an element in the formula that computes the correlation between movie genres.

$$ {s}_{u,i}={r}_{u,i}-\mu -{b}_u-{b}_i $$
(9)

The average rating of genre a denoted by \( {\overline{s}}_{{\mathrm{g}}_{\mathrm{a}}} \) is calculated by subtracting the average bias of the users who rated the movies in genre \( a\left({\overline{b}}_{*}\right) \), the average bias of the movies in genre \( a\left({\overline{b}}_{\mathrm{a}}\right) \), and the average of all ratings (μ) from the average rating of the movies in genre a (\( {\overline{r}}_a \)). \( {\overline{s}}_a \) corresponds to the element of average rating in Pearson’s correlation coefficient, i.e., the average rating of a user in the user-based CF and the average rating of an item in the item-based CF [5, 10]. The correlation between genres is computed using \( {\overline{s}}_a \) and s u,i . \( {\overline{b}}_{*} \) is the average bias of the users who rated the movies in genre a.

$$ {\overline{s}}_a={\overline{r}}_a-\mu -{\overline{b}}_{*}-{\overline{b}}_a $$
(10)

The average bias of genre a is denoted by \( {\overline{b}}_{\mathrm{a}} \). This is calculated by subtracting the average bias of the users who rated the movies in genre a (\( \overline{b}\left({U}_a\right) \)), and the average bias of the movies in genre a (\( \overline{b}\left({I}_a\right) \)) from the average rating of the movies in genre a (\( \overline{r}\left({I}_a\right) \)).

$$ {\overline{b}}_a=\overline{r}\left({I}_a\right)-\mu -\overline{b}\left({U}_a\right)-\overline{b}\left({I}_a\right) $$
(11)

3.2 Movie classification

The classification of movies is performed using associated movie genres and genre correlations [4]. In the MovieLens 100 k dataset, 18 genres were identified and thus a 18 × 18 matrix was created. Note that the movies with ‘unknown’ genres in the dataset were excluded. The genres of each movie are correlated to classify the movie into a single corresponding class in the matrix that has the highest correlation score.

In the MovieLens 10 M dataset, the proposed algorithm computed the correlations of seven genres, as shown in the matrix below. Suppose that three movies i 1, i 2,and i 3 are classified using these genre correlations. The matrix rows represent the genres of the movies and the columns represent the user’s preferred movie genres (Table 1).

Table 1 Example of the calculated several genre correlations

To classify movie i 1, the number of cases that it belongs to action, adventure and crime are identified, and the correlation values of the identified genre pairs are compared. Movie i 1 is classified into the pair of genres (the class in the matrix) that has the highest correlation value. Note that the correlation of two identical genres is not considered when the movie belongs to more than one genre (Table 2).

Table 2 Example of the movie classification using genre correlation

The seven movie genres are denoted by g 1 (Action), g 2 (Adventure), g 3 (Animation), g 4 (Children), g 5 (Comedy), g 6 (Crime), and g 7 (Documentary), respectively.

As shown in the table above, movie i 1 belongs to three different genres. Once the pairs of identical genres are excluded, there are 6 possible cases. Among the 6 cases, g 1, g 2 has the highest correlation value. Hence, movie i 1 is categorized into the g 1, g 2 class. Similarly, movie i 2 is classified into the g 4, g 3 class that has the highest correlation score. Movie i 3 belongs to a single genre so it is categorized into the g 7, g 7 class.

3.3 Movie recommendation

After the pre-processing process described in Sections 3.1 and 3.2 is carried out, the proposed algorithm performs the movie recommendation process that produces a list of recommended movies and predicts the ratings of the movies in the list.

If the target user prefers genre g 1, the genres to be recommended are chosen in the order of g 1, g 2, g 6, g 5, g 4, and g 3, and the movies belonging to the chosen genre(s) are included in the recommendation list. For example, if the target user’s one favorite genre (g 1) and two similar genres (g 1, g 2) are chosen, the moves classified in \( {c}_{g_1,\ {g}_1} \) and \( {c}_{g_1,{g}_2} \) are recommended to the target user.

3.3.1 Creation of a list of recommended movies

Using the ratings given by the target user, the frequency of ratings of the identified 18 genres are calculated, and the top-N frequently rated genres are chosen as the target user’s preferred genres. Here, N is equivalent to UPGC in the equation below. The movies in the target user’s preferred genres and genres similar to the target user’s preferred genres are included in a movie recommendation list for the target user. The number of similar genres is denoted by SGC. The ratings of the movies in the recommendation list are predicted using item-based CF algorithm and user-based CF algorithm [5, 10].

$$ RecommendedLis{t}_u={\displaystyle \underset{upg\in UP{G}_u}{\overset{UPGC}{\cup }}{\displaystyle \underset{sg\in S{G}_{upg}}{\overset{SGC}{\cup }}{c}_{upg,sg}}} $$
(12)

3.3.2 Prediction of movie ratings

The ratings of the movies in the recommendation list are predicted using classical user-based and item-based CF algorithms. In user-based CF, the preference prediction equation is used to predict the ratings (preference scores) that the target user would give to the recommended movies. In item-based CF, the weighted sum equation is used to perform movie rating predictions.

4 Experimental evaluation

4.1 Datasets description

The movie datasets available at MovieLens were used to assess the prediction accuracy of the proposed algorithm. The MovieLens 10 M dataset was used to compute movie genre correlations and the MovieLens 100 k dataset was used to make movie recommendations (Table 3).

Table 3 This table is MovieLens dataset description

In the MovieLens 100 k dataset, 80 % of the data is the training set and 20 % is the test set (5-fold cross validation is used). Using the data in this dataset, the accuracy of the movie ratings predicted by the proposed and conventional algorithms was evaluated. Five pairs of the training set and the test set were experimented to measure rating prediction accuracy, and the average results of these five experiments were used for comparison.

4.2 Evaluation metrics

The accuracy of the rating predictions of the proposed and conventional algorithms was measured using the Mean Absolute Error and the Root Mean Squared Error [8, 10].

4.2.1 MAE

$$ MAE=\frac{{\displaystyle {\sum}_{i\in N}}\left|{p}_i-{q}_i\right|}{\left|N\right|} $$
(13)

4.2.2 RMSE

$$ RMSE=\sqrt{\frac{{\displaystyle {\sum}_{i\in N}}{\left({p}_i-{q}_i\right)}^2}{\left|N\right|}} $$
(14)

In (13) and (14), p i is the real rating of movie i in the test-set, and q i is the predicted rating of movie i in the test-set, and N is the set of movies in the test-set for which rating predictions are made.

4.3 Evaluation

The experiments were executed for the user-based and item-based CF methods in a separate manner. In each method, the classical CF algorithm, and the proposed algorithm using genre correlations were experimented and their prediction accuracies were compared.

4.3.1 User-based CF (UBCF)

Figures 3 and 4 shows the accuracy of the movie rating predictions produced by the UBCF algorithm. k denotes the parameter of the k-nearest neighbor algorithm. The prediction accuracy was measured in terms of MAE and RMSE by varying k, from 10 to 942. In both MAE and RMSE, the optimal result (the most accurate prediction) is obtained when k = 450.

Fig. 3
figure 3

UBCF prediction accuracy of MAE for each k values

Fig. 4
figure 4

UBCF prediction accuracy of RMSE for each k values

4.3.2 User-based CF using genre correlations (UBGC)

The parameters used in the proposed algorithm are the number of nearest neighbors denoted by k, the number of target user’s preferred genres denoted by UPGC, and the number of genres similar to the target user’s preferred genres denoted by SGC. To find the optimal performance, changes in the accuracy of the proposed algorithm were examined by varying the value of the parameters. For parameter k, the value was gradually increased from 10 to 900 (Figs. 5 and 6).

Fig. 5
figure 5

Find the optimal parameter k values (Median)

Fig. 6
figure 6

Find the optimal parameter k values (Median)

For a single k value, 81 results were produced by varying UPGC and SGC from 1 through 9, and the median and maximum values of the results were identified. In the conducted experiments, the number of recommended movies was small when UPGC and SGC were small, so the median values rather than the minimum values were used in MAE and RMSE measures. In addition, the maximum values were used in MAE and RMSE measures to find k that has a small range of deviations.

The prediction accuracy was best when k = 450. Hence, parameter k is fixed to 450 and the accuracy of the rating predictions was examined by varying UPGC and SGC. The results are presented in the table below (Tables 4 and 5).

Table 4 Find the optimal parameters UPGC and SGC (MAE)
Table 5 Find the optimal parameters UPGC and SGC (RMSE)

The optimum performance of the UBGC was determined by considering the fact that the number of recommended movies is small when UPGC and SGC are small. The UBGC produces the optimum prediction quality (MAE is 0.74228 and RMSE is 1.01907) when UPGC = 3 and SGC = 1. With these values, the proposed algorithm employing user-based CF was compared to the conventional user-based CF algorithms, UBCF. The table below shows this comparison (Table 6).

Table 6 Comparison of MAE and RMSE

4.3.3 Classical item-based CF (IBCF)

The table below shows the accuracy of movie ratings predicted by the IBCF algorithm (Table 7).

Table 7 IBCF prediction accuracy of MAE and RMSE

4.3.4 Item-based CF using genre correlations (IBGC)

Like UBGC, the IBGC produces 81 results by varying UPGC and SGC from 1 through 9. A difference from the UBGC is that the IBGC achieves better prediction accuracies when the target user’s preferred genre is fixed and its similar genres are increased to more than 2 (Tables 8 and 9).

Table 8 Find the optimal parameters UPGC and SGC (MAE)
Table 9 Find the optimal parameters UPGC and SGC (RMSE)

In the IBGC, the lowest MAE score is 0.76620 when UPGC = 5 and SGC = 2 and the lowest RMSE score is 1.04893 when UPGC = 1 and SGC = 5. Giving first priority to the number of recommended movies in determining an optimal performance, the case of UPGC = 5 and SGC = 2 was chosen and compared with the conventional item-based CF algorithm (IBCF), as shown below (Table 10).

Table 10 Comparison of MAE and RMSE

5 Conclusions

The proposed algorithm assessed using the MovieLens datasets yields better results than the conventional movie recommendation algorithms although their differences in performance (prediction accuracy) are not very large. The proposed algorithm quantifies the correlation between movie genres using user-given ratings and uses the genre correlations to classify the movies in a dataset before the movie recommendation process begins. This pre-processing enables the proposed algorithm to make more accurate rating predictions, thus producing high-quality movie recommendations. In order for genre correlation information to be reliable and useful for movie recommendations, further work and verifications are needed but the approach proposed in this paper shows a way to exploit item’s genre information and it can be extended to be applicable in other domains. In the experiments, it was observed that the proposed algorithm has lower prediction accuracies than the conventional algorithms when the weights are applied to rating predictions. The reasons and solution for this problem should be studied in the future. Since the movie genre is one of the attributes of the movie, the authors expected that the proposed algorithm would achieve better results when item-based CF is used. In the proposed algorithm using item-based CF, increasing SGC while fixing UPGC made prediction accuracy gradually increase. In the proposed algorithm using user-based CF, increasing UPGC rather than SGC led to higher prediction accuracies. When UPGC and SGC were small, the number of recommended movies were small but the accuracy of predictions was better. In the proposed algorithm, the value for SGC should be determined based on movie characteristics when item-based CF is used. Similarly, the value for UPGC should be determined based on user characteristics when user-based CF is used.

6 Future work

In this paper, the experiments were conducted with two MovieLens datasets, the MovieLens 10 M dataset for genre correlation calculation and the MovieLens 100 k dataset for movie rating predictions.

The MovieLens 10 M dataset was used for genre correlation measurement because the MovieLens 100 k dataset could not be used due to data sparsity - a sparse genre correlation matrix is produced. In order to get consistent and reliable results, it is important that both genre correlation calculation and movie recommendation are carried out in a same dataset, either the MovieLens 100 k dataset or the MovieLens 10 M dataset. Thus, a way to resolve this problem needs to be developed.

The proposed algorithm using used-based CF should address the problem of not being able to make rating predictions due to the unavailability of movie ratings from some nearest neighbors, and it is also less efficient than the conventional CF algorithms in term of computational cost. The proposed algorithm using item-based CF is able to prepare a recommendation list in advance via the preprocessing process, thereby producing much faster recommendations. Since user-based and item-based CF suffers from the fundamental data sparsity problem, another means of recommending movies that does not rely on rating prediction is needed. One way to address this issue is recommending the movies in user’s preferred genres and/or genres similar to the user’s preferred genres.