1 Introduction

With the development of the internet technologies, a deluge of data from all walks of life results in information overload problem [1, 12, 23, 31]. To address this problem, many large web sites and e-commerce sites exploit various convenient and efficient recommender systems to improve service quality with the aim to attract and retain loyal users. Such as the recommendation of books in Amazon [10], applications in markets [13], videos in YouTube [4], and results in the web search [41].

Collaborative filtering is one of successful techniques in recommender systems, which is to recommend items for a user through analyzing the user’s data, and the data can be obtained by tracking browsing history, purchasing records, and rating records, etc [7, 14, 16, 18, 24, 28]. With years of development, this recommendation technology can be mainly classified into two categories: the model-based approach and the memory-based approach [38]. The model-based approach first constructs a prediction model based on the user-item rating matrix, and then predicts ratings on target items. Differing from the former, the memory-based approach first calculates the similarity between users/items, and selects the top-k similar users/items as the neighbors of the active user/target item, and then generates the predicted results. In addition to collaborative filtering, the content-based approaches [27, 30, 36], hybrid filtering [11, 33, 39], and demographic filtering [22, 32, 40] are also proposed with different applications. Furthermore, referred to the memory-based approach, which can be categorized into user-based or item-based. In this paper, we focus on improving the performance of recommender systems based on the user-based method to reduce the impact of the data sparsity [29, 34].

In previous related works, modifications and enhancements of collaborative filtering are mainly embodied in two aspects: the similarity measure modification and the neighbor selection[15, 20, 26, 42]. Aimed at the similarity measure modification, traditional similarity measure methods, such as Pearson Correlation Coefficient (PCC) [9], and Cosine (COS) [35] were widely used in recommender systems. Besides, Jamali and Ester [17] proposed a modified similarity measure method based on PCC using a sigmoid function (SPCC), which emphasized the importance of common rated items. Intuitively, if more common rated items exist between users, then they are more similar. According to the Cosine similarity measure method does not take the rating scale into account, and the adjusted Cosine similarity measure method (ACOS) [37] is proposed to solve the shortage. For example, Ahn [2] introduced a new heuristic similarity measure method, which considered three factors of the similarity measure: proximity, impact, and popularity of ratings, and thus, was called the PIP method. However, PIP is limited in considering the local rating information, and ignores the global user preference. Liu et al. [25] analyzed the shortage of PIP, and proposed a new heuristic similarity model (NHSM). NHSM not only inherits the advantage of the PIP method, but also pays attention to the proportion of common rated items and user preference. In addition to the above proposed similarity measure methods, researchers also have proposed many modified neighbor selection approaches. For example, Kaleli [19] proposed an entropy-based optimization of forming a more qualified neighbor set, which assigned a degree of uncertainty (DU) for every user, and demanded neighbors with minimum differences of the value of DU and maximum of the similarity with the active user. Boumaza and Brun [8] introduced a conception about global neighbors, which are neighbors of all active users. Kim and Yang [21] presented a threshold-based neighbor selection approach. In this approach, neighbors were determined in a certain selection range with respect to the similarity of the preferences. Anand and Bharadwaj [3] introduced a recommendation framework combining both local and global similarities to solve the data sparsity, which allows the variation of the importance given to the global user similarity with regards to the local user similarity.

In this paper, we present an effective collaborative filtering algorithm based on user preference clustering, which differs from aforementioned ones. On the one hand, user groups are introduced to select more accurate and reliable neighbors for the active user. As we know, users with different preferences have different rating habits. Therefore, users can be clustered into different user groups. (1) optimistic user group, in which users prefer to rate high marks; (2) pessimistic user group, in which users prefer to rate low marks; (3) neutral user group, in which users have the tendency to give reasonable marks for items. On the other hand, we notice that most of the previous similarity measure methods are not suitable for capturing user preference, and we propose a new similarity measure method to calculate the similarity between users in the process of clustering. Moreover, extensive experiments show that our proposed algorithm can significantly improve the performance with the sparse rating data. Finally, major contributions of this work can be summarized as follows:

  • Users are allocated into different user groups based on user preference clustering.

  • A new similarity measure method with the factor of user preference is proposed.

  • Extensive experimental results show that our proposed method is effective.

  • The proposed algorithm based on user preference clustering can be combined with other similarity measure methods freely.

The rest of this paper is organized as follows. We review traditional similarity measure methods and the user-based collaborative filtering approach in Section 2. And then, in Section 3, we deeply describe our proposed algorithm. Section 4 demonstrates and explains the experimental results. Finally, we conclude the work and give future work in Section 5.

2 Preliminaries

In recommender systems, a user-item rating matrix R is constructed to make recommendation for the active user, in which there are ratings of m users on n items, and U denotes the set of m users, I represents the set of n items. Note that rating data of the rating matrix is sparse, the missing or unknown rating data is denoted by the symbol ?, and r ui denotes the rating of user u on item i.

According to the user information stored in rating matrix R, traditional similarity measure methods, such as PCC and COS, are widely used to calculate the similarity between users in the user-based collaborative filtering approach, as given in (1) and (2) respectively:

$$ sim(a,\!b)^{PCC}\,=\,\frac{{\sum}_{i\in I_{ab}}(r_{ai}\,-\,\overline{r}_{a})\!\times\!(r_{bi}\,-\,\overline{r}_{b})}{\sqrt{{\sum}_{i\in I_{ab}}(r_{ai}\,-\,\overline{r}_{a})^{2}}\!\times\!\sqrt{{\sum}_{i\in I_{ab}}(r_{bi}\,-\,\overline{r}_{b})^{2}}}, $$
(1)
$$ sim(a,b)^{COS}=\frac{{\sum}_{i\in I_{ab}}r_{ai}\times r_{bi}}{\sqrt{{\sum}_{i\in I_{ab}}r_{ai}^{2}}\times\sqrt{{\sum}_{i\in I_{ab}}r_{bi}^{2}}}, $$
(2)

Where sim(a, b) denotes the similarity between user a and user b, I ab is the set of common rated items by user a and user b, r ai is the rating of user a on item i, and \(\overline {r}_{a}\) is the average rating of user a. After the similarity is calculated, the k nearest similar users are specified as the neighbors of the active user, then the prediction can be worked out on the target item. The recommended formula is defined as follow:

$$ p_{ti}=\overline{r}_{t}+\frac{{\sum}_{u\in U_{nei}}sim(t,u)\times(r_{ui}-\overline{r}_{u})}{{\sum}_{u\in U_{nei}}| sim(t,u)|}, $$
(3)

Where p ti denotes the prediction of active user t on target item i, U nei is the neighbor set of active user t, |U nei | = k.

3 The proposed algorithm

In collaborative filtering, the traditional way of searching neighbors for the active user depends on the rating information of common rated items by two users. However, some shortages exist in the traditional collaborative filtering approach, i.e., the factor of user preference is not taken into account, and a small portion of collected users’ data is utilized. In order to overcome these drawbacks, a novel effective collaborative filtering algorithm based on user preference clustering is proposed. The proposed algorithm’s flowchart is shown in Fig. 1.

Fig. 1
figure 1

Algorithm’s flowchart

3.1 Clustering based on user preference

In practical recommender applications, users could have starkly different views on an item. For example, some users are kind and they might rate their likeable and favorite items with high marks. Conversely, some users have strict attitudes on the rating, who might tend to rate low marks. Finally, some users might give reasonable marks for different items. As discussed above, users could be allocated into three different user groups. Suppose C o , C p , and C n represent optimistic user group, pessimistic user group, and neutral user groups respectively. Meanwhile, c o is the clustering center of C o , c p is the clustering center of C p , and c n is the clustering center of C n . Then, we introduce the selection of clustering centers.

Definition 1 (Different user preferences)

Suppose U h and U l are two subsets of user set U. In which \(U_{h}=\{u\in {U}|\overline r_{u}\geq \alpha \}\), and \(U_{l}=\{u\in {U}|\overline r_{u}<\beta \}\). c o U h , and c p U l .

Where α is set as a high mark, β is set as a low mark. For example, in a 1–5 scale rating matrix, α can be set as 4, and β can be set as 2. Therefore, U h is a subset of users who prefer to rate high marks on items. Similarly, users in the U l tend to rate items with low marks.

Definition 2 (Maximum rating number)

c o is a user from the subset U h who has maximum rating number; c p is a user from the subset U l who has maximum rating number.

The above Definitions give two criteria for the selection of clustering centers, i.e., the expected c o should be with the preference to rate high marks, meanwhile, c o should have as many ratings as possible on items. Through these criteria, we can judge a user’s preference via calculating the similarity between the user and all cluster centers. These cluster centers of different user groups are defined as follow:

Definition 3

Clustering center c o of optimistic user group can be uniquely determined, as follow:

$$ c_{o}=u\Leftarrow\underset{u}{\arg\max}{| I_{u}|}, $$
(4)

where ∀uU h , I u = {iI|r ui ≠?}. Clustering center c p of pessimistic user group can be uniquely determined, as follow:

$$ c_{p}=u\Leftarrow\underset{u}{\arg\max}{| I_{u}|}, $$
(5)

where ∀uU l , I u = {iI|r ui ≠?}.

If ∀iI, then the rating of c n on i is \(\overline {r}_{i}\), \(\overline {r}_{i}\) is the average rating of item i. With this, clustering center c n of neutral user group is constructed.

From Definition 3, we know c o , c p , and c n are uniquely determined and beneficial for achieving user preference clustering, in which both c o and c p are from user set U, and the difference between c o and c p is that they are with totally opposite preference. c n is virtual and constructed for obtaining the neutral user group. In consideration of an enormous amount of users stored in the user-item rating matrix, the average rating on an item can represent the majority view on this item. Therefore, c n can be regarded as a typical user who prefers to give reasonable marks. Based on this, three typical users are found as three cluster centers respectively, who have different characteristics for generating user groups. Then, we can obtain user groups with different preferences in Definition 4.

Definition 4

Suppose C = {c o , c p }, ∀uUC, the preference of u is determined as follow:

uC o ,:

if u satisfies sim(u, c o )>sim(u, c p ), and sim(u, c o )>sim(u, c n );

uC p ,:

if u satisfies sim(u, c p )>sim(u, c o ), and sim(u, c p )>sim(u, c n );

uC n ,:

if u satisfies sim(u, c n )>sim(u, c o ), and sim(u, c n )>sim(u, c p ).

From Definition 4, we can easily identify different preferences of users based on the similarity between users, and users with the consistent preference are assigned to the same user group. Therefore, different user groups can be obtained, i.e., optimistic user group U o , pessimistic user group U p , and neutral user group U n .

3.2 User similarity

In the process of clustering, the rating information of clustering centers is with special characteristics, i.e., c o prefers to rate high marks, and the determination of a user’s preference depends on the similarity between the user and these clustering centers. Therefore, an effective similarity measure method is helpful for assigning the remaining users into different user groups. In order to highlight the importance of user preference, we propose a new similarity measure method to calculate the similarity between users, as follow:

$$\begin{array}{@{}rcl@{}} sim(a,b)^{UPS}&=&\exp\left( -\frac{{\sum}_{i\in I_{ab}}| r_{ai}-r_{bi}|}{| I_{ab}|}\times|\overline{r}_{a}-\overline{r}_{b}|\right)\\ &&\times\frac{| I_{a}|\cap | I_{b}|}{| I_{a}|\cup | I_{b}|}, \end{array} $$
(6)

From (6), we know that two important factors are involved. In the global perspective, user preference is reflected by calculating the average rating on all items, and the higher the difference of average ratings between users, the more different preferences of them are shown. Locally, the factor of common rated items are taken into account to reflect the difference between user preferences. Users who have more common rated items with less difference between their preferences, the higher their similarity is shown. Therefore, users who have consistent preferences are easily assigned to the same user group.

Table 1 shows an example of a user-item rating matrix, in which u 1- u 5 are users and i 1- i 9 are items. We can calculate the similarity between users in Table 1 by different similarity measure methods mentioned in Section 1, as shown in Fig. 2. In Fig. 2, since the user similarity matrix is symmetric, and partial similarity values are not demonstrated.

Table 1 An example of a user-item rating matrix
Fig. 2
figure 2

Similarity matrixs of users in Table 1

Figure 2a gives the user similarity matrix according to the COS method. From Table 1, we can see that u 1 and u 5 have similar ratings, both of them prefer to rate low marks, and thus most of the ratings in u 2 are 4. However, the similarity between u 1 and u 2 is 1 in Fig. 2a. This drawback also exists in the SPCC method, as shown in Fig. 2b. For example, Fig. 2b shows that u 1 and u 2 can obtain the highest correlation regardless of user preference. Compared with COS and SPCC, the computational similarities are more accurate by the NHSM method, as shown in Fig. 2c. But from Fig. 2c, we can see that the similarity between u 2 and u 4 is higher than the similarity between u 3 and u 4, however, u 3 and u 4 have more similar preference in fact. Figure 2d gives the user similarity matrix according to the proposed UPS method. In Fig. 2d, we notice that the similarity between u 1 and u 5 is high. In addition, both u 3 and u 4 prefer to rate high marks, and their similarity is 0.3153, which is higher than the similarity between u 2 and u 4. Based on these observations, we can conclude that the proposed similarity measure method is more suitable for depicting the characteristic of user preference.

3.3 Recommendation method

In this section, we design the related algorithm to make recommendations for the active user. From the analysis in Sections 3.13.2, we first calculate the similarity between users by our proposed method, and the similarity matrix is denoted by sim UPS. Then, c o , c p , and c n are determined as clustering centers with different preferences respectively. Finally, users are assigned to different user groups based on the user similarity. With this, different user groups are generated, which are optimistic user group U o , pessimistic user group U p , and neutral user group U n . After the process of clustering is finished, we can obtain k nearest neighbors for the active user, and the neighbor selection approach is defined as follows:

Definition 5

Suppose U nei is the neighbor set of active user t, |U nei | = k, and \(U_{nei}\subset U\). If tU o , then \(U_{nei}\subset U_{o}\cup U_{n}\). If tU p , then \(U_{nei}\subset U_{p}\cup U_{n}\). If tU n , then \(U_{nei}\subset U_{n}\).

From Definition 5, we know that a user from U n could possibly become a neighbor of all active users, this is because users from U n have reasonable ratings and are valuable for predicting unrated items. Inversely, users from U o (or U p ) who prefer to rate high marks (or low marks), all of them can’t become neighbors of the active user who is from U n . After neighbor set U nei is obtained for active user t, we can predict the rating p ti , as follow:

$$ p_{ti}=\overline{r}_{t}+\frac{{\sum}_{u\in U_{nei}}sim^{UPS}(t,u)\times(r_{ui}-\overline{r}_{u})}{{\sum}_{u\in U_{nei}}| sim^{UPS}(t,u)|}, $$
(7)

In order to provide a clear description, we display our proposed method in Algorithm 1.

figure e

To evaluate the performance of our proposed algorithm, the analysis of time complexity is imperative. The selection of clustering centers requires the extra time cost is O(m), where m denotes the number of users, and when we calculate the similarity between users by the proposed method, the computational complexity is O(m(m+2)). As a whole, although the time complexity of the similarity calculation has increased slightly compared with traditional user-based recommendation algorithm (i.e., O(m 2)), it is usual that the similarity is computed off-line to lessen the time complexity burden.

4 Experiments

4.1 Data sets

We test our proposed algorithm on two well-known data sets, MovieLens (ML) and HetRec2011-MovieLens (HRML). The ML data set was collected by GroupLens research team at the University of Minnesota, in which 943 users rated on 1682 movies with 100000 ratings, and each user at least had 20 rating recode on movies. The density of ML data set is 6.3047 %. The second data set HRML was released on the 2nd international workshop on information heterogeneity and fusion in recommender systems. We randomly drew 1036 users and 1300 movies from HRML data set, the total number of ratings is 106210. And the sparsity of extracted data set is 92.1139 %. In addition, each data set is divided into 5 groups. Twenty percent of all data is selected as the test set, and the remaining data is the training set. For the impartial of experimental results, we adopt the 5-fold cross-validation by choosing different test set and training set.

4.2 Evaluation metrics

To date, researchers have presented many metrics to evaluate the performance of recommender systems [5, 6]. Generally, evaluation metrics are classified into two categories: (1) evaluation metrics of the prediction quality, such as mean absolute error (MAE), coverage, and accuracy; (2) evaluation metrics of the recommendation quality, such as precision, recall, and novelty. In order to estimate the performance of our proposed method, we utilize the MAE and coverage to measure the prediction quality, and the precision and recall to measure the quality of the recommendation set.

MAE

The MAE is one of the most widely used metrics to evaluate the recommendation accuracy, and is defined as the average of absolute difference between prediction values and actual ratings. The lower the MAE reflects, the more accurate predictions. Assuming I u = {iI|p ui ≠?∧r ui ≠?}, which is the set of items rated by user u having prediction values, p ui is the prediction of user u on item i. The MAE is calculated as follows:

$$ MAE=\frac{1}{\#U}\underset{u\in U}{\sum}\frac{{\sum}_{i\in I_{u}}| p_{ui}-r_{ui}|}{\#I_{u}}, $$
(8)

Coverage

The coverage indicates the proportion of predicted items from the total number of items, which applies to the recommender system to reflect the capacity of the prediction. This metric should be as high as possible for good prediction quality. Assuming there are n items, the number of predicted items is s, the coverage is calculated as follows:

$$ Coverage=\frac{s}{n}, $$
(9)

Precision

The precision is the proportion of relevant recommendations from the total number of recommendations. The higher the precision denotes, the better the recommendation performance. Assuming Z u is the set of top-N recommendations to user u, 𝜃 is set as a relevancy threshold, and 𝜃 equals the median value of ratings from the rating matrix. The precision is calculated as follows:

$$ Precision=\frac{1}{\#U}\underset{u\in U}{\sum}\frac{\#\{i\in{Z_{u}}|r_{ui}\geq\theta\wedge p_{ui}\geq\theta\}}{N}, $$
(10)

Recall

The recall is the average proportion of relevant recommendations from the total number of relevant items that user actually liked according to actual ratings. This metric is also as high as possible for good recommendation performance. Assuming T u is the number of relevant items in the test set, which are liked by user u. The recall is calculated as follows:

$$ Recall=\frac{1}{\#U}\underset{u\in U}{\sum}\frac{\#\{i\in{Z_{u}}|r_{ui}\geq\theta\wedge p_{ui}\geq\theta\}}{\#T_{u}}. $$
(11)

4.3 Experimental results

In order to verify the effectiveness of the proposed algorithm, we first explain that the fact of user preference is conducive to the performance improvement, and several experiments are performed on two benchmark data sets, as shown in Figs. 34. Since the number of the nearest neighbor can affect the performance of recommendation algorithms, therefore the number of the nearest neighbor k varies from 10 to 100 in experiments.

Fig. 3
figure 3

Validity check of user preference clustering on ML data set

Fig. 4
figure 4

Validity check of user preference clustering on HRML data set

Figure 3 demonstrates experimental results on ML data set. In Fig. 3a, the result labeled as COS-CF shows the MAE of traditional COS-based collaborative filtering algorithm [35], and Modified-COS-CF presents the MAE of modified COS-CF with user preference, which first generates three user groups by our proposed method, and then forms the nearest neighbor set from corresponding user group/user groups based on the similarity calculated by COS. According to the result comparison of COS-CF and Modified-COS-CF, we can see that the recommendation accuracy of COS-CF is lower than that of Modified-COS-CF with the increasing of the number of the nearest neighbors. Similarly, Modified-PCC-CF is also clearly superior to traditional PCC-based collaborative filtering (PCC-CF) [9], as shown in Fig. 3b. Therefore, we can conclude from Fig. 3a, b that the performance of Modified-COS-CF and Modified-PCC-CF have significant improvement compared with COS-CF and PCC-CF. Figure 3c demonstrates the result comparison of a recommendation algorithm based on the UPS method (UPS-CF) and its modified approach (UPUC-CF). From Fig. 3c, we can see the MAE of UPUC-CF is lower in the whole top-k range, however, since user preference clustering of UPUC-CF depends on the similarity estimated by the UPS method, the difference of the results is delicate. In addition, Fig. 4 demonstrates experimental results on HRML data set. From Fig. 4, we can get the same results as Fig. 3: all of modified approaches have higher recommendation accuracy than traditional algorithms based on single similarity computation. Therefore, we can draw the conclusion that the recommendation algorithm based on user preference is more effective.

For further validating our proposed algorithm, we compare our proposed algorithm with some state-of-the-art recommendation algorithms, i.e., COS-CF [35], PCC-CF [9],SPCC-CF [17], and NHSM-CF [25]. Results of the different algorithm comparisons are shown in Figs. 56.

Fig. 5
figure 5

Algorithms comparison based on ML data set

Fig. 6
figure 6

Algorithms comparison based on HRML data set

Figure 5 shows the performance of different algorithms with the number of the nearest neighbors on ML data set. In which Fig. 5a, b, c, d demonstrate the results of MAE, coverage, precision, and recall respectively. In Fig. 5a, the MAE of all algorithms decrease with the increasing of the number of the nearest neighbors. We can conclude that our proposed algorithm obtains the better MAE in the whole top-k range. In Fig. 5b, our proposed algorithm has remarkable improvement by comparing other algorithms, and the coverage is more than 85 % when the number of the nearest neighbors is 10. In short, our proposed algorithm has better prediction quality on ML data set. In addition, Fig. 5c, d show that the number of recommended items equally increase with the increasing of the number of the nearest neighbors, i.e., the number of the recommended item also varies from 20 to 30 when the k is from 20 to 30. In Fig. 5c, d, both the precision and recall of the proposed algorithm always have the better results with the increasing of the number of the nearest neighbors. Therefore, the effectiveness of the proposed algorithm can be verified to improve the recommendation performance on ML data set.

Figure 6 demonstrates experimental results on HRML data set. From Fig. 6, we can conclude that the change situation of different algorithms with the increasing of the number of the nearest neighbors is basically the same as Fig. 5. Differing from Fig. 5, Fig. 6 shows NHSM-CF and SPCC-CF compare unfavorably with PCC-CF in terms of the prediction quality and recommendation performance, when the number of the nearest neighbors is more than 70. Therefore, experiments on HRML data set reveal that our proposed algorithm is superior to other algorithms.

From Figs. 56, we can conclude that our proposed algorithm can obtain the better prediction quality and recommendation performance than some other methods. As previous methods mentioned in Section 1, many modifications and enhancements of collaborative filtering are published aiming to discover more accurate and reliable neighbors for improving the performance of recommender systems. Although the purpose of our work is similar with them, we propose a novel collaborative filtering algorithm based on user preference clustering. This method considers users who have different rating habits, and different typical users are defined to generate user groups with different preferences. Also, a new similarity measure method is proposed, which not only considers the rating information on common rated items by users, but also is up to the global information of user preference. In short, experimental results on two benchmark data sets show that the proposed algorithm is effective to improve the prediction quality and recommendation performance.

5 Conclusion and future work

In this paper, we introduced a collaborative filtering algorithm based on user preference clustering. Our approach is based on an assumption that users have different rating habits. For distinguishing different typical users, the primary work in this paper is to design a framework to assign users into user groups with different preferences. Therefore, the neighbor users of the active user can be found with consistent preference. As we know, the traditional Pearson Correlation Coefficient and Cosine similarity measure methods have a shortage in that they only consider the factor of commonly rated items by users. To solve this problem, we proposed a new similarity measure method to consider user preference from the local and global perspectives respectively. In addition, an example was illustrated in our paper, which has proved that the proposed similarity measure method is more effective and suitable for calculating the similarity between users. In experiments, we evaluated the effectiveness of our proposed algorithm on quality and recommendation performance improvement respectively, and experimental results on two benchmark data sets demonstrated our proposed algorithm has better performance compared with some state-of-the-art recommendation algorithms. In a word, the proposed algorithm is effective to improve the performance for recommender systems.

In the future we will continue to analyze the impact of the user behavior in recommender systems, and study the mechanism of the user rating behavior.