A new user similarity measure in a new prediction model for collaborative filtering

Manochandar, S.; Punniyamoorthy, M.

doi:10.1007/s10489-020-01811-3

A new user similarity measure in a new prediction model for collaborative filtering

Published: 22 August 2020

Volume 51, pages 586–615, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

A new user similarity measure in a new prediction model for collaborative filtering

Download PDF

726 Accesses
19 Citations
Explore all metrics

Abstract

The Recommender Systems (RSs) based on the performance of Collaborative filtering (CF) depends on similarities among users or items obtained by a user-item rating matrix. The conventional measures such as the Pearson correlation coefficient (PCC), cosine (COS), and Jaccard (JACC) provide a varied and dissimilar value when the ratings between the users lie in the positive and negative side of the rating scale. These measures are also not very effective when there is sparsity in the rating matrix of the user-item. These problems are addressed by the Proximity-Impact-Popularity (PIP) similarity measure. Even though the PIP method provides an improved solution for this problem, the range of values for each component in PIP is very high. To address this issue and to improve the performance of a CF-based RS, a modified proximity-impact-popularity (MPIP) similarity measure is introduced. The expression is designed to get PIP values within the range of 0 to 1. A modified prediction expression is proposed to predict the available and unavailable ratings by combining user- and item-related components. The proposed method is tested by using various benchmark datasets. The size of the user-item sparse matrix varies to compare the performance of the methods in terms of mean absolute error, root mean squared error, precision, recall, and F₁-measure. The performance of the proposed method is statistically tested through the Friedman and McNemer test. The results obtained by using the evaluation criteria indicate that the proposed method provides a better solution than the conventional methods. The statistical analysis reveals that the proposed method provides minimum MAE and RMSE values. Similarly, it also provides a maximum F₁-measure for all the sub-problems.

An Improved Similarity Measure Based on Collaborative Filtering for Sparsity Problem in Recommender Systems

Enhancing the accuracy of collaborative filtering based recommender system with novel similarity measure

Article 27 October 2023

Asymmetrically Weighted Cosine Similarity Measure for Recommendation Systems

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A majority of people spend an increasing amount of time on the Internet because of the excessive quantity of information on various fields it provides [1,2,3,4,5]. A Recommender System (RS) refers to the collection of information on websites related to the user preferences for a set of items. The RS employs different online information sources to predict the users’ preferences for these items [6, 7]; therefore, it plays a vital role in the sale of products or services [8, 9]. Many users prefer to buy the products based on the recommendations of other users; thus, the preferences of the users for different products should be analyzed. From the perspective of a company, this helps to maximize profits and promote its products or services.

The RS methods based on the Collaborative filtering (CF) are incredibly popular among the researchers and practitioners. This is endorsed by a vast number of articles published in the journals and also real-life implementation cases [10,11,12,13]. CF recommends products or services based on the similarities in the preferences of a group of customers or online users, known as neighbors [14,15,16]. The advantage of CF-based RS is that it is domain-independent and comparatively more accurate than content-based filtering (CBF) [17]. With an increase in online purchase and the development of electronic commerce, the automated product recommendation is observed as an essential tool for enhancing the sales of products and services through Internet-based stores [18]. Conversely, CF recommends products to a customer based on the similarities between the users or products. Customers’ past or historical preferences enable them to find out the similarities [16, 19,20,21]. Therefore, the critical component of CF effectively identifies the similarities between the users or items [18].

The Pearson Correlation Coefficient (PCC), Cosine (COS), Jaccard (JACC) and Jaccard Mean Squared Difference (JMSD) are the conventional methods used to compute the similarities between the users or items. The PCC, COS, and mean squared difference is the statistical metrics adopted in CF-based RS; the main advantage of these measures is that they are easy to implement to interpret the similarity values [22]. Similarly, JACC is the ratio of intersection by the union, which helps to find out the similarity between the users based on the number of co-rated items rated by two users. These similarity measures provide high accuracy for CF-based RS. However, these methods suffer from cold-start problems; in other words, only a few items or users are rated. This leads to an extremely sparse user-item rating matrix for the RS [7, 16, 23,24,25]. A similarity matrix computed by using this sparse input matrix misleads an RS [18, 26, 27]. The cold-start is an example of a sparsity problem that occurs when a new user or item is introduced; it becomes difficult to compute the similarities among users or items because of insufficient rating information [25, 28, 29]. The sparse input matrix is a significant issue in an RS, which decreases the performance of CF-based RS [30]. Another problem is that the rating values belong to both the positive and negative side of the rating scale; then, the conventional methods provide different similarity values. This is another drawback of conventional similarity measures. This misleading similarity values, eventually lower the accuracy of the CF-based RS. Therefore, a more effective similarity measure is required to improve the performance of a CF-based RS. To address these issues, Ahn [18] proposed the Proximity-Impact-Popularity (PIP) measure, predominantly for use in the CF to provide a better solution to the sparsity problem. In this method, two agreement conditions are included, and similarity is computed by considering positive and negative ratings. The range of values for the PIP is so wide that the three components (proximity, impact, popularity) are not equally treated; in different scenarios, each component has a different weight; similarly, the values of the components are not normalized [26]. If users provide an extremely positive rating for the co-rated items, then proximity has a greater weight than the impact and popularity. Similarly, if users provide an opposite rating for the co-rated items, then proximity and popularity are treated in the same manner but the impact value is very small. Each component contributes some important information in the PIP calculation; however, as each component is treated in unequal proportion, in reality, this affects prediction accuracy in CF-based RS. This is one of the limitations of the PIP measure. To overcome this, a detailed analysis of the PIP measure has been performed and the shortcomings of the existing similarity measure has been identified. Based on this analysis, a modified PIP (MPIP) similarity measure has been developed to overcome the limitations of PIP measure.

Generally, a similarity-based prediction expression is used to predict the rating, based on user-related average and its weighted average deviation or item-related average and its weighted average deviation. Anyone of these is considered for the prediction process. The user and item-related information are providing important information to make an accurate prediction; in the existing prediction expression, any one of the information is used for the rating prediction. This is one of the shortcomings of the existing prediction expression. Therefore, to improve the accuracy of prediction, a modified prediction expression is devised by adding the user-related deviation in the user-based prediction and item-related deviation in the item-based prediction; further, the predicted rating is the average values of user-and item-based prediction.

In this study, we have modified the PIP similarity measure by converting the range of each component into the range of 0 to 1. The PIP value is the product of proximity, impact and popularity values. These three components are weighed in different proportions in different scenarios. The deviation between the minimum and maximum values of each component is very high. If two users provide extreme positive ratings, then the resultant PIP values are in the magnitude of 10³. If users provide different rating values i.e., one user provides a positive rating and another user offers a negative rating, then the PIP values are 10^–1. The variation between the two conditions is very high. Direct normalization procedures such as z-score, and min-max normalization may be adopted to get normalized PIP values. However, different normalization procedures provide a different range of values. For different similarity matrix, the normalization techniques provide different values. To overcome this issue, the expression itself has changed into the modified PIP measure to compute improved similarity values between users or items. In the existing similarity measures such as PCC, COS, JACC, and JMSD, the agreement conditions are not adopted. The main problem in existing similarity measures is flat and single value problem. In PIP measure provides better solution for flat and single value problems. In MPIP expression the same agreement conditions in PIP are used to differentiate the similar and dissimilar between two ratings. Similarly, it resolves the flat and single value problems. In MPIP values are designed to get higher magnitude values for agreement conditions and very minimal values for disagreement conditions. If two users are in disagreement with the ratings, then MPIP provides very lesser similarity values for this condition. In the existing PIP similarity measure, constant penalty value is multiplied with the proximity component. Instead of constant value, variable penalty is used in the proposed proximity expression. This helps to compute an improved similarity matrix for CF-based RS. A modified prediction expression has been introduced, which evolves by combining the user-and item-related information to enhance the effectiveness of prediction by using a spare user-item rating matrix. The modified prediction expression is also derived for predicting unavailable ratings. Based on this modified prediction expression, one can find an accurate rating prediction for the users. The modified PIP similarity measure and prediction expression are combined as a proposed framework for CF-based RS. This improved CF-based RS recommends more relevant products or services to the customer, which in turn enhances the level of customers’ satisfaction in e-commerce services.

Experiments are conducted by using MovieLens100KB (ML100KB), and Netflix datasets, which are also used by Ahn [18]. Benchmark datasets such as Epinions, CiaoDVD, MovieTweet, FilmTrust, and MovieLens1MB (ML1MB) are used to validate the proposed framework. Each dataset is divided into different sub-problems, to test the proposed framework under sparse conditions. The performance criteria such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), precision, recall and F₁-measure are used to measure the effectiveness of the CF-based RS. McNemer test is conducted to statistically test the performance of the different methods. Friedman rank test is performed to test the performance of the methods in all the sub-problems. The results obtained from the proposed framework are compared with the eleven existing similarity measure. The proposed method provides a minimum of MAE and RMSE values for all the dataset. Similarly, it provides higher accuracy than conventional methods. The statistical tests are conducted on MAE, RMSE and confusion matrix. The results show that the proposed framework can improve prediction performance and quality with the sparse input matrix. The symbols used throughout the paper are listed in Table 1.

Table 1 Description of the symbols and abbreviations used in this paper

A new user similarity measure in a new prediction model for collaborative filtering

Abstract

Similar content being viewed by others

An Improved Similarity Measure Based on Collaborative Filtering for Sparsity Problem in Recommender Systems

Enhancing the accuracy of collaborative filtering based recommender system with novel similarity measure

Asymmetrically Weighted Cosine Similarity Measure for Recommendation Systems

Explore related subjects

1 Introduction

2 Related literature

2.1 Drawbacks in the existing similarity measure

3 Detailed analysis of PIP

3.1 Proximity

3.2 Impact

3.3 Popularity

3.4 Issues identified in PIP

4 Proposed method

4.1 Modified PIP (MPIP) similarity measure

4.1.1 Proposed proximity

4.1.2 Proposed impact

4.1.3 Proposed popularity

4.1.4 Similarity measure computation

4.2 Validation for MPIP similarity measure

4.3 Existing prediction expression

4.3.1 Combined user and item-based prediction expression (CUIP)

5 Experiments

5.1 Characteristics of sub-problems generated from the dataset

5.2 Limitation of the above schema

5.3 Performance criteria

5.4 Friedman rank test

5.5 McNemar’s test

5.6 Results and discussion

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation