1 Introduction

A majority of people spend an increasing amount of time on the Internet because of the excessive quantity of information on various fields it provides [1,2,3,4,5]. A Recommender System (RS) refers to the collection of information on websites related to the user preferences for a set of items. The RS employs different online information sources to predict the users’ preferences for these items [6, 7]; therefore, it plays a vital role in the sale of products or services [8, 9]. Many users prefer to buy the products based on the recommendations of other users; thus, the preferences of the users for different products should be analyzed. From the perspective of a company, this helps to maximize profits and promote its products or services.

The RS methods based on the Collaborative filtering (CF) are incredibly popular among the researchers and practitioners. This is endorsed by a vast number of articles published in the journals and also real-life implementation cases [10,11,12,13]. CF recommends products or services based on the similarities in the preferences of a group of customers or online users, known as neighbors [14,15,16]. The advantage of CF-based RS is that it is domain-independent and comparatively more accurate than content-based filtering (CBF) [17]. With an increase in online purchase and the development of electronic commerce, the automated product recommendation is observed as an essential tool for enhancing the sales of products and services through Internet-based stores [18]. Conversely, CF recommends products to a customer based on the similarities between the users or products. Customers’ past or historical preferences enable them to find out the similarities [16, 19,20,21]. Therefore, the critical component of CF effectively identifies the similarities between the users or items [18].

The Pearson Correlation Coefficient (PCC), Cosine (COS), Jaccard (JACC) and Jaccard Mean Squared Difference (JMSD) are the conventional methods used to compute the similarities between the users or items. The PCC, COS, and mean squared difference is the statistical metrics adopted in CF-based RS; the main advantage of these measures is that they are easy to implement to interpret the similarity values [22]. Similarly, JACC is the ratio of intersection by the union, which helps to find out the similarity between the users based on the number of co-rated items rated by two users. These similarity measures provide high accuracy for CF-based RS. However, these methods suffer from cold-start problems; in other words, only a few items or users are rated. This leads to an extremely sparse user-item rating matrix for the RS [7, 16, 23,24,25]. A similarity matrix computed by using this sparse input matrix misleads an RS [18, 26, 27]. The cold-start is an example of a sparsity problem that occurs when a new user or item is introduced; it becomes difficult to compute the similarities among users or items because of insufficient rating information [25, 28, 29]. The sparse input matrix is a significant issue in an RS, which decreases the performance of CF-based RS [30]. Another problem is that the rating values belong to both the positive and negative side of the rating scale; then, the conventional methods provide different similarity values. This is another drawback of conventional similarity measures. This misleading similarity values, eventually lower the accuracy of the CF-based RS. Therefore, a more effective similarity measure is required to improve the performance of a CF-based RS. To address these issues, Ahn [18] proposed the Proximity-Impact-Popularity (PIP) measure, predominantly for use in the CF to provide a better solution to the sparsity problem. In this method, two agreement conditions are included, and similarity is computed by considering positive and negative ratings. The range of values for the PIP is so wide that the three components (proximity, impact, popularity) are not equally treated; in different scenarios, each component has a different weight; similarly, the values of the components are not normalized [26]. If users provide an extremely positive rating for the co-rated items, then proximity has a greater weight than the impact and popularity. Similarly, if users provide an opposite rating for the co-rated items, then proximity and popularity are treated in the same manner but the impact value is very small. Each component contributes some important information in the PIP calculation; however, as each component is treated in unequal proportion, in reality, this affects prediction accuracy in CF-based RS. This is one of the limitations of the PIP measure. To overcome this, a detailed analysis of the PIP measure has been performed and the shortcomings of the existing similarity measure has been identified. Based on this analysis, a modified PIP (MPIP) similarity measure has been developed to overcome the limitations of PIP measure.

Generally, a similarity-based prediction expression is used to predict the rating, based on user-related average and its weighted average deviation or item-related average and its weighted average deviation. Anyone of these is considered for the prediction process. The user and item-related information are providing important information to make an accurate prediction; in the existing prediction expression, any one of the information is used for the rating prediction. This is one of the shortcomings of the existing prediction expression. Therefore, to improve the accuracy of prediction, a modified prediction expression is devised by adding the user-related deviation in the user-based prediction and item-related deviation in the item-based prediction; further, the predicted rating is the average values of user-and item-based prediction.

In this study, we have modified the PIP similarity measure by converting the range of each component into the range of 0 to 1. The PIP value is the product of proximity, impact and popularity values. These three components are weighed in different proportions in different scenarios. The deviation between the minimum and maximum values of each component is very high. If two users provide extreme positive ratings, then the resultant PIP values are in the magnitude of 103. If users provide different rating values i.e., one user provides a positive rating and another user offers a negative rating, then the PIP values are 10–1. The variation between the two conditions is very high. Direct normalization procedures such as z-score, and min-max normalization may be adopted to get normalized PIP values. However, different normalization procedures provide a different range of values. For different similarity matrix, the normalization techniques provide different values. To overcome this issue, the expression itself has changed into the modified PIP measure to compute improved similarity values between users or items. In the existing similarity measures such as PCC, COS, JACC, and JMSD, the agreement conditions are not adopted. The main problem in existing similarity measures is flat and single value problem. In PIP measure provides better solution for flat and single value problems. In MPIP expression the same agreement conditions in PIP are used to differentiate the similar and dissimilar between two ratings. Similarly, it resolves the flat and single value problems. In MPIP values are designed to get higher magnitude values for agreement conditions and very minimal values for disagreement conditions. If two users are in disagreement with the ratings, then MPIP provides very lesser similarity values for this condition. In the existing PIP similarity measure, constant penalty value is multiplied with the proximity component. Instead of constant value, variable penalty is used in the proposed proximity expression. This helps to compute an improved similarity matrix for CF-based RS. A modified prediction expression has been introduced, which evolves by combining the user-and item-related information to enhance the effectiveness of prediction by using a spare user-item rating matrix. The modified prediction expression is also derived for predicting unavailable ratings. Based on this modified prediction expression, one can find an accurate rating prediction for the users. The modified PIP similarity measure and prediction expression are combined as a proposed framework for CF-based RS. This improved CF-based RS recommends more relevant products or services to the customer, which in turn enhances the level of customers’ satisfaction in e-commerce services.

Experiments are conducted by using MovieLens100KB (ML100KB), and Netflix datasets, which are also used by Ahn [18]. Benchmark datasets such as Epinions, CiaoDVD, MovieTweet, FilmTrust, and MovieLens1MB (ML1MB) are used to validate the proposed framework. Each dataset is divided into different sub-problems, to test the proposed framework under sparse conditions. The performance criteria such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), precision, recall and F1-measure are used to measure the effectiveness of the CF-based RS. McNemer test is conducted to statistically test the performance of the different methods. Friedman rank test is performed to test the performance of the methods in all the sub-problems. The results obtained from the proposed framework are compared with the eleven existing similarity measure. The proposed method provides a minimum of MAE and RMSE values for all the dataset. Similarly, it provides higher accuracy than conventional methods. The statistical tests are conducted on MAE, RMSE and confusion matrix. The results show that the proposed framework can improve prediction performance and quality with the sparse input matrix. The symbols used throughout the paper are listed in Table 1.

Table 1 Description of the symbols and abbreviations used in this paper

The main contribution of this study is as follows: The Modified PIP (MPIP) similarity measure is introduced to provide an improved similarity between users or items. In existing PIP measure, the minimum and maximum value of each component varies at different range. The resultant PIP value heavily depends on any one of the component value. This issue is addressed by the proposed similarity measure (MPIP) by converting each component values into zero to one. In MPIP the variable penalty value is introduced in the proposed proximity expression to differentiate the values at various scenario. Further, Combined user and item-based prediction (CUIP) expression is proposed to get better predicted rating. The CUIP is tuned to forecast for unavailable rating. The MPIP and CUIP both are combined as a new framework to overcome sparsity problem in CF-based RS. A schema is developed to generate different level of sparse input matrix by providing equal importance to all the elements in the input user-item rating matrix. Under this schema number of users, items and ratings size vary in a different proposition. Finally, the McNemer test is explained with the graphical representation for better understating of each component used in the McNemer table.

The remainder of this paper is organized as follows: Section 2 discusses the literature related to CF-based RS and the similarity measures adopted for CF-based RS. Section 3 presents a detailed analysis of the PIP and the issues identified in the PIP expression. Section 4 describes the proposed method, which is a combination of the modified PIP similarity measure and the modified prediction expression. Section 5 presents the experimental results, and Section 6 discusses our conclusions.

2 Related literature

RS is broadly classified into three sectors; they are Content-based Filtering, Collaborative Filtering, and hybrid approach. The content-based filtering predominantly utilizes text-mining concepts; collaborative filtering, which is further subdivided into model-and memory-based filtering; and a hybrid method that combines the text and rating preferences provided by the user. [31]

CF is a type of personalized recommendation technique, which is widely used in many domains [13, 30, 32]. However, CF also suffers from several issues; for example, the cold-start problem, data sparsity [6, 33, 34], and scalability. These problems considerably reduce the user experience. Memory-based filtering is further divided into user-and item-based methods. In this study, literature related to the user-based similarity methods are collected and listed below.

Many studies have been conducted to improve prediction accuracy, resulting in the development of new similarity measures. PCC is often used to compute a linear relationship (i.e., correlation) between a pair of objects; it ranges from −1 to 1, where −1 indicates a negative relationship between the users, mid-value 0 indicates that there is no relationship between users, and + 1 indicates a strong relationship between users.

$$ sim{\left({u}_j,{u}_h\right)}^{PC\mathrm{C}}=\frac{\sum_{i=1}^{m^{\prime }}\left({r}_{u_j,{I}_i}-{\overline{r}}_{u_j}\right)\left({r}_{u_h,{I}_i}-{\overline{r}}_{u_h}\right)}{\sqrt{\sum_{i=1}^{m^{\prime }}{\left({r}_{u_j,{I}_i}-{\overline{r}}_{u_j}\right)}^2}\sqrt{\sum_{i=1}^{m^{\prime }}{\left({r}_{u_h,{I}_i}-{\overline{r}}_{u_h}\right)}^2}} $$
(1)

The COS similarity is a vector space model, which is highly used in information retrieval domains, where the cosine value of the angle between two vectors is used as the similarity between the users. This is calculated by using the following equation:

$$ sim{\left({u}_j,{u}_h\right)}^{COS}=\frac{\sum_{i=1}^{m^{\prime }}\left({r}_{u_j,{I}_i}\ \right)\times \left({r}_{u_h,{I}_i}\right)}{\sqrt{\sum_{i=1}^{m^{\prime }}{r_{u_j,{I}_i}}^2}\times \sqrt{\sum_{i=1}^{m^{\prime }}{r_{u_h,{I}_i}}^2}} $$
(2)

Bobadilla et al. [6] have proposed a new similarity measure by combining Jaccard similarity and mean squared difference (MSD). Another new metric, Jaccard Mean Squared Difference (JMSD), is proposed to solve the cold user problem.

$$ sim{\left({u}_j,{u}_h\right)}^{JACC}=\frac{\left|{I}_{u_j}\right|\cap \left|{I}_{u_h}\right|}{\left|{I}_{u_j}\right|\cup \left|{I}_{u_h}\right|} $$
(3)
$$ sim{\left({u}_j,{u}_h\right)}^{MSD}=1-\left(\frac{\sum_{i=1}^{m^{\prime }}{\left({r}_{u_j,{I}_i}-{r}_{u_h,{I}_i}\right)}^2}{\left|{I}_{u_j}\cap {I}_{u_h}\right|}\right) $$
(4)
$$ sim{\left({u}_j,{u}_h\right)}^{JMSD}=\left(\frac{\left|{I}_{u_j}\right|\cap \left|{I}_{u_h}\right|}{\left|{I}_{u_j}\right|\cup \left|{I}_{u_h}\right|}\right)\times \left(1-\frac{\sum_{i=1}^{m^{\prime }}{\left({r}_{u_j,{I}_i}-{r}_{u_h,{I}_i}\right)}^2}{\left|{I}_{u_j}\cap {I}_{u_h}\right|}\ \right) $$
(5)

The Spearman Rank Correlation (SRC) is another similarity measure that relies on the rank of the items instead of the rating provided by the users as in the PCC. The rankings are based on the higher to lower-rated items [35]. Ahn [18] has proposed the PIP similarity measure for CF for both agreement and disagreement situations.

Further, this similarity measure is modified into a non-linear model by Liu et al. [26], which is termed as a new heuristic similarity measure (NHSM). It comprises the proximity-significance-singularity (PSS) combined with the modified Jaccard (JACC′) function and the user rating preference (URP).

$$ Proximity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)=1-\left(\frac{1}{1+\mathit{\exp}\left(-\left|{r}_{u_j,{I}_i}-{r}_{u_h,{I}_i}\right|\right)}\right) $$
(6)
$$ Significance\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)=\frac{\ 1}{1+\mathit{\exp}\left(-\left|{r}_{u_j,{I}_i}-{R}_{med}\right|\times \left|{r}_{u_h,{I}_i}-{R}_{med}\right|\right)} $$
(7)
$$ Singularity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)=1-\left(\frac{1}{1+\mathit{\exp}\left(-\left|\frac{r_{u_j,{I}_i}-{r}_{u_h,{I}_i}}{2}-{\overline{r}}_{I_i}\right|\right)}\right) $$
(8)
$$ PSS\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)= Proximity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)\times Significance\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)\times Singularity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right) $$
(9)
$$ sim{\left({u}_j,{u}_h\right)}^{PSS}={\sum}_{i=1}^{m\prime } PSS\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right) $$
(10)
$$ sim{\left({u}_j,{u}_h\right)}^{Jacc\prime }=\frac{\left|{I}_{u_j}\cap {I}_{u_h}\right|}{\left|{I}_{u_j}\right|\times \left|{I}_{u_h}\right|} $$
(11)
$$ sim{\left({u}_j,{u}_h\right)}^{URP}=1-\frac{1}{1+\exp \left(-\left|{\overline{r}}_{u_j}-{\overline{r}}_{u_h}\right|\times \left|{\sigma}_{u_j}-{\sigma}_{u_h}\right|\right)} $$
(12)
$$ sim{\left({u}_j,{u}_h\right)}^{NHSM}= sim{\left({u}_j,{u}_h\right)}^{PSS}\times sim{\left({u}_j,{u}_h\right)}^{Jacc^{\prime }}\times sim{\left({u}_j,{u}_h\right)}^{URP} $$
(13)

Bhattacharyya Coefficient in CF is proposed by Patra et al. [17] by combining global and local similarity measures; this coefficient is treated as a global measure, and the local measure is calculated based on the correlation or cosine similarity measure.

$$ sim\left({u}_j,{u}_h\right)= JACC\left({u}_j,{u}_h\right)+\sum \limits_{i\epsilon {I}_{u_j}}\sum \limits_{q\epsilon {I}_{u_h}} BC\left({I}_i,{I}_q\right)\times loc\left({r}_{u_j},{r}_{u_h}\right) $$
(14)
$$ BC\left({I}_i,{I}_q\right)={\sum}_{j=1}^n\sqrt{\mathit{\Pr}\left({u}_j,{I}_i\right)\times \mathit{\Pr}\left({u}_h,{I}_q\right)} $$
(15)
$$ loc{\left({u}_j,{u}_h\right)}^{corr}=\frac{\left({r}_{u_j,{I}_i}-{\overline{r}}_{u_j}\right)\times \left({r}_{u_h,{I}_i}-{\overline{r}}_{u_h}\right)}{\sqrt{\sum_{i=1}^{m\prime }{\left({r}_{u_j,{I}_i}-{\overline{r}}_{u_j}\right)}^2}\sqrt{\sum_{i=1}^{m\prime }{\left({r}_{u_h,{I}_i}-{\overline{r}}_{u_h}\right)}^2}} $$
(16)
$$ loc{\left({u}_j,{u}_h\right)}^{med}=\frac{\left({r}_{u_j,{I}_i}-{R}_{med}\right)\times \left({r}_{u_h,{I}_i}-{R}_{med}\right)}{\sqrt{\sum_{i=1}^{m\prime }{\left({r}_{u_j,{I}_i}-{R}_{med}\right)}^2}\sqrt{\sum_{i=1}^{m\prime }{\left({r}_{u_h,{I}_i}-{R}_{med}\right)}^2}} $$
(17)

Where ‘BC’ is the Bhattacharyya Coefficient, ‘loc’ is the local similarity measure. Eqs. 16 and 17 are used for calculating local similarity. Eq. 16 is calculated by using the mean as a reference; similarly, Eq.17 is calculated by using the median as a reference.

Generally, the Jaccard similarity measure only counts the frequency of co-rated ratings between the users. This is one of the shortcomings in the Jaccard measure. To overcome this issue, Bag et al., [31] have proposed the Relevant Jaccard Similarity Measure (RJACC), which includes the frequency of un-co-rated items. Besides, a new similarity measure for CF-based RS is evolved by combining the RJACC and MSD which is termed as Relevant Jaccard Mean Squared Deviation (RJMSD) [31].

$$ sim{\left({u}_j,{u}_h\right)}^{RJACC}=\frac{1}{1+\frac{1}{\left|{I}_{u_j}\cap {I}_{u_h}\right|}+\frac{\left|\overline{I_{u_j}}\right|}{1+\left|\overline{I_{u_j}}\right|}+\frac{1}{1+\left|\overline{I_{u_h}}\right|}} $$
(18)
$$ sim{\left({u}_j,{u}_h\right)}^{RJMSD}=\frac{1}{1+\frac{1}{\left|{I}_{u_j}\cap {I}_{u_h}\right|}+\frac{\left|\overline{I_{u_j}}\right|}{1+\left|\overline{I_{u_j}}\right|}+\frac{1}{1+\left|\overline{I_{u_h}}\right|}}\times \left(1-\frac{\sum_{i=1}^{m^{\prime }}{\left({r}_{u_j,{I}_i}-{r}_{u_h,{I}_i}\right)}^2}{\left|{I}_{u_j}\cap {I}_{u_h}\right|}\ \right) $$
(19)

Where ‘\( \overline{I_{u_j}} \)’ and ‘\( \overline{I_{u_h}} \)’ are the number of un-co-rated items of user ‘j’ and ‘h’. Sub one quasi norm-based similarity measure for CF-based RS is introduced to overcome the issues of similarity measure based on the Euclidean distance. [7]

$$ sim{\left({u}_j,{u}_h\right)}^{SOQN}={\sum}_{i=1}^{m^{\prime }}1-\left(\frac{{\left|{r}_{u_j,{I}_i}-{r}_{u_h,{I}_i}\right|}^g}{{\left(\frac{R_{ran}}{2}\right)}^g}\right) $$
(20)

Where ‘Rran’ is the range value, which is deviation between Rmax and Rmin. ‘Rmax ’ is the maximum value on the rating scale. ‘Rmin’ is the minimum value in the rating scale. ‘g’ is the parameter which varies from 0 to 1.

2.1 Drawbacks in the existing similarity measure

PCC and COS are widely used methods in CF-based RS. The limitations related to PCC and COS are as follows: Flat value problem [18, 36], Single-value problem [36], Equal ratio problem [36], and Opposite value problem [36]. Specifically, Cosine similarity is one of the popular similarity measures used in many research applications. This is highly used in the area of text clustering to find out the similarity between the two documents [37], which is also used in the k-means clustering algorithm to compute the similarity between the data objects and the corresponding centroids. Based on the similarity values, the data objects are grouped into different clusters. It is also used in a multi-objective function for the Krill Herd algorithm. This similarity measure provides higher accuracy than conventional methods. [38]. Jaccard similarity measure is very highly known as a binary similarity measure. This is mainly adopted to find a similarity between the binary variables. This measure provides a better solution in the evolutionary algorithm, such as fitness function for the genetic algorithm [39]. The research, as mentioned above, clearly shows that cosine and Jaccard based similarity measures are highly used in real-life applications to solve various problems. In the case of CF-based RS, the rating values (ordinal scale) are used. The PCC and COS provide equal weightage for both the positive and negative sides of a rating scale which leads to misleading similarity values. This is one of the shortcomings in the PCC and Cosine based similarity measures. In Jaccard similarity measure, only the number of co-rated ratings is considered, not the intensity of the rating. These are some of the shortcomings in existing similarity measures. To overcome this shortcoming, Ahn [18] has proposed a PIP similarity measure. The magnitude of the PIP values is higher and each component in PIP measures treated in different proportions in different scenarios. PSS is the extension of the PIP expression with a non-linear assumption but the agreement conditions used in PIP measure is not included in this expression; thus, the correlation between PIP and PSS is very low. NHSM is a combination of the PSS, modified Jaccard (JACC′) and User Preference Rating (URP) measure. In URP, the deviation of the mean is multiplied with the standard deviation. If both users have the same mean but a different standard variation in this situation, standard variation becomes negligible. BCF similarity measure is a combination of global and local similarity measures. The Bhattacharya coefficient is treated as a global measure. In this expression, only the rating number is used in the calculation; co-rated items are not considered. Relevance Jaccard (RJACC) is the extension of JACC based similarity measure, in this measure, only the frequency of the co-rated and un-co-rated items are considered, the intensity (i.e., magnitude of the rating) is not considered for the similarity computation. SQON is the improved version of Euclidean distance, if the number of co-rated items gets increased then the magnitude of this similarity values also get increased. If two users having higher number of co-rated items, then the SQON provides higher similarity values for the users and vice-versa. These are the shortcomings in the existing similarity measure used in CF-based RS. To overcome these shortcomings, a modified PIP similarity measure has been proposed to get an improved similarity matrix for CF-based RS.

3 Detailed analysis of PIP

The PIP similarity measure [18] comprises two agreement conditions, which in turn include three components: proximity, impact, and popularity. The similarity between the two users ratings, ‘\( {r}_{u_j,{I}_i} \)’ and ‘\( {r}_{u_h,{I}_i} \)’ is calculated as follows:

$$ PIP\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)= proximity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)\times impact\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)\times popularity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right) $$
(21)
$$ sim{\left({u}_j,{u}_h\right)}^{PIP}={\sum}_{i=1}^{m^{\prime }} PIP\left({r}_{u_j,{I}_i},{\mathrm{r}}_{u_h,{I}_i}\right) $$
(22)

The agreement conditions play a vital role in the PIP measure. Compared to other similarity measures, the PIP is the only similarity measure that differentiates the positive and negative ratings by using the agreement conditions. This agreement conditions are used to discriminate the ratings based on pattern of ratings provided by two users. Let us consider the set of the users U = {u1, u2, …, un} and set of items I = {I1, I2, …, Im}, Where ‘n’ is the total number of users, and ‘m’ denotes the total number of items. They are associated with the rating matrix called ⟨user × item⟩ rating matrix

$$ {\displaystyle \begin{array}{c}{\begin{array}{c}\ \\ {}u\end{array}}_1\\ {}\begin{array}{c}\mathbf{\vdots}\\ {}{u}_n\end{array}\end{array}}{\displaystyle \begin{array}{c}{I}_{1\kern1.75em }\kern0.5em \dots \kern1.75em {I}_m\\ {}\left[\begin{array}{ccc}{r}_{u_1,{I}_1}& \cdots & {r}_{u_1,{I}_m}\\ {}\vdots & \ddots & \vdots \\ {}{r}_{u_n,{I}_1}& \cdots & {r}_{u_n,{I}_m}\end{array}\right]\end{array}} $$
(23)

To understand better, let us consider ‘r1’ and ‘r2’ are the two ratings, where r1 is the rating provided by user ‘j’ for item ‘i’ and r2 is the rating provided by user ‘h’ for item ‘i’. Where ‘Rmax’ is the maximum rating in the rating scale and ‘Rmin’ is the minimum rating in the rating scale; further let \( {R}_{med}=\frac{R_{max}+{R}_{min}}{2} \), if both rating ‘r1’ and ‘r2’ are lesser or greater than ‘Rmed’. This means Agreement(r1, r2) belongs to TRUE situation. Similarly, if the rating is in the opposite direction, i.e., a rating is greater than ‘Rmed’ and another rating is lesser than ‘Rmed’ the situation is FALSE. The main use for this agreement conditions is to discriminate against the rating. A Boolean function for the agreement conditions of ratings ‘r1’ and ‘r2’ is defined as follows:

$$ {\displaystyle \begin{array}{c} Agreement\left({r}_1,{r}_2\right)= FALSE\\ {} if\ \left({r}_1>{R}_{med}\ and\ {r}_2<{R}_{med}\right)\ or\ \left({r}_1<{R}_{med}\ and\ {r}_2>{R}_{med}\right)\ \end{array}} $$
(24)
$$ {\displaystyle \begin{array}{c} Agreement\left({r}_1,{r}_2\right)= TRUE\\ {} otherwise\end{array}} $$
(25)

The pictorial representation of agreement conditions by considering r1 as a reference is shown in Fig. 1. (A rating scale1 to 5 is considered for plotting this figure). The vertical lines denote the rating inside the plot, and the vertical black line indicates the Agreement of r1 and r2 is TRUE; similarly, the vertical red line denotes the Agreement of r1 and r2 is FALSE.

Fig. 1
figure 1

Graphical representation of Agreement conditions

If both the users provide similar kind of ratings, the Agreement(r1, r2) = TRUE. The minimum and maximum deviation of ratings in Agreement(r1, r2) = TRUE situation are 0 and 2; this shows that both the users agree with the rating values. If both the users provide different kinds of rating for the same item then the Agreement(r1, r2) = FALSE. The minimum deviation of Agreement(r1, r2) = FALSE is two and the maximum deviation value is four. This clearly shows that the deviation between the two ratings provides useful information to calculate the similarity between the users. This agreement condition helps to differentiate the users based on the variation of the ratings provided to the co-rated item.

3.1 Proximity

Generally, proximity is used to determine the closeness of the two objects. Here proximity is the absolute difference between two rating r1 and r2.

$$ {\displaystyle \begin{array}{c}D\left({r}_1,{r}_2\right)=\left|{r}_1-{\mathrm{r}}_2\right|\\ {} if\ Agreement\left({r}_1,{r}_2\right)= TRUE\end{array}} $$
(26)
$$ {\displaystyle \begin{array}{c}D\left({r}_1,{r}_2\right)=2\times \left|{r}_1-{r}_2\right|\\ {} if\ Agreement\left({r}_1,{r}_2\right)= FALSE\end{array}} $$
(27)
$$ proximity\left({r}_1,{r}_2\right)={\left\{\left(2\times \left({R}_{max}-{R}_{min}\right)+1\right)-D\left({r}_1,{r}_2\right)\right\}}^2 $$
(28)

Figure 2 (a) is the graphical representation of the proximity values plotted for the rating scale of 1 to 5. The minimum proximity value for Agreement (r1, r2) = TRUE condition is 49. This situation may occur when two ratings are not equal (r1 ≠ r2). If both the ratings are same (i.e., r1 = r2), then the proximity value is 81. Similarly, for the Agreement (r1, r2) = FALSE, condition the minimum value is 1 and the maximum value is 25. The graphical representation of the proximity values for different combinations is shown in Fig. 2 (a).

Fig. 2
figure 2

Graphical comparison of PIP values for rating scale 1–5 (a) proximity values; (b) impact values; (c) popularity values

3.2 Impact

The impact is the second critical component computed in the PIP formula, which provides the information regarding how strong the user preferred or disliked the particular item.

$$ {\displaystyle \begin{array}{c} Impa\mathrm{c}t\left({r}_1,{r}_2\right)=\left(\left|{r}_1-{R}_{med}\right|+1\right)\times \left(\left|{r}_2-{R}_{med}\right|+1\right)\\ {} if\ Agreement\left({r}_1,{r}_2\right)= TRUE\end{array}} $$
(29)
$$ {\displaystyle \begin{array}{c} Impact\left({r}_1,{r}_2\right)=\frac{1}{\left(\left|{r}_1-{R}_{med}\right|+1\right)\times \left(\left|{r}_2-{R}_{med}\right|+1\right)}\\ {} if\ Agreement\left({r}_1,{r}_2\right)= FALSE\end{array}} $$
(30)

Impact in the range 0.11 to 9. In the impact factor, if Agreement (r1, r2) = TRUE the minimum is 1, and the maximum value is 9. For the Agreement (r1, r2) = FALSE the minimum is 0.11, and the maximum value is 0.25. The impact value for different combinations is shown in Fig. 2(b).

3.3 Popularity

Popularity is calculated based on the deviation of the average rating for the item ‘i’. Let ‘\( {\overline{r}}_{I_i} \)’ be the average rating of item ‘i’ by all users.

$$ {\displaystyle \begin{array}{c} Popularity\left({r}_1,{r}_2\right)=1+{\left(\left(\frac{r_1+{r}_2}{2}\right)-{\overline{r}}_{I_i}\right)}^2\\ {} if\ \left({r}_1>{\overline{r}}_{I_i}\ and\ {r}_2>{\overline{r}}_{I_i}\ or\ {r}_1<{\overline{r}}_{I_i}\ and\ {r}_2<{\overline{r}}_{I_i}\ \right)\end{array}} $$
(31)
$$ {\displaystyle \begin{array}{c} Popularity\left({r}_1,{r}_2\right)=1\\ {} otherwise\end{array}} $$
(32)

In popularity factor, if Agreement (r1, r2) = TRUE, the minimum value is 1, and the maximum value is 5. For computation purpose, the value of \( {\overline{r}}_{I_i} \) is considered as 3. For the Agreement (r1, r2) = FALSE the minimum, and the maximum value is 1. The popularity value for different combinations is represented in Fig. 2(c).

The graphical comparison indicates that the proximity has higher magnitude values than the impact and popularity for Agreement (r1, r2) = TRUE condition. In the case of Agreement (r1, r2) = FALSE condition, the impact value is very minimal when compared to the other two components. So that the similarity value is highly dependent on the proximity and popularity values than the impact in the FALSE condition. A detailed explanation is given in the below sub-section.

3.4 Issues identified in PIP

In PIP, each component value lies in the wider range. Proximity, impact, and popularity are treated in different proportions., The calculation is listed for the rating scale 1 to 5. The maximum value is calculated by considering r1 as 5 and r2 as 5 for Agreement (r1, r2) = TRUE situation. The minimum value is computed by considering r1 as 1 and r2 as 5 for Agreement (r1, r2) = FALSE condition, which is the extreme rating provided by both users (uj, uh) for item ‘Ii’.

The minimum and maximum proximity values for Agreement (r1, r2) = TRUE are 49, and 81, respectively. Similarly, if Agreement (r1, r2) = FALSE the minimum and maximum values are 1 and 9. The range value for proximity is 80. If Agreement (r1, r2) = TRUE, the minimum and maximum impact values are 1 and 9. Similarly, if Agreement (r1, r2) = FALSE the minimum and maximum values are 0.11 and 0.25, respectively. The impact range is 8.89. If Agreement (r1, r2) = TRUE, the minimum and maximum values for popularity are 1 and 5. Similarly, if Agreement (r1, r2) = FALSE the minimum and maximum values are both 1. The range for popularity is 4.

If the ‘uj’ rating is 5, and the ‘uh’ rating is 5; this situation leads to higher values for all the three components; i.e., 81 for proximity, 9 for impact, and 5 for popularity. The PIP is the product of proximity, impact and popularity; the resultant value is 3645. In this calculation, the proximity has a greater weight than the impact and popularity. Each component results in a different proportion. Similarly, for the worst scenario; i.e., if the user ‘uj’ rating is 5, and the ‘uh’ rating is 1, then proximity is 1, the impact is 0.11, and popularity is 1. The final ‘PIP(r1, r2)’ value is 0.11, which is very small. Here, we see that proximity and popularity have equal weight, but the impact has much less value. The maximum PIP is 3645, and the minimum PIP is 0.11. This leads to a greater difference between the minimum and maximum value. We conclude that each factor provides important information for computing the similarity between users. Yet, their different range of values shows that any one of the components provides higher values in different scenarios. This unequal scaling method provides different ranges of values for different scenarios, and the values of each component are non-normalized. Suppose direct normalization procedure is adopted to convert the values into particular range, then the values get changed for different scenario. This leads to lesser prediction accuracy [40], which constitutes a major drawback of the existing PIP similarity measure.

4 Proposed method

Our proposed framework aims to provide an improved solution for the CF-based RS with a sparse data matrix. It consists of a modified PIP (MPIP) measure and a combined user and item-based prediction (CUIP) expression.

4.1 Modified PIP (MPIP) similarity measure

In the existing PIP expression, each component has different values in different scenarios, and greater priority can be given to anyone of the components in different situations. To avoid this situation, the component ranges are converted into zero to one in our modified similarity measure by changing the expressions.

4.1.1 Proposed proximity

Proposed proximity is the normalized value that ranges from zero to one, calculated using absolute deviation between the two ratings r1 and r2.

$$ D\left({r}_1,{r}_2\right)=\left|{r}_1-{r}_2\right| $$
(33)
$$ {\displaystyle \begin{array}{c} Proposed\ Proximity\left({r}_1,{r}_2\right)={\left(\frac{D\left({r}_1,{r}_2\right)-\frac{\left({med}^{+}+{med}^{-}\right)}{2}}{R_{max}-{R}_{min}}\right)}^2\\ {} if\ Agreement\left({r}_1,{r}_2\right)= TRUE\end{array}} $$
(34)

Where med+ is the median value of the positive rating (i.e., the rating above or equal to the median value of the rating scale), med is the median value of the negative rating (i.e., the rating value below the median value of the rating scale). Agreement(r1, r2) = TRUE condition the absolute difference between two ratings is subtracted from the average of positive and negative median values. To get a normalized value within the range of 0 to 1 the deviation of Rmax and Rmin values are used. The positive and negative median value is included in the expression to find the closeness of the rating.

$$ {\displaystyle \begin{array}{c} Proposed\ Proximity\left({r}_1,{r}_2\right)=\delta \ast {\left(\frac{\frac{1}{D\left({r}_1,{r}_2\right)}}{R_{max}-{R}_{min}}\right)}^2\\ {} if\ Agreement\left({r}_1,{r}_2\right)= FALSE\end{array}} $$
(35)

Agreement(r1, r2) = FALSE the inverse of the deviation term is used to calculate the proximity value. Our reframed expression provides higher weightage for Agreement(r1, r2) = TRUE condition and less weightage for Agreement(r1, r2) = FALSE condition. Agreement(r1, r2) = FALSE condition, the positive and negative median values are not included because both the users are on the different side of the rating scale values. In both situations, the values are within the range of 0 to 1.

$$ \delta =\left\{\begin{array}{c}0.75\kern2.5em if\ D\left({r}_1,{r}_2\right)>{R}_{med}\ \\ {}0.5\kern1.25em else\ if\ D\left({r}_1,{r}_2\right)={R}_{med}\\ {}0.25\kern6.5em otherwise\end{array}\right. $$
(36)

In MPIP variable penalty (δ) is multiplied with the proximity value. The higher penalty is weighed for higher deviation value, and a lesser penalty is given for lesser deviation values. This conversion reduces the magnitude of the value.

4.1.2 Proposed impact

The exponential based expression is used to compute the impact value for the Agreement (r1, r2) = TRUE case. This helps to get a normalized impact value for the TRUE situations.

$$ {\displaystyle \begin{array}{c} proposed\ impact\left({r}_1,{r}_2\right)=\mathit{\exp}\left(-\frac{1}{\left(\left|{r}_1-{R}_{med}\right|+1\right)\times \left(\left|{r}_2-{R}_{med}\right|+1\right)}\right)\\ {} if\ Agreement\left({r}_1,{r}_2\right)= TRUE\end{array}} $$
(37)

If r1 and r2 are far away from the median, the impact value is higher. When both the ratings are nearer to the median, the impact provides lesser value. This shows that both the users agree with the median values, so the impact of the ratings is very less.

$$ {\displaystyle \begin{array}{c} proposed\ impact\ \left({r}_1,{r}_2\right)=\frac{1}{\left(\left|{r}_1-{R}_{med}\right|+1\right)\times \left(\left|{r}_2-{R}_{med}\right|+1\right)}\\ {} if\ Agreement\left({r}_1,{r}_2\right)= FALSE\end{array}} $$
(38)

In the existing PIP measure the impact values for Agreement (r1, r2) = FALSE condition, are lie within the range of zero to one. Therefore, the same expression is used in the proposed impact for FALSE condition .

4.1.3 Proposed popularity

Popularity is the third component in PIP similarity measure, which includes both positive and negative popularity in this expression.

$$ {\displaystyle \begin{array}{c} proposed\ popularity\left({r}_1,{r}_2\right)={\mathit{\log}}_{10}\left(2+{\left(\frac{r_1+{r}_2}{2}-{\overline{r}}_{I_i}\right)}^2\right)\\ {} if\ \left({r}_1>{\overline{r}}_{I_i}\ and\ {r}_2>{\overline{r}}_{I_i}\ or\ {r}_1<{\overline{r}}_{I_i}\ and\ {r}_2<{\overline{r}}_{I_i}\right)\end{array}} $$
(39)
$$ {\displaystyle \begin{array}{c} Proposed\ popularity\left({r}_1,{r}_2\right)=0.3010\\ {} otherwise\end{array}} $$
(40)

If two ratings are on the same side of the rating scale and the average of two ratings is far away from the item mean, then the popularity is very high. For popular or unpopular items, two users provide similar kinds of ratings, and then, they have high similarity in the type of rating. In the existing method, the minimum value for the Agreement (r1, r2) = TRUE situation is 1. Based on this, the popularity value is set to 1 for Agreement (r1, r2) = FALSE condition. In the proposed method, the minimum value for Agreement (r1, r2) = TRUE is 0.3010; So this value is chosen for all the Agreement (r1, r2) = FALSE condition. The proposed popularity values range from 0.3010–0.778 for the 1–5 rating scale, so that the minimum popularity value is chosen for all Agreement (r1, r2) = FALSE situations.

According to Fig. 3, the proposed proximity, proposed impact and proposed popularity values are within the range of zero to one; in contrast to the existing method, which provides different weights for each component.

Fig. 3
figure 3

Graphical representation of MPIP the values are calculated for rating scale 1–5 (a) proposed proximity (b) proposed impact and (c) proposed popularity

4.1.4 Similarity measure computation

The MPIP expression is the product of proposed proximity, proposed impact, and proposed popularity, as shown in Eq. 41

$$ MPIP\left({r}_{u_j,{\mathrm{I}}_i},{r}_{u_h,{I}_i}\right)=\left\{\begin{array}{c} Proposed\ proximity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)\times \\ {} Proposed\ impact\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)\times \\ {} Proposed\ popularity\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right)\end{array}\right\} $$
(41)

The similarity between users is computed as follows:

$$ sim{\left({u}_j,{u}_h\right)}^{uMPIP}={\sum}_{i=1}^{m^{\prime }} MPIP\left({r}_{u_j,{I}_i},{r}_{u_h,{I}_i}\right) $$
(42)

Similarity between ‘uj’ and ‘uh’ (sim(uj, uh)) is the item-wise summation of MPIP values.

The same procedure is used for computing the similarity between items, which helps to find a similar kind of items. The maximum similarity values show that most of the users provide a similar pattern of rating for the two items. Likewise, the minimum similarity value shows that most of the users provide a high positive rating for one item and a higher negative rating for another item.

$$ sim{\left({I}_i,{I}_q\right)}^{iMPIP}={\sum}_{j=1}^{n^{\prime }} MPIP\left({r}_{u_j,{I}_i},{r}_{u_j,{I}_q}\right) $$
(43)

A comparison of the minimum and maximum values is shown in Table 2.

Table 2 Comparison of PIP and MPIP similarity measures showing the minimum and maximum value for each component

Case 1: Positive or Negative rating.

Both uj and uh have a higher positive or negative rating; i.e., the extreme positive rating means both users uj and uh gives rating 5 for the item ‘i’ otherwise extreme negative rating means both the users uj and uh offers rating 1 for the item ‘i’; the existing PIP method provides a proximity value of 81, an impact of 9, popularity of 5, and a PIP value of 3645. Approximately 85% of weightage is given for the proximity components. When using the MPIP, the proposed proximity is 0.56, the proposed impact is 0.89, the proposed popularity is 0.778, and the MPIP value is 0.389.

Case 2: Median rating.

If uj and uh have a median rating, i.e., both the users uj and uh gives rating 3 for the item ‘i’, then the proximity is 81, the impact is 1, the popularity is 1, and the PIP value is 81. For the MPIP, the proposed proximity is 0.56, the proposed impact is 0.50, the proposed popularity is 0.301, and the MPIP is 0.084.

Case 3: Difference of opinion.

If uj and uh provides different extreme ratings for the item ‘i’, i.e., uj provides rating 1and uh gives rating 5 or vice versa, then the proximity is 1, the impact is 0.11, popularity is 1, and the PIP value is 0.11. For the MPIP, the proposed proximity is 0.002, the proposed impact is 0.11, the proposed popularity is 0.301, and the MPIP value is 0.0001. MPIP values nearer to zero for difference of opinion situation, i.e., very minimal weightage is given for the ratings because both the ratings lie on the extreme value of the rating scale. There is no relationship existing between the two users. This comparison proves that the MPIP values lie within a range of 0 to 1, but the existing method has a vast range.

4.2 Validation for MPIP similarity measure

In the MPIP, the maximum and minimum values range from 0 to 1. To validate the proposed similarity measure (i.e., Modified PIP), a rank correlation test is conducted. A set of rating pairs is generated for the rating scale 1 to 5. The PIP, PSS and MPIP values are computed for each pair of ratings. We generated the set of PIP, Proposed proximity, impact, popularity values and PSS values. The values generated for each pair of ratings are listed in the Table 3.

Table 3 Comparison of PIP, MPIP and PSS values set of rating pairs

The magnitude of the values gets changed in the case of MPIP. In PSS method, each component is ranged from zero to one, but the position of the values gets changed because it provides equal weightage for both agreement and disagreement conditions. The rank correlation test is conducted for this set of values; the results are shown in Table 4.

Table 4 Correlation MPIP, PIP, and PSS

The correlation values of PIP and MPIP components are one. This shows that a strong positive relationship exists between the PIP, and MPIP components. Owing to the violation of the agreement conditions, the PSS measure has much lower correlation values. The correlation between proximity and PSS proximity is higher. In the case of impact and popularity, the values are minimal. At the same time, this shows that a negative relationship exists between impact and PSS significance term. Similarly, there is a negative relationship existing between popularity and singularity. These results clearly show that the violation of agreement condition provides misleading information for the similarity between two ratings. Due to this reason, the same agreement conditions are used in our proposed method. The high correlation between the PIP and MPIP components reveals that the magnitude of the values gets changed in our proposed method, but the equivalent proportion is retained.

4.3 Existing prediction expression

The main objective of the CF-based RS is to predict user ratings based on the similarity measure. Many expressions are used in the prediction process. This user mean-based prediction expression, which is a widely used method that provides a better solution for CF-based RS [5, 18, 36, 41,42,43], is shown as follows:

$$ {P}_{u_j,{I}_i}={\overline{r}}_{u_j}+\frac{\sum_{h=1}^{n\prime } sim\left({u}_j,{u}_h\right)\left({r}_{u_h,{I}_i}-{\overline{r}}_{u_h}\right)}{\sum_{h=1}^{n\prime } sim\left({u}_j,{u}_h\right)} $$
(44)

In this expression, ‘\( {P}_{u_j,{I}_i} \)’ is computed based on the mean of the user ‘j’ which is added with the weighted average deviation of user ‘h’. The weighted average deviation of user ‘j’ is not considered in the prediction process. In weighted deviation, only user-related information is used, item-related information is not included in the prediction expression.

From Fig. 4, the user-related mean is considered for user-based prediction, and the item-related mean is considered for item-based prediction. In the sparse user-item rating matrix, both the mean and deviation provide important information for the rating prediction. This is one of the shortcomings of conventional user-related or item-related prediction expression.

Fig. 4
figure 4

Issues in user and item-based prediction expression

4.3.1 Combined user and item-based prediction expression (CUIP)

A modified prediction expression is derived by incorporating user ‘j’ related components in user-based prediction. Similarly, the item ‘i’ related components are included in the item-based prediction.

Ma and Hu [44] used a hybrid prediction expression for predicting the rating. In this expression, the additional weight parameter ‘λ’ is multiplied by the user based prediction. Similarly, ‘1- λ’ weight is multiplied for the item based prediction. ‘λ’ value varies between 0 to 1. If ‘λ’ equals one, it purely depends on user-based prediction; if ‘λ’ equals zero, it purely depends on item-based prediction. Thus, ‘λ’ is an additional parameter that should be optimized for a different problem.

In the modified prediction expression, the average values of user-based and item-based prediction expression are introduced by providing equal weightage for both the expressions. The modified prediction expression is shown below.

$$ {UP}_{u_j,{I}_i}={\overline{r}}_{u_j}+\frac{\sum_{h=1}^{n\prime } sim\left({u}_j,{u}_h\right)\times \left(\frac{\left({r}_{u_h,{I}_i}-{\overline{r}}_{u_h}\right)+\left({r}_{u_j,{I}_i}-{\overline{r}}_{u_j}\right)}{2}\right)}{\sum_{h=1}^{n\prime } sim\left({u}_j,{u}_h\right)} $$
(45)

Where ‘UP’ is the user based prediction, j ≠ h, j = {1, 2…. ., n},h = {1, 2……, n′}, and n′ is the number of items co-rated by both the users ‘j’ and ‘h’.

$$ {IP}_{u_j,{I}_i}=\overline{r_{I_i}}+\frac{\sum_{q=1}^{m^{\prime }} sim\left({I}_i,{I}_q\right)\times \left(\frac{\left({r}_{u_j,{I}_q}-{\overline{r}}_{I_q}\right)+\left({r}_{u_j,{I}_i}-{\overline{r}}_{I_i}\right)}{2}\right)}{\sum_{q=1}^{m^{\prime }} sim\left({I}_i,{I}_q\right)} $$
(46)

Where ‘IP’ is the item based prediction i ≠ q, i = {1, 2…. ., m},q = {1, 2……, m′}, and m′ is the number of users who rated both items ‘i’ and ‘q’.

$$ {CUIP}_{u_j,{I}_i}=\left(\frac{\left({UP}_{u_j,{I}_i}\right)+\left({IP}_{u_j,{I}_i}\right)}{2}\right) $$
(47)

The similarity values and deviation play a vital role in prediction expression. Either the user-related mean and its weighted average deviation or the item-related mean and its weighted average deviation are used in the existing prediction expressions. The user-related mean \( {\overline{r}}_{u_j} \) and item-related mean \( {\overline{r}}_{I_i} \) are calculated for user ‘j’ and item ‘i’, but the corresponding deviation is not included in the existing prediction expression. This is one of the shortcomings of the existing prediction expression. In the modified prediction expression (CUIP), the above-mentioned shortcoming is removed by including the user and item-related deviation term.

Forecasting is a vital process in CF-based RS. The forecasting values help the company to identify potential customers and then to promote sales. In CUIP, if the user gives rating for that particular item, i.e., \( {r}_{u_j,{I}_i}\ne \varnothing \); then the deviations carry some values. Suppose the rating is unavailable i.e., \( {r}_{u_j,{I}_i}=\varnothing \), then computing user-related and item-related deviations become complex. The average of the user (uj) related mean and item (Ii) related mean is replaced in the unavailable rating to avoid this situation. For unavailable rating, the user based prediction is calculated using Eq. (48).

$$ {UP}_{u_j,{I}_i}={\overline{r}}_{u_j}+\frac{\sum_{h=1}^{n\prime } sim\left({u}_j,{u}_h\right)\times \left(\frac{\left({r}_{u_h,{I}_i}-{\overline{r}}_{u_h}\right)+\left(\frac{{\overline{r}}_{u_j}+{\overline{r}}_{I_i}}{2}-{\overline{r}}_{u_j}\right)}{2}\right)}{\sum_{h=1}^{n\prime } sim\left({u}_j,{u}_h\right)} $$
(48)

Similarly, the item-based prediction is calculated using Eq. (49)

$$ I{P}_{u_j,{I}_i}={\overline{r}}_{I_i}+\frac{\sum_{q=1}^{m\prime } sim\left({I}_i,{I}_q\right)\times \left(\frac{\left({r}_{u_j,{I}_q}-{\overline{r}}_{I_q}\right)+\left(\left(\frac{{\overline{r}}_{u_j}+{\overline{r}}_{I_i}}{2}\right)-{\overline{r}}_{I_i}\right)}{2}\right)}{\sum_{q=1}^{m\prime } sim\left({I}_i,{I}_q\right)} $$
(49)

The Eq. (47) is used for computing the final prediction value for the user ‘j’ for the item ‘i’. The combination of modified PIP (MPIP) similarity measure and modified prediction expression (CUIP) is treated as the proposed method. The algorithm for the proposed method is shown below:

figure afigure a

The block diagram for our proposed framework is shown in Fig. 5.

Fig. 5
figure 5

The proposed framework for CF-based RS

The proposed framework consists of two phases: The first phase is the computation of the similarity matrix (user similarity and item similarity matrix), and the second phase is the prediction process. Initially, it requires an input user-item rating matrix for computing similarity values. The similarity between users is computed by using MPIP expression (eq.(42)). Similarly, item similarity is computed by using MPIP expression (eq.(43)). Both user and item related similarity values provide valuable information for rating prediction. For available rating, the Eqs. (45), (46) and (47) are used to compute prediction values. In the case of unavailable ratings, the Eqs. (48), (49), and (47) are adopted for predicting the user’s ratings. Based on the predicted values, the items are sorted in descending order; the top ‘k’ items are recommended to the users.

5 Experiments

Datasets such as MovieLens1MB, Netflix, Epinions, CiaoDVD, MovieTweet, FilmTrust, and MovieLens100KB are used for comparing the conventional and proposed methods. We used MovieLens100KB and Netflix, which are used in [18], and other benchmark datasets that are often used by many researchers for CF-based RS. To validate our proposed method eleven state of the art methods (PCC, COS, JACC, JMSD, PIP, NHSM, BCFcorr, BCFMed, RJACC, RJMSD, and SOQN) are used for comparison.

5.1 Characteristics of sub-problems generated from the dataset

The dataset description, number of users, number of items, number of ratings, and sparsity are listed in Table 5. The dataset is arranged based on the rank of the sparsity (i.e., higher sparsity to lower sparsity).

Table 5 Summary of the dataset used for this study

[a] http://grouplens.org/datasets/movielens/1m

[b] http://grouplens.org/datasets/movielens/100k

[c] http://www.netflixprize.com

[d] http://www.trustlet.org/downloaded epinions.html.

[e] http://www.librec.net/datasets/CiaoDVD.zip

[f] https://github.com/sidooms/MovieTweetings

[g] http://www.librec.net/datasets/filmtrust.zip

The above-mentioned dataset size is large. Computing the similarity matrix and predicting the rating for this nature of the dataset requires higher computational complexity. Because of this, different levels of sparse subsets \( \left(\left(1-\left(\frac{\#R}{n\ast m}\right)\right)\ast 100\right) \) are generated from the user-item rating matrix to validate the proposed method. The schema used for creating different sub-problems is shown in Fig. 6.

Fig. 6
figure 6

Schema for generating different sub-problems

The schema comprises four levels. Level 1 relates to the dataset; level 2 is for users, level 3 is for items, and level 4 is for rating. The user-level consists of two sub levels. One sub-level of users varies without any constraints. Another sub-level of users is restricted as 25%, 50%, and 75% of the users in the dataset. The next level is the item level, which consists of two sub-levels. One sub-level of items varies without any constraints. The item sub-level is restricted to 25%, 50%, and 75% of the items in the dataset. The final level consists of the rating. This level varies from 1% to 50% in the number of ratings with increments of 1% in the dataset.

For each dataset, one can create 800 different sub-problems by using the above schema. The number of sub-problems created by each path is shown in Table 6.

Table 6 Number of sub-problems generated by each path

5.2 Limitation of the above schema

In this schema, the maximum percentage of users and items is 75%. For this specific combination, it is difficult to create sub-problems for all the ratings because the required percentage of ratings (1% to 50%) could not use for the higher-order matrix. For all the sub-problems, it is ensured that each user and item should have a minimum of two co-rated values. In each sub-problem, it is challenging to generate an exact 1% to 50% of ratings. To overcome this problem, we have an approximate percentage of rating values extracted. For each dataset, this results in lesser than 800 sub-problems, as shown in Table 6.

Table 7 provides the number of feasible sub-problems created for each dataset using the schema.

Table 7 Number of feasible sub-problems

5.3 Performance criteria

The Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are most commonly used evaluation criteria to validate the CF-based RS performance. [45, 46]

MAE is the mean of absolute deviation between the actual and the predicted rating.

$$ MAE=\frac{1}{n}{\sum}_{j=1}^n\frac{1}{m}{\sum}_{i=1}^m\left|{r}_{u_j,{I}_i}-{P}_{u_j,{I}_i}\right| $$
(50)

Where ‘n’ is the total number of users, ‘m’ is the total number of items. ‘\( {P}_{u_j,{I}_i} \)’ is the predicted rating for the jth user and ith item, \( {r}_{u_j,{I}_i} \)is the actual rating for the jth user and ith item, j = {1, 2, …. n}, i = {1, 2, …. . m}.

RMSE is the square root of the deviation between the actual and predicted rating. RMSE is also known as standard deviation of residuals or forecasting error. The formula is given below:

$$ RMSE=\frac{1}{n}{\sum}_{j=1}^n\sqrt{\frac{1}{m}{\sum}_{i=1}^m{\left({r}_{u_j,{I}_i}-{P}_{u_j,{I}_i}\right)}^2} $$
(51)

The lesser MAE and RMSE help to choose the better method. The predicted rating (\( {P}_{u_j}{,}_{I_i} \)) for the existing similarity measures such as PCC, COS, JACC, JMSD, PIP, NHSM, BCFcorr, BCFMed, RJACC, RJMSD, and SOQN are calculated by using Eq. (44). The predicted rating values (\( {P}_{u_j}{,}_{I_i} \)) for the proposed method, are calculated by using Eqs. (45–47). Finally, the predicted ratings are rounded off to the nearest integer for both the existing and proposed method.

5.4 Friedman rank test

Friedman rank test is a non-parametric test for finding differences in methods across multiple replications [47, 48]. This test does not assume the given data that comes from particular distribution (i.e., normal distributions). The Friedman values are calculated by using the given equation

$$ {F}_r=\frac{12\ast TS}{l\left(l+1\right)}\left({\sum}_{y=1}^l{R}_y^2-\frac{l{\left(l+1\right)}^2}{4}\right) $$
(52)

Where Fr is the calculated Friedman value, \( {R}_y^2 \) is the squared average rank values of method ‘y’, (1 ≤ y ≤ l), ‘l’ is the total number of methods, ‘sp’ is the number of sub-problems generated, sp = {1, 2…. TS},TS’ is the total number of sub-problems generated. The hypotheses framed for this test are: Null hypothesis (H0) denotes there is no significant difference that occurs between the performance of the different methods, whereas Alternate hypothesis (H1) denotes there is significant difference that occurs between the performance of the different methods. [49]

5.5 McNemar’s test

McNemar’s test is a non-parametric test used to analyze the performance of the two methods has a statistically significant difference [50, 51]. The contingency table used to compare the two models is given in Table 8.

Table 8 Contingency table to compare two methods

Where ‘A’ is the number of ratings correctly predicted by Methods 1 and 2, B is the number of ratings correctly predicted by Method 1 but incorrectly predicted by Method 2, C is the number of rating correctly predicted by Method 2 but incorrectly predicted by Method 1, D is the number of ratings incorrectly predicted by both methods.

$$ {\chi}^2=\frac{{\left(\left|B-C\right|-1\right)}^2}{B+C} $$
(53)

The Null hypothesis (H0) for this χ2 test is the probabilities, Pr(B) and Pr(C) are equal. The alternative hypothesis (H1) is the performances of the two methods are not equal.

For clarity, a graphical representation is presented in Fig. 7 for the McNemer contingency table values. The A value is used to find out the number of ratings correctly predicted by method 1 and method 2. The D value is used to find out number of ratings that are incorrectly predicted by method 1 and method 2. The B and C values are used to find the best method. If B is greater than C value this indicates that method 2 performs better than method 1. Similarly, if C is greater than B value then, the performance of method 1 is better than that of method 2.

Fig. 7
figure 7

Graphical Representation of McNemer table values

In addition to McNemer test, the precision, recall, and F1-measure are calculated. The measures are computed from the confusion matrix by converting the rating into a good or bad recommendation. The format of the confusion matrix used for computing precision, recall, and F1 measure is given in Table 9.

Table 9 Confusion matrix format for CF-based RS

Where True Positive (TP) is the number of actual good ratings provided by the user and correctly predicted by the CF-based RS, False Negative (FN) is the number of actual good ratings predicted as bad ratings, False Positive (FP) is the number of actual bad ratings predicted as good ratings, True Negative (TN) is the number of actual bad ratings correctly predicted.

The rating values are classified into two groups i.e., good and bad rating. The rating lies between Rmed to Rmax which are treated as good ratings; similarly, the values which are less than Rmed are treated as bad ratings. For example, if rating scale is 1 to 5, where good rating values are 3,4, and 5, and bad ratings are 1 and 2. This similar process is carried out for various rating scale.

Precision is the ratio of the number of good ratings (i.e., relevant rating) recommendations made to the total number of recommendations.

$$ precision=\frac{TP}{TP+ FP} $$
(54)

A recall is the ratio of the number of good ratings recommended to the number of actual good ratings.

$$ Recall=\frac{TP}{TP+ FN} $$
(55)

F1-Measure is the combination of precision and recall

$$ {F}_1=\frac{2\times Precision\times Recall}{Precision+ Recall} $$
(56)

The higher precision, Recall and F1-measure shows the better performance. The overall process of this comparison is shown in Fig. 8.

Fig. 8
figure 8

Overall process carried out in this study

5.6 Results and discussion

The comparison of results obtained for the proposed framework and other methods using the performance criteria are discussed below.

The number of feasible sub-problems generated using the schema varies for each dataset. Each sub-problem, the conventional methods, and the proposed framework are adopted to predict the rating values. The MAE and RMSE values are calculated for all the sub-problems by using Eqs. (50) and (51), respectively. The set of MAE and RMSE values are generated for each dataset. From the set of MAE and RMSE values the minimum MAE and RMSE are chosen and listed in Tables 10 and 11.

Table 10 Comparison of minimum MAE values conventional and proposed method
Table 11 Comparison of minimum RMSE values conventional and proposed method

Our proposed method provides lesser MAE for all the datasets. The second best results are obtained by SOQN for Epinions, FlimTrust, NetFlix, and ML1MB. For CiaoDVD and MovieTweet datasets RJMSD attains the next better solution. For ML100KB datasets NHSM provides the next better result. The percentage of improvement for our proposed method from the second best solution are 35.56% for Epinions; 45.21% for CiaoDVD; 33.8% for MovieTweet; 51.33% for FlimTrust; 46.03% for NetFlix, 47.78% for ML1MB and 47.47% for ML100KB datasets. The existing similarity measures with user-based prediction expression, the high variations exist between the actual and predicted ratings. This higher variations leads to maximum MAE values, which decreases the prediction quality. In our proposed approach, the variation is minimal, and the predicted ratings coincide with the actual rating for most cases; this improves the effectiveness of the CF-based RS.

The table values indicate that the standard deviation of the prediction error is minimum for the proposed method compared to the other methods. The percentage of improvement of our proposed method from the next lesser RMSE values is 42.16%, 41.62%, 42.50%, 61.54%, 48.59%, 44.49%, and 45.35% for Epinions, CiaoDVD, MovieTweet, FlimTrust, NetFlix, ML1MB and ML100KB respectively. The comparison of maximum MAE and RMSE values for all the datasets are arranged in Tables 12 and 13.

Table 12 Comparison of maximum MAE values of the conventional and proposed method
Table 13 Comparison of maximum RMSE values of the conventional and proposed method

Tables 12 and 13 show the maximum MAE and RMSE chosen for each method. It is observed that the proposed method achieves more improved solutions than the conventional methods in terms of lesser MAE and RMSE values. Average of 46.71% of improved MAE and 45.18% of improved RMSE values are arrived at from the next best results. The comparison of average MAE and RMSE values are listed in Tables 14 and 15.

Table 14 Comparison of average MAE values of the conventional and proposed method
Table 15 Comparison of average RMSE values of the conventional and proposed method

From Tables 14 and 15, it is noticed that among all the methods listed, the proposed method attains the smaller MAE and RMSE values for all the datasets. The proposed framework is the combination of MPIP similarity measure and CUIP prediction expression. In MPIP all the three components are converted into to range of 0 to 1 values, this converted value helps to find a better similarity value between users or items. This goes as an input for prediction expression; in the proposed method, the item related components and the user ‘j’ deviation is included; this leads to accurate rating prediction. For calculating MAE and RMSE the actual range of rating scales are used. When compared to the existing method, the predicted ratings of our proposed method are nearer to the actual ratings, which reduce the deviation between actual and predicted ratings.

The average MAE and RMSE values are computed by considering all the sub-problems. The percentage of improvement for the proposed method over other methods is calculated and listed in Table 16.

Table 16 Comparison of percentage of improvement of average MAE and RMSE for the proposed method

The results listed in Table 16 shows that the SOQN method holds the second position by obtaining lesser average MAE and RMSE for Epinions, CiaoDVD, and ML1MB datasets. Similarly, RJMSD provides the second best solution for FlimTrust and NetFlix datasets. PCC provides the next better solution for the MovieTweet dataset. BCFcorr provides the second best solution for the ML100KB dataset. Our proposed framework yields 53.81% and 51.60% of the average percentage of improved MAE and RMSE than the conventional similarity measures such as PCC, COS, JACC and JMSD. Similarly, the average percentage of improved MAE and RMSE for the proposed framework is 49.44% and 49.24% from specific similarity measures used for CF-based RS are PIP, NHSM, BCFcorr, BCFMed, RJACC, RJMSD, and SOQN. This improved solution enhances the accuracy of the CF-based RS. The above-mentioned tables indicate that the proposed method exhibits superiority over other methods.

The Friedman rank test is conducted to test whether the performance of all the methods is the same. The MAE values are computed for all the sub-problems. For each sub-problem (sp), rank the MAE values from 1 (best results) to ‘l’ (worst results). Fr values are calculated based on the squared average rank values of each method. Each dataset the Fr is calculated and listed in Table 17.

Table 17 Comparison of Fr values using MAE as reference for all the dataset

The calculated Fr values are higher than the χ2 table value, which is 19.67 for the degrees of freedom 11 with a significance level of 0.05. The p-values obtained from the Friedman test are nearer to zero, which strongly accepts the alternate hypothesis for all the datasets. The performance of all the methods is not equal. The comparison of average rank values for the conventional method and proposed method for all datasets are shown in Table 18.

Table 18 Comparison of average rank values obtained using MAE as references for all dataset

The average rank for our proposed method is 1 for all the datasets. This infers that the proposed framework provide minimum MAE values for all the sub-problems generated by using the schema. Similarly, a comparison of Fr values for RMSE is listed in Table 19.

Table 19 Comparison of Fr values using RMSE as reference for all the dataset

The p-values of Friedman rank test in Table 18 results in acceptance of alternate hypothesis for all the datasets, and hence it is revealed that the performance of all methods is not the same. The comparison of average rank values for obtaining using RMSE as a reference for all the dataset are listed in Table 20.

Table 20 Comparison of average rank values obtained using RMSE as references for all dataset

The rank values show that our proposed method has proved better than the existing methods. Compared to other methods, the proposed framework attains top rank for all the sub-problems. The proposed method yields a better solution for sparsity problems than the existing similarity measures with user-based prediction expression.

To further validate the proposed method, a McNemer test is conducted on the confusion matrix. The maximum and minimum MAE are chosen from the conventional methods for each dataset, and the corresponding confusion matrix is tested through a McNemer test. The proposed framework is treated as reference for conducting this test. The corresponding calculated χ2 values for all the datasets are listed in Table 21.

Table 21 The calculated χ2 value for minimum MAE of the proposed method against the conventional method

The calculated χ2 values are higher than the χ2 table value; this results in acceptance of alternate hypothesis for McNemer test. This shows that performance of existing method and proposed method are not equal. The values of B and C are the primary components required for conducting McNemer test. The McNemer test is conducted by considering our proposed framework as a reference. The B values are related to the number of rating incorrectly predicted by our proposed framework but correctly predicted by the existing method. Similarly, C is related to the number of rating correctly predicted by our proposed framework but incorrectly predicted by the existing method. The comparison of B and C values are listed in Table 22.

Table 22 Comparison of B and C values for minimum MAE of the conventional and proposed method

The comparison table (Table 22) results clearly show that C values are higher than the B values i.e., the number of ratings predicted by our proposed framework is higher than the other exiting methods.

The maximum MAE values are chosen from the feasible number of sub-problems generated for each dataset corresponding predicted values are considered for conducting the McNemer test by using a confusion matrix. The calculated χ2 and p-values are listed in Table 23.

Table 23 The calculated χ2 value for maximum MAE of the proposed method against the conventional method

The calculated χ2 value is greater than the table value, which agrees with the alternate hypothesis, i.e., a significance difference exists between the conventional and proposed methods. The proposed method has proved that it is better than the existing method for all the datasets. Table 24 is the comparison of B and C values for maximum MAE.

Table 24 Comparison of B and C values for maximum MAE of the conventional and proposed method

The B and C values comparison for a maximum of MAE values reveal that the C is greater than the B values for all the datasets. When compared to the existing method, our proposed method correctly predicts good and bad ratings.

The precision, recall, and F1-measure are calculated for each sub-problem. The summary of average precision, recall, and F1-measure for the conventional and proposed methods are reported in Tables 25, 26, 27.

Table 25 Comparison of average precision values for the conventional and proposed method
Table 26 Comparison of average recall values for the conventional and proposed method
Table 27 Comparison of average F1 measure for the conventional and proposed method

The comparison table (Table 25) indicates that the average precision value is higher for our proposed framework. These results show that from the total predicted ratings most of the actual good rating items are correctly predicted by our proposed method.

The results arranged in Table 26 shows that our proposed method values are greater than the other methods. This indicates more number of actual good rating items are correctly predicted by our proposed method. Table 27 is the comparison of average F1 measure values for the conventional and proposed method.

The F1 measure includes both the precision and recall values. The proposed method attains better results for all the datasets. To validate the effectiveness of the proposed method the Friedman rank test is conducted on F1 -Measure values. The Fr values are listed in Table 28.

Table 28 Comparison of Fr values using F1 as a reference for all the dataset

The calculated Frvalues confirm that the performance of all the methods is not equal. Also, the average rank using for Fr calculation is shown in Table 29.

Table 29 Comparison of average rank values obtained using F1 as references for all dataset

Table 29 indicate that the proposed method provides higher F1-measure for all the feasible sub-problems. Furthermore, most of the actual good rating items are correctly recommended by the proposed method.

The main issue in CF-based RS is sparsity problem. A new framework is introduced to solve the sparsity problem and to enhance the prediction performance. The experiments are conducted on various level of sparse data. The MAE and RMSE results for our proposed approach is minimal compared to conventional similarity measures like PCC, COS, JACC and JMSD with user based prediction expression. The specific similarity methods used for CF-based RS are PIP, NHSM, BCFcorr, BCFMed, RJACC, RJMSD, SOQN with user based prediction expression also attains higher MAE and RMSE values than the proposed methods for all the datasets. This clearly shows that the predicted ratings by the proposed framework are closer to the actual ratings and this minimizes the error values. The average ranking value for MAE and RMSE of our proposed framework is 1, which denotes that for various sub-problems, our proposed method provides better prediction rating. The McNemer test results in acceptance of alternate hypothesis, which shows that existing and proposed methods performance are not same. Each conventional method is tested with our proposed method and in all the comparison, our proposed approach offers improved solution in terms of accurate prediction. Finally, the Precision, Recall, and F1-measure values for our proposed approach are above 0.8, higher than the other methods. The results obtained from the analysis indicate that our proposed framework improves the CF-based RS performance by correctly predicting more good ratings items and reducing the misclassification error. This improved prediction helps the company to identify and recommend products that are more relevant to online users.

6 Conclusion

CF-based RS depends on the similarity measures, of which PIP is one of the most popular techniques for calculating the similarity between the users. However, the ranges of proximity-impact-popularity values are wide and each component provides a different weight in different scenarios; i.e., the magnitude of each component differs. This is a serious limitation of existing PIP measures; therefore, we have developed a modified PIP similarity measure that provides a common value range between 0 and 1 for all the three components, resulting in equal priority. We have also developed a modified prediction expression to include the item-based average and weighted average deviation with user-based average and weighted average deviation in the expression to improve accuracy. Finally, a procedure for forecasting unrated items has been introduced to improve the recommendation performance. The proposed framework was tested by using various benchmark datasets, namely ML1MB, Netflix, ML100KB, CiaoDVD, Epinions, MovieTweet, and FlimTrust. The entire analysis was conducted for a different level of sparsity sub-problems generated from the user-item rating matrix. The various level of sparse data is created by varying the number of users, items, and ratings. This is done for all datasets. The results obtained from the proposed method are compared with those of conventional methods. The proposed framework provides better results for all datasets yielding lower MAE and RMSE than the existing methods. The McNemer test is conducted to validate the proposed method in terms of good and bad rating recommendations. This analysis results in the acceptance of the alternate hypothesis (i.e., the performances of the conventional and proposed methods are not equal). Also, precision, recall, and F1-measure are calculated to identify the best method based on the confusion matrix. The proposed method attains higher precision, recall, and F1-measure. Finally, a Friedman rank test was conducted on the MAE, RMSE, and F1 values. The statistical results show that our proposed framework outperformed the other methods.