Keywords

1 Introduction

The massive growth in digital information in this era of information technology was data is being generating with speed of thought. Availability of internet and glamour of online world have made people not only to overload information about them but also attracting e-commerce companies to understand the behavior of users so that attracting more customers by giving better customized products as well as recommending them for enhancing business profits or revenues of the company. Recommendation system are information filtering system that deal with the problem information overload [8], by filtering vital information fragment out of large amount of dynamically generated information according to user’s preferences, interest, or observed behavior about item [12]. Since the recommendation system has the ability to predict the preference of any item for a user by understanding and finding similar “User-Item-Ratings” pairs.

The work we are following is a kind of Recommendation system with hybrid filtering where SVD model is applied on S3M [9] metric which is Set Sequence Similarity Measure; a linear expression of weighted parameter “p” to emphasize on sequence similarity and relative variance to calculate Jaccard similarity. By varying the value of ‘p’ we compare the recommendation value. The contents we are experimenting is website ratings from dataset based on the websites visited by users. The major problem in collaborative Recommendation system is sparsity. The work we are performing to achieve sparsity reduction by suggesting recommendation ratings ‘Ri’. A comparison is made on recommended rating outcome by varying weighted variable ‘p’ and ‘q’ used for quantitative and/or qualitative parts of similarity linear equation used in S3M [9]. In this paper the work is focusing on web usage mining along with user item rating. The dataset used has been created based on user rating for websites.

S3M similarly which is used to find similarity among users and then rough set based upper approximation for clustering is applied for forming soft clusters. The technique generate overlapping clusters contains common users which are common is showing interests/rating to multiple websites. We have compared our method performance with various conventional methods. The paper follows the work done on web recommendation by Mishra et al. [10]. The recommendation system is different from sequential pattern mining algorithms. In our work we have studied the variation in ‘p’ where we consider content (quantitative) not the sequential information (qualitative) as by Mishra et al. [10]. The users similarity is calculated by S3M, SVD is used for sparsity removal and the Ri is calculated on basis of prediction quotient formula.

Our work has been initiated with the major issues of recommender system. One of the premier work is to remove sparsity problem in Recommender System with collaborative filtering. Mostly web users have a sequential approach for webpage accessing. The algorithm is motivational outcome of work done on web recommendation by Mishra et al. [10], which is actually web content mining and sequence mining based algorithm where the data is a sequential data. The prime work i.e. the similarity calculation and prediction weight calculation both are adhere to the sequential data and the sequential behavior of web sessions.

The method proposed here is novel with respect to the user-item rating matrix as data. The similarity is calculated with partial consideration on sequential nature and more on the content by changing the value of ‘p’. Also the prediction is made not only on the basis of cluster of similar users but also the calculation lies on the similar user with a specific rating given by new user for a particular website. The major contribution is that in order to remove sparsity the prediction vector is proportional to the rating value by similar user in the cluster. The selection of user for computing prediction quotient is also proportional to the rating. As the collaborative filtering primarily work on user-item ratings, above mentioned two levels of rating based prediction makes the work more promising to for accuracy.

The paper organization begins with introduction of the work with basic idea of proposed method, the motivation and contribution of our work, followed by the literature review with respect to understand the sparsity problem and the work done to alleviate this problem. Third section explains the methodology adopted and dataset. The paper concludes the performance of method in the last section.

2 Related Work

The evolution of Recommendation system begun with research paper on Recommendation system where it is defined as a decision making strategy for users under complex information environments [14]. Recommendation system also used as an E-commerce tool helps users filter through records of knowledge which is related to user’s interests and preferences [16]. The phases of Recommendation system are information collection, learning or filtering phase, Recommendation phase. In the filtering phase approaches may be content based technique which predict on basis of user’s information ignoring contribution of others users, collaborative filtering when user item rating are used from other similar users. Hybrid filtering approaches by harnessing benefits of both techniques [18]. The matrix is generated by user preferences or likes for items, finds similar users based on relevant interests. This approach is of two types memory based and model based [1, 5, 6].

Memory based collaborative filtering computers similarity between user item ratings. The algorithm of memory based systems is heuristics that make recommendations based on an entire collection of item already rated by users [4, 11, 15]. Model based collaborative filtering generates the descriptive model of the system, based on the user’s preferences using various DM and ML techniques like Bayesian Model, Clustering Model etc.

Recommendation system faces some problem with respect to efficiency of recommendation system; recommendations quality is affected by cold start problem, data sparsity problem, scalability, synonymy and Matthew effect [2, 3, 13, 17]. These problems are reducing commercial benefits to an extent. The sparsity problem is lack of enough information when only a few items ratings by users in the matrix. Data sparsity problem directly affects the coverage of recommendation result [17]. This makes the matrix sparse which in turn disables to locate proper neighbours which finally leads to weak recommendations. Model based techniques solve the sparsity problem. This problem also exists in a user product based Product Attribute Model which is due to the subjectivity of product reviews since these reviews are not covering all aspects of product. The problem is resolved by Multiplication Convergence Rule and Constraint Condition equations to find the replacement of sparse values [19]. A web recommendation system which works on sequential mining and web mining also applies a weight calculation which adequately leads to substitute the next web page visit vector which is a sparse vector, has been proposed by Mishra et al. [10]. S3M [9] is the similarity measure applied to set sequence similarity proposed by Kumar et al. [15] which is a linear equation based on the weighted parameter ‘p’ Quality is determined by SetSeq(A,B) measure and quantity is calculated by SetSim(A,B) for content matching. S3M is used to find similar users. Singular value decomposition is a model based technique. This deploys the previous ratings (user-item) to improve the performance of Collaborative Filtering Technique. SVD is used in Opt.space algorithm by Keshavan et al. [7] to deal with matrix completion problem.

In this paper we follow the work done on web recommendation by Mishra et al. [10]. The recommendation system is different from sequential pattern mining algorithms. The users similarity by S3M metric and the SVD model is used for removing sparsity significantly and the Ri is calculated on basis of formula proposed by Mishra et al. [10].

3 Methodology

The work in the paper focuses on results which are recommendation generated using SVD model on soft clusters which are made on similarity of users by calculating S3M similarity Mishra et al. [10]. Different values of ‘p’ are taken and compared the result of “Ri” by the model proposed by Mishra et al. [10].

The methodology as shown in Fig. 1 is devised in three parts; Response Matrix ‘A’ Generation, Prediction Quotient Qij Calculation and Recommendation Vector ‘Ri’ Generation.

Fig. 1
figure 1

Methodology

In our work experiments performed on dataset which is generated manually from the survey done in the crowd of university UG program students regarding the websites they visit and asked them to rate. The generated dataset has ratings of websites, category of interested websites along with preferred website sequence partially The dataset is represented as matrix of user—website ratings. The user’s ratings for websites ranges from 1 to 5 where 5 are highest and 1 is lowest rating order. The user—item-rating matrix is being developed where item a frequently visited websites is. The similarity of users from data set is found with full utilization of content similarity and partial consideration of order of Websites (a sequence) visited by user.

3.1 Response Matrix ‘A’ Generation

The work proceeds with formation of cluster which are soft clusters on the basis of website since a user may have multiple interest for which may belong to multiple clusters. A similarity upper approximation based clustering algorithm is used. The RS utilized a rough set based clustering approach. The similarity between users is calculated by similarity measure (metric). There exists many similarity metrics such as cosine, Jacquard etc. The metric used by Kumar et al. [15] is set sequence similarity measure. This measure enforces not only the similarity between two sets of data (vector or ordered set) but also considers the sequence (Fig. 2).

Fig. 2
figure 2

Response matrix ‘A’ generation

The algorithm of similarity upper approximation approach for cluster formation we are following is the same as per proposed by Mishra et al. [15]. The similarity of user is being measured here on the basis of S3M [9]. The S3M is Sequence Set Similarity Measure Kumar et al. [15] calculated as following:

$$ {\text{S}}^{3} {\text{M}}[A,B] = p* {\text{SeqSim}} (A,B) + q * {\text{Setsim }}(A,B) $$
(1)

where ‘p’ is qualitative weight parameter of sequence similarly [9] and q = (l − p) i.e. after sequence similarity content similarity is focused.

The following example explains similarity calculations. Let two users along with their preferred website rating sequence is shown as:

$$ U_{\text{A}} = \left\{ {1,\,\,4,\,\,18,\,\,20,\,\,11,\,\,15,\,\,6,\,\,8,\,\,5,\,\,12} \right\}\,\,{\text{and}}\,\,U_{\text{B}} = \left\{ {4,\,\,5,\,\,11,\,\,8,\,\,1,\,\,2,\,\,3,\,\,7,\,\,9,\,\,18,\,\,15,\,\,20} \right\} $$
$$ {\text{The}}\,{\text{length}}\,{\text{of}}\,{\text{sets}}:L_{\text{A}} = \left| {U_{\text{A}} } \right| = 10,\,\,\,\,L_{\text{B}} = \left| {U_{\text{B}} } \right| = 12 $$
$$ {\text{And}}\,{\text{LLCS}}\left( {U_{\text{A}} ,U_{\text{B}} } \right) = \left\{ {4,\,\,11,\,\,8} \right\} = 3 $$
$$ \begin{aligned} {\text{So}}\,{\text{SeqSim}}\left( {U_{\text{A}} ,\,U_{\text{B}} } \right) & = {\text{LLCS}}/{\text{Max}}(\left| {U_{\text{A}} } \right|,\left| {U_{\text{B}} } \right|) \\ & = 3/ 1 2= 0. 2 5\\ \end{aligned} $$
$$ \begin{aligned} & U_{{\text{A}}} \cap U_{{\text{B}}} = {\left\{{ 1,\,\, 4,\,\, 5,\,\, 8,\,\, 1 1,\,\, 1 8,\,\, 1 5,\,\, 20} \right\}}\,\,{{\text{and}}}\,\,U_{{\text{A}}} \cup U_{{\text{B}}} = {\left\{ {1,\,\, 2,\,\, 3,\,\, 4,\,\, 5,\,\, 6,\,\, 7,\,\, 9,\,\, 1 1,\,\, 1 2,\,\, 1 5,\,\, 1 8,\,\, 20} \right\}} \\ & \,\,\,\,{{\text{SetSim}}}\left( {{{\text{U}}}_{{\text{A}}} {{\text{U}}}_{{\text{B}}}} \right) = \left| {{{\text{U}}}_{{\text{A}}} \cap {{\text{U}}}_{{\text{B}}} } \right|/\left| {{{\text{U}}}_{{\text{A}}} \cup {{\text{U}}}_{{\text{B}}}} \right| = 8/ 1 4= 0. 5 7\\ \end{aligned} $$

Let “Sm” be a similarity matrix (Table 1) such that Sm[i,j] = αij where α is the similarity measure value between users Ui and Uj. The value of α = 1 for all i = j. The value of ‘p’ is 0.7 for calculation of α in the following Table 1 from the formula of S3M Eq. (1).

Table 1 Similarity matrix ‘Sm’ of users’ of dataset

The classification of web users Response matrix A is formed by selecting Top “D” clusters where we choose the higher density cluster such a way that the cluster density of selected clusters is ‘Z’ where “Z” > avg_cluster_density. The response matrix has clusters ‘D’ as rows and items as the columns will be the total ‘N’ (all) websites rated by users. The row vector of matrix ‘A’ will the average rating of respective website by the users of clusters. It is represented as Ai = {Aij: Average rating of Wj by all the users of Cj}. The matrix A is of size D × N where ‘D’ is number of the top high density cluster and ‘N’ is the total websites rated.

3.2 Prediction Quotient ‘Q’ Generation

After constructing response matrix, now it works to make a Prediction Quotient for new users who provide ratings for a few websites. A new user who provides a small pattern of ratings for some websites is the base for finding the similar users and to classify the respective cluster Ck. The prediction quotient is calculated by user ratings ratio/proportions. It is used for a new user who provides a small pattern of ratings for some websites. With the following calculations a recommended Prediction Quotient is calculated for a new user.

$$ {\text{The}}\,{\text{Prediction}}\,{\text{Quotient: }}Q_{ij} = \frac{{W_{ij} }}{{W_{i} }} $$
(2)

where Wij—total number of times ith website rated with “j” as the rating value in the cluster Ck.

Wi—total number of times ith website has been rated in the cluster Ck.

And if Wi = 0 then Qij = 0. A recommended Prediction Quotient vector is formed by placing Qij values in place of ith website rated by new user and the unknown values are filled with ‘×’. The length of the vector is ‘N’ the number of websites rated. Prediction Quotient vector is Ordered Set of Qij where ith website rated by new user and unknown website ratings = ‘×’, of ‘N’ elements. The vector is used with the output matrices of SVD applied on response matrix ‘A’.

3.3 Recommendation Vector ‘R’ Generation

The response matrix ‘A’ is applied to Singular Value Decomposition (SVD) model which produces three matrices U, S, VN. U is of size D × N, S is of size N × N, and the matrix VN is also N × N. The matrix S has diagonal elements as non-zero values other elements will be zero.

4 Result

The dataset has been created with the university students as described in Sect. 3.3 and consists of more than 5000 feedback records collected through their log files of web access. For experimentation purpose, RapidMiner the open source research data mining tool has been used. The unknown (sparse) ratings are the ratings of websites which have not been rated.

We have also compared performance results of our method with other collaborative recommendation system methods such as BMF, KNN and Slope-one. The experiments have been carried with the RapidMiner Data Mining tools in experiments. Our Method is very similar to Factor-Wise Matrix factorization method provided by RapidMiner FWMF operator performs. The following figures are the comparative diagram of the proposed method with the traditional methods.

The Figs. 3, 4, 5, and 6 are the comparative results of the proposed method with traditional methods. The tool RapidMiner uses following operators for experimenting the performance of proposed Recommender System Matrix factorization with factor-wise learning (FWMF) operator performs modeling Relationships at Multiple Scales to Improve Accuracy of Large Recommender Systems. The Bias Matrix Factorization (BMF) operator performs Matrix factorization with explicit user and item bias. This operator uses bold driver heuristics for learning rate adaption and supports Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent method.

Fig. 3
figure 3

Root mean square error

Fig. 4
figure 4

Mean absolute error

Fig. 5
figure 5

Normalized mean square errors

Fig. 6
figure 6

Proposed method with different factors

5 Conclusion

This paper proposed a novel approach for collaborative recommender system based on S3M approach of item and user ratings. The calculation for filling up zeros lies on the similar user with a specific rating given by new user for a particular website, is the foundation for generating the prediction vector makes it proportional to the rating at cluster level and as well at user level. The proposed method outperforms with all methods and gives the minimum errors with respect to RMSE, MAE, NMSE. The proposed method has been compared with different factors and it gives the minimum RMSE values for the ratings predictions.