Keywords

1 Introduction

In present world each person wants quick supplies for his requirements in every field of life including shopping or renting of books. Recommendation systems provide best possible solution to this problem. These are kind of expert systems which help in gathering the related information [1]. Most of recommendation systems work for almost similar purpose that is to recommend items which are most relevant to the users. To fulfill this purpose recommendation systems use different approaches including collaborative, item-based, and hybrid filtering.

In this paper we are using collaborative filtering approach to provide recommendations to the users. We are training a book rating data with our training model. This trained data will be sent to testing model. The testing model will predict user ratings for new users. On the basis of these predicted values, a system is proposed to recommend books to new users on their personal attributes which are age, location, and interest. Using these three attributes we are proposing three different models. All models include dataset provided by our training and testing models. To create this training model we used a real-time dataset of books as described in Fig. 5. It has large number of entries which are feasible for our analysis. Main objective of this proposal is to assist new users of any book repository in finding their desired books. Research works have been accomplished by many researchers with similar objective as shown in Table 1. Main purpose of this research work is to design a different approach in the creation of recommendation systems. Our work will provide a base in creation of recommendation systems using User k-NN prediction model.

Table 1 Output for recommendation

2 Theoretical Background

2.1 Recommendation System Overview

Lot of work has been done in recommendation systems but interest remains same as it is a problem-rich field and having limitless possibilities both in research and industry. It has large number of practical implementations to solve the problem of information overloading and providing personalized information [2]. Following list of different research works in the field of recommendation systems will support the fact that the recommendation system using user k-NN prediction is least touched and thus have large opportunities for research work (Fig. 1).

Fig. 1
figure 1

Literature survey

Initial work in recommendation system can be listed in the areas of cognitive science [3], approximation theory [4], marketing models [5], and automatic text processing [6]. This work later became the rating estimation for new entries on the basis of different attributes and likes of already present entries similar to them.

Recommendation systems can be categorized based on how recommendations are made [7]:

  • Content-based recommendations: Items are recommended on the basis of past preferences of the user.

  • Collaborative recommendations: Items are recommended on the basis of past preferences of users with similar taste.

  • Hybrid recommendations: These are the combinations of both content-based and collaborative recommendations.

We are using collaborative recommendations and user k-NN method for our system which is explained in Sects. 2.3 and 2.4 respectively.

2.2 Performance Measures

RMSE: Root-mean-squared error is a very good general-purpose error matric for numerical predictions [8]. Its value lies between 0 and ∞, 0 is the best value for any prediction and ∞ is the worst. Hence, this value should be minimized to prove performance of our model better.

MAE: Mean absolute error measures the average of magnitude of errors in a specific prediction [9]. Value of MAE also lies between 0 and ∞, 0 is the best values for any prediction and ∞ is the worst. So, our motive is to minimize this value for the better performance.

2.3 Similarity Measures

There are two main similarity measures which are present in Rapid Miner:

  • Cosine-based similarity: This treats the two items as different vectors and the similarity is calculated on the basis of angle between these two vectors. It is also known as vector-based similarity.

  • Pearson-based similarity: It checks how much the rating provided by a common user is different from the average rating of that item.

We used Pearson correlation mode because it provided more accurate results than Cosine for our dataset. Value of RMSE in case of Pearson is less than Cosine by a percentage of 10.66 as shown in Figs. 2 and 3.

Fig. 2
figure 2

RMSE of Pearson

Fig. 3
figure 3

RMSE of Cosine

2.4 Collaborative Recommendation

Collaborative recommendations are provided on the basis preferences of users which are similar in taste to new users [10]. We chose this over content-based because content-based cannot find out the quality of the item [11]. Collaborative recommendations work on collaborative filtering (CF) algorithm which works as follows [12]:

  • Similarity values are calculated between two or more items in a dataset using one of the similarity measures. These measures are explained in Sect. 2.4.

  • These similarity values are used to predict ratings for the entries not present in dataset.

In this paper, collaborative filtering is used along with the user k-NN to provide an approach for recommendation system. Collaborative filtering solves most of the shortcomings present in the content-based filtering [13]. Since feedback of other users creates difference between recommendations, there is a possibility of maintaining the effective performance. The approach of this research is as follows.

2.5 k-NN Algorithm

K-nearest neighbors is the method used for both regression and classification [14]. It is a type of instance-based learning and also called lazy learning. Following is the algorithm for k-NN approach.

It is a technique which uses K-instances as represented points in a Euclidean space.

  • In K-NN classification, an object is classified by a majority vote of its neighbors, and the object is assigned to the class most common among its K nearest neighbors for discrete value.

  • For real value, it returns the mean values of the K nearest neighbors (K is a positive integer, typically small). If K = 1, then the object is simply assigned to the class of that single nearest neighbor.

3 Methodology

The methodology to adopt for the research is depicted in Fig. 4:

Fig. 4
figure 4

Methodology

Datasets from three excel sheets of BX-Book-Ratings, BX-User, and BX-Books details are integrated using data integration techniques.

  1. 1.

    The integrated data is pre-processed.

  2. 2.

    User k-NN algorithm is used for predictive analysis of training samples book ratings.

  3. 3.

    The predictive model is designed using rapid miner.

  4. 4.

    The model is tested using testing samples.

  5. 5.

    Performance of the model will be measured using performance measures named RMSE, MAE, and NMAE.

3.1 Data Integration

There were three files in the initial dataset with different attributes in them. Description of those files is provided in Fig. 5. To select most suitable attributes Pearson R Test is performed to calculate the similarity between attributes.

Fig. 5
figure 5

Metadata of dataset

Attributes with high similarity were reflected as single attributes. Formula for Pearson R Test is given below:

$$r = \frac{{\sum {\left( {x - \overline{x} } \right)\left( {y - \overline{y} } \right)} }}{{\sqrt {\sum {\left( {x - \overline{x} } \right)^{2} \left( {y - \overline{y} } \right)^{2} } } }}.$$

Manual integration is also performed to get most suitable attributes. For example, there were image URL in BX-Books excel files which are not usable to this research. Other attributes such as publisher details and year of publication were not relevant to this approach, and hence removed from the attribute list.

3.2 Data Pre-processing

  • The dataset of book rating, user details, and book details had 1,149,780 ratings for 271,379 books.

  • The user ids are made anonymous and mapped to integers.

  • Six attributes User Id, ISBN No, Book Ratings, Title, Author, and Location were selected from set of different attributes.

  • Data cleaning was performed and repeated; invalid and null values were removed.

  • The dataset is reduced till 5000 user ids for better understanding of results.

4 Experimental Setup

4.1 Dataset Used

The dataset was collected in 4-week crawl from the Book-Crossing community. It was downloaded from official website of IIF [15]. The metadata of the original dataset is given and the pre-processed dataset is shown in Fig. 5.

4.2 Tool Used

The Rapid Miner data mining tools are used for the purpose of research and analysis in data mining. It is a tool with integrated environments for data mining, machine learning, predictive analysis, and text mining. It is used for information mining process including results, presentations, validation, and optimization. It provides a large pool of data loading, data transformation, data modeling, and data visualization methods [16].

4.3 Model Construction for Training

Model constructed in Rapid Miner for training of data which will be used to predict user ratings is shown in Fig. 6. Following steps describe the working and flow of the model:

Fig. 6
figure 6

Model designed for training of data

  1. 1.

    “Read Excel” is used to import an excel file in the Rapid Miner process.

  2. 2.

    Set Role method specifies the role of each attribute present in the excel file [17]. In this model Book Ratings are specified as “label”, ISBN as “item identification”, User Id as “user identification” and all other attributes as “regular”.

  3. 3.

    User k-NN is a model for rating prediction and can be used after installing an extension called “Recommender” in your Rapid Miner tool.

  4. 4.

    Apply Model implements the model selected and provides the final result of that model. Here User k-NN model is User k-NN and result is prediction.

  5. 5.

    “Performance” shows the accuracy and validity of your model.

4.4 Model Construction for Testing

Model constructed in Rapid Miner for testing of data is shown in Fig. 7. This model tests the prediction of ratings for the new users. Following steps describe the working and flow of the model:

Fig. 7
figure 7

Model designed for testing of data

  1. 1.

    “Read Excel”, “Set Role”, “User k-NN”, “Apply Model” and “Performance” work same as in the Training Model.

  2. 2.

    “Filter Example” method separates empty values of user ratings from non-empty values.

  3. 3.

    Empty values are sent to “Apply Model2” which uses the training data and provide prediction for the empty values of user ratings.

5 Result and Analysis

5.1 Output

Outputs of training model and testing model are shown in Figs. 8 and 9, respectively. The model designed for rating prediction trained our dataset on basis of user ratings. Results of the training model are further used in testing of the data. The model designed for testing of data uses output from training model and provides prediction to new users. These results are used in further analysis in the paper.

Fig. 8
figure 8

Output of training model

Fig. 9
figure 9

Output of testing model

5.2 Work Flow of Proposed Model

  • The new user will enter a search item to the system.

  • It can be author’s name or a book title.

  • Then the user is asked for the required attributes which are age, location, and area of interest.

  • Then the dataset which was created by the models will come in picture and will be used for the recommendation.

  • Highest rated books of that author will be recommended to the user if he searched by the author.

  • If he searched by title, then the books which are categorized in that group are recommended to the user.

Example: New user XYZ asks for following author:

“Manette Ansay”

Then all the books written by A. Manette Ansay will be searched from the dataset created by testing model and following is the sample of that data:

Here we have four books by requested author but the three books with highest rating will be sent as recommendation. The recommendations will be

  1. 1.

    Midnight Champagne by A. Manette Ansay

  2. 2.

    Sister by A. Manette Ansay

  3. 3.

    Vinegar Hill by A. Manette Ansay

5.3 Performance Measures

Performance of prediction model is measured on factors defined in Sect. 2(B). Following table mention performance measures for both models (Table 2):

Table 2 Values of performance measures

5.4 Analysis

We are following below-defined procedures for our further analysis and research work. On first access user is asked for following attributes:

  • Age

  • Location

  • Area of Interest

These three possibilities are proposed using above-defined attributes and data created by our training and testing models.

Case study 1: Recommendation using age. When recommendations are provided to new user it cannot use ratings as a total base. Suppose new user is 25 years old and recommended item is rated high by persons of more than 60 years old. Then it will not be a fair recommendation for that user. So using output of testing model, new proposal is made which uses age of new user as a main attribute.

In Fig. 10, predictions provided by testing model are put together with users with different age to show the distribution between them.

Fig. 10
figure 10

Age-wise distribution of prediction

The model shown in Fig. 11 uses age as an attribute of test data and finds similar objects in data trained by our model.

Fig. 11
figure 11

Recommendation using age

  1. 1.

    Age groups are created of range 10 using data of Fig. 10.

  2. 2.

    Suppose user lies in Group 1 which is of 0–10, then three books with highest ratings in that age group are fetched from training dataset.

  3. 3.

    These results are provided to the recommender system and will be produced as recommendations to the new user.

  4. 4.

    Next top three books are recommended in case user does not like provided recommendations.

Case study 2: Recommendation using location: As stated in case study 1, it is necessary to have an attribute which helps in providing more relevant recommendations. In this case, it is location of new user. On the basis of this, a proposal is made for better recommendations.

In Fig. 12, predictions provided by testing model are put together with users with different locations to show the distribution between them.

Fig. 12
figure 12

Location-wise distribution of prediction

The model in Fig. 13 uses location of users as an attribute of test data and finds similar objects in data trained by our model.

Fig. 13
figure 13

Recommendation using location

  1. 1.

    Addresses of users in training data and new users are converted to latitude and longitude values using data provided by Fig. 12.

  2. 2.

    10 values which are closest to the values of new user are selected.

  3. 3.

    Three books with highest ratings in those 10 entries are selected and sent to recommender system.

  4. 4.

    These results will be produced as recommendations to new user.

  5. 5.

    Next top three books are recommended in case user does not like provided recommendations.

Case study 3: Recommendation using interest:

This model uses Area of Interest as an attribute of test data and finds similar objects in data trained by our model (Fig. 14).

Fig. 14
figure 14

Recommendation using interest

  1. 1.

    All books present in training data are categorized in different genres.

  2. 2.

    System provides list of genres and new user selects one of them according to related interest.

  3. 3.

    Three books with highest rating in that genre are selected and sent to recommender system.

  4. 4.

    These results will be produced as recommendations to new user.

  5. 5.

    Next top three books are recommended in case user does not like provided recommendations.

6 Conclusion

Predicted user ratings are well distributed with respect to our three main attributes. All case studies are applicable for development of proposed models except case study 3. It cannot be certified for development as the dataset does not have categorized entries on the basis of area of interest. In future the dataset used can be categorized on the basis of different genres, then it will be used for recommendation on the basis of area of interest.