1 Introduction

In today’s web era, online users are abundant with high expectations from search engines to satisfy their query with most appropriate web pages. Current algorithms in information retrieval focus on enhancing and optimizing to achieve personalization. Based on their search on the web, query results must be customized to provide user satisfaction. The recommendation system is one among the thriving research area today, in which personalization is done to analyze user’s search interest and provide better results even for those users who do not reveal their search interest explicitly [1]. Web Mining is categorized into three different types as explained in [2]. They are Web Usage Mining, Web Structure Mining and Web Content Mining. This research is based on analysis of usage log and content in each URL logged by a corresponding user. Each web page is denoted by its corresponding URL logged. The user profile is constructed for each identical user which is determined by IP address [3]. To hide the identity of each user, their IP address is represented uniquely as random number [4]. The profile of each user was created using eight characteristic features and two content-based features, which is explained in Sect. 3.1.

Traditionally Collaborative based filtering and k-Nearest neighboring (kNN) approaches were predominantly applied in recommendation systems. Both these approaches provide recommendations for an active web user based on other users who have similar interest and preferences [5]. Those users with similar interest are called as neighbors [6]. Unfortunately, Collaborative and kNN approaches have their own drawbacks. For example, consider a web page ‘p’ which has been recently created or modified to hold updated contents. Such page ‘p’ might not be visited/revisited by web users after it has been updated. Hence, ‘p’ may not be included for further recommendation to the currently active user. Such problem is termed as cold start problem.

The objective of this paper is to improve the accuracy of web page recommendation through the extension of kNN method to effectively recommend web page by applying Case-Based Reasoning (CBR) and Weighted Association Rule Mining (WARM) algorithms. The main focus of this paper is as follows:

  • kNN algorithm is applied to identify the initial set of ‘k’ similar neighbors for any current active user. These users are referred as k-NN. Such ‘k’ neighbors are those who have the similar interest in search and target towards web pages that have similar content. For identifying ‘k’ users, collaborative filtering algorithm proposed in [7] was applied.

  • CBR algorithm [8, 9, 10] is applied among ‘k’ identified neighbors, by analyzing user profiles comprised of characteristic and content-based features. CBR reduces the size of k-NN neighbors as n-NN neighbors (where n = k/2) resulting in reduced delay and increased performance of the system.

  • To further improve the accuracy, WARM is applied following CBR. Here weights are assigned while computing support and confidence to generate more accurate rules from frequent item-sets. These final rules are used for final recommendation.

The rest of this paper is organized as follows. In Sect. 2, related work to this paper has been discussed. Section 3 discusses the concept of applying CBR for profile generation. Section 4 discusses the idea of using WARM for rule generation. Section 5 covers results and discussion. Section 6 concludes with the final findings and inferences observed in this paper.

2 Related work

Various traditional methods such as collaborative filtering, association rules, clustering, sequential patterns, hybrid methods and semantic web [11] are used for personalization and recommendation systems. Collaborative filtering developed by [7] and [12] is one of the most common approaches used for providing recommendation by finding similar users. Pearson correlation coefficient and cosine based approach can be used to find similar users [13]. This traditional approach can still be improved by applying normal recovery Collaborative filtering [12]. But recommendation done using pure collaborative filtering approach may lead to problems such as popularity bias, cold start problem, handling dynamic pages etc. So, in order to provide personalized results, this paper combines CBR with WARM. CBR [9] generates the user profile and uses similarity knowledge to predict relevant profiles for the currently active user [9]. Such profile includes Page Rank [14] as a major feature which is computed using HITS and Page Rank algorithm. WARM is similar to traditional association rule mining and it’s more efficient as it considers the importance of transactions and item sets [15, 16].

2.1 Collaborative filtering

Collaborative filtering is one of the most common approaches used for recommendation. Collaborative Filtering systems collect visitor opinions on a set of objects using ratings, explicitly provided by the users or implicitly computed. In explicit ratings, users assign rating to items or web pages, or a positive (or negative) vote to some web pages or documents [11]. The implicit ratings are computed by considering the access to a Web page. A rating matrix is constructed where each row represents a user and each column represents an item or web page keywords [12]. Items could be any type of online information resources in an online community such as web pages, videos, music tracks, photos, academic papers, books etc. Collaborative filtering systems predict a particular user’s interest in an item using the rating matrix. Alternatively, the item–item matrix, which contains the pair-wise similarities of items, can be used as the rating matrix. Rating matrix is the basis of CF methods. The ratings collected by the system may be of both implicit and explicit forms. Although CF techniques based on implicit rating are available for recommendation, most of the CF approaches are developed for recommending items where users can provide their preferences with explicit ratings to items.

The web log files are collected from the users’ browsing history, consisting of IP address, date & time of visiting the web pages, method URL/protocol, status, received byte etc. From the log file all the web page contents are extracted, from which keywords are extracted. Page view and page rank is calculated for each URL. Based on these values, user profile is constructed. The user profile is represented in matrix format. Based on the user profile, user’s similarity is found by applying normal recovery similarity measure [12]. Collaborative filtering approach called Normal Recovery Collaborative Filtering (NRCF) is applied on similar users obtained, for web page recommendation. When new user enters a search query same as other similar user query, then the webpages visited by similar users are recommended to the new user. Normal recovery similarity measure is applied on the users profile and more similar users to calculate the degree of similarity between two users using the following Eq. (1) stated in [12, 17]:

$$ {\text{Sim }}\left({\text{u, v}} \right) = { 1 } - \frac{{\sqrt {\mathop \sum \nolimits_{{{\text{i }} \in {\text{I }}}} \left({\frac{{r_{\text{u,i }} - r_{\text{umin}}}}{{r_{\text{umax }} - r_{\text{umin}}}} - \frac{{r_{\text{v,i}} - r_{\text{vmin}} }}{{r_{\text{vmax}} - r_{\text{vmin}}}}} \right)^{2}}}}{{\sqrt {\left| I \right|}}} $$
(1)

where, i is the set of web pages that are co-visited by user u and v. |I| is the number of i, i.e. total number web pages co-visited by users u and v. ru,i is the value of web page keyword and time spent in particular web page from user u in user web page matrix. rumin and rumax are the lowest and highest values of user u. rvmin and rvmax are the lowest and highest values of user v [12].

2.2 Content-based recommendation

Content-based filtering is a type of information extraction system, where web pages are extracted based on the semantic similarity between the content in those web pages visited by users in past history [18, 19]. Web content mining applications mostly rely on content-based filtering approaches. Content-based filtering offers predominant support for web page recommendation system. In this technique, the keywords and its frequency of occurrence in those web pages that were previously visited are collected. Then, the semantic similarity between such keywords will be analyzed for further process. For example, consider two users “u1” and “u2” who frequently visit web pages based on their domain of interest. Let u1 always focus on heath related web pages and u2 focus on gadget-related sites. Now, during the real time if any active academic user “ua” search for the query “apple”, he will be mostly related to apple devices based sites, rather than apple fruit. So, he will be recommended the sites referred by u2. Similarly, when a dietician “ub” searched for “apple” he will be recommended the sites referred by u1.

Recommendation engine classifies “ua” as an academic user and “ub” as a dietician based on the contents (keywords) of the web pages navigated in past history. Along with the keywords, the semantic similarity between them is also analysed for more effective domain grouping. Content-based classification is used for grouping web users under various domains. For such classification, the frequency and keywords in web pages are represented using TF/IDF notations. TF corresponds to Term Frequency and IDF corresponds to Inverse Document Frequency. The following Eqs. (2), (3) and (4) are used to determine TF-IDF [20] of a term j within a document collection N.

$$ {\text{tf - idf}}\left(j \right) = {\text{ tf}}\left(j \right) \times {\text{idf }}\left(j \right) $$
(2)

where,

$$ {\text{tf}}\left(j \right) = {\text{ Frequency of term j in a document}} $$
(3)

and,

$$ {\text{idf}}\left(j \right) = {\text{ log}}\left({\frac{N}{{{\text{No.}} \, {\text{of docs that has j at least once}}}}} \right) $$
(4)

3 Case based clustering for web page recommendation

3.1 Feature selection

Case-based clustering applies case based reasoning (CBR) for clustering the user profiles. CBR is a process of finding solutions to new problems based on the solutions of similar past problems [8, 10]. In this paper, the phenomenon of such CBR is intended to be applied in Web page recommendation system [21]. Here, a user profile that narrates user interest, searching pattern and web accessing phenomena are created. Such user profile comprises of the following ten features [22, 21].

  • Time on page (TOP)

  • Time on site (TOS)

  • Average time at this page (ATP)

  • Bounce rate (BR)

  • Exit rate (ER)

  • Conversion rate (CR)

  • Number of visitors (NOV)

  • Average page rank (APR)

  • Top similar keywords (SK)

  • Average similarity between keywords (ASM)

In the paper [22] eight characteristic features along with a content-based feature called top similar keywords (SK) were defined. The proposed algorithm introduces another content-based feature called “Average Similarity between keywords” in order to increase the accuracy of recommendation. Hence, the proposed system uses ten characteristic and content-based features for the development of user profile. The methodology for the identification of such features from user’s web access log file is described in Sect. 3.2.

Another contribution of the proposed algorithm is introducing the concept of employing weights (β) for each feature while developing the user profile. The advantage of adding weight is to give more strength to selective features that help in enhancing the accuracy of predicting web pages for recommendation. In the proposed system, the value of β ranges between 1.0 and 2.0. The idea here is to double (β = 2) the contribution of most significant features, considerably increase (β = 1.75) the strength of significant features, marginally increase (β = 1.5) the weight of most relevant features and maintain (β = 1.0) the contribution of required features in a user profile to enhance the accuracy of prediction. The following Table 1 shows the weight (β) assignment of all features for developing the user profile. Initially, traditional collaborative filtering approach is used to filter “k” number of users (neighbors) from the global set of web users. The value “k” is a level of threshold which can be set by recommendation engine to balance between optimization and increasing search accuracy.

Table 1 Assignment of weights (β) for each feature

Now, the profile of all “k” users are analyzed and compared with current active user’s profile as narrated in Table 2. Analysis of user profiles has been done based on CBR Approach [21]. Here, the selected ten features of “k” users are compared by calculating the similarity with current Active User (AU). From all such similar users, top N users profiles whose similarity is below the threshold are selected. In the proposed system, the threshold value is set dynamically as the following Eq. (5):

$$ {\text{Threshold = }}\frac{1}{k} \mathop \sum \limits_{\text{i = 1}}^{k} {\text{Sim}}\left({\text{AU,Ui}} \right) $$
(5)
Table 2 Working principle of CBR based clustering approach (where k = 4)

WARM algorithm is then applied to generate rules that filter the list of all web pages (URLs) that were visited mostly by N users filtered by CBR approach. The working principle of CBR in web page recommendation with an example of k = 4 users is shown in the following Table 2 which is generated based on sample training dataset 1 (discussed under Sect. 6).

For experimentation, AOL web access log dataset [23] was used. The log file contains web query log data from ~ 650 k users. In order to have privacy preservation, IP addresses of individual users are represented using anonymous IDs. Hence each user is represented by unique ID. The schema of this log dataset is: {AnonID, Query, Query Time, Item Rank, ClickURL}. Where, AnonID represents an anonymous user ID number to preserve user privacy. Query denotes the query issued by the user. Query Time says the time at which the query was submitted for search. Item Rank denotes that if the user clicked on a search result, the rank of the item on which they clicked is listed. Finally, Click URL represents the domain portion of the URL that the user clicked on a search result. In the pre-processing stage, the log file is cleansed by removing unwanted information such as blocked URLs, inappropriate and incomplete entries. Finally, the user profile is constructed by analyzing the search pattern and URLs of each individual user identified using AnonID.

3.2 User profile generation

The user profile based on eight characteristic features and two content-based features are created as explained below:

3.2.1 Time on page (TOP)

The parameter Time on Page is the total time spent by an active user within a particular page. An average of time spent on all web pages is measured using the following algorithm 1.

3.2.2 Time on site (TOS)

The time spent by individual user within a website is computed as TOS. This time is calculated by the following Eq. (6) and Eq. (7); where, the time spent on each page pi with same hostnames (URIs) are summed together to identify \( \tau {\text{S}} \).

$$ \tau {\text{S }}\left({\text{URI}} \right) = \mathop \sum \limits_{{{\text{uri}} \in < {\text{ URL,}}\tau {\text{pi >}}}} \tau {\text{pi }}\left({\text{URI}} \right) $$
(6)
$$ \tau {\text{SAvg = }}\frac{1}{n} \mathop \sum \limits_{\text{i = 1}}^{n} \tau{\text{S }}\left({\text{URIi}} \right) $$
(7)

3.2.3 Average time on this page (ATP)

The average time spent by the corresponding user for any page pi is identified using Eq. (8); Where pi is the page for which average time spent is to be calculated. URL is the entire list of web page URLs visited by web users, τpi is time spent for each page pi and N is the total number of occurrences of the page pi.

$$ \overline{{x\left({\text{pi}} \right)}} = { }\frac{{\mathop \sum \nolimits_{{{\text{pi}} \in {\text{URL}}}} \tau {\text{pi}}}}{N} $$
(8)

3.2.4 Bounce rate (BR)

The web page access percentage with respect to session wise grouping of access pattern is called as bounce rate. Today BR plays a vital role in web analytics. Web pages access pattern is grouped into sessions based on date and time difference between two consecutive page requests. If the date and time difference is exceeding certain time limit of 10 min, the access patterns were grouped as clusters called as sessions. The page pi’s access rate between all such sessions is computed as BR using the Eq. (9); Where‘s’ represents each session from the complete set of sessions ‘S’ and TS represents the total number of sessions active by a web user. NS(pi) denotes the total number of sessions where page pi has been accessed.

$$ {\text{BR}}\left({\text{pi}} \right) = { }\frac{{\mathop \sum \nolimits_{s \in S} \mathop \sum \nolimits_{{\left({{\text{pi}} \in s} \right) \cap \left({\text{i = 1}} \right)}} \tau {\text{pi}}}}{{{\text{NS}}\left({\text{pi}} \right)}} \times {\text{TS}} $$
(9)

3.2.5 Exit rate (ER)

The rate at which, the web page (pi) will be at the end of the session is computed as ER. Here, the occurrence of pi being the last entry within the session is calculated to identify the exit rate using the Eq. (10)

$$ {\text{ER}}\left({\text{pi}} \right) = { }\frac{{\mathop \sum \nolimits_{s \in S} \mathop \sum \nolimits_{{\left({{\text{pi}} \in s} \right) \cap ( {\text{i = N)}}}} \tau {\text{pi}}}}{\text{NS(pi)}} \times {\text{TS}} $$
(10)

3.2.6 Conversion rate (CR)

The conversion rate for each web page is computed as the ratio between total sessions accessed by a user to the total number of sessions that contains the page pi. Equation (11) computes the conversion rate of page pi. Here, TS denotes the total number of sessions grouped under each user. NS(pi) denotes the number of sessions contains page pi.

$$ {\text{CR}}\left( {\text{pi}} \right) = \frac{\text{TS}}{\text{NS(pi)}} \times 1 0 0 $$
(11)

3.2.7 Number of visitors (NV)

The total number of visitors, also called as page views, for each web page visited by the corresponding user has to be computed to analyze the priority of a web page. If more number of users has been visiting, the corresponding page is given with good preference for further recommendation. The number of visitors for a particular page pi is computed using Eq. (12):

$$ {\text{NV}}\left({\text{pi}} \right)\,{=}\, \mathop \sum \limits_{\text{j = 1}}^{N} \mathop \sum \limits_{{{\text{pi}} \in {\text{sj}}}} n $$
(12)

where,

$$ {\text{n = }}\left\{ {\begin{array}{*{20}c} {0,\,{\text{if}}{\mkern 1mu} \, {\text{pi}}\,{\mkern 1mu} {\text{is}}\,{\mkern 1mu} {\text{not}}{\mkern 1mu} \,{\text{present}}\,{\mkern 1mu} {\text{at}}{\mkern 1mu} {\text{least}}{\mkern 1mu} \,{\text{once}}{\mkern 1mu} \,{\text{in}}{\mkern 1mu} \,{\text{Sj}}} \\ {{\text{1,}}\,{\text{if}}{\mkern 1mu} \,{\text{pi}}{\mkern 1mu} \,{\text{is}}{\mkern 1mu} \,{\text{present}}{\mkern 1mu} \,{\text{atleast}}{\mkern 1mu} \,{\text{once}}{\mkern 1mu} \,{\text{in}}{\mkern 1mu} \,{\text{Sj}}} \\ \end{array} } \right. $$

3.2.8 Total page rank (TPR)

Page rank is a numerical value that measure’s a webpage importance among the group of similar web pages. Such page rank is computed based on Random Surfer model [14]. This algorithm computes the page rank based on link structure of the web page [24, 25]. A page gets hold of high rank if the addition of the ranks of its backlinks is high. The rank of the given page is thus computed using the following Eq. (13)

$$ {\text{TPR =}}\frac{1}{N}\left[{\left({ 1 {\text{ - d}}} \right) + {\text{ d}}\mathop \sum \limits_{v \in B\left(u \right)} {\text{Page}}_{\text{Wt}} \times \frac{{{\text{PR}}\left(v \right)}}{{N_{v}}}} \right] $$
(13)

Where, u represents a web page. B(u) is the set of pages that point to u. PR(v) is the page rank of page v that points to page u. Nv is the number of outgoing links of page and d is the damping factor that is set between 0 and 1. The damping factor is the decay factor that represents the chance of a user stop clicking links within a current page and then requesting another random page [14]. \( {\text{Page}}_{\text{Wt}} \) is termed as Page weight which is calculated based on frequency and duration as in Eq. (14).

$$ {\text{Page}}_{\text{Wt}} \left({\text{PW}} \right){= NV}\left({\text{pi}} \right) \times \tau {\text{pi}} $$
(14)

where \( \tau{\text{pi}} \) is total time spent by the user on particular webpage represented by Algorithm 1. A quick jump might also occur due to the short length of a web page so the size of page may affect the actual visiting time. Hence, duration is normalized by the length of the web page, i.e. the total bytes of the page. NV(pi) is the number of times that a page is accessed by different users; computed by Eq. (12).

3.2.9 Top similar keywords (SK)

The top similar keywords under each ranked page pi are considered for further recommendation. To identify such top keywords, tokenization and stemming process are performed. The following algorithm (2) is used to identify top keywords based on their frequency of occurrence.

3.2.10 Average similarity between keywords (ASM)

The set of top keywords gathered using Algorithm 2 for all “k” users are further investigated to find the semantic similarity between each user and current Active User (AU). This similarity is used to find the distance between two users based on their search interest. The following algorithm (3) is used to find the average similarity between keywords.

3.3 Finding similarity score

Finally, the similarity among “k” existing (EU) user profiles and the current Active User (AU) is to be identified to filter the most similar neighbors. To identify this similarity, the following Eq. (15) is used.

$$ {\text{sim}}\left( {\text{AU,EU}} \right) = \sum\limits_{{{\text{f }} = { 1}}}^{ 1 0} {\left[ {\beta_{f} \times \left( {{\text{EU}}_{f} - {\text{AU}}_{f} } \right)^{2} } \right]} $$
(15)

where, f denotes the ten features retrieved from individual user profiles. The similarity score is determined by the Euclidean distance between each existing user (EU1..k) and current active user (AU). This difference is multiplied by the weights (β) assigned as in Table 1. After calculating the similarity scores, the threshold value is determined as stated in Eq. (5). Finally, the most similar users whose similarity value is lesser than the threshold value will be selected for further analysis using WARM algorithm which is discussed in the following section. Thus CBR has been applied in order to reduce the k-nearest neighbors and thereby selecting the most similar n-nearest neighbor users. As k-NN has been reduced to n-NN the proposed CBR based recommendation system was found to be working with enhanced performance and speed.

4 Recommendation using WARM

4.1 Identifying frequent item set

To further enhance the accuracy of recommendation, Weighted Association Rules Mining algorithm is used. Following the CBR process, Association rules are mined considering the n-NN neighbor users. Association Rule mining is another predominant algorithm used for effective product recommendation [24, 25]. Here, weights have been computed using Eq. (20) for each item (web page) that will be analyzed for recommendation to any active user. Hence the rule mining algorithm is termed as Weighted Association Rule Mining. Here, n-NN user’s most visited pages (fetched from their profile) that were matching the current user’s query (Eq. (16)) are mined to find frequent item-set which is called as set S. Association rules are generated based on the frequent item-set [26]. The set of web pages that contains the query word(s) is filtered and called as S’ represented using Eq. (17).

$$ {\text{S =}}\, \left\{{{\text{p1, p2, p3,}} \ldots , {\text{ps}}} \right\} $$
(16)
$$ S^{\prime} = \left\{{{\text{pi, pj, pk,}} \ldots , {\text{pn}}} \right\} $$
(17)
$$ \left\{{\text{pi, pj, pk}} \right\} \Rightarrow \left\{{\text{pm}} \right\} $$
(18)

For example, consider the following Eq. (18). The above rule states that users those who visited web pages “pi”, “pj” and “pk”, in any order, they are most likely to visit web page “pm”. Hence it might be most appropriate to recommend web page “pm” to the currently active user. Here pages pi, pj, pk and pm are termed as frequent item-sets. Association rules of type mentioned in Eq. (18) are mined using those frequent item-sets from set S. Support and Confidence value for all frequent item-sets “x” that constitutes to those association rules mined are computed to eliminate rules that are not suitable for recommendation process [27]. The support and confidence value of each mined rule is computed using the following Eq. (19) and Eq. (21):

$$ {\text{Support}} \left( x \right) = {\text{Wt}}({\text{x}}) \times \frac{{\left| {\left\{ {S^{\prime } \in {\text{S}};{\text{x}} \subseteq S^{\prime } } \right\}} \right|}}{\left| S \right|} $$
(19)

where, Wt(x) is the weight of all web pages contained in the item-set x. Wt(x) is computed as Eq. (20):

$$ {\text{Wt}}\left(x \right) \, {= }\mathop \sum \limits_{{{\text{pi }} \in x}} \frac{\text{BR(pi) + ER(pi) + CR(pi)}}{ 3 0 0} $$
(20)

where, BR(pi), ER(pi) and CR(pi) were the Bounce Rate, Exit rate and Conversion rates of webpage pi.

The confidence of any rule \( {\text{p}}1 \Rightarrow {\text{p}}2 \) will be computed using the following Eq. (21):

$$ {\text{Confidence }}\left({{\text{p1}} \Rightarrow {\text{p2}}} \right) = \frac{{{\text{Support(p1}} \cup {\text{p2)}}}}{\text{Support (p1)}} $$
(21)

4.2 Generation of association rules

The overall rule generation process is described using the following Fig. 1. The association rules that are generated will be sorted based on decreasing confidence value, enabling the most appropriate rules with high confidence may be listed at the top. The candidate rules that were mined with maximum confidence value for each query keyword given by end user is listed in Table 3.

Fig. 1
figure 1

Process of generating weighted rules from user profiles

Table 3 Candidate Rules (with maximum confidence value) for the given query

4.3 Recommendation process

Finally, the top “m” rules that are ranked based on confidence value are selected for recommendation process. Here “m” is set by web server/search-engine. Various experiments were conducted with varying m values as 30, 40 and 50 to analyse the accuracy of the proposed algorithm. The RHS from each “m” rules are selected as final web pages to be recommended for the end user.

5 Results and discussion

5.1 Data set

For the research experimentation and analysis, AOL log dataset has been used. The log file contains web query log data from ~ 650 k users. In order to have privacy preservation, IP addresses of individual users are represented using anonymous ID. Hence each user is represented by unique ID. The experiments were carried out with datasets covering 7175 web pages accessed by 287 different users. The schema of this log dataset is: {AnonID, Query, Query Time, Item Rank, ClickURL} [23]. Where, Where, AnonID represents an anonymous user ID number to preserve user privacy [28, 29]. Query denotes the query issued by the user. Query Time says the time at which the query was submitted for search. Item Rank denotes that if the user clicked on a search result, the rank of the item on which they clicked is listed [30, 31]. Finally, Click URL represents the domain portion of the URL that the user clicked on a search result. The web access log dataset is divided into seven samples of equal size with 50 records as mentioned in the following Table 4:

Table 4 Various sample datasets used for experimentation

5.2 Evaluation metrics

In order to verify the performance of the proposed algorithm, the following metrics were identified: F1-Measure, Miss-Rate (MR), Fallout Rate (FR) and Matthews Correlation [32]. In order to compute these evaluation metrics, the following Table 5 is developed.

Table 5 Contingency table used to compute precision and recall

5.2.1 F1-measure

The F1-Measure is computed based on two metrics such as Precision or True Positive Accuracy (Confidence) and Recall or True Positive rate [32]. The Precision is defined as the ratio between the recommended web pages that are relevant to the user query to the total number of recommended items [33,34,35]. Precision is represented using Eq. (22).

$$ {\text{Precision}} = \frac{\text{TP}}{{{\text{TP }} + {\text{ FP}}}} $$
(22)

Recall is calculated as per Eq. (23) and is defined as the ratio of web pages recommended that are relevant to the total number of relevant webpages [32] considered for experimentation purposes

$$ {\text{Recall}} = \frac{\text{TP}}{{{\text{TP }} + {\text{ FN}}}} $$
(23)

The specifications of TP, FP, TN and FN are stated in Table 1 [32]. These Precision and recall values are used to compute F1-measure as given in Eq. (24).

$$ {\text{F1 }} = \frac{{ 2 \times {\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision }} + {\text{ Recall}}}} $$
(24)

5.2.2 Miss rate (MR)

The miss rate is calculated based on the total number of relevant web pages that were not recommended [32]. This is also termed as False Negative Rate as denoted in Eq. (25).

$$ {\text{MR}} = \frac{\text{FN}}{{{\text{TP }} + {\text{ FN}}}} $$
(25)

5.2.3 Fall out rate (FR)

The false positive rate (calculated using Eq. (26)) or Fallout Rate is defined as the rate of irrelevant pages that were recommended to the total number of irrelevant pages [32].

$$ {\text{FR}} = \frac{\text{FP}}{{{\text{FP }} + {\text{ TN}}}} $$
(26)

5.2.4 Matthews correlation (MC)

The Matthews Correlation is used to analyze the effectiveness of the proposed classification algorithm [32]. This is computed using the Eq. (27).

$$ {\text{MC}} = \frac{{\left( {{\text{TP}} \cdot{\text{TN}}} \right) - ({\text{FP}} \cdot{\text{FN)}}}}{{\sqrt {\left( {{\text{TP }} + {\text{ FN}}} \right)\cdot\left( {{\text{FP }} + {\text{ TN}}} \right)\cdot ( {\text{TP }} + {\text{ FP)}}\cdot ( {\text{FN }} + {\text{ TN)}}} }} $$
(27)

5.3 Experiment results

Experiments were conducted using the seven samples of dataset running under three algorithms: Collaborative Filtering (CF); Case-Based Reasoning (CBR) and Case-Based Reasoning with Weighted Association Rule Mining (CBR with WARM). The graphs that measure F1-Measure, Miss Rate (MR), Fallout Rate (FR) and Matthews Correlation (MC) were shown in Figs. 2, 3, 4 and 5 respectively.

Fig. 2
figure 2

Comparing F1-measure with varying “k” values (from k = 5 to 40)

Fig. 3
figure 3

Testing miss rate of three algorithms with seven data samples

Fig. 4
figure 4

Testing fallout rate of three algorithms with seven data samples

Fig. 5
figure 5

Comparing the efficiency of various algorithms using Matthews Correlation

The F1-measure has been analyzed with varying values of “k” as k = 5; k = 10; k = 15; k = 20; k = 25; k = 30; k = 35 and k = 40. Figure 2 clearly states that in all the algorithms with various samples of dataset tested, the optimum value for “k” lies within 20–25 with increased F1-measure. In order to verify the error possibility of the proposed CBR with WARM algorithm, Miss Rate and Fallout rate were tested by conducting experiments with the same seven sample datasets as described in Table 4.

Figures 3 and 4 shows that the Miss Rate or False Negative rate and the Fallout Rate or False Positive rate for the proposed algorithm has been reduced when compared to CBR system and the existing traditional collaborative filtering approaches. Figure 5 analyses the effectiveness of the proposed CBR with WARM algorithm using Matthews Correlation (MC). It was found that, the proposed algorithm outperforms the other two approaches.

6 Conclusions

In this paper, a novel approach to develop user profiles was proposed where eight characteristic features and two content-based features were identified for efficient classification of user profiles. In addition, a new algorithm based on Case-Based Reasoning was proposed that enhances the performance of Collaborative filtering based web page recommendation system. In order to further optimize and increase the accuracy of recommendation process, Weighted Association Rule Mining approach was applied along with CBR. To analyze the effectiveness of proposed algorithms, experiments were conducted on seven test case samples for three algorithms namely Collaborative Filtering, Case-Based Reasoning, Case-Based Reasoning with Weighted Association Rule Mining Algorithm. The experiment result with AOL dataset concludes that optimal ‘k’ value for selecting neighbors for CBR approach lies within 20 to 25. The error level of proposed algorithm was found to have minimum Miss Rate and Fallout Rate. In terms of classification efficiency, the proposed system was found to outperform existing method.