1 Introduction

With the ever-growing use of computing techniques for different applications, the consequences of data mining have also got a rapid swift recently. A number of data mining capabilities have been discovered in the literature. One of the most common issues in the research of data mining is association rule mining (ARM) [1].

Pattern extraction from larger transactional databases uses ARM algorithms. The goal of these algorithms is to fetch inherent relationships among the items. It brings out all the association rules from the database, satisfying a predefined minimum support and confidence threshold. This problem has got many applications in marketing, decision analysis and business management [2].

The hub of this problem is in the mining of frequent itemsets. Frequent itemset mining determines those items which appears together commonly in transactional databases. This corresponds to a large number of customers. If this number is above a certain user defined threshold, then this itemset is considered frequent. Algorithms like Apriori and FP-growth have been used for mining frequent itemsets. Based on these basic algorithms, many algorithms have evolved in research literature [2].

Conventional ARM model is modified to hold WARM problems where weight is assigned to every individual item [3]. By means of weight assignment, target itemsets are given a priority for selection according to their importance, rather than their occurrence rate.

Web documents consist of huge information and hence attract large numbers of researchers to focus on the application of web mining techniques. Web content mining, web structure mining and web usage mining are the three types of web mining used for knowledge and information discovery. The first two types utilize the real or primary data on the web and are used for analyzing the web page contents. Web usage mining mines the secondary data which is a textual log data stored in the web servers, derived from the interactions of the users with the web. It generates data based on the browsing patterns or behaviors of the users [4].

Web usage mining also recognized as web log mining has got four stages in it. They are data collection, Data preprocessing, pattern discovery and pattern analysis [5]. Web log mining is a potential tool for discovering knowledge about the behavioral patterns of the users and website usage information that can be used for different website design tasks. The web content management and link connectivity are improved using the user behavioral patterns for enhanced services of websites.

This paper provides several data preprocessing techniques and algorithms that can be used to convert raw web server logs into user session files in order to carry out web usage mining.

Rest of the paper is structured as follows. Section 2 reviews related works of Frequent Patterns and WARM for Web log. Section 3 details the proposed system of how to preprocess the web server logs, weight estimation techniques, T+weight tree structure and WARM algorithm based on weights. Section 4 carries experimental evaluation. Section 5 provides conclusion.

2 Related work

FP-growth algorithm is used for acquiring the frequent access patterns from the web log data and to provide valuable information about the users [6].

An efficient mining methodology for WARM is proposed by Wang et al. [7]. WARM assigns numerical attribute for every item. This judges the weight of the item in a particular weight domain. The issues of discovering significant binary relationships in transaction datasets in a weighted setting is addressed by Tao et al. [3] where each item is allowed to have a weight. This uses the idea of WARM [7] which is both scalable and efficient in discovering significant relationships in weighted settings.

A new algorithm called combined frequent pattern mining (CFPM) is proposed by Sun and Zhang [8] to cater for web log data specifically.

Recently used frequent web pages may be kept in a server’s cache to speed up web access [9]. WARM technique finds pages of the user’s current interest and caches them to give faster internet access. This approach confines both user’s habit and interest as compared to other approaches where emphasis is only on habit.

To get the useful information, web log files are examined which helps to improve the services offered by web portals and information access and retrieval tools, giving information on problems arising to the users. Mining frequent patterns from web log data can help to optimize the structure of a web site and improve the performance of web servers. Web users are benefitted from these frequent patterns. Frequent pattern mining techniques are discussed for discovering different types of patterns in a web log [10].

The conventional models in existing literature do not take into account the difference between the transactions, and the WARM does not work on databases with only binary attributes. A new measure using link-based models called w-support takes the quality of transactions into consideration rather preassigned weights [11].

An enhanced algorithm for mining the frequently visited groups of web pages is proposed by Yiling Yang et. al. with the consideration of both the site topology and the content of web pages [12]. Here calculation of support includes another two arguments, the content-link ratio of the web page and the inter-linked degree of the page group. Hence, frequently visited web pages with more user interestingness are delivered.

ARM model assumes that all items have the same significance without considering their weights into account. But, WARM requires each item to be given a weight value to reflect their importance to the users. The weights may correspond to special promotions on some products, or the profitability of different items [13].

Proposed system mines the web log for frequent pagesets using an enhanced WARM involving a tree data structure called T+weight tree.

3 Proposed system

Proposed system is intended to achieve weighted rule mining from web logs using weights for each web page visited by users. Weights are assigned based on various factors like frequent access, page’s significance, dwelling time and interest of visitors. T+weight tree algorithm is proposed to discover the frequent and significant web pages from web log data.

Few pages may be visited only by few visitors. They may have dwelled for much longer time on that page which shows the importance of those pages. In the existing literature, dwelling time of users which may communicate the importance of the contents of that page are never taken into consideration. They take into account only the number of times a page is visited even if the time spent on that particular page is less, which shows that the page may be of less significance. Out of the visited pages, some pages might have taken more time because of more useful information than others. Proposed T+weight tree solves this problem. Figure 1 shows the process involved in weighted rule mining. Preprocessing the web log is essential before weight assignment to web pages. Cleaning the user log obtained from web server is the first and foremost work in preprocessing. It removes the irrelevant and redundant data. Cleaned log is processed for each IP in order to get the session wise details. Next, T+weight tree is constructed for frequent pageset mining and WARM. Here, the page dwelling time is considered as weight.

Fig. 1
figure 1

System Architecture

Table 1 Access sequence details for one session ID of an IP address
Table 2 Session details

3.1 Preprocessing

Data preprocessing plays an essential role in web Usage Mining process. It performs a series of processing of web log file covering data cleaning, user identification, session identification, path completion and transaction identification. It removes the invalid data and successfully improves the eminence of mining model and can also reduce the mining time. The data preprocessing task is more complex in web usage mining. However, it serves to offer structural, reliable and integrated data source for pattern discovery. Raw log files contain unimportant details like image accessed, failed entries, etc., which will affect the accuracy of pattern discovery and analysis [14]. Almost 80 % of mining efforts are often depleted to improve the quality of data. Hence, web log analysis and data preprocessing are more essential.

The web log data will be in the form of relational database model. Weight of visited pages is considered as a function of page dwelling time of visitors. Thus, dwelling time of URLs in the web log must be present as one of the attributes for each page visited. This arrangement makes the mining process easier in order to obtain the significant pages with weights.

The web page request details are preserved in the web logs. Attributes like Page URL, IP address, Session ID, Requested time and Dwelling time are updated for each page request in the log list.

Web log is preprocessed to obtain the backend database in a relational form for mining as shown in Table 1 (access sequence details) and Table 2 (session details). The details of the table belong to the web log of an educational institution, which is considered as the dataset for the proposed work. Each page request is restructured as a separate entry with unique session ID for the user [15]. Session details carry for a particular duration, the total number of sessions in the web log and the relevant total number of pages for all the sessions.

Dwelling time can be included as one of the attribute in Table 1 by making use of the following Eq. (1).

$$\begin{aligned} T_{dp} =\sum \limits _{p=1}^m {(R_{p+1} -R_p )} \end{aligned}$$
(1)

Since requested time is available for each page in the table, page dwelling time of pth page is calculated by subtracting the requested time of pth page R\(_\mathrm{p}\) from the requested time of p+1th page R\(_\mathrm{p+1}\) as given in Eq. (2) using time scale. Otherwise, by means of cookies in the working node, we can calculate the total dwelling time of all pages separately and can send all the cookies to the web log when the session closes [16]. This will help to have the page dwelling time directly from the log.

Table 3 shows one such sample Visitor’s log derived from the web log. For instance, it is assumed that in a day, 5 visitors (based on total number of session IDs) have visited total of 4 pages. Every element of each record in Table 3 represents page dwelling time of the respective pages.

If user defined minimum support is 2, at least 2 visitors should have visited that page. Otherwise the particular page is not said to be frequent. Hence, page p4 in Table 3 is not considered to be frequent, even though it has got more dwelling time than other pages and may be more informative and desirable. Therefore, dwelling time plays an important role, without which loss of information may arise. Due to this reason the proposed system considers and gives due credits to the weighted support.

Table 3 Sample Visitor’s log

3.2 Weight estimation

3.2.1 Weighted attributes

Tao et al. [17] have considered weight settings for a super market domain. A (a\(_{1}\), a\(_{2}\), ..., a\(_{k})\) are the attributes chosen to compute weights. As the proposed algorithm’s domain is web log mining, visitor’s page dwelling time is considered to be an essential attribute in the log. There are two categories of weights namely the page weight and the pageset weight.

3.2.2 Page weight

Page weight is a value attached to a page URL visited, representing its significance and denoted as w(p). It is a function of selected weighting attributes denoted as w(p) = f(a). The weight related attribute here is user’s average dwelling time on that page.

3.2.3 Pageset weight

Based on the page weight w(p), the weight of a pageset, denoted as w(ps), can be obtained from the weights of the pages visited by a particular user. To calculate the average value of the page weights, the following Eq. (2) is used.

$$\begin{aligned} w(ps)=\frac{\sum \limits _{x=1}^{\left| {ps} \right| } {w(p_x )} }{\left| {ps} \right| } \end{aligned}$$
(2)

It is possible that in a web log database some pages may be visited by more people in shorter times, while some pages may be visited by few people but for much longer times. Both the cases are significant. Formal case shows the usual and frequently repeated pages which have attracted visitor’s interest, whereas the latter case shows the importance of informative pages. This has motivated the authors to find a formula for calculating the weight based on dwelling time. Equation (2) has steered to obtain the weight of m-pageset ps given by,

$$\begin{aligned} w(ps)=\frac{n_{ps} }{n}\left( \,\sum \limits _{q=1}^n \sum \limits _{p=1}^m {T_{pq} } \right) \end{aligned}$$
(3)

Hence, pageset weight is the product of number of visitors who visited a pageset for a particular duration out of total number of visitors of that duration and sum of dwelling time of all the visitors of that pageset. Where n\(_\mathrm{ps}\) and n are the number of visitors who visited a particular pageset and total number of visitors respectively for a fixed duration. m is the total number of pages. T\(_\mathrm{pq}\) represents the dwelling time of a page p\(\varepsilon \)ps of a qth visitor. The proposed methodology is based on weighted minimum support. If a pageset assures user defined weighted minimum support w\(_\mathrm{min-sup}\), then it is considered to be a frequent pageset. In case of 1-pageset, total pageset weight is equal to total page weight and hence \(n_{ps}=n_{p.}\)

In case of WARM, the downward closure property may not hold properly. It is not essential that all the subsets of a frequent pageset should be frequent, because the sense of frequent pageset is adapted to handle weighted support. But this is overruled in the proposed T+weight tree method. Every page in a pageset has to be individually frequent with respect to the records of that pageset.

3.3 T+weight tree

The abstraction of the database is stored in memory and used to mine the frequent pagesets based on w\(_\mathrm{min-sup}\).

For Table 3 let w\(_\mathrm{min-sup}\)=2. By applying weight definition given in Eq. (3), pages p1, p2, p3 and p4 are all frequent 1-pagesets, because individual page weight is the ratio of the total dwelling time of all the visitors who have visited that page to the total number of visitors, both for a stipulated duration. Individual page weights for our sample database are: w(p1)=(3*18)/5=10.8, w(p2)=(4*14)/5=11.2, w(p3)=(3*13)/5=7.8 and w(p4)=(1*15)/5=3. 2-pageset {p2,p3} is also frequent as weight({p2,p3}) =. 2*((2+4)+(2+5))/5=(2*13)/5=5.2. Number of visitors who have visited both p2 and p3 are alone considered for adding the dwelling time. w{p2} = (2*4)/5 = 1.6, which is less than w\(_\mathrm{min-sup}\). Even though p2 appears with less weight, {p2,p3} is frequent. This is the basic motivation behind the proposed T+weight tree method. It’s a tree based data structure for finding the frequent pagesets based on weight in a single scan of the database.

An m-pageset ps is frequent and satisfies downward closure property, if it satisfies both the conditions:

  1. (i)

    Individual page weights in pageset ps with respect to only the set of records of ps is \(\ge \) w\(_\mathrm{min-sup}\).

  2. (ii)

    Total weight of ps is \(\ge \) w\(_\mathrm{min-sup}\).

The objective of the proposed system is to discover the output as weighted association rules for the given input web log database D using T+weight tree method. To obtain the weighted association rules, the following steps are to be incorporated.

  1. 1.

    T+weight tree is built with every element of the record in D.

  2. 2.

    All frequent pagesets using w\(_\mathrm{min-sup}\) are discovered.

  3. 3.

    Applying ARM based on user defined minimum confidence, weighted association rules are discovered [2].

Weighted association rules are the rules which carry frequent pages with weights in both antecedent and consequent part of an association rule.

This method of finding the frequent pagesets using T+weight tree method is seen better than Weighted tree method [18] as applied to web logs, for finding the frequent pagesets and weighted association rules.

Fig. 2
figure 2

a Page node. b SID node

3.3.1 T+weight tree node structure

T+weight tree has two nodes as in Fig. 2.

  1. 1.

    Page node It has a page URL with two pointers. One is directed to the nodes having session IDs and weights and the other is directed to next page URL.

  2. 2.

    SID nodeIt has two parts, Session ID (SID) and Weight as shown in Fig. 2b. Weight indicates the time dwelled in a page in that session. As many similar nodes as there are SIDs consisting of that page URL is coupled to the respective page node by means of pointers.

Database is scanned only once to obtain the T+weight tree. While scanning, a separate page node is formed for each new page URL undergone and the SID nodes for the particular page visited are affixed to the respective page nodes along with dwelling time as weights.

3.4 WARM algorithm based on weights

The T+weight tree method of WARM has the following steps.

  1. (i)

    Creating T+weight tree.

  2. (ii)

    Eliminating infrequent Page URLs in the tree.

  3. (iii)

    Finding frequent pagesets based on weights.

  4. (iv)

    Performing WARM.

These steps can be achieved by means of the following algorithms (Algorithm A to Algorithm E).

Algorithm A: Creation of T+weight tree.

figure c

Algorithm B: Reducing the T+weight tree

figure d

Algorithm C: Finding S (set of all non empty subsets of F1)

Total number of subsets of an n element set is 2\(^\mathrm{n}\). By mathematical “combinations” one can get nC\(_\mathrm{r}\) number of r element subsets out of n element set by using,

$$\begin{aligned} nC_r =\frac{n!}{r!( {n-r})!} \end{aligned}$$
(4)

Since, single element subsets and Null set are not required, only a total of 2\(^\mathrm{n}\)-nC\(_{1}\)-1 (Since nC\(_{1}\) is n, 2\(^\mathrm{n}\)-n-1) subsets are needed to be obtained.

figure e

Algorithm D: Discovering frequent pagesets

Let {x} be the set of attributes or pages and is an element of {S} and \(\vert \)R\(\vert \) is number of records of {R} (visitors of a particular pageset n\(_\mathrm{ps}\)).

figure f

Algorithm E: WARM

figure g

3.5 Illustration for the sample Visitor’s log

The T+weight tree for the sample Visitor’s log of Table 3 is in Fig. 3.

Fig. 3
figure 3

T+weight tree for the sample database

The ellipses are the pages in the database and the rectangular boxes are the session IDs with weight. If w\(_\mathrm{min-sup }\)=2, by this algorithm, since all the 1-pagesets are frequent as shown in Sect.  3.3, there won’t be any branch reduction in the tree. For this tree {F1} = {p1, p2, p3, p4} and {S} is set of subsets of {F1} except frequent-1 and null sets. For a 4 element set there will be 2\(^{4}\) subsets. Leaving frequent-1 and null sets 2\(^{4}\)-4C\(_{1}\)-1=11 subsets are generated. Hence,

$$\begin{aligned} {\{}S{\}}= & {} {\{}{\{}p1,p2{\}},{\{}p1,p3{\}},{\{}p1,p4{\}},{\{}p2,p3{\}},{\{}p2,p4{\}},\\&{\{}p3,p4{\}},{\{}p1,p2,p3{\}},{\{}p1,p2,p4{\}},{\{}p1,p3,p4{\}},\\&{\{}p2,p3,p4{\}},{\{}p1,p2,p3,p4{\}}{\}} \end{aligned}$$

Every element set of {S} is {x}. By algorithm (iv), first let us consider the element set {p1,p2}.

{x} =:

{p1, p2}

R =:

{1,3,4}

a =:

p2 (other than first element);

R =:

{1,3,4}\(\cap \){1,2,4,7} = {1,4}

R:

is non empty.

For each element ‘e’ in {x},

e =:

p1; 2(8+6)/5=5.6 \(\ge \) w\(_\mathrm{min-sup}\)

e =:

p2; 2(3+2)/5=2 \(\ge \) w\(_\mathrm{min-sup}\)

If w\(_\mathrm{min-sup}\)=3, then p2 will not be frequent and hence

{p1,p2} is not a frequent pageset.

Then check weight for pageset {x}

2(8+3+6+2)/5 = 7.6 \(\ge \) w\(_{min-sup.}\)

Hence {x} = {p1,p2} is a frequent 2-pageset and becomes an element of {F\(_{all}\)}

All the attributes (pages) of {x} in records of R has to be individually frequent. If at least one attribute is infrequent and if it does not meet the user defined w\(_\mathrm{min-sup,}\) then no need to check for other attributes of {x}. Next element of {x} in {S} is considered next.

Similarly checking for all the subsets of {S}, {F\(_\mathrm{all}\)} = {{p1,p2},{p1,p3}}. Remaining sets of {S} are infrequent. Total frequent pagesets, hence will be individual elements of {F1} which is a set of frequent 1-pages and {F\(_\mathrm{all}\)}. To mine less number of more significant pages, w\(_\mathrm{min-sup}\) has to be raised.

With the help of algorithm (v), weighted association rules have to be mined. Let us for example consider one frequent 2-pageset from {F\(_\mathrm{all}\)}, say {x} = {p1,p2} and let us have user defined minimum confidence c = 50 % i.e. 0.5.

$$\begin{aligned}&{\{}s{\}} = {\{}p1{\}}\\&w({\{}x{\}})/w({\{}s{\}}) = w{\{}p1,p2{\}}/w{\{}p1{\}} = 7.6/10.8\\&\quad = 0.7 \ge c _{min} \end{aligned}$$

Hence the rule {p1} -> {p2} is outputted as it is a strong association rule satisfying minimum confidence. Similarly, for other frequent pagesets, strong association rules have to be mined using WARM.

4 Experimental evaluation

For assessing the performance of proposed method, experiments are conducted on two different datasets. One is EDI (educational institution) dataset and the other is msnbc dataset available in UCI machine learning repository from Internet Information Server (IIS) logs for msnbc.com. Comparison has been carried out between Weighted tree [18] and proposed T+weight tree, both in terms of speed (CPU execution time in seconds) and space (Number of frequent pagesets generated) for various w\(_\mathrm{min-sup}\). For comparison sake, both the data are considered for one full day duration. i.e for twenty four hours. For calculating the speed in terms of CPU execution time, stubs were included in the program.

Fig. 4
figure 4

Comparison of Execution time for Weighted tree and T+weight tree for msnbc dataset

Fig. 5
figure 5

Comparison of Execution time for Weighted tree and T+weight tree for EDI dataset

Experiments were performed on an Intel Core I5, 3.2 GHz processor machine with 2GB RAM and 500 GB hard disk with Windows XP platform. T+weight tree algorithm is implemented in Java.

The proposed method shows better performance because of the difference in the formula used for calculating the pageset weight. Experimental results are provided below from Figs. 4, 5, 6, 7, and 8. Tables 4 and 5 show the comparison of execution time and number of frequent pagesets generated with respect to Weighted tree and T+weight tree methods for msnbc and EDI datasets. The empirical results confirm that for the datasets, the proposed T+weight tree method takes lesser execution time than Weighted tree method, since it produces lesser number of more significant pagesets comparatively.

Fig. 6
figure 6

Comparison of Number of pagesets generated by Weighted tree and T+weight tree for msnbc dataset

Fig. 7
figure 7

Comparison of Number of pagesets generated by Weighted tree and T+weight tree for EDI dataset

Fig. 8
figure 8

Scalability of T+weight tree on EDI dataset

Table 4 Execution time and number of frequent pagesets generated for msnbc dataset
Table 5 Execution time and number of frequent pagesets generated for EDI dataset

Comparison of the results for execution time of both the methods is shown in Figs. 4 and 5 and that for comparison of number of pagesets generated by both the methods is given in Figs. 6 and 7. It is seen that, as w\(_\mathrm{min-sup}\) increases the execution time and space reduces more for T+weight tree than for Weighted tree.

When it comes to comparison among EDI and msnbc datasets, execution time and number of frequent pagesets mined are more for msnbc dataset than EDI dataset as it is a globalized news repository. The news that affected the people most, whether positively or negatively and its impact are clearly visualized by mining the msnbc dataset. The change in trends can also be envisioned.

T+weight tree is scalable because it performs well even when the input dataset size increases. For this purpose both the datasets are splitted into smaller datasets each consisting of different sizes for experimental purposes. Dataset range is changed from 40 to 200 KB and the results are produced in comparison with Weighted tree for EDI dataset in Fig. 8 for a w\(_\mathrm{min-sup}\) of 6. It shows that the proposed method scales up well with respect to increase in dataset size and that it can efficiently work on larger datasets. From Fig. 8 we infer that T+weight tree is linearly scalable and has got an improved scalability than Weighted tree.

5 Conclusion

Method for mining frequent pagesets from web log based on page dwelling time as weights is discussed. This employs T+weight tree arrangement and from experimental evaluation it is found to be more efficient than Weighted tree method, by both time and space. The time elapsed for finding all the subsets of a set is more. If an efficient way is found out to reduce this time, then this method would be much more beneficial to find the frequent pagesets.

The system so far discussed gains importance in mining the frequent pagesets and association rules based on weighted minimum support and confidence. But those frequent pagesets may have frequently occurred in the earlier stage of the respective duration than in the later stage. In order to identify most recently and less recently accessed frequent web pages, the log may be divided into n windows and weights may vary for several windows ranging from high value for the recently accessed window to low value for that of least recently accessed [9]. The pages in a particular window carry the weight of the window multiplied by its dwelling time. Then T+weight tree method may be preceded.

T+weight tree method of WARM gains a credit in mining the web logs of educational institutions, to find the frequent pages which shows the behavior of the students. The access of those pages which spoil them can be denied and those which are informative can be taken into account for taking future decisions.

The browsing behavior of users cannot be accurately predicted in advance for any organization and it may randomly vary day-to-day. It depends upon the need of the person and the trend. In our example, we have considered the data for one full day from both the datasets, which may give various results if the data considered is for some other day in the dataset. Hence as a future work some stochastic optimization models can be included.