Keywords

1 Introduction

The continuous growth of web is resulting in enormous websites. The structure of websites is also becoming complex. Often users face difficulty in locating the desired information while navigating through the website. With the website designer perspective, the main challenge is to analyze the user behavior and personalize them. This will not only help them in locating required information but also improve user’s satisfaction level.

Web Navigation Prediction (WNP) is an emerging research area to address these issues. In WNP, a model is trained such that it predicts the next web page(s) from the visited web pages. WNP can be generalized and applied on different applications [14] like search engines [16], caching systems and latency reduction [17], anomaly detection [8], personalization [5], website design [18], detecting malicious web pages [26, 27], recommendation systems [1], event detection [32] and location prediction [9, 28].

User navigation history is captured in the log file through cookies or web servers. A snapshot of weblog file shown in Fig. 1. The fields of web logs are user IP address, user authentication, date/time, action, return code, size, referral, browser/platform. Each row in the log file [10] represents single web page request. It consist important information about the client and the requested web page. The information is recorded by the server to understand user behavior.

Fig. 1.
figure 1

Web log file [10]

Web logs are preprocessed and sessions are constructed from the log file which is used for making prediction model. Session consisting set of pages can be of varied length. Longer sessions often have noise as they may be repetition of pages or user is following longer path to reach desired page. This results in poor browsing experience of the user and may harm the popularity of the website. According to Janrain [15], about 74% of the online users get frustrated with website when they do not get their required web content. According to Forrester research [11], a good website design can attract more user’s and vice versa. Half of the sales will get negative impact, if user is unable to locate his desired information. Due to the negative experience faced by the users on their first visit, 40% users may not return to the website. In 2013, a Monetate/eConsultancy study [15] found that in-house marketers who are personalizing their browsing experience observed 19% uplift in their sales. Smaller sessions will dilute the learning of prediction model so it is important to use the session length that can help in building the prediction model optimally.

This paper analyses the performance of prediction model based on two different ranges of session length. The two set of ranges are Set A (3 to 7) and Set B (2 to 10). Generally Set A has been used in past studies [4]. We will compare this range with longer session length range two to ten (2–10) to find suitable session length for model building. This study analyzes the impact of varied session length on prediction model.

1.1 Research Objectives

  1. 1.

    This paper highlights pre-investigations measures which are required to inject good quality inputs to the training model.

  2. 2.

    Web navigations have been analysed and detail summary of how pre-investigations will affect the model is discussed.

  3. 3.

    We have evaluated model performance using varied session length on three real datasets(MSWEB, BMS and Wikispeedia)

The rest of the paper is organized as follows. Section 2 gives preliminaries and model representation. Related work is presented in the Sect. 3. Experimental details are described in Sect. 4 and conclude the paper in Sect. 5.

2 Preliminaries

This section describes the basic terminology, representation and modeling of a session.

  • Sessions: A Session represents the web page(s) visit order of the user during the website navigation. A session, S is represented as {P1, P2, …, Pn} where n denotes the number of pages. Each user browsing history is stored in a session.

  • N-grams: In WNP, N-gram is prominently used to represent the training model. The N-gram can be represented as <p1, p2, ….,pN>. This depicts sequences of web page(s) navigation of the user’s. Each web page is represented with unique page id. For example, if we consider session consisting six pages having session length as six, S = <P11, P22, P5, P13, P20, P8>. In the given example, 1-gram will contain five sessions <P11, P22>, <P22, P5>, <P5, P13>, <P13, P20>, <P20, P8> and 2-gram will contain four sessions <P11, P22, P5>, <P22, P5, P13>, <P5, P13, P20>, <P13, P20, P8>. N-gram is a fixed length representation of sessions. Due to the fixed length representation of the training set, the model complexity, state-space complexity, computational complexity required to build the model can be easily determined.

  • Markov Model (MM): Markov model [2, 3, 12, 13] is the well known representation used for the WNP. User navigation behavior is captured in the log file and analyzed to predict the next desired information. The log file is pre-processed to find the sessions. These sessions are used as the input for modeling the Markov model. MM is the graphical representation of sessions. Each node is represented by the pages and links between them represents the transition probability to move from one state to another. Markov models can be formed in varied order. In first-order MM, each state is represented with single page. For instance, a link between state A and B is formed using the transition probability. The transition probability is defined as the ratio of number of times <A, B> occurs to the number of times <A> occurs.

Transition probability to move from A to B is given by,

$$ P\left( {A \to B} \right) = \frac{{\upmu \left( {{\text{A}},\text{B}} \right)}}{{\upmu \left( {\text{A}} \right)}}\quad \text{where},\,\upmu\,\text{denotes}\,\text{frequency} $$

In the second-order MM, each state is represented with two pages. For instance, a link between state <A, B> and <C> is formed using the transition probability. The transition probability is defined as the ratio of number of times <A, B, C> occurs to the number of times <A, B> occurs.

Transition probability to move from <A, B> to C is given by,

$$ P\left( {\left( {A,B} \right) \to C} \right) = \frac{{\upmu\left( {\text{A},\text{B}, \text{C}} \right)}}{{\upmu\left( {\text{A}, \text{B}} \right)}}\quad \text{where}\,\upmu\,\text{denotes}\,\text{frequency} $$

Similarly, higher order MM can be formed. In a Kth-order MM, each state is represented by K web pages. Since, the accuracy of Kth-order MM is low, All-Kth Markov model (KMM) comes into existence. In KMM, all lower order models are nested inside the higher order model. If a higher order KMM fail to predict then the search begins in the next subsequent lower order model.

  • All-Kth Modified Markov Model (KMMM): The accuracy of MM is very low. Therefore, Modified Markov model (MMM) is proposed by Mamoun et al. [2]. In this model order of the pages does not matter. For example, if the sessions have same set of pages then they are represented in the same state. In order to further enhance the performance of MMM, all-Kth model are embedded with it. This model is known as All-Kth Modified Markov Model (KMMM). Jindal et al. [7] and Mamoun et al. [2] analyzed that All-Kth Modified Markov Model (KMMM) is proved to the compressed and effective prediction model. Therefore, in this work we choose KMMM as a prediction model to evaluate the performance over varied session length.

3 Related Work

During website browsing, user navigation history is captured in the web log file. The web log file cannot be used directly for analysis sand prediction as it consists of lot of noisy information like image, video, audio and robotics files. Thus, these log files are cleaned and pre-processed. During this phase the noisy information is filtered and user’s as well as sessions are identified. Sessions are the sequence of the navigation trails of the users. Users’ are identified using their IP address.

In past several sessions generation techniques were found which attempts to obtain relevant patterns from the web log file. Broadly, three session generation techniques have been used in the past namely, time-based, navigation-based and integer programming.

  • Time-based: Catledge et al. [19] and Cooley et al. [21] have used page-stay time and session duration thresholds. Zhang et al. [20] proposed dynamic time-oriented method. The sub sessions are formed from the session when their time exceeds from the respective thresholds. Time-oriented heuristics do not consider website structure, thus most of the useful navigation patterns are missed in the session generation. Session generated may have duplicate web pages in the same session. For example {P2, P1, P1, P7, P3} or {P2, P1, P7, P1, P3} are allowed in the time-based heuristics. Here, {P1, P1} or {P1, P7, P1} causes unnecessary duplication of web page P1 that makes sessions longer. Moreover, these heuristics are not reliable as user(s) might get involved in some other activities during web page navigation. Other factors like web page content, content size, web page components, busy communication line may impact the session formation.

  • Navigation-based: Cooley et al. [22, 23] have proposed navigation-based graphical structure of web sessions. In this network, nodes are represented by the web pages and edges are represented by the direct link between the web pages. For each navigated session, if there is no connection found between the two consecutive web pages then backward browsed webpage is inserted. This artificial insertion generates longer sessions.

  • Integer programming: Dell et al. [24, 25] proposed integer programming based session generation techniques. Herein, web sessions are partitioned into the chunks using IP and agent information through logarithmic objective function. This objective function assigns the each web page to the chunk of the particular session such that there is no duplicate web page found in the session. For example, the given session is {P1, P3, P6, P3, P6, P8, P7, P6, P6, P8, P10} there is actually no link present between page P7 and P10. In this approach the session will split into two subsessions as {P1, P3, P6, P8, P7} and {P10}. However, according to website topology, the correct subsessions should consist of {P1, P3, P6, P8, P7} and {P1, P3, P6, P8, P10. In addition, the obtained subsession with web page P10 have no correlation with other web pages which is not correct.

Session generation techniques presented varied session identification methods but they do not focused on deriving optimal session length. West et al. [29] observed that session length defines the user navigation behavior. Shorter path means user step towards the right direction and longer path means user did not get the right path. He might be circling around the desired page. In addition, longer path requires more state-space complexity and high computational cost [30]. It makes the prediction model development cumbersome [30] and degrades model performance. Since, the success of pattern discovery depends on the quality of input session injected to it [31], we have evaluated the impact of varied session lengths on web navigation prediction model. The paper discusses the pre-investigation measures that required to be performed before generating a prediction model. The pre-investigations are required mainly to choose the optimal session length for web navigation prediction as the quality of prediction accuracy depends upon the input sessions. To the best of our knowledge, no work has been done in past that inquires the optimal session length for web navigation prediction. Although logs are generated, cleaned and later used for prediction in so many application areas that we have mentioned in the paper but none of them have discussed the session length to be important component which need attention.

4 Experimental Details

Selecting an optimal Session length is a major concern before developing the prediction model. This is because the accuracy of the model depends on the sessions taken as an input. This study main focus is to analyze the effect of session length over prediction model. This section presents the experimental details like the dataset used, pre-investigation measures, evaluative parameters and the results obtained. Therefore, we analyses the performance of prediction model based on two different ranges of session length. We perform experiment on two sets. Set A consists sessions whose length lies in between 3 to 7. Set B consists sessions whose length lies in between 2 to 10. Generally Set A has been used in past studies [4]. We will compare this range with Set B. The training and testing for both sets is divided in the ratio of 0.7 and 0.3.

4.1 Dataset Description

We have conducted experiments on three datasets: MSWEB, BMS and Wikispeedia. The detail characteristics of each dataset have been presented in Table 1.

Table 1. Dataset summary
  • Dataset 1: MSWEB

This dataset was collected from the Microsoft logs. The data consists of 38000 sessions from random users in February, 1998. Each row represents sequence of areas of the website that the user visits in a period of one week.

  • Dataset 2: BMS

This dataset was collected from e-commerce web server logs (Gazelle.com) and used as a part of the KDD Cup 2000 competition. It contains 59,601 web sessions of items and 497 distinct items. The average length of the sessions is 2.42 items.

  • Dataset 3: Wikispeedia

This is a popular online web page game. In this, each player has given a task to find a shortest path from source to destination web page. The player navigates from source web page to destination web page using the hyperlinks. The player has no knowledge of the global network structure. Therefore, he uses local information provided on the webpages. The player’s navigations were collected in the web log file which consists 4606 articles and 3326 distinct articles. It comprises 51 K navigations collected over 2009.

The details of training and testing sessions are summarized in Table 2. After 0.7 (training) and 0.3 (testing) split, sessions are further divided categorized into N-gram using sliding window concept. It has been clearly observed that training and testing sessions of Set B is more as compared to Set A sessions. This is because Set B is a superset of Set A.

Table 2. Training and testing dataset

4.2 Pre-investigation Measures

Pre-investigation measures are the metrics which is used to measure the effectiveness of input data. Measuring quality of data is very important before developing a model. A good quality input data injected to the model will produce better results. This section presents two pre-investigation measures: Page Loss and Branching Factor.

  1. (a)

    Page Loss

Page loss determines the missing percentage of the pages in the training model. It defines as the ratio of number of web pages missing in the dataset to the total number of web pages of the website. The page loss would yield unseen pages and will not generate predictions for such pages. It also impacts models negatively. For example, if a web page P9 occurs in a test dataset which was not available in the training model; then training model will fail to generate predictions for web page 9. This measure is important to understand model incapability before model development phase. Table 3 depicts page loss of set A and set B training model. While investigating the datasets, we have found that some web pages were lost while dividing the dataset into training and testing. It has been observed that page loss is less in set B as compared to set A. Since, Set B has long session range; it produces more subset of sessions in the dataset with large combination of pages. Addressing this page loss is important, because it will give rise to more cold-start pages and cold-start sessions.

Table 3. Training page loss
  1. (b)

    Branching Factor

Branching factor measures the network characteristics of the model. It is defined as the average number of outlinks present in the model corresponding to each state. Branching factor determines model prediction capability. It gives network structure insights which is helpful to understand “how much predictions a model may generate corresponding to its current state”. This pre-investigation measure is important to compute average outlink percentage of the network states. Table 4 presents branching factor of Set A and B on varied N-grams. It has been found that branching factor of Set B is higher in all the datasets. This is because Set B injects more sessions in the training model which will have more outlinks corresponding to each state.

Table 4. Branching factor

4.3 Evaluation Parameters

In this section, we will define some prediction parameters used to evaluate model performance [6, 7]. The definitions of the predicting parameters are given below:

Definition 1: Prediction Accuracy

Prediction accuracy is defined as the ratio of correct predictions to the total number of test cases.

$$ Prediction\;Accuracy = \frac{Correctly\;predicted\;test\;cases}{Total\;test\;cases} $$

Definition 2: Model Accuracy

Model accuracy is defined as the ratio of correct predictions to the total predictions.

$$ Model\;Accuracy = \frac{Correctly\;predicted\;test\;cases}{Total\;test\;cases\;matched\;with\;the\;training\;model} $$

Definition 3: Coverage

Coverage is defined as the ratio of total number of predictions to the number of total test cases.

$$ Coverage = \frac{Total\;Predictions}{Total\;test\;cases} $$

4.4 Experimental Results

  1. (1)

    Coverage

Coverage is the evaluative measure which defines percentage of outlinks (prediction paths) covered by the test state. The value of coverage is depended on network structure. Table 5 presents coverage of the Set A and B over varied N-grams. It has been found that coverage of set B is more in all the datasets. Since, the branching factor of training models of Set B is higher; the model with Set B covers more outlinks during prediction as compared to model with Set A.

Table 5. Coverage
  1. (2)

    Prediction Accuracy

Table 6 presents the effect of varying the session length on the prediction accuracy of the model. It has been seen clearly that the prediction accuracy decreases as N-gram increases. This is because the number of training examples becomes less as session length increases (N) (see Table 2). We have observed that the prediction accuracy of set B is higher as compared to set A on both datasets. This is because set B has less page loss while having high coverage corresponding to each test example session. Due to more availability of sessions, Set B has more chances to make correct predictions than Set A.

Table 6. Prediction accuracy of Set A and B
  1. (3)

    Model Accuracy

The difference between model and prediction accuracy is that during the evaluation phase, model accuracy removes unseen test sessions from the total test set. Unseen sessions are those which are not known to the training model.

Model accuracy with respect to varied session length is presented in Table 7. It shows model prediction ability with respect to the test sessions which are available in the training model. It has been observed that Set B has more correct prediction ability than Set A on all datasets. Since, Set B has more outlinks for each state as compared to Set A. It generates more predictions and has more chances to make correct predictions.

Table 7. Model accuracy of Set A and B

4.5 Discussion

From the experiment results, we have inferred that early investigation of the input would yield better predictions. Before making predictions, the optimal split of training and testing dataset and optimal session length should be consider. To investigate the performance of prediction model, two investigation parameters have been used. Page loss indicates the amount of page loss in the training and testing dataset. It is important to consider because it provide insight of cold-start web pages and cold-start sessions or unseen sessions. Presence of unseen sessions makes model difficult to learn and causes prediction failure. Second investigation parameter is the branching factor. This measure is important as it provides insight of the number of predictions possible from the training state. The Set B has less page loss and high branching factor as compared to Set A which indicates Set B is more preferable. Our experimental results revealed that model trained with Set B attains better coverage, prediction and model accuracies. The experimental results confirm the inference drawn from the pre-investigations measures.

5 Conclusion and Future Work

In this paper, we conduct experiments to evaluate the performance of prediction model over varied session length. For this, we select two set of session length. In set A, session with length 3 to 7 are selected and in Set B sessions with length 2 to 10 are selected. We evaluate the effectiveness of the input sessions injected to the model using two pre-investigation measures: page loss and branching factor. A set which has less page loss and high branching factor should be considered for the predictions.

In addition, we evaluate the performance of the model using evaluative measures over varied N-grams. The measures used in the study are: coverage, prediction accuracy and model accuracy. More crucially, it has been observed that set B has high coverage and high accuracy as compared to set A. It has been found that the session length do impacts the coverage and accuracy of the prediction model. Session length ranging from 2 to 10 is found to be best for development of prediction model. The model accuracy of Set B showed improvement from 0.27 to 8.73% in MSWEB, 0.62 to 2.8% in BMS and 10.81 to 14.23% in Wikispeedia dataset.

In the near future, we plan to do focus domain-centric session evaluation as user browsing behaviour varies with domains. Moreover, other pre-investigations measures can be explored which are required to develop high quality sessions.