Keywords

1 Introduction

Web Usage Mining is the process of extracting useful information from Web Log Repository by the application of Data Mining technique. Extracted patterns represent user browsing behaviors. Accurate analysis of these patterns leads to understanding of users visiting the web site thereby improved user satisfaction. Improved customer satisfaction is the key to success of business. Thus, Web Based Applications can improve their business by the application of Web Usage Mining to Web Log Repository. Targeted Marketing, Location Based Marketing, Web Personalization, Fraud Detection and Improved Web Administration are some of the application areas of Web Usage Mining.

Web Usage mining consists of three main steps: Web Log Preprocessing, Knowledge Discovery and Pattern Analysis. Among these tasks, Web Log Preprocessing is most complex and critical for the efficient extraction of useful patterns. Especially, the Web Log Cleaning is more demanding in order to eliminate noisy and irrelevant data and to make the Log Data suitable for Knowledge Discovery. Also the Web Log is memory intensive and pruning irrelevant data reduces the input load of the Knowledge Discovery phase. This paper presents dimensionality reduction techniques to eliminate the noisy data and combined methodologies to efficiently identify users and user sessions from Web Log Repository.

The paper is organized as follows: Sect. 2 presents a brief literature review, Sect. 3 presents the structure of Web Log, and Sect. 4 presents the Web Log Preprocessor, Sect. 5 presents the Experimental Setup and Result analysis and finally, Sect. 6 presents the conclusion.

2 Literature Review

The method of extracting useful information from server log files and different application areas of Web Usage Mining is presented in [1]. A framework for Web Usage Mining consisting of Preprocessing, Pattern Discovery and Users classification, is proposed in [2]. This framework classifies the users based on country, site entry and access time. Information extraction from user navigation history using Web Usage Mining is explored and discussed in [3, 4]. A detailed survey on data collection and pre-processing stage of web usage mining is discussed in [5]. Several data preparation techniques of access stream to identify the unique sessions and unique users are presented in [6]. Educational data mining techniques to analyze learners’ behavior, to help in learning evaluation and to enhance the structure of a given course is implemented in [7]. A new algorithm for preprocessing and clustering of web log is proposed in [8]. A specific methodology to extract useful information from an e-commerce website is proposed in [9]. A critical analysis and comparison of the common web robot detection approaches is presented in [10].

3 The Web Log Repository

Web Log Repository is a pool of user activities on a web site. When activated by the web site administrator, it automatically collects the user navigation activities on the web, the moment he enters the web site till the moment he leaves the web site. In Extended Common Log Format (ECLF), a web log usually contains entries with regard to Host IP Address, User Authentication, Date and Time of visit, HTTP Request, Referrer Field and User Agent Field. Details with reference to each field are given below:

  • Host IP address—Used to identify the user visiting the web site.

  • User Authentication—Contains the Username and password of the user visiting web site, usually empty due freeform of websites.

  • Date and time of the visit—Tells when the user has visited the web site

  • HTTP Request—Represents collective information—the Request Method (GET, POST, HEAD, etc.), the Requested Resource (a HTML page, an Image, a CGI program, or a script, etc.) and the Protocol Version (HTTP protocol being used along with version number).

  • Request Status—Status of the request (200 Series—Successful Transmission, 400 Series- Client Error, etc.).

  • Page size—Size of the document downloaded in Bytes.

  • Referring Agent (RA)—Gives the details of the web site from which the user has traversed to the web site. If the user has directly enters this website by typing the web site URL, this field will be “-”.

  • User Agent (UA)—Gives the details with regard to the browser and operating system of the client.

The web access log was collected from the web server of Dr. T.M.A. Pai Polytechnic, Manipal web site [11]. The web site hosts information about courses offered, admission details, facilities available and the placement details, etc. A sample of web log record is given in Fig. 1.

Fig. 1
figure 1

Extract of experimental data

The above log entry indicates that user with IP Address ***.***.***.*** (masked here) requested the link automobile-engineering under courses on 9th Jan 2015 at 10:04:32 AM and he traversed from the link http://tmapaipolytechnic.com. The request was successful and a total of 2116 bytes have been downloaded. Also, it indicates that Mozilla (compatible) 5.0 was the browser and Windows NT 6.3 was the operating system used.

4 Web Log Preprocessor

Web Log Preprocessing plays an important role in Web Usage Mining. The data collected in Web Log Repository is not suitable for Data Mining algorithms. The Log Data needs to be cleansed and converted into structured format before being processed by Knowledge Discovery Phase. Web Log Preprocessor takes Web Log Repository as input and identifies the Users and User Sessions. We begin Pre-processing phase by Feature extraction and Time Stamp Creation. It is then followed by Data Cleaning employing Dimensionality Reduction Techniques. The original Web Log Repository is blended with relevant and irrelevant information leading to huge log size. Direct processing of this raw data puts unnecessary burden on the Knowledge Discovery Phase. Hence, efficient cleaning of Web Log Repository is necessary to extract useful patterns from Web Log. Once the log is cleansed effectively, Users visiting the web site are identified. Then, User activities in the Web are grouped into meaningful sessions before being processed by Knowledge Discovery Phase. Thus, Web Log Preprocessor mainly contains—Feature Extraction and Time Stamp Computation, Data Cleaning, User Identification, and User Session Identification.

4.1 Feature Extraction and Timestamp Computation

In Feature Extraction step, features are extracted from fields representing collective information so that preprocessing algorithms can be applied. Also, from the date and time entries of web log, time stamp is computed so as to estimate the duration of the user’s visit to the Web site and to maintain the sequence of web requests across days. The steps for creation of time stamp are as follows:

  1. (i)

    Compute the number of days between the web log entry date and a reference date.

  2. (ii)

    Multiply this number of days by 86,400.

  3. (iii)

    Find the time in seconds since midnight that is represented by the time in the web log entry.

  4. (iv)

    Add (ii) and (iii).

4.2 Web Log Cleaning

In Data Cleaning, all irrelevant entries from the log record are eliminated to minimize the burden on the processor. A web log usually contains all the requests to the web server. This includes actual user requests and automated requests. The automated request represents the requests from automated programs like web bots, spiders and crawlers. Similarly, when a user requests a page from the web server, along with the page, any images associated in the requested page is also downloaded and a record for each such image downloads is created in the log. As Web Usage Mining intends to model the user browsing patterns, all such requests need to be eliminated. Similarly, unsuccessful requests are also eliminated. Also, only the request for getting a resource from the web server is retained. Thus, Web Log cleaning consists of the following sub steps:

Robot Request Eliminator: Robot requests can be identified using 2 methods:

  1. 1.

    Robot identification based on Requested Page Field

  2. 2.

    Robot identification based on User Agent field.

For efficient identification of web robots, combined methods were employed. The algorithm IsRobotRequest takes as input each log record and returns TRUE/FALSE is given in Fig. 2.

Fig. 2
figure 2

IsRobotRequest algorithm

Image Request Filter: The files with the extensions like GIF, JPEG, CSS are also downloaded along with requested page. They are not actually the user interested web page; rather it is just the documents embedded in the web page. So, it is not necessary to include in identifying the user interested web pages. So, the cleaning process eliminates these unnecessary entries from web logs by scanning the Uniform Resource Identifier (URI) field of every record. This step drastically reduces the size of web log. The algorithm for filtering out the Image Requests is given in Fig. 3. The algorithm checks each web log record and returns TRUE/FALSE.

Fig. 3
figure 3

IsImageRequest algorithm

Unsuccessful Request Remover: Successful web requests represent the user actual request to the web server using which the user profile can be modeled. Hence, log records with status codes other than 200 (successful request) are removed. This cleaning process will further reduce the evaluation time for determining the user interested patterns. The algorithm for Removal of Unsuccessful HTTP requests is given in Fig. 4. The algorithm checks each record of web log and returns TRUE/FALSE.

Fig. 4
figure 4

IsSuccessfulRequest algorithm

nonGET Request Remover: A GET method in the HTTP Request Field indicates that the user has requested a resource from the web server. Hence, log records having the value of GET in the Method field of HTTP Request are retained, while all other records are eliminated. This step again reduces the volume of the data to be processed further. The algorithm for elimination of non GET methods from web log is given in Fig. 5. The algorithm takes each record and returns TRUE/FALSE.

Fig. 5
figure 5

IsGetMethod algorithm

The modified Data Cleaning algorithm using above algorithms is given below. The algorithm scans each log record and either retains or discards the record by calling the above algorithms.

4.3 Modified User Identification

Identification of each distinct user visiting the website is important and complex task in Web Usage Mining. Apart from the user-id field, the IP address, UA and RA fields can be employed for user identification. In this paper, User Identification based on combined methods using all the three fields has been implemented to uniquely identify the users. In UA field, both the browser and operating system are considered for distinguishing between two users. The Modified User Identification algorithm is given below:

4.4 Modified User Session Identification

User Session identification is the process of segmenting the access log of each user into individual access sessions. For Session Identification heuristics based on Time and Navigation are employed. Time based methods are not reliable because users may involve in some other activities after opening the web page. Hence in this paper, a combined technique based on both the heuristics is employed for Session Identification. This method uses web topology and page stay time. The Session Identification algorithm is given below.

5 Experimental Setup and Results

The web access log was collected from the web server of Dr. T.M.A. Pai Polytechnic web site from 31st Dec 2014 12:09:56 through 11:18:07 15th Jan 2015, a total of 15 days. A total of 5817 requests were recorded during this period. The algorithms were implemented in MATLAB.

5.1 Web Log Cleaning

The Web Log Cleaning Algorithm was applied to the log data after feature extraction and time stamp computation. The algorithm eliminated a total of 4648 records containing multimedia objects, robot requests and failed requests with a total of 1169 clean log records ready for further processing. This means that the size of the log file was reduced to 20 % of the original log size. The Tables 1 and 2 shows the statistics about individual request category and aggregated results of Data Cleaning Step. Figure 6. depicts the distribution of irrelevant Data in Web Log. It is observed that a major portion of Web Log usually consists of irrelevant and redundant data which has to be eliminated to speed up the upcoming mining process.

Table 1 Statistics of individual request category
Table 2 Aggregate results of data cleaning
Fig. 6
figure 6

Distribution of irrelevant data in web log

5.2 User Identification and User Session Identification

The User Identification Algorithm uniquely identifies the users of Web Site. A total of 235 users were identified in the given log. The session identification splits all the pages accessed by each user into individual access sessions using the combined technique based on time oriented and navigation oriented heuristics. It was observed that each user having one session with maximum number pages in a session = 26.

5.3 Analysis of Web Log Pre-processing Results

The simple analysis of Web Log Repository, after Web Log Preprocessing, could be useful for the web site administrator. The chart of Requests across days and chart of downloads across days is given in Figs. 7 and 8.

Fig. 7
figure 7

Days versus no. of requests

Fig. 8
figure 8

Days versus no. of byes downloaded

6 Conclusion

Web Log Preprocessing is one of the complex tasks of Web Usage Mining. Modified Web Log Preprocessing eliminates the noisy data and drastically reduce the input log thereby lessen the burden on the further tasks. In this paper, the Web Log Preprocessing algorithms based on Dimensionality Reduction Techniques and Combined Methodologies on Web Log Repository from a real time web server is been implemented. The Web Log Preprocessing Algorithm has identified around 16 % robot requests in the log. Results of preprocessing have shown that the input web log size is reduced by 80 %. The results show that the Web Log Preprocessing techniques based on various dimensionality reduction techniques and combination of methods improve the performance of Web Log Preprocessor.