Keywords

1 Introduction

Web Usage Mining is the application of data mining techniques to usage data, termed weblog data, for different purposes, e.g., system enhancement and adaptation, personalization, recommendation and advertisement [1, 2]. Depending on the analysis purpose and context, the weblog data are completed with other data [3] and are mined using different mining techniques, e.g., statistics, classification, clustering, association/sequential rules and dependency modelling. Beyond the conventional steps of a mining process, Web Usage Mining preprocessing step tackles with three critical tasks, i.e., data cleaning, data structuration into single/unique users and visits, and episodes identification [1, 4,5,6]. The cleaning task references the relevance of the whole mining process [7,8,9]. It is meant to process the raw weblog data (R.WLDB) and provide a noise-free weblog database of clicks (C.WLDB). A R.WLDB refers to data recorded by webservers within Access Logfiles (ALF). It reflects usage activity on Website servers, i.e., end-user clicks and underlying user-agent hits [1, 8]. Notice that end-user clicks and user-agent hits relate to webpages URI (Uniform Resource Identifier) and their underlying displayed view components, respectively. Since Web Usage Mining is interested in end-user behavior, user-agent hits are referred to as noise to be identified and removed before mining. Filtering hits from clicks is not trivial for two reasons, i.e., servers record requests interlaced in sequential order regardless of their source or type, websites resources may be set up as requestable interchangeably by end-users and user-agents [1, 8, 10].

On the basis of web usage mining reviewing papers from WEBKDD ’99 up to 2017 [1,2,3,4,5,6, 11,12,13], the cleaning task can be mapped into three layers. The first is meant to identify clicks from hits. The second consists of selecting, from the identified clicks, those meaningful for the analysis purposes and context, e.g., the successful or unsuccessful requests, known single IP proxies and robots [3, 6, 7]. The third is performed downstream to data structuration and is rather a behavioral cleaning aimed at robot and outlier detection [3, 8, 14]. Thus, the core cleaning layer that tackles with generic noise is the identification of clicks from hits. In this regard, the most reported and implemented methods are the conventional and advanced cleaning. They are content-centric filtering heuristics based on the requested resource attribute of the R.WLDB. These cleaning methods are limited in terms of relevancy, workability and cost constraints, within the context of dynamic and responsive web content. In order to deal with dynamic and responsive web constraints, a rule-based cleaning method, focused on the logging structure rules, is introduced in the current paper. As the logging structure is insensitive to the logging content, the rule-based cleaning method experimentation demonstrates significant advantages compared to the conventional and advanced cleaning.

This paper is structured as follows: Sect. 2 depicts the cleaning methods and their limitations. Section 3 introduces the rule-based cleaning concept and method. Section 4 presents the experimental results analysis and evaluation.

2 Cleaning Methods Analysis

2.1 Data and Formalism

To better depict the cleaning method, the weblog data content, formats, attributes, and the underlying formalism are presented in the following. The most used ALF formats are the Common Log Format (CLF) and the Extended Common Log Format (ECLF) that contain information entries on usage activity [8]. The ALF entries content described in Table 1 represents the R.WLDB content. Given a R.WLDB of (n) requests (REQ) involving 7 entries instantiation recorded in an ALF as follows: \( REQ_{n} \) = {127.0.0.1, −, frank, 10/Oct/2000:13:55:36 −0700, GET/apache.gif HTTP/1.0, 200, 2326, http://apache.org/example.html, Mozilla/4.08 (Win*; …; Brow*)}. Notice the following useful attributes and their abbreviation for the upcoming analysis: [GET] the request method (REQ.MET), [apache.gif] the requested resource (REQ.RES), [apache] the requested resource name (REQ.RES.NAM), [gif] the requested resource type (REQ.RES.TYP), [apache.org] the referring resource root (REF.RES.ROO), [example.html] the referring resource (REF.RES), [example] the referring resource name (REF.RES.NAM), [html] the referring resource type (REF.RES.TYP), [Mozilla/4.08 (Win*; …;Brow*)] the user-agent (USE.AGE). Each attribute occurrence is indexed \( {\text{req}}_{\text{n}} \left( {{\text{att}}_{\text{m}} } \right) \). E.g., req.res = apache.org in \( REQ_{n} \) above is indexed \( {\text{req}}_{\text{n}} \left( {{\text{req}}.{\text{res}}} \right) \).

Table 1. Logfiles entries.

2.2 Related Methods

Conventional Cleaning.

The conventional cleaning is based on the assumption that end-user clicks relate to html or equivalent web resource types as recommended by the World Wide Web Consortium (W3C) [1, 12]. All requests that relate to non-html resource types are assumed as user-agent requests and referred to as hits/noise. Thus, the cleaning attribute is the REQ.RES.TYP. In practice, undesired REQ.RES.TYP are set within a filtering knowledge database (FLT.KDB) that serves the cleaning heuristic. Thus, they are removed (FILTER-OUT) from the R.WLDB to come up with a cleaned weblog database (C.WLDB). The related cleaning process is given in Algorithm 1.

figure a

Advanced Cleaning.

The advanced cleaning is based on prior knowledge on website URIs or a real-time extraction of those embedding clickable resources [10, 12]. The purpose is to set up a validation knowledge database (VLD.KDB) that contains all valid requestable resources imbedded in the website URIs and intended for end-user clicks. All requests that contain resources belonging to the VLD.KDB are referred to as clicks and are filtered into (FILTER-IN) the C.WLDB. Thus, the cleaning attribute is the REQ.RES. The related cleaning process is given in Algorithm 2.

figure b

2.3 Limitation Analysis

In order to achieve high quality outputs, the conventional and advanced cleaning need an exhaustive prior knowledge, extra-weblog and extra-cost factors in addition to the cleaning process cost. The conventional cleaning need exhaustive prior knowledge and extra-weblog data related to resource types set up as requestable by end-users or those intended for user-agents to be referred to as noise. Since there are no mandatory standards in this regard, the construction and updating of this prior FLT.KDB represent an additional cost factor besides the cleaning process cost. In addition, the conventional cleaning becomes obsolete in the case of websites based on frames without filetype extensions used as a cleaning attribute. The advanced cleaning is based on a VLD.KDB that consists of the REQ.RES embedded within the website structure. Thus, it needs extra-weblog data related to the website structure. In case of dynamic web design including automatic personalization and structure adaption, where the website content and structure are shifting continuously, the advanced cleaning becomes overlapping for servers and even impossible in case of personalized content [3]. This is due to the fact that the construction and updating of the VLD.KDB need a continuous extraction of the website URIs intended for end-users. Such a process is very complex to perform, represents an extra-cost factor, and may miss new/cancelled content if not synchronized with the cleaning process in real-time. Finally, the fact that the two methods need an exhaustive prior knowledge related to the website content, structure, and its resources intended for end-users, to set up the filtering heuristic, represents a serious weakness. Obtaining such an exhaustive prior knowledge within the analytics perspective of server owners is not obvious because server owners do not have necessarily this knowledge, unlike the website owners [15].

3 Rule-Based Cleaning Approach

3.1 Targeted Contribution Concept and Method

The targeted contribution is meant to provide a cleaning method that meets the advanced cleaning advantages (noise-free) without any need for extra-weblog data, prior knowledge, or extra-cost factors beyond the cleaning process. Such a method aims to be workable within the context of dynamic and responsive Web in both of the analytics perspectives, i.e., server and website owners. This cleaning method filters the R.WLDB on the basis of the logging structure rules. Since the logging structure depends only on the logging format, the proposed method is insensitive to the repercussions of the dynamic web in terms of websites structure and content. Thus, the rule-based cleaning targets to overcome relevancy, workability and cots constraints of the advanced and conventional cleaning.

The method consists of three main steps, i.e., (1) the identification of logging structure features related to clicks request out of hits, (2) the abstraction of the features into cleaning rules, (3) the implementation of the rules within a cleaning algorithm to be experimented and evaluated. The rule-based cleaning is based on the rules related to the regular features of the clicks logging structure. Thus, a browsing simulation itemset drawn on a predefined relevant browsing items is performed through a dedicated software (Webserver Stress Tool 8.0.0.1010) on a mirrored website (Simple English Wikipedia) within a localhost server (Apache server) set to generate an ALF in the ECLF. The relevant browsing items concern the different manner of URIs browsing by an end-user, e.g., access by click on links from a search engine or a third-party website, typing the URI in the browser search bar, backward and bookmark navigation. The generated ALF is collected and the requests related to the browsed URIs (relevant item/end-user clicks) are highlighted within their underlying hits (irrelevant item/user-agent hits). The identified regular features in terms of the REQ.RES and REF.RES attributes of the clicks logging structure are abstracted to infer rules underlying to clicks out of hits. The inferred rules are combined into a filtering process that filters-in the relevant items into the C.WLDB. An implementation algorithm under R within Apache Spark API is set to perform a cleaning test on third-party websites ALF. The outcomes of the rule-based cleaning are compared to the conventional and advanced cleaning.

3.2 Logging Structure Features Identification (Step 1)

The attributes and functions of the identified regular features of the logging structure are exhibited in Table 2. Three generic features (a, b, c) are identified within the ALF structure: (a) Website access by URN (homepage/Domain Name/Uniform Resource Name) takes on an empty requested resource attribute, between the request and http protocol attributes. (b) Website browsing by URLs (content page) that is identified by the relationship between the requested resource/content page URL (Uniform Resource Locator) attribute and the referring resource attribute of its components. (c) Interfering noise with the regular features cited above that takes on responsive requests, namely, Style Queries (SQ) interlaced within the logged referring resource at the ALF referrer entry.

Table 2. Regular logging features.

3.3 Cleaning Rule (Step 2)

The identified regular features can be formalized in three rules, i.e., URN access/clicks, URL access/clicks, and regular interfering noise/hits.

$$ {\text{URN ACCESS }} \Rightarrow {\text{req}}_{\text{n}} \left( {\left( {{\text{req}}.{\text{met }}\& req.pro } \right) \ne {\O } \;\& \; req.res = {\O }} \right) $$
(1)
$$ {\text{URL ACCESS }} \Rightarrow {\text{req}}_{\text{n}} \left( {{\text{req}}.{\text{res}}} \right) = {\text{req}}_{{{\text{n}} + 1}} \left( {{\text{ref}}.{\text{res}}} \right) $$
(2)
$$ {\text{INTERFERING NOISE }} \Rightarrow {\text{req}}_{\text{n}} \left( {{\text{ref}}.{\text{res}}.{\text{typ}}} \right) \in {\text{SQ}} $$
(3)

The automatic inference and fill-in of the website DN within the empty attribute (\( {{\rm req}}.{{\rm res}} \)) bring the URN access rule to the URL access rule. The automatic inference of the DN given in (Eq. 4) is argued in the following. The referrer entry of an ALF takes on internal and external referrers (URIs). One click is referred by one external referrer and as many internal referrers as the number of the pointed page components. Since the internal referring resources (URIs) consist of the same root (Website DN), and the external referring ones relate to different roots, the most frequent root within the referrer entry is the Website DN. Thus, the website DN is the most frequent referring resources root.

$$ {{\rm DN }} = {{\rm arg}}.\mathop {{\rm max} }\limits_{{\rm freq}} \left( {{{\rm REQ }}\left( {{{\rm ref}}.{{\rm roo}}} \right)} \right) $$
(4)

The automatic inference of the style/media queries noise (SQ) given in (Eq. 5) is argued in the following. Considering that style queries are the unique page component of SQ type that may appear within the referrer entry, it is the least frequent resource type within the intersection of the requested and referring resource types.

$$ {{\rm SQ}} = { \arg }.\mathop {{\rm min} }\limits_{{\rm freq}} \left( {{{\rm REQ}}.{{\rm RES}}.{{\rm TYP }}\, \cap \,\,{{\rm REF}}.{{\rm RES}}.{{\rm TYP}}} \right) $$
(5)

Equations 4 and 5 represent statistical properties of the logging structure within the ECLF of an ALF. They have been tested and validated within the 6 ALFs that served in the experimentation section (Sect. 4). Thus, we make the assumption that they are generalizable within the ECLF of ALFs. Thus, after the automatic inference of the DN and the SQ, we end up with two rules only, i.e., access/clicks and noise rules.

$$ {\text{ACCESS }}\; \Rightarrow {\text{req}}_{\text{n}} \left( {{\text{req}}.{\text{res}}} \right) = {\text{req}}_{{{\text{n}} + 1}} \left( {{\text{ref}}.{\text{res}}} \right) $$
(6)
$$ {\text{NOISE }}\; \Rightarrow \;{\text{req}}_{\text{n}} \left( {{\text{ref}}.{\text{res}}.{\text{typ}}} \right)\, \in \,{\text{SQ}} $$
(7)
figure c

3.4 Ruel-Based Cleaning Algorithm (Step 3)

Algorithm 3 draws the rule-based cleaning. The rule-based algorithm begins by parsing the weblog data to, first, infer the DN and fit it in the empty attributes where the requests method and http protocol are not missing. Then, it infers the SQ type that may appear in the referring entry. Once the SQ type is inferred, the requests whose referrers contain the SQ resource type are removed. Thus, the access rule can be applied to filter-in requests referred to as clicks into the C.WLDB.

4 Experimentation and Results Discussion

4.1 Experimental Data and Validation Reference

The rule-based cleaning algorithm as well as the conventional and advanced cleaning ones have been tested on a representative sample of 6 ALFs of third-party websites. The description of the experimental data is given in Table 3. This panel aims to be representative of the different practices in terms of web design, web content, and clicks/hits ratio. The involved ALFs relate to: a public administration (AL1.GOV), commercial websites (AL2.COM, AL3.COM, AL4.COM, AL5.COM), and our Laboratory website (AL6. LAB). The size of the ALFs is given in requests number. They have been processed with the involved cleaning methods through their related Algorithms 1, 2 and 3, implemented under R within Apache Spark API.

Table 3. Experimental data description.

The methods outputs are evaluated under the terms of a confusion matrix where: Size (SIZ) refers to the processed ALF size given in requests number; Positive (P) and Negative (N), respectively, indicate the number of clicks and hits the ALF contains. True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) indicate the valid and invalid requests, respectively, misreferred to as clicks or hits by the cleaning algorithm; False Positive Rate (\( FPR = \frac{FP}{{{\text{FP}} + {\text{TN}}}} \)) measures the noise rate; False Negative Rate (\( FNR = \frac{FN}{{{\text{FN}} + {\text{TP}}}} \)) measures the miss rate; Accuracy (\( ACC = \frac{TP + TN}{{{\text{P}} + {\text{N}}}} \)) measures the cleaning relevance. The validation reference (P and N number/labels) is composed of valid requestable resources by an end-user within the websites of the involved ALFs. This validation reference is set by a website checker software (Xenu’s Link Sleuth 1.3.8) that reports, among others, the URIs embedding clickable resources intended for end-users.

4.2 Results Analysis

The cleaning results are depicted in Table 4. The analysis of the noise rate (FPR) of the CON.CLE algorithm shows that 53% consists of frames without filetype extensions in addition to unexpected responsive filetype extensions. 47% of the miss rate relate to different formats of clickable media content (non-html resources) that have been requested by end-users. The results in terms of FPR of the CON.CLE confirm the statement of its inappropriateness for frame-based web design. The results in term of FNR demonstrate the exhaustivity challenge in terms of prior FLT.KDB. Overall, the CON.CLE is very noisy (ACC 88%). Both of the ADV.CLE and R.CLE tests give a noise-free (FPR) and a miss-free (FNR) rates. This is due to the fact that they are based on the same filtering attributes (REQ.RES) retrieved from two different sources. The R.CLE includes it in the access rule (Eq. 6) that is a function of the referring URIs within the ALF. However, the ADV.CLE retrieves it from the website URIs. Since the extracted website URIs contain all the possible requestable resources by click, the ADV.CLE provides noise-free outputs while the R.CLE reaches the same result in an optimized way, as it focuses only on the resource that has been requested by clicks identified by the access rule.

Table 4. Cleaning results.

4.3 Results Evaluation

The CON.CLE is the most advantageous method in terms of relevancy balanced by workability and costs constraints. It provides high quality outputs with only one cost factor (the cleaning process). It does not need any extra-weblog data, making it insensitive to dynamic and responsive Web repercussions on the logging content. In addition, since it does not need any prior knowledge, it is workable for both website and server owner analytics perspectives. The ADV.CLE that gives also noise-free results is overlapping for servers and proves to be unworkable in case of automatic and continuous shifting content and structure, specifically in case of personalized content. The CON.CLE is very noisy and proves to be obsolete in the case of responsive Web, based on frames. Both of the ADV.CLE and CON.CLE need exhaustive and prior knowledge on the involved website, making them limited to the website owner analytics perspective. Overall, the ADV.CLE and R.CLE can be combined where the early is optimized by the later, as they share a common filtering attribute from two different sources.

5 Conclusion

This contribution draws a critical analysis of weblog data cleaning. It depicts the related cleaning methods limitations in terms of relevancy, workability and cost constraints. The limitations of the analyzed methods are due to dynamic and responsive web repercussions on the logging content. These cleaning methods are content-centric and prove to be inappropriate to such a context, specifically within the analytics perspective of sever owners. Since the proposed method is focused on the logging structure, it demonstrates significant advantages compared to the content-centric cleaning methods in terms of relevancy balanced by workability and costs constraints. Finally, despite the limited size of the experimented ALF, the fact that the rule-based cleaning is based on the logging format, make it generalizable within the ECLF of ALFs, since the format does not depend on the logging size.