Another viewpoint on “evaluating web software reliability based on workload and failure data extracted from server logs”

Huynh, Toan; Miller, James

doi:10.1007/s10664-008-9084-6

Another viewpoint on “evaluating web software reliability based on workload and failure data extracted from server logs”

Published: 01 August 2008

Volume 14, pages 371–396, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Empirical Software Engineering Aims and scope Submit manuscript

Another viewpoint on “evaluating web software reliability based on workload and failure data extracted from server logs”

Download PDF

Toan Huynh¹ &
James Miller¹

305 Accesses
10 Citations
Explore all metrics

Abstract

An approach of determining a website’s reliability is evaluated in this paper. This technique extracts workload measures and error codes from the server’s data logs. This information is then used to calculate the reliability for a particular website. This study follows on from a previous study, and hence, can be regarded as a “partial replication” (technically, as both studies are case studies not formal experiments, this description is inaccurate. Unfortunately, no corresponding definition exists for case studies, and hence the term is used to convey a general sense of purpose) of the original study. Although the method proposed by the original study is feasible, the effectiveness of just using a specific error type and a specific workload to estimate the reliability of websites is questionable. In this study, different error types and their usefulness for reliability analysis are examined and discussed. After a thorough investigation, we believe that reliability analysis for websites must be based on more specific error definitions as they can provide a superior reliability estimate for today’s highly dynamic websites.

A Hypothetical Scenario-Based Analysis on Software Reliability Evaluation Approaches in the Web Environment

Understanding Error Rates in Software Engineering: Conceptual, Empirical, and Experimental Approaches

Article 21 February 2019

An empirical analysis of error propagation in critical software systems

Article 13 March 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Reliability is becoming increasingly important to web systems due to the popularity of web applications. The need for highly reliable systems will only grow as companies continue to move their operations online. In order to increase reliability, a method to measure current systems’ reliability is required. However, existing methods to measure reliability (Lyu 1995; Musa et al. 1987; Trivedi 2001) cannot be applied directly to web systems due to their specific nature (Alagar and Ormandjieva 2002; Offutt 2002). Thus, these existing methods will need to be modified to include new workload characteristics to estimate the reliability of web systems (Tian et al. 2004). More specifically, they defined two special characteristics:

Massiveness and diversity: Web systems can interact with many different external systems. For example, one application may interact with Internet Explorer 6.5 and MySQL 3.23; another application may interact with Internet Explorer 5.5, Mozilla FireFox 1.5, SQLite 3.4.2 and Google Maps API 2.1. Not only that, every user with an Internet connection is considered to be a potential user of the web system. The workload characteristics selected need to reflect this diverse software configuration and massive and ill-defined user population.
Document and information focus: Traditional workload concentrates on the computational focus whereas web systems principally have a document and information focus. Newer web systems have increased computation; however, search and retrieval remains the dominant usage for web users. The workload types for computational focus are fundamentally different than the workload types for document and information focus.

To measure web workloads to ensure accurate reliability estimation, generic workloads suitable for traditional computation-intensive cannot be used. Hence, Tian et al. (2004) defined four different web workload characteristics for reliability calculations:

The number of hits: This workload is popular because each hit corresponds to a specific request to a web server, and each entry in the access log is a hit which allows for easy extraction of the data. However, this workload is misleading if it shows high variability with the individual hits (Tian et al. 2004).
The number of bytes transferred may be used as a workload of finer granularity than the hit count; the number of bytes of transferred for each hit is recorded in the server logs and can be extracted with relative ease.
The number of users. This alternative workload can be used by organizations that support various web systems and want to examine reliability at the user level. To count the number of users per day, the total number of unique IP addresses for that day is counted, and each unique IP address is assumed to correspond to a unique user. In other words, all hits originating from the same IP address (which may be associated with one computer or multiple computers sharing the same IP address) are considered to be requests from a single user. A disadvantage of the user workload is its coarse granularity. This problem can be remedied by counting the number of user sessions.
The number of sessions can be calculated from the IP address and the access time. If the time between each hit from one IP is within a time period, then all of these hits are considered to be one session. The session workload is better than the user workload because each session is typically associated with a change in user activity or a change in user. The same user may have several different usage patterns for each session; this can be revealed by the session workload characteristic.

Given the issues related to these workload estimates, this study will also examine simply using “days” as a workload characteristic. A “day” is defined as a 24 h period within a log file. Clearly this alternative has a substantially coarser granularity than the alternatives discussed above. While the most obvious temptation is to utilize a fine-grain workload metric; since issues exists in their estimation, the question of are they actually a superior choice of normalizing term needs to be considered.

Although web traffic characteristics have been explored in detailed—such as the characterization of the workloads (Alagar and Ormandjieva 2002), traffic trends and patterns (Crovella and Bestavros 1997), response times (Cremonesi and Serazzi 2002), etc.—only a few studies have investigated web error behavior and the measurement of web reliability. Although several hypothetical approaches exist; they lack empirical validations (Alagar and Ormandjieva 2002; Wang and Tang 2003). One practical approach to measuring the reliability of web systems is to use the information contained in server logs (Huynh and Miller 2005; Kallepalli and Tian 2001; Tian et al. 2004), such as system usage and failure codes. This information can be extracted and used to evaluate the system’s reliability and identify “areas” for reliability improvement.

In this paper, the approach of measuring reliability from server logs, as presented by Tian et al. (2004), will be evaluated and analyzed to determine the viability and effectiveness of this approach. Results from the original study and from our new study will be used in the analysis. Two websites were examined in the original study; and two additional websites will be investigated in this new study. Initially, these two websites are analyzed using the same methodology as proposed in the original study (Tian et al. 2004). That is, the server logs from these two websites were parsed for all errors that occurred while the websites were serving content to their visitors. A reliability estimate is then calculated from the extracted errors. This paper extends the original study (Tian et al. 2004) by:

Applying the technique to two new websites. One of which is a commercial website; in fact the site can be considered as being mission critical to the commercial organization. The logs investigated for this commercial website cover a 15 month period, which is an extensive time period. It is believed that this log represents the longest period of capture, and the only truly “mission critical” log reported within the research literature.
Examining the error codes more rigorously; this will allow web administrators to focus on high value error codes.
Re-examining the workload models to provide alternative methods for web administrators to analyze and interpret reliability information.

The remaining sections of this paper are organized as follows: Section 2 describes the research methodology. Section 3 provides a brief overview of the characteristics of the websites used in the previous and the current study. Section 4 examines the workloads, the limitations of the workloads proposed, and the results from the two websites. Finally, Section 5 presents our conclusions.

2 Research Methodology

Tian et al. (2004) demonstrated by performing an experiment on two websites that the operational reliability of websites could be estimated from server logs. They identified three failure sources:

Host, network, or browser failures that prevent the delivery of requested information to web users. These errors can be analyzed and assured by existing techniques (Lyu 1995; Musa et al. 1987; Trivedi 2001) because they are similar to failures in regular computer systems, network or software (Tian et al. 2004).
Source content failures that prevent the acquisition of the requested information by web users because of problems such as missing or inaccessible files, trouble with starting JavaScript, etc. These failures have unique characteristics to web systems (Crovella and Bestavros 1997; Montgomery and Faloutsos 2001; Offutt 2002); hence, special workload characteristics need to be defined before their reliability can be estimated.
User errors, such as improper usage, mistyped URLs, etc. These errors also include any external factors that are beyond the control of web service or content providers.

They noted that host, network, browser failures and user errors can either be addressed by existing approaches or are outside of the responsibility and control of the content provider. However, source content failures represent a significant part of the problem and the content providers can address these issues. Hence, Tian et al. (2004) focused on web source content failures contained in error and access log files in their study. These files are created by all commercial HTTP Daemons.

The Nelson model (Nelson 1978), a widely used input domain reliability model, was used by Tian et al. (2004) to calculate reliability after the necessary information was extracted from the server logs. The formula for the Nelson model is:

$$R = \frac{{n - f}}{n} = 1 - \frac{f}{n} = 1 - r$$

(1)

where f is the total number of failures, n is the number of workload units and r is the failure rate. The mean time between failures (MTBF) was then calculated as:

$${\text{MTBF}} = \frac{1}{f}\sum\limits_i {t_i } $$

(2)

where t _i is the usage time for each workload unit i. If the usage time is not available, the number of workload units is then used as an approximation of the time period. Thus, the MTBF can be calculated as:

$${\text{MTBF}} = \frac{n}{f}$$

(3)

2.1 Removal of Automated Requests

The log files contain requests from robots and other automated systems that should be removed as they are not actual requests from web users. Automated systems are classified as systems that repeatedly request a resource from the website after a set period of time. For example, upon investigation of Site A’s server log, requests from two monitoring services are identified. The first service requests a resource from Site A every 30 min while the second service requests a resource from Site A every 66 min. The resources these services request are unique and not publicly available; hence removing them simply involves identifying these resources in the log files. Robots that automatically request the “robots.txt” resource are also removed from both Site A and ECE log files.

Although, it is infeasible to remove all automated requests from the server logs, web administrators need to remove all identifiable requests. Several techniques to identify them can be used by web administrators to remove automated requests. Most well known robots have a signature line that is included with every request as part of the USER AGENT field of the log file. For example, “Googlebot-Image/1.0” can be used to identify a robot from Google that is indexing the website’s images. For web monitoring services, web administrators can simply dedicate a special resource that only these services can access. This resource can then be easily identified within the log files.

2.2 Analysis of Error Code Information

Error response codes can be extracted from either access or error logs. Due to the lack of error log files for the K Desktop Environment (KDE) website and Site A, only the access log files were used to extract the error information (Tian et al. 2004). Error response codes are embedded in the access logs, and these codes can be mapped to the error entries in the error log, for example, a “file not found” error in the error log usually corresponds to a 404 error code in the access log. Hence as stated in Tian et al. (2004), using just the access logs is a reasonable method to gather error information unless detailed information about the errors is required. Figure 1 provides a sample entry that can be found within the access logs.

This figure shows that on November 3, 2005, a remote user with an IP address of 129.194.12.3 used the POST protocol to access a file called search.php. The server responded with a 200 code and returned 50482 bytes of data. The previous URL that the user visited is http://www.sitea.com/database/form.php. The user used Microsoft Internet Explore version 6.0 to access the webpage.

The Nelson model and MTBF calculation require that the server logs capture the entire workload for the period under investigation. To ensure that the logs are complete, the parser used was customized to report suspicious gaps, which can be defined as long periods of inactivity between two recorded hits. These gaps were manually examined and discussed with the web administrators to ensure that the gaps are naturally occurring and not due to external factors such as the hard drive being full.

The error response codes in Tables 3, 4 and 5 are the standard HTTP error response codes as defined by the Request For Comment 2616 (http://www.w3.org/Protocols/rfc2616/rfc2616.html) as part of the HTTP protocol. The following is a list of the codes encountered, their descriptions, and what the implications are when they are used for reliability analysis:

400 (Bad request)—the request could not be understood by the server due to its malformed syntax. This code should not be used for reliability analysis because the code is caused by a client that is not following the HTTP standard. Since this is a client-side issue, it does not make sense to estimate a website’s reliability based on this code.
401 (Unauthorized)—the server does not accept the client’s authorization credentials (or they were not supplied). This error occurs when a user requests a resource that the user does not have permission to retrieve. If the referrer for this resource is external to the website then this error can be ignored because the web administrators cannot control these external referrers. However, if the referrer is internal to the website and it is not the expected behavior of the server, then this error needs to be included in the reliability analysis. This situation of an error response code encompassing error types which are source content failure and external sources (human and system errors) occurs repeatedly; hence, the situation needs to be resolved to provide accurate reliability information. This issue is resolved later in the paper.
403 (Forbidden)—the server is refusing to fulfill the client’s request. The cause for this error is similar to the 401 error code. Depending on the configuration of the HTTP daemon, this error may be returned instead of the 401 error code. Hence, it has the same issue as the 401 error response code, and will be discussed later.
404 (Not found)—the server cannot find anything matching the Request-URI. This error is currently the dominating error code and represented the focus of result of Tian et al.’s paper (2004). However, again, this error response code covers a multitude of different error types some of which are source content failure but others lie outside the system or what seem to be source content failures are actually not source content failures. For example, an attacker utilizing a scanner can (Spitzner 2001) spoof the referrer field of the log file when scanning for a system’s vulnerability; the spoofed referrer field appears to be an internal link when it is actually from an external source. Links to old versions of the website can also create 404 error codes that appear to be internal bad links because the old version of the website is hosted on the same server as the current website. However, these internal bad links should be discarded because the user is using an incorrect version of the website. With the availability of powerful link checkers (NetMechanic HTML Toolbox^{Footnote 1}, W3C Link Checker^{Footnote 2}), it is highly likely that actual source content failures are on the decline.
405 (Method not allowed)—the method specified in the Request-Line is not allowed for the resource identified by the Request-URI. The client performs a request that is not allowed by the server. For example, the client tries to perform a PUT request, but the server is configured to not accept PUT requests; hence, a 405 error code is generated. Since this error code only occurs due to a configuration issue, it should be discarded.
406 (Not acceptable)—this error is returned if the web server detects that the client cannot accept the data it wants to return. This error code should be discarded because the server’s content does not support the client used to access it.
407 (Proxy authentication required)—if the client does not authenticate itself with the proxy then this error is returned. This error code can be discarded because the client did not authenticate with the server before attempting to access restricted content.
408 (Request timeout)—the client did not produce a request within the time that the server was prepared to wait. This is a network failure rather than a source content failure, and hence, it should be discarded.
409 (Conflict)—the client is attempting to perform a request that conflicts with the server’s established rule. For example, the client is attempting to upload a file that is older than the file currently available on the server, this results in a version control conflict. This error can be discarded because it is a browser failure, not a server failure.
410 (Gone)—the server cannot find the requested resource and no alternative location can be found. This error code is related to the 404 response code, and hence it should follow the same rules as the 404 response code.
411 (Length required)—the server is denying the data the client is uploading because the client is not specifying the size of the data. Because this error is a browser failure and not a server failure, it can be discarded.
412 (Precondition failed)—the resource requested failed to match the established preconditions. This error should be included because the server failed to satisfy the preconditions; this implies that this error response code is a server failure.
413 (Request entity too large)—the server is rejecting the data being uploaded from the client because the data size is too large. The size limit can be adjusted within the server configuration. Since this error code only occurs due to a configuration issue, it should be discarded.
414 (Request-URI Too Long)—the server returns this error code in the following situations:
- The client (usually a browser) has converted values from a POST request to a GET request. The POST request can handle larger values than the GET request; thus, the error occurs when an extremely large POST request is converted to a GET request.
- The client is attempting to exploit some type of vulnerability in the server. Usually, these exploits involve a large amount of malicious code being injected into the Request-URI. Some of these vulnerabilities include: buffer overflows (Cowan et al. 1998; Evans and Larochelle 2002; Wagner et al. 2000), SQL injections (Boyd and Keromytis 2004; Grossman 2004; Huang et al. 2003), cross-site scripting (CGISecurity 2002; Cook 2003), etc.
  
  Generally, the first situation is rare, and hence it is usually safe to assume that a majority of 414 errors will correspond to attacks on the server or other users who are accessing the vulnerable website. Thus, by identifying these 414 errors, system administrators can identify attacks on their server system and take appropriate actions against the attackers. Although the 414 error code is useful to system administrators, it is not a source content failure and, hence, will be excluded from reliability analysis.
415 (Unsupported media type)—the server is refusing the request because the resource is in a different format from the requested format. For example, the browser requests a resource and specifies it as a text document; however, the server recognizes the requested resource as a binary file and not a text document. A 415 response code would be generated in this scenario. Since this error code is a browser failure and not a source content failure, it should be discarded.
416 (Requested range not satisfiable)—the client is requesting a file size’s range that is invalid. This error occurs when the client, usually a download manager such as Getright (http://www.getright.com) or Wget (http://www.gnu.org/software/wget/wget.html), erred in its resume point calculation. Hence, this error code should not be used in reliability analysis.
500 (Internal error)—the server encountered an unexpected condition which prevented it from fulfilling the request. Bugs within various dynamic scripts running on the server cause this error code. Therefore, it must be included in any reliability calculation.
501 (Not implemented)—the server does not support the request type that the client is sending. For example, the browser tries to retrieve the header information of an ASP enabled web page, so it sends a HEAD request to the server. However, the server does not understand this request for ASP enabled web pages, so it returns 501 error response code. This error code should be included in reliability analysis.
502 (Bad gateway)—this error has two definitions depending on the HTTP daemon used. For Apache, this error occurs when the server, while acting as a gateway or proxy, received an invalid response from the upstream server it accessed in attempting to fulfill the request. Because this error response code only occurs when the Apache HTTP Daemon is acting in a different mode rather than actively serving web pages, this error should be discarded for servers using the Apache HTTP daemon. For IIS, Microsoft IIS’ support page (http://support.microsoft.com/default.aspx?scid=kb;en-us;318380) describes this error as “You receive this error message when you try to run a CGI script that does not return a valid set of HTTP headers.” In other words, this error code can be triggered by an error in the web application’s output code. Thus, this error should be included in reliability analysis if the web software is running on the IIS platform.
503 (Service unavailable)—the server is overloaded and cannot serve further requests. For example, due to a popular marketing campaign for a website, many users decide to visit this site. The unexpected load caused by this sudden increase in traffic causes a major strain in the server’s resources, which then leads to extremely slow response time or a server crash. For example, Toys R Us’ website received a surge in traffic after it released its Big Book catalog. This surge in traffic overloaded the system’s resources, which lead to an extremely slow response time. Numerous potential purchasers were turned away because of this slow response time (Masterson 1999).
This failure response code is a host failure that can lead to extended availability issue if not resolved properly. Tian et al. (2004) stated that availability problems are generally perceived by web users as less serious than web software problems. They argued that users are more likely to be successful accessing required information after temporary unavailability whereas software problems would persist unless the underlying causes are identified and fixed. We believe this argument is questionable because web users are much more impatient and less forgiving than traditional users, as discussed by many studies (Galletta et al. 2004; Grant 2000; Masterson 1999; Nah 2002; Rose et al. 2001; Williams 2001). They typically move on to the next site if they encounter issues with the current site that they are browsing. From their perspective, if they cannot access the information they want then it is an error. Hence, although the 503 error response code corresponds to a host failure and not a source content failure, it must be included in reliability analysis.
504 (Gateway timeout)—this error only occurs when the server is acting as a gateway or proxy server, hence it should be discarded.
505 (HTTP version not supported)—the server does not support the HTTP protocol version used by the client. This error can be discarded because the client is not using the proper HTTP protocol version.

It should be noted that web systems can be configured to catch error codes and respond with a 200 OK code instead. While this strategy hides technical information from users, it does not allow the error codes to be logged properly if configured incorrectly. Hence, web administrators should ensure that error codes are still logged if this strategy is to be used.

3 Overview of the Websites

Tian et al. (2004) applied the proposed approach to two websites. The first website analyzed was www.seas.smu.edu, the official web site for the School of Engineering and Applied Science at Southern Methodist University (SMU/SEAS). The log files contained data covering 26 consecutive days in 1999. The second website analyzed was www.kde.org (KDE). This is the official website for the KDE project. The overall traffic and user population for this website is significantly larger than the SMU/SEAS website. The logs contained 31 days of traffic data. During these 31 days, over 13 million hits were recorded. Both of these websites used the popular Apache HTTP Daemon (http://httpd.apache.org) to serve their web pages.

3.1 Overview of the Websites in This Study

This paper re-analyzes the approach presented in the original study (Tian et al. 2004). It initially applies this approach to two new websites, and based on these results postulates an alternative approach. The first website is www.ece.ualberta.ca, the website for the Department of Electrical and Computer Engineering at the University of Alberta. This site—similar to SMU/SEAS and KDE—although important to the organization, it is non-commercial and not mission critical. This website is a dynamic website that utilizes the ColdFusion (http://www.macromedia.com/software/coldfusion) scripting language, and the Apache HTTP Daemon (http://httpd.apache.org). To investigate the stability of the data, the log files were chosen to cover approximately 30 consecutive days in January 2005 (ECE1) and 30 consecutive days in March 2006 (ECE2). For the month of January, the ECE website received approximately 500,000 hits, 53,100 “unique” visitors and transferred a total amount of 4.8 Gbytes of data. During March 2006, the ECE website handled 470,000 hits, 61,000 “unique” visitors and transferred a total amount of 6.2 Gbytes of data.

The second website is the website for a publishing company that specializes in online databases (Site A). This website differs from the previous websites in that it is very critical to Company A’s operation and hence it needs to be extremely reliable. The website utilizes the PHP (http://www.php.net) scripting language, MySQL (http://www.mysql.com) for the backend database and is hosted on an Apache HTTP Daemon. In order to observe potential trends and patterns for this mission critical website, the log files chosen cover 15 months of operation from January 2005 to March 2006. This website’s traffic is lower than the ECE website. However, it represents a typical business website. That is, the site is a dynamic website with a mixed amount of static and dynamic pages—these are pages generated dynamically depending on the customers’ inputs; its users are customers who are either looking to purchase a product or to register for a training course. For the 15 months covered, Site A received approximately 1.9 million hits and 92,000 “unique” visitors. The site transferred 34 Gbytes of data. Table 1 displays the technologies used by, and reliability requirements for, the two websites under investigation. Unfortunately, the ECE site administrator only has an approximate reliability target for their installation. These two websites were selected for this investigation because they utilize similar web development technologies while having different reliability requirements. The two websites use a scripting language in addition to an HTTP daemon; with one of the sites (A) also using a DBMS for data management. Although the technologies used are similar, their reliability objectives are quite different. ECE—due to its non-mission critical nature—is expected to experience between one to ten failures per month. Site A requires high reliability because the loss of customers and sales will occur if the site’s failure occurs. In other words, Site A is expected to experience no more than one failure per month. Note: the two sites are not related in any way, nor have any personnel in common.

Table 1 Sites examined

Another viewpoint on “evaluating web software reliability based on workload and failure data extracted from server logs”

Abstract

Similar content being viewed by others

A Hypothetical Scenario-Based Analysis on Software Reliability Evaluation Approaches in the Web Environment

Understanding Error Rates in Software Engineering: Conceptual, Empirical, and Experimental Approaches

An empirical analysis of error propagation in critical software systems

1 Introduction

2 Research Methodology

2.1 Removal of Automated Requests

2.2 Analysis of Error Code Information

3 Overview of the Websites

3.1 Overview of the Websites in This Study

4 Results and Discussions

4.1 Results from the Original Study

4.2 Results from This Study

4.3 Workload Analysis and Discussions

4.4 Reliability Analysis and Discussions

4.5 Limitations of Log Files

5 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation