Keywords

1 Introduction

When services fail service management tends to be interested in answering three questions. First: How do we fix it fast? Second: Who is responsible? And Third: How do we stop it from happening again? It is a generally accepted fact that the most significant failures in information technology (IT) environments (as in other industry environments) are due to human errors [13]. Famous examples such as the June 29, 2001 NASDAQ integrity failure which was caused during the routine testing of a development system by an administrator, lead up to more recent service disruptions like the April 29, 2011 Amazon Web Services (AWS) 47 h downtime that started with an incorrectly performed network traffic shift by a network technician [4]. So what should be the consequence of this perception other than a twist on “What can go wrong will go wrong?” Failures are always an interaction of inchoate elements where even a small error can lead to a disaster. And it is hardly possible to foresee all risks even by the most formidable group of experts. With computational power ever increasing and analytical methods for analysing data streams becoming better and better there is a soaring number of research that deals with the analysis of incident and monitoring data to improve stability on the one hand and support IT service decisions on the other hand.

The evaluation of these new models however, is very bothersome as there is hardly any usable real-world IT service incident or monitoring data available. A lot of research in cloud computing and decision support systems relies on a series of case studies or is tailored specifically for the scenario with a project partner who is (understandably) not willing to publish their service incident data. Other available data e.g. from cloud providers like Amazon Web ServicesFootnote 1 or SalesforceFootnote 2 is aggregated in a way that makes it unusable for most cases. Consequently there is a need for the creation of synthetic incident data.

We thus, state the following: it should be the aim of any researcher requiring service monitoring or incident data to validate his work with real-world data. Unfortunately this data is very hard to come by. Or to specify: almost impossible to come by with certain necessary properties to make it comparable.

The goal of this research is the identification of characteristics of a procedure to create realistic, comparable and reproducible incident data to validate formal models from the realm of service science research.

This work is structured as follows: In Sect. 2 we analyse related works. Section 3 summarises the requirements for realistic incident data and proclaims its characteristics. We conclude our work in Sect. 4.

2 Literature Review

As stated in the section before, it should be the aim of a researcher to validate his work with real-world data if there is any way to do so. Because of this established fact, research dealing with the creation of artificial incident data is sparse. Nevertheless there has been significant research on identifying the failure patterns of service incidents. Franke has analysed empirical data sets and concluded that the Weibull distribution and the Log-normal distribution are suited for a fitting in [5, 6].

When exploring a connection between business impact costs and service incidents Kieninger et al. have identified the Beta distribution to fit their empirical data sets [7, 8]. The focus is on finding the relation between these incidents and business costs.

Google researchers have conducted analyses on the failure of hard drives in their data centres [9]. However, they stick with exhibiting the results of their findings without trying to fit them to probability distributions. When scoping these results it can be assumed that disk failures are somewhat normally distributed.

While these findings are very interesting indeed, the aim is never to reproduce these incident patterns for future experiments and simulation studies. They should serve as a basis when deciding for the correct distributions or rather the incident patterns that should be analysed.

3 Model Description

We first identify reasons why real-world monitoring data is not sufficient or to hard to come by when dealing with decisions in IT service settings. The subsection thereafter comprises characteristics necessary for simulated services.

3.1 The Need for Synthetic Incident Data

In this section we want to identify why there is a need for incident data to be synthetic when validating decision models relying on IT service incident and monitoring data.

Disposal and Aggregation. Most service providers dispose of their monitoring and incident data as soon as it is no longer needed for contractual obligations or internal analysis. In most cases fine granulations of monitoring data is deleted after a set time period and only aggregated data is stored over a longer course of time. However, smaller service providers might not even keep this aggregated data.

Service Changes. In realistic settings IT infrastructures change over the course of time. A provider might increase computing power, change other hardware components or improve/change the software running on the infrastructure. This makes fair comparisons impossible.

Different Parameters. Besides their functionality, IT services have certain parameters that are non-functional (e.g. availability). Comparisons of these parameters be only conduced when extracting them to a common format (e.g. WS-Agreement [10]) and reducing the set of parameters to the ones available for all services.

Segment Lengths. Depending on what kind of service is monitored, the time segment length might significantly differ. Some services are monitored on a millisecond basis, while the availability of other services is tested once an hour. This also makes comparisons between services assailable.

These limitations are most prominent when dealing with real-world incident data and create the need for synthetic data. For this data to be realistic and be of use in IT service decision scenarios a series of characteristics have to be simulated. These are listed in the following section.

3.2 Service Characteristics

It is assumed that IT services have unique generic types i (e.g. storage or database service) and are offered by multiple service providers j. Specific services are the combination of an IT service type and a provider offering that service \(s_{ij}\). Each service has a price \(p_{ij}\) per unit of time t it is contracted.

IT service incidents have a frequency \(m_f(s_{ij},t)\) and an expected failure duration \(d_f(s_{ij},t)\). This is common practice in reliability engineering where the frequency is often labelled as the Rate Of Occurrence Of Failures (ROCOF) and the failure duration as Mean Time To Repair (MTTR) [11]. Both are substituted to \(\lambda _{ij}^t := (m_f(s_{ij},t),d_f(s_{ij},t)) \in \varLambda \). Some services \(s_{ij}\) come with a penalty agreement \(\mu _{ij}(\cdot )\) in case service objectives are not met. The penalty that has to be paid by the service provider is dependent on \(\lambda _{ij}^t\).

For incident data to be realistic in a service decision scenario the afore introduced components have to be simulated.

Pricing Strategy. IT services tend to be priced in tiers. Providers offer e.g. gold and platinum plans, where the platinum plan is significantly more expensive than the gold plan and offers a higher quality. Additionally usage-based pricing, performance-based pricing, user-based pricing and flat pricing should be implemented [12, 13].

Penalty Agreements. Especially public cloud computing providers offer no penalty payments when a service is unavailable. Traditional (outsourcing) service providers however are contractually obligated to pay a fee if their service is not usable. In reality, however, only a series of functions are encountered that are generally limited by an upper bound [14, 15].

Service Failures. It is assumed that services fail with certain probabilities that can be approximated through failure distributions. This is a well-established fact for systems in reliability engineering research [16, 17], and also valid across the wide range of different IT services, no matter if considering human error [2], hardware failures [9] or other unanticipated failures [6, 8].

Each of the above characteristics are necessary for a simulative comparison of different IT services and their incident behaviour. The correct parametrisation of different failure distributions is vital for meaningful incident time series.

4 Conclusion

In this work we have primarily shown that there is a need for synthetic incident data. Having evaluated a previously introduced decision method [1820], we want to improve an existing implementation significantly and provide a thorough survey of the generated data and package our model and make it available for usage. Hence, the work at hand is conducted as research in progress to gather further input from fellow researchers and enhance our model.