An empirical analysis of error propagation in critical software systems

Cinque, Marcello; Della Corte, Raffaele; Pecchia, Antonio

doi:10.1007/s10664-020-09801-2

An empirical analysis of error propagation in critical software systems

Published: 13 March 2020

Volume 25, pages 2450–2484, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

An empirical analysis of error propagation in critical software systems

Download PDF

681 Accesses
5 Citations
Explore all metrics

Abstract

Error propagation analysis is a consolidated practice to gain insights into error modes and effects that pertain to the activation of faults in software systems. A variety of approaches, such as architecture-based, source code instrumentation and variable tracing, have been proposed so far to address software error propagation analysis. Although valuable, existing approaches entail a substantial degree of system internals’ knowledge, visibility and code manipulation that is not well-suited for real-life production environments. This paper proposes an empirical analysis of error propagation. We specifically address the challenges in using fault data and error events in the logs, which are a convenient byproduct of the system’s execution. The approach puts forth the construction of error reporting graphs. We apply the approach to 2,042 failure data points from two real-world critical systems from the Air Traffic Control domain by a top industry provider. The approach contributes to develop a deep understanding on error modes and propagation paths, which can be leveraged by practitioners to make informed decisions on the placement of error detection mechanisms.

Understanding Error Rates in Software Engineering: Conceptual, Empirical, and Experimental Approaches

Article 21 February 2019

Hidden Markov Model Approach for Software Reliability Estimation with Logic Error

Article 22 February 2020

Fault Detection Model for Software Correctness and Reliability

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Error propagation analysis

is a consolidated practice to gain insights into the dependability of software systems. It allows to infer error modes, intermediate paths and effects that pertain to the activation of faults.^{Footnote 1} Analysis of propagation allows assessing the error behavior of a system, inferring error-prone components, and establishing where-what type of errors are likely to cause system-wide failures (Avizienis et al. 2004). This is of utmost importance to support practitioners in making informed decisions for designing and placing error detection mechanisms (EDMs) and error recovery mechanisms (ERMs) (Arora and Kulkarni 1998). To this aim, many existing approaches rely on quite convoluted data sources that entail a substantial degree of system internals’ knowledge and source code visibility. For example, Jhumka and Leeke (2011), Abdelmoez et al. (2004), Popic et al. (2005), Cortellessa and Grassi (2007), and Voas (1997) require operation details, such as states and failure rates, for each system component, Hiller et al. (2004), Hiller et al. (2002a), Leeke and Jhumka (2010), and Michael and Jones (1997) leverage data obtained by instrumenting variables, while Tucek et al. (2007) uses dynamic binary instrumentation.

The application of these approaches is far from being seamless when the systems under analysis allow a limited degree of intervention and/or provide a limited view of system internals. This is a common scenario in legacy and OTS-based systems, critical software systems and production environments. Moreover, we observe that above-mentioned literature on error propagation –although valuable– falls short when it comes to adopt log files, which are used to collect error events by built-in detection mechanisms, such as event logging and assertion checking.

Log files –or simply logs– are a byproduct of the system execution and contain text messages on regular and error events encountered by a system under real workloads (Li et al. 2018; Kabinna et al. 2018). Current systems ubiquitously emit log files. Analysis of logs is well-consolidated for troubleshooting field failures (Kalyanakrishnam et al. 1999; Tian et al. 2004; Chuah et al. 2015; Russo et al. 2015); when needed, analysis is accompanied by a debugging phase that typically leads to the identification and fix of the fault that caused the failure. Examples of works leveraging log files are Yuan et al. (2010) and Lyu et al. (1996). The approach in Yuan et al. (2010) uses logs to assess control- and data-flow; however, it requires static code analysis and is not originally conceived for error propagation. In Lyu et al. (1996) debug data are leveraged to build error propagation graphs; however, the approach neglects error messages generated by the system under analysis, which prevents to obtain runtime information about the propagation of errors through system components.

This paper proposes an empirical analysis of error propagation. Analysis is based on logs, which are naturally emitted by a system. We do not address resource consumption metrics, such as CPU, memory and network usage, which are out the scope of this work. We face the research challenge of obtaining insights into error modes and their propagation by means of logs.

We analyze faults and error events in the logs related to 2,042 failures of two real-world mission critical software systems: a middleware for data distribution and a standalone application for the management of flights and runaway control both used in the Air Traffic Control (ATC) domain by a top leading industry provider in electronic and information solutions for critical systems.^{Footnote 2} We put forth an analysis approach based on the construction of error reporting graphs. The paper addresses the formalization of error modes from data and proposes a set of novel metrics, such the error propagation reportability (EPR), which are computed from the graphs to quantify the error behavior of the system under assessment. We rely on a representation of the input data that allows decoupling our approach from the data sources, which makes many steps of the approach potentially automated.

The approach contributed to develop a deep understanding on error modes, propagation paths and capabilities of the error reporting mechanisms, which provided actionable insights to the industry provider for improving error detection. The key findings of our empirical study are:

With respect to the error modes –and their granularity– adopted in this study, different fault types lead to a small subset of error modes, which mainly concern type and value of variables. For example, data type errors and unexpected value errors are the most reported for event logging and assertion checking, respectively; this finding is consistent with the previous literature, such as (Leeke and Jhumka 2010).
Early error propagation steps are mostly silent. We observe that a software component affected by a fault might report no error notifications. For example, in our setup, the logs emitted by the component containing an algorithm fault report an error only in the 33.87% and 19.20% of cases, in the two target systems.
Although missed by the component originating the fault, errors might still be reported by other components along the propagation path. For example, logs of the database component –belonging to the ATC middleware– report 23 out 69 missing function faults undetected by the originating faulty component elsewhere. A similar finding is noted for assertion checking and logs generated from the arrival manager stand-alone application.
Latest error propagation steps determine the type of failure that will be encountered by the system. Our study reveals a strong relation between the last components reporting errors and the type of failure observed. This provides insights on the errors that should be handled to avoid failures.
The analysis of graphs guides the improvement of the error detection mechanism of a complex software system, and allows to quantify the extent of the improvement itself. By analyzing the graphs, practitioners can identify the components where to place more EDMs; experiments on the ATC middleware after the placement of new EDMs highlight an improvement, in terms of the error propagation reportability for algorithmic faults, from 33.87% up to 94.50%.

The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 presents the systems under assessment and the datasets. Section 4 describes the proposed approach and the metrics. Section 5 discusses the error modes inferred from data, while Section 6 presents the insights achieved from error reporting graphs. Threats to validity are discussed in Section 7, while Section 8 concludes the paper.

2 Related Work

We position our research with respect to existing work on software error propagation analysis, distinguishing them in architecture- and metrics-based approaches, and code instrumentation.

2.1 Architecture- and Metrics-Based Approaches

Several approaches that address software error propagation require a substantial degree of knowledge on system internals, such as architectural dependencies between system components and evaluation of software metrics; moreover, some of the approaches are accompanied with static analysis of the source code.

The approach presented in Jhumka and Leeke (2011) leverages module coupling to identify potential data-value error detector locations at module-level. Coupling is evaluated using information about modules, e.g., input and output data/control parameters, data/control global variables, number of called/calling modules. The approach is used in an open-source flight simulator. Authors in Abdelmoez et al. (2004) propose a static analytical approach that leverages architecture specifications to estimate the probability of error propagation in a software architecture. The approach is based on a metric, named error propagation probability. Evaluation requires architectural-level data, such as states of components and messages they can exchange. A similar approach is adopted in Filieri et al. (2010) to component-based systems. Authors propose a methodology, based on a probabilistic model, to analyze the reliability of the system starting from failure modes and failure probabilities of its components. The approach requires detailed architectural information on how the components are assembled, in terms of input/output ports and their connections. The model includes the formalization of probabilistic error propagation among components’ ports.

In Popic et al. (2005) an existing Bayesian methodology is extended for reliability prediction of component-based software systems for error propagation. The methodology leverages the error propagation probability metric and it requires the knowledge of failure rates of components, under the assumption of failure independency; each component is assumed to exhibit the same failure rate. Similarly, the work in Cortellessa and Grassi (2007) leverages the error propagation probability and requires detailed information on each component, such as, unconditional and conditional (e.g., subject to a given correct input) failure probability and operational profile. The method has been shown to be beneficial for the placement of error detection and recovery mechanisms.

The impact of inter-modular data error propagation is assessed in Jhumka et al. (2001). The work characterizes data error propagation and derives a set of metrics that quantify inter-modular interactions. Results indicate that the metrics allow to identify candidate modules to be equipped with detection/recovery mechanisms. In Khoshgoftaar et al. (1999) it is presented an approach to identify software modules that do not propagate data errors. The work demonstrates –through experimentation on the Nethack adventure game– that static software metrics are good predictors for the identification of such modules, avoiding the evaluation of their error propagation probability.

The approach proposed in Voas (1997) studies information flows between components of a system. The approach is based on the corruption of the information flowing through components and the observation of its impact during execution, in order to isolate the components that cannot tolerate failures of the other ones. SherLog (Yuan et al. 2010) is a diagnosis tool that analyzes the source code and event logs it generates at runtime during the occurrence of failures, to automatically provide control-flow and data-flow information.

An approach to evaluate error propagation from debug data is presented in Lyu et al. (1996). The approach allows building error propagation graphs from the reports generated by analysts after the occurrence of failures. The graphs provide information about the fault causing the failure, the type of the first error, the error propagation mode and how the error has been detected.

Our work aims to overcome the drawbacks that threaten the application of architecture- and metrics-based approaches in real-life production environments. For example, with respect to Jhumka and Leeke (2011), Abdelmoez et al. (2004), Popic et al. (2005), Cortellessa and Grassi (2007), Voas (1997), and Filieri et al. (2010) our proposal does not require detailed information on software components (such as, input and output data/control parameters, undesirable states, failure rates), which are difficult to retrieve especially when the system is based on OTS and legacy components. The work in Zheng and Lyu (2010) questions the use of such component-level models for reliability prediction, especially when applied on web services, due to the lack of detailed system information and network unpredictability. Similarly to our work, they propose to collect real data about the failures affecting the system, even if for a different purpose (reliability prediction). But, differently from us, they rely on a users-based collaborative framework to collect past failure data from past experiences with the web services to be composed. With respect to Jhumka and Leeke (2011), Jhumka et al. (2001), and Khoshgoftaar et al. (1999) our approach is not limited to data errors and it does not require to monitor the output of each component (Voas 1997). In addition, with respect to Yuan et al. (2010), our approach does not require static analysis of the source code, which can be either expensive in a large system or inapplicable when the code is not available. Finally, with respect to Lyu et al. (1996) our proposal leverages error messages naturally emitted from the target system to perform error propagation analysis, which allows obtaining valuable information about the propagation of errors through the system components.

2.2 Code Instrumentation Approaches

Code instrumentation approaches capitalize on monitoring code (either at source code or binary level) to generate error traces upon fault activation (Lattner and Adve 2004; Cinque et al. 2013).

EPIC (Hiller et al. 2004) is a framework based on variable instrumentation to trace the value of variables in order to estimate an error permeability metric (which evaluates the ability of a module to contain errors) and to place EDMs. PROPANE (Hiller et al. 2002a) analyzes the propagation of data errors in single-process C software systems, and identifies error paths and propagation frequency. PROPANE is based on a fault injection approach to induce data errors in the system and variable instrumentation to detect errors. In Cinque et al. (2013) is proposed a set of logging rules for the placement of log statements in the source code, in order to generate error traces upon the activation of software faults.

The work (Johansson and Suri 2005) proposes an approach for the analysis of errors in Windows CE .Net device drivers, to study how errors propagate to applications. Data errors are induced by means of fault injection at interface level, while propagation is analyzed by instrumenting the code with assertions. A set of metrics is proposed to evaluate if the target driver needs a wrapper to handle the errors.

The work (Leeke and Jhumka 2010) introduces the importance metric to measure the impact a given variable has on the dependability of a software system. The evaluation of the metric requires to instrument the variables in order to understand when a variable is corrupted. The approach provides insights on the design and positioning of error detection and recovery mechanisms. An open-source flight simulator has been used to assess the proposal. In Tucek et al. (2007) authors propose a system, called Triage, that automatically performs onsite software failure diagnosis. The system makes use of both kernel-level components and multiple re-executions of the target software to support failure diagnosis; during each re-execution, detailed data are collected via dynamic binary instrumentation to conduct the analysis of occurred failure and its causes.

An approach for error propagation analysis using invariants is presented in Chan et al. (2017). The approach, named IPA (Invariant Propagation Analysis), automatically derives invariants for multithreaded programs by instrumenting the source code at function entry and exit points. The approach has been evaluated with different fault types across six programs through fault injection experiments. An error propagation study for MPI applications is presented in Calhoun et al. (2017). The paper investigates how Silent Data Corruption due to soft errors propagates through HPC applications. An LLVM-based tool is developed to instrument MPI applications in order to inject faults and track error propagation at instruction and application variable level. The tool has been applied to three HPC applications.

Differently from these studies, we propose to capitalize on the data already produced by the system, such as log files produced by event logging and/or assertions already available in the source code. The idea is also to receive feedback on how these error reporting techniques work and if/how they should be improved for better error propagation analysis. On the opposite, the use of code instrumentation approaches is not straightforward in production environments, for systems adopting OTS, or when there is limited knowledge on system internals. For example, the approaches (Hiller et al. 2004; 2002a; Leeke and Jhumka 2010; Calhoun et al. 2017; Chan et al. 2017) require instrumenting variables or function entry/exit points, which might be expensive in a complex software system encompassing several components and even not applicable if the source code is not available. In addition, the approach in Hiller et al. (2004) requires measuring the error permeability for each input of each module, leading to a low scalability of the approach; while the tool (Hiller et al. 2002a) addresses only single process software. The system proposed in Tucek et al. (2007) uses kernel-level components and dynamic binary instrumentation, which is not allowed in critical production environments (e.g., mission critical systems) with stringent constraints imposed by certification standards and the use of obsolete kernel versions. Finally, the approaches (Hiller et al. 2004; 2002a; Johansson and Suri 2005) only address data errors, while those presented in Johansson and Suri (2005) and Calhoun et al. (2017) are conceived only for OS device drivers and MPI applications, respectively.

3 Systems and Datasets

Datasets available in this study consist of faults and error events that pertain to total 2,042 distinct failures of two systems. We analyze a middleware and a standalone application –named arrival manager– both used by the industry provider in the critical domain of the Air Traffic Control (ATC). In the following we present systems, testing applications and error detection mechanisms beforehand; then, we describe how faults and error events are arranged into tabular failure data instances for propagation analysis.

3.1 Description of the Systems

Middleware (MW)

The middleware assessed in this study is an OMG-compliant data distribution service (DDS) layer among heterogeneous ATC applications. It provides a message-oriented application programming interface (API), which is based on the publish-subscribe paradigm and topics. Figure 1a shows a typical deployment of the middleware by the industry provider, where a flight data processor (FDP) and a controller working position (CWP), i.e., two ATC applications, generate the messages exchanged through the middleware. The source code of the middleware consists of 796,353 lines of C code, organized into 8 components,^{Footnote 3} depicted in Fig. 1a:

abstraction: level between middleware / operating system;
api: API provided to applications;
database: bridges data to a DB and vice versa;
ddsi2: provides QoS-driven real-time networking based on multiple reliable multicast channels;
durability: implements fault-tolerant storage for both state data and persistent settings;
kernel: the core of the middleware;
spliced: it is responsible for creating and initializing the database used to manage the middleware data;
user: intermediate level between api and kernel components.

FDP and CWP are testing applications provided by the industry partner and serve as load generators to exercise the middleware. FDP and CWP implement a workload, i.e., the library of inputs that a generator submits to the target system (Hsueh et al. 1997), consisting of messages that are published under certain topics. Messages and topics reflect the nominal usage profile of the middleware by ATC operators. The leftmost column of Table 1 shows the top 10 invoked functions of the middleware over a sample of 15,409 invocations of 118 distinct functions. It is worth noting that the usage profile exercises the principal entities of the OMG DDS model,^{Footnote 4} such as Data Reader/Writer, Publisher/Subscriber and Topic.

Table 1 Top 10 invoked functions by system

An empirical analysis of error propagation in critical software systems

Abstract

Similar content being viewed by others

Understanding Error Rates in Software Engineering: Conceptual, Empirical, and Experimental Approaches

Hidden Markov Model Approach for Software Reliability Estimation with Logic Error

Fault Detection Model for Software Correctness and Reliability

1 Introduction

Error propagation analysis

2 Related Work

2.1 Architecture- and Metrics-Based Approaches

2.2 Code Instrumentation Approaches

3 Systems and Datasets

3.1 Description of the Systems

Middleware (MW)

Arrival Manager (AM)

3.2 Error Detection

3.3 Datasets

3.4 Notion of Failure Data Instance

4 Proposed Approach

4.1 Construction of the Reporting Table

4.1.1 Reporting Stage

4.1.2 Error Mode

Manual Inspection

Error Model Base

Automatic Analysis

4.2 Construction of the Reporting Path

Nodes

Arcs

4.3 Graph Update

4.4 Metrics Computation

Error Propagation Probabilities

Error Propagation Reportability

5 Error Model

5.1 Event Logging - Middleware Dataset (EL-MW)

5.2 Assertion Checking - Middleware Dataset (AC-MW)

5.3 Event Logging - Arrival Manager Dataset (EL-AM)

5.4 Final Remarks on the Error Models

6 Propagation Analysis

6.1 Case Study 1: Analysis of EL-MW

6.2 Case Study 2: Analysis of AC-MW

6.3 Case Study 3: Analysis of EL-AM

6.4 Improvement of the Detection Mechanism

7 Threats to Validity

Construct Validity

Internal Validity

External and Conclusion Validity

8 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation